Field Configuration: Best Practices - Analyzers

Facet configuration can impact index size and payload size, which has an impact on indexing time and engine performance.  The following guidelines are standard best practices, your configurations may vary depending on your data, business requirements and use cases. 

 


Types of Analyzers

This section outlines the available analyzers. There are five analyzer options that are available for the business user to select. The selection of an analyzer will override the default indexing configuration “Hawk Search Analyzer”. Each analyzer is explained below.

Hawk Analyzer 

The Hawksearch Analyzer has the same properties as the Snowball Analyzer (see below) but will take into account for synonyms configured in the Hawksearch workbench. The Hawk Analyzer is the only analyze that will take into account the synonyms.  This is the default analyzer applied when the field is set to be queried.

Snowball Analyzer

The stemming step converts a word into its stem.  For example, if the word “climbing” is entered, the analyzer would convert the word to “climb” and use that to search a field with the Snowball Analyzer set on it.  It would return items that contained climber, climbing, and climb.  It is possible that information can be lost describing the original form of your text.  For example, the terms universe, university and universal all stem to the same root, “univers” and would all return the same results.  This is likely not be the desired result.

Description:

  • Stemming is applied

  • Stop words are removed

  • Colons, #, %, $, parentheses, and slashes are removed

  • Removes underscores, hyphens, @, and & symbols unless they are part of words or numbers

  • Remove apostrophe if it is (a) at the beginning of a word, (b) at the end of a word, or (c) followed by the letter s

  • Separates numbers from text when numbers are at the beginning of a word

  • Letter characters are converted to lowercase

Best Used For:

Fields that have content consisting of multiple versions of a word.

Standard Analyzer 

The Standard Analyzer accounts for the following: 

Description:

  • Separates text “smartly”, accounting for the following lexical types:

    • Alphanumerics

    • Acronyms

    • Company names

    • Email addresses

    • Computer hostnames

    • Numbers

    • Words with an interior apostrophe

    • Serial numbers

    • IP addresses

    • Chinese and Japanese characters

  • Stop words are removed

  • Letter characters are converted to lowercase

  • No stemming applied

Best Used For:

Searching English words such as units of measure as well as fields with the values listed above. 

Simple Analyzer

The Simple Analyzer accounts for the following: 

Description:

  • Separates text at non-letter characters and removes all non-letter characters

  • Letter characters are converted to lowercase

  • No stop words are removed

  • No stemming applied

Best Used For:

Fields that only have alphabetical characters and don’t need the advanced interpretation of the Standard Analyzer.  For example, consider a field that stores famous 1-line quotes that will be queried.  If a user searches “to be, or not to be” removing the standard stop words would leave nothing to search on.  Additionally, if stemming were applied to this field, the results would not be as relevant as they would be without stemming.  In a case like this, the Simple Analyzer makes a good choice. 

Stop Analyzer

The Stop Analyzer accounts for the following: 

Description:

  • Stop words are removed

  • Divides text at non-letter characters and removes all non-letter characters

  • Letter characters are converted to lowercase

  • No stemming applied

Best Used For:

When a simple, text-only analyzer is needed that also removes stop words.  This should be used on fields that are intended to only have values made up of alphabetic characters.

Example:

Stop word: with

Original Product Name: Men's Long Mesh Short With Pockets

Product name with Stop Analyzer implemented: men s long mesh short pockets

White Space Analyzer

The White Space Analyzer accounts for the following: 

Description:

  • Search terms divided at whitespace

  • No characters are removed

  • No characters are converted to lowercase

  • No stop words are removed

  • No stemming applied

Best Used For:

Searching by exactly what is entered by user.  This could be useful on a field that may be queried with terms that are both proper names and common nouns such as: polish vs.

Polish, bill vs. Bill, case vs. Case.