Field Configuration: Best Practices - Analyzers
Facet configuration can impact index size and payload size, which has an impact on indexing time and engine performance. The following guidelines are standard best practices, your configurations may vary depending on your data, business requirements and use cases.
Types of Analyzers
This section outlines the available analyzers. There are five analyzer options that are available for the business user to select. The selection of an analyzer will override the default indexing configuration “Hawk Search Analyzer”. Each analyzer is explained below.
Hawk Analyzer
The Hawksearch Analyzer has the same properties as the Snowball Analyzer (see below) but will take into account for synonyms configured in the Hawksearch workbench. The Hawk Analyzer is the only analyze that will take into account the synonyms. This is the default analyzer applied when the field is set to be queried.
Snowball Analyzer
The stemming step converts a word into its stem. For example, if the word “climbing” is entered, the analyzer would convert the word to “climb” and use that to search a field with the Snowball Analyzer set on it. It would return items that contained climber, climbing, and climb. It is possible that information can be lost describing the original form of your text. For example, the terms universe, university and universal all stem to the same root, “univers” and would all return the same results. This is likely not be the desired result.
Description:
Stemming is applied
Stop words are removed
Colons, #, %, $, parentheses, and slashes are removed
Removes underscores, hyphens, @, and & symbols unless they are part of words or numbers
Remove apostrophe if it is (a) at the beginning of a word, (b) at the end of a word, or (c) followed by the letter s
Separates numbers from text when numbers are at the beginning of a word
Letter characters are converted to lowercase
Best Used For:
Fields that have content consisting of multiple versions of a word.
Standard Analyzer
The Standard Analyzer accounts for the following:
Description:
Separates text “smartly”, accounting for the following lexical types:
Alphanumerics
Acronyms
Company names
Email addresses
Computer hostnames
Numbers
Words with an interior apostrophe
Serial numbers
IP addresses
Chinese and Japanese characters
Stop words are removed
Letter characters are converted to lowercase
No stemming applied
Best Used For:
Searching English words such as units of measure as well as fields with the values listed above.
Simple Analyzer
The Simple Analyzer accounts for the following:
Description:
Separates text at non-letter characters and removes all non-letter characters
Letter characters are converted to lowercase
No stop words are removed
No stemming applied
Best Used For:
Fields that only have alphabetical characters and don’t need the advanced interpretation of the Standard Analyzer. For example, consider a field that stores famous 1-line quotes that will be queried. If a user searches “to be, or not to be” removing the standard stop words would leave nothing to search on. Additionally, if stemming were applied to this field, the results would not be as relevant as they would be without stemming. In a case like this, the Simple Analyzer makes a good choice.
Stop Analyzer
The Stop Analyzer accounts for the following:
Description:
Stop words are removed
Divides text at non-letter characters and removes all non-letter characters
Letter characters are converted to lowercase
No stemming applied
Best Used For:
When a simple, text-only analyzer is needed that also removes stop words. This should be used on fields that are intended to only have values made up of alphabetic characters.
Example:
Stop word: with
Original Product Name: Men's Long Mesh Short With Pockets
Product name with Stop Analyzer implemented: men s long mesh short pockets
White Space Analyzer
The White Space Analyzer accounts for the following:
Description:
Search terms divided at whitespace
No characters are removed
No characters are converted to lowercase
No stop words are removed
No stemming applied
Best Used For:
Searching by exactly what is entered by user. This could be useful on a field that may be queried with terms that are both proper names and common nouns such as: polish vs.
Polish, bill vs. Bill, case vs. Case.