How is the data being indexed?
This section will outlined how the system is configured to index data by default. This section will also outline how to alter the way a specific data field is indexed.
Default Indexing Configuration
Every Hawksearch implementation is set up with a default configuration. If there are issues with the way the engine is indexing the data, please see the section on "Adding a Field".
The system runs through a several different filters when indexing the data.
1. Standard Filter – tokenizing the data.
2. Lowercase filter – uniforms the data into lower case characters.
3. ASCII folding filter - converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.
4. Synonym filter – applies synonyms to terms to the dictionary for the fields that have the synonyms flag turned on.
5. Stop filter – removes all words that have been added to the stop words feature within the workbench.
6. Length filter – removes words that are too long or too short from the stream. Note: Length is calculated as the number of UTF-16 code units.
7. Snowball filter – applies stemming to the dictionary based on the fields that have stemming flag turned on.
Adding a Field
Adding a field to the engine requires a variety of information. There are key configuration used for indexing the data within the engine. This section outlines the way to influence the way the data is indexed.
Query this Field
The Query this Field must be enabled in order for the system to index the data for search. After this feature is enabled, additional configuration options will be made available. These features will allow the user the ability to customize how the data in this field is indexed.
Analyzers
The Analyzer feature will become available after the Query this Field is enabled. If no analyzer is selected the system will apply the default configuration to the field. This means it will execute all of the filters provided within the "HawkSnowBallAnalyzer". This is standard implementation. There may be situations where the configuration may need to be changed.
Examples
This section outlines examples of why the business user would override the default indexing configuration "HawkSnowBallAnalyzer"
Special Characters
One of the data fields possess special characters. The user may search with or without the special characters. The business team would like the products to be found for either scenario. The default configuration removes all special characters. To accommodate this requirement the business user will need to index the data field in two different ways.
This will require the business user adds the field twice.
o Once with the default configuration "HawkSnowBallAnalyzer". No analyzer is selected.
o Once with the "White Space Analyzer"
Note: when the field is created, it will need to be created with two different name. It is important that the naming convention assist with identifying the configuration of the data. Examples: item_number and item_number_special_characters.
Content Data
The website may have a separate landing page for articles. The article records may possess a body data field. The data in this field may need less manipulation than the other data fields. It may be desired to do make limited changes to the existing data. The only adjusts to the data field required are; diving text at non-letter characters and lowercasing the words. If this is the case, then the Simple Analyzer would be utilized versus the default.
Types of Analyzers
This section outlines the available analyzers. There are five analyzer options that are available for the business user to select. The selection of an analyzer will override the default indexing configuration "HawkSnowBallAnalyzer". Each analyzer is explained below.
Standard Analyzer
Standard Analyzer holds the honor as the most generally useful built-in analyzer. A JFlex-based grammar underlies it, tokenizing with cleverness for the following lexical types: alphanumerics, acronyms, company names, email addresses, computer hostnames, numbers, and words with an interior apostrophe, serial numbers, IP addresses, and Chinese and Japanese characters. StandardAnalyzer also includes stop-word removal, using the same mechanism as the StopAnalyzer. StandardAnalyzer makes a great first choice.
Simple Analyzer
Divides text at non-letter characters and lowercases.
Snowball Analyzer
The Snowball Analyzer is a stemming analyzer. By using a stemmer such as Snowball, information can be lost describing the original form of your text. Sometimes this will be useful, sometimes not.
For example, Snowball will stem "organization" into "organ", so a search for "organization" will return results with "organ", without any scoring penalty.
Whether or not this is appropriate for use depends upon the content data, and on the type of supported (for example, are the searches very basic, or are users very sophisticated and using search to accurately filter down the results). Other less aggressive stemmers could also be considered, such as KStem.
Stop Analyzer
Divides test at non-letter characters, lowercases and removes stop words.
Option 1: Divides text at non-letter characters and lowercases, and removes stop words.
Option 2: With this option, beyond doing basic word splitting and lowercasing, it also removes special words called stop words. Stop words are words that are very common, such as "the," and thus assumed to carry very little standalone meaning for searching since nearly every product or document will contain the word.
Applying the stop analyzer will do the following:
1. Tokenize the data provided within the data field. This involves removing special character and adding white space.
2. Update the format of the data within the data field to lower case.
3. Remove all stop words from the data within the data field.
Stop Analyzer Example:
Stop word: with
Original Product Name
Men's Long Mesh Short With Pockets
Product name with Stop Analyzer implemented
men s long mesh short pockets
White Space Analyzer
Divides text at whitespace.
Boosting Fields
The Boost feature allows the business users to indicate which fields should be considered a higher priority when established relevancy. If not boost is added, all fields will be considered equal relevancy. It is recommended that the team established which ones are most to least important. Based on this list, a boost value should be added to the fields are the most important. The least important fields can be left untouched.
Boost Example
This examples provides a list of fields that are most to least important. Let's assume that the name field is the most important and it should be ranked the highest. The category field is the second most important but not as important as the name field. It should be boosted but not as high. The rest of the fields are of equal importance. The example below provides an example of how the values can be added to each of the fields.
Fields Boost Value
1. Name 100
2. Category 50
3. Keywords 1
4. Long description 1
5. Item number 1
6. Manufacturer part number 1
7. UPC 1
OmitNorms
Norms allow for field length normalization. This allows the length of the field value to help influence relevance. For example, if a document has 50,000 lines of text and another document has 50 lines of text, the smaller document would score higher with norms enabled. The Omit Norms feature will disable this option for this field. Enable this feature for fields that are very short (e.g. ids, names) and if there are many fields in the engine. If this does not apply, leave Omit Norms disabled.
Include in Dictionary
The Include in Dictionary features indicates to the engine that the data from this fields should be included in the dictionary. The dictionary is used to provide the auto-correct features with keyword terms.
This feature is used for fields that possess keyword terms that should be added to the dictionary. Item number fields do not possess keywords therefore are not included in the dictionary.
Strip HTML
The Strip HTML feature indicates to the engine that this field possess HTML and it should be removed. Some data fields are created with HTML to assist with the layout of the page. If the data field has HTML and it should be queried, the strip HTML feature should be enabled. If it is important that the HTML be returned with the content for display, the field should be added twice.