Regular expressions in an entity extraction dictionary do not match expected terms on web pages

Summary: After dictionaries containing regular expressions are uploaded, the expected terms on web pages are not recognized as entities.

Cause: Regular expressions for entity recognition currently have the following limitations:

  • Regular expression patterns are only matched against each single word. For example, if you have pattern "AZBU\s*\d{6}" and expect it to match "AZBU 123456", it is not going to work because the search appliance first tries to match it against "AZBU", then "123456", and because neither matches the pattern, the entity will not get picked up.
  • Regular expression patterns only match whole words, not partial ones. For example, suppose you have "AZBU123456" (no space) in a web page, it will get picked up by the pattern "AZBU\s*\d{6}", which is expected. But if you have "AZBU123456cv" on your web page, then it won't match the pattern because the search appliance currently only matches whole words.
  • The following limitation applies to the scenario where <store_regex_or_name> is set to regex, which is the default. Currently only the whole matched string can be stored; you cannot store only part of the matched string. For example, pattern "AZBU\d{6}\w*" matches "AZBU123456cv", and the whole matched string "AZBU123456cv" will be stored. Even though you may want to store "123456" it is not currently possible.
  • As noted in the documentation, a regular expression is always case sensitive, so the "Case sensitive" flag on the Entity Recognition configuration page does not apply in the case of a regular expression.

  • For the 1st and 3rd limitations, be mindful that currently regular expression cannot be used to match multiple words, and the whole matched string will be stored when <store_regex_or_name> is set to regex. These limitations will be addressed in Feature Requests 7991666 (Recognize multiple terms with regular expression for entity recognition), and 7723643 (Allow storing part of matched string in entity recognition regular expression)
  • The issue caused by the second limitation above can be avoided by specifying a regular expression that matches the whole word. For example, the regular expression pattern "AZBU\d{6}" does not match the whole word "AZBU123456cv", but "AZBU\d{6}\w*" would match fine.
  • If case-insensitive matching is desired, the (?i) flag can be used in the regular expression pattern itself. For example, "(?i)approver:[a-z]+" would match "Approver:ABC", "approver:xYZ", etc.

Additional Information:

  1. Discovering and Indexing Entities
  2. Help Center page for entity recognition
  3. RE2 syntax (this is the version of regular expression syntax that the entity recognition feature is based on)
Was this helpful?
How can we improve it?