pattern.search

The pattern.search module offers a pattern matching system similar to regular expressions, that can be used to search a string syntactically (word function), or semantically (word meaning) using a taxonomy. 

It can be used by itself or with other pattern modules: web | db | en | search  | vector | graph.


Documentation

 


Searching + matching in a nutshell

The search() function takes a word (or a sequence of words) that you want to retrieve from a string. It returns a list of non-overlapping matches. The match() function returns the first match (or None). The compile() function returns a Pattern object with a Pattern.search() and Pattern.match() method.

search(pattern, sentence)
match(pattern, sentence)
compile(pattern)
>>> from pattern.search import search
>>> print search('rabbit', 'big white rabbit')

[Match(words=[Word('rabbit')])]

Search words can contain a wildcard character at the *start, end*, *both* or in*between:

>>> print search('rabbit*', 'big white rabbit')
>>> print search('rabbit*', 'big white rabbits')

[Match(words=[Word('rabbit')])]
[Match(words=[Word('rabbits')])]

Search words can be composed of different options separated by a vertical bar:

>>> print search('rabbit|cony|bunny', 'big black bunny')

[Match(words=[Word('bunny')])]

 

Syntactical pattern matching

The above can also be accomplished with regular expressions (which is faster). It is more interesting if we involve the parser from the pattern.en module. The parser takes a string, identifies words and sentences, and assigns a part-of-speech tag to each word. For example: NN (noun), VB (verb) or JJ (adjective). It groups words into chunks, for example: NP (noun phrase) consists of successive NN, JJ, DT (determiner), ...

A parsed Sentence or Text can be searched for part-of-speech tags:

>>> from pattern.search import search
>>> from pattern.en import parsetree
>>> t = parsetree("big white rabbit")
>>> print t
>>> print
>>> print search('JJ', t) # all adjectives
>>> print search('NN', t) # all nouns
>>> print search('NP', t) # all noun phrases

[Sentence('big/JJ/B-NP/O white/JJ/I-NP/O rabbit/NN/I-NP/O')]

[Match(words=[Word(u'big/JJ')]), Match(words=[Word(u'white/JJ')])]
[Match(words=[Word(u'rabbit/NN')])]
[Match(words=[Word(u'big/JJ'), Word(u'white/JJ'), Word(u'rabbit/NN')])]

 

Semantical pattern matching

The pattern.search module includes a Taxonomy class that can be used to define semantic categories for words. Assume that you wanted to extract flower names from a text. This can make patterns long and clumsy, e.g.: compile("rose|lily|daisy|daffodil|begonia").

A more robust approach is to work with the taxonomy:

>>> from pattern.search import taxonomy, search
>>> from pattern.en import parsetree
>>> for f in ('rose', 'lily', 'daisy', 'daffodil', 'begonia'):
>>>     taxonomy.append(f, type='flower')
>>> t = parsetree('A field of white daffodils.', lemmata=True)
>>> print t
>>> print
>>> print search('FLOWER', t)

[Sentence('A/DT/B-NP/O/a field/NN/I-NP/O/field of/IN/B-PP/B-PNP/of' 
          'white/JJ/B-NP/I-PNP/white daffodils/NNS/I-NP/I-PNP/daffodil ././O/O/.')]

[Match(words=[Word(u'white/JJ'), Word(u'daffodils/NNS')])]

Since the pattern has "FLOWER" in uppercase, it is recognized as the taxonomic category (rose + lily + daisy + daffodil + begonia) instead of the word "flower". The parser will compute word lemmata, so that "daffodils" is recognized as the plural form of "daffodil" (the lemma), and therefore part of FLOWER.

Note that the returned match is actually "white daffodils". This is because (by default) the system assumes that you are underspecifying: white daffodils is a more specified version of daffodil, so the whole chunk is returned.

 


Pattern

A Pattern is a sequence of constraints that matches certain phrases in a (parsed) sentence. Each Constraint can match a word in the sentence. If  a number of successive words corresponds to the entire sequence, the phrase is a match.

Constraints can be constructed for grammatical syntax (e.g., any adjective) and semantics (e.g., any product). For example: Pattern.fromstring("NP be * than NP"), where NP means any noun phrase, matches phrases such as "a cat is faster than a dog" and "G. I. Joe was cooler than He-Man". 

pattern = Pattern(sequence=[])
pattern = Pattern.fromstring(string, taxonomy=TAXONOMY)
pattern.sequence            # List of Constraint objects.
pattern.groups              # List of groups, each a list of Constraint objects.
pattern.strict              # Disable underspecification (chunk head matching)?
pattern.greedy              # lambda chunk, constraint: True 
pattern.search(sentence)
pattern.match(sentence, start=0)
  • Pattern.search() returns a list of Match objects from the given sentence.
  • Pattern.match() returns the first Match found in the given sentence, or None.

 


Pattern constraint

A constraint matches a range of (tagged) words and taxonomy terms. For example:

  • Constraint.fromstring('with|of') matches either "with" or "of".
  • Constraint.fromstring('JJ?') matches any adjective tagged JJ, but it is optional.
  • Constraint.fromstring('NP|SBJ') matches subject noun phrases.
  • Constraint.fromstring('QUANTITY|QUALITY') matches quantity-type and quality-type taxa.
constraint = Constraint(
     words = [], 
      tags = [], 
    chunks = [], 
     roles = [], 
      taxa = [], 
  optional = False, 
  multiple = False, 
     first = False,
  taxonomy = TAXONOMY,
   exclude = None)
constraint = Constraint.fromstring(string, **kwargs)
constraint.index
constraint.string
constraint.words            # List of allowed words/lemmata (of, with, ...)
constraint.tags             # List of allowed parts-of-speech (NN, JJ, ...)
constraint.chunks           # List of allowed chunk types (NP, VP, ...)
constraint.roles            # List of allowed chunk roles (SBJ, OBJ, ...)
constraint.taxa             # List of allowed word categories.
constraint.taxonomy         # Taxonomy used for lookup.
constraint.optional         # True => matches zero or one word.
constraint.multiple         # True => matches one or more words.
constraint.first            # True => can only match first word.
constraint.exclude          # None, or Constraint of disallowed options.
constraint.match(word)

Constraint string syntax

Constraint.fromstring() returns a new Constraint from the given string. It takes the same optional parameters as the constructor. Uppercase words in the given string indicate either a part-of-speech tag (e.g. NN, JJ, VP) or a taxonomy term (e.g. PRODUCT, PERSON). Here is an overview of all part-of-speech tags. The search itself is case-insensitive.

Some characters like | or ? are special. They affect how the regular expressions around them are interpreted:

Character Example Description
( (JJ) Wrapper for an optional constraint (deprecated).
[ [Mac OS X | Windows Vista] Wrapper for a constraint that has spaces. 
{ DT {JJ?+} NN Wrapper for match groups, e.g., Match.group(1).
_ Windows_Vista Converted to a space. 
| ADJP|ADVP Separator for different options.
* soft*|JJ* Used as a wildcard character.  
! !be|VB* Used in front of words/tags that are not allowed.
? JJ? Used as a suffix: constraint is optional.
+ RB|JJ+ or JJ?+ or *+ Used as a suffix: constraint can span multiple words. 
^ ^hello Used as a prefix: constraint can only match first word.

These characters need to be escaped if used as content: "\?".
You can use the module's escape() function: escape("hello?") => "hello\?"

Constraint matching

Constraint.match() returns True if the given string or Word is part of the constraint:

  • the word (or its lemma) occurs in Constraint.words, OR,
  • the word (or its lemma) occurs in Constraint.taxa taxonomy tree, AND
  • the word and/or chunk tags match those defined in the constraint.

The search is case-insensitive. Individual terms in Constraint.words can contain wildcards (*). Some part-of-speech-tags can also contain wildcards: NN*, VB*, JJ*, RB*.

The following example demonstrates the use of optional and multiple constraints:

>>> from pattern.search import Pattern
>>> from pattern.en import parsetree
>>> t = parsetree('tasty cat food')
>>> p = Pattern.fromstring('DT? RB? JJ? NN+')
>>> print t
>>> print
>>> print p.search(t)

[Sentence('tasty/JJ/B-NP/O cat/NN/I-NP/O food/NN/I-NP/O')]

[Match(words=[Word(u'tasty/JJ'), Word(u'cat/NN')]), 
 Match(words=[Word(u'food/NN')])]

The pattern looks for successive nouns (NN), but it will also include a preceding determiner (DT), adverb (RB) and adjective (JJ). It matches anything from food to cat food, tasty food, the tasty cat food, etc.

Constraint underspecification

The parser in pattern.en groups words that belong together into chunks. For example, the black cat is one chunk, tagged NP (noun phrase). The head of the chunk is cat. By default, the pattern matching algorithm will compare a Sentence on the chunk head. Thus, if you look for cat and the sentence has a big black cat, the entire chunk will be returned. The extra words in the chunk contain important information about how the cat looks and feels, an opinion (my silly cat), etc. Expert users can tweak which chunks are matched with a custom Pattern.greedy function.

The behavior can be disabled by passing a STRICT flag to Pattern, compile(), search() or match():

>>> from pattern.search import Pattern
>>> from pattern.en import parsetree
>>> t = parsetree('The black cat is lurking in the tree.')
>>> p = Pattern.fromstring('cat', t)
>>> print t
>>> print
>>> print p.search(t)

[Sentence('The/DT/B-NP/O black/JJ/I-NP/O cat/NN/I-NP/O is/VBZ/B-VP/O'
          'lurking/VBG/I-VP/O in/IN/B-PP/B-PNP the/DT/B-NP/I-PNP' 
          'tree/NN/I-NP/I-PNP ././O/O')]

[Match(words=[Word(u'The/DT'), Word(u'black/JJ'), Word(u'cat/NN/)])]
>>> from pattern.search import STRICT
>>> p = Pattern.fromstring('cat', STRICT) # Or: compile('cat', STRICT)
>>> print p.search(t)

[Match(words=[Word(u'cat/NN')])]

 


Pattern matching

Pattern.search() returns a list of Match objects, where each match is a list of successive Word objects.

match = Match(pattern, words=[])
match.pattern               # Pattern source.
match.words                 # List of Word objects.
match.string                # String of words separated with a space.
match.start                 # Index of first word in sentence.
match.stop                  # Index of last word in sentence + 1.
match.group(index, chunked=False)
match.constraint(word)
match.constraints(chunk)
match.constituents(constraint=None)
  • Match.group() returns a list of Words matching the constraints in a { } group.
  • Match.constraint() returns the Constraint that matched the given Word, or None.
  • Match.constraints() returns the list of constraints that matched the given Chunk.
  • Match.constituents() returns a list of Word and Chunk objects, with successive words grouped into their chunks whenever possible. Optionally, returns only chunks/words that matched the given Constraint, (or list of constraints or constraint indices). Chunks are only available if a Sentence or Text was given (i.e., not for plain strings).
>>> t = parsetree('the turtle was faster than the hare', lemmata=True)
>>> m = match('NP be more? ADJP|ADVP than NP', t)
>>> print t
>>> print
>>> for w in m.words:
>>>    print w, '\t =>', m.constraint(w)

[Sentence('the/DT/B-NP/O/the turtle/NN/I-NP/O/turtle was/VBD/B-VP/O/be'
          'faster/RBR/B-ADVP/O/faster than/IN/B-PP/B-PNP/than the/DT/B-NP/I-PNP/the'
          'hare/NN/I-NP/I-PNP/hare')]

Word(u'the/DT')     => Constraint(chunks=['NP'])
Word(u'turtle/NN')  => Constraint(chunks=['NP'])
Word(u'was/VBD')    => Constraint(words=['be'])
Word(u'faster/RBR') => Constraint(chunks=['ADJP', 'ADVP'])
Word(u'than/IN')    => Constraint(words=['than'])
Word(u'the/DT')     => Constraint(chunks=['NP'])
Word(u'hare/NN')    => Constraint(chunks=['NP'])

Match groups offer a convenient way to retrieve what you need from a Match:

>>> t = parsetree('the big black dog')
>>> m = match('DT {JJ?+ NN}', t)
>>> print m.group(0) # full pattern
>>> print m.group(1) # {JJ?+ NN}
>>> print m.group(1).string

[Word(u'the/DT'), Word(u'big/JJ'), Word(u'black/JJ'), Word(u'dog/NN')]
[Word(u'big/JJ'), Word(u'black/JJ'), Word(u'dog/NN')]
'big black dog'

Each Word in a Match or Match.group() has the following attributes:

word = Word(sentence, string, tag=None, index=0)
word.string
word.tag                    # Part-of-speech tag (e.g. NN, JJ).
word.sentence               # Sentence (a list of successive Words).
word.index                  # Sentence index.

If search() or match() is called with a string, Word objects are created when the Match is returned. When called with a parsed Sentence, Word objects are linked from the sentence. These have extra attributes. For an overview of what is possible with Sentence, Chunk and Word, see the parser documentation.

 


Taxonomy

A taxonomy is a hierarchical tree of words classified by semantic type. For example: begonia is a flower, a flower is a plant. Taxonomy terms can be used in a constraint: "FLOWER" will match "flower" as well as "begonia", or any other flower that has been defined in the taxonomy. By default, constraints will look up terms in the global taxonomy.

taxonomy = Taxonomy()
taxonomy.case_sensitive     # False by default.
taxonomy.classifiers        # List of Classifier objects.
taxonomy.append(term, type=None, value=None)
taxonomy.remove(term)
taxonomy.classify(term)
taxonomy.parents(term, recursive=False)
taxonomy.children(term, recursive=False)
taxonomy.value(term)
  • Taxonomy.classify() returns the (most recently added) semantic type for a given term.
    If the term is not in the dictionary, tries Taxonomy.classifiers.
  • Taxonomy.parents() returns a list of all semantic types for the given term.
    Taxonomy.children() returns a list of all terms of the given semantic type.
    With recursive=True, traverses the entire branch.

For example:

>>> from pattern.search import taxonomy, search
>>> taxonomy.append('chicken', type='food')
>>> taxonomy.append('chicken', type='bird')
>>> taxonomy.append('penguin', type='bird')
>>> taxonomy.append('bird', type='animal')
>>> print taxonomy.parents('chicken')
>>> print taxonomy.children('animal', recursive=True)
>>> print
>>> print search('FOOD', 'I'm eating chicken.')

['bird', 'food']
['bird', 'penguin', 'chicken']

[Match(words=[Word('chicken')])]

Taxonomy classifiers

A Classifier offers a rule-based approach to enrich the taxonomy. If a term is not in the taxonomy, the taxonomy will go over the list of its classifiers. Each classifier is a group of functions that can be customized to yield the semantic type for a given set of terms.

classifier = Classifier(
     parents = lambda term: [], 
    children = lambda term: [], 
       value = lambda term: None)
classifier.parents          # Returns list of parents for a term.
classifier.children         # Returns list of children for a term.
classifier.value            # Returns value for a term, or None.

This is useful since taxonomy terms can't include wildcards (the * character is taken literally). The example below creates a classifier that tags any word ending in -ness as QUALITY. This is more concise than manually adding roughness, sharpness, ... to the taxonomy.

>>> from pattern.search import Classifier, taxonomy, search
>>> c = Classifier(parents=lambda s: s.endswith('ness') and ['quality'] or [])
>>> taxonomy.classifiers.append(c)
>>> taxonomy.append('chicken', type='animal')
>>> print search('QUALITY of a|an|the ANIMAL', 'the spryness of a chicken')

[Match(words=[Word('spryness'), Word('of'), Word('a'), Word('chicken')])]

If you plan to use classifiers for pattern matching you need to implement Classifier.parents() (Classifier.children() is not called in Constraint.match() for performance).

Finally, this example creates a rule-based taxonomy from the pattern.en.wordnet module:
>>> from pattern.search import taxonomy, WordNetClassifier
>>> from pattern.en import wordnet
>>> taxonomy.classifiers.append(WordNetClassifier(wordnet))
>>> print taxonomy.parents('duck', pos='NN')
>>> print taxonomy.parents('duck', pos='VB')

['anseriform bird']
['move']

 


Useful list functions

The pattern.search module has a number of interesting list functions:

unique(list)          # Returns a new list with unique items.
find(function, list)  # Returns first item for which function(item) is True.
combinations(list, n) # Returns a generator of all combinations of length n.
variations(list, optional=lambda item: False)
odict(data, reversed=False)
  • combinations() returns a generator of all combinations of length n of the items in the list.
    For example: list(combinations([1,2,3), n=2)) yields:
    [[1,1], [1,2], [1,3], [2,1], [2,2], [2,3], [3,1], [3,2], [3,3]].
  • variations() returns all variations of a list containing optional items (in-order).
  • odict() is a dictionary with ordered keys – like a stack.
    With reversed=True, the latest keys will be returned first when traversing the dictionary.
    odict.append() takes a (key, value)-tuple and sets the given key to the given value. If the key exists, pushes the updated item to the head (or tail) of the dictionary.