pattern.en

The pattern.en module contains a fast, regular expressions-based tagger/chunker (identifies nouns, adjectives, verbs, etc. in a sentence), a WordNet interface and tools for sentiment analysis, verb conjugation and noun singularization & pluralization.

It can be used by itself or with other pattern modules: web | db | en | search | vector | graph.


Documentation

 


Indefinite article

The article is the most common determiner (DT) in English. It defines whether the noun following it is definite (the cat) or indefinite (a cat). The definite article is always the. The indefinite article can be either a or an – depending on how the successive noun is pronounced.

article(word, function=INDEFINITE)   # DEFINITE | INDEFINITE
referenced(word, article=INDEFINITE) # Returns article + word.
>>> from pattern.en import referenced
>>> print referenced('university')
>>> print referenced('hour')

a university
an hour

Reference: Granger, M. (2006). Ruby Linguistics Framework, http://deveiate.org/projects/Linguistics

 


Pluralization + singularization

The pluralize() function returns the singular form of a plural noun. It handles 96% of exceptions correctly. The singularize() function returns the plural form of a singular noun. The pos parameter (part-of-speech) can be set to NOUN or ADJECTIVE, but only a small number of possessive adjectives inflect (e.g. myour). The custom dictionary is for user-defined replacements.

pluralize(word, pos=NOUN, custom={}, classical=True)
singularize(word, pos=NOUN, custom={})
>>> from pattern.en import pluralize, singularize
>>> print pluralize('child')
>>> print singularize('wolves')

children
wolves

Reference:
Conway, D. (1998). An Algorithmic Approach to English Pluralization.
Ferrer, B. (2005). Inflector for Python, http://www.bermi.org/projects/inflector

 


Comparative + superlative

The comparative() and superlative() function give the comparative or superlative form of an adjective. Words with three or more syllables are simply preceded by more or most.

comparative(adjective)      # big => bigger
superlative(adjective)      # big => biggest
>>> from pattern.en import comparative, superlative
>>> print comparative('bad')
>>> print superlative('bad')

worse
worst

 


Verb conjugation

The pattern.en module has a lexicon of 8,500 common English verbs and their conjugated forms (infinitive, 3rd singular present, present participle, past and past participle – verbs such as be have other forms as well). The following verbs can be negated: be, can, do, will, must, have, may, need, dare, ought.

conjugate(verb, 
    tense = PRESENT,        # INFINITIVE, PRESENT, PAST, FUTURE
   person = 3,              # 1, 2, 3 or None
   number = SINGULAR,       # SG, PL
     mood = INDICATIVE,     # INDICATIVE, IMPERATIVE, CONDITIONAL, SUBJUNCTIVE
   aspect = IMPERFECTIVE,   # IMPERFECTIVE, PERFECTIVE, PROGRESSIVE 
  negated = False)          # True or False
lemma(verb)                 # Base form, e.g. are => be.
lexeme(verb)                # List of possible forms: be => is, was, ...
tenses(verb)                # List of possible tenses of the given form.

The conjugate() function takes the following optional parameters:

Tense Person Number Mood Aspect Alias Tag Example
INFINITIVE None None None None "inf" VB be
PRESENT 1 SG INDICATIVE IMPERFECTIVE "1sg" VBP I am
PRESENT 2 SG INDICATIVE IMPERFECTIVE "2sg"  · you are
PRESENT 3 SG INDICATIVE IMPERFECTIVE "3sg" VBZ he is
PRESENT None PL INDICATIVE IMPERFECTIVE "pl"  · are
PRESENT None None INDICATIVE PROGRESSIVE "part" VBG being
 
PAST None None None None "p" VBD were
PAST 1 PL INDICATIVE IMPERFECTIVE "1sgp"  · I was
PAST 2 PL INDICATIVE IMPERFECTIVE "2sgp"  · you were
PAST 3 PL INDICATIVE IMPERFECTIVE "3gp"  · he was
PAST None PL INDICATIVE IMPERFECTIVE "ppl"  · were
PAST None None INDICATIVE PROGRESSIVE "ppart" VBN been

Instead of optional parameters, a single short alias, the part-of-speech tag, PARTICIPLE or PAST+PARTICIPLE can also be given. With no parameters, the infinitive form of the verb is returned.

For example:

>>> from pattern.en import conjugate, lemma, lexeme, tenses, PAST, PL
>>> print lexeme('purr')
>>> print lemma('purring')
>>> print conjugate('purred', '3sg')
>>> print PAST in tenses('purred') # 'p' in tenses() also works.
>>> print (PAST, 1, PL) in tensses('purred') 

['purr', 'purrs', 'purring', 'purred']
purr
purrs
True

Reference: XTAG English morphology (1999), University of Pennsylvania, http://www.cis.upenn.edu/~xtag

 

Rule-based conjugation

All of the verb functions have an optional parse=True parameter that enables a rule-based parser for unknown verbs. This will not work for irregular verbs however, and it is fragile for verbs ending in -e in the past tense or as present participle (overall accuracy 91%).

With parse=Falseconjugate() and lemma() yield None.

>>> from pattern.en import conjugate, VERBS, PARTICIPLE
>>> print 'googled' in VERBS
>>> print conjugate('googled', tense=PARTICIPLE, parse=False)
>>> print conjugate('googled', tense=PARTICIPLE, parse=True)

False
None
googling

 


Quantification

The number() function returns a float or int parsed from the given (numeric) string. If no number can be parsed from the string, it returns 0.

The numerals() function returns the given int or float as a string of numerals. By default, the fraction is rounded to two decimals. Because of rounding float(number(x)) == x is not always True.

The quantify() function returns a wordcount approximation. Two similar words are a pair, three to eight several, and so on. Words can be given as a list, a word → count dictionary, or a string + amount.

The reflect() function quantifies Python objects – see the examples bundled with the module.

number(string)              # "seventy-five point two" => 75.2
numerals(n, round=2)        # 2.245 => "two point twenty-five"
quantify([word1, word2, ...], plural={})
reflect(object, quantify=True, replace=[])
>>> from pattern.en import quantify
>>> print quantify(['goose', 'goose', 'duck', 'chicken', 'chicken', 'chicken'])
>>> print quantify('carrot', amount=1000)
>>> print quantify({'carrot': 100, 'parrot': 20})

several chickens, a pair of geese an⁣d a duck
hundreds of carrots
dozens of carrots an⁣d a score of parrots

 


Spelling

The spelling() function returns a list of spelling suggestions for a given word. Each suggestion is a (word, confidence)-tuple. It is about 70% accurate.

spelling(string)
>>> print spelling("adress")

[("dress", 0.64), ("address", 0.36)]

Reference: Norvig, P. (2007). How to Write a Spelling Corrector. http://norvig.com/spell-correct.html 

 


n-grams

The ngrams() function returns a list of n-grams (i.e., tuples of n successive words) from the given string. Alternatively, you can supply a Text or Sentence object (see further). n-grams will not run over sentence markers (i.e., .!?), unless continuous is True.

ngrams(string, n=3, continuous=False)
>>> print ngrams("I am eating a pizza.", n=3)

[('I', 'am', 'eating'), ('am', 'eating', 'a'), ('eating', 'a', 'pizza')] 

 


Parser

The core of the pattern.en module is a rule-based shallow parser. To a machine, a text document is nothing more than a string of characters. A shallow parser adds meaning by distinguishing between abbreviation periods and sentence breaks, by adding part-of-speech tags to words (is can in this sentence a NOUN or a VERB?) and by grouping words that belong together (chunking).

The parser uses a regular expressions-based approach, which is fast but not always accurate. The parse() function and the Text, Sentence, Chunk and Word objects (discussed in the next section) are identical to those in MBSP – a shallow parser that uses a statistical machine learning approach. It is more robust, but slower. Output from both parsers can be used in the pattern.search and pattern.vector modules.

The parse() function takes a string of text and returns a tagged Unicode string.
Sentences in the output are separated by newline characters.

parse(string, 
   tokenize = True,         # Separate punctuation from words?
       tags = True,         # Parse part-of-speech tags? (NN, JJ, ...)
     chunks = True,         # Parse chunks? (NP, VP, PNP, ...)
  relations = False,        # Find relations? (SBJ, OBJ, ...)
    lemmata = False,        # Find word lemmata? (ate => eat)
   encoding = 'utf-8',      # Default string encoding?
    default = 'NN',         # Default part-of-speech tag.
      light = False)        # True => disables contextual rules.

For example:

>>> from pattern.en import parse
>>> print parse('I eat pizza with a fork.')

I/PRP/B-NP/O eat/VBD/B-VP pizza/NN/B-NP/O with/IN/B-PP/B-PNP a/DT/B-NP/I-PNP
fork/NN/I-NP/I-PNP ././O/O

Each token (i.e. tagged word) in a sentence has a number of annotations: tags=True includes the word part-of-speech tag, chunks=True the chunk tag + PNP tag (prepositional noun phrase). With tokenize set to False, no tokenization is carried out (the input string is expected to be tokenized). The encoding parameter defines the character encoding of the input string.

The parser is built on a Brill lexicon of tagged words and rules to improve the tags context-wise. With light=False, it uses Brill's contextual rules. With light=True it uses Jason Wiener's simpler ruleset. This ruleset is 5-10x faster but also 25% less accurate.

Reference: Brill, E. (1992). A simple rule-based part of speech tagger. ANLC '92 Proceedings.

Parser tags

Let's examine the word fork and the tags assigned by the parser in the example above:

word part-of-speech chunk pnp
fork NN I-NP I-PNP

The word's part-of-speech tag is NN, which means that it is a noun. The word occurs in a NP chunk, a noun phrase (a fork). It is also part of a prepositional noun phrase (with a fork).

Common part-of-speech tags include NN (noun), JJ (adjective) and VB (verb).
Common chunk tags include NP (noun phrase) and VP (verb phrase).
Common relations include SBJ (subject) and OBJ (object).

The Penn Treebank II tags page gives an overview of all the possible tags generated by the parser.

Parser shortcuts

The tag() function returns a list of (word, POStag)-tuples. With light=True, this is the fastest and simplest way to get an idea of a sentence's constituents.

tag(string, tokenize=True, encoding='utf-8')
>>> from pattern.en import tag
>>> for word, tag in tag('The cat felt happy.', light=True):
>>>     if tag == "JJ": # Retrieve all adjectives from the input string.
>>>         print word

happy

Parser output

The output of the parse() function is a string of sentences in which each token has been annotated with the requested tags. The pprint() function (extra p is for pretty) gives a good overview of the tags:

>>> from pattern.en import parse, pprint
>>> s = parse('I ate pizza.', relations=True, lemmata=True)
>>> pprint(s) 

    WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA
       I   PRP    NP      SBJ    1      -      i   
     ate   VBP    VP      -      1      -      eat         
   pizza   NN     NP      OBJ    1      -      pizza         
       .   .      -       -      -      -      .        

The output string is a TaggedString object that behaves as a Python string, but with a TaggedString.split() method that yields (without parameters) a list of sentences, where each sentence is a list of tokens, in which each token is a list of the word + its tags.

If you want to analyze the output (i.e. examine the relations between words and groups of words), the most convenient way is to construct a parse tree from the output.

 


Parse trees

A parse tree stores a tagged string as a network of linked Python objects that can be traversed to analyze the constituents in the text. The parsetree() function takes the same parameters as parse() and returns a Text() object. A Text is a list of Sentence objects. Each Sentence consists of Word objects. Word objects are also grouped in Chunk objects, which are related to other Chunk objects in various ways.

parsetree(string,
   tokenize = True,         # Separate punctuation from words?
       tags = True,         # Parse part-of-speech tags? (NN, JJ, ...)
     chunks = True,         # Parse chunks? (NP, VP, PNP, ...)
  relations = False,        # Find relations? (SBJ, OBJ, ...)
    lemmata = False,        # Find word lemmata? (ate => eat)
   encoding = 'utf-8',      # Default string encoding?
    default = 'NN',         # Default part-of-speech tag.
      light = False)        # True => disables contextual rules.

We'll run the sentence "The cat sat on the mat." through the parse tree:

>>> from pattern.en import parsetree
>>> s = parsetree('The cat sat on the mat.', relations=True, lemmata=True)
>>> print repr(s)

[Sentence(
 'The/DT/B-NP/O/NP-SBJ-1/the 
  cat/NN/I-NP/O/NP-SBJ-1/cat 
  sat/VBD/B-VP/O/VP-1/sit 
  on/IN/B-PP/B-PNP/O/on 
  the/DT/B-NP/I-PNP/O/the 
  mat/NN/I-NP/I-PNP/O/mat 
  ././O/O/O/O/.')]
>>> print s[0].chunks

[Chunk('The cat/NP-SBJ-1'), 
 Chunk('sat/VP-1'), 
 Chunk('on/PP'), 
 Chunk('the mat/NP')]

If you already have a tagged string from parse() you can transform it to a Text with the split() function. In effect, parsetree() = parse() + split(). The tagged string might have been previously stored in a file, for example.

Text

A Text is a list of Sentence objects (i.e. you can do: for sentence in text).

text = Text(taggedstring)
text = Text.from_xml(xml)  # Reads an XML-string generated with Text.xml.
text.string                # 'The cat sat on the mat .'
text.sentences             # [Sentence('The cat sat on the mat .')]
text.copy()
text.xml

Sentence

A Sentence is a list of Word objects, with attributes + methods that organize words in Chunk objects.

sentence = Sentence(taggedstring)
sentence = Sentence.from_xml(xml) 
sentence.parent            # Sentence parent (for a Slice), or None.
sentence.id                # Unique for each sentence.
sentence.start             # 0
sentence.stop              # Sentence length.
sentence.string            # Tokenized string, without tags.
sentence.words             # List of Word objects.
sentence.chunks            # List of Chunk objects.
sentence.subjects          # List of NP-SBJ chunks.
sentence.objects           # List of NP-OBJ chunks.
sentence.verbs             # List of VP chunks.
sentence.relations         # {'SBJ': {1: Chunk('the cat/NP-SBJ-1')},
                           #   'VP': {1: Chunk('sat/VP-1')},
                           #  'OBJ': {}}
sentence.pnp               # List of PNPChunks: [Chunk('on the mat/PNP')]
sentence.constituents(pnp=False)
sentence.slice(start, stop)
sentence.copy()
sentence.xml
  • Sentence.constituents() returns an in-order list of Word and Chunk objects.
    With pnp=True, also groups into PNPChunk objects whenever possible.
  • Sentence.slice() returns a Slice (subclass of Sentence) starting with the word at index start and containing all the words up to (before) index stop.

Sentence words

A Sentence is made up of Word objects, which are also grouped in Chunk objects:

word = Word(sentence, string, lemma=None, type=None, index=0)
word.sentence              # Sentence parent.
word.index                 # Sentence index of word.
word.string                # String (Unicode).
word.lemma                 # String lemma, e.g. 'sat' => 'sit',
word.type                  # Part-of-speech tag (NN, JJ, VBD, ...)
word.chunk                 # Chunk parent, or None.
word.pnp                   # PNPChunk parent, or None.

Sentence chunks

A Chunk is a list of Word objects that belong together.
Chunks can be part of a PNPChunk, which starts with a PP chunk followed by NP chunks.

chunk = Chunk(sentence, words=[], type=None, role=None, relation=None)
chunk.sentence             # Sentence parent.
chunk.start                # Sentence index of first word.
chunk.stop                 # Sentence index of last word + 1.
chunk.string               # String of words (Unicode).
chunk.words                # List of Word objects.
chunk.head                 # Primary Word in the chunk.
chunk.type                 # Chunk tag (NP, VP, PP, ...)
chunk.role                 # Role tag (SBJ, OBJ, ...)
chunk.relation             # Relation id, e.g. NP-SBJ-1 => 1.
chunk.relations            # List of (id, role)-tuples.
chunk.related              # List of Chunks with same relation id.
chunk.subject              # NP-SBJ chunk with same id.
chunk.object               # NP-OBJ chunk with same id.
chunk.verb                 # VP chunk with same id.
chunk.modifiers            # []
chunk.conjunctions         # []
chunk.pnp                  # PNPChunk parent, or None.
chunk.previous(type=None)
chunk.next(type=None)
chunk.nearest(type='VP')
  • Chunk.head yields the last (i.e. primary) Word in the chunk: the big catcat.
  • Chunk.relations  contains all relations the chunk is involved in.
    Some chunks have multiple relations, for example functioning as both SBJ and OBJ, or being the OBJ of multiple VP chunks.
  • For VP chunks, Chunk.modifiers is a list of nearby adjectives and adverbs with no relations.
    For example in the cat really wants out: really and out are ADVP with no relations.
    The parse tree will assume that they have something to do with the VP wants.
    What does the cat want? → out.
    How badly does the cat want out? → really.
  • Chunk.conjunctions is a list of chunks linked by and & or to this chunk.
    For example in going up and down: the up chunk has conjunctions: [(Chunk('down'), AND)]

Prepositional noun phrases

PNPChunk is a subclass of Chunk. It has the same attributes and methods.
It groups PP + NP chunks in a prepositional noun phrase (PNP).

pnp = PNPChunk(sentence, words=[], type=None, role=None, relation=None)
pnp.string                 # String of words (Unicode).
pnp.chunks                 # List of Chunk objects.
pnp.preposition            # First PP-chunk in the PNP.

Words and chunks that are part of a PNP will have their Word.pnp and Chunk.pnp attribute set.
All the prepositional noun phrases in a sentence can be retrieved with Sentence.pnp.

 


Sentiment

Text can be broadly categorized into two types: facts and opinions. Opinions carry people's sentiments, appraisals and feelings toward the world. The module bundles a lexicon of adjectives that occur frequently in product reviews, tagged with scores for sentiment polarity (positive/negative) and subjectivity. 

The sentiment() function returns a (polarity, subjectivity)-tuple for the given sentence (based on the adjectives in it), with polarity between -1.0 and 1.0 and subjectivity between 0.0 and 1.0. The sentence can be a string, Text, Sentence, ChunkWord or a Synset (see further).

sentiment(sentence)
polarity(sentence)
subjectivity(sentence)
positive(s, threshold=0.1)
>>> from pattern.en import sentiment
>>> print sentiment(
>>>     "The movie attempts to be surreal by incorporating various time paradoxes,"
>>>     "but it's presented in such a ridiculous way it's seriously boring.") 

(-0.34, 1.0) 

In the example above, -0.34 is a compromise between surreal, various, ridiculous and seriously boring.

The polarity() function returns the sentence polarity (positive/negative sentiment). 

The subjectivity() function returns the sentence subjectivity (objective/subjective).

The positive() function returns True if the given sentence's polarity is above the threshold. The threshold can be lowered or raised, but overall +0.1 gives the best results for product reviews. Accuracy is 73% (P 0.73, R 0.73) for movie reviews.

 


Mood & modality

Linguistic modality deals with necessity, permissibility and probability. 

mood(sentence) # INDICATIVE | IMPERATIVE | CONDITIONAL | SUBJUNCTIVE
modality(sentence, type=EPISTEMIC)
negated(sentence)

The mood() function tries to identify a parsed Sentence as indicative, imperative, conditional or subjunctive:

Mood Form Use Example
INDICATIVE none of the below  fact, belief It is raining.
IMPERATIVE infinitive without to warning, instruction Do your homework!
CONDITIONAL would|could|should, will|can + if possible or imaginary I could show you.
SUBJUNCTIVE wish|were, it is + infinitive wish, judgement, opinion I wish I knew.

The mood() function has an optional predictive=True parameter that determines how conditional sentences are handled. When False, sentences with will/shall must have an explicit if/when/once clause in order to be identified as conditional. For example: while "You will help me" is imperative, "I will help you" is predictive conditional and "I will help you when I get back" is speculative conditional. Sentences with can/may always need an explicit if-clause.

The modality() function returns a value between -1.0 and +1.0, expressing the degree of possibility: "I wish it would stop raining" scores -0.35 whereas "It will surely stop raining" scores +0.75. Roughly, >0.5 can be seen as certain. Accuracy (F1-score) when predicting certain vs. uncertain is around 67% for Wikipedia texts.

The negated() function returns True if the Sentence contains never, not or n't (as in wouldn't).

 


WordNet

The pattern.en module comes bundled with WordNet 3.0 and Oliver Steele's PyWordNet module. WordNet is a lexical database for the English language, that groups words into Synset objects (= sets of synonyms). Each synset provides a short definition and various semantic relations to other synsets:

synset = synsets(word, pos=NOUN)[i]
synset.pos                  # Part-of-speech: NOUN | VERB | ADJECTIVE | ADVERB.
synset.synonyms             # List of word forms (i.e. synonyms).
synset.gloss                # Definition string.
synset.lexname              # Category string, or None.
synset.ic                   # Information Content value.
synset.weight               # Tuple of (polarity, subjectivity), using SentiWordNet.
synset.antonym              # A Synset (semantical opposite).
synset.hypernym             # A Synset (semantical parent).
synset.hypernyms(recursive=False, depth=None)
synset.hyponyms(recursive=False, depth=None)
synset.meronyms()           # List of synsets (members/parts).
synset.holonyms()           # List of synsets (of which this is a member).
synset.similar()            # List of synsets (similar adjectives/verbs) 
  • Synset.hypernyms() returns a list of  parent synsets (i.e. more general).
  • Synset.hyponyms() returns a list child synsets (i.e. more specific).
    With recursive=True, returns all parents of all parents / all children of all children.
    Optionally returns parents / children recursively up to the given depth.

For example:

>>> from pattern.en import wordnet
>>> s = wordnet.synsets('bird')[0]
>>> print 'Definition:', s.gloss
>>> print '  Synonyms:', s.synonyms
>>> print ' Hypernyms:', s.hypernyms()
>>> print '  Hyponyms:', s.hyponyms()
>>> print '  Holonyms:', s.holonyms()
>>> print '  Meronyms:', s.meronyms()

Definition: u'warm-blooded egg-laying vertebrates characterized 
              by feathers and forelimbs modified as wings'
  Synonyms: ['bird']
 Hypernyms: [Synset('vertebrate')]
  Hyponyms: [Synset('dickeybird'), Synset('cock'), Synset('hen'), ...]
  Holonyms: [Synset('Aves'), Synset('flock')]
  Meronyms: [Synset('beak'), Synset('furcula'), Synset('feather'), ...]
Reference: Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Cambridge, MIT Press.

Synset similarity

The ancestor() function returns the common ancestor (or least common subsumer) of two synsets.
The similarity() function returns the semantic similarity of two synsets, as a number.

ancestor(synset1, synset2)
similarity(synset1, synset2) # Lower value = higher similarity.
>>> a = wordnet.synsets('cat')[0]
>>> b = wordnet.synsets('dog')[0]
>>> print wordnet.ancestor(a, b)

Synset('carnivore')
>>> c = wordnet.synsets('teapot')[0]
>>> print wordnet.similarity(a, b)
>>> print wordnet.similarity(a, c)

3.29
4152.56

The similarity weight is based on Lin-similarity and Information Content (IC). IC values for each synset are derived from the word's occurence in a given corpus (e.g. Brown). The idea is that less frequent words convey more information. Lower values indicate higher similarity:

lin = 2.0 * ancestor(synset1, synset2).ic / (synset1.ic + synset2.ic)

Synset sentiment

SentiWordNet is a third-party lexical resource for opinion mining, with polarity and subjectivity scores for all WordNet synsets. SentiWordNet is free for non-commercial research purposes. To enable SentiWordNet, request a download from the authors and place the file SentiWordNet*.txt in pattern/en/wordnet/. You can then use wordnet.sentiwordnet and Synset.weight() in your script:

>>> from pattern.en import wordnet
>>> from pattern.en import ADJECTIVE
>>> print wordnet.sentiwordnet['lamp']
>>> print wordnet.synsets('happy', ADJECTIVE)[0].weight
>>> print wordnet.synsets('sad', ADJECTIVE)[0].weight

(0.0, 0.0)
(0.375, 0.875)
(-0.625, 0.875)

 


Wordlists

The patten.en module includes a number of general-purpose word lists:

List Description Size Example
ACADEMIC English academic words 500 criterion, proportionally, research
BASIC English basic words 1,000 chicken, pain, road
PROFANITY English swear words 350  
TIME English time & date words 100 Christmas, past, saturday
>>> from pattern.en.wordlist import ACADEMIC
>>> words = open("paper.txt").read().split(" ")
>>> words = [w for w in words if w not in ACADEMIC] 

 


See also

  • MBSP (GPL): robust parser using a memory-based learning approach, in Python.
  • NLTK (Apache): full natural language processing toolkit for Python.