pattern.dev

Pattern is a web mining module for the Python programming language.

Pattern is written in Python, with extensions in JavaScript for web apps. The source code is hosted on GitHub. It is released under a BSD-license, so it can be freely incorporated in commercial applications. All contributions are welcome.

There are six core modules in the pattern package: web | db | en | search | vector | graph.


Topics

 


Contribute

Source code updates

The source code is hosted on GitHub, a web-based hosting service for software projects using version control. Version control keeps track of all the changes to the source code. For example, you can step back to an earlier version, or merge revisions from different contributors into the latest version.

The best way to get involved is to create a fork of the project. Detailed information can be found here on GitHub. You can then work on your own local copy – implement new functionality or improve existing code. When you are ready with your changes, you can push them to GitHub. Several desktop applications can help out in this process. We use TowerGitHub for Mac is free. You can also issue a pull request (as described here). This notifies the developers of Pattern. If your changes are interesting to a wider audience, we can pull them into the main branch so they become part of the next Pattern release.

Bug reports

Please let us know if you encounter a bug. We prefer if you create an issue on GitHub. This way, the bug is known to all users of Pattern and we can work together to solve it.

 


Dependencies

There are six core modules in the package:

Module Functionality
web Asynchronous requests, cached downloads, search engine API's, HTML parser + spider.
db Wrappers for MySQL + SQLite databases and CSV-files.
en English text parser, parse trees, word inflection, sentiment, mood & modality.
search N-gram pattern matching algorithm for parsed text.
vector Vector space model for corpora, clustering, classification.
graph Graph networks & graph visualization.

Design philosophy

Decentralized

Each module is designed to be standalone. One should be able to take the web folder out of the package and use it as a standalone project. Modules that refer to other modules should degrade silently if that module is not present. Inside a single module there may be a lot of interdependency between classes, hence the oversized __init__.py files. They will have to be split into submodules in the future.

Pure Python

Pattern is a pure-Python package written for Python 2.4. This means that it should not include C/C++ code, Perl scripts or (ideally) Python 2.7 functionality. Pattern's audience is diverse (from computer scientists to web designers) so we prefer out-of-the-box installation without INSTALL.txt over speed. Many fast, specialized alternatives already exist. Pattern's strength is in ease-of-use. If you must include C/C++ code, provide precompiled binaries for different platforms and use ctypes.

Web-enabled

For visualization purposes, graphics that display natively in the browser are preferred. This is reflected by the canvas.js helper module, which offers browser-based visualization without dependencies. You can include Python code that generates JavaScript for canvas.js apps. Also, some users may want to incorporate Pattern in Google App Engine, so data files should be no larger than 10MB.

Pragmatic

If it is non-trivial and it already exists, don't rewrite it. Many Python projects are BSD-licensed and can be included in Pattern without restriction. Do not include GPL-licensed projects. Do not include big projects (e.g., NLTK, Django). Instead provide tools for mapping data structures from/to Pattern.

Base classes

  • pattern.web: the idea is to wrap existing tools (e.g., Beautiful Soup, PDFMiner) instead of writing code from scratch – the code in this module requires a lot of maintenance since the web changes constantly. New search engines inherit from SearchEngine to provide a uniform API.
  • pattern.db: there is no strategy to include new database engines next to MySQL and SQLite. If you think this is a challenge you can start by inspecting Database.connect(), Database.escape(), Database._field_SQL() and Table._update().
  • pattern.text: the Lexicon class can be reused by Brill-based taggers. This is explained in more detail in the language support section. An example is the pattern.nl module which inherits from the English parser. Note that the package tree can benefit from a reorganization in this module.
  • pattern.vector: classifiers inherit from the Classifier base class. Clustering algorithms must be available through Corpus.cluster(), for example Corpus.cluster(method=KMEANS).
  • pattern.graph: the Node, Edge and Graph classes can be subclassed. Pay attention to Graph.add_node() and Graph.add_edge(): we need to pass the subclass of Node and Edge as an optional parameter here. Layout algorithms can be implemented by subclassing GraphLayout.

 


Documentation

Each function or method should have a docstring:

def find(match=lambda item: False, list=[]):
    """ Returns the first item in the given list for which match(item) is True.
    """
    for item in list:
        if match(item) is True: 
            return item

Provide a concise description of the expected input and output. Our docstrings usually start with "Returns the ..." If you add new functionality, remember to also add unit tests for other developers and a fun example for users. Some users prefer to learn by reading documentation, others by tinkering with examples. Unit tests are important since they ensure that individual parts work correctly.

Unfortunately, we do not have an open documentation framework. We will update the online documentation with your contributions when the next version is released. The offline documentation is automatically generated.

 


Coding conventions

Whitespace

Unfortunately, the source code does not always follow the PEP8 style guide for Python code. This is the case for whitespace. We add additional whitespace to object property assignments so that the inline comments are aligned as a block:

class Table(object):
    def __init__(self, name, database):
        """ A collection of rows with one or more fields of a certain type.
        """
        self.database    = database
        self.name        = name
        self.fields      = [] # List of field names (i.e., column names).
        self.schema      = {} # Dictionary of (field, Schema)-items.
        self.default     = {} # Default values for Table.insert().
        self.primary_key = None
        self._update()

We sometimes use whitespace to align the keys and values in a dictionary:

url = URL("http://search.twitter.com/search.json?", method=GET, query={
       "q": query,
    "page": start,
     "rpp": min(count, 100) 
})

Class and function names

Class names use CamelCase: AsynchronousRequest, SearchEngine, HTTP404Error. Function and method names use lowercase_with_underscore. Preferably, function and methods names are single, short, descriptive words. Furthermore, if a method takes no arguments and it is not a heavy operation, it should be a property:

class AsynchronousRequest:
    @property
    def done(self):
        return not self._thread.isAlive()
while not request.done:
   ... 

Variable names

We use single character variable names often. Usually this is the first letter of what the variable name would otherwise be (for example, dict keys and values are k and v). This is done to make the structure of the algorithm more visible. In the example below, s is used as short for "string". Attention draws to the actual string methods, not the s = [...] assigment part.

def normalize(s, punctuation="!?.:;,()[] "):
    s = s.decode("utf-8")
    s = s.lower()
    s = s.strip(punctuation)
    return s

Overview of frequently used single character variable names:

Variable Meaning Example
a array, all a = [normalize(w) for w in words]
b boolean while b is False:
f file, filter, function f = open('data.csv', 'r')
i index for i in range(len(matrix)):
j index for j in range(len(matrix[i])):
k key for k in vector.keys():
p parser, pattern p = pattern.search.compile('NN')
s string s = s.decode('utf-8').strip()
t time t = time.time() - t0
v value, vector for k, v in vector.items():
w word for i, w in enumerate(sentence.words):
x horizontal position node.x = 0
y vertical position node.y = 0

Dictionaries

Large portions of the code deal with dictionaries. For example, pattern.en is built around a word → part-of-speech tag dictionary, or verb infinitive → inflected verbs. In pattern.vector, document vectors are represented as word → weight dictionaries. In pattern.graph, node adjacency is represented as a dictionary indexed by node id's, in which each value is a node id → weight dictionary. Python's dict is fast for lookup operations (key in dict).This can be exploited by using dict as a sparse data format. For example, a document vector stores only the words in the document, not all words in the entire corpus. When comparing two documents, dict.get() is used with an optional default value for missing terms:

v1 = document1.vector
v2 = document2.vector
similarity = sum(v1.get(w,0) * f for w, f in v2.items()) / (v1.norm * v2.norm or 1)

Once similarity is calculated it can be stored in a cache helper dictionary indexed by (v1.id, v2.id). Since Python is slower than Java or C, profiling an algorithm and optimizing it with a caching mechanism is often useful. See also pattern.metrics.profile().

List comprehensions

We make abundant use of list comprehensions (single line for). They are often faster than for or map(). Another advantage is that multiple lines of nested for and if can be bundled into a single statement. A disadvantage is that sometimes it is harder to read the code – in which case a comment should be added:

# Split string into words and remove punctuation.
words = s.replace("\n", "\n ")
words = (w.strip(punctuation) for w in words.split(" ")) # generator
words = [w for w in words if w != ""]

Ternary operator

Pattern supports Python 2.4 (some web servers still run it), which does not have the ternary operator. The ternary operator (single line if) is available since Python 2.5:

s = s.lower() if lowercase is True else s # Python 2.5

Instead we use a boolean condition:

s = lowercase is True and s.lower() or s  # Python 2.4

This is useful in combination with list comprehensions. Care must be taken for values 0, '', [], (), {}, and None, since they evaluate to False and might trigger the or-clause.
For example, score = word not in vector and 0 or 1  always results in 1.

 


Code quality

The source code numbers about 25,000 lines of Python code + 20,000 lines of bundled external projects (Universal Feed Parser, simplejson, PDFminer, Beautiful Soup, PyWordNet, LIBSVM). You can use pylint from the shell to check the quality of the source code and look for possible bugs:

> cd pattern
> pylint pattern --rcfile=.pylintrc 

The configuration file .pylintrc defines a number of customized settings:

  1. pylint complains about lines longer than 80 characters, but we use a more relaxed 100 characters.
  2. Ignore pylint id C0103. We use single-character variable names (see above).
  3. Ignore pylint id W0142. We think *args and **kwargs are powerful Python features, not magic.
  4. Ignore bundled third-party modules (e.g., Beautiful Soup, pywordnet, libsvm).

With these settings the code scores about 7.3. It scores about 4.0 with the default settings. Message id's to look out for are those starting with E, which are possible bugs.

Over 20% of the source code is unit testing (see PyUnit). If you implement new functionality, also add unit tests for it. This ensures that the code continues to run correctly even after further changes are made.

 


Language support

Pattern bundles a parser for English that can be used to annotate words with their part-of-speech tag (is can a noun or a verb?), to inflect verbs, to predict positive or negative opinion in text, etc. As of 2005, there are 340 million native speakers of English. We are interested in adding support for other languages, if we can collaborate with native speakers of those languages. 

Here is an overview of some languages on the to-do list (but we are interested in all languages):

Language Code Speakers Example countries
Spanish es 350m Argentina (40m), Colombia (40m), Mexico (110m), Spain (45m)
English en 340m Canada (30m), United Kingdom (60m), United States (300m)
German de 100m Austria (10m), Germany (80m), Switzerland (7m)
French fr 70m France (65m), Côte d'Ivoire (20m)
Italian it 60m Italy (60m)
Dutch nl 25M Netherlands (15m), Belgium (5m), Suriname (1m)

Reference corpus

The parser consists of a Brill tagger and regular expressions for tokenization and chunking (see pattern/text/en/parser/). 

Tokenization splits punctuation marks from words. Chunking groups words that belong together (e.g., the black cat). The English tokenizer can be reused for any language. The English chunker can be reused for Germanic languages (en, de, nl) but probably needs to be updated for Romance languages (es, fr, it) because the chunk word order is different: the black cat → el gato negro.

Brill's tagger is based on a lookup dictionary of common words and their part-of-speech tag (black → adjective). This dictionary (or lexicon) is "trained" on text that has been manually tagged by people (called a reference corpus). To support a new language in Pattern, a Brill lexicon is required. Therefore, a reference corpus is required. We can then count the most frequent words in the corpus and include them in the lexicon together with their part-of-speech tag. The bigger the available corpus (e.g., 1 million words) the better the lexicon. For example, for English there is Brown corpus.

Tranformation rules

Some words can have a different part-of-speech depending on how they are used in a sentence. For example: in "the cat can have a can of tuna" the word can is used both as a verb and as a noun. Brill's tagger uses the lexicon together with transformation rules that improve the predicted part-of-speech tag according to where a word occurs in a sentence. A set of rules can be trained from the reference corpus, using Brill's algorithm. The original algorithm is written in C and is available here. A Python implementation is bundled in NLTK.

The Brill tagger typically achieves an accuracy of 94-97%. More robust methods exist (for example, MBSP) that achieve an accuracy of 98-99%. We use a Brill tagger because it is fast and economic in memory usage, making it well-suited for web applications.

Inflection rules

The parser is further augmented with regular expressions to singularize nouns (cats → cat), conjugate verbs and adjective attribution (un garçon curieux, des filles curieuses). This is used to implement the parser's lemmatizer (i.e., word base forms), which is important for text mining purposes (we want to bring curieux and curieuses together). See for example pattern/text/en/inflect/. In general, simple regular expressions for inflection can attain a fair accuracy (+90%) but the task is non-trivial and requires certain effort (1-2 weeks of work).

Other support modules such as tree.py are language-indepent and can be inherited from the English parser. To get an idea of how this works, have a look at the Dutch parser (pattern/text/nl), which inherits from the English parser.