Cheshire3

28th February 2012 (2012-02-28)

Description
Authors
Latest Version
Installation
Requirements / Dependencies
- Additional / Optional Features
Documentation
Development
Bugs, Feature requests etc.
Licensing
Coding Examples
- Initialization
- Loading Data
  - Pre-Processing (PreParsing)
- Searching
- Retrieving
- Transforming Records
- Indexes (Looking Under The Hood)

Description

Cheshire3 is a fast XML search engine, written in Python for extensability and using C libraries for speed. Cheshire3 is feature rich, including support for XML namespaces, unicode, a distributable object oriented model and all the features expected of a digital library system.

Standards are foremost, including SRU and CQL, as well as Z39.50 and OAI. It is highly modular and configurable, enabling very specific needs to be addressed with a minimum of effort. The API is stable and fully documented, allowing easy third party development of components.

Given a set of documents records, Cheshire3 can extract data into one or more indexes after processing with configurable workflows to add extra normalization and processing. Once the indexes have been constructed, it supports such operations as search, retrieve, browse and sort.

The abstract protocolHandler allows integration of Cheshire3 into any environment that will support Python. For example using Apache handlers or WSGI applications, any interface from standard APIs like SRU, Z39.50 and OAI (all included by default in the cheshire3.web sub-package), to an online shop front can be provided.

Authors

Cheshire3 Team at the University of Liverpool http://www.liv.ac.uk:

Robert Sanderson
John Harrison john.harrison@liv.ac.uk
Catherine Smith
Jerome Fuselier

(Current maintainer in bold)

Latest Version

We are currently working towards a version 1.0 release for inclusion in PyPI. This effort will be mainly focused on distribution and installation mechanisms; as previously mentioned, Cheshire3's API is stable and documented.

The latest stable version will be available from our website: http://www.cheshire3.org/download/

Source code is under version control and available from: http://github.com/cheshire3/cheshire3

Previously, source code was available from our own Subversion server: http://cvs.cheshire3.org/repos/cheshire3

The SVN repository is being kept alive for the time being as read-only, and best efforts will be made to keep it up-to-date with the master (i.e. "production") branch from the Git repository.

Previous versions, including code + dependency bundles, and auto-installation scripts are available from: http://www.cheshire3.org/download/

Installation

As mentioned, we're currently working on the installation process to make it easier for both lay-persons and developers to get started with Cheshire3.

The following guidelines assume that you administrative privileges on the machine you're installing on. If this is not the case, then you might need to use the option --user. For more details, see: http://docs.python.org/install/index.html#alternate-installation

Users (i.e. those not wanting to actually develop Cheshire3):

Clone cheshire3 repository from GitHub
Run python setup.py install

Developers:

Fork the cheshire3 repository in GitHub
Clone your fork of the cheshire3 repository from GitHub
Run python setup.py develop
Read the Development section of this README

Requirements / Dependencies

Cheshire3 requires Python 2.6.0 or later. It has not yet been verified as Python 3 compliant.

Python package dependencies should be resolved by the standard Python package management mechanisms (setuptools, easy_install) so need not be discussed here.

Cheshire3's extensive use of third party C-libraries (e.g. BerkeleyDB, libxml) may require installation of some pre-requisites, and we have historically recommended that rather than just installing the pure Python package, you should download the necessary Cheshire3 packages from: http://www.cheshire3.org/download/ and install them using the scripts provided.

We're trying to get away from this situation, so that Cheshire3 can be installed as a pure Python package, but on some systems, you may still need to download and install the dependencies. The bundles available from http://www.cheshire3.org/download may continue to be a useful place to get hold of the source code for these dependencies, with the build scripts providing hints on how to compile and install them.

Additional / Optional Features

Certain features within the Cheshire3 Information Framework will have additional dependencies (e.g. web APIs will require a web application server). We'll try to maintain an accurate list of these in the README file for each sub-package.

Documentation

HTML Documentation is provided in the docs directory. There is additional documentation for the source code in the form of comments and docstrings. Documentation for most default object configurations can be found within the <docs> tag in the config XML for each object. We would encourage users to take advantage of this tag to provide documentation for their own custom object configurations.

Development

This section is intended for those who are intending to develop code to contribute back to Cheshire3.

The project will make use of a Distributed Version Control System (DVCS), called Git, for managing change within the Cheshire3 code base, configurations and documentation. A repository will be maintained for existing code, and any new developments, on github.

Development in the GitHub repository will follow (at least to begin with) Vincent Driessen's branching model, and use git-flow to facilitate this.

So your workflow should be something like:

Fork the GitHub repository
Clone your forked repository onto you local development machine
Develop in the develop branch (or develop in you own feature branch and merge back into the develop branch.)
Push your changes back to you github fork
Issue a pull request

If you're developing Cheshire3, and intend to create your own releases, you may find it useful to install the package setuptools-git.

Developed code for intended to be contributed back to Cheshire3 should follow the recommendations made by the standard Style Guide for Python Code (which includes the provision that guidelines may be ignored in situations where following them would make the code less readable.)

Particular attention should be paid to documentation and source code annotation (comments). All developed modules, functions, classes, and methods should be documented in the source code. Newly configured objects at the server level should be documented using the <docs> tag. Comments and Documentation should be accurate and up-to-date, and should never contradict the code itself.

Bugs, Feature requests etc.

Bug reports, feature requests etc. should be made using the GitHub issue tracker: https://github.com/cheshire3/cheshire3/issues

Licensing

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the University of Liverpool nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Coding Examples

Initialization

The first thing that we need to do is create a Session and build a Server. The Session object will be passed around amongst the processing objects to maintain details of the current environment. It stores, for example, user and identifier for the database.

    >>> from cheshire3.baseObjects import Session
    >>> session = Session()

The Server looks after all of our objects, databases, indexes ... everything.

Its constructor takes session and one argument, the filename of the top level configuration file. You can find it dynamically as follows:

    >>> import os
    >>> from cheshire3.server import SimpleServer
    >>> from cheshire3.internal import cheshire3Root
    >>> serverConfig = os.path.join(cheshire3Root, 'configs', 'serverConfig.xml')
    >>> serv = SimpleServer(session, serverConfig)
    >>> serv
    <cheshire3.server.SimpleServer object...

The next thing you'll probably want to do is get a database.

    >>> db = serv.get_object(session, 'db_test')
    >>> db
    <cheshire3.database.SimpleDatabase object...

After this you MUST set session.database to the identifier for your database, in this case 'db_test':

    >>> session.database = 'db_test'

This is primarily for efficiency in the workflow processing (objects are cached by their identifier, which might be duplicated for different objects in different databases). More on workflows later.

Another useful path to know is the database's default path:

    >>> dfp = db.get_path(session, 'defaultPath')

Loading Data

In order to load data into your database you'll need a document factory to find your documents, a parser to parse the XML and a record store to put the parsed XML into. The most commonly used are defaultDocumentFactory and LxmlParser. Each database needs its own record store.

    >>> df = db.get_object(session, "defaultDocumentFactory")
    >>> parser = db.get_object(session, "LxmlParser")
    >>> recStore = db.get_object(session, "recordStore")

Before we get started, we need to make sure that the stores are all clear.

    >>> recStore.clear(session)
    <cheshire3.recordStore.BdbRecordStore object...
    >>> db.clear_indexes(session)

First you should call db.begin_indexing() in order to let the database initialise anything it needs to before indexing starts. Ditto for the record store.

    >>> db.begin_indexing(session)
    >>> recStore.begin_storing(session)

Then you'll need to tell the document factory where it can find your data:

    >>> df.load(session, 'data', cache=0, format='dir')
    <cheshire3.documentFactory.SimpleDocumentFactory object...

DocumentFactory's load function takes session, plus:

data -- this could be a filename, a directory name, the data as a string, a URL to the data and so forth.

If data ends in [(numA):(numB)], and the preceding string is a filename, then the data will be extracted from bytes numA through to numB (this is pretty advanced though - you'll probably never need it!)
cache -- setting for how to cache documents in memory when reading them in. This will depend greatly on use case. e.g. if loading 3Gb of documents on a machine with 2Gb memory, full caching will obviously not work very well. On the other hand, if loading a reasonably small quantity of data over HTTP, full caching would read all of the data in one shot, closing the HTTP connection and avoiding potential timeouts. Possible values:

0 = no document caching. Just locate the data and get ready to discover and yield documents when they're requested from the documentFactory. This is probably the option you're most likely to want.

1 = Cache location of documents within the data stream by byte offset.

2 = Cache full documents.

format -- The format of the data parameter. Many options, the most common are:

  * xml  -- xml file. Can have multiple records in single file.
  * dir  -- a directory containing files to load
  * tar  -- a tar file containing files to load
  * zip  -- a zip file containing files to load
  * marc -- a file with MARC records (library catalogue data)
  * http -- a base HTTP URL to retrieve

tagName -- the name of the tag which starts (and ends!) a record. This is useful for extracting sections of documents and ignoring the rest of the XML in the file. Medline, Wikipedia and ModsCollection do this for example.
codec -- the name of the codec in which the data is encoded. Normally 'ascii' or 'utf-8'

You'll note above that the call to load returns itself. This is because the document factory acts as an iterator. The easiest way to get to your documents is to loop through the document factory:

    >>> for doc in df:
    ...    rec = parser.process_document(session, doc)  # [1]
    ...    recStore.create_record(session, rec)         # [2]
    ...    db.add_record(session, rec)                  # [3]
    ...    db.index_record(session, rec)                # [4]
    recordStore/...

In this loop, we first use the Lxml Parser to create a record object in 1.
Then in 2, we store the record in the recordStore. This assigns an identifier to it, by default a sequential integer. In 3 we add the record to the database. This stores database level metadata such as how many words in total, how many records, average number of words per record, average number of bytes per record and so forth. The last line, 4, indexes the record against all indexes known to the database -- typically all indexes in the indexStore in the database's 'indexStore' path setting.

Then we need to ensure this data is commited to disk:

    >>> recStore.commit_storing(session)
    >>> db.commit_metadata(session)

And, potentially taking longer, merge any temporary index files created:

    >>> db.commit_indexing(session)

Pre-Processing (PreParsing)

As often than not, documents will require some sort of pre-processing step in order to ensure that they're valid XML in the schema that you want them in. To do this, there are PreParser objects which take a document and transform it into another document.

The simplest preParser takes raw text, escapes the entities and wraps it in a element:

    >>> from cheshire3.document import StringDocument
    >>> doc = StringDocument("This is some raw text with an & and a < and a >.")
    >>> pp = db.get_object(session, 'TxtToXmlPreParser')
    >>> doc2 = pp.process_document(session, doc)
    >>> doc2.get_raw(session)
    '<data>This is some raw text with an &amp; and a &lt; and a &gt;.</data>'

Searching

In order to allow for translation between query languages (if possible) we have a query factory, which defaults to CQL (SRU's query language, and our internal language).

    >>> qf = db.get_object(session, 'defaultQueryFactory')
    >>> qf
    <cheshire3.queryFactory.SimpleQueryFactory object ...

We can then use this factory to build queries for us:

    >>> q = qf.get_query(session, 'c3.idx-text-kwd any "compute"')
    >>> q
    <cheshire3.cqlParser.SearchClause ...

And then use this parsed query to search the database:

    >>> rs = db.search(session, q)
    >>> rs
    <cheshire3.resultSet.SimpleResultSet ...
    >>> len(rs)
    3

The 'rs' object here is a result set which acts much like a list. Each entry in the result set is a ResultSetItem, which is a pointer to a record.

    >>> rs[0]
    Ptr:recordStore/1

Retrieving

Each result set item can fetch its record:

    >>> rec = rs[0].fetch_record(session)
    >>> rec.recordStore, rec.id
    ('recordStore', 1)

Records can expose their data as xml:

    >>> rec.get_xml(session)
    '<record>...

As SAX events:

    >>> rec.get_sax(session)
    ["4 None, 'record', 'record', {}...

Or as DOM nodes, in this case using the Lxml Etree API:

    >>> rec.get_dom(session)
    <Element record at ...

You can also use XPath expressions on them:

    >>> rec.process_xpath(session, '/record/header/identifier')
    [<Element identifier at ...
    >>> rec.process_xpath(session, '/record/header/identifier/text()')
    ['oai:CiteSeerPSU:2']

Transforming Records

Records can be processed back into documents, typically in a different form, using Transformers:

    >>> dctxr = db.get_object(session, 'DublinCoreTxr')
    >>> doc = dctxr.process_record(session, rec)

And you can get the data from the document with get_raw()

    >>> doc.get_raw(session)
    '<?xml version="1.0"?>...

This transformer uses XSLT, which is common, but other transformers are equally possible.

It is also possible to iterate through stores. This is useful for adding new indexes or otherwise processing all of the data without reloading it.

First find our index, and the indexStore:

    >>> idx = db.get_object(session, 'idx-creationDate')

Then start indexing for just that index, step through each record, and then commit the terms extracted.

    >>> idxStore.begin_indexing(session, idx)
    >>> for rec in recStore:
    ...     idx.index_record(session, rec)
    recordStore/...   
    >>> idxStore.commit_indexing(session, idx)

Indexes (Looking Under the Hood)###

Configuring Indexes, and the processing required to populate them requires some further object types, such as Selectors, Extractors, Tokenizers and TokenMergers. Of course, one would normally configure these for each index in the database and the code in the examples below would normally be executed automatically. However it can sometimes be useful to get at the objects and play around with them manually, particularly when starting out to find out what they do, or figure out why things didn't work as expected, and Cheshire3 makes this possible.

Selector objects are configured with one or more locations from which data should be selected from the Record. Most commonly (for XML data at least) these will use XPaths. A selector returns a list of lists, one for each configured location.

    >>> xp1 = db.get_object(session, 'identifierXPathSelector')
    >>> rec = recStore.fetch_record(session, 1)
    >>> elems = xp1.process_record(session, rec)
    >>> elems
    [[<Element identifier at ...

However we need the text from the matching elements rather than the XML elements themselves. This is achieved using an Extractor, which processes the list of lists returned by a Selector and returns a doctionary a.k.a an associative array or hash:

    >>> extr = db.get_object(session, 'SimpleExtractor')
    >>> hash = extr.process_xpathResult(session, elems)
    >>> hash
    {'oai:CiteSeerPSU:2 ': {'text': 'oai:CiteSeerPSU:2 ', ...

And then we'll want to normalize the results a bit. For example we can make everything lowercase:

    >>> n = db.get_object(session, 'CaseNormalizer')
    >>> h2 = n.process_hash(session, h)
    >>> h2
    {'oai:citeseerpsu:2 ': {'text': 'oai:citeseerpsu:2 ', ...

And note the extra space on the end of the identifier...

    >>> s = db.get_object(session, 'SpaceNormalizer')
    >>> h3 = s.process_hash(session, h2)
    >>> h3
    {'oai:citeseerpsu:2': {'text': 'oai:citeseerpsu:2',...

Now the extracted and normalized data is ready to be stored in the index!

This is fine if you want to just store strings, but most searches will probably be at word or token level. Let's get the abstract text from the record:

    >>> xp2 = db.get_object(session, 'textXPathSelector')
    >>> elems = xp2.process_record(session, rec)
    >>> elems
    [[<Element {http://purl.org/dc/elements/1.1/}description ...

Note the {...} bit ... that's lxml's representation of a namespace, and needs to be included in the configuration for the xpath in the Selector.

    >>> extractor = db.get_object(session, 'ProxExtractor')
    >>> hash = extractor.process_xpathResult(session, elems)
    >>> hash
    {'The Graham scan is a fundamental backtracking...

ProxExtractor records where in the record the text came from, but otherwise just extracts the text from the elements. We now need to split it up into words, a process called tokenization.

    >>> tokenizer = db.get_object(session, 'RegexpFindTokenizer')
    >>> hash2 = tokenizer.process_hash(session, hash)
    >>> h
    {'The Graham scan is a fundamental backtracking...

Although the key at the beginning looks the same, the value is now a list of tokens from the key, in order. We then have to merge those tokens together, such that we have 'the' as the key, and the value has the locations of that type.

    >>> tokenMerger = db.get_object(session, 'ProxTokenMerger')
    >>> hash3 = tokenMerger.process_hash(session, hash2)
    >>> hash3
    {'show': {'text': 'show', 'occurences': 1, 'positions': [12, 41]},...

After token merging, the multiple terms are ready to be stored in the index!

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
code		code
configs		configs
dbs		dbs
docs		docs
www		www
.gitignore		.gitignore
README.mdown		README.mdown
distribute_setup.py		distribute_setup.py
setup.cfg		setup.cfg
setup.py		setup.py

Cheshire-Grampa/cheshire3

Folders and files

Latest commit

History

Repository files navigation