Full Text Search with Storm and PosgreSQL 8.3

Posted on November 13, 2008 by oubiwann

Content
Metadata

Introduction

At Canonical, we recently decided to provide a knowledge base in our Landscape product. The pros and cons for developing this in-house were weighed judiciously, and in the end, the balance tipped to building our own. Ordinarily, one might think a team would be better off integrating 3rd party software... but our team isn't ordinary :-) There is a lot of fantastic stuff these guys have created over the last few years that can be brought to bear in just about any software project, enabling visions to reach fruition quickly.

Last November, Twisted developer Thomas Hervé and I worked on the KB code together, joined by long-time Landscape developer Jamu Kakar for some sweet, sweet styles and page layouts. We made amazing progress and were able to deliver ahead of schedule. One of the reasons for this is the awesome tool chain we have, in particular, bzr (multiple, dependent, in-progress branches easily managed and updated with each other and trunk).

As a result of being ahead of schedule, our tech lead Gustavo Niemeyer had started looking for future-planned features that we could research and prepare, if not implement, immediately. One of the features that made this list was full text search.

The Postgre SQL 8.3 manual, chapter 12, gives an excellent definition of full text search in their opening paragraph:

Full Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query.

Mind you, this is not anywhere close to simple LIKE or even regex searches.

So let's get startred :-)

Postgres Preparation

One of the things we wanted to support with text search in our knowledge base was strong stemming (Google's search is famous for its good stemming). In order to do this, we needed some dictionary files that supported this, and the Postgres docs recommended ispell. These files needed to be converted to UTF-8, so here's what we did:

TBD

Martin Pitt at Canonical has been working with the Postgres folks to get this included in a future release so that you won't have to do all that work anymore.

Schema Changes

With the proper files in place, we then needed to override the default (ASCII) text search. These requirements resulted in the following SQL:

TBD

Postgres Text Search SQL

The SQL that performs full text search in Postgres looks a little odd at first:

TBD

The parts that might jar the eyes a bit are:

the new to_tsvector, to_tsquery, tsrankcd, and ts_headline functions
the match operator
compound search expression

For a full and accurate understanding of these, I recommend a thorough reading of the afore-mentioned manual, Chapter 12. For now, though, here are some quick pass definitions:

ts_vector - Converts a document (text) to a preprocessed representation that is needed for text search.

ts_query - Converts and normalizes query strings to that they can be easily compared to documents in tsvector-format.

tsrankcd - Measures query relavance to documents; it "take[s] into account lexical, proximity, and structural information [...] consider[s] how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur."

ts_headline - Returns an excerpt from a matched document with found search terms highlighted in markup.
@@ - The match operator is an odd-looking creature that does what you might think: returns a Boolean for the search expression as matched against a document's text search vector.
search terms - These are single words separated by the Boolean operators & (AND), | (OR) and ! (NOT) and may be grouped with parenthesis to determine proper precedence.

The sample SQL query above does not take indexing into consideration, nor a table column that preprocesses the document title and body into a text search vector. Here is what we used to set these up:

TBD

Storm Customizations

Most of my Storm usage over the past couple years has been pretty tame. Probably, the most exciting thing I've done is used Thomas' Twisted in tegration branch. Or 1-to-many relationships. So when I put together the SQL we were going to need, I had to talk to the team. Chris Armstrong gave me some awesome pointers, and I'm delighted to share these with a wider audience. The cleverness involved is all under the covers, making it very easy for devs to extend Storm – but I highly recommend taking a look at how Storm makes this possible... it's a delight :-)

Chris showed me some files with examples, the end result being a few simple subclasses:

TBD

Storm Full Text Search

With the Postgres configuration in place, the schema set up, and our text-search-specific Storm customizations made, we're ready to perform full text search with Storm:

TBD

I've created a storm branch with the full text search example code in it. You can browse the code here. You can also do the usual:

bzr branch lp:~oubiwann/storm/storm-examples

Summary

The biggest chunk of this work is DBA stuff, really. Then some sysadmin tasks. Once all that's done, the programmer only has to do a few tasks: create a class object, define the functions, and then reap the usability and productivity benefits of working with a well thought-out ORM. This is world-class search with a world-class ORM: both a user's and a programmer's dream :-)

Author	oubiwann
Date	November 13, 2008
Time	09:19:08
Category
Tags	canonical landscape postgres postgresql ubuntu
Line Count	1
Word Count	990
Character Count	7328

Comments?

This blog doesn't use standard (embedded) comments; however, since the site is hosted on Github, if there is something you'd like to share, please do so by opening a "comment" ticket!