Building an F# powered indexing system (part 2)

This is the second in a series of posts documenting a from-scratch indexing application I'm building in F#.  The first post in the series gives a good overview of the application. I've gone ahead and put the source up as well, so you can skip straight to that if you think I'm a jackass.

Now that I have a basic set of types for defining what goes into an index (a schema), it's time to have a look at types for the actual data going in and out.  I'll call these documents, where the document type looks like this:

type DocField = string * string
type Document = { Schema : IndexSchema; Fields : DocField list }

Documents are associated with our IndexSchema type, and contain a list of DocFields.  A document field is a simple two-string tuple, the first is a field name, the second is some text to cram into the index.  Given the sample schema from the previous post, a basic document declaration might look like this:

let sampleDoc = {
    Schema = defaultSchema;
    Fields =
        [
            ("ID", "ASDF");
            ("PublishDate", DateTime.Now.ToString());
            ("Title", "This is a title");
            ("Excerpt", "My title was non-obvious");
            ("Body", "Lorem ipsum ad infinitum");
        ] }

Pretty basic, really, but it's nice an easy to read.

At this point, I need to do a little groundwork for the Lucene.net library.  So, I'll add a reference!  Right clicking on an F# project in Visual Studio brings up an options dialog with a field for additional compiler flags.  I wanted to reference a DLL in a "Vendor" directory, so I put in something like this: -R "..\Vendor\Lucene.net 2.1\Lucene.Net.dll" (with "all configurations" selected up top, it wouldn't make much sense to only reference it for debug builds).  This is admittedly a warty way of adding a reference to a project, but I expect we'll get a real "add reference" dialog when the F# CTP is released later this year.  On the bright side, I didn't have to wait 8 minutes for the dialog to show up.

There are a couple of different things steps to turning the Document type into something Lucene can handle.  Lucene lives in an imperative world, so the process of creating an indexed document goes something like: create a Lucene document, append fields (with options) one by one.  For this particular library, indexing instructions for each field live live in the Schema, and I'll grab those as necessary when I'm preparing Lucene compatible fields from my Document.

Rather than writing any real logic to search my Schema's field list, I decided to change it to store a Map of fields (where a Map is a HashTable, Dictionary, or whatever else you want to call it).  This turned out to be incredibly easy since there's a Map.of_list function that takes a list of tuples (which I was using before), and breaks each one up into a key/value pair:

type IndexSchema = {
    Name : string;
    Version : float;
    Fields : Map<string, FieldOption list> }
 
let defaultSchema =
    let options = [ Indexed; Tokenized; Type(String) ]
    {   Name = "Mubble.Content";
        Version = 1.0;
        Fields = Map.of_list
                  [ ("ID", [ Unique; Indexed; Stored; Required; ] );
                    ( "PublishDate", [ Indexed; Type(Date Minute) ] );
                    ( "Title", options );
                    ( "Excerpt", options );
                    ( "Body", MultiValue :: options ) ]
    }

It's pretty cool that my defaultSchema declaration only required a minor modification to handle the new Map.

Each field in the schema has a number of associated options that need to be translated to their Lucene equivalents.  The Lucene document field constructor (at least, one of the overloads) takes four parameters: name, value, store and index.  I wrote a small function that takes a DocField and gives back a corresponding Lucene field.  My first hack looks something like this:

open Lucene.Net
let buildField (schema : IndexSchema) (field : DocField) =
    let name = ref (fst field)
    let value = ref (snd field)
    let store = ref Documents.Field.Store.NO
    let index = ref Documents.Field.Index.TOKENIZED
 
    let options = getSchemaField schema !name
 
    let setOption o =
        match o with
        | FieldOption.Stored -> store := Documents.Field.Store.YES
        | FieldOption.Compressed -> store := Documents.Field.Store.COMPRESS
        | FieldOption.Indexed -> index := Documents.Field.Index.UN_TOKENIZED
        | FieldOption.Tokenized -> index := Documents.Field.Index.TOKENIZED
        | FieldOption.Type(t) -> value := format t !value
        // The following aren't relevant here
        | FieldOption.Required | FieldOption.Unique
        | FieldOption.MultiValue -> ()
 
    options |> List.iter setOption
 
    let lfield = new Documents.Field(!name, !value, !store, !index)
    lfield

There are a couple of new bits of F# in that function, starting with the ref function calls.  let name = ref (fst field) says "bind a new reference cell to name using the first value in the tuple field. The reference cell stuff in the F# library is defined like this:

type 'a ref = { mutable contents: 'a }
let (!) r = r.contents
let (:=) r v = r.contents <- v
let ref v = { contents = v }
 

The first line creates the ref type, with a mutable contents label.  "Mutable" means exactly that... the value in contents can change within the scope of ref.  The next two lines are some helper operators, which I've used in my buildField function above.  ! is a bit of a non-intuitive choice, but it returns the contents of the passed in reference.  The assignment operator := sets the value of contents for the passed in reference.  The last line is a helper function for creating reference cell instances.  I could actually create an instance directly if I were so inclined: let name = { contents = "something" }, but the ref function call is "the standard".

My buildField function uses a few of these reference cells which will ultimately be passed into the Lucene field constructor, based on the options getSchemaField gives us.  I defined a local function named setOption that sets the value of the appropriate reference cell (based on the passed in option). The setOption function is just a basic pattern match over a discriminated union, though the FieldOption.Type(t) case is somewhat special. Assuming the value of o is a FieldOption.Type, it binds the actual type to t and passes it and the raw string value to a helper format function:

let format (t : FieldType) raw =
    match t with
    | FieldType.Date(r) ->
        let d = DateTime.Parse(raw)
        DateTools.DateToString(d, (convertResolution r))
    | String -> raw

This function is pretty straightforward (since I'm only dealing with strings and dates at the moment). If the FieldType is a Date, it uses the Lucene DateTools utility to format the string according to the specified resolution. Yet another helper function, convertResolution, turns our pretty DateResolution value into a janky Lucene value:

let convertResolution r =
    match r with
    | Day -> DateTools.Resolution.DAY
    | Hour -> DateTools.Resolution.HOUR
    | Millisecond -> DateTools.Resolution.MILLISECOND
    | Minute -> DateTools.Resolution.MINUTE
    | Month -> DateTools.Resolution.MONTH
    | Second -> DateTools.Resolution.SECOND
    | Year -> DateTools.Resolution.YEAR

Pattern matching is a nice, succinct way of doing something like that, which is great!  It does, however, have another compelling ability that I've fallen in love with.  The F# compiler understands the patterns and will give an "incomplete pattern match" warning at compile time if every possible option isn't accounted for.  So if I decided to add a Microsecond option to my DateResolution type and didn't account for it in my convertResolution function, the compiler would let me know.  I've actually taken to making that an error rather than a warning using the --warn-as-error 25 compiler directive.  It's sweet.

With the buildField function as a starting point, it's pretty easy to create a function that takes one of my Document values and converts it to its final Lucene representation:

let convert (doc : Document) =
    let ldoc = new Documents.Document()
 
    let lFields = doc.Fields |> List.map (buildField doc.Schema)
 
    lFields |> List.iter ldoc.Add
    ldoc

This is as particularly good example of what I like about this style of programming. Given the right supporting functions, the entire document conversion process is incredibly expressive. This function creates a new Lucene document to work with. It then uses the built in List.map function to create a list of fields that the Lucene document can cope with, and iterates through them, appending each one.

I actually used something called "partial function application" with the List.map function.  This basically means "I'll give the compiler a function and only one of its two arguments, the compiler will then create a new function that takes the remaining argument for me".  I could have been a bit more explicit and used a lambda (or anonymous function, if you want to call it that) like so:

let lFields = doc.Fields |> List.map (fun x -> buildField doc.Schema x)

The partial function version it a little less encumbered by syntax, and I've found myself using those wherever possible.

There are two other bits of function coolness going on here as well.  The first is the pipeline operator, |>, which works just like a pipeline from the command shell.  Everything to the left gets passed to the right side as a parameter.  I'm also passing a the ldoc.Add function from the Lucene library as a first class function to the built in List.iter function.  List.iter takes a single argument function and passes each element of the list to it, so the ldoc.Add function works fine there.

That's basically it for converting documents (with the appropriate schema information) to something we can ultimately index. It's interesting that it takes me about 8x as long to write these posts as the code, though.

As an aside, my buildField function is very much imperative. If I wanted to, I could rewrite it like this:

let buildField2 (schema : IndexSchema) (field : DocField) =
    let getOptions l init =
        let name, value, store, index = init
        match l with
        | h :: t ->
            match h with
            | FieldOption.Stored ->
                (name, value, Documents.Field.Store.YES, index)
            | FieldOption.Compressed ->
                (name, value, Documents.Field.Store.COMPRESS, index)
            | FieldOption.Indexed ->
                (name, value, store, Documents.Field.Index.UN_TOKENIZED)
            | FieldOption.Tokenized ->
                (name, value, store, Documents.Field.Index.TOKENIZED)
            | FieldOption.Type(t) ->
                (name, format t value, store, index)
            // The following aren't relevant here
            | FieldOption.Required | FieldOption.Unique
            | FieldOption.MultiValue -> init
        | [] -> init
 
    let name, value, store, index =
        (
            fst field,
            snd field,
            Documents.Field.Store.NO,
            Documents.Field.Index.TOKENIZED
        )
            |> getOptions (getSchemaField schema (fst field))
 
    let lfield = new Documents.Field(name, value, store, index)
    lfield

I find that ugly. Given that the imperative bits are well encapsulated in a function, I'm sticking with the first version for now... although it will probably bother me that I can't come up with a pretty, functional way of doing that.

The source code to this point is available on Google Code.  It's revision 2 in the Subversion repository, and there's also a zipped up snapshot.

kick it on DotNetKicks.com

Trackbacks/Pingbacks (2)

  1. Dew Drop - May 14, 2008 | Alvin Ashcraft's Morning Dew on Wednesday, May 14, 2008 at 9:54 am

    [...] Building an F# Powered Indexing System (Part 2) (Kurt) [...]

  2. Building an F# powered indexing system < Trying This Again on Thursday, May 15, 2008 at 8:39 am

    [...] again, and again, and once more < C# vs F#: some parallel refactoring (and generalization) Building an F# powered indexing system (part 2) [...]