Building an F# powered indexing system

When I first started dabbling in F#, I really struggled to understand how someone (in particular, me) would sit down and start writing an application from scratch.  Project Euler puzzles are a great way to learn syntax (and probably the best place to start), but I would have loved to see a real application's source with a sort of "here's how it was built" narrative.  So that's what I'm going to do!

I sat down this morning to start moving one of the tools I use to F#.  Doing a rewrite is partially a learning exercise, but there's quite a bit of work that I'd have to do even if I were keeping it in C# and various bits of it really lend themselves to my newfound functional abilities. The application is a utility for managing the content index for my publishing system. It's based on Lucene, and every single page on a Mubble powered site uses it to generate lists of content.

In theory, this will be a series of posts covering these steps (part two is now available):

  • Writing an indexer
  • Building queries and running them
  • Testing the application
  • Turning it into a useful .NET library (most of my code is in C#, afterall)

Lucene.NET is a really good place to start for something like this, but I really want to divorce the exposed functionality from Lucene.NET itself. It's a near direct port of Java Lucene, which is good in some ways. It *is* relatively easy to find how-tos and such for the Java version that apply directly to the .NET version. It's bad because many Java idioms don't match up real well to .NET idioms, leaving me with a dirty feeling when I deal directly with it.

Lucene's API is also pretty basic, and in a server side context has a number of gotchas that can make things difficult. I need to account for concurrency, in particular, since I can only write to any given index once at a time.

Indexing applications are divided into roughly two areas: maintaining the index (indexing) and querying the index. I chose to start with the "indexing" portion of that equation, mainly because I have a working "prototype" to base it on.

My first few hours were spent building the structure of an "update" instruction. I chose to break the instruction into two areas: a loose schema defining how various fields were treated, and a document consisting of field/value combinations. The schema structure is relatively straightforward, and an F# record type seemed appropriate. A schema needs a name, a version and a collection of fields (with options):

type IndexSchema = { Name : string; Version : float; Fields : SchemaField list }

The SchemaField type needs two basic pieces of information, a label and a set of options. I don't know about you, but when I read "two pieces of information", I think tuple:

type SchemaField = string * FieldOption list

Field options map almost directly to the Lucene underneath, and are usually just a flag indicating whether something should be tokenized, stored, etc. So, basically an enum:

type FieldOption =
    Unique | Indexed | Stored | Compressed | MultiValue
    | Required | Tokenized

In addition to the various option flags listed above, we need some way of controlling how values are formatted on their way into Lucene. Dates, in particular, need some special treatment before indexing. Discriminated unions are a great way of accounting for this type of requirement, making the FieldOption declaration look something like this:

type FieldType = String | Date
 
type FieldOption =
    Unique | Indexed | Stored | Compressed | MultiValue
    | Required | Tokenized | Type of FieldType

A discriminated union also comes in handy for the FieldType type. Lucene will vary date formatting based on a precision specifier, and that information can actually be encoded as part of the date option for the type:

type DateResolution = Day | Hour | Millisecond | Minute | Month | Second | Year
 
type FieldType = String | Date of DateResolutio

I was relatively impressed with the succinctness of these definitions, especially compared to my corresponding C# where I ended up solving the problem with a base FieldType and a date subtype.

At this point, I can actually create a schema. I just need a set of default options, which I can append to or replace as needed. It looks something like this:

let defaultSchema =
    let options = [ Indexed; Tokenized; Type(String) ]
    {   Name = "Mubble.Content";
        Version = 1.0;
        Fields =
            [
                ("ID", [ Unique; Indexed; Stored; Required; ] );
                ( "PublishDate", [ Indexed; Type(Date Minute) ] );
                ( "Title", options );
                ( "Excerpt", options );
                ( "Body", MultiValue :: options );
            ]
    }

Note that the compiler was able to infer that defaultSchema should have type IndexSchema based on the labels I used. If necessary, I could have told it explicitly that I wanted an IndexSchema, but I didn't have to.

It's always therapeutic to "see" something after I've written code, so at this point I worked up a small SchemaField printer function. The goal was the run through the fields defined in a particular schema, printing the name and options for each. The quick and dirty version looks like this:

let printFields (schema : IndexSchema) =
    let optionToString o =
        match o with
        | FieldOption.Type(t) ->
            match t with
            | Date(p) -> sprintf "Type=Date:%A" p
            | _ -> sprintf "Type=%A" t
        | _ -> sprintf "%A" o
 
    schema.Fields |> List.iter (fun (n,f) ->
        printfn "Field %s" n
        f |> List.iter (fun o -> printfn "\t%s" (optionToString o)))
 
defaultSchema |> printFields

My printFields function takes a schema, defines an inner optionToString function, then iterates over each field in the schema. optionToString is a decent example of pattern matching over discriminated unions. It looks to see if the value passed to it is a FieldOption.Type, assigning it to t if it is. If t is a FieldType.Date, it extracts the precision to p and prints out something like Type=Date:Minute. The underscores catch anything not previously specified, sorta like an "else" block.

On to part two!

kick it on DotNetKicks.com

Comment (1)

  1. raj wrote::

    hi, i am searching for a indeing system that require following.
    i have multiple file which have gbs of data and i want to index that file format
    adv,campaign,ad,chnl,impression,date,clicks,

    if i anyone want to search on clicks on adv and particular date, it should show. will this system will full fill my requirement. or any other.

    Monday, May 26, 2008 at 7:18 am #

Trackbacks/Pingbacks (3)

  1. Dew Drop - May 8, 2008 | Alvin Ashcraft's Morning Dew on Thursday, May 8, 2008 at 8:18 am

    [...] Building an F# Powered Indexing System (Kurt) [...]

  2. [...] Building an F# powered indexing system - The first part of what I hope will be a really good series on using F# to produce some real functionality. [...]

  3. [...] Trying This Again We’ll pretend the first time never happened < Building an F# powered indexing system [...]