Saturday, March 2, 2013

Full Text searching in NHibernate

There's a library called NHibernate.Search available from NuGet that is supposed to allow for using a Lucene-based full-text search, regardless of what database you're persisting to... I thought this was a great idea.

Problem is, it doesn't work!

The code hasn't been updated in years, the libraries are obsolete and deprecated, and the common components aren't compatible with other libraries from NuGet.  In other words, at this point, it is just this side of completely unusable without a lot of work.

But if you're looking to implement non-database-dependent Full-Text searching on your NHibernate project, there are ways... with surprisingly little effort.  This article will describe how I got it going for one of my NHibernate projects.

First, I installed the Lucene.Net and Common.Logging packages from the NuGet repository.  If you don't know how to do that, google it!  You will be dismayed that you have gone this long without knowing how to do this.

I wanted to define the string fields in my entities that could use a full-text index, so I created a FullTextIndexedAttribute class:


[AttributeUsage(AttributeTargets.Property, AllowMultiple = false)]
public class FullTextIndexedAttribute : Attribute
{
public int IndexPriority { get; set; }
public FullTextIndexedAttribute() { }
public FullTextIndexedAttribute(int indexPriority)
{
IndexPriority = indexPriority;
}
}

Note the constructor that lets me set the index priority.  This is useful when trying to determine relevance.  Some fields have higher significance in their contents than others.  This class allowed me to decorate properties in my model domain like so:

public partial class Address
{
public virtual Guid AddressId { get; set; }

[FullTextIndexed(2)]
[NotNull]
public virtual string Address1 { get; set; }

[FullTextIndexed(3)]
public virtual string Address2 { get; set; }

[FullTextIndexed(1)]
[NotNull]
public virtual string City { get; set; }

[NotNull]
public virtual string State { get; set; }

[NotNull]
public virtual string ZipCode { get; set; }

[FullTextIndexed(0)]
public virtual string Phone { get; set; }
}

At this point I should probably point out that I am using Fluent NHibernate Automapping; all of my ID columns are regularly named, and all are GUIDs.  This really cut down on the reflection and coding time for implementing Full Text Indexing.  If you have things manually mapped without a regular identifier pattern, you'll have some extra coding to do at some point, but that's out of scope for this article.

I created a class of static functions I called FullText that actually did the "grunt work" of interfacing with the Lucene engine:

public class FullText
{
private static readonly Common.Logging.ILog Logger = Common.Logging.LogManager.GetCurrentClassLogger();
private static string indexPath;

private static FSDirectory _FullSearchDirectory;
public static FSDirectory FullSearchDirectory
{
get
{
if (_FullSearchDirectory == null)
{
// this is collected in my configuration settings.
indexPath = MyCustomSettings.Instance.FullTextSearching.IndexLocation;

_FullSearchDirectory = FSDirectory.Open(new System.IO.DirectoryInfo(indexPath));

if (IndexWriter.IsLocked(_FullSearchDirectory))
{
IndexWriter.Unlock(_FullSearchDirectory);
}

var lockFilePath = System.IO.Path.Combine(indexPath, "write.lock");
if (System.IO.File.Exists(lockFilePath))
{
System.IO.File.Delete(lockFilePath);
}
}

return _FullSearchDirectory;
}
}

/// <summary>
/// Collects the list of PropertyInfo objects decorated with the 
/// FullTextIndexedAttribute class for the supplied object.
/// </summary>
/// <param name="obj"></param>
/// <returns></returns>
public static IEnumerable<PropertyInfo> GetIndexableProperties(object obj)
{
return GetIndexableProperties(obj.GetType());
}

/// <summary>
/// Collects the list of PropertyInfo objects decorated with the
/// FullTextIndexedAttribute class for the supplied type.
/// </summary>
/// <param name="type"></param>
/// <returns></returns>
public static IEnumerable<PropertyInfo> GetIndexableProperties(Type type)
{
var props =
type.GetProperties()
.Where(x => x.GetCustomAttributes<FullTextIndexedAttribute>().Any())
.Select(x => new
{
PropInfo = x,
Priority = x.GetCustomAttributes<FullTextIndexedAttribute>().First().IndexPriority
})
.OrderBy(x => x.Priority)
.Select(x => x.PropInfo);

return props;
}

/// <summary>
/// Determines if the type has Full-Text indexable properties.
/// </summary>
/// <param name="type"></param>
/// <returns></returns>
public static bool IsIndexable(Type type)
{
return type.GetProperties()
.Select(x => x.GetCustomAttributes<FullTextIndexedAttribute>())
.Any();
}

/// <summary>
/// Determines if the object has Full-Text indexable properties.
/// </summary>
/// <param name="obj"></param>
/// <returns></returns>
public static bool IsIndexable(object obj)
{
return IsIndexable(obj.GetType());
}

/// <summary>
/// Returns a list of object IDs that match the search criteria.
/// </summary>
/// <param name="searchText"></param>
/// <param name="type"></param>
/// <param name="maxRows"></param>
/// <returns></returns>
public static Guid[] Search(string searchText, Type type, int maxRows)
{
if (string.IsNullOrWhiteSpace(searchText)) return new Guid[] { };

Logger.Trace(t => t("Searching Type {0} for max results of {2} with search string '{3}'...", type.Name, maxRows, searchText));
using (var searcher = new IndexSearcher(FullSearchDirectory, true))
using (var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30))
{
var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30,
GetIndexableProperties(type).Select(x => x.Name).ToArray(), analyzer);
Query query;
try
{
query = parser.Parse(searchText.Trim());
}
catch (ParseException)
{
// if a ParseException is thrown, it's likely due to extraneous odd symbols
// in the search text, so this will escape them.
query = parser.Parse(QueryParser.Escape(searchText.Trim()));
}

var hits = searcher.Search(query, null, maxRows, Sort.RELEVANCE).ScoreDocs;
var results = hits.Select(x => Guid.Parse(searcher.Doc(x.Doc).Get(string.Format("{0}Id", type.Name))))
.ToArray();

Logger.Trace(t => t("Found {0} hits for '{1}'.", results.Count(), searchText));
return results;
}
}

/// <summary>
/// Removes all indices from the Full Text index; used for regular maintenance.
/// </summary>
public static void ClearIndices()
{
using (var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30))
using (var writer = new IndexWriter(FullSearchDirectory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
writer.DeleteAll();
}
}

/// <summary>
/// Refreshes the index of the supplied entity, assuming that the entity is indexable.
/// </summary>
/// <param name="entity"></param>
public static void UpdateIndexOnEntity(object entity)
{
var indexables = GetIndexableProperties(entity);
if (indexables.Count() == 0) return;

var entityType = entity.GetType();
var entityName = entityType.Name;
var entityIdName = string.Format("{0}Id", entityName);
var entityIdValue = entityType.GetProperty(entityIdName).GetValue(entity).ToString();

Logger.Trace(t => t("Updating FT Index for {0} {1}...", entityIdName, entityIdValue));

var term = new Term(entityIdName);
using (var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30))
using (var writer = new IndexWriter(FullSearchDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
{
var doc = CreateDocument(entity, entityIdName, entityIdValue, indexables);
writer.UpdateDocument(term, doc);
Logger.Trace(t => t("Updated the FT index for {0} {1}", entityIdName, entityIdValue));
}
}

/// <summary>
/// Removes the index of the supplied entity.
/// </summary>
/// <param name="entity"></param>
public static void DeleteIndexForEntity(object entity)
{
var indexables = GetIndexableProperties(entity);
if (indexables.Count() == 0) return;

var entityType = entity.GetType();
var entityName = entityType.Name;
var entityIdName = string.Format("{0}Id", entityName);
var entityIdValue = entityType.GetProperty(entityIdName).GetValue(entity).ToString();

Logger.Trace(t => t("Deleting FT Index for {0} {1}...", entityIdName, entityIdValue));
var searchQuery = new TermQuery(new Term(entityIdName, entityIdValue));
using (var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30))
using (var writer = new IndexWriter(FullSearchDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
{
writer.DeleteDocuments(searchQuery);
Logger.Trace(t => t("Deleted the FT index for {0} {1}", entityIdName, entityIdValue));
}
}

/// <summary>
/// Creates a new index for the supplied entity.
/// </summary>
/// <param name="entity"></param>
public static void InsertIndexOnEntity(object entity)
{
var indexables = FullText.GetIndexableProperties(entity);
if (indexables.Count() == 0) return;

//the just-persisted entity has full-text indexable fields, so create an indexed document
var entityType = entity.GetType();
var entityName = entityType.Name;
var entityIdName = string.Format("{0}Id", entityName);
var entityIdValue = entityType.GetProperty(entityIdName).GetValue(entity).ToString();

Logger.Trace(t => t("Inserting FT Index for {0} {1}...", entityIdName, entityIdValue));

using (var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30))
using (var writer = new IndexWriter(FullSearchDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
{
var doc = CreateDocument(entity, entityIdName, entityIdValue, indexables);
writer.AddDocument(doc);
Logger.Trace(t => t("Inserted the FT index for {0} {1}", entityIdName, entityIdValue));
}
}

/// <summary>
/// Creates new indices for the collection of indexable objects supplied
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="entities"></param>
public static void PopulateIndex<T>(IEnumerable<T> entities) where T : class
{
var entityType = typeof(T);
if (!IsIndexable(entityType)) return;

var entityName = entityType.Name;
var entityIdName = string.Format("{0}Id", entityName);

var indexables = GetIndexableProperties(entityType);

Logger.Info(i => i("Populating the Full-text index with values from the {0} entity...", entityName));
using (var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30))
using (var writer = new IndexWriter(FullSearchDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
{
foreach (var entity in entities)
{
var entityIdValue = entityType.GetProperty(entityIdName).GetValue(entity).ToString();
var doc = CreateDocument(entity, entityIdName, entityIdValue, indexables);
writer.AddDocument(doc);
}
}
Logger.Info(i => i("Index population of {0} is complete.", entityName));
}

private static Document CreateDocument(object entity, string entityIdName, string entityIdValue, IEnumerable<PropertyInfo> indexables)
{
var doc = new Document();
Logger.Trace(t => t("Adding {0} of {1}...", entityIdName, entityIdValue));
doc.Add(new Field(entityIdName, entityIdValue, Field.Store.YES, Field.Index.NOT_ANALYZED));
foreach (var property in indexables)
{
var propertyName = property.Name;
var propertyValue = property.GetValue(entity).ToStringOrEmpty();
Logger.Trace(t => t("Adding property '{0}'...", propertyName));
doc.Add(new Field(propertyName, propertyValue, Field.Store.YES, Field.Index.ANALYZED));
}
return doc;
}

Giving props where props are due, a lot of this logic and code was derived from Mikhail Tsennykh's blog here. The class above allows me to insert, update, delete, clean up, and search on the Lucene repository, and also determine which entities/ properties are indexable.  Note that I created an extension method .ToStringOrEmpty() to address potential null values.

Now we have to tell NHibernate that whenever an entity is changed, the Lucene repository should be similarly updated.  We do this by implementing the IPostUpdateEventListener, IPostDeleteEventListener, and IPostInsertEventListener interfaces in the NHibernate.Event namespace, and then attaching it to the NHibernate configuration.  Sound complicated?  It's shockingly simple:


public class LuceneFTIndexEventListener : IPostUpdateEventListener, IPostDeleteEventListener, IPostInsertEventListener
{
private static readonly Common.Logging.ILog Logger = Common.Logging.LogManager.GetCurrentClassLogger();

/// <summary>
/// Updates the Full-Text index on the entity.
/// </summary>
/// <param name="e"></param>
public void OnPostUpdate(PostUpdateEvent e)
{
FullText.UpdateIndexOnEntity(e.Entity);
}

/// <summary>
/// Removes Full-Text Indices on deleted entities.
/// </summary>
/// <param name="e"></param>
public void OnPostDelete(PostDeleteEvent e)
{
FullText.DeleteIndexForEntity(e.Entity);
}

/// <summary>
/// Adds a Full-Text index on the entity.
/// </summary>
/// <param name="e"></param>
public void OnPostInsert(PostInsertEvent e)
{
FullText.InsertIndexOnEntity(e.Entity);
}
}


That's all there is to the implementation of the interfaces.  To attach to the NHibernate configuration, in your NHibernate helper class (where you instantiate your ISessionFactory) you'll need to put something like this in:


//collect database-level info from web.config file
Logger.Debug(d => d("Collecting database information from the configuration file..."));
var webcfg = new Configuration();
webcfg.Configure();

//add the Lucene-based Full-Text Search population interceptors.
Logger.Debug(d => d("Attaching Full-Text engine listeners..."));
webcfg.SetListener(NHibernate.Event.ListenerType.PostUpdate, new LuceneFTIndexEventListener());
webcfg.SetListener(NHibernate.Event.ListenerType.PostInsert, new LuceneFTIndexEventListener());
webcfg.SetListener(NHibernate.Event.ListenerType.PostDelete, new LuceneFTIndexEventListener());


OK, at this point, the Lucene Full-Text Index will be populated and maintained.  Great!  Only one thing left to do, and that's actually use it for full-text searching.  In my Address repository class, I create a method like so:


public IList<Address> Search(string searchString, int maxRows)
{
if (maxRows <= 0) throw new ArgumentOutOfRangeException("Max Rows must be greater than zero.");
if (string.IsNullOrEmpty(searchString)) return GetFirst(maxRows);

return _Session
.CreateCriteria(typeof(Address))
.Add(Restrictions.In("AddressId", FullText.Search(searchString, typeof(Address), maxRows)))
.List<Address>();
}

Then, by invoking the Search method, I will collect a set of record Id's from the Full Text engine that correspond to the search text submitted.  I then collect the actual entities from the database using the ids.

That's all there is to it!





Friday, January 18, 2013

A few aggregate functions Sql Server ought to have

There's a few aggregate functions I would love for Sql Server to have, but don't exist.  Fortunately, with the CLR integration, you can make up for these.

One of the ones I have always been miffed about is FIRST.  How many times are you neck-deep in an aggregate query and you just don't care what data is in a certain column, since all the results will be the same anyway?  Using MAX() might seem perfectly reasonable, but if there are minute differences in the data you can end up "mixing and matching" against different rows.  Consider the following situation:

FirstName LastName MiddleInitial EmployeeID Paycheck PayDate
John Smith A 12345 $2,000.00 Jan 25, 2013
JOHN SMITH ALBERT 12345 $2,000.00 Jan 19, 2013

MAX() will examine the data in the column and return the maximum value, but you can get odd results.

select MAX(firstname), MAX(lastname), MAX(middleinital) from #foo;

John Smith ALBERT

As you can see here, it's more or less the same data, but it can be jarring.  This gets to be a bigger deal when dealing with columns that can depend on each other, like Address1 and Address2; different data can be in these, and together they represent a consistent, usable mailing address.  Individually, however, they may give and end result that is really confusing:

Employeeid Address1 Address2
12345 25 East Nowhere Suite 001A
12345 25 East Nowhere St. Suite 001A Attention: Bill

...try THAT with MAX()...

So what to do?  The answer is, FIRST.  Grab the first row that has data in that column.  The CLR code looks like this:


using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(Format.UserDefined, MaxByteSize = -1)]
public struct First : IBinarySerialize
{
public string first;
public bool HasRows;

public void Init()
{
HasRows = false;
first = null;
}

public void Accumulate(SqlString Value)
{
HasRows = true;

if (!Value.IsNull && first == null)
{
first = Value.Value;
}
}

public void Merge(First Group)
{
HasRows = Group.HasRows;
if (Group.first != null) first = Group.first;
}

public SqlString Terminate()
{
if (!HasRows) return SqlString.Null;
return new SqlString(first);
}

public void Read(System.IO.BinaryReader r)
{
HasRows = r.ReadBoolean();
var isNull = r.ReadBoolean();
first = isNull ? null : r.ReadString();
}

public void Write(System.IO.BinaryWriter w)
{
w.Write(HasRows);
var isNull = (first == null);
w.Write(isNull);
if (!isNull) w.Write(first);
}
}

How about a bitwise OR aggregation?  Looking for any 1 in a group of bits...

using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(Format.Native)]
public struct GroupOR
{
public bool hasRows;
public bool hasTrue;

public void Init()
{
hasRows = false;
hasTrue = false;
}

public void Accumulate(SqlBoolean Value)
{
hasRows = true;
if (Value.IsTrue) hasTrue = true;
}

public void Merge(GroupOR Group)
{
hasTrue = Group.hasTrue;
hasRows = Group.hasRows;
}

public SqlBoolean Terminate()
{
if (!hasRows) return SqlBoolean.Null;
if (hasTrue) return SqlBoolean.One;
return SqlBoolean.Zero;
}
}

...likewise, there is a bitwise AND... where all bits MUST be one...

using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(Format.Native)]
public struct GroupAND
{
public bool hasRows;
public bool hasFalse;
public bool hasNull;

public void Init()
{
hasRows = false;
hasFalse = false;
hasNull = false;
}

public void Accumulate(SqlBoolean Value)
{
hasRows = true;
if (Value.IsFalse) hasFalse = true;
if (Value.IsNull) hasNull = true;
}

public void Merge(GroupAND Group)
{
hasFalse = Group.hasFalse;
hasNull = Group.hasNull;
hasRows = Group.hasRows;
}

public SqlBoolean Terminate()
{
if (!hasRows) return SqlBoolean.Null;
if (hasFalse || hasNull) return SqlBoolean.Zero;
return SqlBoolean.One;
}
}

You get the idea.  I use these kinds of functions every day.