## Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server

Quite often we come across a requirement where we may need to perform some sort of fuzzy string grouping or data correlation. For example, we may want to correlate the customer records of a database by identifying records that are similar but not necessarily exactly the same (due to spelling mistakes for example). Obviously a simple group by, will not successfully group such data. We will need to employ what is commonly referred to as a distance algorithm or a string metric in order to determine how close 2 string values are.

In this post:

### 1. SoundEx

SoundEx is one such algorithm built in to Sql server, but it doesnt really perform well on calculating string value distances since it is a phonetic encoding algorithm. Vowels are ignored and the resulting value consists of a letter followed by 3 digits calculated like so:

1. Replace consonants with digits as follows (but do not change the first letter):

• b, f, p, v => 1
• c, g, j, k, q, s, x, z => 2
• d, t => 3
• l => 4
• m, n => 5
• r => 6
2. Collapse adjacent identical digits into a single digit of that value.
3. Remove all non-digits after the first letter.
4. Return the starting letter and the first three remaining digits. If needed, append zeroes to make it a letter and three digits.

So let's try SoundEx with a few variations and see how it performs

`select Soundex('LLoyds') -- returns L432 select Soundex('Loyds')  -- returns L320 select Soundex('Brighton') -- returns B623 select Soundex('Bristol')  -- returns B623 `

As you can see, Soundex returned the same code for Brighton and Bristol, while at the same time it returned a different code for Loyds and LLoyds. Such results make you think twice before employing this function in a production environment.

So, is this all that is available to us? Of course not.

### 2. SimMetrics

The Web Intelligence Group within the University of Sheffield have released a wonderful library over at sourceforge just for this purpose. SimMetrics is the library you would need to use. The developer of this library and credit for its development goes to Sam Chapman

Here are only some of the similarity metrics in SimMetrics that can be briefly described in simple English. For an even more detailed explanation and for the full list of included functions have a look over at http://www.dcs.shef.ac.uk/~sam/stringmetrics.html or on Wikipedia

• Hamming distance
Measures the minimum number of substitutions required to change one into the other
• Levenshtein distance
Measures the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character
• Needleman-Wunch distance or Sellers Algorithm
Similar to Levenstein with an added cost adjustment to the cost of a gap of the operation
• Smith-Waterman distance
Similar to Levenstein but with two adjustable parameters for cost functions
• Gotoh Distance or Smith-Waterman-Gotoh distance
Extension of the Smith Waterman with variable gap costs
• Jaro distance metric
This distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match. The higher the distance for two strings is, the more similar the strings are
• Jaro Winkler
Similar to Jaro but with more favourable ratings for strings that match from the beginning
• Cosine similarity
Often used to compare documents in text mining.

There are plenty more in this library, but frankly, the descriptions are pretty baffling and dont fit in simple one-liner explanation. Follow the links above for even more information on this.

What I would like instead to focus on is how we can exploit the power of such a library within Sql Server.

### 3. Adding string Metric functions in MS Sql Server In Sql Server 2005 upwards we can leverage the power of CLR to introduce these string metrics into our SQL queries. And I will walk you through the process, step by step.

2. Open and compile the project so you have your SimMetrics.dll in your bin folder.
3. Create a new C# Class Library project and call it TextFunctions (or if you're feeling lazy, download it here)
4. Add a reference to the SimMetrics.dll
5. Add a class and call it SimMetrics
6. Copy and paste the following Code into your Class file
```using System;
using System.Collections.Generic;
using System.Text;
using System.Data.SqlTypes;
using SimMetricsMetricUtilities;

public class StringMetrics     {

static StringMetrics()
{
_Levenstein = new Levenstein();
_NeedlemanWunch = new NeedlemanWunch();
_SmithWaterman = new SmithWaterman();
_SmithWatermanGotoh = new SmithWatermanGotoh();
_SmithWatermanGotohWindowedAffine = new SmithWatermanGotohWindowedAffine();
_Jaro = new Jaro();
_JaroWinkler = new JaroWinkler();
_ChapmanLengthDeviation = new ChapmanLengthDeviation();
_ChapmanMeanLength = new ChapmanMeanLength();
_QGramsDistance = new QGramsDistance();
_BlockDistance = new BlockDistance();
_CosineSimilarity = new CosineSimilarity();
_DiceSimilarity = new DiceSimilarity();
_EuclideanDistance = new EuclideanDistance();
_JaccardSimilarity = new JaccardSimilarity();
_MatchingCoefficient = new MatchingCoefficient();
_MongeElkan = new MongeElkan();
_OverlapCoefficient = new OverlapCoefficient();
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble Levenstein(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_Levenstein.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble NeedlemanWunch(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_NeedlemanWunch.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble SmithWaterman(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_SmithWaterman.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble SmithWatermanGotoh(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_SmithWatermanGotoh.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble SmithWatermanGotohWindowedAffine(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_SmithWatermanGotohWindowedAffine.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble Jaro(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_Jaro.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble JaroWinkler(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_JaroWinkler.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble ChapmanLengthDeviation(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_ChapmanLengthDeviation.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble ChapmanMeanLength(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_ChapmanMeanLength.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble QGramsDistance(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_QGramsDistance.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble BlockDistance(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_BlockDistance.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble CosineSimilarity(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_CosineSimilarity.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble DiceSimilarity(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_DiceSimilarity.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble EuclideanDistance(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_EuclideanDistance.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble JaccardSimilarity(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_JaccardSimilarity.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble MatchingCoefficient(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_MatchingCoefficient.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble MongeElkan(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_MongeElkan.GetSimilarity(firstWord.Value, secondWord.Value));
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlDouble OverlapCoefficient(SqlString firstWord, SqlString secondWord)
{
if (firstWord.IsNull || secondWord.IsNull)
return 0;

return new SqlDouble(_OverlapCoefficient.GetSimilarity(firstWord.Value, secondWord.Value));
}

}
```

2. We need to make sure that our target sql server is CLR Enabled so we need to run:
```EXEC sp_configure'clr enabled', 1
RECONFIGURE```
1. Install the library and the functions onto SQL Server
```CREATE ASSEMBLY [TextFunctions]
AUTHORIZATION [dbo]
FROM 'c:\SqlAssembly\TextFunctions.dll' WITH PERMISSION_SET = SAFE GO CREATE FUNCTION Levenstein(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.Levenstein
GO CREATE FUNCTION NeedlemanWunch(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.NeedlemanWunch
GO CREATE FUNCTION SmithWaterman(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.SmithWaterman
GO CREATE FUNCTION SmithWatermanGotoh(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.SmithWatermanGotoh
GO CREATE FUNCTION SmithWatermanGotohWindowedAffine(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.SmithWatermanGotohWindowedAffine
GO CREATE FUNCTION Jaro(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.Jaro
GO CREATE FUNCTION JaroWinkler(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.JaroWinkler
GO CREATE FUNCTION ChapmanLengthDeviation(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.ChapmanLengthDeviation
GO CREATE FUNCTION ChapmanMeanLength(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.ChapmanMeanLength
GO CREATE FUNCTION QGramsDistance(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.QGramsDistance
GO CREATE FUNCTION BlockDistance(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.BlockDistance
GO CREATE FUNCTION CosineSimilarity(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.CosineSimilarity
GO CREATE FUNCTION DiceSimilarity(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.DiceSimilarity
GO CREATE FUNCTION EuclideanDistance(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.EuclideanDistance
GO CREATE FUNCTION JaccardSimilarity(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.JaccardSimilarity
GO CREATE FUNCTION MatchingCoefficient(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.MatchingCoefficient
GO CREATE FUNCTION MongeElkan(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.MongeElkan
GO CREATE FUNCTION OverlapCoefficient(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.OverlapCoefficient
GO ```
1. Now we are ready to use the functions our newly available string metrics functions.

### 4. Evaluating metric accuracy and comparing Metrics

All string metric functions accept two string arguments and return a float value between 0 and 1. The closer the score reaches 1 means that the metric we have applied is telling us the 2 input strings are a match.

Now that we have plenty of Metrics to choose from, we will want to evaluate them and see which one is best for us. This is where the following function comes in handy:

```CREATE FUNCTION dbo.CompareStringMetrics(@firstword [nvarchar](255), @secondword [nvarchar](255)) RETURNS TABLE AS RETURN (     SELECT dbo.Jaro(@firstword, @secondword) as Score, 'Jaro' as Metric
UNION SELECT dbo.JaroWinkler(@firstword, @secondword), 'JaroWinkler'     UNION SELECT dbo.BlockDistance(@firstword, @secondword), 'BlockDistance'     UNION SELECT dbo.ChapmanLengthDeviation(@firstword, @secondword), 'ChapmanLengthDeviation'     UNION SELECT dbo.ChapmanMeanLength(@firstword, @secondword), 'ChapmanMeanLength'     UNION SELECT dbo.CosineSimilarity(@firstword, @secondword), 'CosineSimilarity'     UNION SELECT dbo.DiceSimilarity(@firstword, @secondword), 'DiceSimilarity'     UNION SELECT dbo.EuclideanDistance(@firstword, @secondword), 'EuclideanDistance'     UNION SELECT dbo.JaccardSimilarity(@firstword, @secondword), 'JaccardSimilarity'
UNION SELECT dbo.Levenstein(@firstword, @secondword), 'Levenstein'     UNION SELECT dbo.MatchingCoefficient(@firstword, @secondword), 'MatchingCoefficient'     UNION SELECT dbo.MongeElkan(@firstword, @secondword), 'MongeElkan'     UNION SELECT dbo.NeedlemanWunch(@firstword, @secondword), 'NeedlemanWunch'     UNION SELECT dbo.OverlapCoefficient(@firstword, @secondword), 'OverlapCoefficient'     UNION SELECT dbo.QGramsDistance(@firstword, @secondword), 'QGramsDistance'     UNION SELECT dbo.SmithWaterman(@firstword, @secondword), 'SmithWaterman'     UNION SELECT dbo.SmithWatermanGotoh(@firstword, @secondword), 'SmithWatermanGotoh'     UNION SELECT dbo.SmithWatermanGotohWindowedAffine(@firstword, @secondword), 'SmithWatermanGotohWindowedAffine' )```

This means that now we can easily evaluate the performance and the accuracy of each metric by running queries like:

```Select * from CompareStringMetrics('Loyds', 'LLoyds') where score > 0.6
Select * from CompareStringMetrics('bristol', 'brighton') where score < 0.5``` Try out various functions on large database tables, the performance is great!

### 5. Conclusion

SimMetrics is a great library for fuzzy text comparisons. This article showed you how you can bring its power into Sql Server. There are plenty of functions to choose from, experiment and see what works best for you. You can be creative on what you want to apply your metric on, e.g  Instead of applying directly on a field you can choose to make simple searchkeys by lower casing fields, striping common text patterns that you dont need (eg Inc, Ltd etc) and removing numbers. There are many ways to apply these sort of functions, what is more suitable for you is your call, since it depends on the context of your problem. If you find any innovative ways of correlating your data share it with us!

source code: (including sql install and uninstall scripts)
TextFunctions.zip