Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server

Quite often we come across a requirement where we may need to perform some sort of fuzzy string grouping or data correlation. For example, we may want to correlate the customer records of a database by identifying records that are similar but not necessarily exactly the same (due to spelling mistakes for example). Obviously a simple group by, will not successfully group such data. We will need to employ what is commonly referred to as a distance algorithm or a string metric in order to determine how close 2 string values are.

In this post:

  1. SoundEx in Sql Server
  2. SimMetrics
  3. Adding string Metric functions in MS Sql Server
  4. Evaluating metric accuracy and comparing Metrics
  5. Conclusion + code

1. SoundEx

SoundEx is one such algorithm built in to Sql server, but it doesnt really perform well on calculating string value distances since it is a phonetic encoding algorithm. Vowels are ignored and the resulting value consists of a letter followed by 3 digits calculated like so:

    1. Replace consonants with digits as follows (but do not change the first letter):

      • b, f, p, v => 1
      • c, g, j, k, q, s, x, z => 2
      • d, t => 3
      • l => 4
      • m, n => 5
      • r => 6
    2. Collapse adjacent identical digits into a single digit of that value.
    3. Remove all non-digits after the first letter.
    4. Return the starting letter and the first three remaining digits. If needed, append zeroes to make it a letter and three digits.

 

So let's try SoundEx with a few variations and see how it performs

select Soundex('LLoyds') -- returns L432 
select Soundex('Loyds') -- returns L320
select Soundex('Brighton') -- returns B623
select Soundex('Bristol') -- returns B623

As you can see, Soundex returned the same code for Brighton and Bristol, while at the same time it returned a different code for Loyds and LLoyds. Such results make you think twice before employing this function in a production environment.

So, is this all that is available to us? Of course not.

2. SimMetrics

The Web Intelligence Group within the University of Sheffield have released a wonderful library over at sourceforge just for this purpose. SimMetrics is the library you would need to use. The developer of this library and credit for its development goes to Sam Chapman

Here are only some of the similarity metrics in SimMetrics that can be briefly described in simple English. For an even more detailed explanation and for the full list of included functions have a look over at http://www.dcs.shef.ac.uk/~sam/stringmetrics.html or on Wikipedia

  • Hamming distance 
    Measures the minimum number of substitutions required to change one into the other
  • Levenshtein distance
    Measures the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character
  • Needleman-Wunch distance or Sellers Algorithm
    Similar to Levenstein with an added cost adjustment to the cost of a gap of the operation
  • Smith-Waterman distance
    Similar to Levenstein but with two adjustable parameters for cost functions
  • Gotoh Distance or Smith-Waterman-Gotoh distance
    Extension of the Smith Waterman with variable gap costs
  • Jaro distance metric
    This distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match. The higher the distance for two strings is, the more similar the strings are
  • Jaro Winkler 
    Similar to Jaro but with more favourable ratings for strings that match from the beginning
  • Cosine similarity
    Often used to compare documents in text mining.

There are plenty more in this library, but frankly, the descriptions are pretty baffling and dont fit in simple one-liner explanation. Follow the links above for even more information on this.

What I would like instead to focus on is how we can exploit the power of such a library within Sql Server.

3. Adding string Metric functions in MS Sql Server

string metrics in sql server

In Sql Server 2005 upwards we can leverage the power of CLR to introduce these string metrics into our SQL queries. And I will walk you through the process, step by step.

  1. Download the source code of the SimMetrics library from http://sourceforge.net/projects/simmetrics/
  2. Open and compile the project so you have your SimMetrics.dll in your bin folder.
  3. Create a new C# Class Library project and call it TextFunctions (or if you're feeling lazy, download it here)
  4. Add a reference to the SimMetrics.dll
  5. Add a class and call it SimMetrics
  6. Copy and paste the following Code into your Class file
using System;
using System.Collections.Generic;
using System.Text;
using System.Data.SqlTypes;
using SimMetricsMetricUtilities;

    public class StringMetrics     {

        private static readonly Levenstein _Levenstein;
        private static readonly NeedlemanWunch _NeedlemanWunch;
        private static readonly SmithWaterman _SmithWaterman;
        private static readonly SmithWatermanGotoh _SmithWatermanGotoh;
        private static readonly SmithWatermanGotohWindowedAffine _SmithWatermanGotohWindowedAffine;
        private static readonly Jaro _Jaro;
        private static readonly JaroWinkler _JaroWinkler;
        private static readonly ChapmanLengthDeviation _ChapmanLengthDeviation;
        private static readonly ChapmanMeanLength _ChapmanMeanLength;
        private static readonly QGramsDistance _QGramsDistance;
        private static readonly BlockDistance _BlockDistance;
        private static readonly CosineSimilarity _CosineSimilarity;
        private static readonly DiceSimilarity _DiceSimilarity;
        private static readonly EuclideanDistance _EuclideanDistance;
        private static readonly JaccardSimilarity _JaccardSimilarity;
        private static readonly MatchingCoefficient _MatchingCoefficient;
        private static readonly MongeElkan _MongeElkan;
        private static readonly OverlapCoefficient _OverlapCoefficient;

        static StringMetrics()
        {
            _Levenstein = new Levenstein();
            _NeedlemanWunch = new NeedlemanWunch();
            _SmithWaterman = new SmithWaterman();
            _SmithWatermanGotoh = new SmithWatermanGotoh();
            _SmithWatermanGotohWindowedAffine = new SmithWatermanGotohWindowedAffine();
            _Jaro = new Jaro();
            _JaroWinkler = new JaroWinkler();
            _ChapmanLengthDeviation = new ChapmanLengthDeviation();
            _ChapmanMeanLength = new ChapmanMeanLength();
            _QGramsDistance = new QGramsDistance();
            _BlockDistance = new BlockDistance();
            _CosineSimilarity = new CosineSimilarity();
            _DiceSimilarity = new DiceSimilarity();
            _EuclideanDistance = new EuclideanDistance();
            _JaccardSimilarity = new JaccardSimilarity();
            _MatchingCoefficient = new MatchingCoefficient();
            _MongeElkan = new MongeElkan();
            _OverlapCoefficient = new OverlapCoefficient();
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble Levenstein(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_Levenstein.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble NeedlemanWunch(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_NeedlemanWunch.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble SmithWaterman(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_SmithWaterman.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble SmithWatermanGotoh(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_SmithWatermanGotoh.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble SmithWatermanGotohWindowedAffine(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_SmithWatermanGotohWindowedAffine.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble Jaro(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_Jaro.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble JaroWinkler(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_JaroWinkler.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble ChapmanLengthDeviation(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_ChapmanLengthDeviation.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble ChapmanMeanLength(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_ChapmanMeanLength.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble QGramsDistance(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_QGramsDistance.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble BlockDistance(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_BlockDistance.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble CosineSimilarity(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_CosineSimilarity.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble DiceSimilarity(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_DiceSimilarity.GetSimilarity(firstWord.Value, secondWord.Value));
        }

        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble EuclideanDistance(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_EuclideanDistance.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble JaccardSimilarity(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_JaccardSimilarity.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble MatchingCoefficient(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_MatchingCoefficient.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble MongeElkan(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_MongeElkan.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble OverlapCoefficient(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_OverlapCoefficient.GetSimilarity(firstWord.Value, secondWord.Value));
        }


    }

 

  1. Compile your project and copy the path to your dll.
  2. We need to make sure that our target sql server is CLR Enabled so we need to run:
EXEC sp_configure'clr enabled', 1
RECONFIGURE
  1. Install the library and the functions onto SQL Server
CREATE ASSEMBLY [TextFunctions]
AUTHORIZATION [dbo]
FROM 'c:\SqlAssembly\TextFunctions.dll' WITH PERMISSION_SET = SAFE GO CREATE FUNCTION Levenstein(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.Levenstein
GO CREATE FUNCTION NeedlemanWunch(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.NeedlemanWunch
GO CREATE FUNCTION SmithWaterman(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.SmithWaterman
GO CREATE FUNCTION SmithWatermanGotoh(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.SmithWatermanGotoh
GO CREATE FUNCTION SmithWatermanGotohWindowedAffine(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.SmithWatermanGotohWindowedAffine
GO CREATE FUNCTION Jaro(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.Jaro
GO CREATE FUNCTION JaroWinkler(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.JaroWinkler
GO CREATE FUNCTION ChapmanLengthDeviation(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.ChapmanLengthDeviation
GO CREATE FUNCTION ChapmanMeanLength(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.ChapmanMeanLength
GO CREATE FUNCTION QGramsDistance(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.QGramsDistance
GO CREATE FUNCTION BlockDistance(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.BlockDistance
GO CREATE FUNCTION CosineSimilarity(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.CosineSimilarity
GO CREATE FUNCTION DiceSimilarity(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.DiceSimilarity
GO CREATE FUNCTION EuclideanDistance(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.EuclideanDistance
GO CREATE FUNCTION JaccardSimilarity(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.JaccardSimilarity
GO CREATE FUNCTION MatchingCoefficient(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.MatchingCoefficient
GO CREATE FUNCTION MongeElkan(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.MongeElkan
GO CREATE FUNCTION OverlapCoefficient(@firstword NVARCHAR(255),@secondword NVARCHAR(255)) RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.OverlapCoefficient
GO 
  1. Now we are ready to use the functions our newly available string metrics functions.

4. Evaluating metric accuracy and comparing Metrics

All string metric functions accept two string arguments and return a float value between 0 and 1. The closer the score reaches 1 means that the metric we have applied is telling us the 2 input strings are a match.

Now that we have plenty of Metrics to choose from, we will want to evaluate them and see which one is best for us. This is where the following function comes in handy:

CREATE FUNCTION dbo.CompareStringMetrics(@firstword [nvarchar](255), @secondword [nvarchar](255)) RETURNS TABLE AS RETURN (     SELECT dbo.Jaro(@firstword, @secondword) as Score, 'Jaro' as Metric
    UNION SELECT dbo.JaroWinkler(@firstword, @secondword), 'JaroWinkler'     UNION SELECT dbo.BlockDistance(@firstword, @secondword), 'BlockDistance'     UNION SELECT dbo.ChapmanLengthDeviation(@firstword, @secondword), 'ChapmanLengthDeviation'     UNION SELECT dbo.ChapmanMeanLength(@firstword, @secondword), 'ChapmanMeanLength'     UNION SELECT dbo.CosineSimilarity(@firstword, @secondword), 'CosineSimilarity'     UNION SELECT dbo.DiceSimilarity(@firstword, @secondword), 'DiceSimilarity'     UNION SELECT dbo.EuclideanDistance(@firstword, @secondword), 'EuclideanDistance'     UNION SELECT dbo.JaccardSimilarity(@firstword, @secondword), 'JaccardSimilarity' 
    UNION SELECT dbo.Levenstein(@firstword, @secondword), 'Levenstein'     UNION SELECT dbo.MatchingCoefficient(@firstword, @secondword), 'MatchingCoefficient'     UNION SELECT dbo.MongeElkan(@firstword, @secondword), 'MongeElkan'     UNION SELECT dbo.NeedlemanWunch(@firstword, @secondword), 'NeedlemanWunch'     UNION SELECT dbo.OverlapCoefficient(@firstword, @secondword), 'OverlapCoefficient'     UNION SELECT dbo.QGramsDistance(@firstword, @secondword), 'QGramsDistance'     UNION SELECT dbo.SmithWaterman(@firstword, @secondword), 'SmithWaterman'     UNION SELECT dbo.SmithWatermanGotoh(@firstword, @secondword), 'SmithWatermanGotoh'     UNION SELECT dbo.SmithWatermanGotohWindowedAffine(@firstword, @secondword), 'SmithWatermanGotohWindowedAffine' )

 

This means that now we can easily evaluate the performance and the accuracy of each metric by running queries like:

Select * from CompareStringMetrics('Loyds', 'LLoyds') where score > 0.6
Select * from CompareStringMetrics('bristol', 'brighton') where score < 0.5

String Metric Comparison

Try out various functions on large database tables, the performance is great!

5. Conclusion

SimMetrics is a great library for fuzzy text comparisons. This article showed you how you can bring its power into Sql Server. There are plenty of functions to choose from, experiment and see what works best for you. You can be creative on what you want to apply your metric on, e.g  Instead of applying directly on a field you can choose to make simple searchkeys by lower casing fields, striping common text patterns that you dont need (eg Inc, Ltd etc) and removing numbers. There are many ways to apply these sort of functions, what is more suitable for you is your call, since it depends on the context of your problem. If you find any innovative ways of correlating your data share it with us!

source code: (including sql install and uninstall scripts)
TextFunctions.zip

Comments are closed