Hi and welcome to my blog. Im Tasos, a software engineer working in the UK. This is where i share some of my findings related with SQL, c#, asp.net and javascript with you. I hope you find something helpful and Im looking forward to your feedback!

Recent Comments

Popular Posts

Recent Posts

Archives

Post Categories

Blog Stats

  • Posts: 15
  • Comments: 116
  • Trackbacks: 9
  • Articles: 1

Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server

Sunday, January 11, 2009 12:08 AM, Filed Under SQL C# SQLCLR

In this post:

  1. SoundEx in Sql Server
  2. SimMetrics
  3. Adding string Metric functions in MS Sql Server
  4. Evaluating metric accuracy and comparing Metrics
  5. Conclusion + code

Quite often we come across a requirement where we may need to perform some sort of fuzzy string grouping or data correlation. For example, we may want to correlate the customer records of a database by identifying records that are similar but not necessarily exactly the same (due to spelling mistakes for example). Obviously a simple group by, will not successfully group such data. We will need to employ what is commonly referred to as a distance algorithm or a string metric in order to determine how close 2 string values are.

1. SoundEx

SoundEx is one such algorithm built in to Sql server, but it doesnt really perform well on calculating string value distances since it is a phonetic encoding algorithm. Vowels are ignored and the resulting value consists of a letter followed by 3 digits calculated like so:

    1. Replace consonants with digits as follows (but do not change the first letter):

      • b, f, p, v => 1
      • c, g, j, k, q, s, x, z => 2
      • d, t => 3
      • l => 4
      • m, n => 5
      • r => 6
    2. Collapse adjacent identical digits into a single digit of that value.
    3. Remove all non-digits after the first letter.
    4. Return the starting letter and the first three remaining digits. If needed, append zeroes to make it a letter and three digits.

 

So let's try SoundEx with a few variations and see how it performs

select Soundex('LLoyds') -- returns L432
select Soundex('Loyds')  -- returns L320

select Soundex('Brighton') -- returns B623
select Soundex('Bristol')  -- returns B623

As you can see, Soundex returned the same code for Brighton and Bristol, while at the same time it returned a different code for Loyds and LLoyds. Such results make you think twice before employing this function in a production environment.

So, is this all that is available to us? Of course not.

2. SimMetrics

The Web Intelligence Group within the University of Sheffield have released a wonderful library over at sourceforge just for this purpose. SimMetrics is the library you would need to use. The developer of this library and credit for its development goes to Sam Chapman

Here are only some of the similarity metrics in SimMetrics that can be briefly described in simple English. For an even more detailed explanation and for the full list of included functions have a look over at http://www.dcs.shef.ac.uk/~sam/stringmetrics.html or on Wikipedia

  • Hamming distance 
    Measures the minimum number of substitutions required to change one into the other
  • Levenshtein distance
    Measures the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character
  • Needleman-Wunch distance or Sellers Algorithm
    Similar to Levenstein with an added cost adjustment to the cost of a gap of the operation
  • Smith-Waterman distance
    Similar to Levenstein but with two adjustable parameters for cost functions
  • Gotoh Distance or Smith-Waterman-Gotoh distance
    Extension of the Smith Waterman with variable gap costs
  • Jaro distance metric
    This distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match. The higher the distance for two strings is, the more similar the strings are
  • Jaro Winkler 
    Similar to Jaro but with more favourable ratings for strings that match from the beginning
  • Cosine similarity
    Often used to compare documents in text mining.

There are plenty more in this library, but frankly, the descriptions are pretty baffling and dont fit in simple one-liner explanation. Follow the links above for even more information on this.

What I would like instead to focus on is how we can exploit the power of such a library within Sql Server.

3. Adding string Metric functions in MS Sql Server

image

In Sql Server 2005 upwards we can leverage the power of CLR to introduce these string metrics into our SQL queries. And I will walk you through the process, step by step.

  1. Download the source code of the SimMetrics library from http://sourceforge.net/projects/simmetrics/
  2. Open and compile the project so you have your SimMetrics.dll in your bin folder.
  3. Create a new C# Class Library project and call it TextFunctions (or if you're feeling lazy, download it here)
  4. Add a reference to the SimMetrics.dll
  5. Add a class and call it SimMetrics
  6. Copy and paste the following Code into your Class file
using System;
using System.Collections.Generic;
using System.Text;
using System.Data.SqlTypes;
using SimMetricsMetricUtilities;

    public class StringMetrics
    {

        private static readonly Levenstein _Levenstein;
        private static readonly NeedlemanWunch _NeedlemanWunch;
        private static readonly SmithWaterman _SmithWaterman;
        private static readonly SmithWatermanGotoh _SmithWatermanGotoh;
        private static readonly SmithWatermanGotohWindowedAffine _SmithWatermanGotohWindowedAffine;
        private static readonly Jaro _Jaro;
        private static readonly JaroWinkler _JaroWinkler;
        private static readonly ChapmanLengthDeviation _ChapmanLengthDeviation;
        private static readonly ChapmanMeanLength _ChapmanMeanLength;
        private static readonly QGramsDistance _QGramsDistance;
        private static readonly BlockDistance _BlockDistance;
        private static readonly CosineSimilarity _CosineSimilarity;
        private static readonly DiceSimilarity _DiceSimilarity;
        private static readonly EuclideanDistance _EuclideanDistance;
        private static readonly JaccardSimilarity _JaccardSimilarity;
        private static readonly MatchingCoefficient _MatchingCoefficient;
        private static readonly MongeElkan _MongeElkan;
        private static readonly OverlapCoefficient _OverlapCoefficient;

        static StringMetrics()
        {
            _Levenstein = new Levenstein();
            _NeedlemanWunch = new NeedlemanWunch();
            _SmithWaterman = new SmithWaterman();
            _SmithWatermanGotoh = new SmithWatermanGotoh();
            _SmithWatermanGotohWindowedAffine = new SmithWatermanGotohWindowedAffine();
            _Jaro = new Jaro();
            _JaroWinkler = new JaroWinkler();
            _ChapmanLengthDeviation = new ChapmanLengthDeviation();
            _ChapmanMeanLength = new ChapmanMeanLength();
            _QGramsDistance = new QGramsDistance();
            _BlockDistance = new BlockDistance();
            _CosineSimilarity = new CosineSimilarity();
            _DiceSimilarity = new DiceSimilarity();
            _EuclideanDistance = new EuclideanDistance();
            _JaccardSimilarity = new JaccardSimilarity();
            _MatchingCoefficient = new MatchingCoefficient();
            _MongeElkan = new MongeElkan();
            _OverlapCoefficient = new OverlapCoefficient();
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble Levenstein(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_Levenstein.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble NeedlemanWunch(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_NeedlemanWunch.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble SmithWaterman(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_SmithWaterman.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble SmithWatermanGotoh(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_SmithWatermanGotoh.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble SmithWatermanGotohWindowedAffine(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_SmithWatermanGotohWindowedAffine.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble Jaro(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_Jaro.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble JaroWinkler(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_JaroWinkler.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble ChapmanLengthDeviation(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_ChapmanLengthDeviation.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble ChapmanMeanLength(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_ChapmanMeanLength.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble QGramsDistance(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_QGramsDistance.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble BlockDistance(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_BlockDistance.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble CosineSimilarity(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_CosineSimilarity.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble DiceSimilarity(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_DiceSimilarity.GetSimilarity(firstWord.Value, secondWord.Value));
        }

        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble EuclideanDistance(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_EuclideanDistance.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble JaccardSimilarity(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_JaccardSimilarity.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble MatchingCoefficient(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_MatchingCoefficient.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble MongeElkan(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_MongeElkan.GetSimilarity(firstWord.Value, secondWord.Value));
        }


        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlDouble OverlapCoefficient(SqlString firstWord, SqlString secondWord)
        {
            if (firstWord.IsNull || secondWord.IsNull)
                return 0;

            return new SqlDouble(_OverlapCoefficient.GetSimilarity(firstWord.Value, secondWord.Value));
        }


    }

 

  1. Compile your project and copy the path to your dll.
  2. We need to make sure that our target sql server is CLR Enabled so we need to run:
EXEC sp_configure 'clr enabled', 1
RECONFIGURE
  1. Install the library and the functions onto SQL Server
CREATE ASSEMBLY [TextFunctions]
AUTHORIZATION [dbo]
FROM 'c:\SqlAssembly\TextFunctions.dll'
WITH PERMISSION_SET = SAFE

GO

CREATE FUNCTION Levenstein(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.Levenstein
GO

CREATE FUNCTION NeedlemanWunch(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.NeedlemanWunch
GO

CREATE FUNCTION SmithWaterman(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.SmithWaterman
GO

CREATE FUNCTION SmithWatermanGotoh(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.SmithWatermanGotoh
GO

CREATE FUNCTION SmithWatermanGotohWindowedAffine(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.SmithWatermanGotohWindowedAffine
GO

CREATE FUNCTION Jaro(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.Jaro
GO

CREATE FUNCTION JaroWinkler(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.JaroWinkler
GO

CREATE FUNCTION ChapmanLengthDeviation(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.ChapmanLengthDeviation
GO

CREATE FUNCTION ChapmanMeanLength(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.ChapmanMeanLength
GO

CREATE FUNCTION QGramsDistance(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.QGramsDistance
GO

CREATE FUNCTION BlockDistance(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.BlockDistance
GO

CREATE FUNCTION CosineSimilarity(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.CosineSimilarity
GO

CREATE FUNCTION DiceSimilarity(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.DiceSimilarity
GO

CREATE FUNCTION EuclideanDistance(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.EuclideanDistance
GO

CREATE FUNCTION JaccardSimilarity(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.JaccardSimilarity
GO

CREATE FUNCTION MatchingCoefficient(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.MatchingCoefficient
GO

CREATE FUNCTION MongeElkan(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.MongeElkan
GO

CREATE FUNCTION OverlapCoefficient(@firstword NVARCHAR(255),@secondword NVARCHAR(255))
RETURNS float EXTERNAL NAME TextFunctions.StringMetrics.OverlapCoefficient
GO

  1. Now we are ready to use the functions our newly available string metrics functions.

4. Evaluating metric accuracy and comparing Metrics

All string metric functions accept two string arguments and return a float value between 0 and 1. The closer the score reaches 1 means that the metric we have applied is telling us the 2 input strings are a match.

Now that we have plenty of Metrics to choose from, we will want to evaluate them and see which one is best for us. This is where the following function comes in handy:

CREATE FUNCTION dbo.CompareStringMetrics(@firstword [nvarchar](255), @secondword [nvarchar](255))
RETURNS TABLE
AS
RETURN
(
    SELECT dbo.Jaro(@firstword, @secondword) as Score, 'Jaro' as Metric
    UNION SELECT dbo.JaroWinkler(@firstword, @secondword), 'JaroWinkler'
    UNION SELECT dbo.BlockDistance(@firstword, @secondword), 'BlockDistance'
    UNION SELECT dbo.ChapmanLengthDeviation(@firstword, @secondword), 'ChapmanLengthDeviation'
    UNION SELECT dbo.ChapmanMeanLength(@firstword, @secondword), 'ChapmanMeanLength'
    UNION SELECT dbo.CosineSimilarity(@firstword, @secondword), 'CosineSimilarity'
    UNION SELECT dbo.DiceSimilarity(@firstword, @secondword), 'DiceSimilarity'
    UNION SELECT dbo.EuclideanDistance(@firstword, @secondword), 'EuclideanDistance'
    UNION SELECT dbo.JaccardSimilarity(@firstword, @secondword), 'JaccardSimilarity' 
    UNION SELECT dbo.Levenstein(@firstword, @secondword), 'Levenstein'
    UNION SELECT dbo.MatchingCoefficient(@firstword, @secondword), 'MatchingCoefficient'
    UNION SELECT dbo.MongeElkan(@firstword, @secondword), 'MongeElkan'
    UNION SELECT dbo.NeedlemanWunch(@firstword, @secondword), 'NeedlemanWunch'
    UNION SELECT dbo.OverlapCoefficient(@firstword, @secondword), 'OverlapCoefficient'
    UNION SELECT dbo.QGramsDistance(@firstword, @secondword), 'QGramsDistance'
    UNION SELECT dbo.SmithWaterman(@firstword, @secondword), 'SmithWaterman'
    UNION SELECT dbo.SmithWatermanGotoh(@firstword, @secondword), 'SmithWatermanGotoh'
    UNION SELECT dbo.SmithWatermanGotohWindowedAffine(@firstword, @secondword), 'SmithWatermanGotohWindowedAffine'
)

 

This means that now we can easily evaluate the performance and the accuracy of each metric by running queries like:

Select * from CompareStringMetrics('Loyds', 'LLoyds') where score > 0.6
Select * from CompareStringMetrics('bristol', 'brighton') where score < 0.5

String Metric Comparison

Try out various functions on large database tables, the performance is great!

5. Conclusion

SimMetrics is a great library for fuzzy text comparisons. This article showed you how you can bring its power into Sql Server. There are plenty of functions to choose from, experiment and see what works best for you. You can be creative on what you want to apply your metric on, e.g  Instead of applying directly on a field you can choose to make simple searchkeys by lower casing fields, striping common text patterns that you dont need (eg Inc, Ltd etc) and removing numbers. There are many ways to apply these sort of functions, what is more suitable for you is your call, since it depends on the context of your problem. If you find any innovative ways of correlating your data share it with us!

source code: (including sql install and uninstall scripts)
TextFunctions.zip

Share this post!
digg it
Kick it on DotNetKicks.com

Comments

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Chris Marisic on 1/19/2009 3:11 PM

So would Jaro Winkler being one of the most optimal operations for implementing fuzzy search on first name / last name search fields?

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Tasos Yalanopoulos on 1/19/2009 5:06 PM

Hi Chris, indeed jaro winkler seems to have worked nicely for me so far for such fields

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Campgrounds on 1/22/2009 3:47 PM

I am downloading and getting ready to set this up. Thanks for such an informative post.

I need to compare business names which can be several words long. I'm not real familiar with fuzzy matching. Do you know if one of these operations is better than the others for this, or should I try to write a function to compare them 1 word at a time?

Thanks.

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Tasos Yalanopoulos on 1/26/2009 8:27 PM

Try stripping out spaces and common company suffixes (Ltd, inc) and create a search key. Of course this is no silver bullet, experiment with your suggestion too and see what works out better for you quality/performance wise. There is no optimum solution and results depend on the input data.

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Tim Wilson-Brown on 2/2/2009 2:10 AM

The SQL Server implementation of Soundex differs from the SimMetrics implementation - SimMetrics counts initial double consonants as one letter, whereas SQL Server counts them as a letter and number. See www.dcs.shef.ac.uk/~sam/stringmetrics.html#soundex

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Coder on 2/12/2009 5:13 AM

Wow what a useful library. Great for keeping duplicates out of the database.

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Caroline_B on 4/15/2009 1:46 PM

Campgrounds, the suggestion to remove white space and common words is a good one. You might also find bigram/trigram/n-gram analysis very useful (breaking the contents of the field down into strings of 2, 3 or n characters and seeing how many of these are common).

If you don't have a good reference list for company suffixes etc, try writing something to extract all words with their frequency of appearance in the data (across all rows). Typically the common suffixes and words like "the" and "and" will come up at the top of your frequency table and can then easily be stripped. You will also get a better idea of any terms which are peculiar to your data (we had one client where the name 'Brenda' cropped up over 400 times in the customer_name field!!).

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by statey on 4/22/2009 1:10 AM

I really like this. I've read many posts on this subject and this is by far the best. Is there anyway to get an overall score or percentage of similarity derived from the combined algorithms.

Alternatively does anyone know when if / when it will be possible to automate the fuzzy match functionality in SSIS from C# (without using a package)?

Thanks

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by aCoder on 5/9/2009 2:01 AM

Any ports for Oracle?

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Even Mien on 6/18/2009 6:35 PM

FYI, the link to download your TextFunctions project is using a relative link, so it does not work when someone browses from your archived page.

http://anastasiosyal.com/code/textfunctions.zip

Thanks some much for sharing this information.

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Jamie K on 7/13/2009 9:32 PM

JaroWinkler seems to be the best for firstname/lastname matches, but not so good for addresses, maybe my users mis-hear addresses differently to the way they mis-hear customer names?. Used in combination with the first part of a postcode it works well in my application. Thanks for the posting!

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Paul on 8/6/2009 2:12 PM

If you want to see really great solution for this problem please check www.matchlogics.com.

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Alejo L on 8/11/2009 12:41 AM

Great Help THANKS

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by gn on 8/28/2009 4:50 PM

You list as step 2 '2.Open and compile the project so you have your SimMetrics.dll in your bin folder. ' What project? I extracted the jar file and I find no project. Can you please clarify. These routines look great, and I would really like to be able to use them.

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by TimR on 10/30/2009 7:13 PM

This looks great. One question, how would i create a simple SQL statement to select Names <similar to> @userEnteredName with these functions. In other words, I want to get customer names from the customer table where the user typed in smyth and get back smith, smitty, etc....

@userEnteredLastName = 'smyth'
@userEnteredFirstName = 'Jon'


SELECT * FROM Customer WHERE JaroWinkler(CustomerLastName, @userEnteredLastName) > .75 AND JaroWinkler(CustomerFirstName, @userEnteredFirstName) > .75

Can a column be used for one of your values? If so, I will install this in our system. Thanks!

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Gary Gonzales on 11/1/2009 9:42 AM

Hello,

I am mining data from an ERP system. The users of this system are not using the ERP's part number field and are instead entering the part numbers, along with other data (model, part description), in a free form text description field within the ERP's PO line items. As a result I have a ton of user embellishments occurring within the data set (ex. part # 12345-AB, P/N#1234AB...sometimes the part numbers are in the middle or at the end of the string and not all contain #) etc. So I need a way to extract the part numbers from this string of data. I realize that SoundEx in any form is meant for phonetic names but I'm wondering if it would be applied to my challenge...somehow? Thanks

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by ANDRES on 3/15/2010 2:11 PM

EXCELENT PAGE

# re: Good!, Posted by Jaime on 4/26/2010 9:35 AM

Nice stuff, David.
Really interesting, and really well explained.

Thanks a lot

Best

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by AEM on 5/15/2010 12:58 PM

Thanks. It helps me

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Black Mold on 5/19/2010 7:03 PM

Thanks a ton!

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Oshimune on 5/21/2010 12:40 AM

Would you please clarifying step 2.
you mention,
"Open and compile the project so you have your SimMetrics.dll in your bin folder".
I'm new at this so please bear with me a little bit.

thx

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by shmokie on 5/21/2010 2:11 PM

First Name, Last Name, Date of Birth, Address Line 1, [Address Line 2], [Address Line 3], Admission Date, [Discharge Date], [Consultant Code]
Where an MRN is not available, Claimsure will do its best to identify a patient from the patient’s name, address and date of birth, then try to match this with episodes. The more information is available, the fewer the number of errors will be produced.

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by shmokie on 5/21/2010 2:21 PM

Ignore the last comment.

@Anastasios,

Thanks for this, it's a great resource.

@Oshimune,

On the SimMetrics page, go to the Files tab, scroll to the bottom and get the .Net project.

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Alan on 6/25/2010 4:37 AM

Awesom Post, exactly what I needed :)

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by carlos on 7/12/2010 6:47 PM

Hello Anastasios!

Excellent post! It helps me greatly. The implementation in SQL Server has had a great performance. Now I'm trying to do this in Oracle. Any idea how to do this? Any advice is welcome. Thanks.

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by capn on 7/14/2010 7:51 PM

For you Oracle guys, you need to be running 10g R2 to run code implemented in the CLR (at least stored procedures, not sure about user defined functions):

www.oracle.com/.../williams_sps.html

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Oil-Filled Radiator on 8/3/2010 1:57 PM

Wow! I think I understand what you said. Hopefully this will get me going in the right direction. Thanks.

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Ry on 8/3/2010 5:43 PM

I also am wondering how you "Open and compile the project so you have your SimMetrics.dll in your bin folder".
Any help would be great. Thanks.

# re: Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, Posted by Deodorants on 9/3/2010 7:02 AM

I'm kinda confused and having trouble with the whole SimMetrics.dll thing too. If you could elaborate just a bit more on that it would be awesome.

Comments

Title: *
Name: *
Email: (never displayed)
Website:
Comment: *  
Please add 7 and 5 and type the answer here: