Database definition

Fuzzy Matching – Definition, Process and Techniques

An accenture survey showed that 75% of consumers prefer to buy from retailers who know their name and buying behavior, and 52% are more likely to switch brands if they don’t offer personalized experiences. With millions of data points captured by brands almost every day, identifying unique customers and building their profiles is one of the biggest challenges facing most businesses.

When a business uses multiple tools to capture data, it’s very common to misspell a customer’s name or accept an email address with an incorrect pattern. Additionally, when disparate data applications have varying information about the same customer, it becomes impossible to gain insight into your customers’ behavior and preferences.

Next, we’ll learn what fuzzy matching is, how it’s implemented, common techniques used, and challenges encountered. Let’s start.

Fuzzy matching is a data matching technique that compares two or more records and calculates the probability that they belong to the same feature. Rather than broadly categorizing records as matched and unmatched, fuzzy match generates a number (usually between 0 and 100%) that identifies the likelihood that these records belong to the same customer, product, employee, etc.

Efficient fuzzy matching algorithm supports a range of data ambiguities, such as first/last name inversions, acronyms, short names, phonetic and deliberate misspellings, abbreviations, added/deleted punctuations , etc.

Fuzzy matching process

The fuzzy matching process goes as follows:

  1. Profile records for basic normalization errors. These errors are corrected in order to obtain a uniform and standardized view for all records.
  2. Select and map attributes depending on which fuzzy matching will take place. Because these attributes may be titled differently, they must be mapped across sources.
  3. Choose a fuzzy matching technique for each attribute. For example, names can be matched based on keyboard distance or name variants, while phone numbers can be matched based on numeric similarity metrics.
  4. Select a weight for each attribute, so attributes assigned higher weights (or higher priority) will have more impact on the overall match confidence level compared to fields with lower weights.
  5. Set threshold level – records with a fuzzy match score above the level are considered a match and those that do not match are a non-match.
  6. Run fuzzy matching algorithms and analyze match results.
  7. Replace false positives and the negatives that might arise.
  8. Mergededuplicate or simply eliminate duplicate records.

Fuzzy Match Settings

From the process defined above, you can see that a fuzzy matching algorithm has a number of parameters that form the basis of this technique. These include attribute weights, fuzzy matching technique, and score threshold level.

To get the best results, you should run fuzzy matching techniques with varying parameters and find the values ​​that best suit your data. Many vendors integrate such features into their fuzzy matching solution where these settings are set automatically but can be customized to suit your needs.

There are many fuzzy matching techniques used today that differ based on the exact algorithm of the formula used to compare and match fields. Depending on the nature of your data, you can choose the technique that suits your needs. Here is a list of common approximate matching techniques:

  1. Character-based similarity metrics that best match strings. These include:
    1. Edit distance: Calculates the distance between two strings, calculated character by character.
    2. Affine spacing distance: Calculates the distance between two strings also taking into account the gap or spaces between the strings.
    3. Distance Smith–Waterman: Calculates the distance between two strings considering also the presence or absence of prefixes and suffixes.
    4. Jaro distance: The best is to match the first and last names.
  2. Similarity based on tokens metrics which are best for matching complete words in strings. These include:
    1. Atomic Strings: Splits long strings into punctuation-delimited words and compares on individual words.
    2. WHIRL: Similar to atomic strings, but WHIRL also assigns weights to each word.
  3. Phonetic Similarity Metrics which are best for comparing words that look alike but have a totally different character composition. These include:
    1. Soundex: It’s best to compare surnames that are spelled differently but sound similar.
    2. NYSIIS: Similar to Soundex, but also retains vowel position details.
    3. Metaphone: Compares similar-sounding words that exist in English, other words familiar to Americans, and commonly used first and last names in the United States.
  4. Numerical similarity metrics which compare numbers, how far apart they are, the distribution of numerical data, etc.

The fuzzy matching process – despite the incredible benefits it offers – can be quite difficult to implement. Here are some common challenges businesses face:

1. Higher rate of false positives and negatives

Many fuzzy matching solutions have a higher rate of false positives and false negatives. This happens when the algorithm incorrectly classifies matches and non-matches or vice versa. Configurable match definitions and fuzzy parameters can help reduce incorrect links as much as possible.

2. IT complexity

During the matching process, each record is compared to all other records in the same dataset. And if you are dealing with multiple datasets, the number of comparisons increases further. Note that the comparisons increase quadratically as the size of the database increases. For this reason, you should use a system that can handle resource-intensive calculations.

3. Validation of tests

Matching records are merged to represent a complete 360° view of features. Any mistakes made during this process can add risk to your business operations. This is why detailed validation tests should be performed to ensure that the tuned algorithm consistently produces results with a high accuracy rate.

Businesses often view fuzzy matching solutions as complex, resource-intensive, and exhausting projects that take too long. The truth is investing in the right solution that produces fast and accurate results is the key. Organizations need to consider a number of factors when opting for a fuzzy matching tool, such as the time and money they are willing to invest, the scalability design they have in mind and the nature of their datasets. This will help them choose a solution that will allow them to get the most out of their data.