Karthikeyan Guruswamy

Short Text Disambiguation - Weighted QGRAMs and Cosine Similarity

Blog Post created by Karthikeyan Guruswamy on Jul 27, 2017

Some of you at Partners 2016 wanted to play with the QGRAM function to do text matching besides the IdentityMatch() function. I explain here how to invoke it and you can add that to your environment and try it out. 

 

Algorithm Background

Read more about QGRAMs, Cosine Similarity etc., in my previous blog post to understand different text similarity techniques.

 

Data Science - Fuzzy Matching & Similarity Measures

 

In this blog post, I only explain a simple use case of matching short text, especially merchant names in transactions.

 

Overview

If a business has transaction data that has merchant names, it's often a common problem to 'disambiguate' those names and map that to a yellow page/registered business name.  If you have millions of business names that mutate over time, morph etc., in the data, it is almost next to impossible with traditional techniques such as string matching/regular expressions/SQL LIKEs etc.,This doc. here explains how we solved this using a technique called 'Weighted QGRAMS with Cosine similarity'. The SQL/MR function QGRAM was created by the Aster PS and combined that with the 6.10 Analytic function called VectorDistance() that computes the cosine similarity metric. It uses a probabilistic technique to break down the merchant names to Q-grams (or letter ngrams), weigh it based on the position in the string and then by applying Cosine Similarity to find the best match.

 

SQL/MR Functions Used:

 

QGRAM() must be installed in your cluster schema.  Download the SQL MR: QGRAM  and install it your schema or request your DBA to install in your cluster using:

 

\install qgram.zip

from act. 

 

VectorDistance() is available in the later Analytic foundation libraries and should be used in conjunction with QGRAM.

 

Inputs:

 

Yellow Page Dictionary (standardized names you want to rollup to):

 

id

merchant

1

Ace Mortgage Inc

2

Ace Credit Card Inc

3

Ace Credit Bank

4

Ceasar Frozen Yogurt LLC

5

8/12 Convienence Store

6

Teradata Corp

7

Walmart Super Market

8

Apple 

9

Wholefoods

10

San Mateo City

 

Merchant names that need to be 'disambiguated' or fuzzy joined with above table:

 

id

bizname

1

Ace MG

2

ACC

3

ACB

4

Ceasar Frozen Yogurt

5

8/12 Store

6

Teradata Corporation

7

Walmart Super M

8

apple

9

Whole Foods Market

10

San Mateo City

Final Output (after the Weighted QGRAM/Cosine Similarity Workflow next section):

 

 

bizname

merchant

cos_score (or fuzzy score/confidence)

Ace MG

Ace Mortgage Inc

0.435152

ACC

Ace Credit Card Inc

0.0549411

ACB

Ace Credit Bank

0.0149411

Ceasar Frozen Yogurt

Ceasar Frozen Yogurt LLC

0.960521

8/12 Store

8/12 Convienence Store

0.566962

Teradata Corporation

Teradata Corp

0.844375

Walmart Super M

Walmart Super Market

0.890216

apple

Apple 

0.757017

Whole Foods Market

Wholefoods

0.545401

San Mateo City

San Mateo City

1.0

SQL, SQL/MR Workflow code to try in your Teradata Studio:

 

drop table if exists bizname_yellow_pages;
create table bizname_yellow_pages
(
 id bigint,
 merchant varchar
) distribute by hash(id);

 

insert into bizname_yellow_pages values (1,'Ace Mortgage Inc');
insert into bizname_yellow_pages values (2,'Ace Credit Card Inc');
insert into bizname_yellow_pages values (3,'Ace Bank Inc');
insert into bizname_yellow_pages values (4,'Ceasar Frozen Yogurt LLC');
insert into bizname_yellow_pages values (5,'8/12 Convienence Store');
insert into bizname_yellow_pages values (6,'Teradata Corp');
insert into bizname_yellow_pages values (7,'Walmart Super Market');
insert into bizname_yellow_pages values (8,'Apple');
insert into bizname_yellow_pages values (9,'Wholefoods');
insert into bizname_yellow_pages values (10,'San Mateo City');

 

drop table if exists operational_biznames;
create table operational_biznames
(
 id bigint,
 bizname varchar
) distribute by hash(id);

 

insert into operational_biznames values(1,'Ace MG');
insert into operational_biznames values(2,'ACC');
insert into operational_biznames values(3,'ACB');
insert into operational_biznames values(4,'Ceasar Frozen Yogurt');
insert into operational_biznames values(5,'8/12 Store');
insert into operational_biznames values(6,'Teradata Corporation');
insert into operational_biznames values(7,'Walmart Super M');
insert into operational_biznames values(8,'apple');
insert into operational_biznames values(9,'Whole Foods Market');
insert into operational_biznames values(10,'San Mateo City');

 

drop table if exists qgram_yp;
create table qgram_yp distribute by hash(qgram)
as
select id, qgram, weight
   from qgram(
       on bizname_yellow_pages
       decay(0.95)
       gram(3)
);

drop table if exists qgram_ops;
create table qgram_ops distribute by hash(qgram)
as
select id, qgram, weight
 from qgram(
     on operational_biznames
     decay(0.95)
     gram(3)
);

drop table if exists cosine_similarity_match;
create table cosine_similarity_match distribute by hash(target_id)
as
select * from VectorDistance(
   on qgram_ops as target partition by id
   on qgram_yp as ref dimension
   targetIdColumns('id')
   targetFeatureColumn('qgram')
   targetValueColumn('weight')
   measure('Cosine')
   topK(1)
);

select b.id id, b.bizname, c.merchant, 1- a.distance as cosine_score
 from
cosine_similarity_match a
   inner join operational_biznames b on (a.target_id = b.id)
   inner join bizname_yellow_pages c on (a.ref_id = c.id)
order by id
;

 

Outcomes