## Recommendations with simple correlation metrics on implied preference data

[latexpage]

When you sit down to write a recommendations system, there are quite a few well-practiced techniques you can use, and it’s difficult to know in advance how well they are going to work out when applied to your data. Thanks to the Netflix prize, which was initiated in 2006 and awarded in 2009, a lot has been written on recommender systems for the Netflix data set. If you happen to have a product catalogue similar to Netflix’s (those movies from the 60s are still being viewed and rated), and your users happen to have scored it with a 5-point explicit ratings system, there are some awesome advanced techniques and frameworks that you can take for a spin. Does that sound like you? Show of hands? I didn’t think so. Our data is certainly nothing like that.

What to do? I decided to start with something simple, before our inevitable trek into the forests of matrix factorization, stochastic gradient descent, Markov clusters and other impressive-sounding stuff: more on all that in subsequent posts.

So George and I began with the most obvious available literature, the O’Reilly Book Programming Collective Intelligence by Toby Segaran (if there’s an O’Reilly book on it, you must be able to just do it, right?), and with the simplest data set we could imagine: a set of relations between users and items, which we interpret as the user’s preference for the item. This might be a item-view, an item-purchase (unless closely followed by a return), or any other event we think might come in handy. We need a general term for this activity, for discussion purposes, not ‘view’ or ‘purchase’: let’s call it ‘flagging’. This data is most like the book’s del.icio.us example: people either linked to something or did not. We’re also going to limit ourselves to the simplest possible tool, a sql interface to something that is more or less a bunch of tables. We have tried the following, or at least parts of it, on MS SQL Server, Netezza and Hive.

It’s impractical for us to load our entire data set into memory, or even to represent the relationships of all users and all items as explicit data, so we look for a sparse durable representation: a relational table in which a record means that a user flagged an item within a certain context. The context depends on the data source: a user viewed an item in a particular month/week/day/session, a user purchased an item in a particular month/week/day/order, etc.

Now let’s compute the following:

- For each user-item pair: how many times did the user flag the item?
- For each user: how many items were flagged? how many total flag events?
- For each pair of items flagged by at least one user: how many users flagged both items? We call this ‘overlap’. We exclude outliers at this point: users with too few items flagged, or too many flag events.
- For each item: how many users flagged it? We call this ‘popularity’.

Now we’re ready to compute any of the correlation metrics in the book. Which ones make sense? Not Pearson correlation or Euclidean distance. You can compute them well enough, but try to imagine what they mean in the context of this data. What kind of straight line, or triangle’s hypotenuse, are you fitting these data points to? None that I can picture. The data is too much of a degenerate case of anything to which those concepts might usefully apply: you get a lot of scores of exactly ‘1’ or ‘0’ or $\sqrt{2}$ . Raw frequency makes sense, but it’s a bit of a blunt instrument. Prior to our setting up this system, there was something on the site that essentially used frequency along the lines we’re talking about here. It overvalued very popular things, to be sure, but in the end people clicked and bought things off those recommendations, so it wasn’t terrible. But what about Jaccard coefficient (sometimes called Tanimoto distance)? $J(A,B) = \frac{{A}\cap{B}}{{A}\cup{B}}$ . Sounds plausible. We’ll interpret the Jaccard coefficient of our items A and B as 1 minus (overlap of A and B)/(popularity of A + popularity of B – overlap of A and B). Makes sense to me, and it’s straightforward in sql! Our final table (let’s call it ‘item_affinity_jaccard’) will have at least 3 columns: the id of A, the id of B, and the coefficient.

We placed those results in a test harness, and the results were visibly, obviously better than the frequency-based thing that was there before. But could we trust our eyes? Hard to know without trying it. We replaced it on the site, and clickthrough rose 18%. That we can trust! If you’re starting out with recommendations, I’d say give that a try.

For extra points, let’s move on to a less degenerate case. We’ll add a new product C to our A and B, and observe that if A is connected to B, and B is connected to C, then A is connected, in a way, to C. This will be quick, dirty, and not scalable at all (scalable in the sense that, if you wanted to add a D, E or F, you would quickly be out of luck). But if you’ve got this far, and you’ve never gone through an exercise to convince yourself that graph processing gets ugly fast when your only tools are bunch of relational tables, try the following:

- Summarize previous results in a table: for each item, compute the count of users who flagged it, total flags, and the number of other items for which you can compute the Jaccard coefficient (let’s call this the ‘recommendation count’, and these items ‘recommendable items’).
- Make an item_relationship_step2 table that contains all the connected pairs. Avoid combinatorial explosion by only including items where recommendation count is greater than 0 and less than something that excludes items for which you already have so many direct pairs that you don’t really need the farther-away things.
- Join item_affinity_jaccard to itself and then to item_relationship_step2, and compute the two-hop distance in whatever way you think best.

## Responses

May 25th, 2012

Pretty interesting application of Jaccard coefficient in collaborative filtering relm. As a different approach how about using the individual user’s behavior to provide recommendation ?

Lets say we have an a user ‘U’ (in simple scenario – probably with an account at Wayfair or atleast some transaction hisotry that can be tracked). Now we define two dimension ‘item’ and ‘action’ where ‘action’ = {‘flagged item’, ‘viewed item’, ‘bought item’, ‘social action e.g. FB -Like’}

Here we are treating every action a user takes as part of a complete system. More precisely, an action ‘a’ taken on item ‘i’ would effect other ‘items’ in the

user’s ‘U’ universe and we would want to be able to score the items ‘i’ from where we can then easily branch out to doing “more items like ‘i'”

We could use a simple scoring fuction as follows:

f(c_i) = s_c_i^C/sum(s_c_i) where 0 <C<1, s_c_i = score of category c_i and i = 1 to n (i.e. 'n' categories). The constant C is the penalty or discount factor on the greedy action. Since we have the effect of all the scores of all categories in the denominator of the funciton, every single action on any one item beloging to that category effects other categories in the system thereby having a normalizing effect on the decay rate.

As clearly evident the function tries to adapt itself to the freuqency and value of the actions chosen on item categories and alter the scoring values and thus the overall ranking and the recommendation strategies.

To see the effect we could run a simulation with the following paramaters:

Total number of item categories – N : C1, C2, C3, C4, … , CN

Total number of available actions – 3 : 'Flagged item', 'Viewed item', 'Bought item'

Value of actions – 'Flagged item':1, 'Viewed item':2, 'Bought item' :3 (we could alos use a equal weight strategy)

Total number of user actions (trials) – N = 100000

Iniitial score of taking an action on a item category – 1/N (equally likely)

I ran the simulation with C_N = 4, N = 10 (for brevity) and here are the results:

[Initialize]

C2:0.250 C3:0.250 C4:0.250 C1:0.250

Sum : 1.00000

N = 0

Category : [C1] Action : ['Flagged item'] Value : [1]

Sum : 2.00000

C2:0.125 C3:0.125 C4:0.125 C1:0.500

N = 1

Category : [C4] Action : ['Bought item'] Value : [3]

Sum : 3.87500

C2:0.032 C3:0.032 C4:0.774 C1:0.129

N = 2

Category : [C1] Action : ['Flagged item'] Value : [1]

Sum : 1.96774

C2:0.016 C3:0.016 C4:0.393 C1:0.508

N = 3

Category : [C3] Action : ['Bought item'] Value : [3]

Sum : 3.93443

C2:0.004 C3:0.763 C4:0.100 C1:0.129

N = 4

Category : [C2] Action : ['Viewed item'] Value : [2]

Sum : 2.99583

C2:0.668 C3:0.255 C4:0.033 C1:0.043

N = 5

Category : [C1] Action : ['Flagged item'] Value : [1]

Sum : 1.99861

C2:0.334 C3:0.127 C4:0.017 C1:0.500

N = 6

Category : [C4] Action : ['Bought item'] Value : [3]

Sum : 3.97843

C2:0.084 C3:0.032 C4:0.754 C1:0.126

N = 7

Category : [C4] Action : ['Bought item'] Value : [3]

Sum : 3.99580

C2:0.021 C3:0.008 C4:0.751 C1:0.031

N = 8

Category : [C2] Action : ['Viewed item'] Value : [2]

Sum : 2.81129

C2:0.711 C3:0.003 C4:0.267 C1:0.011

N = 9

Category : [C2] Action : ['Viewed item'] Value : [2]

Sum : 2.99253

C2:0.668 C3:0.001 C4:0.089 C1:0.004

[Frequency]

C2:3.000 C3:1.000 C4:3.000 C1:3.000

Would love to know your thoughts on this.

Thanks

Udayan