Fun with math: Similarity Scores and the Kentucky Wildcats
One of the fun things to discuss about sports are the ways players and teams are related to historical counterparts. Who does a certain player remind you of? What team plays like this one? Most of the time these discussions are limited by our own experience, memory and observation which tends to limit the available pool of comparisons. Many years ago the famed baseball writer Bill James introduced the idea of Similarity Scores - a method of quantifying how similar the careers of two players were. This provided a way of identifying potentially similar baseball players without the prerequisite of having seen them play - an advantage given the long history of the sport. The idea has since been applied to basketball (mostly NBA) and is used quite frequently in making player projections.
I've been fiddling around with doing something similar for college basketball teams because I was interested in what kind of results I would get and I wanted to know if there was any value in applying the method to the current season to see if it might lend any insight as to the future prospects of Kentucky. While I wouldn't take this too seriously, some of the results I got were quite intrigueing and if nothing else I think it might provoke some good discussions.
The use of similarity scores as a means of comparing teams - while not unheard of - is not common, but I did find a discussion on the APBRmetrics board where someone did pretty much the same thing I did only for NBA teams. For the sake of completeness, here is an article from a few years ago discussing some of the philosophy behind the use of Similarity Scores.
More after the jump
The method I used is fairly straightforward: I used the 4 Factors for both offense and defense as a means to compare two teams. This gives me 8 categories to compare and provides a solid fundamental description of every team. I computed the z-score for each category and used the Euclidean Distance to measure how "close" two teams are. If you don't know what a z-score or Euclidean distance are then don't worry - just know the lower the score, the better the match. After that it's a simple process to sort the scores to find the closest matches.
Fortunately for me, Ken Pomeroy has data on the 4 factors available in a handy csv file for each season starting with 2003-2004. This gave me data on 2,011 team-seasons to use for comparisons and made it relatively easy to set up a spreadsheet with all the pertinant information.
Here I must admit to a bit of mathematical fudging. While Pomeroy provides the statistical means for each category for each season, he does not list the standard deviations so to get exact figures would require me to go team-by-team to get the exact number of rebounds, possessions, field goal attempts, and free throws for each. That's a rather overwhelming task for 2011 teams, so rather than use exact values for the standard deviation and the mean, I estimated them for each season using the values in the csv file. I don't think this makes too much difference however, as the difference between the estimated mean and the true mean tends to be less than one half of one percent (0.05%) except for defensive free throw rate where the difference is closer to 1%. I suspect the estimated and true standard deviations are similarly close.
On to some results!
I started with finding some comps for UK's two best teams of the last 6 years: 2004 and 2005.
| Year | Team | Score | Record | PostSeason |
| 2004 | Kentucky | 0 | 27 - 5 | 2nd rd |
| 2005 | Oklahoma | 1.12 | 25 - 8 | 2nd rd |
| 2007 | North Dakota St. | 1.24 | 20 - 8 | |
| 2004 | Stanford | 1.26 | 30 - 2 | 2nd rd |
| 2006 | UCLA | 1.26 | 32 - 7 | Champ. Gm |
| 2007 | Kansas | 1.3 | 33 - 5 | Elite 8 |
| 2009 | Louisville | 1.34 | 31 - 6 | Elite 8 |
| 2004 | Central Florida | 1.37 | 25 - 6 | 1st round |
| 2004 | Florida St. | 1.4 | 19 - 14 | NIT |
| 2006 | Winthrop | 1.4 | 23 - 8 | 1st round |
| 2008 | Illinois St. | 1.44 | 25 - 10 | NIT |
| 2005 | Kentucky | 0 | 28 - 6 | Elite 8 |
| 2008 | Duke | 0.92 | 28 - 6 | 2nd rd |
| 2006 | Arkansas | 1.15 | 22 - 10 | 1st rd |
| 2005 | Wisconsin Milwaukee | 1.36 | 26 - 6 | Sweet 16 |
| 2009 | East Tennessee St. | 1.49 | 23 - 11 | 1st rd |
| 2009 | Duke | 1.51 | 30 - 7 | Sweet 16 |
| 2006 | Pennsylvania | 1.57 | 20 - 9 | 1st rd |
| 2008 | Akron | 1.58 | 24 - 11 | NIT |
| 2007 | Purdue | 1.6 | 22 - 12 | 2nd rd |
| 2004 | East Tennessee St. | 1.63 | 27 - 6 | 1st rd |
| 2005 | George Washington | 1.64 | 22 - 8 | 1st rd |
The 2004 team has some interesting comps. Stanford was also a #1 seed in 2004 and was also upset in the 2nd round of the NCAA Tournament by Alabama. There are also a couple of teams that made the Elite 8 and a UCLA Final 4 squad. The top 10 is rounded out by some smaller schools that either made the NIT or were early exits in the Big Dance. The 2005 UK team has a little different makeup in the top 10: only 1 NIT team and every other school made the tournament but none made it past the Sweet 16.
Next up is last year's UK squad that went to the NIT:
| 2009 | Kentucky | 0 | 22 - 14 | NIT |
| 2004 | Virginia Commonwealth | 1.37 | 23 - 8 | 1st rd |
| 2008 | Texas Arlington | 1.39 | 21 - 12 | 1st rd |
| 2006 | UCLA | 1.51 | 32 - 7 | Champ. Gm |
| 2007 | Duke | 1.51 | 22 - 11 | 1st rd |
| 2009 | Kansas | 1.56 | 27 - 8 | Sweet 16 |
| 2009 | Wake Forest | 1.6 | 24 - 7 | 1st rd |
| 2005 | Texas A&M Corpus Chris | 1.61 | 20 - 8 | |
| 2007 | Central Florida | 1.62 | 22 - 9 | |
| 2007 | Arkansas | 1.68 | 21 - 13 | 1st rd |
| 2004 | St. Mary's | 1.7 | 19 - 12 |
I think the results here are actually rather informative. There are a few NCAA participants and some small schools that missed the tournament, but I think this lines up with what we saw from last year's squad: they definitely had the talent to get to make the NCAAs and were well on their way after starting 4 - 0 in SEC Play but - as we know all too well - just fell apart down the stretch.
By the way, you'll notice that 2006 UCLA team pops up again. They are actually an interesting case. Nearly every comparison I've done so far has had a dozen teams with a comparison score of 2 or less - except for 2006 UCLA. That team has around 85 teams with a comparison score of 2 or less. The 2005 UCLA team has 180 such scores and the 2007 UCLA team has around 40. I'm not sure what that means exactly, but I have yet to find another team with such a large number of low scores.
Here are some more non-UK results: The 2008 Memphis team and last year's national champs.
| 2008 | Memphis | 0 | 38 - 2 | Champ game | 2009 | North Carolina | 0 | 34 - 4 | NCAA Champ | |
| 2009 | Memphis | 1.44 | 33 - 4 | Sweet 16 | 2007 | North Carolina | 1.27 | 31 - 7 | Elite 8 | |
| 2007 | North Carolina | 1.72 | 31 - 7 | Elite 8 | 2007 | Notre Dame | 1.45 | 24 - 8 | 1st rd | |
| 2008 | Wisconsin | 1.78 | 31 - 5 | Sweet 16 | 2008 | Pittsburgh | 1.55 | 27 - 10 | 2nd rd | |
| 2004 | Cincinnati | 1.79 | 25 - 7 | 2nd rd | 2007 | Texas | 1.62 | 25 - 10 | 2nd rd | |
| 2004 | Nevada | 1.81 | 25 - 9 | Sweet 16 | 2007 | Ohio St. | 1.66 | 35 - 4 | Champ game | |
| 2007 | Notre Dame | 1.82 | 24 - 8 | 1st rd | 2008 | North Carolina | 1.78 | 36 - 3 | Final 4 | |
| 2008 | St. Mary's | 1.85 | 25 - 7 | 1st rd | 2008 | UCLA | 1.8 | 35 - 4 | Final 4 | |
| 2008 | Kansas | 1.89 | 37 - 3 | NCAA Champ | 2005 | Charlotte | 1.88 | 21 - 8 | 1st rd | |
| 2007 | Wisconsin | 1.91 | 30 - 6 | 2nd rd | 2008 | St. Mary's | 1.92 | 25 - 7 | 1st rd | |
| 2008 | UCLA | 1.92 | 35 - 4 | Final 4 | 2009 | St. Mary's | 1.96 | 28 - 7 | NIT |
These were elite teams and that's reflected in the comparisons. All the Top 10 comps for both schools made the NCAA tournament with the exception of UNC's #10. For those that don't recall, that 2009 St. Mary's team starred Patty Mills and would have easily made the tournament had Mills not been injured against Gonzaga halfway through the season. As it was they were one of the teams right on the bubble on the Selection Sunday that year. Otherwise you see a lot of teams that had considerable success in the tournament. For Memphis 6 of 10 comps got past the first weekend of the tournament and 4 of 10 did the same for UNC.
You'll also note that the #1 comp for both teams are the same school from either the previous or successive season. That makes a lot of sense when you think about the makeup and stability of each squad from year to year and I think it lends some credibility to this method. As an aside, I found similar results for other teams where successive seasons from the same school appeared in the top 10 comps. These tended to be schools with several important returning players and no coaching changes.
Okay, so now the moment you've all been waiting for. Here is the current edition of the Wildcats. I actually have two sets to show you, one was done on Monday, before the Hartford game, the other was done Wednesday. Since this is an in-season comparison, I wanted to see how the list changes as more games are played.
| Year | Team | Score | Record | PostSeason | Year | Team | Score | Record | PostSeason | |
| 2010 | Kentucky | 0 | 13 – 0 | ? | 2010 | Kentucky | 0 | 14 – 0 | ? | |
| 2006 | North Carolina | 0.99 | 23 – 8 | 2nd rd | 2006 | North Carolina | 1.11 | 23 – 8 | 2nd rd | |
| 2005 | Pittsburgh | 1.17 | 20 – 9 | 1st rd | 2004 | Mississippi St. | 1.38 | 26 – 4 | 2nd rd | |
| 2004 | Mississippi St. | 1.4 | 26 – 4 | 2nd rd | 2005 | Pittsburgh | 1.48 | 20 – 9 | 1st rd | |
| 2008 | New Mexico St. | 1.55 | 21 – 14 | 2008 | North Carolina | 1.53 | 36 – 3 | Final 4 | ||
| 2008 | Syracuse | 1.61 | 21 – 14 | NIT | 2008 | New Mexico St. | 1.73 | 21 – 14 | ||
| 2008 | North Carolina | 1.82 | 36 – 3 | Final 4 | 2008 | Syracuse | 1.81 | 21 – 14 | NIT | |
| 2007 | Providence | 1.84 | 18 – 13 | NIT | 2007 | North Carolina | 1.81 | 31 - 7 | Elite 8 | |
| 2005 | Mississippi St. | 1.85 | 23 – 11 | 2nd rd | 2009 | Pittsburgh | 1.85 | 31 – 5 | Elite 8 | |
| 2007 | North Dakota St. | 1.91 | 20 – 8 | 2007 | North Dakota St. | 1.89 | 20 - 8 | |||
| 2006 | Louisiana St. | 1.92 | 27 – 9 | Final 4 | 2006 | Louisiana St. | 1.98 | 27 – 9 | Final 4 |
For the most part the lists are identical. 8 of the 10 comps are the same and in mostly the same order. I've highlighted the differences and you can see that the 7th and 8th highest comps have changed to more impressive teams after the win over Hartford. These comps suggest that right now UK is good enough to make the Sweet 16/Elite 8. Of particular interest to me is the presence of three recent UNC teams. Those teams had a lot of young talent, much like our Wildcats this year and those teams did pretty well collectively. Recall that 2007 UNC team starred sophomore Tyler Hansbrough and freshmen Ty Lawson & company. The game they lost to Georgetown in the Elite 8 was one in which they dominate the first 30+ minutes and had a huge lead only to fall apart at the end. Despite some struggles early, I have a lot more confidence in the Cats ability to close out teams late. The only team that looks really out of place is North Dakota St, but let me tell you that 2007 NDS team has a lot of good, BCS team comps in their top 10.
I'm going to continue tracking UK this way for the rest of the season. I won't do it after every game, but maybe every couple of weeks while I also play around with the lists and look for any interesting patterns. If you would like a copy of the spreadsheet I'm using you can email me and I'll send you a copy.
10 comments
|
1 recs |
Do you like this story?
Comments
Absolutely fascinating.
I love this kind of exercise, and I think there is some insight to be gained by this method. I think I’ll alert Ken Pomeroy to what you’ve done, I would be interested to hear his comments, and I think he would be fascinated to see how you’ve used the data.
I think that the comparisons that wind up in UK’s top ten are pretty darn apt. There are a lot of very interesting “coincidences” revealed in this effort, and this is just the kind of analysis that really fascinates me.
Unfortunately, I’m not enough of a statistician to pull off something like this. :-) Well done.
A Sea of Blue -- Kentucky Sports for the Discerning Fan
Thanks, I'm glad you like it
What I love most about it is that I had absolutely no idea what I was going to get for results – it might have just come out with random sets of good and bad teams.
There’s not a whole lot of heavy lifting, statistically speaking. Just some basic knowledge and the time and patience to get everything set up properly.
3 > 2, except for very large values of 2.
Wow
Very interesting and am sure it took tons of work. I know these results only go back so far, but I would be fascinated at how the ’96 Cats compare to the greatest teams of all time and which teams are their closest comps.
It might be possible to estimate it
Using Jon Scott’s website you could get their 4 factors information. It would then be necessary to extrapolate the NCAA mean and standard deviation from other data. I’m not sure what the best way is to go about that, but there’s probably some way to do it and get an answer that has some hope of being accurate.
Actually, that’s not a bad idea, I’ll give it some thought.
3 > 2, except for very large values of 2.
Very Nice!
I know what you mean about trying to get all the data on all the teams. I was trying to get at the same data for the Pre-game widget. I wanted to use it to calculate the adjusted OE and DE for each game for some other goodness I wanted to include….alas, that was too daunting a task.
Off subject but,
I watched the UT-Memphis game today, and UT really struggled against the Memphis zone. Something UK should take note of…
"You are what you are and you ain't what you ain't"
Vols-Tigers
Totally unimpressed by both teams. Tigers play like they’re coached by a young, inexperienced guy and Pastner doesn’t seem to have them under even minimum control. UT’s defense was effective but due in great part to Memphis’ lack of patience and organization. UT dominated boards but again much due to Memphis lack of size and position. Conference USA must be euphoric that Calipari has gone and league will finally have some balance.
"Learn(ing) without thinking begets ignorance. Think(ing) without learning is dangerous."
-Confucius
Stats ignore the intangibles of history
Equipment, nutrition, advances in training, medical advances, the “bar” of shooting for a record.
With all due respect to the numbers guys, I’d prefer to honor history and compare apples to apples in the present.
No matter where you're at, there you are
Well, this is the same era
There are ways to account for the differences in eras. That is something that was particularly important for baseball with it’s long history. For basketball this would be an issue if we were trying to compare teams of today against, say, teams from the early 80’s but that’s not an issue here since I’m only looking at teams since 2003-2004.
This type of comparison doesn’t replace what you are describing – it enhances it by helping to reveal patterns and similarities that would not be apparent to someone who didn’t actually get to see teams play.
3 > 2, except for very large values of 2.

by 












