PDA

View Full Version : NBA: Pythagoras and Close Games



Latrinsorm
10-05-2013, 03:09 PM
The Pythagorean model (http://en.wikipedia.org/wiki/Pythagorean_expectation) is not perfectly accurate. What if one source of error we can remove is that close games are a .500 proposition for every team, and any deviation from that is merely statistical noise?

So I tabulated all the NBA team-seasons where 82 games were played in a 30 team league, which gives us 238 (2005-2011, 2013, less 2 games for Boston and Indiana in 2013). That gives us...


12 1 17
13 1 16
15 2 35
17 2 43
18 2 45
19 3 57
20 2 41
21 2 33
22 4 99
23 5 114
24 6 163
25 2 52
26 6 170
27 5 148
28 2 57
29 5 148
30 3 90
31 2 67
32 6 187
33 10 338
34 10 337
35 5 174
36 5 183
37 4 147
38 4 140
39 3 112
40 7 275
41 12 490
42 7 289
43 5 209
44 8 339
45 9 400
46 4 184
47 6 272
48 3 141
49 6 291
50 10 508
51 3 156
52 5 267
53 5 267
54 8 432
55 4 209
56 5 281
57 5 279
58 4 240
59 5 304
60 2 122
61 3 174
62 3 181
63 1 61
64 1 60
65 1 61
66 3 194
67 1 61
...where the first column is actual games won, the second column is number of team-seasons that had that, and the last column is total predicted wins for those team-seasons. If our hypothesis is correct, we would expect teams with 40 or less wins (sub-.500) to have more actual than predicted, 42 or more (super-.500) to have less, and 41 to have the same. As it turns out...


rec n a p dif
<.500 109 3219 3288 69
>.500 12 492 490 -2
=.500 117 6038 5982 -56
...that is pretty much what we see! But a few things:

1. What if a team was predicted to have 41 wins and actually had 43? This fast read would say that the sub-.500 got 2 games closer to .500, but they really only got 1 game closer to .500 and then went even further. How should we count that? It's hard to say, and while those bridge-crossers only accounted for 11 of 238 team-seasons, they accounted for -38 of the -56 (but only 5 of the 69). I think the smart thing to do would be to count 41 to 43 as 1 game towards .500, 1 game away from .500, net result 0 for our metric. This reduces the total observed value of [69+56-2 = 123] by 24, because a good part of that -38 was in the correct direction to start with, so over 238 team-seasons we see 99 games' worth of close games being crapshoots, or about two-fifths of a game per team per season.

2. More importantly, I'm not at all sure how to quantify error bars for this measurement.

3. I'm using the basketball-reference stated numbers for Pythagorean Wins, but obviously the formalism returns long decimals, and when we're talking about decimal differences that may be significant.

But for now I'm encouraged to go back through the years of 82 games and see what we can see.

Latrinsorm
10-05-2013, 05:45 PM
Alright, went back to 1985 and it's looking very promising. Raw data...


11 2 22
12 2 33
13 2 35
14 1 15
15 8 139
16 1 17
17 10 184
18 5 100
19 10 202
20 10 213
21 12 252
22 11 271
23 10 236
24 16 414
25 10 260
26 16 444
27 10 279
28 14 394
29 13 395
30 14 419
31 16 528
32 12 376
33 17 561
34 15 511
35 13 454
36 20 717
37 13 488
38 12 440
39 15 584
40 19 750
41 35 1449
42 33 1387
43 18 777
44 27 1182
45 21 933
46 12 552
47 27 1279
48 12 559
49 15 739
50 29 1447
51 11 553
52 17 879
53 15 809
54 17 912
55 22 1194
56 16 871
57 22 1263
58 12 701
59 13 760
60 7 416
61 9 535
62 10 588
63 6 358
64 3 185
65 2 123
66 3 194
67 4 254
69 1 68
72 1 70

...relative to .500 (also the post above I misordered them, obviously more than 12 teams finished above .500, the labels should look like this)...


rec n a p dif
<.500 329 9513 9733 220
=.500 35 1435 1449 14
>.500 385 19752 19588 -164
...which gives us a first order figure of 370, minus 76 for overcorrection of near-.500 teams, and it occurred to me that using non-absolute values for .500 teams hid some there because every deviation from .500 is away from .500 so minus another 46, leaving us with 262 games towards .500 for 749 team-seasons, or just about one third of one game per team per season. That the value is very nearly the same even with 3 times the data makes it look very promising from a standard deviation perspective.

And that's all a fine start, but it also turns out there were at least 10 teams to finish with 19 through 59 wins, so why don't we graph them?

http://img.photobucket.com/albums/v456/johnnyoldschool/NBAPythagorasandCloseGames_zpse9d1a227.jpg

I've taken the liberty of marking 41 games as that corresponds to a .500 winning percentage in the NATIONAL BASKETBALL ASSOCIATION and reminds us that the graph is lopsided by our constraint. It's very noisy, but there is the hint of a more discriminatory effect than "less than .500? more?", which makes sense: a .200 team stands more to gain with a subset of .500 games than a .499 team.

.

Also, it makes Jordan's 70.5 wins over 2 seasons average seem even more unapproachable when we find out his team was 3 games higher than we would expect at first order, 3.7 higher if this 1/3 per year formalism holds up, and possibly even higher given how far his team was from .500.

Latrinsorm
10-06-2013, 06:12 PM
a .200 team stands more to gain with a subset of .500 games than a .499 team.And it occurred to me that we could just do that out algebraically if we have an estimate of how many close games a team tends to play, which from 2002-2012 turns out to be 28% of the time...

predicted winning percentage = p
games played = 82
on average, games played that will be close (within 5 points) = 82 * .28
on average, winning percentage in those games = .500
adjusted prediction of wins = (82 - 82 * .28) * p + 82 * .28 * .5
= 59.04 * p + 11.48

Using this, our sum of squared residuals goes from 6952 to 16315, so that's not right. What we could do instead of extrapolating from an empirical point is just to try values for % of games that are 50-50 propositions and see if any reduce our squares, but as it turns out no. Even saying one-thousandth of games (or about 4 minutes per season) are 50-50 increases the squares. I'm not sure what to make of this, but another thing we can try is returning to a flat binary effect of the form...

if prediction > 41 games, revise prediction to prediction - x
if prediction < 41 games, revise to prediction + x
if prediction = 41 games, leave alone

...which does marginally reduce our squares for x on the range of 0 to 0.15 and reaches a minimum at .079, but the effect is so much smaller than the ~1/3 observed above that I think something must be wrong. Perhaps returning to our sums relative to .500 will help:


rec p p2 a
<5 9733 9763.5 9513
>5 19588 19555.3 19752

And it does, because it suddenly occurs to me that I had an extra minus sign running around, and >.500 teams are better than predicted while <.500 teams are worse. Whoops! :D