PDA

View Full Version : A Statistics Thread



Latrinsorm
09-17-2010, 05:09 PM
It occurred to me recently that while baseball players are very prone to go 0 for 10, 15, 20, 25, etc., baseball players almost never go 10 for 10, 15 for 15, etc. Of course, no baseball player bats .500, so this is to be expected from our delicious binomial distribution, but I wondered: just how do streaks play out for any given hitter? So I checked.

David Wright, in 2008, who batted 189/626 = .302, had the following streaks, where "1" refers to streaks of hits (so 1 - 6 refers to 6 hits in 6 at bats and 1-2 refers to 2 hits in 2 at bats) and "0" refers to streaks of non-hits (so 0-19 refers to 0 hits in 19 at bats):




1 - 6 - 1
1 - 5 - 1
1 - 4 - 3
1 - 3 - 6
1 - 2 - 28
1 - 1 - 92
0 - -1 - 45
0 - -2 - 27
0 - -3 - 14
0 - -4 - 12
0 - -5 - 9
0 - -6 - 13
0 - -7 - 4
0 - -8 - 1
0 - -9 - 2
0 - -10 - 1
0 - -11 -
0 - -12 -
0 - -13 - 1
0 - -14 - 1
0 - -15 - 1
0 - -16 -
0 - -17 -
0 - -18 -
0 - -19 - 1


Now, am I crazy to think "what in the name of Eratosthenes is going on here??!"!!? I am totally flummoxed by the (by far) most likely streak being 1 hit for 1 at bat. If I restrict my distribution to 1 case, of course the most likely result is not a hit. Is that improper, and if so, in what other way would I model the relative likelihoods of an 0 for 1 streak versus a 1 for 1 streak versus an 0 for 19 streak, and so on?

Sean
09-17-2010, 05:17 PM
How does your statistical model account for the quality of pitcher faced? I'd say it's a lot easier to go 10 for 10 in a series against a subpar pitching staff than it is against a quality pitching staff.

Rinualdo
09-17-2010, 05:20 PM
Walks.

eta- relevant for streaks, not batting average of course.

Latrinsorm
09-17-2010, 05:56 PM
I am discounting all incidentals such as "a soft-tossing lefty" or "any pitcher on any Pirates' staff" and using simply the brutish average. I am sure that this is not wholly correct, but I am not clear on how this could be incorrect one way and not the other (only 6 for 6 vs. 0 for 19 in the extreme, and 1 for 1 vs. 0 for 1 in the particular). I am eager to concede that there are specific anomalies to blame for this disparity, but I would feel rather... rather slimy to do so when I am not sure that systematic errors in my approach are not to blame.

I have also wholly discounted walks, sac flies (Mr. Wright regularly leads the league in this category), et cetera that are not factored into batting average. Only hits, only at bats, and only batting average are enumerated in the first post for this reason.

Bobmuhthol
09-17-2010, 06:03 PM
I think the counterintuitive results are because of dependence. Assuming batting average represents the purely random chance of getting a hit, most hits will not be followed up with another hit, but there is a good chance that a not-hit will be followed up with another not-hit. The easiest way to see this is to look at where most of the weight of hits or not-hits lies: 1 or 2 for hits, 1 through 6 for not-hits. You are saying "if I am starting a streak, what is the most likely streak I am starting?" which leads you to it being a 1-hit streak, but 1-no-hit and 2-no-hit streaks combined are more likely to occur. I think, and this is pure speculation because I have nothing to back up, that your results will make more intuitive sense if you include "multiple" streaks, ie., a streak of 5 no-hits would count as a streak of 1, 2, 3, 4, and 5; that way, you can ask the question for a particular hit and it will be effectively the beginning of a streak.

I probably am not making sense but the idea is that your results only allow you to ask "what is the most likely streak?" for the first at bat. Afterward, you're already in a streak, so you would have to know that the next result would be different than the previous result in order for any prediction to have meaning, but if you already know the next result then you aren't predicting anything at all; ie., you can't do what you're trying to do with the data that you have.

Sean
09-17-2010, 07:18 PM
I am discounting all incidentals such as "a soft-tossing lefty" or "any pitcher on any Pirates' staff" and using simply the brutish average. I am sure that this is not wholly correct, but I am not clear on how this could be incorrect one way and not the other (only 6 for 6 vs. 0 for 19 in the extreme, and 1 for 1 vs. 0 for 1 in the particular). I am eager to concede that there are specific anomalies to blame for this disparity, but I would feel rather... rather slimy to do so when I am not sure that systematic errors in my approach are not to blame.

I have also wholly discounted walks, sac flies (Mr. Wright regularly leads the league in this category), et cetera that are not factored into batting average. Only hits, only at bats, and only batting average are enumerated in the first post for this reason.

I think another interesting metric to look at would be for example when he went 6 for 6, what was the quality of the defense of the team he was playing.

peam
09-17-2010, 07:41 PM
NFL started last week, dude. You don't have to watch baseball anymore.

Latrinsorm
09-17-2010, 10:50 PM
I think another interesting metric to look at would be for example when he went 6 for 6, what was the quality of the defense of the team he was playing.The Phillies, who during those two games fielded Bruntlett at SS, Werth in CF, and Burrell in LF. So... I would say pretty bad.
I probably am not making sense but the idea is that your results only allow you to ask "what is the most likely streak?" for the first at bat.Everything up to this statement made sense, then nothing made sense. If I count every streak of n also as a streak of n-1, n-2, ... , 1, then I get the following numbers:
1 - 6 - 1
1 - 5 - 2
1 - 4 - 5
1 - 3 - 11
1 - 2 - 39
1 - 1 - 131
0 - -1 - 132
0 - -2 - 87
0 - -3 - 60
0 - -4 - 46
0 - -5 - 34
0 - -6 - 25
0 - -7 - 12
0 - -8 - 8
0 - -9 - 7
0 - -10 - 5
0 - -11 - 4
0 - -12 - 4
0 - -13 - 4
0 - -14 - 3
0 - -15 - 2
0 - -16 - 1
0 - -17 - 1
0 - -18 - 1
0 - -19 - 1


...which I like a lot more, but I'm not totally (very) [slightly] satisfied with the methodology.

Rinualdo
09-18-2010, 12:49 AM
NFL started last week, dude. You don't have to watch baseball anymore.

wait, what? I thought Lantrin was a chick?

Latrinsorm
09-21-2010, 02:35 PM
I figured it out: X for X vs. 0 for X are only comparable streaks when X is the same. Comparing 0 for 1 and 1 for 1 and 0 for 2 makes no sense. When I break His year down into strings of equal length, the distributions are almost as expected, but it turns out that the following streaks happened an unlikely number of times assuming Gaussian noise:

0 for 12 (4)
0 for 13 (4)
8 for 13 (6)
1 for 15 (4)
1 for 16 (5)
1 for 17 (4)

Where the number in parentheses indicates the number of standard deviations away the observed number of streaks was from the predicted. For reference, 6 is a fucking shitload of standard deviations. The next step in the analysis will be to examine the teams (and more importantly pitchers) he played against during these streaks, but for now I am advancing the tentative hypothesis that David Wright displayed characteristics of streaky hitting in 2008. Now that I've established a good methodology (and spreadsheet), I also want to look at other hitters, starting with Albert Pujols to see if the widely held subjective measurement of him being an unflinching hitting cyborg is accurate.

Latrinsorm
09-21-2010, 02:38 PM
Oh, and in addition I figured out why the counterintuitive results in my first post, given what I figured out before I figured out why the counterintuitive results in my first post. The distribution I used doesn't take into account microstates at all, so any ordering information is lost. If I take any 3 at bats, I am most likely to get 1 hit, but it could just as easily be one hit, no hit, no hit; no hit, one hit, no hit; or no hit, no hit, one hit. In the analysis I did, the middle one would give me two "streaks" of 0 for 1 while all three would give me "streaks" of 1 for 1. 3 > 2, hence I will see more 1 for 1 streaks than 0 for 1. I will assume without loss of generality that this can be extended to any X at bats.

Valthissa
09-21-2010, 04:13 PM
Latrin,

A little poking around tells me that you might want to look at a copy of "A statistical analysis of hitting streaks in baseball" written by S. C. Albright and published in the Journal of the American Statistical Association in 1993. His data set was every at bat in both leagues for 1987 to to 1990. Unfortunately the online version of the journal is only searchable back to 2000.

His conclusion from the data is that there is little evidence for hitters exhibiting a hot hand. If you used his methodology you could then compare David Wright's 2008 season and determine if he was prone to hitting streaks.

C/Valth

Paradii
09-21-2010, 11:16 PM
wait, what? I thought Lantrin was a chick?

Nah, he just wears eye liner.

Latrinsorm
09-22-2010, 12:33 AM
Latrin,

A little poking around tells me that you might want to look at a copy of "A statistical analysis of hitting streaks in baseball" written by S. C. Albright and published in the Journal of the American Statistical Association in 1993. His data set was every at bat in both leagues for 1987 to to 1990. Unfortunately the online version of the journal is only searchable back to 2000.

His conclusion from the data is that there is little evidence for hitters exhibiting a hot hand. If you used his methodology you could then compare David Wright's 2008 season and determine if he was prone to hitting streaks.

C/ValthI can only find the last page of it online, but I'll certainly look it up next time I'm in a library. :)

Mighty Nikkisaurus
09-22-2010, 12:39 AM
Nah, he just wears eye liner.

And nail polish. GOOD nail polish. OPI ftw.

Also, I vote that 'fucking shitload' becomes a technical mathematical definition, if it's not already.

Delias
09-22-2010, 04:13 AM
I don't know if you guys know, but I invented math. I don't do it anymore because now that it's all popular I'd feel like a sell out. Just thought you should know.