Sabermetric Overview Series Part V: Correlation
Correlation
Our Sabermetric Overview Series has lost some momentum (which doesn't really exist in baseball, but nevermind) so I thought I'd kick start it with a big diary about the C-word.
Baseball statheads love saying the word correlation. Sabermetricians toss around "correlation" like Joe Morgan does "consistency" (in terms of frequency, not accuracy). But what does it mean? Let's get to the bottom of it and lay the groundwork (read: ruin the surprises) for further installments in this series.
! A Note: This is about as math-y as the new baseball thinking gets. I mean it's scary `rithmatic--we're talking weird stuff like "linear regression" and "covariance". If you want to do cool original baseball research these days you have to know how to do this stuff. Fortunately you don't need to know much math to understand the concepts. Take me. I have no idea how any of this stuff works and I'm writing a Goddamn primer on it. That's why I'm carefully citing all the charts and figures here. I didn't do any of this work (with one exception). I'm just cribbing it from people who kindly posted their efforts on the internet. My prediction: this diary will have enough weird concepts to freak out mathophobes while having plenty of incorrect information to piss off the people who actually know about this stuff. And away we go!
Correlation (or the correlation coefficient) describes a relationship between two variables. In our case the variables are going to be baseball statistics--strikeouts and runs scored or BA w/RISP '06 and BA w/RISP '07. It could be any two sets of data ("Things", for our purposes). Statisticians have many different ways to measure the relationship between the two Things. The most common is called the Pearson product method, named for some guy called Pearson (or maybe some other guy called Galton). This is what the guts of what we're talking about looks like:

Don't worry if you don't understand this! Nobody does. Scientists have been studying it for years and so far they have only concluded that math is hard.
The important thing is that you get something designated `r' or the correlation coefficient that describes the relationship. Let's talk real world and use visuals.

Here's a chart of boys' heights vs their age. As you can see there's some positive correlation. As the boys get older they generally get taller. The dots kind of go from the bottom left to the upper right. Sometimes statisticians will draw a "best fit line" through the data to give you the idea. Now here's a chart of boys' heights vs the month in which they were born.

It's just a big blob! There's no correlation, and why would there be? What month a boy is born has no bearing on how tall he'll be. Remember, when you see a big blob there's no correlation (these charts courtesy of Science Buddies.org: Free Science Fair Project Ideas, Answers and Tools for Serious Students1).
The correlation coeffecient, r, is expressed as a number between -1 and 1. If there's no correlation it's 0. So if the correlation is between 0 and -1 (say -0.25) there's an inverse correlation. If it's between 0 and 1 (say 0.58) there's a positive correlation. The closer to 1 or -1 the stronger the correlation.
Now let's talk baseball (finally). There are many nifty things we can do with this. One is seeing which stats correlate best with scoring or preventing runs so we know how best to judge players. Here are the correlation coefficients between run scoring and various offensive measures (courtesy of Dan Fox at the HardballTimes2)
BB 0.590
HR 0.719
AVG 0.843
OBP 0.910
SLG 0.913
OPS 0.955
RC 0.964
RC is Runs Created, a fancy saberstat. As you can see all of these measure have a positive correlation with run scoring. The more advanced measures paint a fuller picture. Some data--say GIDP--would have a negative correlation with run scoring and the numbers would be below zero. SPOILER ALERT: strikeouts are right about at zero when it comes to scoring runs.
Another useful way to use this tool is to check for correlations between a statistic in one year and that same statistic the next year. If a player has a skill, like Adam Dunn's home run power, it will strongly correlate year-to-year. If a player's stats are the result of normal statistical fluctuation (luck, to use a loaded term) they won't correlate year-to-year--think Bronson Arroyo's home run power. Here's how some basic batting stats correlated (r-squared) from 2005 to 2006 (courtesy of David Appelman at FanGraphs3):
AVG .12
OBP .36
OPS .36
SLG .38
Batting average correlates only a third as well as the others. It really fluctuates. Remember in Bull Durham when Kevin Costner made that speech about how if he could just get one extra flukey hit a month he'd hit .300 instead of .250 and be in the show? He's talking about batting average's low correlation year-to-year even if he doesn't know it. If Crash's GM were savvy he'd look beyond the average (inflated on balls in play) and see that his slugging and on base skills were the same. Maybe in September, Crash. SPOILER ALERT: any sort of stat designed to measure clutchness--hitting with RISP, Late Inning Pressure Situations, October accomplishments--do not correlate from one year to the next. This is why Stats geeks say clutch is not a skill.
Turning to pitching stats, here's a graph showing the correlation of one year's ERA to the next year's (courtesy of JC Bradbury at the HardballTimes4):

As you can see from the best fit line the blob is slightly moving in the right direction. The r-squared is .13, about the same as batting average--better than nothing but we can do a lot better. Here's the same thing with strikeout rate:

Oh yeah! Now there's some correlation. Point-six-one baby! So we just learned that if we're trying to predict a pitcher's future performance we'd be much better off looking at his K-rate than ERA. SPOILER ALERT: stats that are independent of the pitcher's defense--K-rate, BB-rate and HR-rate--are the basis of a nerdy way to evaluate pitcher's called DIPS (Defensive Independent Pitching Stats).
If you'd like to do this wizardry yourself you easily can in Excel (here comes the original research of this piece). Just click on the paste function button, under Function Category select Statistical, under Function Name select Pearson (Galton?), then in the popup highlight one column or row of data for Array1 and another set of data for Array2. The result is the correlation between your two Things. Square it for the r-squared. Now I can tell you that in the Joe Morgan Red Reporter Fantasy League, the correlation between a team's standing and its number of roster moves is a whopping .73!

These Things are closely correlated...
Update [2007-8-7 9:57:58 by Red Menace]: I forgot to mention the important maxim that correlation does not equal causation (technically, as Gray points out, one should say correlation does not imply causation, because sometimes correlation does equal causation). Or to be properly anal: empirically observed covariation is a necessary but not sufficient condition for causality. Example: there's a strong correlation between ice cream sales and drownings. If one forgot that C does not imply C, one would say ice cream causes drownings. In fact both are affected by a third factor: hot weather. For a baseball example there's a slight positive correlation between strike outs and run scoring, but you shouldn't tell all your hitters to try to strike out more. Both strike outs and run scoring seem to be the result of going deep into counts by waiting for a pitch to drive.
0 recs |
45
comments
Read Related
Comments
Outstanding
by TheC on Aug 7, 2007 8:00 AM EDT 0 recs
mmmm, numbers
Good job, Menace. This article has an r-squared of .92 with my understanding of correlation.
by Slyde on Aug 7, 2007 8:34 AM EDT 0 recs
Nice writeup of this...
There are some assumptions made in deriving this correlation coefficient. Most importantly, the only relationship we're looking at is a linear one. If the relationship between two things isn't linear, this will only pick up some component of it, and not the "true" relationship. Like, say you were looking at age and batting average. I would assume that batting average tends to increase with age for some age range, then decrease. This is not a linear relationship. Reducing the relationship to a linear one will affect your conclusions about it.
Also, this measure assumes that both variables are normally distributed, i.e. if you were to plot the values of one, they would generate a shape like a bell curve. This can be something of a problem if the distribution is actually pretty heavily skewed towards one side, or truncated at some value. I'm sure there are some decent examples in baseball, but I'm not coming up with any at the moment.
And finally, always, always remember that correlation does not imply causation. Even if two factors are perfectly correlated, you cannot conclude that one of them causes the other. Red Menace suggested nothing of the sort, but I feel that it probably needs to be mentioned anyway.
by Gray on Aug 7, 2007 9:48 AM EDT 0 recs
I forgot to mention
/ C.
by Red Menace on
Aug 7, 2007 9:52 AM EDT
up
0 recs
Forgetting C/C may be the reason
by Madville on
Aug 7, 2007 12:19 PM EDT
up
0 recs
RE-TYPE
that Wayne picked Pete Mariachi over you for the Red's manager's job RM. Although it is apparent after this post that you (RM) should have gotten the job.
by Madville on
Aug 7, 2007 12:20 PM EDT
up
0 recs
It's not that difficult.

Menace, I enjoyed this diary. I was a little bit hungry and this hit the spot. Filled me right up. Dessert? Oh, thank you but I am stuffed. No really, I couldn't eat another bite. I think I need to go lie down. Where's your restroom?
by Fat Vegas Alan on
Aug 7, 2007 5:04 PM EDT
up
0 recs
my favorite part was when
wait, that wasnt you? oh, nevermind...
by Charlie Scrabbles on Aug 7, 2007 11:10 AM EDT 0 recs
Hey Menace
Could you show the scatter-plots for OPS vs. runs scored and strikeouts vs. runs scored? Also, could you also show a scatter-plot for strikeouts by pitchers versus runs allowed? I just thought it would be good to show why statisticians think strikeouts for hitters are not that bad, while strikeouts for pitchers are good. That's the intuitive roadblock for a lot of people who bash Dunn for his strikeouts.
Also, as Gray mentioned, relationships are not always linear. Is there any non-linear relationship between strikeouts and runs scored?
It's also interesting that OPS actually does correlate better with runs scored than does either OBP or SLG, even though OBP correlates with SLG to a certain extent, and you might get some effects of multicollinearity.
by Paul Householder on Aug 7, 2007 11:28 AM EDT 0 recs
thanks
I changed correlation to the c-word at the last minute in a fit of ribaldry, but you're right. Clutch is the word (is the word... is the word...).
by Red Menace on
Aug 7, 2007 12:26 PM EDT
up
0 recs
How does momentum not exist?
by Zach K on Aug 7, 2007 11:31 AM EDT 0 recs
When there is a lot of inertia.
An object at rest will remain at rest unless acted upon by an external and unbalanced force.
by Paul Householder on
Aug 7, 2007 12:04 PM EDT
up
0 recs
When involved in a faith vs. reason treatise
I mean, after all: correlation does not imply causation. And we all know what that means in the discussion of whether CLUTCH is a verb or a noun.
Momentum exists, I saw it in the dictionary.
by Madville on
Aug 7, 2007 12:07 PM EDT
up
0 recs
big mo
by Red Menace on
Aug 7, 2007 12:30 PM EDT
up
0 recs
you are right to point out that
by Madville on
Aug 7, 2007 1:20 PM EDT
up
0 recs
Ok I agree with that
by Zach K on
Aug 7, 2007 2:34 PM EDT
up
0 recs
Yes, although...
Example: Most teams the Reds play have a lot of momentum in the 8th inning, it seems.
Another Example: Most teams the Reds play have seemed to gain momentum every time Gary McJeffsky pitches. They have less momentum when Coutlangus pitches.
by Paul Householder on
Aug 7, 2007 2:37 PM EDT
up
0 recs
Yeah I agree with that haha
by Zach K on
Aug 7, 2007 2:41 PM EDT
up
0 recs
Or
by Man Mountain on
Aug 7, 2007 6:24 PM EDT
up
0 recs
This reminds me...
Very often the team that wins the first two games of a series is winning those games because they are simply the better team. So if the series were to be played out to a "best of 61" it would still be likely that the team that won the first two games would win the series.
Also (in every sport except baseball) the championship series usually begins at the home of the team with the better record (which is usually an indicator that they are the better team) so if a team goes up 2-0 before taking the series on the road they are, as the saying goes- "doing what they are supposed to be doing" but they are also doing what they might have been expected to do regardless of where the series began.
by Fat Vegas Alan on
Aug 7, 2007 6:45 PM EDT
up
0 recs
this reminds me...
by Man Mountain on
Aug 7, 2007 6:51 PM EDT
up
0 recs
This reminds me...
He ran the numbers and actually concluded that if you want to go with this argument, you want to be the team that scores...oh, I can't remember which. Maybe second or third. But you know, "Score second!" doesn't seem as good of a motivator...
by Gray on
Aug 7, 2007 7:28 PM EDT
up
0 recs
This reminds me..

Crazy, dude.
by Fat Vegas Alan on
Aug 7, 2007 8:28 PM EDT
up
0 recs
Crazy.
But the message for ESPN is that it's showing the wrong graphic.If a goal is scored and ESPN flashes a graphic saying, "Teams that have scored first are 22-3-3," I, the typical American sports fan who doesn't care about soccer, will think, "Well, there's about a four-in-five chance that this baby's over. I believe I will turn off the TV, kick my dog, curse some foreigners and play with my assault rifle."
But if that graphic said, "Teams that have scored second are 17-2-3," I'm going to want to stick around to see which team can come up with that all-important tally. Better for me, better for ESPN and way better for the dog.
by Gray on
Aug 7, 2007 8:47 PM EDT
up
0 recs
This reminds me...
by Red Menace on
Aug 7, 2007 10:08 PM EDT
up
0 recs
This reminds me...
by Fat Vegas Alan on
Aug 7, 2007 10:53 PM EDT
up
0 recs
Oh yeah?
Ha!
by Fat Vegas Alan on
Aug 7, 2007 11:44 PM EDT
up
0 recs
You could say that
by Zach K on
Aug 7, 2007 8:25 PM EDT
up
0 recs
Momentum....
This is the limitation of the quantitative approach. Periodically, it assumes, or those who believe in the central theorem assume, that if you can not measure something, it lacks validity. This is why the use of the scientific method outside of the hard or real sciences has limitations, even in the world of baseball (I see it in my discipline every day [much more extensively than in baseball]). Of course, for those who played, there is such a thing as momentum, but it is oftentimes a state of mind or psychological. It is the same phenomenon as those who say there is no such thing as a clubhouse chemistry (strange to give such a subjective concept a scientific name). Well, Albert Belle and Dave Kingman never won a WS, but on the other hand look at the self-hating, clubhouse poison of the Yankees in the late '70s. It cuts both ways, and anytime when you hear someone in baseball (like in economics, politics, or anything) say, "on the other hand," you know you have a problem.
by tonywf on
Aug 8, 2007 9:22 AM EDT
up
0 recs
it doesn't really
Nicely done, in any case/
by andromache on Aug 7, 2007 1:35 PM EDT 0 recs
Well...
by Gray on
Aug 7, 2007 1:41 PM EDT
up
0 recs
You really should have handled this one
by Red Menace on
Aug 7, 2007 1:54 PM EDT
up
0 recs
No, I'm sure I would have botched it...
by Gray on
Aug 7, 2007 2:32 PM EDT
up
0 recs
Evaluating pitchers
Second, my question is in regards to pitcher analysis. Are any of the batting metrics useful in evaluating/projecting pitchers? For example, batting average against, OPS against, RC against, etc. A pitcher making 30 starts can face about 800 batters in a season.
Also, seems that the correlation of pitching stats may vary between starters and relievers. It seems that ERA is a better indicator of a starting pitcher's ability, but I wouldn't use ERA to judge a reliever. The WHIP stat has been one I've looked at a lot for relievers. For example, look at Gary Majewski's "great" 2005 season. A WHIP of 1.48. That's a lot of baserunners for a reliever to allow. Throw in the 7 batters he hit with a pitch, and it bumps to 1.55 runners per inning.
by omnired on Aug 7, 2007 1:40 PM EDT 0 recs
pitchers
I don't have any data handy right now, but I imagine BA against isn't that consistent year to year for the same reason as for hitters.
by Red Menace on
Aug 7, 2007 1:59 PM EDT
up
0 recs
Correlation question
Is there any correlation between the number of times a player reaches due to an error and speed?
Maybe there's a correlation between how hard a player hits the ball and how often he reaches on an error.
Just a thought.
by JJ on Aug 7, 2007 5:08 PM EDT 0 recs
no sure
A not very scientific way to get started would be to look at the reach-on-error leaders and see what type of players they are. I agree there would probably be more Reyes types than Howard types.
by Red Menace on
Aug 7, 2007 6:25 PM EDT
up
0 recs
Speed and power
2007
-----------
Derek Jeter 15
Placido Polanco 9
Shane Victorino 9
Brandon Phillips 9
Julio Lugo 9
Jose Guillen 8
Travis Hafner 8
David DeJesus 8
Randy Winn 8
Johnny Damon 8
Jose Reyes 8
Ryan Theriot 8
2006
-----------
Kenji Johjima 13
Juan Pierre 12
Carlos Beltran 12
Ichiro Suzuki 11
Josh Willingham 11
Brandon Inge 11
Clint Barmes 11
Adrian Beltre 11
Jay Payton 10
Brian Anderson 10
Adam Everett 10
Orlando Hudson 10
2005
-----------
Jason Kendall 15
Freddy Sanchez 13
Jose Reyes 12
Derek Jeter 11
Jose Guillen 11
Jack Wilson 11
Grady Sizemore 11
Carlos Beltran 11
Chone Figgins 10
Garrett Atkins 10
Johnny Damon 10
Adrian Beltre 10
2004
-----------
Miguel Tejada 16
Ichiro Suzuki 15
Albert Pujols 14
Derek Jeter 13
Juan Pierre 13
Alex Rodriguez 12
Luis Castillo 12
Brian Roberts 12
Chipper Jones 11
Mark Loretta 11
Carl Crawford 11
Angel Berroa 11
2003
-----------
Ty Wigginton 15
Aaron Boone 13
Craig Biggio 13
Miguel Tejada 12
Cristian Guzman 12
Ken Harvey 11
Dave Roberts 11
Marquis Grissom 11
Joe Randa 11
Ichiro Suzuki 10
Casey Blake 10
Juan Pierre 10
2002
-----------
Sammy Sosa 13
Shea Hillenbrand 13
Jeff Kent 13
Craig Biggio 13
Rondell White 13
Junior Spivey 12
Vinny Castilla 12
Michael Young 11
Aaron Boone 11
Jacque Jones 11
Jeff Cirillo 11
Randy Winn 11
by Slyde on
Aug 7, 2007 8:36 PM EDT
up
0 recs
Ahem... Park adjusted?
by Fat Vegas Alan on
Aug 7, 2007 8:41 PM EDT
up
0 recs
I don't get it
by Slyde on
Aug 7, 2007 8:44 PM EDT
up
0 recs
The Joy of Stats............
by tonywf on Aug 8, 2007 9:09 AM EDT 0 recs
Correlation vs. Causation
It's true that you can get better evidence about causation by doing an experiment. But just because correlations between two valuables can be caused by a third causal factor doesn't mean that they always are. Therefore, one can test hypotheses using correlational evidence. If that weren't true, we wouldn't be able to do any hypothesis testing in baseball. Period. And clearly that's not the case, even though all we have to work with are correlational data.
The key issue is one of timing. If I create a correlation matrix of 20 different variables and I find that one or two of the variables are correlated, I can't then say that one causes the other. All I've done is make the observation that one variable is related to the other.
However, if I make an a priori hypothesis that variable A causes change in variable B, and then I do a study and find a correlation between the two variables, I can certainly say that the I've supported my hypothesis that variable A causes change in variable B. Yes, an experiment would be better evidence, but experiments aren't always logistically possible (e.g. unless we work for a team, we can't actually manipulate how teams execute game strategy in some sort of controlled manner).
The key point is that experiments are not the only way that one can test a causal hypothesis. Yes, one must be tentative about findings from correlational studies because of the possibility of confounds from other variables. But then again, because an experiment can never be perfectly controlled, one has to be tentative about findings from any experimental study as well...
-j
by JinAZ on Aug 10, 2007 12:11 PM EDT 0 recs













