Our Sabermetric Overview Series has lost some momentum (which doesn't really exist in baseball, but nevermind) so I thought I'd kick start it with a big diary about the C-word.
Baseball statheads love saying the word correlation. Sabermetricians toss around "correlation" like Joe Morgan does "consistency" (in terms of frequency, not accuracy). But what does it mean? Let's get to the bottom of it and lay the groundwork (read: ruin the surprises) for further installments in this series.
! A Note: This is about as math-y as the new baseball thinking gets. I mean it's scary `rithmatic--we're talking weird stuff like "linear regression" and "covariance". If you want to do cool original baseball research these days you have to know how to do this stuff. Fortunately you don't need to know much math to understand the concepts. Take me. I have no idea how any of this stuff works and I'm writing a Goddamn primer on it. That's why I'm carefully citing all the charts and figures here. I didn't do any of this work (with one exception). I'm just cribbing it from people who kindly posted their efforts on the internet. My prediction: this diary will have enough weird concepts to freak out mathophobes while having plenty of incorrect information to piss off the people who actually know about this stuff. And away we go!
Correlation (or the correlation coefficient) describes a relationship between two variables. In our case the variables are going to be baseball statistics--strikeouts and runs scored or BA w/RISP '06 and BA w/RISP '07. It could be any two sets of data ("Things", for our purposes). Statisticians have many different ways to measure the relationship between the two Things. The most common is called the Pearson product method, named for some guy called Pearson (or maybe some other guy called Galton). This is what the guts of what we're talking about looks like:
Don't worry if you don't understand this! Nobody does. Scientists have been studying it for years and so far they have only concluded that math is hard.
The important thing is that you get something designated `r' or the correlation coefficient that describes the relationship. Let's talk real world and use visuals.
Here's a chart of boys' heights vs their age. As you can see there's some positive correlation. As the boys get older they generally get taller. The dots kind of go from the bottom left to the upper right. Sometimes statisticians will draw a "best fit line" through the data to give you the idea. Now here's a chart of boys' heights vs the month in which they were born.
It's just a big blob! There's no correlation, and why would there be? What month a boy is born has no bearing on how tall he'll be. Remember, when you see a big blob there's no correlation (these charts courtesy of Science Buddies.org: Free Science Fair Project Ideas, Answers and Tools for Serious Students1).
The correlation coeffecient, r, is expressed as a number between -1 and 1. If there's no correlation it's 0. So if the correlation is between 0 and -1 (say -0.25) there's an inverse correlation. If it's between 0 and 1 (say 0.58) there's a positive correlation. The closer to 1 or -1 the stronger the correlation.
Now let's talk baseball (finally). There are many nifty things we can do with this. One is seeing which stats correlate best with scoring or preventing runs so we know how best to judge players. Here are the correlation coefficients between run scoring and various offensive measures (courtesy of Dan Fox at the HardballTimes2)
RC is Runs Created, a fancy saberstat. As you can see all of these measure have a positive correlation with run scoring. The more advanced measures paint a fuller picture. Some data--say GIDP--would have a negative correlation with run scoring and the numbers would be below zero. SPOILER ALERT: strikeouts are right about at zero when it comes to scoring runs.
Another useful way to use this tool is to check for correlations between a statistic in one year and that same statistic the next year. If a player has a skill, like Adam Dunn's home run power, it will strongly correlate year-to-year. If a player's stats are the result of normal statistical fluctuation (luck, to use a loaded term) they won't correlate year-to-year--think Bronson Arroyo's home run power. Here's how some basic batting stats correlated (r-squared) from 2005 to 2006 (courtesy of David Appelman at FanGraphs3):
Batting average correlates only a third as well as the others. It really fluctuates. Remember in Bull Durham when Kevin Costner made that speech about how if he could just get one extra flukey hit a month he'd hit .300 instead of .250 and be in the show? He's talking about batting average's low correlation year-to-year even if he doesn't know it. If Crash's GM were savvy he'd look beyond the average (inflated on balls in play) and see that his slugging and on base skills were the same. Maybe in September, Crash. SPOILER ALERT: any sort of stat designed to measure clutchness--hitting with RISP, Late Inning Pressure Situations, October accomplishments--do not correlate from one year to the next. This is why Stats geeks say clutch is not a skill.
Turning to pitching stats, here's a graph showing the correlation of one year's ERA to the next year's (courtesy of JC Bradbury at the HardballTimes4):
As you can see from the best fit line the blob is slightly moving in the right direction. The r-squared is .13, about the same as batting average--better than nothing but we can do a lot better. Here's the same thing with strikeout rate:
Oh yeah! Now there's some correlation. Point-six-one baby! So we just learned that if we're trying to predict a pitcher's future performance we'd be much better off looking at his K-rate than ERA. SPOILER ALERT: stats that are independent of the pitcher's defense--K-rate, BB-rate and HR-rate--are the basis of a nerdy way to evaluate pitcher's called DIPS (Defensive Independent Pitching Stats).
If you'd like to do this wizardry yourself you easily can in Excel (here comes the original research of this piece). Just click on the paste function button, under Function Category select Statistical, under Function Name select Pearson (Galton?), then in the popup highlight one column or row of data for Array1 and another set of data for Array2. The result is the correlation between your two Things. Square it for the r-squared. Now I can tell you that in the Joe Morgan Red Reporter Fantasy League, the correlation between a team's standing and its number of roster moves is a whopping .73!
These Things are closely correlated...
Update [2007-8-7 9:57:58 by Red Menace]: I forgot to mention the important maxim that correlation does not equal causation (technically, as Gray points out, one should say correlation does not imply causation, because sometimes correlation does equal causation). Or to be properly anal: empirically observed covariation is a necessary but not sufficient condition for causality. Example: there's a strong correlation between ice cream sales and drownings. If one forgot that C does not imply C, one would say ice cream causes drownings. In fact both are affected by a third factor: hot weather. For a baseball example there's a slight positive correlation between strike outs and run scoring, but you shouldn't tell all your hitters to try to strike out more. Both strike outs and run scoring seem to be the result of going deep into counts by waiting for a pitch to drive.