Big data and new processing power has really opened the door in data analysis – applying reliable methods to determine the causes of variation in mountains of every kind of data, and thereby getting closer to predicting what might happen in the future. Being good at this requires formal education, or maybe being really smart, or at the very least reading thoroughly on the topic. I ask you this: who has time for that?
What I enjoy is just coming up with some kind of algorithm, rotely applying it to a random pile of data I feel is relevant to my interests and looking through the results to amuse myself. Generally speaking, this is very pointless. But we live in a world of profound, mindblowing, just incredible stupidity, so I feel like it’s only right that I bring with you on my journey.
So on twitter, I was doing this thing for awhile: picking one Reds win that happened historically for each day. That was pretty fun, but the CovidEnnui set in quickly for me, and after that I was just totally overcome with anxiety trying to figure out how to get through the backlog, and so I quit. But, I did put together a resource to help me out – an excel chart of every Reds game from when baseball-reference considers the current Reds franchise to have started (1882) to the current day (2019). That’s where all the data comes from, to give credit. I only grabbed the game data for each season from the equivalent of this page: https://www.baseball-reference.com/teams/CIN/1938-schedule-scores.shtml. But that still provides access to a few important values, including the date, the game-number and the outcome.
Incidentally, this makes it pretty easy to count the number of wins in each and every 60-game stretch. And if you’re only interested in 60 game stretches that occur within a season (This is including all the overlapping ones), you can just filter out all the game numbers less than 60. Convenient, if your favorite major league baseball team is the Reds (is it?) and they’re about to set out on a 60-game season.a
I mean, some caveats. I ordered the games by date, so it’s very possible that some of these stretches split a mis-ordered double header. Meh. Also, while this is a 60 game stretch of baseball, I think it is safe to say that is an unprecedented 60-game stretch of baseball, and there are a lot of problems comparing it to 60 games in the middle of a normal, full season. Also 60 games is a pretty significant portion of a season, so the overall season record and the 60-game set record really aren’t exactly independent. And for that matter, the 1882 Reds only played like 80 games, and even 150-160ish games isn’t necessarily a perfect But headaches like these are why I DO NOT DO useful analysis; we like, just talked about that.
So here’s a pretty basic display. The orange line tracks the overall season win percentage for the Reds, also, there is a gray rectangle for the number of wins for each 60-game stretch. I’ve set the scales the same here, for comparison’s sake. Not a ton to say here, other than 1981 can go to hell.
To hit some of the records, obviously, the best 60-game stretch in Cincinnati Reds history came in 1869, but we’re ignoring that for similarly obvious reasons. Since ’82, the most wins for the Reds in a 60 game was in 1919, where the Reds won 47 out of 60 games a few times in basically the same timeframe – July and August. That of course, wasn’t all that out of ordinary for the outstanding 1919 Reds, who went 96-44 on the season, and went on to win the World Series that year totally fairly and with nothing else going on.
The sole worst record in 60 games came in 1914 – with only 13 wins in that period. The 1914 Reds were about as bad as the 1919 Reds were good. Between September 5 and September 23, they went on a 19 game losing streak, which is as far as I know, the worst Reds losing streak by quite a margin.
But I also wanted to test looking at how distribution of wins in a 60 game segment compared to the overall win-loss rate for the season. For that, I felt like it was easier to move over to R, so that’s what’s up with my wonky figures for the rest of this … whatever this is. I think joining tables in excel is hard, and joining tables in R is easy. That meant I could compare an "expected wins" number (season win % * 60) against the actual wins in the 60 game stretch. I figured this parrticular configuration would make the units consistent across seasons. A win is a win is a win, right? This looks at how well a team overperformed or underperformed compared to their overall season record.
Our biggest overachievers were the August/September 1927 team, their 40 wins over 60 games well out-paced their season win percentage of 0.49 (expected 29 wins in 60 games). But the 2018 Reds have a spot up there too. From about mid-May to mid-July, the 2018 Reds won 35 games. It’s just that that team finished the season so poorly, that the overall win percentage was 0.414, an average of 25 wins over 60 games. The biggest underachievers were still the 1914 Reds at the end of the season. Even though their overall season win percentage only works out to winning 23 games out of 60, they still somehow did so much worse than that.
So… can we see any patterns? Like, I think this cursed 2020 Reds team has the makings of a really quality baseball team (2020 is cursed, the Reds, it remains to be seen). If we assume the "true talent level" of this team is "good" what does that say about the likelihood of final record in a 60 game season? Here’s a full histogram of the actual wins in 60 games minus the expected wins (season win percentage * 60).
Pretty standard normal distribution with the center at 0 like you’d expect. I calculated the skewness, and it’s very slightly negative. I also tried to create a similar graph to the excel one above – tracking the range of 60 game records over the years, but this time using that "actual wins – expected wins" value so you’re not just looking at how good the team is each season.
Meh. I even color coded by the overall season win% to see if anything stood out. It doesn’t? It is a pretty graph though. Yay!
OK, but what if I summarize all results by year, to get the distribution of 60-game performance compared to season performance for each individual year? Here’s that standard deviation of the actual-expected value for each season plotted against the overall season win percentage for each season, color coded by year this time.
OK, one last try. What if I sorted each 60-game record into bins based on the season win percentage, and then compared the distribution of the 60 game performance between those bins in a boxplot?
I guess we’re just going to have to watch the baseball games after all.
So yeah. This is what passes for fun in my household. If I was a writer, I probably would’ve organized this better, so it wasn’t like 2 or 3 kernels of arguably interesting information in the middle, followed by nothing but meaningless graphs, but I’m not a writer, either. I’m just a girl, sitting in front of a computer, hoping she can think of a joke to put here before she hits publish.