I've spent a lot of time over the last couple of weeks in my labs working on the "hot zone" graphs to hopefully make them easier to read and more accurate. I'm not sure if I've succeeded, which is why I need your help. I'd like some feedback from you on how effective these graphs are. Do they make sense? What could help make them easier to read? Are they worth the effort? What is wrong with my methodology (which will be explained after the jump)?
Below you'll see an example of the new format with a graph for Joey Votto. After the jump are links to several other players as well as some notes on how I generate these graphs. Again, any feedback you can provide would be greatly appreciated.
Each graph represents a different aspect of how a hitter performs at the plate:
- The top left graph is the hitters rate of swinging per pitch in a zone.
- The top right graph is the rate of contact per swing in a zone - foul balls count as contact.
- The bottom left is based on the slugging rate per each at bat in a zone. In this case, strikeouts where the third strike was in that zone counts as an AB.
- The bottom right graph is a differentiation between zones that have a higher fly ball rate versus zones that have a higher ground ball rate for that hitter.
For each graph besides the batted ball graph, the colors of the graph are relative to the how the rest of the league performs in that zone. The comparison is only made to hitters of the same handedness who faced the same handed pitchers and the same pitch type (in cases where we are only looking at a single pitch type). This means that if we are only looking at how a right-handed batter fared on sliders from RHP, we are only comparing to how other RHB did against sliders from RHP.
Batted ball data is only relative to the hitter being examined because I thought it was more informative scouting-wise to know if a hitter tended to hit more flyballs than groundballs from a zone, not whether he did it more than the rest of the league.
Some technical details on the steps I use to generate the graphs. Feel free to skip this section. If you do read it, please critique my methods if something doesn't make sense. On a lot of this stuff I had to feel my way through it:
- Each pitch that has location data is assigned to 1 of 99 zones based on that location data. All of the zones touch the strike zone except for the outer most zones, which represent all pitches that are more than a radius of the ball outside of the zone.
- Rates for the league are calculated thusly:
- All data is aggregated at the individual player level per zone, pitcher handedness, and pitch type. In order to smooth the data, the 8 zones that surround each given zone are averaged into that zone as well.
- All hitters that meet the minimum requirement for a rate in a zone are used to create percentile buckets for that zone. The minimum requirements are:
swing rate: 20 pitchesThese values are admittedly arbitrary, but when I'm looking at all pitch types, they give me a pretty large sample to use to create buckets. If somebody with more statistical background can teach me how to find the proper significant value so that I know I am getting a good cross-section of data, please speak up. I'm willing to do whatever it takes to make these more accurate.
contact rate: 10 swings
power rate: 10 AB
- When I look at specific pitch types, the amount of data in each zone becomes screwy - in some cases to the point where I don't have enough sample to generate percentiles. In those cases, I calculate the average rate for that zone across all instances meeting the batter, pitcher, and pitch type criteria. I then set the average as the 50th percentie rate (I know this isn't right, but I'm compensating), and then generate buckets on either side of that at intervals that are +/- 10% of the average. So, if I have an average rate that is 62%, my intervals for my "percentile" buckets would be:
37% 43% 50% 56% 62% 68% 74% 81% 87%
- Once I have all of my league rate buckets, I compare the individual hitter who we are graphing to the the league rates and assign each zone to a bucket. That bucket determines the hue and the shade of the color that we see on the graph.
The biggest problem that I am having is with the fact that the data is not normally distributed. So, if I change my minimum criteria, I can dramatically change the percentile buckets. For instance, if I lower the contact rate criteria to 5 pitches, I end up with zones that have a 40th Percentile of 0% contact. So, a lot of my time was spent trying to compensate for that in a realistic manner. I'm not sure if I've done that, but hopefully through your feedback we can find out.