Jim Caple posted some criticism of the Wins Above Replacement statistic over at ESPN this morning. Overall, I thought his criticisms were warranted and well-stated, but I wanted to add some of my own thoughts to clear up any misunderstandings and to add a different perspective to some of the points raised.
My issue is this: I don’t like the increasing over-use of (and over-reliance on) WAR as THE definitive evaluation of a player’s worth.
Certainly this is an issue more apparent with WAR than most other statistics simply by its nature. Smashing offense plus defense plus positional adjustments plus replacement level into one statistic would give one the idea that this one number and one number alone can paint an entire picture of a player’s season or career. However, such reliance on one statistic has certainly been true even before WAR was ever born. Saves for relievers have often been the only barometer by which a closer’s value is judged; RBI’s for hitters; won-lost record for starting pitchers. When possible, you should always use multiple methods of evaluation and be mindful of all their limitations, no matter which stats you’re using.
Moving on to Caple’s direct criticisms…
Almost no one knows how to calculate WAR. […] If we can’t figure a stat out on our own, then how do we verify whether it is accurate?
WAR is, as far as I know, the most complex statistic used in sports right now. It is very arduous to learn everything that goes into creating it, much less trying to reproduce the results. However, it is not impossible to learn and its complexity is neither a feature nor a detriment. The utility of a statistic should be judged on its ability to describe what happened or its ability to help predict future results. It doesn’t matter if your statistic has one simple arithmetic function or is a concoction of mathematical trickery as long as it does its job well.
WAR does its job well (but not perfectly!). Sort any WAR list from greatest to least, or least to greatest and it will almost always line up with what you observed with your eyes or judged with more traditional statistics. 2012’s WAR top-ten included Mike Trout, Buster Posey, Ryan Braun, Robinson Cano, David Wright, Chase Headley, Andrew McCutchen, Miguel Cabrera, Jason Heyward, and Adrian Beltre. All players you’d have included in a top-ten list most likely. Maybe you disagree with the specific ordering, and that’s fine. As Tango says, “Everyone has their own WAR.” If you think FanGraphs weights defense improperly, you can make your own adjustments. FanGraphs’ and Baseball Reference’s versions of WAR are not sacrosanct.
Disagreements with WAR arise when detractors cite a questionable result (“Miguel Cabrera third in WAR? Nonsense!”) but aren’t transparent and consistent about the methods behind their own evaluations. WAR is useful because you know what’s going into it and the components are applied uniformly and without bias.
Actually, we know it isn’t always accurate because depending on your source — FanGraphs or Baseball-reference.com — you can get wildly different WAR scores.
I would argue this is a feature of WAR. In fact, we should have even more versions of the stat because we will further learn what does work and what doesn’t work. FanGraphs uses FIP in its calculation of pitcher WAR, while Baseball Reference does not, which is why a pitcher like Ricky Nolasco — someone who has under-performed his defense-independent statistics — has 18.1 career fWAR but only 7.8 rWAR. Should we be adjusting pitchers’ results due to park factor, quality of defense, and batted ball luck? The debate isn’t settled, but we are aware of the disparity because we have divergent methods of calculating a similar statistic.
Think of fWAR and rWAR as scientific experiments. When you get different results, you continue to tweak and control for different variables. You don’t pack up your stuff and never experiment again.
If a player’s batting average varied from .245 to .307 from ranking to ranking, would you trust either statistic?
If a statistic is descriptive, as batting average is, there is nothing to distrust. What you must be wary of is instead the interpretation of the data. Ryan Howard hit .313 in 2006 but finished at .219 last season. If those were our only data points, and we were trying to predict his future performance, we would heavily regress him back to the league average. However, since we have much more information at our fingertips, we know that Howard’s batting average was affected by his Achilles injury, his age, the heavy use of the infield shift against him, and his increasing futility against left-handed pitching. That goes to Caple’s above point, that we should not use any single statistic in isolation.
When you see wildly divergent WARs for players, that should be a telltale sign to investigate further, rather than throwing your hands up in defeat.
The fielding metrics used by baseball-reference (Baseball Info Solutions Defensive Runs Saved) seem to lift or lower their WAR scores much more than the fielding metrics used by FanGraphs (Ultimate Zone Rating). And that presents another issue with WAR. […]
In other words, most baseball stats are based entirely on indisputable math calculations. WAR has an element of theory and assumption to it.
This is a very good point and it’s my biggest issue with WAR. Our methods of evaluating defense are still in their infancy and must be met with a lot of skepticism. Whenever I use WAR for hitters, I make a note of how heavily it is affected by UZR. For instance, Darwin Barney posted 2.5 fWAR in 2012. Broken into specific components, he posted negative 15 batting runs, 4.3 base running runs, and 13 fielding runs. Because of his playing time and position, he was also credited with 19.6 replacement level runs and 2.2 positional runs. I don’t know too much about Barney — I’m not a Cubs fan and don’t get to see too many Cubs games during the season, so I don’t have any prior knowledge about his fielding capabilities. UZR graded him positively but much less favorably last year in a similar amount of playing time, crediting him with 6.1 fielding runs. Upon seeing that, I would be skeptical about UZR’s accuracy of Barney’s defense.
On the other hand, when I look at Chase Utley’s page, I see that he has been an above average (elite, actually) defender throughout his career. He was credited with 5.3 fielding runs last season, a decline from 2007-10, when he ranged from 10-20 fielding runs. Given my prior knowledge, having watched Utley very closely throughout his career and knowing that he has had knee problems, I am comfortable accepting UZR’s evaluation of his defense, while still being skeptical of UZR overall. It would be great to have the specificity of offensive statistics with our defensive statistics, but we just aren’t there yet. However, that is no reason to toss WAR out to the curb, nor is there another method out there proven to be more accurate at evaluating defense.
The same approach should apply to WAR. We need to look at many stats to assess players, and one of them should be WAR. But it shouldn’t be the only stat we look at or cite.
If you take away one thing from Caple’s article it should be this, but it doesn’t just apply to WAR — it applies to all statistics. Don’t just look at RBI; look at RBI opportunities, where the runners are on the bases, and the base running ability of those runners. Don’t just look at xFIP; look at BABIP, park factors, the quality of defense, among many other factors.