Some Additional Thoughts on WAR

Jim Caple posted some criticism of the Wins Above Replacement statistic over at ESPN this morning. Overall, I thought his criticisms were warranted and well-stated, but I wanted to add some of my own thoughts to clear up any misunderstandings and to add a different perspective to some of the points raised.

Caple writes:

My issue is this: I don’t like the increasing over-use of (and over-reliance on) WAR as THE definitive evaluation of a player’s worth.

Certainly this is an issue more apparent with WAR than most other statistics simply by its nature. Smashing offense plus defense plus positional adjustments plus replacement level into one statistic would give one the idea that this one number and one number alone can paint an entire picture of a player’s season or career. However, such reliance on one statistic has certainly been true even before WAR was ever born. Saves for relievers have often been the only barometer by which a closer’s value is judged; RBI’s for hitters; won-lost record for starting pitchers. When possible, you should always use multiple methods of evaluation and be mindful of all their limitations, no matter which stats you’re using.

Moving on to Caple’s direct criticisms…

Almost no one knows how to calculate WAR. [...] If we can’t figure a stat out on our own, then how do we verify whether it is accurate?

WAR is, as far as I know, the most complex statistic used in sports right now. It is very arduous to learn everything that goes into creating it, much less trying to reproduce the results. However, it is not impossible to learn and its complexity is neither a feature nor a detriment. The utility of a statistic should be judged on its ability to describe what happened or its ability to help predict future results. It doesn’t matter if your statistic has one simple arithmetic function or is a concoction of mathematical trickery as long as it does its job well.

WAR does its job well (but not perfectly!). Sort any WAR list from greatest to least, or least to greatest and it will almost always line up with what you observed with your eyes or judged with more traditional statistics. 2012’s WAR top-ten included Mike Trout, Buster Posey, Ryan Braun, Robinson Cano, David Wright, Chase Headley, Andrew McCutchen, Miguel Cabrera, Jason Heyward, and Adrian Beltre. All players you’d have included in a top-ten list most likely. Maybe you disagree with the specific ordering, and that’s fine. As Tango says, “Everyone has their own WAR.” If you think FanGraphs weights defense improperly, you can make your own adjustments. FanGraphs’ and Baseball Reference’s versions of WAR are not sacrosanct.

Disagreements with WAR arise when detractors cite a questionable result (“Miguel Cabrera third in WAR? Nonsense!”) but aren’t transparent and consistent about the methods behind their own evaluations. WAR is useful because you know what’s going into it and the components are applied uniformly and without bias.

Caple continues…

Actually, we know it isn’t always accurate because depending on your source — FanGraphs or Baseball-reference.com — you can get wildly different WAR scores.

I would argue this is a feature of WAR. In fact, we should have even more versions of the stat because we will further learn what does work and what doesn’t work. FanGraphs uses FIP in its calculation of pitcher WAR, while Baseball Reference does not, which is why a pitcher like Ricky Nolasco — someone who has under-performed his defense-independent statistics — has 18.1 career fWAR but only 7.8 rWAR. Should we be adjusting pitchers’ results due to park factor, quality of defense, and batted ball luck? The debate isn’t settled, but we are aware of the disparity because we have divergent methods of calculating a similar statistic.

Think of fWAR and rWAR as scientific experiments. When you get different results, you continue to tweak and control for different variables. You don’t pack up your stuff and never experiment again.

If a player’s batting average varied from .245 to .307 from ranking to ranking, would you trust either statistic?

If a statistic is descriptive, as batting average is, there is nothing to distrust. What you must be wary of is instead the interpretation of the data. Ryan Howard hit .313 in 2006 but finished at .219 last season. If those were our only data points, and we were trying to predict his future performance, we would heavily regress him back to the league average. However, since we have much more information at our fingertips, we know that Howard’s batting average was affected by his Achilles injury, his age, the heavy use of the infield shift against him, and his increasing futility against left-handed pitching. That goes to Caple’s above point, that we should not use any single statistic in isolation.

When you see wildly divergent WARs for players, that should be a telltale sign to investigate further, rather than throwing your hands up in defeat.

The fielding metrics used by baseball-reference (Baseball Info Solutions Defensive Runs Saved) seem to lift or lower their WAR scores much more than the fielding metrics used by FanGraphs (Ultimate Zone Rating). And that presents another issue with WAR. [...]

In other words, most baseball stats are based entirely on indisputable math calculations. WAR has an element of theory and assumption to it.

This is a very good point and it’s my biggest issue with WAR. Our methods of evaluating defense are still in their infancy and must be met with a lot of skepticism. Whenever I use WAR for hitters, I make a note of how heavily it is affected by UZR. For instance, Darwin Barney posted 2.5 fWAR in 2012. Broken into specific components, he posted negative 15 batting runs, 4.3 base running runs, and 13 fielding runs. Because of his playing time and position, he was also credited with 19.6 replacement level runs and 2.2 positional runs. I don’t know too much about Barney — I’m not a Cubs fan and don’t get to see too many Cubs games during the season, so I don’t have any prior knowledge about his fielding capabilities. UZR graded him positively but much less favorably last year in a similar amount of playing time, crediting him with 6.1 fielding runs. Upon seeing that, I would be skeptical about UZR’s accuracy of Barney’s defense.

On the other hand, when I look at Chase Utley’s page, I see that he has been an above average (elite, actually) defender throughout his career. He was credited with 5.3 fielding runs last season, a decline from 2007-10, when he ranged from 10-20 fielding runs. Given my prior knowledge, having watched Utley very closely throughout his career and knowing that he has had knee problems, I am comfortable accepting UZR’s evaluation of his defense, while still being skeptical of UZR overall. It would be great to have the specificity of offensive statistics with our defensive statistics, but we just aren’t there yet. However, that is no reason to toss WAR out to the curb, nor is there another method out there proven to be more accurate at evaluating defense.

The same approach should apply to WAR. We need to look at many stats to assess players, and one of them should be WAR. But it shouldn’t be the only stat we look at or cite.

If you take away one thing from Caple’s article it should be this, but it doesn’t just apply to WAR — it applies to all statistics. Don’t just look at RBI; look at RBI opportunities, where the runners are on the bases, and the base running ability of those runners. Don’t just look at xFIP; look at BABIP, park factors, the quality of defense, among many other factors.

Leave a Reply

*

29 comments

  1. TomG

    February 01, 2013 03:46 PM

    Not many people know that the original lyrics to the Edwin Starr hit were:

    UZR! Huh!
    What can it tell us about a player’s worth?
    Absolutely debatable!

    Till someone realized it didn’t quite scan. So they made it about WAR. And the reason not many know this little factoid is I just made it up right now.

    Please tell me I’m the first to go the Edwin Starr route with the Wins Above Replacement stat in a comment on this blog. Even if I’m not.

    (Actual true little-known fact: Edwin Starr “War” wasn’t the first version …)

  2. LTG

    February 01, 2013 04:00 PM

    I heart TomG’s post.

  3. LTG

    February 01, 2013 04:00 PM

    Also, I was just about to request a response to Caple’s article. I should have known better than to think I had to request one.

  4. Quoting Joe Simpson

    February 01, 2013 04:06 PM

    “But what if the replacement player is Mike Trout?!”

  5. TomG

    February 01, 2013 04:18 PM

    So I finally click over to Caple’s article and see he begins it with a reference to Starr’s hit song. That gave me a been-there, done-that sad. But I’m still holding out hope for being the first on this blog …?

    I heart LTG’s post. The first one. The second one? Meh.

  6. Phillie697

    February 01, 2013 04:55 PM

    I’m pretty sure nuclear physics is hard to understand and difficult to “calculate.” Anybody wants to question that nuclear reactors and nuclear bombs work? That’s the worst criticism of something complex ever. “Well I can’t calculate it.” “Well, buddy, I would suggest either you learn, or you get smarter. Either or would work for me.”

  7. JM

    February 01, 2013 05:11 PM

    Bill…RBI% would be so much better than just RBIs…make this happen please…

  8. JM

    February 01, 2013 05:16 PM

    Here’s my problem with WAR. It makes JMJ look like a schlub. He’s obviously the biggest power hitter in the Phillies OF who isn’t a douchebag, and should be recognized for that greatness. He is also wicked smart, and doesn’t get enough credit for changing his batting stance to improve his swing…basically he’s Jason Werth 6 yrs ago…

  9. LTG

    February 01, 2013 05:52 PM

    JM, sarcasm?

  10. luke ustaszewski

    February 01, 2013 06:31 PM

    JM I hope you are kidding aboutJMJ.

  11. luke ustaszewski

    February 01, 2013 06:34 PM

    Jm I really hope you are kidding about JMJ.

  12. JM

    February 01, 2013 06:37 PM

    JMJ is in fact very smart!

  13. JRFarmer

    February 01, 2013 06:50 PM

    >> Bill…RBI% would be so much better than
    >> just RBIs…make this happen please…

    It’s already been shown there is no such thing as a “clutch” hitter, and RBI% looks like a stat that tries really hard to identify one.

    I’m imagining those 7-8-9 batters in the lineup just cleaning-up in RBI%, solely because their opportunities are fewer and the stat would suffer from sample size issues.

    It’s also entirely possible that RBI% tracks very closely to batting average, and is therefore redundant.

    I get the feeling this would not be such a helpful stat. Just my hunches, but I’d love to hear more.

  14. JM

    February 01, 2013 06:50 PM

    And he changed his batting stance!

  15. JM

    February 01, 2013 06:57 PM

    @JR. Unless the 4-6 hitters suck, sample size shouldn’t be a real problem. Sac flies could over inflate a hitters worth though, since all he did was hit a little fly ball, and didn’t actually do anything more special than not strike out. I personally do believe in clutch. This is why some players are great in the regular season, but not in the playoffs. There is also real value in identifying which hitter actually makes contact with runners on base. Pressure matters, that is why we practice in non pressure situations, so our muscle memory takes over when our brains lock up.

  16. JM

    February 01, 2013 06:58 PM

    Bill, that mermaid was so hot!

  17. JRFarmer

    February 01, 2013 07:00 PM

    >> This is why some players are great in the
    >> regular season, but not in the playoffs.

    Examples? I’m skeptical, simply because playoff games surely are the smallest of small sample sizes.

  18. JRFarmer

    February 01, 2013 07:01 PM

    e.g. Delmon Young was the MVP of the ALCS. Even if you prove he has that “clutch” ability, is it worth having if you have to suffer through 162 games of that slop?

  19. JM

    February 01, 2013 07:06 PM

    A-Rod in any playoff year not 2009. Every Yankee last year not named ibanez. Bill Buckner 1986. Andy pettite. Any Braves team in the 90’s…Seattle 1998…

  20. JM

    February 01, 2013 07:07 PM

    It doesn’t prove that douche young isn’t…but I hope I don’t have to find that out with the phillies

  21. LTG

    February 01, 2013 07:52 PM

    I’m staring to believe JM just is John Mayberry, Sr. And John Mayberry, Sr. is either hilarious or hilariously deluded.

  22. Phillie697

    February 01, 2013 09:51 PM

    DON’T MAKE FUN OF MY MERMAIDS!!!

  23. hk

    February 04, 2013 07:19 AM

    JM and JRF,

    Clutch exists, but only in a descriptive way. Having said that, JM’s examples of un-clutch players and teams are laughable.

    1. ARod produced WRC+ of 125 in the 1997 post-season, 163 in the 2000 post-season and 166 in the 2004 post-season. He has also had some poor post-seasons, which makes him the poster child for the fact that “clutchness” only describes what happened and is useless to predict what to expect.

    2. The 1998 Mariners like the 2011 Phillies dominated the regular season only to lose to the small sample size randomness of the playoffs. Are the 2011 Phillies another example of un-clutch performers? If so, you’d be saying that 2011 Cliff Lee was un-clutch because of one unlucky performance 2 short years after he was dominant in the post-season.

    3. Bill Buckner’s error, the ultimate in small sample sizes.

  24. JRFarmer

    February 04, 2013 10:34 AM

    hk,

    >> Clutch exists, but only in a descriptive way.

    I think you’re agreeing with me. I believe in clutch hits, and I believe in clutch performances.

    I don’t believe in clutch players. It has not been demonstrated that a player can have a sustainable “skill” of being clutch.

    Delmon Young’s ALCS MVP award demonstrates this well. He had a good series of games. You might even say he was “clutch” during that series, but he has not demonstrated the ability to be clutch in a consistent manner. Rather, it looks very much, as you say, like an artifact of the “small sample size randomness of the playoffs”.

    I think of Cody Ross in 2010 as well.

  25. Phillie697

    February 04, 2013 11:23 AM

    Why do we need to give a word to describe something where someone just did what they always did, except at the right place and the right time? Is someone winning the lottery “clutch”? Because that’s really along the lines of what we’re talking about here.

    That’s the problem with the word “clutch” as opposed to the word “luck.” The latter people realizes has nothing to do with the person, whereas the first, people do subconsciously attribute it to some mystical and imaginary power of said person. I’m just fine sticking to something generic like, “that was awesome!” or “that was an amazing performance by such and such!”, because while it has nothing to do with being “clutch,” it DOES have something to do with the player’s skills as a baseball player; Matt Stairs can hit that HR, but Freddy Galvis will probably just fly out.

  26. KH

    February 07, 2013 02:19 PM

    Caple was dead right about defensive ratings though. Is there any form of WAR that just looks at offensive numbers at this point? Defense is really hard to quantify while offensive production is really simple. There is so much that goes into defense in baseball a lot can be deterimented just by the defensive philsophy of the coaching staff in terms of fielder positioning.

Next ArticleThe Many Emotions of Ryan Howard