A Model-Recommended Win Total for Your Consideration

Fun fact: I have never bet on a season win total before. I admittedly always thought they were a bit pointless. Even if I thought a team’s win total over/under had value, why would I tie up my money for an entire season when I could just leverage my disagreement from game to game? It’s a legitimate critique that I wouldn’t fault any bettor for expressing. But given my relatively recent foray into predictive sports modeling, I’ve had the opportunity to evaluate team win totals in a more concrete way.

Seeing in hindsight the profit I could have accumulated with my NFL model by taking action on large model disagreements gives me the assurance that doing that exact thing with the MLB model I’ve built for the upcoming 2019 season should have positive expected value. And I figured with you guys waiting for my impending announcement of what my plans will be with the MLB model, I thought I’d share some of that information with you and give you something to chew on.

As a Cubs fan, seeing that the Cubs were the most overvalued team in the team totals market hurt quite a bit, but it’s really not that hard to see how and why this team would be overvalued. In terms of offseason moves, the Cubs did absolutely nothing of substance whereas the rest of the NL Central did their best to ensure this year’s divisional race is a bloodbath. The Cardinals added Paul Goldschmidt and Andrew Miller; the Brewers added Yasmani Grandal and Mike Moustakas; and the Reds added Tanner Roark, Yasiel Puig, Sonny Gray, and Matt Kemp (while getting rid of Homer Bailey).

On top of that, they have a bunch of talent that is either already regressing or are prime regression candidates (Jon Lester, Cole Hamels, etc.). Javier Baez in particular is due for regression in the biggest of ways this season. Last year he batted .290 for 34 HRs and led the league with 111 RBIs, which was a massive improvement over his 2017 campaign in which he batted .273 for 23 HRs and 75 RBIs. Granted Baez did have 27% more plate appearances in 2018, but the bump in his counting stats production vastly outpaced the rate of his opportunity increase.

That production increase came despite maintaining a similar and abnormally high BABIP (.345 in 2017, .347 in 2018), and can be best explained by his increase in power (.207 ISO in 2017, .265 in 2018) which then led to the massive jump in his slugging percentage (.480 in 2017, .554 in 2018) and HRs. Despite those spikes, his BaseRuns only jumped from 3.8 to 3.9 across those two seasons and it’d be much more likely for Baez to have a 2019 that looks more like 2017 than 2018.

And while the Cubs have a ton of talent trending downwards, the rest of the NL Central has budding talent. The Cardinals have Marcell Ozuna, Jack Flaherty, and Alex Reyes; the Brewers have Ben Gamel and Keston Hiura; and the Reds have Luis Castillo, Nick Senzel, and a Sonny Gray without a non-destructive pitching coach. Considering 35% of the Cubs’ games will be against the Cardinals, Brewers, and Reds, they have a tough task to get wins as-is.

Another 20% of their games will be against the NL East, which is full of playoff contenders (ATL, PHI, NYM, WAS). A look at the supporting data doesn’t make their case any better, as the model projects this team to only be a top ten team in starting pitching while ranking league-average in run production, tenth-worst in relief pitching, and near the bottom third in fielding. That doesn’t sound like a team that should be tied for the seventh-hightest win total, and the model agrees. Take the Chicago Cubs to go under 88.5 wins.

Kyle Freeland’s Very Weird (and Profitable) Home/Away Splits

In last week’s write-up we took a look at the offensive side of baseball and the continued evolution of batting metrics, and for this week’s write-up I originally was going to do a similarly-structured write-up with the pitching side of sabermetrics. However, after giving it a second thought I thought I’d switch it up and use this topic as an opportunity to just demonstrate how modern day pitching metrics can be utilized when evaluating a pitcher and his development while also using it as an excuse to talk about my favorite pitcher in all of baseball, Kyle Freeland.

Kyle Freeland is an anomaly that I could probably write an entire book about, and he has only played two full seasons of major league ball. In 2017, Kyle Freeland put together an above average season for a rookie starting pitcher. In the following season, Kyle Freeland took a monstrous leap and became one of the best pitchers in the National League, finishing fourth in NL Cy Young voting. He markedly improved in basically every statistical category fathomable as you can see below.

NOTE: The “minus” metrics for pitching are similar to the “plus” metrics for batting which we covered in the previous write-up, in that they are park and league adjusted and are scaled in a way that 100 is league average. However, the inverse is true with the “minus” stats in that each point below 100 represents the percent better a pitcher is than the league average, and each point above 100 represents the percent worse a pitcher is than the league average.

But it isn’t Kyle Freeland’s incredible development as a pitcher that is what is most fascinating about him. What makes Kyle Freeland so fascinating is that despite having the misfortune of pitching in the extremely hitter-friendly Coors Field for half of his games, his home/away splits show that he is actually a much better pitcher at Coors.

The result of these astonishing splits is my favorite betting “trend” ever: In Kyle Freeland’s starts in his first two seasons:

  • The under went 41-16 for +23.4 units (37% ROI)
  • The under in home games went 25-5 for +19.5 units (59% ROI)
  • The F5 under in home games went 25-4-1 for +20.6 units (62% ROI)

Like most betting “trends”, the Kyle Freeland under “trend” is not some magical force that leads to guaranteed profits over time for the rest of eternity. Instead, it is merely a reflection of a previously uncorrected and/or inefficient pricing pattern or strategy in the betting market. To help illustrate this, lets take a look at some of the underlying elements that may have led to this incredibly profitable trend with Kyle Freeland’s starts.

The first thing that makes Kyle Freeland’s success in his first two seasons and his success at hitter-friendly Coors Field during that time so astounding is a 25 year old performance trend of pitchers drafted by the Rockies. As you can see in the below, not only are Rockies-drafted pitchers dead last in WARP accrued with their drafted team, they are the only team with a negative figure in that regard. The forward-projecting sentiment that comes as a result of this historical performance can explain a part of the undervaluing and mispricing of Kyle Freeland’s starts, especially the starts from his 2017 rookie season.

Part of Freeland’s comfort at Coors can be attributed to the fact that the ace was born and raised in Denver, allowing him to be acclimated essentially since birth to the effects of the altitude. As a result, Kyle Freeland experiences a Freaky Friday-like home/road phenomenon in which the altitude and conditions of Coors is his “comfort zone” while pitching elsewhere is what pitching at Coors is like for the the rest of the league. Nevertheless, whatever adjustments betting markets made following the 2017 season to better capture Freeland’s comfort at home was never going to be enough to capture his development as a pitcher heading into and during the 2018 season.

The biggest change for Kyle Freeland in 2018 was his pitch selection. A large reason for this shift is directly tied to the development of sabermetrics and its ability over the years to identify which pitches work better and worse at Coors due to the conditions. In summary for the unaware, curveballs and sinkers have a measurably sharp drop-off in performance whereas sliders and cutters are proposed as better alternatives. As you can see in the above, Freeland opted to cut his sinker usage by over half in 2018 and reallocated that volume to the rest of his more Coors-friendly arsenal. The change was certainly deliberate as Freeland himself noted, “Last year we discovered after the first half that guys were looking for sinkers down and away, because they knew I would be throwing them, and I started getting hurt throwing those pitches”. Freeland also worked on his command to punch his fastballs up and in and especially against right-handed batters, as you can see in the below with 2017 on the left and 2018 on the right and with both charts being from the catcher’s perspective.

The end goal of these shifts in Kyle Freeland’s game was to cause as much soft contact as possible. As Freeland said, “Getting in on their hands is going to induce a lot of weak contact, especially if they aren’t able to get that barrel around and then once you do that, it opens up options to where you can throw your changeup down and away, and it comes out of your hand looking like a fastball, and then the next thing they know it’s off the end of their bat for a weak ground ball or a weak fly ball”. This concerted effort to induce soft contact not only proved to be fruitful for Freeland (who finished 18th in groundball percentage), but it seemed to be an effort pushed by the Rockies pitching staff that helped Jon Gray (10th) and German Marquez (12th) achieve similar results in 2018.

But what should we expect from Kyle Freeland in 2019? The vast majority of projections I’ve seen (PECOTA, FanGraphs, etc.) seem to suggest that Kyle Freeland’s 2019 season will be more similar to his 2017 season than his 2018 season. Considering that generating weak contact does not seem to be a pitcher skill that typically translates year over year, I can see why those projections are positioning Kyle Freeland as a major regression candidate. And given the historical performance of some of those projection systems, Freeland probably is very likely to show significant regression in 2019.

But I don’t really care, because Kyle Freeland fascinates me endlessly and I’ll root for him (and those unders) until we both fail.

The Evolution of Batting Metrics

Last week, I started off this year’s set of MLB write-ups with an introduction to cluster luck and the BaseRuns metric. This week, I thought it’d be best to expand on the offensive side of baseball and dive deeper into the evolution of batting metrics. The collective understanding of batting performance and efficiency has evolved over time and continues to see significant developments to this day. As a result, it can be difficult to determine which metrics are best for your own personal use when trying to model the offensive side of baseball. There is certainly no all-encompassing “right” answer when it comes to batting metrics, but there is certainly a great chance there is a “right” answer when it comes to finding metrics that tailor most to your own beliefs as to what should matter and how much it should matter.

Batting Average (AVG)

Batting Average is obviously the most widely-know batting metric there is, and is calculated by simply dividing total hits by total at bats. At best, batting average is a surface-level measurement of showing how often a player gets a hit (duh). The primary shortcomings of batting average are that it fails to quantify the plate appearances that don’t register as at bats (walks, sacrifice hits, etc.) and it fails to give any weight to the varying types of hits (a single and a home run are equally just one hit).

On-Base Percentage (OBP) and Slugging Percentage (SLG)

On-base percentage is exactly what is says it is and aims to address the first aforementioned shortcoming of batting average. OBP takes the batter’s instances of getting on-base (H + BB + HBP) and divides it by total plate appearances, and tells us the rate at which a batter gets on base. Slugging percentage aims to address the second shortcoming of batting average by weighing each type of hit by the number of bases a batter takes for each:
SLG = [ (1B) + (2B x 2) + (3B x 3) + (4 x HR) ] / AB
But although OBP and SLG each present a solution to one of the two major shortcomings of BA, they also each fail to address the remaining one. Which brings us to…

On-Base Plus Slugging (OPS)

OPS brings us the first official step into “sabermetrics” territory, first popularized in 1994 in The Hidden Game of Baseball by John Thorn and Pete Palmer. As the name suggests, OPS is calculated by adding OBP and SLG together, and aims to represent a player’s ability to get on base and hit for power. For your own reference, an OPS of ≥0.900 is typically considered to be a great mark to hit. One of the underlying problems with OPS lies more with the underlying problem with OPS, in that it linearly weighs the different types of hits according to the amount of bases they equate to.
A variant of OPS called OPS+ has since been developed that accounts for park factors and normalizes the stat across each of the two leagues (NL and AL). OPS+ is also scaled in a way where 100 is league average and each point of deviation above/below 100 equates to the percentage that the player is better/worse than league average. For example, a player with a 120 OPS+ is 20% better than league average whereas a player with a 85 OPS+ is 15% worse than league average.

Weighted On-Base Average (wOBA)

wOBA is a much more recently developed sabermetric, originally introduced in The Book in 2007.  The metric was actually developed and presented as an improvement from what OPS represented, as the authors felt OBP and SLG had significant overlap and that the on-base element of the statistic was being underrepresented. wOBA’s formula assigns “linear weights” that represent the average number of runs scored in a half-inning after such event occurs. These run value weights are then scaled to fit wOBA on the same scale as OBP (0.000 to 1.000). The formula for wOBA has evolved over time, with the first formula below being the original iteration and the following one being the Fangraphs version of the formula for the 2018 season.

Desipite the creators of wOBA hailing their newly-created sabermetric as superior to OPS in nearly every way, some research since has shown otherwise. In 2013, a professor from San Antonio College compared the predictive performance of OPS and wOBA and his results using the 2003-2012 seasons showed that OPS had a higher correlation to team run production rates than wOBA did. In 2018 Baseball Prospectus stepped in and conducted their own research, expanding the sample (1986-2016) and expanding the analysis to include a look at:

  • Descriptive performance: the correlation between the metric and same-year team runs/PA
  • Reliability performance: the correlation between the metric and itself in the following year
  • Predictive performance: the correlation between the metric and the following year’s runs/PA.

The findings essentially confirmed that OPS was superior to wOBA:

Runs Created (RC), Weighted Runs Created (wRC), and Weighted Runs Created Plus (wRC+)

The original Runs Created metric was created by Bill James and serves as an estimate of how many runs a player contributed to his team. Weighted Runs Created was an evolution of Bill James’ original work that incorporated the aforementioend wOBA into the formulation. The inherent problem with both itereations was that the stat was ultimately still just a counting stat, much like HRs or RBIs. Weighted Runs Created Plus (wRC+) did to Weighted Runs what OPS+ did to OPS, in that it took an otherwise context-less stat and turned it into a rate (while also controlling for park and league factors). Just like OPS+, a player with a wRC+ of 118 has contributed 18% more runs to his team than the league average player.

Wins Above Replacement Player (WAR or WARP)

The Runs Created trio above aren’t the only metric that tries to serve as an estimator of a player’s contributions to his teams offensive production. WAR is the number of wins a player has incrementally added to his team above the amount of expected wins if that player were to be replaced with a replacement level player. WAR as a whole incorporates batting, baserunning, and defense for position players, but the batting element of WAR can be singled out and is often represented as bWARP. The various baseball analytics sites (Baseball Prospectus, Baseball Reference, FanGraphs, etc.) have different formulations for WAR, so you may see some varying WAR figures.

Batting Average on Balls in Play (BABIP)

BABIP is another self-explanatory metric, as it essentially shows how many of a batter’s batted balls go for hits and outs. However, BABIP is much different than any of the stats discussed thus far. The primary purpose of BABIP is to serve as a potential warning that a batter is possibly performing above (high BABIP) or below expectation (low BABIP). In essence, a batter with a BABIP with significant deviation from the “normal” .300 typically signals that the batter is due for regression towards the mean. That regression expectation can be applied within a season or from year to year. To show BABIP in action, here is a chart showing the BABIP leaders from 2017 (minimum 250 plate appearances) and their performance in 2018. Significant increases in performances are highlighted in green, significant decreases are highlighted in red, and relatively similar performances are left uncolored:

BaseRuns (BsR)

BABIP begins to lean into the world of expectation and the “should’ve could’ve would’ve” world of sabermetrics, and that is arguably the largest development to date in the space. We covered BaseRuns quite in depth last week but as a refresher, BaseRuns is a metric that was originally developed by David Smyth and aims to estimate how many runs a team should have scored over the course of a season. Since we played with BaseRuns a lot in the previous write-up, I won’t expand any further in this write-up.

Deserved Runs Created Plus (DRC+)

Deserved Runs Created Plus is the newest development in batting sabermetrics, having just been introduced by the Baseball Prospectus team this past December. I could try my best to explain the premise of DRC+, but Baseball Prospectus did exactly that in their aptly titled “Why DRC+?” article that accompanied the introduction of the metric:

“Why another batting metric? Because existing batting metrics (including ours) have two serious problems: (1) they purport to offer summaries of player contributions, when in fact they merely average play outcomes in which the players participated; and (2) they treat all outcomes, whether it be a walk or a single, as equally likely to be driven by the player’s skill, even though no one believes that is actually true. DRC+ addresses the first problem by rejecting the assumption that play outcomes automatically equal player contributions, and forces players to demonstrate a consistent ability to generate those outcomes over time to get full credit for them. DRC+ addresses the second problem by recognizing that certain outcomes (walks, strikeouts) are more attributable to player skill than others (singles, triples).”

Like the rest of the “plus” metrics, DRC+ is scaled in a way where 100 is league average and deviations above/below signal how much better/worse a player is in terms of percentage. DRC+ was met with some criticisms following its unveiling, and the Baseball Prospectus team has since updated the metric in response. The team also did extensive research to compare the updated DRC+’s descriptive, reliability, and predictive performance compared to OPS+, wRC+, and the original DRC+:

Bringing It All Together

For your reference, above is a chart showing how many runs each team scored in 2018 as well as how they performend in each of the batting metrics we covered with the exception of WAR (too many variations to choose from) and BABIP (not really purposeful for this chart). It’s easy to see how our starting point of batting average relatively fails to be an accurate measure of batting production and/or efficiency, as team performance in that category has the lowest correlation to runs scored than any other metric represented. More importantly, this chart illustrates that most batting metrics will give you a similar general idea of a team’s batting ability, showing that each metric typically has some value when it comes to evaluating batting performance.

And that is essentially the abridged version of the evolution of batting metrics. Hopefully this gives you a better understanding of what it is that you’re exactly looking at the next time you pull up a stats page on Baseball Prospectus or FanGraphs. As always, if you have any questions about the topics covered in this write-up you are more than encouraged to reach out to me via Twitter.

Until next time.

Cluster Luck and BaseRuns

It is officially time for us to set our eyes on Major League Baseball. From now until the end of the season, I will be providing a weekly baseball write-up (hopefully) every Wednesday. Between now and Opening Day, I thought it’d be best to cover some introductory (yet still very comprehensive and higher level) topics, metrics, and ideas that populate the analytical side of professional baseball. For today’s write-up in particular, I will be answering the question “Is it better to be lucky than good?” by taking a look at cluster luck and BaseRuns.

Cluster luck is a term coined by Trading Bases author Joe Peta that serves to be the underlying explanation as to why the amount of games a team actually won would differ from the amount of games they should have won. Cluster luck itself is not an actual metric with a formula, but there are plenty of ways to calculate expected wins and see cluster luck in action. The Pythagorean win theorem, a formula created by Bill James (the founding father of sabermetrics), estimates the percentage of wins a team should have had given the amount of runs they have scored and the runs they have allowed. The original formula was:

(Runs Scored)^2  /  [(Runs Scored)^2 + (Runs Allowed)^2]

Since the original formula was published, sabermetricians have more accurately assigned 1.83 as the exponent. Putting the formula to action, the World Series-winning Red Sox scored 876 runs and allowed 647 runs. Plugging those into the modern iteration of the Pythagorean expectation formula with 1.83 as the exponent, the Red Sox should have had a 0.635 win percentage, good for 102.9 wins. The Red Sox finished the regular season with 109 actual wins, a difference of 6.1 wins. Generally speaking, a four win difference between actual wins and Pythagorean expected wins is considered to be significant and generally non-repeatable. In other words, the 2018 Boston Red Sox very likely benefited from cluster luck.

But what exactly is “cluster luck”? Cluster luck is the idea that the particular sequencing of plate appearance outcomes lead to very different run-scoring and run-allowing results. Given that each plate appearance by any given player can be numerically boiled down to a set of expected probabilities assigned to each possible event, pure chance has the ability to cluster positive or negative events sequentially thus leading to very different outcomes. Take an inning in which a team has two walks, one single, one triple, two strikeouts, and one popout. Depending on the sequencing of those events, you can have two very different outcomes:

  • Sequence A: Single, BB, Strikeout, BB, Strikeout, Triple, Popout
  • Outcome A: Three runs scored / allowed
  • Sequence B: Triple, Single, BB, Strikeout, Strikeout, BB, Popout
  • Outcome B: One run scored / allowed

Sequence A would be considered to have an outcome that the offense benefited from cluster luck, whereas Sequence B would be considered to have an outcome that the defense benefited from cluster luck. Throughout the course of 162 games, teams can become significant victims or beneficiaries of cluster luck. Here is a table showing the lucky teams and the unlucky teams of 2018:

s always (and especially with baseball), we have an ability to go deeper with our evaluation. In particular there is the BaseRuns (BsR) metric which was originally developed by David Smyth and aims to estimate how many runs a team should have scored or allowed over the course of a season. Much like the Pythagorean expectation formula (and the vast majority of sabermetrics), the BaseRuns formula has evolved over time and there are some slight variances depending where you look, but the current iteration I utilize is the FanGraphs version that currently looks like this:

The above table shows BsR scored with the “lucky” teams on the left having more actual runs scored than BaseRuns scored, and the “unlucky” teams having the opposite. We can use the above table to potentially begin answering the “Is it better to be lucky than good?” question. Of the top nine teams in offensive BaseRun differential (actual runs minus BaseRuns), six played in the division tiebreaker games or made the playoffs outright. On the other hand, ten of the bottom eleven teams in differential missed the playoffs. BaseRuns alone are also just a solid measurement of a team’s strength in run production, as last year all ten teams who played in a division tiebreaker or made the playoffs outright finished in the top thirteen in BaseRuns scored. Next, let’s look at BsR allowed:

Obviously the inverse would be true with BsR allowed with the “lucky” teams on the left having less actual runs allowed than BaseRuns allowed, and the “unlucky” teams having the opposite. Five of the top eight teams that have a benefitial differential made the division tiebreakers or playoffs outright, whereas eight of the ten teams with the most unfortunate differentials missed the playoffs. Nine of the ten teams who made it that far also ranked in the top twelve in least BaseRuns allowed. The Rockies are the only such team to miss that cut, but given that they play in run-friendly Coors Field it’s easy to understand why they might not rank favorably in BaseRuns allowed. Nevertheless, now that we have an expected runs scored metric and an expected runs allowed metric, we can use the Pythagorean win formula to see how many wins each team should have won in 2018. Below shows exactly that, with the left table being sorted by differential and the right table being sorted by Pythagorean expected wins using BaseRuns. I’ve also highlighted the ten teams that played past Game 162.

As you can see in the right table, Pythagorean expected wins seems to be a good measure of team strength with nine of the ten “postseason” teams ranking in the top eleven. The Rays might be the team that suffered the worst cluster luck fate given the circumstance of their misfortune. They finished fifth in expected wins but finished eleventh in actual wins. Furthermore they finished seven games behind the Athletics for the final Wild Card spot. If you were to bring them from -5.67 to 0.00 in Pythagorean vs. actual win differential, it would have put the Rays at 95.67 wins. If you do the same for the Athletics with their differential (moving them from 2.49 to 0.00), the Rays would have had a better record and beaten them out for that Wild Card spot. On the other side of the coin, the 91 win Rockies certainly benefited the most from cluster luck as they should have had 84.39 expected wins using BaseRuns, which would have been bested by either the Nationals or the Cardinals when bringing their expected vs. actual win differential to zero. So to answer the question, yes, it is in fact better to be lucky than good (sometimes).

That’s going to wrap it up for my inaugural MLB write-up. I hope this was an insightful and educational start to what is hopefully a very insightful and long-lasting series of write-ups for the MLB. If you have any questions regarding the topics I covered today or have any topics in mind that you would like me to cover in future write-ups, don’t be afraid to give me a shout!

Until next time.