EPA (expected points added) is being used too much. There, I said it. Hear me out before you revoke my analytics club membership. I also want to state that no single application of EPA has driven me to this conclusion but rather many over the years including my own work quantifying player value using Bayesian models that uses EPA to estimate player value! From scrutinizing my own work and others I’ve come to the conclusion that EPA’s use as an evaluation tool should be limited largely to team drives and games, complete quarterback seasons, league-wide trends, and to inform other models like win probability. Use of EPA to evaluate per play averages in many specific game situations, in my opinion, should be discontinued.
Virgil Carter & Robert Machol’s attempt to frame football in terms of expected points was revolutionary and perhaps the first influential sports analytics paper despite it being ignored for about 40 years. Thanks to Brian Burke’s site and then later work to make expected points models public (driven largely by Ron Yurko and Ben Baldwin), expected points and EPA as a way to analyze football has finally become mainstream. I remember the first time I heard my ESPN colleague (at the time) Domonique Foxworth casually use “EPA” without explanation on Around the Horn and I couldn’t stop smiling from my cubicle. We won!
Humility
We should all be more quick to acknowledge flaws with any metric and EPA is far from perfect. Work by UPenn PhD student Ryan Brill recently showed that expected points models are built on data rife with selection bias which can impact how confident we can be in 4th down decision models.
Distribution of EPA on a Play
EPA is the difference of between the expected net number of points the team will score before the play happens and then after the play happens. The problem is that the distribution of expected outcomes on a football play is anything but unimodal or symmetric let alone normally distributed. So when we use EPA, it doesn’t actually mean that a 0 EPA play on a play is the 50th percentile outcome. Sometimes 0 EPA is well above the 50th percentile outcome, and sometimes it is well below!
Let’s look at a few examples.
Changing Distributional Shape of EPA Outcomes
We’ll start with the empirical distribution of EPA for a pretty neutral situation at midfield for each down and play type (run & pass).
For each down there is a bimodal distribution of outcomes whose variance and skewness increase for each additional down. This matters because the means are pulled away from the center of the distribution towards the side of the skew. Players and teams that are in these situations will get credit or blame in terms of EPA for simply performing at the median or 50th percentile. For example a 50th percentile performance for a run on 3rd down and 10 at the 50 is worth -0.41 EPA despite it being simply the median outcome! If we are using EPA per play to compare players in certain situations and one player happened to be in this situation more than others then that alone will make them appear better.
We can also look at the most common starting field position of the 25 yard line (usually after a kickoff).
An even more extreme example of this, but still a common situation, is plays inside the 5 yard line in “goal to go” situations.
Player Credit
EPA is used often to compare player performances. In fact total EPA gained on the season for quarterbacks mirrors closely the results of the MVP voting. In recent years, several attempts have been made to assign credit to players based on how that player’s teams unit performs in terms of EPA (Paul Sabin Plus-Minus, PFF WAR, Yurko nflWAR, Baldwin EPA+CPOE composite, Kevin Cole Plus-Minus).
Barring extreme situations where a player intentionally gives himself up instead of scoring at the end of the game to preserve possession and ensure victory, or going out of bounds to stop the clock, the goal of each player on a football team on each play is to advance the ball as far as possible towards the end-zone (for the offense) and likewise to prevent that from happening for the defense.
This begs questions such as, do we use EPA as the preferred metric because it’s the best metric to encapsulate individual and team performance on a play, or do we use it simply because it isn’t yards? And why can’t we just use yards in some situations? Is the same metric preferable for the whole season as for a set of plays in specific situations?
Flaws of EPA
As I see it, EPA has three main flaws:
- The large jump in EPA at the 1st down line vs 1 yard short of the first down line. That one additional yard is attained largely due to chance.
- EPA does not follow a symmetric unimodal distribution, and depending on the situation it can skew left, skew right, be bimodal, unimodal, and more! This means that EPA per play for a team, unit, or player is affected largely by situation before the results of the play occur.
- Selection Bias. Until recently all EPA models were built off observed plays. Better teams have more plays in the opposing territory, biasing the expectation towards better teams in those situations. Brill & Wyner 2024 use catalytic priors in an effort to adjust expected points models for this selection bias.
Yards vs EPA
We all know that 10 yards on 3rd and 10 and 10 yards on 3rd and 20 are not worth the same to the offensive team. We know EPA accounts for this problem.
Let’s consider the hypothetical situation, it is 3rd and 10 at exactly midfield in the 1st quarter of a 0-0 game.
- Result A: The team gains 9 yards.
- Result B: The team gains 10 yards.
Now in terms of EPA, according to the NFLFastR model, the Expected Points before the play is 1.71 while the EPA for Result A is -0.09 compared to result B of 2.02.
While in yards the team gained 10% more in situation B than A, in EPA one play is drastically different than the other. That is a massive difference for players with close to a coin flip difference in result. In causal inference we would call this spatial proximity. That means the difference of the play resulting in 9 yards vs 10 yards is mostly due to chance yet EPA results in a massive difference of that evaluation. If we are comparing players or teams in specific situations, our sample sizes can be overrun by these cliffs. If we do this as analysts we are making the same mistake fans make by overrating a team with an abundance of close wins.
Using Yards is OK
EPA is still useful in many situations and so are yards. In fact, in a lot of situations yards are a better evaluation tool than EPA. Plus yards are understood better by casual fans and decision makers than EPA is.
Let’s look at quarterbacks on 3rd down and less than 5 yards to go in the first half of games since 2011. I will include quarterback seasons with at least 25 snaps in this situation. I will also define Total Adjusted Net Yards as yards_gained + 20*touchdown - 45*turnover
for all quarterback dropbacks (i.e. QB sacks, runs, and passes).
Sorted by EPA per play, the top of this list looks very reasonable. Patrick Mathomes, Jalen Hurts, and Peyton Manning are all at the top but then we get into some interesting names that we don’t typically think of when it comes to leading an NFL leaderboard of quarterbacks like Carson Wentz, Derek Carr, and Geno Smith.
Sure, these are smaller sample sizes because I’m filtering down, but this whole exercise is to examine what happens to EPA when we filter down. As analysts we make these comparisons or see others make these comparisons with similar sample sizes all the time!
Even though each season is a much smaller sample size, we can make some conclusions with almost 300 or so of these quarterback seasons available to us. We can look at the stability from how a quarterback does one season to the next on 3rd and less than 5.
Stat | Stability |
---|---|
EPA | 0.002 |
TOT ANY | 0.004 |
YDS | 0.093 |
Small samples are noisy so none of these have high year to year correlation, but ~ 0.10 is still a lot higher than 0.003!
Why is this?
Well for a QB, over the course of a season, the benefits of accounting for yardline, down, distance, time remaining, etc. in EPA will make it more predictive than yards. The opposite can be true when we start splicing certain game situations. When we splice the data for certain situations in a game or for situations in a play that are correlated with situations in a game, using EPA may now hurt predictability because, among other things, the jump at the first down line coupled with small sample sizes. The goal of every team and player is to get as many yards as possible (with some very limited exceptions) so it is ok and can be preferable to use yards when we are honing in on specific situations.
Beyond Yards & EPA?
I believe the future of evaluation of players and teams goes beyond using yards, EPA, WPA (win probability added), or success rate. In order to truly quantify performance we need to understand where on the distribution of outcomes (accounting for the situation) the result of that play (or micro-play if using tracking data) falls. This is not trivial, and it is something I am actively working on. Want to help or be involved? Reach out!