The following is a rough transcript which has not been revised by High Signal or the guest. Please check with us before using any quotations from this transcript. Thank you.
===

[00:00:00] Today where we have HawkEye cameras, and these are, in baseball, 12 300 frame per second cameras recording everything that happens on the field, from pigeons landing to streakers running across the field. And most importantly, it directly observes things like the spin axis of a ball. You can construct a skeletal pose for doing biomechanical analyses.

That really induced a revolution because you were able to record mechanistic metrics that were r- directly related to an individual's contribution to the game. That was Chris Fonsec, principal data scientist at PyMC Labs, on the sensor revolution that has turned every pitch and swing in baseball into a measured event, and why this kind of revolution is a leading indicator for the future of data science and AI.

Chris is the creator of the Bayesian modeling library PyMC, an adjunct associate professor at Vanderbilt [00:01:00] University Medical Center, and if that's not enough, he has 20 years of experience as a data scientist across academia, industry, and government, including loads of pro baseball work with the Phillies, Yankees, Brewers, and Mets.

Chris joins me to unpack why baseball front offices are already running the probabilistic and causal workflows that most industry AI teams won't adopt for years to come. And we trace the arc from early box scores to modern HawkEye cameras that generate terabytes of skeletal pose data per game. So why has baseball been a leading indicator for data analytics for so long?

Huge incentives and a culture that treats decisions as quantifiable. With each additional win a player can contribute to a team worth about 8 to $10 million in contract value, and front offices built around probabilistic reasoning, baseball has had every reason to push the methods further and faster than industry.

Chris and I dive into the technical patterns AI builders can use today, the [00:02:00] move from measuring outcomes to evaluating decision quality, the use of Bayesian hierarchical models to handle subpopulations of small samples, along with the integration of expert judgment into methods and priors. Chris also makes the case for why causal inference is the next frontier for anyone looking to move beyond simple prediction towards making higher stake interventions.

This is the skill set that data scientists and AI builders should be putting under their belt right now. Probabilistic thinking, hierarchical models, integrating expert judgment, and increasingly, causal inference. Baseball solved these first. The teams running them today are showing the rest of us where the field is going.

If you enjoy these conversations, please leave us a review, give us five stars, subscribe to the newsletter, and share it with your friends. Links are in the show notes. I'm Hugo Bowne-Anderson, and welcome to High Signal. Let's now check in with Duncan Gilchrist from Delfina before we jump into the interview Hey, Hugo.

How are you? I'm well, thanks. So before we jump [00:03:00] into the conversation with Chris, I'd love for you to tell us a bit about what you're up to at Delfina and why we make High Signal. Delfina is the AI-managed context layer for messy enterprise data. We speak every day with the best folks in data, and so with the podcast, we're sharing that High Signal.

So we covered a lot of ground with Chris. So I was just wondering if you could let us know what resonated with you the most. You know, Chris describes this shift from observing outcomes to observing causes. A box score will tell you that a pitcher threw 94 miles an hour and got a strikeout. A Hawk-Eye camera tells you the spin axis, the release point, the biomechanical load on the pitcher's shoulder.

It is vastly more data, and that is a fundamentally different science. The question that literally gives me tingles: What is my Hawk-Eye camera? What is yours? What would it look like to instrument and analyze your business, with AI of course, at that microscopic level? Wow. Let's get into it. Hey there, Chris, and welcome to the show.

Hey, [00:04:00] Hugo. It's been a while. Good to be back. It's... It, it has been a while, and we've, we've chatted about so many fun things o- o- over the years. I... Something we haven't talked about publicly, though, is something near and dear to your heart, something you work on a lot, which is sabermetrics. And- Sabermetrics

for people who may not know what that term means, I, I... They'd be forgiven for thinking maybe it was the study of Star Wars or, or something like- ... along those lines. But maybe tell us- Swords, yes ... Exactly. Sabers, right? Can you tell us what sabermetrics is? Yeah, so the name was coined by maybe the most famous sabermetrician, Bill James.

So b- noted baseball analyst from '70s, '80s. And the name d- is derived from SABR, S-A-B-R, which is the Society for American Baseball Research, which was founded in, uh, I think it's 1971. And, and so the name was coined, I think, in Baseball Abstracts in, like, [00:05:00] 1977. He coined the term sabermetrics, which is essentially just baseball analytics.

And so yeah, it's been broadly adopted to define analytics, uh, sports analytics applied to baseball. Yeah. Awesome. And one of the reasons I, I reached out to you to n- do this podcast was that I was kind of, you know, reading a bunch and thinking a bunch about sports analytics, saber- sabermetrics in particular, but, uh, it just became bleedingly obvious that sabermetrics is a, a leading indicator in, in many ways for techniques that data scientists, data analysts do in other sports but also in industry more generally.

And from Bill James to Moneyball era stuff, which some listeners may be a- aware of. And so- Part of this conversation is to look at the history and see what's happening today so people can figure out what techniques they should import into i- industry as well. Sure. I'm wondering if you could just give us a brief history of sabermetrics.

Yeah. Uh, sabermat- You can think of sabermetrics as an about [00:06:00] 150 years now apprenticeship in observational data science, right? And, uh, you... I guess it's useful to divide it into kind of pre- and post-Bill James, so pre- and post-sabermetrics. But, and go all the way back to the late or mid-1800s, Henry Chadwick, uh, who was one of the early, you know, one of the founders of original baseball, and he was a, sort of a cricket writer turned baseball evangelist, and he created the first box score.

So if you, even if you're not a baseball fan or a follower, you open up any, I was gonna say newspaper, but ESPN or any website, you'll see the box score that kind of is a record of all the events and, that went on in the game, a little s- little visual summary. That's been virtually unchanged since, you know, the, the late 1800s.

Mm-hmm. And, and, um, and even as early as early 1900s, a guy by the name of F.C. Lane came up with a run estimator that was essentially a linear model [00:07:00] that, uh, uh, weighted different offensive events like singles, doubles, triples, based on their contribution to, to run. So r- really a full linear model, decades before we started formally applying regression to anything.

And then, you know, fast-forward to, yeah, Bill James and sabermetrics, which kind of watershed moment there, and people like, uh, Voros McCracken, so a very famous post on, uh, on Usenet in 1999, for those of you that remember Usenet newsgroups, developed one of the first kind of specifically baseball estimators called DIPS, uh, which stands for Defensive Independent Pitching Statistics.

And he tried to separate essentially signal from noise, so separating outcomes or things that predict outcomes from things that are essentially random. So he found metrics like strikeout rate, walk rate, home run rate, and there was, they were better correlated year to [00:08:00] year than other things like batting average and runs surrendered and so forth.

And i- this kind of started the whole cutting edge, I guess, that baseball developed of really separating signal from noise and trying to come up with metrics that are as independent as possible from, you know, random chance and stochasticity, and using those as predictors for things that we care about.

And then, you know, the, I guess the most famous is, you know, Billy Beane and the whole, you know, Moneyball saying, and he gets most of the credit for Moneyball, but it was, it was actually Sandy Alderson, his boss, who was the- General manager of the Oakland Athletics in the early '80s. Uh, he adopted that philosophy before Billy Beane arrived on the scene, and then he kind of inherited that philosophy, he and Paul DePodesta, and he got the book and all of that.

But where it became really important was when decision-makers like owners and general managers and managers started taking all of [00:09:00] the products of all these baseball nerds and applying them to actual, the play of baseball on the field and how decisions are made in the front office. Yeah, and I love that you mention product and managers, executives, 'cause part of this story is about cultural integration as well, right?

And I... Correct me if I'm wrong, but part of Moneyball era was the commentators had the statistician next to them, or something along those lines. Yeah, and they still do, and statistics have been part of baseball, again, from the beginning. But the... I think the big change that sabermetrics was responsible for was weaning us off of traditional statistics, things that you see in a box score, event data, raw event data, and turning those into model-based estimates and model-based predictions, things that are a little bit more predictive of, you know, what's gonna happen in the future or a better record of what a player's skill actually is.

Yeah, that's always been the case. Now we're becoming a little bit more sophisticated [00:10:00] and, and nuanced in our application of statistics in baseball. Super cool. And maybe we'll get to these things, but already what we're seeing in this conversation is, you mention, like, modeling the signal and, uh, uh, and not the noise, right?

So we're thinking about making sure we're not o- o- over fitting. We're talking around uncertainty e- estimation and even thinking probabilistically, and I hope we get to all of these places. I am interested, in the world of sports, why baseball first? So what made baseball uniquely suited to this type of quantitative analysis compared to other sports?

I mean, some of it, y- you know, a lot of it is accidental. Historical accidents, structural accidents. But, um, s- there are certainly characteristics of baseball as a sport that lend themselves to statistical analysis, particularly, you know, in an era that does not yet have ball tracking data and StatCast data and kind of high frequency cameras and things like that.

The fact [00:11:00] remains that baseball is a sequence of discrete low interaction events and sort of Markovian states that happen, right? Like, you can... There's 24 configurations that you can have of bases and outs, uh, on a, on a baseball field, and you can ascribe a value to each of those. So... A- and each, each plate appearance which kinda define...

You know, baseball's kinda unique in that it- it's not defined by a clock necessarily. It's more about state. And so, uh, each plate appearance is, is roughly this sort of- Self-contained, uh, Bernoulli trial, right? Do, do you get on base or not? Does he get a hit or not? Does he strike out or not? And, um, and so that makes...

It lends itself, right, to, you know, linear weights types of approaches or, uh, expected run expectancy modeling work. So you can compute things like, uh, you know, the change in run value per event, things like that. And so the most important event in any, any given play in [00:12:00] baseball is this pitcher-batter match-up, and you can isolate that, and you can attribute outcomes to individuals with very...

You don't have to have these structural equation models or anything like that. And it... Contrast that with almost any other sport, right? Uh, basketball, where you've got five players on each team, so five, you know, 25 simultaneously, uh, evolving spatial relationships. You know, you have to deal with continuous spatial temporal data, which, you know, A, didn't exist until very recently, and e- now that e- even now that it does exist, it, it requires...

You, it, it... Certain amateurs can't just pick that up and, and run with it like they've done with baseball over the years. Soccer similarly, even sparser in terms of there's 11 a side is there, and then the events we care about, like scoring, is very sparse. So there's a sparse data problem there. It, it's relatively unique to baseball.

You could argue American football has discrete plays, but there aren't re- again, it's sparse and distributed over 22 [00:13:00] players that kinda defied measurements until you had these RFID player tracking data and so forth. So the bottom line is that outside of baseball, there's very little public data and, and actually the public was ahe- generally ahead of the baseball clubs for a long time, and they led the way for decades before the advent of sabermetrics and, and Moneyball.

One more thing is that the game hasn't changed all that much, so, you know, it's been... The rules have been relatively stable. A 1920 box score is still readable and interpretable today, and, and the rules are more or less the same. You can take decades' worth of data and throw it into the same model. That's a great breakdown, and something...

I wanna tease something apart in, in that a lot of this we're actually talking about how well we can predict certain things, right? That isn't all that analytics is, of course, but you know, a, a lot of what we're talking about here in sabermetrics is. And thinking... And you and I messaged about this n- n- the other day, but thinking about acts of, uh, [00:14:00] prediction, uh, we can break it down.

There are different types of systems, right? There are, like, clock-like systems, right? Which, uh, there's, there's a physics n- to them, a well-known physics, and, uh, there are amazing examples there in, like, celestial mechanics, r- right? So the prediction of Halley's Comet, for example, is an amazing e- example which seems so surprising, but when you, when you think about it, it's like, "Oh, okay, this is somewhat of a clock-like system."

Then we have games. Chess, poker Checkers, Go, the, these types of things which are discrete, rule-based, so being able to predict is something we, we can do. Of course, it can get computationally in- intensive in chess and then Go, of course. StarCraft, these types of games as well, con- more c- seemingly continuous, but you know, stuff that we've become very good at with respect to prediction.

After games, you get sports, and then you get complex systems, chaotic systems. We, we may get there, but when... Something I'm hearing there is when you go to sports, being able to predict gets tougher than in games, but [00:15:00] baseball is somewhat more game-like in its discreteness, at, at least, and set of rules than something like basketball or the NFL.

Yeah. I actually remember when I worked for the Yankees, we had a vendor at some point that came along and said that they had essentially solved baseball, and I don't know what happened to them. We di- I don't think we ended up hiring them or anything. But, yeah, certainly coming up with mechanisms that, you know, and models that are able to separate, again, signal from noise is very helpful.

And indeed, some people have charged sabermetricians with ruining the sport because they've, we've made it so predictable and so heavily optimized that we, we've made the game boring, changing pitchers every few batters, and highly optimizing pitches, and designing pitches that are almost impossible to hit.

So in that respect, you can make... There's a lot of physics involved there. There's a whole industry in baseball now of pitch design, where you can take a pitcher that was previously middling scrub player, and just by making them hold the ball a little bit differently [00:16:00] or deliver it a little bit differently, turn them into an elite pitcher.

And, and a lot of that comes from the fact that you can, based on the data that's available now, the ball tracking data, you can perfectly track spin, and velocity, and release points, and so those things can be engineered to generate optimal, optimal movement. Obviously, predicting further into the future is harder than something that's proximal.

Most of the projection systems that we have, both public and private, they do really well. You know, next... If you go to FanGraphs or Baseball-Reference, which are kind of the two main quantitative baseball websites, uh, their projections are one year ahead. They're not predicting five or seven years ahead.

That's really difficult and, you know, I find, found that out the hard way working in baseball, being over-fixated on projecting things out, you know, too deep into the future. Uh, a- and, and you'll see things, you know, like during a game now you'll see a win, win probability. So you'll be in the sixth inning of a game and...

Or, or, or halfway through a hockey game, and it'll say the New York [00:17:00] Rangers have a 62% chance of winning this game based on the state at this point in the game. And as you come closer to that time horizon, it becomes eas- easier to predict. One way of looking at it is that the amount of available variation in the future will determine how easy it is to predict something.

I love that you mentioned, talking about time horizons and how long we're predicting because- In baseball, a really important thing is thinking about, like when you, you know, hiring rookies or whatever, thinking about, "Oh, I'm such a tech bro." I was gonna say the lifetime value- ... of a player. Clear- whatever the term is, right?

You, you wanna think, uh, like how valuable will this player be throughout their career, right? Yeah. And that's something which we've seen huge advances in, in, in baseball. Mm-hmm. Yeah, that's the name of the game. And, yeah, I think the other aspect of that worth noting is that you can actually develop really good models for predicting future performance in the aggregate, so the distribution of players over time.

What's hard [00:18:00] is predicting it for this particular player. And so there's, again, there's so much data. That's the nice thing about baseball, and I think is why it- it's a, it's such a great... a hotbed for the development of methodological, methodological development because there's so much data. There's always more tomorrow to validate your models.

And so from a population perspective, there's, you know, you can be an empirical Bayesian and, uh, develop really precise models for future prediction for populations, but individuals, the different levels of stochasticity catch up with you and, you know, you get your, your cone of, of doom that, uh, us- usually defies prediction too far into the future.

But that's the name of the game is determining whether it's worth paying a 20-year-old promising player $100 million over seven years. One school of thought is you, you lock up your young talented players as quickly as possible because they'll only get more expensive if they're [00:19:00] good. On the other hand, y- there's a non-zero chance, or probably a p- relatively high chance that it won't be worth the money.

There's always that tension. And one thing I'm hearing in there is perhaps considering, once again, not wanting to use too many financial terms, but like a portfolio of players or a, a cohort, right? And almost embracing the un- the uncertainty or hedging risk across a portfolio. Yeah. Yeah, and that's where the whole...

M- most North American sports anyways, they, teams are built on a draft process where you essentially pick players i- in a particular order based on how you finished during the year. And you... And in baseball this is particularly important because you'll have entire teams, uh, across multiple levels of players that are under contract for your team, and you have to, uh, nurture them through those levels, uh, so that a player picked when they're, you know, 19, 20 years old, won't actually reach the Major Leagues, if they do at all, until they're in their mid-20s.

And so, [00:20:00] so that does provide an opportunity to hedge risk a little bit. You're not, uh, resting the fortunes of the franchise on one player. It's, it's an entire ecosystem or family, if you will, and each of them d- requires a different amount of player development. So that's a big, that's a big part of modern sports analytics is, you know, the, the whole player development angle.

And some teams seem to be better at it than others, both in terms of nu- numbers of players that matriculate through their system, and then those that are able to stay healthy once they're Major League players. Some teams have a, a knack for, of having all of their players on the injured list all the time, and that's part of player development as well.

So certainly hedging risk is a big component of that. I wonder if the other aspect that we've talked around is... So speaking to why so many advances happen in baseball is the incentives, that there's a lot of money at, at stake, and a lot of culture for America, right? Yeah. And that's where putting [00:21:00] a value on player performance is one of the goals of sabermetrics.

And very quickly in the last, at least in the last few decades, there's been a premium on expressing player value in terms of things that are important, so notably wins. So the currency of the value of a baseball player is in terms of wins above replacement. Mm. So the number of additional wins you would expect that player to contribute to your team over 162 game schedule, relative not to an average player, but to a, what they call a replacement player, which is somebody who you could pick essentially for free off of the waiver wire or from a minor league team.

And that's also re- translated into money in the sense that an additional win is worth about 8 to $10 million, depending on, depending on the season and depending on the team. And all of that goes into the calculus of how you can value a player and how big of a contract to [00:22:00] offer that player. Noting that if you don't overpay for some of these players, you're probably not gonna get them.

And there's always that tension, and each team has a different sort of philosophy. Smaller market teams rely on having a really good player development system and then flipping those players to the rich teams- Mm ... getting a big payoff back, and thereby being able to sustain that success over the short term.

Whereas the richer teams are able to take their pick, but they end up spending a lot of money, typically losing a lot of money because most of the time those large contracts don't end up paying off. There seem, seems like there's more science to this than there is to a lot of science, to be honest. Yeah.

The, unfortunately it's been treated more of like an art than a science, which is why there are so many, are so many records of failure, notable n- records of failure, teams overspending. And, uh, it, it really underlines the ones that do a good job, uh, ones that tend to be analytically driven, not just in baseball, but, you know, in, in other places and [00:23:00] are able to...

Again, the whole Moneyball, the whole kernel of the Moneyball idea was, uh, taking advantage of undervalued assets and punching above your weight by spending less money on something that will give you a similar return to- characteristics of players that perhaps are overvalued or widely recognized as valued and, yeah, bec- it's, it's, you know, it's kind of like the Red Queen thing.

You've gotta keep changing, just keep changing, keep running to stay in the same place. And so it's all about a competitive advantage and, and that your competitive advantage tends not to last for very long, so you've gotta take advantage of it while it's there. Totally. And to your point about mistakes and even, e- even big mistakes, I mean, as, uh, as we both know, science works through making lots of mistakes.

And I always, you know, when I worked in research, wished there was a journal of negative results because I honestly think the reproducibility crisis wouldn't be solved by that, but would've, it would've helped us all a lot to know what [00:24:00] negative results other labs had, had more so than is purely dispersed in the culture.

Yeah, and from a sporting standpoint, the important thing is to be able to learn from the negative results. And, yeah, again, sophisticated front offices will learn from other people's mistakes and their own mistakes. And as you say, c- you know, if you don't measure it, you can't change it. And being able to track these things and, and come up with models to inform decision-making rather than hunches is, was big.

And it- that's been the big tension in, in baseball for a long time, is the influence of analytics and data-driven decision-making versus traditional scouting-based decision-making, and how those two can coexist. So what does modern baseball analytics look like? What's changed since Moneyball? Yeah, I mean, a lot.

So if, if you go back to, I think the first, the first actual statistician, uh, [00:25:00] was hired by Branch Rickey, who was a baseball executive in the '40s and '50s, hired the very... I think his name was Al, Alan Ross. Alan Ross, hired by the Brooklyn Dodgers in 1947, and he was considered the first full-time statistician, uh, employed by a Major League Baseball team.

And, and that didn't s- didn't set the world on fire at the time. But today, all 30 Major League Baseball teams have R&D departments. Typically, that involves 10 to 25 quants. So teams like the Dodgers, Yankees, Phillies, Astros, some of them have 50 or more people with each of which have niches in things like player development and, uh, draft strategy, in-game strategy and so forth.

So highly specialized individuals within R&D departments. The, I think the, the big, one of the big revolutions was in terms of, of hardware and data acquisition. [00:26:00] So in sort of the mid- 20 teens, they started implementing, uh, ball tracking data league-wide. It was first a system called StatCast, and then Trackman, which was developed initially for golf.

And those were, that was a radar-based system. And to today, where we have Hawkeye cameras, the same ones you see in tennis, right? If you watch a tennis match and they challenge a, a, a call, and you see the ball bouncing off, then it's based on Hawkeye cameras. And these are, in baseball, 12 300 frame per second cameras recording everything that happens on the field, from pigeons landing to, you know, streakers running across the field.

And, um, most importantly, it, it directly observes things like the spin axis of a ball. You can construct a skeletal pose, points for doing, for doing biomechanical analyses, you name it. And that really induced a revolution because you were able to record [00:27:00] mechanistic metrics that were re- directly related to an, an individual's contribution to the game.

So how hard a pitcher can throw, how fast he can spin the ball, how fast a batter's bat speed is coming across the plate, the angle at which they hit the ball, all of that. And so the big shift that allowed for was a shift from outcome-based statistics and modeling to process-based statistics. So, so you go today to, you know, a modern baseball website like FanGraphs or, uh, Major League Baseball's own Baseball Savant, most of the things you'll see on there aren't actually statistics, they're model-based predictions.

They'll all have an X in front of it, so expected batting average, expected on-base average, and, and so forth. And so they're based on metrics that are reliable, reproducible, and predictive into the future. And that's become kind of [00:28:00] the backbone of modern pitch- pitcher evaluation and player performance estimation.

And then intersect the data with data scientists. So the whole reason, or one of the reasons I was able to come to make the leap from academia, you know, I used to be a biostatistician at Vanderbilt University Medical Center here in Nashville, and I jumped more or less straight into the Yankees. It's, and that was because of the data science aspect.

What do you do with all this data? A given Major League Baseball game in 2026 using Hawkeye generates six to seven terabytes of data in a single game, and so- Mm ... you know, you need the data scientists to, to help process, so data engineers and data scientists to process and analyze all of this. And so, you know, the advent of things like tree-based models, every team's got an XGBoost model in their repertoire somewhere predicting something important.

And then increasingly, things like Bayesian analysis and hierarchical [00:29:00] models and things like that Awesome. And I am excited soon to get into a conversation around Bayesian aspects of, uh, uh, of things and how people working in other disciplines can- Yeah ... can really start to become more probabilistic. I am also aware that, you know, there's been developments in automatic balls and, a- a- and strikes, and I'm just wondering if you could let me know what's ha- happening there.

Yeah. That's the latest and greatest change. This may be one of the biggest changes in baseball since, you know, its inception. And that typically w- we rely on the judgment of umpires to call balls and strikes and outs. And as of 2026, this was implemented in every Major League Baseball game, and it's a challenge-based system.

So it was initially introduced to the, I think it was the Atlantic League, which is a minor league, as every call was automated. So there was a, a, a headphone, a, a earbud in the ear of every [00:30:00] umpire, and all they were doing was relaying the ball and strike call based on remote sensing, essentially. But the way...

It's interesting the way that they implemented it in Major League Baseball is actually a challenge system. Each team gets two challenges a game, and, uh, only the batter and the pitcher, uh, or the catcher can initiate a challenge. So if, if, if an umpire calls a strike and the batter or the batter thinks it was a ball, he taps his head a couple of times and it goes to a challenge.

And, uh, this is also similar to what you would get in similar thing in tennis. And, and similarly for a, a pitcher, if they thought that a ball that was called a ball is actually a strike, they can do the same thing. And it, it's been like... I think it's worked quite well, unlike the VAR systems in soccer and football, which have been a disaster.

This has actually been relatively engaging for the fan, and, uh, really, it, it's introduced sort of a game theory, uh, aspect to [00:31:00] baseball because there is a, a discrete stake, two-player asymmetric game with, uh, a budget of challenges. You can't, you know, you can't use them up all at once. More-- And moreover, when you use one, you're using one on behalf of your team, so you're taking them away potentially from other players.

But the und- unintended consequence of all of this is that the strike zone has actually shrunk about 10% league-wide, and batters are taking a lot more pitches. They're taking their chances with them being called balls, so you're getting really high walk rates. And you've actually-- the strike zone has actually ended up being shrunk.

But there's lots of strategic dimensions when if it's the first inning versus whether it's the ninth inning, do you, do you use your challenge and so forth versus it being used to essentially replace the umpire. That hasn't happened. And so I think it's... They've done a good job of, of implementing that.

Incredible. And I do... The challenge system seems like a good way to at least Kind of deal with, like, cultural challenges that could happen. A- as you [00:32:00] mentioned, like soccer, football, also what a culture to try to do something like that in as, uh, uh, as well. I actually... I went to a soccer game at, um, Boca Juniors in Buenos Aires in Argentina-

once, and- You survived to live the tale, did you? Barely, yeah, they- You survived to tell it. Yeah, they let, you know, the away team crowd and the home team crowd out different entrances, different times. Mm-hmm. And I think they don't even allow a away team crowd to come an- anymore because of everything that, that, that happened there.

But I... So we've been talking about all the wonderful developments in sabermetrics, why it's been possible, the incentives behind it. I'm interested, we haven't really been talking about the skill set and the ways of thinking, and you've said the B word twice, statistics. And you've also mentioned the cone of doom.

We've been talking around thinking probabilistically, thinking in terms of, uh, uncertainty, hedging, uh, risk, thinking about distributions. So I, I am wondering if you can tell us a bit just about maybe the [00:33:00] ways of thinking that make people and cultures good at this and prediction. It's quite the question.

So you mean good at, good at what? Good at analytics or good at making sound decisions- Good- ... for their sport? Oh, great, great question. Maybe we could start with good at forecasting. Yeah. Again, I think a lot of it goes back to those original arguments about data availability and the ability to actually quantify the events that go on.

But sound probabilistic thinking kind of underlies everything that goes on. In, in a good... You know, let's go back to forecasting, a great, you know, forecasting model in baseball, for example. Everything is kind of small samples with a strong prior, so lots of population information and, uh, but small samples.

You know, my player has only had three at bats and they got two [00:34:00] hits. Are they gonna win a batting title or is this just luck? And this hierarchical structure is everywhere. So again, lot, there's lots of data at the season team level. There's less data at the player game, at bat within a game level. And so that naturally lends itself to Bayesian thinking and probabilistic thinking, and being able to share information, share strengths among players and teams and seasons, and be able to make more informed estimates and predictions from things.

Being able to integrate, uh, expert information. This is another aspect of Bayesian methodology that's, that can be helpful when you, when a, again, when a player is new. They have a history in the minor leagues. They have a history in, you know, in college perhaps. How do you integrate... That's still information that may be helpful in making a prediction about that player.

And so how do you integrate scouting information and, and that [00:35:00] prior information? Well, it's naturally done through, uh, Bayesian methods. And so, but again, b- baseball has advantages that other, you know, that, uh, other sports don't. But, you know, that's changing quickly as well. You're getting, you know, within football now, within soccer, uh, you know, you're getting, uh, metrics like expected goals, right?

That kind of, uh, quantify the realistic chances that a particular team had to score goals in a game and, and you can compare that to what actually happens and say, "Oh, this team got lucky, this team got... This team was unlucky," and we can make decisions based on the process, w- what should have happened v- versus what actually happened.

What's Bayesian thinking? What is Bayesian thinking? It's a proxy for probabilistic modeling, right? It's invoking base formula. Involves taking s- information or a state of knowledge, uh, about the world or about a system, and updating that with new information. And again, in a [00:36:00] baseball context, y- you have a little bit of information a- about a player.

You have a lot of information about all second basement in the history of baseball. And so you can combine those to make a prediction about what that player i- is currently like, you know, w- his current state or what he's gonna be in the future. And so Bayes is a way of updating our knowledge of the system, uh, in a principled way, in a probabilistic way, specifically such that you are pushing forward an entire distribution of expected events or probable events so that you can make sound decisions under uncertainty.

'Cause the uncertainty doesn't go away. You just have to, you know, quantify it appropriately and, you know, hedge, in- invoke hedging behavior and make a probabilistic decision based on the risks, potential risks and rewards and costs of making, let's say, an incorrect decision. And then the other [00:37:00] aspect of it is the whole hierarchical multi-level aspect to it, that you have information, valuable information for making predictions and making estimates at multiple levels.

So individual player characteristics, characteristics having to do with the stadium, time of day that a game is being played, the fact that you're, you know, a mile high in elevation in Colorado affecting the pitches, and integrating that all in the same system and being able to make a holistic prediction based on holistic information across multiple scales.

And so it, it really, you know, the efficient use of information- Not leaving anything on the table and, uh, and making your, you know, best. 'Cause i- ironically, you know, a lot of the times, or unfortunately a lot of the times, you know, you really are the decision maker, general manager. The manager just wants, you know, a list, let's say, of players to trade for or to pick in a draft.

And so you've got to integrate a- over all of [00:38:00] that uncertainty in the end and give them a point prediction. But you wanna make sure that you're integrating all of that uncertainty in the modeling process and making sure that list you end up with is an appropriate one for them to make a... use to make a decision.

Awesome. And, and there's so much in there to tell us why Bayesian methods are so effective and can be challenging as well. But the ability to reason and make decisions under uncertainty is so key. And I joke that I'm probably a, a, a Bayesian, and I... Actually, I did a lot of workshops with our mutual friend and colleague Ravin Kumar last year.

Yeah. Who's now at DeepMind, and- Mm-hmm ... he said at the start of one of them, "Hugo and I are both Bayesians." And I said, "Ravin, I'm probably a Bayesian." He said, "Hugo- ... it's such high likelihood we can call you..." Yeah. Exactly. Exactly. And several other things that I find really useful about Bayesian methods are being able to, you know, talk in [00:39:00] distributions and express- Mm-hmm

uncertainty, not just second moments, third order moments, what- whatever it may be. Also, the fact that it forces you to make your assumptions explicit. As you know, I used to work in, in, in cell biology, in, uh, cytoskeletal dynamics actually. And the amount of people who would do... I don't have huge issues with NHST when done correctly, but a lot of the time people will use tests as, uh, as you know, without realizing what the assumptions underlying the, uh, the tests are.

Mm. I think there is an objection to Bayesian stuff that, how do I choose the prior, uh, that type of stuff. But once again, you do choose a prior informed by certain things, and you also make it explicit as well. And you can s- then see what sensitivity ar- around the prior looks like also. Yeah. Mo- certainly model validation is, is very important.

And, uh, working for a baseball team, it was great because you could always... It, it was an easy way to integrate experts, both in terms of prior elicitation, but then also in terms of model validation. If you handed a [00:40:00] result to a seasoned baseball professional, uh, it sort of has to pass a, a sniff test. And, uh, and if it doesn't, you know, you have to go back and test your assumptions and, uh, test your sensitivity to choices that you made in the model.

And the important thing to remember is that the choice that you make about, you know, a prior or a structural component isn't a permanent decision. It's just one, uh, view of the world that is made explicit. And, uh, and, uh, the nice thing about that is that anybody can go in and criticize that model.

Nothing is hidden. There are no- underlying assumptions to worry about. It's, it's all there if you know how to read it. And, and you're always free to go back and revise that. In fact, that's the spirit of Bayes really, is going back and revising a state of knowledge about baseball based on new information and new input and new ideas about what ought to be included in a model and what should be left out.

It's iterative. Yep. It's a [00:41:00] loop. Very much. And I think the hierarchical techniques, which I'd love to see diffused more in, i- in the data culture, but we're seeing a lot in baseball, right? But they're applicable everywhere from everywhere you have customers or users or clients, wherever it may be. Like big tech companies who have global users can look at population level stuff, then see...

Use hierarchical modeling to see how that applies to subpopulations, right? Yeah. It's bound... One of the things that I've learned my two years now with PyMC Labs is just the range of potential application that I thought was more limited to STEM and science and research. It's also applicable to marketing decisions and bond trading and everything in between, including baseball.

Well, the PyMC marketing stuff took off in, in ways that all of us were surprised with. And- Yeah ... needless to say, of course, you, you built PyMC o- originally when working in biostatistics, right? In wildlife biology. It was even, it predated- Right. Wow ... [00:42:00] biostatistics. Yeah. I, I wrote it, the very first version, yeah, when I was a postdoc at the University of Georgia.

I was actually doing adaptive management modeling for duck populations, and so we wanted to use Bayesian methods there. And in those days, there were not widely adopted and maintained open source MCMC libraries. There was WinBUGS, and that was it. And, uh, it was close sourced and written in Pascal. And to serve the dual goal of learning Python and learning Bayesian inference, PyMC v1 was developed.

And with every subsequent release, I've had less and less to do with the important parts of it as it's handed off to people who are better engineers and statisticians than I am. So it's in good hands. Yeah. W- so you can go and do cool baseball analytics with the Mets, for example. Right. Yeah. Do the important stuff.

I didn't realize WinBUGS was written in Pascal. That's the worst version of Pascal's wager ever. Object. Object Pascal. [00:43:00] Yeah. For people who wanna find out more in a kind of a more popular way about, about Bayesian methods, I, I do rec- I mean, Nate Silver's Signal and the Noise, which has a bunch of the ideas we've been talking about today, the final third, like 30 to 35% of that, like for a bestseller, it's trying to tell the world you need to think like a Bayesian, right?

Mm-hmm. Mm-hmm. And then Philip Tetlock's book, Superforecasters. For those who don't know, Tetlock's Superforecasting, and he's collaborated Super forecasting project really has been dedicated for a long time to, like, figuring out who's good at f- forecasting and why. Incredible work. And as it turns out- Yeah

the types of characteristics that make you a really good forecaster are those that define a Bayesian thinker as well, right? Yeah. I'll add another one in a... To give it a baseball flavor. It's quite a, quite an old book now. It's by a guy named Tom Tango, who actually works for Major League Baseball now, Michael Lichtman and Andrew Dolphin.

It's called The Book. And, uh, it's, and it's essentially, uh, [00:44:00] underpins kind of the modern approach to baseball of creating a, uh, a, a metric of underlying performance that is highly predictive. And it's, it's a really nice sort of recipe for how to get started in baseball a- analytics in a principled way.

And, and the cool thing is that, uh, Tom Tango comes up with this model that he calls the Marcel Model, and it's kind of the most... As if it was created by a monkey, named after Marcel the monkey. Mm. So the, the simplest, sim- most simple-minded model that is reasonable. And it essentially involves, you know, a fixed weight, sort of an AR3 type of a model with, with regression to the mean, kind of a shrinkage.

So it's not explicitly Bayesian, but it's Bayesian in spirit. And very few components. And, and to this day, it still competes with more sophisticated model and kinda acts as a baseline. So on the subject of books, if you're interested in baseball analytics and don't know where to get started, The Book is a great place to start.

I'm [00:45:00] really excited to check that out. Thank you, Chris. And I'll include that in the show notes as well. I... We've been talking a lot about inference, prediction, forecasting, and talking around intervention as well. And whenever we take an action in the world, it changes the state of the system and what we...

Of course, something we know out of scope for this conversation is the act of forecasting may change the state of the system as well. But where I'm really going with this is causal inference. Mm-hmm. How is causal thinking starting to show up in, in baseball? And why should, you know, data people outside sports care?

Yeah. Like I said at, at the outset, you can think of sabermetrics as this apprenticeship in observational data science, and the key there is observational, right? And whenever you're dealing with observational data, you're typically building regression models, boosted tree models. You're, you're, you're dealing in correlations.

And, [00:46:00] and causal inference are essentially a bag of statistical tricks to allow you to make causal inferences based on some observational data, provided that certain structures are satisfied. Now, why is that better? It, it... That would be... It would be more predictive, right? You're not relying on potentially spurious correlations in order to, to make decisions.

And so putting it in a... Framing it in baseball, uh, you know, let's- Let's say we're interested... Let's say we observe that, uh, pitchers that throw a, a cutter, a cut fastball, have been getting more strikeouts, particular trend in baseball. And so that's kind of a naive framing that you simply have to throw more cutters and they'll...

You'll get more strikeouts. The causal question is, you know, would a random pitcher or an arbitrary pitcher from my pitching staff who started throwing cutters get more strikeouts? And that's confounded by a bunch of things, by skill. Pitchers that are good enough to throw cutters tend to be [00:47:00] probably are ones that tend to throw it more.

Selection, teams add a pitch only if it projects. And then there's a bunch of counterfactual ambiguity, like would the re- rest of his pitching arsenal play down, for example. And so building, you know, models based on, you know, sort of causal principles and, and, and really this involves thinking hard about the structure of your model and knowing whether to include variables that are confounders versus leaving them out because they would actually obscure the inference that you're trying to, uh, that you're trying to carry out.

And there are other applications as well, like work- workload counterfactuals in baseball, like a player that, uh, g- gets injured. Would that pitcher have stayed healthy and an innings cap, something that could have happened if we had made a different decision. And, and we can only do that by building smart models and having sufficient observational data to, to inform that.

You can never run the [00:48:00] randomized controlled trials. You know, some of them look kind of... Baseball often seems random, and, and it seems like there's repeated measurements, and there are. But at the end of the day, you, you never choose with certainty who faces who and the weather during that game and other potential confounding factors.

And so it's becoming i- i- increasingly used. One, uh, just off the top of my head when we're building projection models, one of the big problems that seeps in is selection bias. So a player that is still productive into their late 30s and 40s is not a randomly selected player from the population. It's highly skilled players that have managed to stay healthy and throw the ball hard or hit the ball hard.

We use causal effect, causal inference models in order to adjust for those biases. Awesome. And it, it is interesting to hear that this is still, like, actively in development and progress is being made today. 'Cause I do... Maybe I'm impatient. [00:49:00] Well, I, I am impatient, but I do often feel like data science and analytics in industry, I don't know so much about the sabermetrics space, which is one of the reasons I wanted to chat today, but I don't think causal thinking, like robust statistical thinking, things like people in economic, e- econometrics know a, a bunch about as well.

I think, you know, the fact that the real, the best version of causal thinking is The A/B test in, in, i-i-in data science, and usually not done, um, correctly or, or, or, or robustly. And I, I am, I, I am wondering, maybe it's still just early days, but your thoughts on why, you know, probabilistic thinking, Bayesian thinking, causal inference haven't diffused more i- in the data culture yet.

I can... Speaking for causal inference, it's hard. That's an advanced course in... And there was no causal inference course when, when I was taking statistics both at the undergraduate and the graduate level. [00:50:00] Um, and a lot of this is still quite new. So, um, so again, you know, baseball is probably a leading indicator, uh, again, in the use of causal inference in, in sports.

So, and yeah, I know on the Bayesian side of things, as in any other industry, there were barriers in terms of, you know, again, philosophies, people being trained to do null hypothesis testing and statistics that aren't necessarily relevant to baseball modeling. But then also the computational constraints.

You couldn't do MCMC 20 or 30 years ago, mainly because of computational constraints. And so having software like Stan and PyMC as really hi-high level languages that, packages that allow users to define their problem of application and then allow them to, um, you know, sort of plug into these more advanced meth-methods.

And, and the same goes with causal inference. If you wanna build a causal inference model, you've gotta write it from scratch in your, you know, language [00:51:00] choice, more or less. Those sorts of higher level libraries, uh, kinda lags behind a little bit and, uh... So yeah, it's, it's difficult. It's not easy. You have to think hard about...

Causal inference is, is hard to automate because you've gotta think hard about the building your directed acyclic graph and which way the arrows point and, and how to validate such a model, and going about things like colliders and modifiers and all of the... It's a whole other language, I think, that has most, explains most of it.

Yeah. A- agreed, it is hard. There is a tooling, t- tooling issue for, for causal, and it involves thinking in a different way. But, uh, for people who wanna do Bayesian stuff, do check out PyMC, which is a wonderful package for Bayesian hierarchical modeling, all of these types of things. A- as a final question, I'm just wondering, you know, if a data analyst or scientist wanted to break into sports analytics, what should they learn first?

It doesn't hurt to learn a high level programming language, so R, R and [00:52:00] Python. So in, in Major League Baseball you had R shops and Python shops. The Yankees were an R shop. Phillies were a Python shop. And increasingly it doesn't matter which. Probably even be a Julia programmer if you wanted to. And then- Yeah.

Being conversant in machine learning methods, database, you know, with all of the, uh, the myriad data sources in baseball, knowing SQL is big. I never wrote as much SQL as I did when I worked in baseball, so it's still important to know languages like SQL. And then from just apart from the technical side of things, showing that you can solve problems and ship software is good.

Having a, I don't know, good public GitHub. You know, increasingly your GitHub account is part of your CV, and having some Jupyter notebooks or Marimo notebooks tucked away in there, uh, that show you're able to, you know, apply these methods to baseball data or what- whichever sport that you're [00:53:00] applying these methods to, and, and being able to s- solve a problem and present it and, and translate it to...

That's another aspect, is being able to translate this to, you know, smart but non-technical end users, your, your quote-unquote customers, which end up being players and front office decision makers. There's a lot. It's hard, you know, being a good data scientist is hard, right? The- Drew's Venn diagram, now famous Venn diagram.

You've gotta be a good software engineer, statistician, and domain expert, and it's hard to be all three. So if you can be good at, an expert at one and conversant in one or two of the others, you're on the path to- Totally, and- ... being a sports analyst. Seemingly you need to know a lot about Venn diagrams as well.

There was so... There was a, there was actually a Venn diagram once which had Drew Conway's Venn diagram as a circle in the Venn diagram. Um- We ended up with so- Very meta, yeah ... so many. I love that you mentioned a GitHub repository as, as well, and the reason I, I wanna index on that briefly is it may seem [00:54:00] that having GitHub repositories now could be meaningless 'cause code is cheap to, to generate, right?

But what you're talking about isn't code. There may be code in there, but it demonstrates your thinking, your communication skills, plu- plus, plus, which will involve code. But especially i- in this world, demonstrating those two things, thinking and the ability to solve problems and communicate them, is key.

Yeah. In the, in, yeah, in the LLM AI world it, it's probably helpful for some people that they don't have to be experts in PMC in order to build a model. But you do have to know the... You have to choose the model that is appropriate, you have to choose the problem that you want, the appropriate problem to solve, and the data that you need to solve it.

So it, it does... AI helps, helps with the software engineering side of things, but not necessarily with all of the pieces. So yeah, it still involves making sound [00:55:00] quantitative choices at every step along the way. Whether or not you're writing the code yourself or getting Claude to do it doesn't absolve you of that responsibility.

And to demonstrate that you really Have thought about the data generating process. And we haven't said this in this conversation, but Bayesian thinking is the original generative modeling That's, that's right. Yeah. That's, yeah. So your likelihood is also your, your sampling distribution. So it's where the data comes from supposedly Well, Chris, so excited to see what you get up to next and all, all the wonderful work you get to do with, with the Mets.

And thanks for coming and sharing your, your time and wisdom and for such a great chat Thanks Hugo. It's always good and hope to, hope our paths intersect soon Thanks so much for listening to High Signal, brought to you by Delfina. If you enjoyed this episode, don't forget to sign up for our newsletter, follow us on YouTube, and share the podcast with your friends and colleagues.

Like and subscribe on YouTube and give us five stars and a review [00:56:00] on iTunes and Spotify. This will help us bring you more of the conversations you love. All the links are in the show notes. We'll catch you next time