Cracking "Beat the Streak" with data science
by Randy Gingeleski
7 minutes to read
Thoughts on winning $5.6 million in "Beat the Streak" from MLB.com using data science and machine learning.
You just need to predict a baseball player getting a hit during their game that day. 57 times.
And you can sit out some days too. Or if you feel really good about a couple guys, “double down” and make two picks in one day.
Easy money! … just kidding.
My roommate brought this whole idea up to me. I’m a sucker for schemes of the write-code-print-money variety. It took me all of an evening to figure out this could be a big time sink without any money on the other side.
However - so as not to have totally wasted that evening of research - here we are in this blog post. I had to get a little mileage out of it. And maybe my research will help others.
Let’s better define the contest before we get into cracking it.
Basically the idea is, you compete in vehicular combat against a bunch of other opponents. The last person surviving gets granted one wish.
… OK, kidding. That’s Twisted Metal.
Supposedly the toughest record in baseball to beat is correct prediction of hits across 56 games. That’s Joe DiMaggio’s and if you can best it at 57, MLB.com will give you $5.6 million.
You need to say, this specific person will get at least one hit in this specific game, 57 times.
The closest anyone has gotten in the contest is Robert Mosley at 51.
And you can read the rules here if you’re a robot.
Heading to backtest land
Trying to beat the streak (!!) with data science and a computer isn’t that unlike algorithmic trading. That’s where you write code that takes in data and spits out stock market predictions.
Take Quantopian, a great web app for learning about that. Quantopian gives you a sandbox in which you have access to a bunch of data sources - then can write an algorithm and backtest it.
What do you need to do before even thinking about algorithms? Come up with (and program!) a backtesting framework. Something with a lot of historical data that you can run your algorithm through, and it’ll identify how well you would’ve done historically.
Once you’re killing it there, then it seems right to go live.
Give me data or give me death
Quality data is the bedrock of backtesting and predictive algorithms. Let’s talk about where we can get some historical hit data first.
My understanding of the best place for historical play-by-plays is Retrosheet. It’s run by volunteers who make available logs on games going back to the 1800s. I can’t speak on the format - I didn’t download the .zip’s - but this sounds like good backtest fodder. At least you’ll know who hit when.
In addition to Retrosheet, here are 3 other great stat sources, as pointed out by sabr.org …
(that link at sabr.org is really worth reading too)
For batter-pitcher matchups, already categorized as “hot” or “cold”, see this page on Rotowire.
And I’m sure there’s a ton more. These are free. When I’ve gone to analyze football or soccer for betting, in those domains it seems like you have to pay for good data. In the baseball arena that apparently is not the case.
See my reading list later in this post for more data leads.
Algorithm approach ideas
So where do we start for algorithms? Below are some basic approaches from a related Quora entry.
Matchups - which players have hit the opposing pitcher well in the past?
Opposite dominant hands - which right-handed hitters are facing left-handed pitchers, or vice versa?
Hot hand - which hitters have an above-average number of hits in a recent time frame (i.e. this past week)?
BTS recommended players - what are the 4 recommended players and the 4 popular players listed on Beat the Streak’s website that day?
BTS player-pitcher batting average - which players have the best betting average against the opposing starting pitcher over the last 5 years? This is on Beat the Streak’s site too.
Insert code here…
This is where you start coding and I wrap up the blog post so you can get to it.
The logical choice is to write everything in Python (3!). I love Python. If you don’t know it already, get learning. This is a biased opinion though.
Remember, start with a backtesting framework. You’ll thank me later. Extrapolating your algorithm results over past data isn’t a guaranteed indication of anything but might save you a bunch of time.
Think - if you can tune now, that might be the difference between a 57-game streak and a 51-game streak.
Consider the approaches above. Consider combinations of the approaches above. But don’t limit yourself.
Evil thoughts on clones
Look, I make my living as an information security consultant (l33t haxx0r), so I’ve got to bring this up.
My understanding is it’s pretty easy to make a “Beat the Streak” account. They’re not asking for your SSN or doing any kind of serious individuality confirmation.
I have not really looked into this. But my hunch is someone could get away with having like 50+ accounts.
You just need to fake them convincingly. A little VPS (unique-ish IP address) for each “person”. Maybe from Vultr because it’s cheap. Maybe some Digital Ocean droplets for the same reason, and to spice up the IP ranges.
Stagger the account creation so you don’t have dozens of accounts with the same creation date.
You can access the game by web app so we don’t have to emulate mobile stuff even. Each “person” has their times when they tend to play, they have a browser (user agent string!) they like to use, some are better at updating it than others (i.e. bump the UA string sometimes). Get creative - you’re playing G0D and making (fake) life, in a way.
50+ chances to guess right with whatever your model is beats just 1. However, if you get close to beating the streak - say even 30 good guesses - MLB.com or whoever will likely perform an audit. I don’t know anyone in cyber world who’s pining to work for MLB.com and certainly not anyone worth their weight in salt. So I’m just saying… be smart about your multiple identities and chances are you’ll get away with it.
But there’ll always be that thought in the back of your mind, as you get close, whether you’ll get caught and be out $5.6 million.
Don’t let this article be found in your browser history.
If you’re going down this rabbit hole, I enjoyed reading the following and think you will too.
I don’t want to underestimate the fact this is freakishly difficult, even if you produce a crazy good model. Because probabilities, people.
That’s why I’m not spending time on this (at the moment). Other priorities with better odds of return.
Still - these chances are likely better than your state lottery. The pursuit will be more fun. You’ll learn something about coding and modeling even if you don’t become a millionaire.