Rubber Meeting the Road
Testing Results vs. Fiscal Reality
A very good friend pointed out that while testing results are great, until you have actual results well, you really don’t have anything. And be clear about what you are posting so there isn’t ambiguity between predictions on historical matches and it’s performance on future ones.
Pretty crucial and important points. With that in mind, I wanted to spend the bulk of this post focusing on the historical side, as that is what drives the training and evaluation process. I will then end the post with the initial set of predictions made from the selected models for each league.
All of the results reported in this post are based on the historical match dataset and represent in no way any profits I have made. Hopefully that will be handled in future posts, lol.
And again, this is sports folks. You may have the best model in the world but it cannot fully account for the humans in the equation. And thank goodness for that!
Training / Testing Split and the Process itself
I talked quite a bit in a prior blog about feature engineering and the underlying stats so I am not going to delve too deeply on that aspect but instead want to focus on the process of training, testing and result evaluation. So let’s start with an overall diagram of the process:
This process will run up to 3500 times for a league as the simulations work to refine the feature set down to the 25 most impactful, and profitable ones. Think of it as a sifter, we are sifting out the features that aren’t making an impact and keeping those that push the model forward based on our evaluating metrics, described below.
Revisiting Draws, Accuracy, Precision and Value
You will recall that a couple of the driving factors that made draws so attractive were 1) the favorable odds you receive and 2) being able to move the model to a binary vs. multi-class classification problem. Given the typical odds for a draw, one could expect a minor level of profitability getting 1 in 3 predictions correct (a rather low 33%) and that any improvement on that should drive a much higher return.
The pipeline utilizes 3 main metrics to filter down from a full cross-section of 3500 model executions to a single one, containing the “golden-ticket” feature set. Let me provide a little foundation first. The following diagram depicts the put-aside/test set graphically with the matches ending in a draw having a green background (POSITIVE) and the matches not ending in a draw in red (NEGATIVE). The predictions made by the model maintain that color-coding (green in this case does not represent a ‘win’ by the model) and are grouped by the prediction itself, draw vs. non-draw. The key takeaways here are the concepts of TRUE/FALSE POSITIVE and TRUE/FALSE NEGATIVE. Simply put:
True Positive - Model accurately predicted a draw would occur
False Positive - Model predicted a draw but the result was a non-draw.
True Negative - Model accurately predicted a non-draw would occur
False Negative - Model predicted a non-draw but the result was a draw.
With that backdrop let’s explore the 3 metrics used during evaluation. First a few definitions:
PR = The matches that ended in a draw (Positive Result)
TP = True Positive
FP = False Positive
Accuracy
Accuracy defines the percentage of times that the model predicted a draw occurred when it ACTUALLY did. So mathematically stated: TP / PR or the number of True Positives divided by the total number of Positive Results (draw matches). It is basically the total surface area we have to make predictions on.
Precision
Precision defines how well the model does when it actually makes a prediction. Think of it as it’s winning percentage and mathematically stated as TP / (TP + FP) or the total number of True Positives (draw predictions that were draws) divided by the total number of draw predictions made.
Value
Value is the monetary value, or profit, that the model would have made if it was actually predicting on the matches in the test set. Every match has a stored odds value for the draw result, so the value is the sum of all the draw predictions taking into account whether the prediction was correct and had profit or incorrect and was a debit. For test purposes I settled on a wager value of $100, so each prediction was either a win and was (100 * the decimal odds value - 100). The ‘- 100’ is to remove the actual wager value itself from the profit so we are in fact only recording in the “ledger” pure profit or loss. In the case of a loss, the wager value of 100 is deducted from the running total.
Current State of Affairs and League Model Performance
The following table depicts the current performance of each league that we are mining. The table is sorted on overall value described above and contains columns for all 3 final metric values used in the evaluation process.
The columns in the table are hopefully straightforward but for completeness sake let me define them further:
League: League Name
Matches: The total number of matches in the league data set
Draws: The total number of draws in the league data set
Draw %: Percentage of draws that occur in the data set
Predictions: Total number of draw predictions that were made by the model
Wins: The total number of draw predictions that were correct
Accuracy, Precision and Value are defined above.
Return on Bet: The theoretical amount you would walk away with on every $100 wager made.
OK, time to be completely clear. These are THEORETICAL values based on the historical test set and may not translate at all in the same manner on future matches. That is what the next several weeks and blog posts will target. The ACTUAL performance of the models vs. results we are experiencing in the testing phase.
And finally tying it back to the title of the post, making the rubber meet the road here are the predictions made by the model for this weekend:
It’s a little nerve-wracking actually posting these prior, thoughts drift back to the movie Office Space and the dreaded 1 decimal point bug, but have to do it at some point. Results could either be through the roof, crash spectacularly or most likely fall somewhere in-between. We shall see….