A Late Data Cup 2021 Submission
My dataset consists of 75000 events from 41 OHL games played by the Erie Otters in 2019–2020. It was made available by the NHL for this year’s annual Data Cup competition. The competition ended in March, but I decided to use it because it was in part what motivated me to take data science more seriously. “I can do this.. but how can I do it better?”
In any case, I decided to build three models. They are:
1. The probability of a goal conditional on a shot.
2. The probability of a completed pass conditional on a pass attempt.
3. The expected number of shots in the next 10 seconds of play.
Feature Engineering
Before I could predict what I wanted to know I had to transform the data and engineer features. The features I created that were common to all models were:
1. Total seconds remaining in the period and game.
2. The prior two events
3. The x,y coordinates of the prior two events
4. The seconds elapsed since the last event
5. The home skater advantage
6. The score differential
For the pass model I also created a Pass Angle (in rads) feature. For the Expected Shots in the Next 10 Seconds model I engineered the target feature, which required a nifty for-loop to sum the shots in qualifying rows!
Metrics to Evaluate the Models
The Goal Probability Model
Because goals are so infrequent, I decided to focus on the “area under the curve” computed using the Receiver Operating Characteristic.
The Pass Completion Model
Passes are usually completed, but the outcome of each pass is still subject to aleatory randomness. I therefore chose to again focus on the ROC and the area under its curve.
The Expected Shots Model
Because the number of shots is a count, I used Poisson deviance to establish a baseline and compare my models. In the process I learned that XGB models are especially flexible! You can specify a number of different objectives and metrics.
Model Results
Goal Model:
My goal model ended up performing well on the test set! Here is a comparison between the XGBoost Classifier and a simple logistic regression model.
Much to my satisfaction, the interaction of the x and y-coordinates when predicting whether a shot will result in a goal matches what you’d expect if you’ve ever played or watched a hockey game for more than thirty seconds. Note that all x, y coordinates below are within the offensive zone.
Curiously, the goal probability declines faster as you go South on the above plot, but this is not entirely unexpected. Roughly ~65% of players shoot left-handed, so shots to the North will be more powerful for the majority of players in the OHL.
Here’s another plot showing same in more dramatic fashion.
The Pass Model
My pass model performed extremely well on the test set. Below is the ROC curve for the XGBoost model.
This result is comparable to the result achieved by a Data Cup finalist working in R with a BART (Bayesian Additive Regression Trees) model. Not quite as good, but comparable.
When examining the interaction of feature variables I discovered relationships that resemble what an experienced hockey player would expected. The likelihood of a pass completion drops off sharply when the pass target is in the center of the ice and increases signifncantly towards the boards and corners.
The dataset included each player who participated in an event, so with my passing model I computed the expected pass completion probability for every pass each player attempted. I then calculated “pass completions above expectation per 100 pass attempts” for every Erie Otters player.
Jamie Drysdale and Maxim Golod, the two players leading the team in Passes Completed Above Expectation, are considered the top prospects on the Erie Otters by NHL scouts.
As an exercise, I also used Grid Approximation to do a Bayesian update on the passing skill of Aidan Campbell, a player with only 200 pass attempts (note my prior here could easily be questioned: this was an exercise).
We learned a little bit: Aidan is less likely to be much better or much worse than an average player in the dataset with his pass attempt risk profile.
The Shots Model
My shots model was more modest in its ability to predict the future. It improved upon the baseline by explaining roughly 10% of the variance in the shot count over the next 10 seconds. The most important feature by far was the x-coordinate of the event.
The interaction of the features was nevertheless illuminating. The graph below shows both the discontinuity in the effect of the x-coordinate and the stratification of the influence of each event type.
Conclusion
My models are pretty good and the interaction of the features is intuitive and consistent with domain knowledge. The models also allow for player comparisons that accord with the consensus opinions of hockey experts. That’s very satisfying.
There is much more that could be done with this dataset and I look forward to further exploration.
All of the notebooks for the above analysis can be found at my github here.