Case Study: Machine Learning Applications — Making NBA Predictions

Published in

Oursky Team

9 min readNov 26, 2019

Executive Summary

Oursky was commissioned by a client to develop a Machine Learning algorithm to predict NBA game result. To achieve this goal, we build a tailor-made machine learning model to make predictions for NBA games. To be specific, the model is able to predict the probability of each team winning an NBA game, as well as presenting the rationale behind the predictions. We were able to out-perform several other published models and predictions on FiveThirtyEight in both backtesting of NBA season (>67%) and in the first season of production use (>80%).

Through this case study, we will also illustrate how a typical AI project works from scratch.

Challenges

What is the Background?

Making predictions and forecasting is one of the top machine learning applications. By using statistical models and algorithms, machine learning can predict possible outcomes and trends. Many previous cases show that machine learning can help make stock prediction, forecast sales of business and finance, and even improve patient care by predicting health conditions.

We decided to try applying machine learning on making game result predictions for NBA. NBA is one of the most popular sports league in the world, it is not surprising that the worldwide NBA lovers are eager to know who will be the champion in the NBA seasons. If accurate predictions on NBA game results could be done by utilizing machine learning, it will help create more excitement and engagement for NBA fans all over the world.

The two main objectives of this project are:

develop a machine learning model that predicts the winning/losing probability of a team
interpret “reasons” of such predictions

Since Oursky team is full of NBA lovers, this project is very interesting to us. With both our expertise and solid experience in AI projects, we would love to explore the limits of how accurate we can predict NBA Games, and to understand more about machine learning’s capability and limitations on making predictions. We can also further explore the opportunities of applying machine learning on more dynamic situations and create more business values with technology.

Solutions

How to Start with AI Projects?

Many people may have thought of doing some AI projects but don’t know where to start. Below are some typical procedures to undergo if we are going to start an AI project:

Pre-Evaluation: For a machine learning project, the first step is usually evaluating the project idea and confirm if we have enough data for later training and testing. If the project idea / business idea comes from a client, normally we will analyse the case and come up with a proposal of 2–3 possible solutions on how AI can be applied to solve their issues.
Gathering Data: Depending on the nature of the project, we define what data is relevant to feed the machine learning models. Datasets can be obtained from different sources such as free open data or paid database. Sometimes there are some projects which require internal data from our clients too. For example, we developed a sales estimation and product recommendation model for a retail client in another AI project. The datasets would mostly consist of historic transaction records of the client.
Data Cleansing: Cleaning and organising the raw data is required before performing any ML models. It helps us sort out missing/extreme values, noises or potential data errors. Datasets will be further divided into training and testing.
Choosing Model: After fully analysing the project scope and data, we will determine which model(s) could be used to proceed with training. We may try more than one models and experiment with different designs if the AI solutions don’t have some well-established model design that’s proven to be good enough.
Training: For most supervised learning model, to train a model, we will feed the model with a batch of training data and optimize the model’s parameters during the process.
Evaluation: Testing data set is used to validate if the built model provides meaningful results. A validation set of data will be fed to the trained model and test the accuracy. We will also evaluate how to modify the model during the training (e.g. optimise the parameters or choose another suitable model).
Final Modeling: If the concept is proven to work with good enough accuracy, we can consider it as the final model and use for production. If it is a client project, we will provide the final model to the client. The client can feed production data and make predictions with the final model afterwards.
Post-Evaluation: At the end of the project, we will also generate a review report to conclude how we trained the models, as well as the results and recommendation on how to optimize in long term.

Is there an accuracy commitment for Machine Learning Prediction projects?

Nowadays, many machine learning projects are developed based on existing models. Hence, you can usually obtain a certain accuracy with a proven model. However, it is essential to understand the capabilities of a machine learning project and set the right expectation if you are starting your own one. For a new machine learning project which is building from scratch, there is no guarantee for accuracy as we are yet to know the qualities of the training data or even if the data and prediction result are indeed correlated. Therefore, a completely new machine learning project should be regarded as a proof of concept (PoC) process (You can refer to the “black box” problem in the technical section below if interested).

In this project, we assumed that implicit information derived from historical NBA records is related to the probability of a team winning/losing an NBA game. The project goal is proving whether it is feasible to predict the result of NBA games with a scientific and systemic approach.

How Does Machine Learning Work In This Project?

Following the process aforementioned, we worked out the NBA prediction project step by step:

Pre-evaluation

Deep learning, a subset of machine learning, has the strength to learn from raw input features in the hidden layer without domain knowledge. We, therefore, proposed a few models including team-based approach, player-based approach and network-based approach.

Gathering Data & Data Cleansing

For NBA predictions, we would like to predict the margin of victory (MOV) and the winner of the game. The margin of victory is a statistical-based calculation on the difference between scores of the winning team and the losing team and helps to determine the significance of the victory.
We selected players’ logs of each game starting from 1983 (the first year where there is a full entry of box score) from Basketball Reference as data inputs.
Each log involves box score and other information of the play in the game including timestamp, experience, attempts, fouls etc. We then proceeded to clean the data.

Choosing Model & Training

Since the MOV of a game does not necessarily represent the strength difference between the competing teams, it is ineffective to predict the MOV of a game. Practically, the probability distribution output by the classification model can be treated as the confidence level of which team is going to win. It is more understandable to interpret the probability distribution output as how confident the model prediction is than the MOV output by the regression model.
Please refer to Optional Read section for more technical details if you are interested.

Evaluation

We performed testing with a set of data from 2013 to 2017 with 3 different models in parameters initialisation for comparison. It is proven that our models generally outperform FiveThirtyEight and are ready to feed with production data for predictions.
Instead of just providing numbers, we further used SHAP framework to give “reasoning” of the predictions to make the prediction results more appealing. Please refer to Optional Read section for further elaboration.

Post-Evaluation

At the end of the POC, we generated an evaluation report to conclude the approach we tried, the way we trained the models, the accuracy of the predictions and suggestions on how to further optimise results in the future.

Results

For the last NBA season, our model obtained an overall accuracy of 80% which outperformed FiveThirtyEight. To further optimize the results, we are eager to try Temporal Convolutional Network (TCN) in future, which is more computationally efficient and robust than RNN.

With the success of this project, we have validated the concept and foresee that the subject of machine learning projects could potentially even be further extended to other sporting events such as football and MLB.

If you are interested in making predictions for your business, or not sure how AI might help, you’re welcome to make an appointment with our professional AI consultants to explore different options.

Optional Read

[TL;DR] This section covers the technical details about various feasible approaches and models being used. Recommended for those who have fundamental knowledge of machine learning.

Feasible Approaches

Considering the team and player as the two main entities in a basketball game, we studied 3 relevant approaches for forecasting:

Team-based approach

Team-based approach attempts to predict the game results by evaluating the general strength of a whole team. It aims to predict the margin of victory (MOV) with a linear regression formula, by deriving the difference in strength of the opposing teams. Other factors like home advantage, back-to-back games are also taken into consideration.

Player-based approach

Player-based approach takes team composition into account. It considers the strength of each player as a key factor determining the integrated strength of a team. FiveThirtyEight, a popular website providing NBA predictions with statistical analysis, also uses this approach. In this approach, the average playoff experience and Elo rating are the key factors.

Network approach

Network approach treats a sports league as a network of players, coaches and teams. It utilises the work relationships among networks to extract the implicit information of each team for forecasting the teams’ behaviour. In this approach, the key factors include team volatility, roster aggregate volatility, team inexperience, roster aggregate coherence, roster size etc.

Machine Learning Models

NBA is an extremely dynamic game with many attributes; using any 1 model may not be promising enough. We, therefore, proposed to use a deep learning algorithm for this project. Deep learning is a subset of machine learning and is generally being used to teach machines to identify patterns or classify information.

We developed our machine learning model with two predictive models:

Regression

To predict the margin of victory (scores difference between home and away team), we treated it as a regression problem. We modelled MOV as a function of the inputs (players’ logs of the NBA games) and trained the neural network to predict the MOV. When the MOV is positive, it implies that the home team is likely to win.

Classification

To predict which team is going to win. By classifying the NBA game as a home team winning or losing, we can forecast the winner of a game with great confidence. This model will output a probabilistic distribution of whether a game belongs to a category of the home team winning or losing the game.
Different from regression where the output is a linear activation function, in classification, a softmax function will be used as the output of the network, so that it is able to provide a normalised vector of the probabilistic distribution of different classes.

For both models, they share the same Recurrent Neural Network (RNN) for player strength encoding. RNN is a type of neural network which is designed to model sequential data including text and stock price. We treated players’ logs as time series data which can be inputted to the RNN cell. To overcome the vanishing and exploding gradient problem, we used long short-term memory (LSTM) cell instead of a generic RNN cell.

We tested window sizes ranging from 1 to 200 for the optimal window size. Furthermore, the model accuracy is also tested with data from 2017.

We built a network map which shows 26 players’ logs passing through the RNN with weights shared among all the inputs. 3 LSTM cells are used in the RNN and the output is a vector of 256-dimension. The output vectors are then concatenated together and fed to a fully connected network of 3 layers with 1024 nodes each.

Explaining the prediction result

Explanation is important for validating the predictions made, that they are not just some numbers made with a blind guess. However, explaining prediction results is somehow difficult.

“Black Box” is considered as a common problem of deep learning algorithms (deep learning is a subset of machine learning). Simply speaking, deep learning creates its own understanding of the hierarchal architecture of neurons. Therefore, it is difficult to inspect how deep learning algorithms analysed the data and accomplished the assigned task. To overcome the black box problem, SHAP, a general framework to interpret machine learning models by visualisation, is a possible solution.

By applying this framework, we are able to quantify the contribution of each input parameter to the prediction result with an activation heat map. Then we can analyse the activation heat map to figure out the most critical parameter.

Originally published at https://blog.oursky.com on November 26, 2019.