NBA Analytics 1: Home vs. Away Analysis
Examining NBA Data to determine differences in Home vs. Away Games across seasons from 2004-2024.
For my Business Intelligence Capstone, my team and I performed an exploratory data analysis (EDA) of NBA game data from 2004 to 2024. This included data transformation, visualization, and hypothesis testing of NBA game data from 2004 to 2024. The goal of the analysis was to determine if home and away games have a meaningful difference in box score statistics that could be investigated further in future projects during the capstone.
Data Collection and Cleaning
Data was collected from the NBA API, specifically Box Scores of teams. The following code can be used to access the data for your own analysis:
file_id = '1U2UaHWRSkUXfJBn4kBHPYttd3dvw_CZF'
url = f'https://drive.google.com/uc?id={file_id}'
df = pd.read_csv(url, encoding='utf-8')
A brief look at the percentages of missing data in the dataset revealed that missing or NA data could be dropped without making a meaningful difference in the data. From there, we were ready to examine win percentages and possible predictor variables.
Examining Win Percentages for Teams
To examine the win percentages, the data was grouped by season and a percent of wins each team achieved were added to the dataframe. This resulted in the following table:
season | home_win_pct | away_win_pct |
---|---|---|
2004-05 | 60.182% | 39.818% |
2005-06 | 60.787% | 39.213% |
2006-07 | 59.573% | 40.427% |
2007-08 | 61.002% | 38.999% |
2008-09 | 61.217% | 38.783% |
2009-10 | 59.817% | 40.183% |
2010-11 | 60.777% | 39.223% |
2011-12 | 59.349% | 40.651% |
2012-13 | 61.398% | 38.602% |
2013-14 | 57.835% | 42.165% |
2014-15 | 57.722% | 42.278% |
2015-16 | 59.423% | 40.577% |
2016-17 | 58.145% | 41.856% |
2017-18 | 58.405% | 41.595% |
2018-19 | 59.182% | 40.818% |
2019-20 | 54.177% | 45.823% |
2020-21 | 54.382% | 45.618% |
2021-22 | 55.364% | 44.636% |
2022-23 | 57.317% | 42.683% |
2023-24 | 55.190% | 44.810% |
From the resultant table, the home win percentage (home_win_pct
) is decreasing on average each season, making the away win percent (away_win_pct
) grow as a result. This aligns with outside analyses, like Michael MacKelvie's excellent analysis of Home games in the NBA, where home win rates have been declining since the 1940's:
So what is causing this decline that we can examine in our data? We can examine the statistics in our dataset like free throws, 3 pointers, turnovers, and defensive/offensive rebounds to see if the group means and variances between home and away games are the same. If the means are the same, we can intuit that the statistic does not have a meaningful impact on home win percentage. If the means or variances are not the same, then we can infer that they have a meaningful impact on a teams win percentage at home or away.
Factors Contributing to Decline :: An Exploration
After splitting the dataset into separate pandas dataframes for home and away games, several tests were applied. The first was to check the histograms and QQ plots of the distribution of each box score statistic, and visually examine the data. Once this was done, more robust tests of normality, the Shapiro-Wilks test and the Anderson-Darling Test (used due to the robust size of the data) were used to get a quantitative result of the distributions of stats.
[Add example image of a Histogram and QQ Plot here, with a caption outlining insights from the distribution]
The stats chosen were as follows:
Field Goal Percentage
: making more points = more wins, hopefullyThree-Pointer Percentage
: making higher value points should lead to higher scores, and hopefully more winsFree Throw Percentage
: making clutch and-one plays can give teams an edge in a close game, therefore this stat was chosen.Offensive Rebounds
: teams who have more chances to shoot and end up scoring likely boosts a team's chances of winning.Defensive Rebounds
: teams who get defensive rebounds deny scoring opportunities for their opponents, hopefully leading to larger score differentials in the rebounding team's favorContested Field Goals
: Making points under pressure could indicate a team's ability to score in important games or when the team is under pressure.Uncontested Field Goals
: these shots are explored in the MacKelvie video, where he states that uncontested field goals are a robust metric for tracking a team's scoring ability, blocking the factor that a crowd may have on a game.
Shapiro-Wilks Testing
The Shapiro-Wilk Test of Normality is defined as:
The test is run by computing a test statistic, , and returns a -value for us to interpret. From this test, a -value below the significance level tells us to reject the null hypothesis , and we cannot assume Normality. This was implemented with the following code:
def run_shapiro_test(l:list):
from scipy.stats import shapiro
w_home = shapiro(l[0])
w_away = shapiro(l[1])
print(f'Home Games:\n Test Stat (W): {w_home.statistic},\n p-value: {w_home.pvalue}\n')
print(f'Away Games:\n Test Stat (W): {w_away.statistic},\n p-value: {w_away.pvalue}\n')
return
Anderson-Darling Testing
The Shapiro-Wilk Test is useful for determining normality of a sample, but has one fatal flaw: it produces inaccurate p-values for . This means another test, the Anderson-Darling (AD) Test, must be used. The Anderson-Darling Test is as follows:
AD Testing checks if a sample comes from a provided distribution. Meaning, we can provide a sample to it along with a desired distribution to test (The normal distribution in our case), and get a result of whether our sample comes from the desired distribution or not.
We are again using scipy.stats
for this test, using the anderson
function. The function returns the test statistic, an array of critical values and an array of significance levels. If the returned test statistic is larger than the critical values for the corresponding significance levels, then the null hypothesis should be rejected.
def run_ad_test(l:list):
from scipy.stats import anderson
home = anderson(l[0],dist='norm')
away = anderson(l[1],dist='norm')
print(f'Home Games:\n Test Statistic: {home.statistic},\n Critical Values: {home.critical_values},\n Significance Level: {home.significance_level}\n')
print(f'Home Games:\n Test Statistic: {away.statistic},\n Critical Values: {away.critical_values},\n Significance Level: {away.significance_level}\n')
return
Results
The results of these tests can be found in the following table:
Statistic | Home SW p-value | Away SW p-value | Home AD Test Statistic | Away AD Test Statistic |
---|---|---|---|---|
Field Goal Percentage | 8.09e-12 | 1.71e-07 | 7.122 | 3.765 |
Three Pointer Percentage | 1.013e-23 | 7.835e-24 | 18.078 | 14.525 |
Free Throw Percentage | 6.013e-39 | 2.486e-40 | 40.930 | 39.009 |
Offensive Rebounds | 6.432e-48 | 7.994e-49 | 132.236 | 141.451 |
Defensive Rebounds | 3.868e-27 | 5.104e-29 | 43.834 | 46.083 |
Contested Field Goals | 1.879e-102 | 3.030e-102 | 2746.373 | 2714.624 |
Uncontested Field Goals | 1.426e-103 | 1.154e-103 | 2880.563 | 2888.399 |
From the results of each test, the team confirmed that none of these features are normally distributed. This tells us that for any future testing and models, the data will either need to be scaled or used with non-parametric models and tests. This guided our decision-making process for the following hypothesis tests, which were used to look for differences between home and away games.
Hypothesis Testing
For hypothesis testing, we will be testing for differences in group means and variances for home and away field goals, three pointers, free throws, offensive and defensive rebounds, and contested and uncontested field goals. Because no distribution is normally distributed, we will perform these hypothesis tests using non-parametric tests. These tests are the Levene's Test for group variances, the Kruskal-Wallis Test for group means, and Dunn's Test for posthoc analysis.
Each function will take in a significance level, we are defaulting to a value of .05
Kruskal-Wallis Test
The Kruskal Wallis Test is essentially a non-parametric ANOVA test, meaning the test analyzes group means without assuming normality in distributions. It does this by computing a test-statistic, , and has hypotheses:
The following function will take in a list of pandas.Series
objects of each feature, and will run the scipy.stats.kruskal
function to retrieve the results. If the p-value of the test is below the significance level, then we reject the null hypothesis and cannot assume group mean equality. If the Null Hypothesis is rejected, Dunn's Test must be run as a posthoc analysis to determine how the group means differ.
Levene's Test
Levene's Test is a statistical test used to check equality of variances between groups, and will be used to check if variances between home and away games are the same. This is done by computing a test statistic, , and has hypotheses:
The following function takes in a list of pandas.Series
features, and runs the scipy.stats.levene
function to retrieve the test results. If the p-value of the test is less than the significance level, then we reject the null hypothesis and cannot assume group variance equality. If the null hypothesis is rejected, posthoc tests may need to be run.
Dunn Test
Dunn's Test conducts a posthoc analysis of the group means after a Kruskal-Wallis one-way ANOVA to see how group means differ.
The following function uses the scikit_posthoc
package's posthoc_dunn
function and returns the results.
Conover Test
Conover's test is a non-parametric posthoc test to be used when Levene's test discovers significant differences in group variances. We can again use scikit_posthocs
in the function to compare group variances.
Results:
Statistic | KW Test p-value | Dunn Test p-value | Levene Test p-value | Conover Test p-value |
---|---|---|---|---|
Field Goal Percentage | 4.179e-100 | 4.18e-100 | 0.003 | 1.61e-100 |
Three Pointer Percentage | 1.257e-13 | 1.26e-13 | 0.057 | N/A |
Free Throw Percentage | 0.003 | 2.92e-03 | 7.737e-05 | 2.92e-03 |
Offensive Rebounds | 1.309e-17 | 1.31e-17 | 0.001 | 1.28e-17 |
Defensive Rebounds | 3.423e-84 | 3.42e-84 | 0.101 | N/A |
Contested Field Goals | 1.371e-4 | 1.37e-04 | 5.233e-06 | 1.37e-04 |
Uncontested Field Goals | 2.759e-4 | 2.76e-04 | 1.367e-4 | 2.76e-04 |
Given the results of the above distribution and hypothesis testing, we see that the statistics chosen are not normally distributed, and differ significantly in both mean and variance between Home and Away games. This indicates that there is a significant difference in these game types that should be examined further, and that these stats could indicate a reason why home game win rates have declined over time.
The only stats that differ from this conclusion are the Three Pointer Percentage, and Defensive Rebounds. In the Levene Test, we fail to reject the null hypothesis, indicating similar variances between home and away games. There is not much to glean from this, aside from the variances not being variables of interest in future explorations.
Conclusion
From the tests of normality, differences in means and variance, the team now has a springboard to start analyzing differences in home and away games. While the non-normality of each feature is not a massive issue considering most models have scaling functionalities, it is a factor to account for to prevent inaccuracy in future modeling and analysis.
From the hypothesis testing of group means and variances, we can conclude that each chosen statistic has a significant difference between home and away games in both mean and variance, and should be investigated further to determine what is causing this disparity in winrates.
In future articles, we will be using Regression Analysis and Classification Methods to examine NBA stats further.
References
Statistical Tests:
- Shapiro-Wilk Test
- Anderson-Darling Test
- Kruskal-Wallis Test
- Levene's Test
- Dunn's Test
- Scipy.stats Documentation
- Scikit_posthocs Documentation