NBA Analytics 1: Home vs. Away Analysis

Examining NBA Data to determine differences in Home vs. Away Games across seasons from 2004-2024.


For my Business Intelligence Capstone, my team and I performed an exploratory data analysis (EDA) of NBA game data from 2004 to 2024. This included data transformation, visualization, and hypothesis testing of NBA game data from 2004 to 2024. The goal of the analysis was to determine if home and away games have a meaningful difference in box score statistics that could be investigated further in future projects during the capstone.

Data Collection and Cleaning

Data was collected from the NBA API, specifically Box Scores of teams. The following code can be used to access the data for your own analysis:

file_id = '1U2UaHWRSkUXfJBn4kBHPYttd3dvw_CZF'
url = f'https://drive.google.com/uc?id={file_id}'
df = pd.read_csv(url, encoding='utf-8')

A brief look at the percentages of missing data in the dataset revealed that missing or NA data could be dropped without making a meaningful difference in the data. From there, we were ready to examine win percentages and possible predictor variables.

Examining Win Percentages for Teams

To examine the win percentages, the data was grouped by season and a percent of wins each team achieved were added to the dataframe. This resulted in the following table:

seasonhome_win_pctaway_win_pct
2004-0560.182%39.818%
2005-0660.787%39.213%
2006-0759.573%40.427%
2007-0861.002%38.999%
2008-0961.217%38.783%
2009-1059.817%40.183%
2010-1160.777%39.223%
2011-1259.349%40.651%
2012-1361.398%38.602%
2013-1457.835%42.165%
2014-1557.722%42.278%
2015-1659.423%40.577%
2016-1758.145%41.856%
2017-1858.405%41.595%
2018-1959.182%40.818%
2019-2054.177%45.823%
2020-2154.382%45.618%
2021-2255.364%44.636%
2022-2357.317%42.683%
2023-2455.190%44.810%

From the resultant table, the home win percentage (home_win_pct) is decreasing on average each season, making the away win percent (away_win_pct) grow as a result. This aligns with outside analyses, like Michael MacKelvie's excellent analysis of Home games in the NBA, where home win rates have been declining since the 1940's:

So what is causing this decline that we can examine in our data? We can examine the statistics in our dataset like free throws, 3 pointers, turnovers, and defensive/offensive rebounds to see if the group means and variances between home and away games are the same. If the means are the same, we can intuit that the statistic does not have a meaningful impact on home win percentage. If the means or variances are not the same, then we can infer that they have a meaningful impact on a teams win percentage at home or away.

Factors Contributing to Decline :: An Exploration

After splitting the dataset into separate pandas dataframes for home and away games, several tests were applied. The first was to check the histograms and QQ plots of the distribution of each box score statistic, and visually examine the data. Once this was done, more robust tests of normality, the Shapiro-Wilks test and the Anderson-Darling Test (used due to the robust size of the data) were used to get a quantitative result of the distributions of stats.

[Add example image of a Histogram and QQ Plot here, with a caption outlining insights from the distribution]

The stats chosen were as follows:

  • Field Goal Percentage: making more points = more wins, hopefully
  • Three-Pointer Percentage: making higher value points should lead to higher scores, and hopefully more wins
  • Free Throw Percentage: making clutch and-one plays can give teams an edge in a close game, therefore this stat was chosen.
  • Offensive Rebounds: teams who have more chances to shoot and end up scoring likely boosts a team's chances of winning.
  • Defensive Rebounds: teams who get defensive rebounds deny scoring opportunities for their opponents, hopefully leading to larger score differentials in the rebounding team's favor
  • Contested Field Goals: Making points under pressure could indicate a team's ability to score in important games or when the team is under pressure.
  • Uncontested Field Goals: these shots are explored in the MacKelvie video, where he states that uncontested field goals are a robust metric for tracking a team's scoring ability, blocking the factor that a crowd may have on a game.

Shapiro-Wilks Testing

The Shapiro-Wilk Test of Normality is defined as:

H0: The data has been sampled from a normal distribution, N(μ,σ2)H1: The data has has not been sampled from a normal distribution, N(μ,σ2)\begin{align*} H_0 &\text{: The data has been sampled from a normal distribution, } N(\mu,\sigma^2) \\ H_1 &\text{: The data has has not been sampled from a normal distribution, } N(\mu,\sigma^2) \end{align*}

The test is run by computing a test statistic, WW, and returns a pp-value for us to interpret. From this test, a pp-value below the significance level tells us to reject the null hypothesis H0H_0, and we cannot assume Normality. This was implemented with the following code:

def run_shapiro_test(l:list):
	from scipy.stats import shapiro
	w_home = shapiro(l[0])
	w_away = shapiro(l[1])
	print(f'Home Games:\n Test Stat (W): {w_home.statistic},\n p-value: {w_home.pvalue}\n')
	print(f'Away Games:\n Test Stat (W): {w_away.statistic},\n p-value: {w_away.pvalue}\n')
	return

Anderson-Darling Testing

The Shapiro-Wilk Test is useful for determining normality of a sample, but has one fatal flaw: it produces inaccurate p-values for N>5000N > 5000. This means another test, the Anderson-Darling (AD) Test, must be used. The Anderson-Darling Test is as follows:

H0: The data comes from the chosen (normal) distributionH1: The data does not come from the chosen (normal) distribution\begin{align*} H_0&\text{: The data comes from the chosen (normal) distribution} \\ H_1&\text{: The data does not come from the chosen (normal) distribution} \\ \end{align*}

AD Testing checks if a sample comes from a provided distribution. Meaning, we can provide a sample to it along with a desired distribution to test (The normal distribution NN in our case), and get a result of whether our sample comes from the desired distribution or not.

We are again using scipy.stats for this test, using the anderson function. The function returns the test statistic, an array of critical values and an array of significance levels. If the returned test statistic is larger than the critical values for the corresponding significance levels, then the null hypothesis should be rejected.

def run_ad_test(l:list):
	from scipy.stats import anderson
 
	home = anderson(l[0],dist='norm')
	away = anderson(l[1],dist='norm')
 
	print(f'Home Games:\n Test Statistic: {home.statistic},\n Critical Values: {home.critical_values},\n Significance Level: {home.significance_level}\n')
	print(f'Home Games:\n Test Statistic: {away.statistic},\n Critical Values: {away.critical_values},\n Significance Level: {away.significance_level}\n')
 
	return

Results

The results of these tests can be found in the following table:

StatisticHome SW p-valueAway SW p-valueHome AD Test StatisticAway AD Test Statistic
Field Goal Percentage8.09e-121.71e-077.1223.765
Three Pointer Percentage1.013e-237.835e-2418.07814.525
Free Throw Percentage6.013e-392.486e-4040.93039.009
Offensive Rebounds6.432e-487.994e-49132.236141.451
Defensive Rebounds3.868e-275.104e-2943.83446.083
Contested Field Goals1.879e-1023.030e-1022746.3732714.624
Uncontested Field Goals1.426e-1031.154e-1032880.5632888.399

From the results of each test, the team confirmed that none of these features are normally distributed. This tells us that for any future testing and models, the data will either need to be scaled or used with non-parametric models and tests. This guided our decision-making process for the following hypothesis tests, which were used to look for differences between home and away games.

Hypothesis Testing

For hypothesis testing, we will be testing for differences in group means and variances for home and away field goals, three pointers, free throws, offensive and defensive rebounds, and contested and uncontested field goals. Because no distribution is normally distributed, we will perform these hypothesis tests using non-parametric tests. These tests are the Levene's Test for group variances, the Kruskal-Wallis Test for group means, and Dunn's Test for posthoc analysis.

Each function will take in a significance level, we are defaulting to a value of .05

Kruskal-Wallis Test

The Kruskal Wallis Test is essentially a non-parametric ANOVA test, meaning the test analyzes group means without assuming normality in distributions. It does this by computing a test-statistic, HH, and has hypotheses:

H0:μg1=μg2==μgkH1: At least one group has a different mean\begin{align*} H_0&\text{:}\mu_{g1} = \mu_{g2} = \cdots = \mu_{gk} \\ H_1&\text{: At least one group has a different mean} \\ \end{align*}

The following function will take in a list of pandas.Series objects of each feature, and will run the scipy.stats.kruskal function to retrieve the results. If the p-value of the test is below the significance level, then we reject the null hypothesis and cannot assume group mean equality. If the Null Hypothesis is rejected, Dunn's Test must be run as a posthoc analysis to determine how the group means differ.

Levene's Test

Levene's Test is a statistical test used to check equality of variances between groups, and will be used to check if variances between home and away games are the same. This is done by computing a test statistic, WW, and has hypotheses:

H0σg12=σg22==σgk2H1: At least one group has a different variance\begin{align*} H_0&\text{: }\sigma_{g1}^2 = \sigma_{g2}^2 = \cdots = \sigma_{gk}^2 \\ H_1&\text{: At least one group has a different variance} \end{align*}

The following function takes in a list of pandas.Series features, and runs the scipy.stats.levene function to retrieve the test results. If the p-value of the test is less than the significance level, then we reject the null hypothesis and cannot assume group variance equality. If the null hypothesis is rejected, posthoc tests may need to be run.

Dunn Test

Dunn's Test conducts a posthoc analysis of the group means after a Kruskal-Wallis one-way ANOVA to see how group means differ.

H0μg1=μg2H1μg1μg2\begin{align*} H_0&\text{: }\mu_{g1} = \mu_{g2} \\ H_1&\text{: }\mu_{g1} \neq \mu_{g2} \\ \end{align*}

The following function uses the scikit_posthoc package's posthoc_dunn function and returns the results.

Conover Test

Conover's test is a non-parametric posthoc test to be used when Levene's test discovers significant differences in group variances. We can again use scikit_posthocs in the function to compare group variances.

Results:

StatisticKW Test p-valueDunn Test p-valueLevene Test p-valueConover Test p-value
Field Goal Percentage4.179e-1004.18e-1000.0031.61e-100
Three Pointer Percentage1.257e-131.26e-130.057N/A
Free Throw Percentage0.0032.92e-037.737e-052.92e-03
Offensive Rebounds1.309e-171.31e-170.0011.28e-17
Defensive Rebounds3.423e-843.42e-840.101N/A
Contested Field Goals1.371e-41.37e-045.233e-061.37e-04
Uncontested Field Goals2.759e-42.76e-041.367e-42.76e-04

Given the results of the above distribution and hypothesis testing, we see that the statistics chosen are not normally distributed, and differ significantly in both mean and variance between Home and Away games. This indicates that there is a significant difference in these game types that should be examined further, and that these stats could indicate a reason why home game win rates have declined over time.

The only stats that differ from this conclusion are the Three Pointer Percentage, and Defensive Rebounds. In the Levene Test, we fail to reject the null hypothesis, indicating similar variances between home and away games. There is not much to glean from this, aside from the variances not being variables of interest in future explorations.

Conclusion

From the tests of normality, differences in means and variance, the team now has a springboard to start analyzing differences in home and away games. While the non-normality of each feature is not a massive issue considering most models have scaling functionalities, it is a factor to account for to prevent inaccuracy in future modeling and analysis.

From the hypothesis testing of group means and variances, we can conclude that each chosen statistic has a significant difference between home and away games in both mean and variance, and should be investigated further to determine what is causing this disparity in winrates.

In future articles, we will be using Regression Analysis and Classification Methods to examine NBA stats further.

References

Github Repository

Code

Statistical Tests: