NBA Analytics 1: Home vs. Away Analysis

April 23, 2025

Examining NBA Data to determine differences in Home vs. Away Games across seasons from 2004-2024.

For my Business Intelligence Capstone, my team and I performed an exploratory data analysis (EDA) of NBA game data from 2004 to 2024. This included data transformation, visualization, and hypothesis testing of NBA game data from 2004 to 2024. The goal of the analysis was to determine if home and away games have a meaningful difference in box score statistics that could be investigated further in future projects during the capstone.

Data Collection and Cleaning

Data was collected from the NBA API, specifically Box Scores of teams. The following code can be used to access the data for your own analysis:

file_id = '1U2UaHWRSkUXfJBn4kBHPYttd3dvw_CZF'
url = f'https://drive.google.com/uc?id={file_id}'
df = pd.read_csv(url, encoding='utf-8')

A brief look at the percentages of missing data in the dataset revealed that missing or NA data could be dropped without making a meaningful difference in the data. From there, we were ready to examine win percentages and possible predictor variables.

Examining Win Percentages for Teams

To examine the win percentages, the data was grouped by season and a percent of wins each team achieved were added to the dataframe. This resulted in the following table:

season	home_win_pct	away_win_pct
2004-05	60.182%	39.818%
2005-06	60.787%	39.213%
2006-07	59.573%	40.427%
2007-08	61.002%	38.999%
2008-09	61.217%	38.783%
2009-10	59.817%	40.183%
2010-11	60.777%	39.223%
2011-12	59.349%	40.651%
2012-13	61.398%	38.602%
2013-14	57.835%	42.165%
2014-15	57.722%	42.278%
2015-16	59.423%	40.577%
2016-17	58.145%	41.856%
2017-18	58.405%	41.595%
2018-19	59.182%	40.818%
2019-20	54.177%	45.823%
2020-21	54.382%	45.618%
2021-22	55.364%	44.636%
2022-23	57.317%	42.683%
2023-24	55.190%	44.810%

From the resultant table, the home win percentage (home_win_pct) is decreasing on average each season, making the away win percent (away_win_pct) grow as a result. This aligns with outside analyses, like Michael MacKelvie's excellent analysis of Home games in the NBA, where home win rates have been declining since the 1940's:

So what is causing this decline that we can examine in our data? We can examine the statistics in our dataset like free throws, 3 pointers, turnovers, and defensive/offensive rebounds to see if the group means and variances between home and away games are the same. If the means are the same, we can intuit that the statistic does not have a meaningful impact on home win percentage. If the means or variances are not the same, then we can infer that they have a meaningful impact on a teams win percentage at home or away.

Factors Contributing to Decline :: An Exploration

After splitting the dataset into separate pandas dataframes for home and away games, several tests were applied. The first was to check the histograms and QQ plots of the distribution of each box score statistic, and visually examine the data. Once this was done, more robust tests of normality, the Shapiro-Wilks test and the Anderson-Darling Test (used due to the robust size of the data) were used to get a quantitative result of the distributions of stats.

[Add example image of a Histogram and QQ Plot here, with a caption outlining insights from the distribution]

The stats chosen were as follows:

Field Goal Percentage: making more points = more wins, hopefully
Three-Pointer Percentage: making higher value points should lead to higher scores, and hopefully more wins
Free Throw Percentage: making clutch and-one plays can give teams an edge in a close game, therefore this stat was chosen.
Offensive Rebounds: teams who have more chances to shoot and end up scoring likely boosts a team's chances of winning.
Defensive Rebounds: teams who get defensive rebounds deny scoring opportunities for their opponents, hopefully leading to larger score differentials in the rebounding team's favor
Contested Field Goals: Making points under pressure could indicate a team's ability to score in important games or when the team is under pressure.
Uncontested Field Goals: these shots are explored in the MacKelvie video, where he states that uncontested field goals are a robust metric for tracking a team's scoring ability, blocking the factor that a crowd may have on a game.

Shapiro-Wilks Testing

The Shapiro-Wilk Test of Normality is defined as:

\begin{align*} H_0 &\text{: The data has been sampled from a normal distribution, } N(\mu,\sigma^2) \\ H_1 &\text{: The data has has not been sampled from a normal distribution, } N(\mu,\sigma^2) \end{align*}

The test is run by computing a test statistic, $W$ , and returns a $p$ -value for us to interpret. From this test, a $p$ -value below the significance level tells us to reject the null hypothesis $H_0$ , and we cannot assume Normality. This was implemented with the following code:

def run_shapiro_test(l:list):
	from scipy.stats import shapiro
	w_home = shapiro(l[0])
	w_away = shapiro(l[1])
	print(f'Home Games:\n Test Stat (W): {w_home.statistic},\n p-value: {w_home.pvalue}\n')
	print(f'Away Games:\n Test Stat (W): {w_away.statistic},\n p-value: {w_away.pvalue}\n')
	return

Anderson-Darling Testing

The Shapiro-Wilk Test is useful for determining normality of a sample, but has one fatal flaw: it produces inaccurate p-values for $N > 5000$ . This means another test, the Anderson-Darling (AD) Test, must be used. The Anderson-Darling Test is as follows:

\begin{align*} H_0&\text{: The data comes from the chosen (normal) distribution} \\ H_1&\text{: The data does not come from the chosen (normal) distribution} \\ \end{align*}

AD Testing checks if a sample comes from a provided distribution. Meaning, we can provide a sample to it along with a desired distribution to test (The normal distribution $N$ in our case), and get a result of whether our sample comes from the desired distribution or not.

We are again using scipy.stats for this test, using the anderson function. The function returns the test statistic, an array of critical values and an array of significance levels. If the returned test statistic is larger than the critical values for the corresponding significance levels, then the null hypothesis should be rejected.

def run_ad_test(l:list):
	from scipy.stats import anderson
 
	home = anderson(l[0],dist='norm')
	away = anderson(l[1],dist='norm')
 
	print(f'Home Games:\n Test Statistic: {home.statistic},\n Critical Values: {home.critical_values},\n Significance Level: {home.significance_level}\n')
	print(f'Home Games:\n Test Statistic: {away.statistic},\n Critical Values: {away.critical_values},\n Significance Level: {away.significance_level}\n')
 
	return

Results

The results of these tests can be found in the following table:

Statistic	Home SW p-value	Away SW p-value	Home AD Test Statistic	Away AD Test Statistic
Field Goal Percentage	8.09e-12	1.71e-07	7.122	3.765
Three Pointer Percentage	1.013e-23	7.835e-24	18.078	14.525
Free Throw Percentage	6.013e-39	2.486e-40	40.930	39.009
Offensive Rebounds	6.432e-48	7.994e-49	132.236	141.451
Defensive Rebounds	3.868e-27	5.104e-29	43.834	46.083
Contested Field Goals	1.879e-102	3.030e-102	2746.373	2714.624
Uncontested Field Goals	1.426e-103	1.154e-103	2880.563	2888.399

From the results of each test, the team confirmed that none of these features are normally distributed. This tells us that for any future testing and models, the data will either need to be scaled or used with non-parametric models and tests. This guided our decision-making process for the following hypothesis tests, which were used to look for differences between home and away games.

Hypothesis Testing

For hypothesis testing, we will be testing for differences in group means and variances for home and away field goals, three pointers, free throws, offensive and defensive rebounds, and contested and uncontested field goals. Because no distribution is normally distributed, we will perform these hypothesis tests using non-parametric tests. These tests are the Levene's Test for group variances, the Kruskal-Wallis Test for group means, and Dunn's Test for posthoc analysis.

Each function will take in a significance level, we are defaulting to a value of .05

Kruskal-Wallis Test

The Kruskal Wallis Test is essentially a non-parametric ANOVA test, meaning the test analyzes group means without assuming normality in distributions. It does this by computing a test-statistic, $H$ , and has hypotheses:

\begin{align*} H_0&\text{:}\mu_{g1} = \mu_{g2} = \cdots = \mu_{gk} \\ H_1&\text{: At least one group has a different mean} \\ \end{align*}

The following function will take in a list of pandas.Series objects of each feature, and will run the scipy.stats.kruskal function to retrieve the results. If the p-value of the test is below the significance level, then we reject the null hypothesis and cannot assume group mean equality. If the Null Hypothesis is rejected, Dunn's Test must be run as a posthoc analysis to determine how the group means differ.

Levene's Test

Levene's Test is a statistical test used to check equality of variances between groups, and will be used to check if variances between home and away games are the same. This is done by computing a test statistic, $W$ , and has hypotheses:

\begin{align*} H_0&\text{: }\sigma_{g1}^2 = \sigma_{g2}^2 = \cdots = \sigma_{gk}^2 \\ H_1&\text{: At least one group has a different variance} \end{align*}

The following function takes in a list of pandas.Series features, and runs the scipy.stats.levene function to retrieve the test results. If the p-value of the test is less than the significance level, then we reject the null hypothesis and cannot assume group variance equality. If the null hypothesis is rejected, posthoc tests may need to be run.

Dunn Test

Dunn's Test conducts a posthoc analysis of the group means after a Kruskal-Wallis one-way ANOVA to see how group means differ.

\begin{align*} H_0&\text{: }\mu_{g1} = \mu_{g2} \\ H_1&\text{: }\mu_{g1} \neq \mu_{g2} \\ \end{align*}

The following function uses the scikit_posthoc package's posthoc_dunn function and returns the results.

Conover Test

Conover's test is a non-parametric posthoc test to be used when Levene's test discovers significant differences in group variances. We can again use scikit_posthocs in the function to compare group variances.

Results:

Statistic	KW Test p-value	Dunn Test p-value	Levene Test p-value	Conover Test p-value
Field Goal Percentage	4.179e-100	4.18e-100	0.003	1.61e-100
Three Pointer Percentage	1.257e-13	1.26e-13	0.057	N/A
Free Throw Percentage	0.003	2.92e-03	7.737e-05	2.92e-03
Offensive Rebounds	1.309e-17	1.31e-17	0.001	1.28e-17
Defensive Rebounds	3.423e-84	3.42e-84	0.101	N/A
Contested Field Goals	1.371e-4	1.37e-04	5.233e-06	1.37e-04
Uncontested Field Goals	2.759e-4	2.76e-04	1.367e-4	2.76e-04

Given the results of the above distribution and hypothesis testing, we see that the statistics chosen are not normally distributed, and differ significantly in both mean and variance between Home and Away games. This indicates that there is a significant difference in these game types that should be examined further, and that these stats could indicate a reason why home game win rates have declined over time.

The only stats that differ from this conclusion are the Three Pointer Percentage, and Defensive Rebounds. In the Levene Test, we fail to reject the null hypothesis, indicating similar variances between home and away games. There is not much to glean from this, aside from the variances not being variables of interest in future explorations.

Conclusion

From the tests of normality, differences in means and variance, the team now has a springboard to start analyzing differences in home and away games. While the non-normality of each feature is not a massive issue considering most models have scaling functionalities, it is a factor to account for to prevent inaccuracy in future modeling and analysis.

From the hypothesis testing of group means and variances, we can conclude that each chosen statistic has a significant difference between home and away games in both mean and variance, and should be investigated further to determine what is causing this disparity in winrates.

In future articles, we will be using Regression Analysis and Classification Methods to examine NBA stats further.

References

Github Repository

Code

Statistical Tests:

Shapiro-Wilk Test
- Docs
- Wikipedia Article
Anderson-Darling Test
- Docs
- Wikipedia Article
Kruskal-Wallis Test
- Docs
- Wikipedia Article
Levene's Test
- Docs
- Wikipedia Article
Dunn's Test
- Docs
- Statology.org Article
Scipy.stats Documentation
Scikit_posthocs Documentation