I used python to calculate the correlation between stock returns listed in the S&P 500 over the past ~12 months (more notes on method at bottom of the post). I cut this data a couple of ways, which I’ll cover below:

- Grouped stocks by industry, and calculated which industries correlate closest with the S&P 500 index (I used the SPY, an ETF designed to track S&P 500 movement, as my proxy)
- Grouped stocks by industry, and calculated how correlated individual stocks are within a given industry (said another way – how related are the returns of a set of stocks in each industry?)

**Pearson’s correlation coefficient**

*(tl;dr: in the charts below, the higher the # the more correlated the variables are)*

I used Pearson’s correlation coefficient for the purposes of this analysis; it’s a measure of linear correlation between two variables, and can be used to see how tightly correlated different stocks are. The output of the value must be between -1 and 1, and is defined as such:

- Positive values denote positive linear correlation
- Negative values denote negative linear correlation
- A value of 0 denotes no linear correlation
- The closer the value is to 1 or –1, the stronger the linear correlation

**Which industry correlates closest with the overall movement of the S&P 500?**

We can see here that stocks making up the Industrials sector within the S&P 500 track closest to its overall movement while Consumer Discretionary has the weakest correlation with S&P 500 movement.

**There’s no relationship between total industry market cap and the S&P 500.**

Astute observers may note that the S&P 500 index is calculated as the weighted average of the market capitalization of the 500 equities it’s comprised of–meaning industries with the largest total market caps would, in theory, have the largest influence on the S&P 500’s movement.

Interestingly, this is not the case:

The r-squared value is 0.09, indicating there is no significant relationship between the total market cap of an industry within the S&P 500 and the SPY’s movement over our time frame of 1 year.

**How correlated are individual stocks within a given industry? (or, how volatile are the returns in an industry across its equities)**

Next, I calculated how tightly correlated individual stocks were within a given industry. Below, we can see that Utilities tops the list, followed by Telecom.

This means that, if I have a portfolio comprised solely Utility stocks, I don’t have much diversification since the different companies operating in that sector tend to move in lockstep. Additionally, it means that I can have a pretty well-diversified portfolio comprised of equities solely in Consumer Discretionary, or Healthcare, since those industries have weakly correlated equities.

**Some notes on this analysis:**

*Correlation does not imply causation.**Time frame covered was 8/10/2015 to 8/10/2016, or roughly the past year’s worth of stock movement data. Different time windows will yield different results, meaning these measures are only accurate for the time window specified.**I am comparing the correlation of the stock’s returns using the month-end close price from Yahoo (or Verizon :p) data. This does not account for dividends, but it seems to be a decent approximation**This isn’t a particularly refined way to do this (there are people who do this for a living), but it acts as a good approximation**Disadvantages of using Pearson’s correlation coefficient:**The approach is only valid for linear dependencies, which are not always observed.**The approach only captures the first two moments of the relationship. This means a value of 0 does not necessarily indicate a relationship does not exist.*