Investigating Lifestyle and its Impact on BMI of Teenagers

Jay Machado and Jeffrey Wong

Background and Motivation

The Youth Risk Behaviour Survey System was created in the 1990s periodically survey schools across the United States and gather information about student health behavior. These surveys change from year to year and are often conducted by different branches and different levels of government. The CDC has attempted to aggregate data from different years and with different questions and publish the data on their website. The combined data is available for surveys on the national level, on the state level, and on the district level.

The survey measures behaviors related to sex, diet, physical activity, drug use, and mental health. We decided we wanted to look at trends in body mass index (BMI) across students using the surveys conducted at a state level over the years of the program. BMI is a calculation based on the weight and height of an individual that gives a general idea if someone is at a healthy weight for their size. While it has its limitations, it is easier to collect that information such as fat percentage. We want to find general trends in BMI values with respect to time, location, and behaviors in the survey, such as drinking milk or eating breakfast. Our hope is to find which predictors have a greater (if any) effect on BMI. We focused only on the state datasets because combining other levels might introduce overlapping data.

Background Readings

The WHO calls childhood obesity "one of the most serious public health challenges of the 21st century". The NIH hosts a decent article on why obesity prevention is better than a cure. The WHO has an article on why it matters. The idea of this project is to be able to more easily identify students with at-risk behavior.

Tutorial

At the end of this tutorial, you should be able to:

  • Import YRBSS data from a csv file
  • Tidy the data to be usable in pandas
  • Plot BMI data with respect to time
  • Run simple regressions to predict BMI over time
  • Calculate and plot behavior prevalence
  • Run regressions to predict BMI based on behavior

Preliminary Data Import

The datasets are large and only available in a SAS or Access Database form. To make this more accessible to pandas without extra drivers, we exported the sets to .csv files. The original Access databases can be found here.

After exporting the relevant ones to .csv, the files can be read into Pandas and easily merged. A beefier computer is recommended, on a lower-end computer or a virtualization layer will likely require sampling and quickly discarding of the rest of the data.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import folium
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn import metrics
In [2]:
# States A - M
dataset = pd.read_csv('databases/states_a-m1.csv')

# States N - Z
dataset = dataset.append(pd.read_csv('databases/states_n-z1.csv'))

# Drop BMI and any without proper age data
dataset = dataset.dropna(subset=["bmi", "age"])

dataset.head()
Out[2]:
sitecode sitename sitetype sitetypenum year survyear weight stratum PSU record ... qnwenthungry qnmusclestrength qnsunscreenuse qnindoortanning qnsunburn qnconcentrating qncurrentasthma qnwheresleep qnspeakenglish qntransgender
73898 AL Alabama (AL) State 2.0 1999.0 5.0 203.9870 7.0 2.0 263599.0 ... NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
73899 AL Alabama (AL) State 2.0 1999.0 5.0 135.9913 9.0 2.0 263600.0 ... NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
73900 AL Alabama (AL) State 2.0 1999.0 5.0 65.0445 10.0 2.0 263601.0 ... NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
73901 AL Alabama (AL) State 2.0 1999.0 5.0 116.8801 9.0 2.0 263602.0 ... NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
73904 AL Alabama (AL) State 2.0 1999.0 5.0 66.6849 22.0 1.0 263605.0 ... NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 191 columns

Tidying the Data

We could just jump right in but it would be more prudent to translate everything we need to human-readable values. This data was designed for using different software, so many of the values are represented numerically with a key given. We want to keep only certain demographic and lifestyle columns so we will first collect only those. Specifically, we want to keep all information pertaining to BMI, race, sex, and behavior questions relating to diet and exercise that have been present in most of the data.

In [3]:
# Keep columns pertaining to demographics
demographics = ["sitecode","year","age","sex","race4","race7","stheight","stweight","bmi"]

# Select behavior survey questions
behaviors = ["qnfr1", "qnveg1",'qnsoda1','qnmilk1','qnbk7day','qnpa0day','qnpa7day']
            
# Drop extra values
dataset = dataset[demographics + behaviors]
In [4]:
# Restructure indicator variables
dataset['veg'] = np.abs(dataset['qnveg1']-2)
dataset['fr'] = np.abs(dataset['qnfr1']-2)
dataset['soda'] = np.abs(dataset['qnsoda1']-2)
dataset['milk'] = np.abs(dataset['qnmilk1']-2)
dataset['brk7'] = np.abs(dataset['qnbk7day']-2)
dataset['act0'] = dataset['qnpa0day']-1
dataset['act7'] = np.abs(dataset['qnpa7day']-2)
In [5]:
# Rename race values
def race_label(x):
    if x==1.0:
        return "American Indian"
    elif x==2.0:
        return "Asian"
    elif x==3.0:
        return "Black"
    elif x==4.0:
        return "Hispanic"
    elif x==5.0:
        return "Pacific Islander"
    elif x==6.0:
        return "White"
    elif x==7.0:
        return "Multiracial"
    else:
        return np.nan
    
dataset['race7'] = dataset['race7'].map(race_label)
In [6]:
# Testing upon sex demographic
dataset["sexlabel"] = dataset["sex"]

# Label as male or female for visibility for x-axis readability
def sexlabeling(x):
        if x == 1.0:
            return "female"
        elif x == 2.0:
            return "male"
        else:
            return np.nan

dataset["sexlabel"] = dataset["sexlabel"].map(sexlabeling)

The simplest place to begin is visualizing BMI across different time periods.

In [7]:
# Scatter of BMIs over time
dataset.plot.scatter(y="bmi", x="year", title="BMI over year")
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x17577063fa0>

We find that the above scatterplot is not extremely helpful, though it does point out that some BMI values are clearly incorrect. A 6' person with a weight of 1400 lbs would only have a BMI of about 190, so values above 200 must be incorrect. To make things simple, we will only look at the BMI range of 9 to 60. We will also only look at people ages 13 - 17 to keep ages in a range of known values.

To make things easier to visualize, we will present a violin plot. Using this, we should be more easily able to understand where more of the values are, (the wider plots of the "violin").

In [8]:
# Clearly there is some error with BMIs over 200, as that would require weights over 500 in children
# so let's cap it at 60. Additionally some seem to drop as low as 0, so I will introduce a floor of 9
dataset = dataset[dataset["bmi"] < 60]
dataset = dataset[dataset["bmi"] > 9]

# Let's only look at ages 13 - 17
dataset = dataset[dataset["age"] < 7]
dataset = dataset[dataset["age"] > 1]

# Also, since the scatterplots are quite dense per survey year, I'd perfer
# Violin Plots

# Set the style
sns.set(style="whitegrid")

# Plot Continent vs Residuals
ax = sns.violinplot(x="year", y="bmi", data=dataset, palette="bright")

# Improve look and Labeling
plt.xlabel("Year")
plt.ylabel("BMI")
plt.title("BMI Violin Distributions")
Out[8]:
Text(0.5, 1.0, 'BMI Violin Distributions')

We would like to run a simple cubic regression on the data — this will look for the best cubic polynomial that fits the data. (Specifically, the cubic polynomial that will fit the data with the minimal least–squares error.) This is just to look for any possible trends over time.

In [9]:
plt.figure()

# Create a regression
reg = np.poly1d(np.polyfit(dataset["year"], dataset["bmi"], 3))

# Find the general trend
plt.plot(dataset["year"].unique(), reg(dataset["year"].unique()), color = "violet")
plt.title("BMI Regression over Time")
plt.xlabel("Year")
plt.ylabel("BMI")
plt.show()

There appears to be an upward trend in BMI over time, but this could be the result of increased participation in the survey over the years. It would be prudent to look at trends for each state to see if they follow similar patterns. Since some states participated for fewer years, I will choose the degree based on the number of years they participated in. (We don't want to overfit for sets that participated in too few years.)

In [10]:
plt.figure(figsize=(20, 10))

# Create a regression per State
for state, datasubset in dataset.groupby("sitecode"):
    # Pick a degree low enough so the rank will be well-conditioned
    # (Some states have very few years of participation)
    deg = int(datasubset["year"].unique().size / 3)
    
    if deg == 0:
        deg = 1
    
    # Regression
    reg = np.poly1d(np.polyfit(datasubset["year"], datasubset["bmi"], deg))
    
    
    plt.plot(datasubset["year"], reg(datasubset["year"]), label=state)
    
    
    
plt.title("BMI in states over time")
plt.xlabel("Year")
plt.ylabel("BMI")
plt.legend(loc="upper left")
plt.show()

We can see from this most states also have an upward trend, though to varying degrees. To illustrate differences in BMI per location we can show the mean BMI per state on a map.

In [11]:
# Get the format of state lines for the map
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
state_geo = f'{url}/us-states.json'

# Define the map
m = folium.Map(location=[37, -102], zoom_start=3)

# Calculate the average BMI of each state
avgs = dataset.groupby('sitecode')['bmi'].mean().to_frame().reset_index()

# Color each state by their average BMI
folium.Choropleth(
    geo_data = state_geo,
    name = 'Average BMI Across the United States',
    data = avgs,
    columns = ['sitecode', 'bmi'],
    key_on = 'feature.id',
    fill_color = 'YlGn',
    nan_fill_color = 'white',
    fill_opacity = 0.7,
    line_opacity = 0.2,
    legend_name = 'BMI'
).add_to(m)

m
Out[11]:

Before looking try to explain the trends, we first want to see if there is any signfication variation between BMI distributions of members of different sexes or races. We will use violin plots and regressions for this.

In [12]:
# Set the style
sns.set(style="whitegrid")

# Plot Sex vs BMI
ax = sns.violinplot(x="sexlabel", y="bmi", data=dataset, palette="bright")

# Improve look and Labeling
plt.xlabel("Sex")
plt.ylabel("BMI")
plt.title("BMI Distributions between the Sexes")
Out[12]:
Text(0.5, 1.0, 'BMI Distributions between the Sexes')
In [13]:
plt.figure()

# Create a regression per State
for sex, datasubset in dataset.groupby("sexlabel"):
    
    # Regression
    reg = np.poly1d(np.polyfit(datasubset["year"], datasubset["bmi"], 3))
    
    
    plt.plot(datasubset["year"].unique(), reg(datasubset["year"].unique()), label=sex)
    
    
    
plt.title("BMI between Sexes over time")
plt.xlabel("Year")
plt.ylabel("BMI")
plt.legend(loc="upper left")
plt.show()
In [14]:
plt.figure(figsize=(20, 10));

# Set the style
sns.set(style="whitegrid")

# Plot Sex vs BMI
ax = sns.violinplot(x="race7", y="bmi", data=dataset, palette="bright")

# Improve look and Labeling
plt.xlabel("Sex")
plt.ylabel("BMI")
plt.title("BMI Distributions by Race")

# Enlarge so labelling works well
plt.ylim(15,30)
plt.show()
In [15]:
plt.figure()

# Create a regression per State
for sex, datasubset in dataset.groupby("race7"):
    
    # Regression
    reg = np.poly1d(np.polyfit(datasubset["year"], datasubset["bmi"], 3))
    
    # Plot the year vs sex
    plt.plot(datasubset["year"].unique(), reg(datasubset["year"].unique()), label=sex)
    
    
    
plt.title("BMI between Races over time")
plt.xlabel("Year")
plt.ylabel("BMI")

# Change the location of the legend
plt.legend(loc="upper left")
plt.show()

There does not appear to be signification variation upon these demographics.

To try and see what might be causing the upward trend, we want to look at the behavior questions over time. To do this we will create a function that takes in a question name, calculates the percentage of students who answered affirmative and return the years and percentages. We can thus graph the percentage of students who drank milk daily, ate vegetables daily, drank soda daily, and who ate breakfast daily.

In [16]:
def qpercreg(label):
    # Create variables
    years = []
    percent = []
    
    # For each data that has the year
    for year, datasubset in dataset.dropna(subset=[label]).groupby("year"):
        
        # If it is not empty
        if not datasubset.empty:
            
            # Add the year and percentage
            years.append(year)
            percent.append(datasubset[datasubset[label] == 1.0].shape[0] * 100 / datasubset.shape[0])
    
    return years, percent
In [17]:
plt.figure()

xvals, yvals = qpercreg("qnmilk1")
plt.plot(xvals, yvals)

plt.title("Milk Drinking Over Time")
plt.xlabel("Year")
plt.ylabel("Dairy Percentage")

# Change percentage limits to 0 to 100
plt.ylim(0, 100)
plt.show()
In [18]:
xvals, yvals = qpercreg("qnveg1")

plt.plot(xvals, yvals)

plt.title("Veggie Consumption >= 1 per day Over Time")
plt.xlabel("Year")
plt.ylabel("Veggie Percentage")

# Change percentage limits to 0 to 100
plt.ylim(0, 100)
plt.show()
In [19]:
xvals, yvals = qpercreg("qnsoda1")

plt.plot(xvals, yvals)

plt.title("Soda/Pop Drinking Over Time")
plt.xlabel("Year")
plt.ylabel("Pop Percentage")

# Change percentage limits to 0 to 100
plt.ylim(0, 100)
plt.show()
In [20]:
xvals, yvals = qpercreg("qnbk7day")

plt.plot(xvals, yvals)

plt.title("Daily Breakfast Over Time")
plt.xlabel("Year")
plt.ylabel("Breakfast Percentage")

# Change percentage limits to 0 to 100
plt.ylim(0, 100)
plt.show()

There appears to be a downward trend in vegetable and milk consumption, but also a similar trend in soda consumption. Perhaps wit would be more useful to create a model that predicted BMI based on these variables. That might give us a better sense of which behaviors are more relevant.

Developing a Model

We want to develop a linear model predicting BMI from a number of categorical predictor variables. The subset of data that includes the predictor variables and the dependent variable, BMI, is selected from the dataset. We also drop all of the incomplete data entries, accepting the error from using incomplete data.

In [21]:
# Indicator variables
indicator_vars = ['veg','fr','soda','milk','brk7','act7']

# Drop rows where an indicator variable or bmi are null
model_data = dataset.dropna(subset=indicator_vars+['bmi']).reset_index()

We will be training our model using validation techniques. In order to do so, we will designate 80% of our dataset for training and the remaining 20% as test data. Using the training data, we can develop a linear model.

In [22]:
# Seperate data into indicator and dependent values
df_x = model_data[indicator_vars]
df_y = model_data['bmi'].values

# Divide dataset into test and training data
x_train, x_test, y_train, y_test = train_test_split(df_x,df_y,test_size=0.2,random_state=0)

# Create linear model
reg = LinearRegression()
reg.fit(x_train, y_train)
coeff_df = pd.DataFrame(reg.coef_,df_x.columns,columns=['Coefficient'])
coeff_df.head(6)
Out[22]:
Coefficient
veg 0.132134
fr -0.134477
soda 0.255948
milk -0.031214
brk7 -0.626282
act7 -0.560476

Once we have a model, we will use our testing data set to make predictions of BMI. We can calculate the residual from our predicated values and the actual value of BMI from our test data.

In [23]:
# Use model to predict BMI from test data
y_pred = reg.predict(x_test)
res = pd.DataFrame({'actual':y_test,'predicted':y_pred,'residual':y_test-y_pred,'flag':0})
res.head(10)
Out[23]:
actual predicted residual flag
0 32.591759 23.394287 9.197472 0
1 20.461759 23.221892 -2.760133 0
2 29.553276 23.526421 6.026856 0
3 28.125000 23.034616 5.090384 0
4 28.323232 23.526421 4.796812 0
5 17.179018 22.308448 -5.129431 0
6 25.761694 23.660898 2.100796 0
7 20.023438 22.902482 -2.879044 0
8 24.122717 23.782368 0.340349 0
9 35.314879 22.871268 12.443611 0

Plotting the actual values versus our predicted values helps to show the accuracy of our model. A perfectly accurate model would have a perfectly linear trend with a slope of 1. As you can see from the graph, our model is quite conservative, estimating closer to average. This means that our model loses accuracy with individuals on the more extreme ends of the BMI scale. We have also listed Mean Absolute Error, Mean Squared Error, and Root Mean Squared Error, all of which are metrics to measure the error our model produces.

In [24]:
# Plot residuals
ax = res.plot(kind='scatter',x='predicted',y='actual',color='violet',title='Actual vs Predicted BMI')
ax.set_xlabel('Predicted BMI')
ax.set_ylabel('Actual BMI')
# Calculate trend line
trend = np.polyfit(res.predicted, res.actual, 1)
model = np.poly1d(trend)
plt.plot(res.predicted, model(res.predicted))
plt.show()

# Calculate error
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Mean Absolute Error: 3.6220439562158218
Mean Squared Error: 23.966481329478338
Root Mean Squared Error: 4.895557305300219

We can go further in visualizing our error with a violin plot of the residuals. It shows that the resdiuals are heavily concentrated with a low magnitude, but are also heavily skewed by the more extreme values.

In [25]:
# Create violin plot
bx = sns.violinplot(y='residual',x='flag',data=res)
bx.set_title('Residuals')
bx.set_xlabel('')
bx.set_ylabel('Residual')
plt.show()

10-Fold Cross Validation

We want to see if we can further strengthen our model by expanding our training methodology. To do so we will implement a 10–Fold Cross–Validation with our data. A 10–Fold Cross–Validation technique means that we are going to divide the dataset into 10 subsets and run our analysis with each subset as the test data. The results of this are then averaged to find the total accuracy of the model.

In [26]:
# Define indicator and dependent variables
df_x = model_data[indicator_vars]
df_y = model_data[['bmi']]

reg = LinearRegression()
# Define 10-Fold Cross Validation
cv = KFold(n_splits=10, shuffle=False)
res = []
# Split dataset for cross validation
for train_index, test_index in cv.split(df_x):
    # Set training and test data
    x_train, x_test = df_x[df_x.index.isin(train_index)], df_x[df_x.index.isin(test_index)]
    y_train, y_test = df_y[df_y.index.isin(train_index)], df_y[df_y.index.isin(test_index)]
    # Generate model
    reg.fit(x_train, y_train)
    # Predict BMI
    y_pred = reg.predict(x_test)
    # Record accuracy
    res.append(metrics.mean_absolute_error(y_test, y_pred))

# Calculate average error
print('Average Mean Absolute Error from 10-Fold Cross Validation:',np.mean(res))
Average Mean Absolute Error from 10-Fold Cross Validation: 3.639202660848846

If we want to change our model to use whether an individual exercises 1 or more day per a week rather than 7 days a week we can redo the previous models.

In [27]:
# Indicator variables
indicator_vars = ['veg','fr','soda','milk','brk7','act0']

# Drop rows where an indicator variable or bmi are null
model_data = dataset.dropna(subset=indicator_vars+['bmi']).reset_index()

# Seperate data into indicator and dependent values
df_x = model_data[indicator_vars]
df_y = model_data['bmi'].values

# Divide dataset into test and training data
x_train, x_test, y_train, y_test = train_test_split(df_x,df_y,test_size=0.2,random_state=0)

# Create linear model
reg = LinearRegression()
reg.fit(x_train, y_train)
print(coeff_df.head(6))

# Use model to predict BMI from test data
y_pred = reg.predict(x_test)
res = pd.DataFrame({'actual':y_test,'predicted':y_pred,'residual':y_test-y_pred,'flag':0})

# Plot residuals
ax = res.plot(kind='scatter',x='predicted',y='actual',color='violet',title='Actual vs Predicted BMI')
ax.set_xlabel('Predicted BMI')
ax.set_ylabel('Actual BMI')
# Calculate trend line
trend = np.polyfit(res.predicted, res.actual, 1)
model = np.poly1d(trend)
plt.plot(res.predicted, model(res.predicted))
plt.show()

# Calculate error
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

# Create violin plot
bx = sns.violinplot(y='residual',x='flag',data=res)
bx.set_title('Residual')
bx.set_xlabel('')
bx.set_ylabel('Residual')
plt.show()

# Define indicator and dependent variables
df_x = model_data[indicator_vars]
df_y = model_data[['bmi']]

reg = LinearRegression()
# Define 10-Fold Cross Validation
cv = KFold(n_splits=10, shuffle=False)
res = []
# Split dataset for cross validation
for train_index, test_index in cv.split(df_x):
    # Set training and test data
    x_train, x_test = df_x[df_x.index.isin(train_index)], df_x[df_x.index.isin(test_index)]
    y_train, y_test = df_y[df_y.index.isin(train_index)], df_y[df_y.index.isin(test_index)]
    # Generate model
    reg.fit(x_train, y_train)
    # Predict BMI
    y_pred = reg.predict(x_test)
    # Record accuracy
    res.append(metrics.mean_absolute_error(y_test, y_pred))
    
# Calculate average error
print('Average Mean Absolute Error from 10-Fold Cross Validation:',np.mean(res))
      Coefficient
veg      0.132134
fr      -0.134477
soda     0.255948
milk    -0.031214
brk7    -0.626282
act7    -0.560476
Mean Absolute Error: 3.6229813302417258
Mean Squared Error: 23.982024895343432
Root Mean Squared Error: 4.897144565493593
Average Mean Absolute Error from 10-Fold Cross Validation: 3.6417481258000763

Conclusion

We found an upward trend in BMI over the years of the survey and we found decreasing trends in question responses such as milk and veggie consumption. In spite of this, it is important to keep in mind that correlation does not imply causation. Furthermore, we were able to successfully develop a predictive model for BMI from a number of categorical, predictor variables. However, our model did not have particularly high accuracy. The model made very conservative predictions, resulting in fairly accurate predictions for individuals with average BMI, but a loss of accuracy towards the extremes. This indicates that alone the variables we used are not great predictors of a teenager's BMI. Therefore, further research and study is necessary before we would be comfortable making large generalizations about the effect of specific lifestyle choices on the BMI of a teenager.