Assignment: Running a Lasso Regression Analysis

25 Jan 2017

Program and outputs

Data loading and cleaning

import pandas as pandas
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pylab as plt

CSV_PATH = 'gapminder.csv'

data = pandas.read_csv(CSV_PATH)
print('Total number of countries: {0}'.format(len(data)))

Total number of countries: 213

PREDICTORS = [
    'incomeperperson', 'alcconsumption', 'armedforcesrate',
    'breastcancerper100th', 'co2emissions', 'femaleemployrate',
    'hivrate', 'internetuserate',
    'polityscore', 'relectricperperson', 'suicideper100th',
    'employrate', 'urbanrate'
]

clean = data.copy()
for key in PREDICTORS + ['lifeexpectancy']:
    clean[key] = pandas.to_numeric(clean[key], errors='coerce')

clean = clean.dropna()

print('Countries remaining:', len(clean))
clean.head()

Countries remaining: 107

	country	incomeperperson	alcconsumption	armedforcesrate	breastcancerper100th	co2emissions	femaleemployrate	hivrate	internetuserate	lifeexpectancy	oilperperson	polityscore	relectricperperson	suicideper100th	employrate	urbanrate
2	Algeria	2231.993335	0.69	2.306817	23.5	2.932109e+09	31.700001	0.1	12.500073	73.131	.42009452521537	2.0	590.509814	4.848770	50.500000	65.22
4	Angola	1381.004268	5.57	1.461329	23.1	2.483580e+08	69.400002	2.0	9.999954	51.093		-2.0	172.999227	14.554677	75.699997	56.70
6	Argentina	10749.419238	9.35	0.560987	73.9	5.872119e+09	45.900002	0.5	36.000335	75.901	.635943800978195	8.0	768.428300	7.765584	58.400002	92.00
7	Armenia	1326.741757	13.66	2.618438	51.6	5.121967e+07	34.200001	0.1	44.001025	74.241		5.0	603.763058	3.741588	40.099998	63.86
9	Australia	25249.986061	10.21	0.486280	83.2	1.297009e+10	54.599998	0.1	75.895654	81.907	1.91302610912404	10.0	2825.391095	8.470030	61.500000	88.74

from sklearn import preprocessing

predictors = clean[PREDICTORS].copy()
for key in PREDICTORS:
    predictors[key] = preprocessing.scale(predictors[key])
    
targets = clean.lifeexpectancy
    
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.3, random_state=123)

print(pred_train.shape, pred_test.shape, tar_train.shape, tar_test.shape)

(74, 13) (33, 13) (74,) (33,)

from collections import OrderedDict
from sklearn.linear_model import LassoLarsCV

model = LassoLarsCV(cv=10, precompute=False).fit(pred_train, tar_train)

OrderedDict(sorted(zip(predictors.columns, model.coef_), key=lambda x:x[1], reverse=True))

OrderedDict([('internetuserate', 2.9741932507050883),
             ('incomeperperson', 1.5624998619776493),
             ('polityscore', 0.95348158080473089),
             ('urbanrate', 0.62824156642092388),
             ('alcconsumption', 0.0),
             ('armedforcesrate', 0.0),
             ('breastcancerper100th', 0.0),
             ('relectricperperson', 0.0),
             ('co2emissions', -0.065710252825883983),
             ('femaleemployrate', -0.16966106864470906),
             ('suicideper100th', -0.83797198915263182),
             ('employrate', -1.3086675757200679),
             ('hivrate', -3.6033945847485298)])

# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
plt.show()

png

# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
plt.show()

png

# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print('training data MSE')
print(train_error)
print('test data MSE')
print(test_error)

training data MSE
14.0227968412
test data MSE
22.9565114677

# R-square from training and test data
rsquared_train = model.score(pred_train, tar_train)
rsquared_test = model.score(pred_test, tar_test)
print('training data R-square')
print(rsquared_train)
print('test data R-square')
print(rsquared_test)

training data R-square
0.823964900718
test data R-square
0.658213145158

from collections import OrderedDict
from sklearn.linear_model import LassoLarsCV

model2 = LassoLarsCV(cv=10, precompute=False).fit(predictors, targets)

print('mse', mean_squared_error(targets, model2.predict(predictors)))
print('r-square', model2.score(predictors, targets))

OrderedDict(sorted(zip(predictors.columns, model2.coef_), key=lambda x:x[1], reverse=True))

mse 17.7754276093
r-square 0.766001466082

OrderedDict([('internetuserate', 2.6765897850358265),
             ('incomeperperson', 1.4881319407059432),
             ('urbanrate', 0.62065826306013672),
             ('polityscore', 0.49665728486271465),
             ('alcconsumption', 0.0),
             ('armedforcesrate', 0.0),
             ('breastcancerper100th', 0.0),
             ('co2emissions', 0.0),
             ('femaleemployrate', 0.0),
             ('relectricperperson', 0.0),
             ('suicideper100th', 0.0),
             ('employrate', -0.86922466889577155),
             ('hivrate', -3.6439368063365305)])

Summary

After running the Lasso regression, my model showed that HIV rate (-3.6) and internet use rate (3.0) were the most influential features in determining a country’s life expectancy. My model resulted in an R-square of 0.66 when run against the test dataset, down from 0.82 against the training set. This is a noticeable drop, but still high enough to suggest that these are reliable features.

Alcohol consumption, the armed forces rate, incidences of breast cancer, and residential electricity consumption ended up being reduced out of the model.

When I re-ran the model against the entire dataset (with ~100 records, the split dataset is incredibly small), it resulted in an R-square of 0.77, with the same features coming out on top. However, CO2 emissions, female employment rate, and sucide rates were all removed from the model.

Lasso regression seems like an incredibly useful tool to use at the start of data analysis, to identify features that are likely to produce useful results under other analysis methods.

Data Visualization Course

Assignment: Running a Lasso Regression Analysis

Program and outputs

Data loading and cleaning

Summary

Related Posts

Assignment: Running a k-means Cluster Analysis 03 Feb 2017

Assignment: Running a Random Forest 19 Jan 2017

Assignment: Running a Classification Tree 15 Jan 2017