Assignment: Test a Logistic Regression Model
18 Nov 2016
Program and outputs
Data loading and cleaning
import numpy as np
import pandas as pandas
import seaborn
import statsmodels.api as sm
import statsmodels.formula.api as smf
from matplotlib import pyplot as plt
data = pandas.read_csv('gapminder.csv')
print('Total number of countries: {0}'.format(len(data)))
Total number of countries: 213
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')
data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['employrate'] = pandas.to_numeric(data['employrate'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data['armedforcesrate'] = pandas.to_numeric(data['armedforcesrate'], errors='coerce')
# Since there are no categorical variables in the gapminder dataset,
# I'm creating one by grouping polity scores less than vs greater than 0.
labels = [0, 1]
data['polityscore_bins'] = pandas.cut(sub1['polityscore'], bins=2, labels=labels)
data['polityscore_bins'] = pandas.to_numeric(data['polityscore_bins'], errors='coerce')
sub1 = data[['polityscore', 'polityscore_bins', 'internetuserate', 'employrate', 'urbanrate', 'armedforcesrate']].dropna()
print('Total remaining countries: {0}'.format(len(sub1)))
Total remaining countries: 148
Logistic regression with no confounding variables
lreg1 = smf.logit(formula='polityscore_bins ~ internetuserate', data=sub1).fit()
lreg1.summary()
Logit Regression Results
Dep. Variable: | polityscore_bins | No. Observations: | 148 |
Model: | Logit | Df Residuals: | 146 |
Method: | MLE | Df Model: | 1 |
Date: | Fri, 18 Nov 2016 | Pseudo R-squ.: | 0.06948 |
Time: | 17:03:41 | Log-Likelihood: | -86.076 |
converged: | True | LL-Null: | -92.503 |
| | LLR p-value: | 0.0003367 |
| coef | std err | z | P>|z| | [95.0% Conf. Int.] |
Intercept | 0.0138 | 0.271 | 0.051 | 0.959 | -0.517 0.545 |
internetuserate | 0.0255 | 0.008 | 3.298 | 0.001 | 0.010 0.041 |
def odds_ratios(lreg):
params = lreg.params
conf = lreg.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
return np.exp(conf)
odds_ratios(lreg1)
|
Lower CI |
Upper CI |
OR |
Intercept |
0.596194 |
1.724212 |
1.013886 |
internetuserate |
1.010417 |
1.041558 |
1.025869 |
Logistic regression with confounding variables
lreg2 = smf.logit(formula='polityscore_bins ~ internetuserate + employrate + urbanrate + armedforcesrate', data=sub1).fit()
lreg2.summary()
Logit Regression Results
Dep. Variable: | polityscore_bins | No. Observations: | 148 |
Model: | Logit | Df Residuals: | 143 |
Method: | MLE | Df Model: | 4 |
Date: | Fri, 18 Nov 2016 | Pseudo R-squ.: | 0.1908 |
Time: | 17:03:47 | Log-Likelihood: | -74.857 |
converged: | True | LL-Null: | -92.503 |
| | LLR p-value: | 4.046e-07 |
| coef | std err | z | P>|z| | [95.0% Conf. Int.] |
Intercept | 4.2600 | 1.615 | 2.638 | 0.008 | 1.095 7.425 |
internetuserate | 0.0364 | 0.012 | 3.015 | 0.003 | 0.013 0.060 |
employrate | -0.0515 | 0.022 | -2.371 | 0.018 | -0.094 -0.009 |
urbanrate | -0.0094 | 0.013 | -0.750 | 0.453 | -0.034 0.015 |
armedforcesrate | -0.6771 | 0.185 | -3.651 | 0.000 | -1.041 -0.314 |
|
Lower CI |
Upper CI |
OR |
Intercept |
2.990108 |
1677.028382 |
70.813105 |
internetuserate |
1.012822 |
1.061937 |
1.037088 |
employrate |
0.910171 |
0.991097 |
0.949772 |
urbanrate |
0.966505 |
1.015320 |
0.990612 |
armedforcesrate |
0.353227 |
0.730799 |
0.508072 |
Summary
My hypothesis was that internet use rate would be fairly well correlated with a nation’s polity score.
In my first logistic regression, I was surprised to find a high significance, but very low odds ratio (OR=1.03, 95% CI = 1.01-1.04, p=.001). Apparently, my hypothesis was incorrect.
After adjusting for potential confounding factors (employment rate, urbanrate, and armed forces rate), the odds of having a polity score greater than zero were almost half as likely for countries with a higher armed forces rate (OR=0.51, 95% CI = 0.35-0.73, p<0.001). None of the other factors were significantly associated with the polity score.