Assignment: Test a Logistic Regression Model

Program and outputs

Data loading and cleaning

import numpy as np
import pandas as pandas
import seaborn
import statsmodels.api as sm
import statsmodels.formula.api as smf
from matplotlib import pyplot as plt

data = pandas.read_csv('gapminder.csv')
print('Total number of countries: {0}'.format(len(data)))
Total number of countries: 213
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')
data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['employrate'] = pandas.to_numeric(data['employrate'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data['armedforcesrate'] = pandas.to_numeric(data['armedforcesrate'], errors='coerce')

# Since there are no categorical variables in the gapminder dataset,
# I'm creating one by grouping polity scores less than vs greater than 0.
labels = [0, 1]
data['polityscore_bins'] = pandas.cut(sub1['polityscore'], bins=2, labels=labels)
data['polityscore_bins'] = pandas.to_numeric(data['polityscore_bins'], errors='coerce')

sub1 = data[['polityscore', 'polityscore_bins', 'internetuserate', 'employrate', 'urbanrate', 'armedforcesrate']].dropna()

print('Total remaining countries: {0}'.format(len(sub1)))
Total remaining countries: 148

Logistic regression with no confounding variables

lreg1 = smf.logit(formula='polityscore_bins ~ internetuserate', data=sub1).fit()
lreg1.summary()
Logit Regression Results
Dep. Variable: polityscore_bins No. Observations: 148
Model: Logit Df Residuals: 146
Method: MLE Df Model: 1
Date: Fri, 18 Nov 2016 Pseudo R-squ.: 0.06948
Time: 17:03:41 Log-Likelihood: -86.076
converged: True LL-Null: -92.503
LLR p-value: 0.0003367
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 0.0138 0.271 0.051 0.959 -0.517 0.545
internetuserate 0.0255 0.008 3.298 0.001 0.010 0.041
def odds_ratios(lreg):
    params = lreg.params
    conf = lreg.conf_int()
    conf['OR'] = params
    conf.columns = ['Lower CI', 'Upper CI', 'OR']
    
    return np.exp(conf)

odds_ratios(lreg1)
Lower CI Upper CI OR
Intercept 0.596194 1.724212 1.013886
internetuserate 1.010417 1.041558 1.025869

Logistic regression with confounding variables

lreg2 = smf.logit(formula='polityscore_bins ~ internetuserate + employrate + urbanrate + armedforcesrate', data=sub1).fit()
lreg2.summary()
Logit Regression Results
Dep. Variable: polityscore_bins No. Observations: 148
Model: Logit Df Residuals: 143
Method: MLE Df Model: 4
Date: Fri, 18 Nov 2016 Pseudo R-squ.: 0.1908
Time: 17:03:47 Log-Likelihood: -74.857
converged: True LL-Null: -92.503
LLR p-value: 4.046e-07
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 4.2600 1.615 2.638 0.008 1.095 7.425
internetuserate 0.0364 0.012 3.015 0.003 0.013 0.060
employrate -0.0515 0.022 -2.371 0.018 -0.094 -0.009
urbanrate -0.0094 0.013 -0.750 0.453 -0.034 0.015
armedforcesrate -0.6771 0.185 -3.651 0.000 -1.041 -0.314
odds_ratios(lreg2)
Lower CI Upper CI OR
Intercept 2.990108 1677.028382 70.813105
internetuserate 1.012822 1.061937 1.037088
employrate 0.910171 0.991097 0.949772
urbanrate 0.966505 1.015320 0.990612
armedforcesrate 0.353227 0.730799 0.508072

Summary

My hypothesis was that internet use rate would be fairly well correlated with a nation’s polity score.

In my first logistic regression, I was surprised to find a high significance, but very low odds ratio (OR=1.03, 95% CI = 1.01-1.04, p=.001). Apparently, my hypothesis was incorrect.

After adjusting for potential confounding factors (employment rate, urbanrate, and armed forces rate), the odds of having a polity score greater than zero were almost half as likely for countries with a higher armed forces rate (OR=0.51, 95% CI = 0.35-0.73, p<0.001). None of the other factors were significantly associated with the polity score.