Assignment: Running a Random Forest

19 Jan 2017

Program and outputs

Data loading and cleaning

import pandas as pandas
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
import sklearn.metrics
import numpy as np
import matplotlib.pylab as plt

CSV_PATH = 'gapminder.csv'

data = pandas.read_csv(CSV_PATH)
print('Total number of countries: {0}'.format(len(data)))

Total number of countries: 213

PREDICTORS = [
    'incomeperperson', 'alcconsumption', 'armedforcesrate',
    'breastcancerper100th', 'co2emissions', 'femaleemployrate',
    'hivrate', 'internetuserate', 'oilperperson',
    'polityscore', 'relectricperperson', 'suicideper100th',
    'employrate', 'urbanrate'
]

clean = data.copy()
    
clean = clean.replace(r'\s+', np.NaN, regex=True)
clean = clean[PREDICTORS + ['lifeexpectancy']].dropna()

for key in PREDICTORS + ['lifeexpectancy']:
    clean[key] = pandas.to_numeric(clean[key], errors='coerce', downcast='integer')
    
clean = clean.astype(int)

predictors = clean[PREDICTORS]
targets = clean.lifeexpectancy

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

print(pred_train.shape, pred_test.shape, tar_train.shape, tar_test.shape)

(33, 14) (23, 14) (33,) (23,)

classifier=RandomForestClassifier(n_estimators=25)
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

print('Confusion matrix:', sklearn.metrics.confusion_matrix(tar_test, predictions), sep='\n')
print('Accuracy score:', sklearn.metrics.accuracy_score(tar_test, predictions))

Confusion matrix:
[[0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 1 0 0 0 0 0 0]
 [0 0 1 0 3 0 2 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 2 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 4 0]
 [0 0 0 0 0 0 0 0 0 0 1 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0]]
Accuracy score: 0.173913043478

from table import table

# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train, tar_train)

# display the relative importance of each attribute
importances = list(zip(PREDICTORS, model.feature_importances_))
importances = sorted(importances, key=lambda x: x[1], reverse=True)

table(importances)

urbanrate	0.126317199412
breastcancerper100th	0.105117665487
incomeperperson	0.0948448657848
internetuserate	0.0771414112204
employrate	0.0757854973973
suicideper100th	0.073923975227
alcconsumption	0.0678999029417
armedforcesrate	0.0631145170619
polityscore	0.0601258137948
co2emissions	0.0597650190565
femaleemployrate	0.0555015579106
oilperperson	0.0544900713322
relectricperperson	0.0449603576248
hivrate	0.041012145749

trees = range(25)
accuracy = np.zeros(25)

for idx in trees:
   classifier = RandomForestClassifier(n_estimators=idx + 1)
   classifier = classifier.fit(pred_train,tar_train)
   predictions = classifier.predict(pred_test)
   accuracy[idx] = sklearn.metrics.accuracy_score(tar_test, predictions)
   
plt.cla()
plt.plot(trees, accuracy)

png

Summary

The gapminder dataset continues to be fairly unimpressive with this form of analysis. The accuracy score of 0.17 indicates that the random forest has very little predictive power for determining life expectancy. I think there just aren’t enough remaining datapoints once the gapminder dataset has been cleaned up.

Urban rate, incidences of breast cancer, and income per person were the most important predictors, but having run the analysis a couple times I found that those numbers varied pretty wildly, nothing consistently stood out as a predictor.

Running the classifier with different numbers of trees was interesting. There was a general curve of improvement to accuracy, which seemed to level off around 10, but there’s enough variance even in that that I don’t think any meaningful conclusions can be drawn.

Data Visualization Course

Assignment: Running a Random Forest

Program and outputs

Data loading and cleaning

Summary

Related Posts

Assignment: Running a k-means Cluster Analysis 03 Feb 2017

Assignment: Running a Lasso Regression Analysis 25 Jan 2017

Assignment: Running a Classification Tree 15 Jan 2017