Assignment: Running a Chi-Square Test of Independence

Program and outputs

%matplotlib inline
import pandas
import numpy
import scipy
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv')
print('Total number of countries: {0}'.format(len(data)))
Total number of countries: 213
# Convert numeric types
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')

sub1 = data[['incomeperperson', 'internetuserate']].dropna()

print('Remaining number of countries: {0}'.format(len(sub1)))

# Since GDP per person isn't categorical data, I'm going to group it by magnitude first
groups = [pow(10, i) for i in range(2, 6)]
labels = ['{0} - {1}'.format(groups[index], i) for index, i in enumerate(groups[1:])]
sub1['incomeperperson'] = pandas.cut(sub1['incomeperperson'], groups, right=False, labels=labels)

# Since Internet Use Rate isn't categorical, I need to group it first.
# From an earlier assignment, I know that ~20 is a noticeable cutoff for internet use rate, so lets use that.
# See: http://erikwiffin.github.io/data-visualization-course//2016/06/30/assignment-4/#internet-use-rate
groups = [0, 20, 100]
labels = ['Less than 20', 'Greater than 20']
sub1['internetuserate'] = pandas.cut(sub1['internetuserate'], groups, right=False, labels=labels)
Remaining number of countries: 183
# contingency table of observed counts
ct1 = pandas.crosstab(sub1['internetuserate'], sub1['incomeperperson'])
print('Contingency table of observed counts')
print('=' * 40)
print(ct1)

# column percentages
colsum = ct1.sum(axis=0)
colpct = ct1/colsum
print('\nColumn percentages')
print('=' * 40)
print(colpct)

# chi-square
print('\nchi-square value, p value, expected counts')
print('=' * 40)
cs1= scipy.stats.chi2_contingency(ct1)
print(cs1)

# Make them graphable again
sub2 = sub1.copy()
sub2['incomeperperson'] = sub2['incomeperperson'].astype('category')
groups = [0, 20, 100]
sub2['internetuserate'] = pandas.cut(data['internetuserate'], groups, right=False, labels=[0, 20])
sub2['internetuserate'] = pandas.to_numeric(sub2['internetuserate'], errors='coerce')
seaborn.factorplot(x="incomeperperson", y="internetuserate", data=sub2, kind="bar", ci=None)
plt.xlabel('Income per person')
plt.ylabel('Internet use rate')
Contingency table of observed counts
========================================
incomeperperson  100 - 1000  1000 - 10000  10000 - 100000
internetuserate                                          
Less than 20             48            26               0
Greater than 20           4            60              45

Column percentages
========================================
incomeperperson  100 - 1000  1000 - 10000  10000 - 100000
internetuserate                                          
Less than 20       0.923077      0.302326             0.0
Greater than 20    0.076923      0.697674             1.0

chi-square value, p value, expected counts
========================================
(92.356982887599486, 8.8091901262367566e-21, 2, array([[ 21.0273224 ,  34.77595628,  18.19672131],
       [ 30.9726776 ,  51.22404372,  26.80327869]]))

png

def recode(sub, recoding):
    sub['incomeperpersonV2'] = sub['incomeperperson'].map(recoding)
    
    # Header
    header = 'Comparing {0} and {1}'.format(*recoding.keys())
    print(header)
    print('=' * len(header) + '\n')

    # contingency table of observed counts
    ct = pandas.crosstab(sub['internetuserate'], sub['incomeperpersonV2'])
    print('Contingency table of observed counts')
    print('-' * len('Contingency table of observed counts'))
    print(str(ct) + '\n')

    # column percentages
    colsum = ct.sum(axis=0)
    colpct = ct/colsum
    print('Column percentages')
    print('-' * len('Column percentages'))
    print(str(colpct) + '\n')

    print('chi-square value, p value, expected counts')
    print('-' * len('chi-square value, p value, expected counts'))
    cs = scipy.stats.chi2_contingency(ct)
    print(str(cs) + '\n')
    
recode(sub1.copy(), {'100 - 1000': '100 - 1000', '1000 - 10000': '1000 - 10000'})
recode(sub1.copy(), {'100 - 1000': '100 - 1000', '10000 - 100000': '10000 - 100000'})
recode(sub1.copy(), {'1000 - 10000': '1000 - 10000', '10000 - 100000': '10000 - 100000'})
Comparing 100 - 1000 and 1000 - 10000
=====================================

Contingency table of observed counts
------------------------------------
incomeperpersonV2  100 - 1000  1000 - 10000
internetuserate                            
Less than 20               48            26
Greater than 20             4            60

Column percentages
------------------
incomeperpersonV2  100 - 1000  1000 - 10000
internetuserate                            
Less than 20         0.923077      0.302326
Greater than 20      0.076923      0.697674

chi-square value, p value, expected counts
------------------------------------------
(47.746562413123101, 4.8503395840347456e-12, 1, array([[ 27.88405797,  46.11594203],
       [ 24.11594203,  39.88405797]]))

Comparing 100 - 1000 and 10000 - 100000
=======================================

Contingency table of observed counts
------------------------------------
incomeperpersonV2  100 - 1000  10000 - 100000
internetuserate                              
Less than 20               48               0
Greater than 20             4              45

Column percentages
------------------
incomeperpersonV2  100 - 1000  10000 - 100000
internetuserate                              
Less than 20         0.923077             0.0
Greater than 20      0.076923             1.0

chi-square value, p value, expected counts
------------------------------------------
(78.577956612666426, 7.6902772386092302e-19, 1, array([[ 25.73195876,  22.26804124],
       [ 26.26804124,  22.73195876]]))

Comparing 1000 - 10000 and 10000 - 100000
=========================================

Contingency table of observed counts
------------------------------------
incomeperpersonV2  1000 - 10000  10000 - 100000
internetuserate                                
Less than 20                 26               0
Greater than 20              60              45

Column percentages
------------------
incomeperpersonV2  1000 - 10000  10000 - 100000
internetuserate                                
Less than 20           0.302326             0.0
Greater than 20        0.697674             1.0

chi-square value, p value, expected counts
------------------------------------------
(15.126175119023959, 0.00010055931031856319, 1, array([[ 17.06870229,   8.93129771],
       [ 68.93129771,  36.06870229]]))

Summary

With a Bonferroni adjustment of 0.05/3 = 0.017, there is a significant difference of internet use rate in all of my categories.

Assignment: Running an analysis of variance

Program and outputs

import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

data = pandas.read_csv('gapminder.csv')
print('Total number of countries: {0}'.format(len(data)))
Total number of countries: 213
# Convert numeric types
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')

sub1 = data[['incomeperperson', 'internetuserate']].dropna()

print('Remaining number of countries: {0}'.format(len(sub1)))

# Since GDP per person isn't categorical data, I'm going to group it by magnitude first
groups = [pow(10, i) for i in range(2, 7)]
labels = ['{0} - {1}'.format(groups[index], i) for index, i in enumerate(groups[1:])]
sub1['incomeperperson'] = pandas.cut(sub1['incomeperperson'], groups, right=False, labels=labels)
Remaining number of countries: 183
model = smf.ols(formula='internetuserate ~ C(incomeperperson)', data=sub1).fit()
print(model.summary())
print('\n'*2)

print('Means for internet use rate by income per person')
m = sub1.groupby('incomeperperson').mean()
print(m)
print('\n'*2)

print('standard deviations for internet use rate by income per person')
sd = sub1.groupby('incomeperperson').std()
print(sd)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        internetuserate   R-squared:                       0.677
Model:                            OLS   Adj. R-squared:                  0.673
Method:                 Least Squares   F-statistic:                     188.3
Date:                Tue, 09 Aug 2016   Prob (F-statistic):           7.59e-45
Time:                        21:02:32   Log-Likelihood:                -765.55
No. Observations:                 183   AIC:                             1537.
Df Residuals:                     180   BIC:                             1547.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==========================================================================================================
                                             coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------------------
Intercept                                  8.3944      2.219      3.783      0.000         4.016    12.773
C(incomeperperson)[T.1000 - 10000]        24.2198      2.811      8.616      0.000        18.673    29.766
C(incomeperperson)[T.10000 - 100000]      62.8513      3.258     19.292      0.000        56.423    69.280
C(incomeperperson)[T.100000 - 1000000]          0          0        nan        nan             0         0
==============================================================================
Omnibus:                        3.852   Durbin-Watson:                   1.985
Prob(Omnibus):                  0.146   Jarque-Bera (JB):                3.473
Skew:                           0.326   Prob(JB):                        0.176
Kurtosis:                       3.175   Cond. No.                          inf
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is      0. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.



Means for internet use rate by income per person
                  internetuserate
incomeperperson                  
100 - 1000               8.394409
1000 - 10000            32.614232
10000 - 100000          71.245728
100000 - 1000000              NaN



standard deviations for internet use rate by income per person
                  internetuserate
incomeperperson                  
100 - 1000               8.328892
1000 - 10000            19.402752
10000 - 100000          15.484270
100000 - 1000000              NaN
 mc = multi.MultiComparison(sub1['internetuserate'], sub1['incomeperperson'])
 res = mc.tukeyhsd()
 print(res.summary())
     Multiple Comparison of Means - Tukey HSD,FWER=0.05    
===========================================================
   group1        group2     meandiff  lower   upper  reject
-----------------------------------------------------------
 100 - 1000   1000 - 10000  24.2198  17.5765 30.8631  True 
 100 - 1000  10000 - 100000 62.8513  55.1516  70.551  True 
1000 - 10000 10000 - 100000 38.6315  31.6736 45.5894  True 
-----------------------------------------------------------

Summary

A p-value of 7.59e-45 seems pretty significant, and in all comparisons, I was able to reject the null hypothesis. It seems certain that income per person has a very strong predictive effect on internet use rate.

Assignment: Creating graphs for your data

Program and outputs

%matplotlib inline
import pandas
import numpy
import seaborn
from matplotlib import pyplot

data = pandas.read_csv('gapminder.csv')
print('Total number of countries: {0}'.format(len(data)))
Total number of countries: 213
# Convert numeric types
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')

# Drop any countries with missing GPD
data = data[pandas.notnull(data['incomeperperson'])]

print('Remaining number of countries: {0}'.format(len(data['incomeperperson'])))
Remaining number of countries: 190
# Since GDP per person isn't categorical data, I'm going to group it by magnitude first
groups = [pow(10, i) for i in range(2, 7)]
labels = ['{0} - {1}'.format(groups[index], i) for index, i in enumerate(groups[1:])]
grouped = pandas.cut(data['incomeperperson'], groups, right=False, labels=labels)

grouped = grouped.astype('category')
graph = seaborn.countplot(x=grouped)
pyplot.xlabel('Income per person')
pyplot.title('Income per person according to gapminder')

png

# Now do the above for all of my consumption types
types = [
    ('alcconsumption', 'Alcohol Consumption'),
    ('co2emissions', 'CO2 Emissions'),
    ('internetuserate', 'Internet Use Rate'),
    ('oilperperson', 'Oil per Person'),
    ('relectricperperson', 'Electricity per Person'),
]

# Convert to numeric
clean = data.copy()
for (key, name) in types:
    clean[key] = pandas.to_numeric(clean[key], errors='coerce')

def draw_distplot(series, name):
    # Drop NaNs
    series = series.dropna()

    # Draw a distplot
    graph = seaborn.distplot(series, kde=False)
    pyplot.xlabel(name)
    pyplot.title('{0} according to gapminder'.format(name))

def draw_regplot(data, y, name):
    # Draw a regplot
    seaborn.regplot(x='incomeperperson', y=y, data=data, logx=True)
    pyplot.xlabel('Income per person')
    pyplot.ylabel(name)
    pyplot.title('{0} according to gapminder'.format(name))

for (key, name) in types:
    draw_distplot(clean[key], name)
    draw_regplot(clean, key, name)

Alcohol consumption

png png

Unimodal, skewed-right distribution.

CO2 Emissions

png png

Unimodal, skewed-right distribution.

Internet Use Rate

png png

Unimodal, skewed-right distribution.

Oil per Person

png png

Unimodal, skewed-right distribution.

Electricity per Person

png png

Unimodal, skewed-right distribution.

Summary

All of my measured variables were unimodal, and skewed-right. There was some correlation between my measured variables and Income per Person. In particular, Internet use rate was very closely correlated.

Assignment: Making Data Management Decisions

Program and outputs

import pandas
import numpy

data = pandas.read_csv('gapminder.csv')
print('Total number of countries: {0}'.format(len(data)))
Total number of countries: 213
# Convert numeric types
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')

# Drop any countries with missing GPD
data = data[pandas.notnull(data['incomeperperson'])]

print('Remaining number of countries: {0}'.format(len(data['incomeperperson'])))
Remaining number of countries: 190
# Since GDP per person isn't categorical data, I'm going to group it by magnitude first
groups = [pow(10, i) for i in range(2, 7)]
labels = ['{0} - {1}'.format(groups[index], i) for index, i in enumerate(groups[1:])]
print('Groups: {0}'.format(labels))
Groups: ['100 - 1000', '1000 - 10000', '10000 - 100000', '100000 - 1000000']
grouped = pandas.cut(data['incomeperperson'], groups, right=False, labels=labels)
print('Counts for GDP per person, grouped by magnitude:')
print(grouped.value_counts(sort=False))
print('\nPercentages for GDP per person, grouped by magnitude:')
print(grouped.value_counts(sort=False, normalize=True))
Counts for GDP per person, grouped by magnitude:
100 - 1000          54
1000 - 10000        89
10000 - 100000      46
100000 - 1000000     1
Name: incomeperperson, dtype: int64

Percentages for GDP per person, grouped by magnitude:
100 - 1000          0.284211
1000 - 10000        0.468421
10000 - 100000      0.242105
100000 - 1000000    0.005263
Name: incomeperperson, dtype: float64
# Now do the above for all of my consumption types
types = [
    ('alcconsumption', 'Alcohol Consumption'),
    ('co2emissions', 'CO2 Emissions'),
    ('internetuserate', 'Internet Use Rate'),
    ('oilperperson', 'Oil per Person'),
    ('relectricperperson', 'Electricity per Person'),
]
def summarize(series, name):
    # Convert to numeric and drop NaNs
    series = pandas.to_numeric(series, errors='coerce')
    series.dropna(inplace=True)

    grouped = pandas.qcut(series, 4)
    
    print(name)
    print('-' * len(name))
    
    print('Counts for {0} grouped by percentile:'.format(name))
    print(grouped.value_counts(sort=False))
    
    print('Percentages for {0}, grouped by percentile (should probably be 25%)'.format(name))
    print(grouped.value_counts(sort=False, normalize=True))

for (key, name) in types:
    summarize(data[key], name)
    print('\n')
Alcohol Consumption
-------------------
Counts for Alcohol Consumption grouped by percentile:
[0.05, 2.73]       45
(2.73, 6.12]       45
(6.12, 10.035]     44
(10.035, 23.01]    45
Name: alcconsumption, dtype: int64
Percentages for Alcohol Consumption, grouped by percentile (should probably be 25%)
[0.05, 2.73]       0.251397
(2.73, 6.12]       0.251397
(6.12, 10.035]     0.245810
(10.035, 23.01]    0.251397
Name: alcconsumption, dtype: float64


CO2 Emissions
-------------
Counts for CO2 Emissions grouped by percentile:
[850666.667, 39924500]            45
(39924500, 234864666.667]         45
(234864666.667, 2138961000]       44
(2138961000, 334220872333.333]    45
Name: co2emissions, dtype: int64
Percentages for CO2 Emissions, grouped by percentile (should probably be 25%)
[850666.667, 39924500]            0.251397
(39924500, 234864666.667]         0.251397
(234864666.667, 2138961000]       0.245810
(2138961000, 334220872333.333]    0.251397
Name: co2emissions, dtype: float64


Internet Use Rate
-----------------
Counts for Internet Use Rate grouped by percentile:
[0.21, 9.949]         46
(9.949, 31.00438]     46
(31.00438, 55.646]    45
(55.646, 95.638]      46
Name: internetuserate, dtype: int64
Percentages for Internet Use Rate, grouped by percentile (should probably be 25%)
[0.21, 9.949]         0.251366
(9.949, 31.00438]     0.251366
(31.00438, 55.646]    0.245902
(55.646, 95.638]      0.251366
Name: internetuserate, dtype: float64


Oil per Person
--------------
Counts for Oil per Person grouped by percentile:
[0.0323, 0.505]    16
(0.505, 0.891]     15
(0.891, 1.593]     15
(1.593, 12.229]    15
Name: oilperperson, dtype: int64
Percentages for Oil per Person, grouped by percentile (should probably be 25%)
[0.0323, 0.505]    0.262295
(0.505, 0.891]     0.245902
(0.891, 1.593]     0.245902
(1.593, 12.229]    0.245902
Name: oilperperson, dtype: float64


Electricity per Person
----------------------
Counts for Electricity per Person grouped by percentile:
[0, 226.318]             33
(226.318, 609.335]       32
(609.335, 1484.703]      32
(1484.703, 11154.755]    33
Name: relectricperperson, dtype: int64
Percentages for Electricity per Person, grouped by percentile (should probably be 25%)
[0, 226.318]             0.253846
(226.318, 609.335]       0.246154
(609.335, 1484.703]      0.246154
(1484.703, 11154.755]    0.253846
Name: relectricperperson, dtype: float64

Summary

I began by dropping any rows where GDP per capita was missing, since I’m looking to eventually compare that to my various consumption categories. Then I showed the output of GDP grouped by magnitude (100 - 1000, 1000 - 10000, 10000 - 100000, 100000 - 1000000) since those gave me the most meaningful groupings.

Once that was done, I wanted to summarize each of my consumption types. For those I used the qcut method explained in this weeks lesson to break them each into quartiles.

Alcohol, CO2, and Internet use rate generally seem to have pretty thorough coverage of data (~180 countries), but oil and electricity per person are missing a lot more data. I’ll probably end up just using the first three types.

Assignment: Running Your First Program

Program and outputs

import pandas
import numpy

data = pandas.read_csv('gapminder.csv')
print('Total number of countries: {0}'.format(len(data)))
Total number of countries: 213
# Convert numeric types and drop NaNs
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['incomeperperson'].dropna(inplace=True)

print('Remaining number of countries: {0}'.format(len(data['incomeperperson'])))
Remaining number of countries: 190
# Since GDP per person isn't categorical data, I'm going to group it by magnitude first
groups = [pow(10, i) for i in range(2, 7)]
labels = ['{0} - {1}'.format(groups[index], i) for index, i in enumerate(groups[1:])]
print('Groups: {0}'.format(labels))
Groups: ['100 - 1000', '1000 - 10000', '10000 - 100000', '100000 - 1000000']
grouped = pandas.cut(data['incomeperperson'], groups, right=False, labels=labels)
print('Counts for GDP per person, grouped by magnitude:')
print(grouped.value_counts(sort=False))
print('\nPercentages for GDP per person, grouped by magnitude:')
print(grouped.value_counts(sort=False, normalize=True))
Counts for GDP per person, grouped by magnitude:
100 - 1000          54
1000 - 10000        89
10000 - 100000      46
100000 - 1000000     1
Name: incomeperperson, dtype: int64

Percentages for GDP per person, grouped by magnitude:
100 - 1000          0.284211
1000 - 10000        0.468421
10000 - 100000      0.242105
100000 - 1000000    0.005263
Name: incomeperperson, dtype: float64
# Now do the above for all of my consumption types
types = [
    ('alcconsumption', 'Alcohol Consumption'),
    ('co2emissions', 'CO2 Emissions'),
    ('internetuserate', 'Internet Use Rate'),
    ('oilperperson', 'Oil per Person'),
    ('relectricperperson', 'Electricity per Person'),
]
def summarize(series, name):
    # Convert to numeric and drop NaNs
    series = pandas.to_numeric(series, errors='coerce')
    series.dropna(inplace=True)

    percentiles = numpy.linspace(0, 1, 5)
    groups = list(series.quantile(percentiles))
    labels = ['{0} - {1}'.format(groups[index], i) for index, i in enumerate(groups[1:])]
    grouped = pandas.cut(series, groups, right=False, labels=labels)
    
    print(name)
    print('-' * len(name))
    
    print('Counts for {0} grouped by percentile:'.format(name))
    print(grouped.value_counts(sort=False))
    
    print('Percentages for {0}, grouped by percentile (should probably be 25%)'.format(name))
    print(grouped.value_counts(sort=False, normalize=True))

for (key, name) in types:
    summarize(data[key], name)
    print('\n')
Alcohol Consumption
-------------------
Counts for Alcohol Consumption grouped by percentile:
0.03 - 2.625     47
2.625 - 5.92     45
5.92 - 9.925     48
9.925 - 23.01    46
Name: alcconsumption, dtype: int64
Percentages for Alcohol Consumption, grouped by percentile (should probably be 25%)
0.03 - 2.625     0.252688
2.625 - 5.92     0.241935
5.92 - 9.925     0.258065
9.925 - 23.01    0.247312
Name: alcconsumption, dtype: float64


CO2 Emissions
-------------
Counts for CO2 Emissions grouped by percentile:
132000.0 - 34846166.66666667             50
34846166.66666667 - 185901833.3333335    50
185901833.3333335 - 1846084166.666665    50
1846084166.666665 - 334220872333.333     49
Name: co2emissions, dtype: int64
Percentages for CO2 Emissions, grouped by percentile (should probably be 25%)
132000.0 - 34846166.66666667             0.251256
34846166.66666667 - 185901833.3333335    0.251256
185901833.3333335 - 1846084166.666665    0.251256
1846084166.666665 - 334220872333.333     0.246231
Name: co2emissions, dtype: float64


Internet Use Rate
-----------------
Counts for Internet Use Rate grouped by percentile:
0.210066325622776 - 9.999603951038267    48
9.999603951038267 - 31.81012075468915    48
31.81012075468915 - 56.41604586287351    48
56.41604586287351 - 95.6381132075472     47
Name: internetuserate, dtype: int64
Percentages for Internet Use Rate, grouped by percentile (should probably be 25%)
0.210066325622776 - 9.999603951038267    0.251309
9.999603951038267 - 31.81012075468915    0.251309
31.81012075468915 - 56.41604586287351    0.251309
56.41604586287351 - 95.6381132075472     0.246073
Name: internetuserate, dtype: float64


Oil per Person
--------------
Counts for Oil per Person grouped by percentile:
0.03228146619272 - 0.5325414918259135      16
0.5325414918259135 - 1.03246988375935      15
1.03246988375935 - 1.6227370046323601      16
1.6227370046323601 - 12.228644991426199    15
Name: oilperperson, dtype: int64
Percentages for Oil per Person, grouped by percentile (should probably be 25%)
0.03228146619272 - 0.5325414918259135      0.258065
0.5325414918259135 - 1.03246988375935      0.241935
1.03246988375935 - 1.6227370046323601      0.258065
1.6227370046323601 - 12.228644991426199    0.241935
Name: oilperperson, dtype: float64


Electricity per Person
----------------------
Counts for Electricity per Person grouped by percentile:
0.0 - 203.65210850945525                  34
203.65210850945525 - 597.1364359554304    34
597.1364359554304 - 1491.145248925905     34
1491.145248925905 - 11154.7550328078      33
Name: relectricperperson, dtype: int64
Percentages for Electricity per Person, grouped by percentile (should probably be 25%)
0.0 - 203.65210850945525                  0.251852
203.65210850945525 - 597.1364359554304    0.251852
597.1364359554304 - 1491.145248925905     0.251852
1491.145248925905 - 11154.7550328078      0.244444
Name: relectricperperson, dtype: float64

Conclusions

Since my data wasn’t categorical, it was a bit tricky to make it all work with value counts. However, that gave me the opportunity to learn a bit more about the pandas library, and how to create categories out of non-categorical data.

One thing I noticed pretty quickly, was that for both GDP and my various consumption variables, once the values started growing, they grow very quickly. For example, most of the alcohol consumption variables centered around 5 liters, but at the high end, it went as far as 23 liters.