Assignment: Running Your First Program

18 Jun 2016

Program and outputs

import pandas
import numpy

data = pandas.read_csv('gapminder.csv')
print('Total number of countries: {0}'.format(len(data)))

Total number of countries: 213

# Convert numeric types and drop NaNs
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['incomeperperson'].dropna(inplace=True)

print('Remaining number of countries: {0}'.format(len(data['incomeperperson'])))

Remaining number of countries: 190

# Since GDP per person isn't categorical data, I'm going to group it by magnitude first
groups = [pow(10, i) for i in range(2, 7)]
labels = ['{0} - {1}'.format(groups[index], i) for index, i in enumerate(groups[1:])]
print('Groups: {0}'.format(labels))

Groups: ['100 - 1000', '1000 - 10000', '10000 - 100000', '100000 - 1000000']

grouped = pandas.cut(data['incomeperperson'], groups, right=False, labels=labels)
print('Counts for GDP per person, grouped by magnitude:')
print(grouped.value_counts(sort=False))
print('\nPercentages for GDP per person, grouped by magnitude:')
print(grouped.value_counts(sort=False, normalize=True))

Counts for GDP per person, grouped by magnitude:
100 - 1000          54
1000 - 10000        89
10000 - 100000      46
100000 - 1000000     1
Name: incomeperperson, dtype: int64

Percentages for GDP per person, grouped by magnitude:
100 - 1000          0.284211
1000 - 10000        0.468421
10000 - 100000      0.242105
100000 - 1000000    0.005263
Name: incomeperperson, dtype: float64

# Now do the above for all of my consumption types
types = [
    ('alcconsumption', 'Alcohol Consumption'),
    ('co2emissions', 'CO2 Emissions'),
    ('internetuserate', 'Internet Use Rate'),
    ('oilperperson', 'Oil per Person'),
    ('relectricperperson', 'Electricity per Person'),
]
def summarize(series, name):
    # Convert to numeric and drop NaNs
    series = pandas.to_numeric(series, errors='coerce')
    series.dropna(inplace=True)

    percentiles = numpy.linspace(0, 1, 5)
    groups = list(series.quantile(percentiles))
    labels = ['{0} - {1}'.format(groups[index], i) for index, i in enumerate(groups[1:])]
    grouped = pandas.cut(series, groups, right=False, labels=labels)
    
    print(name)
    print('-' * len(name))
    
    print('Counts for {0} grouped by percentile:'.format(name))
    print(grouped.value_counts(sort=False))
    
    print('Percentages for {0}, grouped by percentile (should probably be 25%)'.format(name))
    print(grouped.value_counts(sort=False, normalize=True))

for (key, name) in types:
    summarize(data[key], name)
    print('\n')

Alcohol Consumption
-------------------
Counts for Alcohol Consumption grouped by percentile:
0.03 - 2.625     47
2.625 - 5.92     45
5.92 - 9.925     48
9.925 - 23.01    46
Name: alcconsumption, dtype: int64
Percentages for Alcohol Consumption, grouped by percentile (should probably be 25%)
0.03 - 2.625     0.252688
2.625 - 5.92     0.241935
5.92 - 9.925     0.258065
9.925 - 23.01    0.247312
Name: alcconsumption, dtype: float64


CO2 Emissions
-------------
Counts for CO2 Emissions grouped by percentile:
132000.0 - 34846166.66666667             50
34846166.66666667 - 185901833.3333335    50
185901833.3333335 - 1846084166.666665    50
1846084166.666665 - 334220872333.333     49
Name: co2emissions, dtype: int64
Percentages for CO2 Emissions, grouped by percentile (should probably be 25%)
132000.0 - 34846166.66666667             0.251256
34846166.66666667 - 185901833.3333335    0.251256
185901833.3333335 - 1846084166.666665    0.251256
1846084166.666665 - 334220872333.333     0.246231
Name: co2emissions, dtype: float64


Internet Use Rate
-----------------
Counts for Internet Use Rate grouped by percentile:
0.210066325622776 - 9.999603951038267    48
9.999603951038267 - 31.81012075468915    48
31.81012075468915 - 56.41604586287351    48
56.41604586287351 - 95.6381132075472     47
Name: internetuserate, dtype: int64
Percentages for Internet Use Rate, grouped by percentile (should probably be 25%)
0.210066325622776 - 9.999603951038267    0.251309
9.999603951038267 - 31.81012075468915    0.251309
31.81012075468915 - 56.41604586287351    0.251309
56.41604586287351 - 95.6381132075472     0.246073
Name: internetuserate, dtype: float64


Oil per Person
--------------
Counts for Oil per Person grouped by percentile:
0.03228146619272 - 0.5325414918259135      16
0.5325414918259135 - 1.03246988375935      15
1.03246988375935 - 1.6227370046323601      16
1.6227370046323601 - 12.228644991426199    15
Name: oilperperson, dtype: int64
Percentages for Oil per Person, grouped by percentile (should probably be 25%)
0.03228146619272 - 0.5325414918259135      0.258065
0.5325414918259135 - 1.03246988375935      0.241935
1.03246988375935 - 1.6227370046323601      0.258065
1.6227370046323601 - 12.228644991426199    0.241935
Name: oilperperson, dtype: float64


Electricity per Person
----------------------
Counts for Electricity per Person grouped by percentile:
0.0 - 203.65210850945525                  34
203.65210850945525 - 597.1364359554304    34
597.1364359554304 - 1491.145248925905     34
1491.145248925905 - 11154.7550328078      33
Name: relectricperperson, dtype: int64
Percentages for Electricity per Person, grouped by percentile (should probably be 25%)
0.0 - 203.65210850945525                  0.251852
203.65210850945525 - 597.1364359554304    0.251852
597.1364359554304 - 1491.145248925905     0.251852
1491.145248925905 - 11154.7550328078      0.244444
Name: relectricperperson, dtype: float64

Conclusions

Since my data wasn’t categorical, it was a bit tricky to make it all work with value counts. However, that gave me the opportunity to learn a bit more about the pandas library, and how to create categories out of non-categorical data.

One thing I noticed pretty quickly, was that for both GDP and my various consumption variables, once the values started growing, they grow very quickly. For example, most of the alcohol consumption variables centered around 5 liters, but at the high end, it went as far as 23 liters.

Data Visualization Course

Assignment: Running Your First Program

Program and outputs

Conclusions

Related Posts

Assignment: Running a k-means Cluster Analysis 03 Feb 2017

Assignment: Running a Lasso Regression Analysis 25 Jan 2017

Assignment: Running a Random Forest 19 Jan 2017