2 Course foundations: data management

Now that we are familiar with the basics of Python syntax and how to interact with it, it’s time to learn how to manage and visualize data. In this section, we’ll explore widely-used libraries such as Pandas, which enable us to handle and analyze data effectively. We’ll get to know one of the fundamental data structures in Pandas, the DataFrame. Finally, we’ll learn some visualization techniques using Seaborn. Again, if you are already familiar with this concepts, feel free to move to the next section.

2.1 Data management with Pandas

Pandas is a Python library designed for efficient and practical data analysis tasks (clean, manipulate, analyze). Differently from NumPy, which is designed to work with numerical arrays, Pandas is designed to handle relational or labeled data (i.e., data that has been given a context or meaning through labels).

Pandas and NumPy share some similarities as Pandas builds on NumPy. However, rather than arrays, the main object for 2D data management in Pandas is the DataFrame (equivalent to data.frame in R). DataFrames are essentially matrices with labeled columns and rows that can accommodate mixed datatypes and missing values. DataFrames are composed of rows and columns. Row labels are known as indices (start from 0) and column labels as columns names.

Pandas is a very good tool for:

Dealing with missing data
Adding or deleting columns
Align data on labels or not on labels (i.e., merging and joining)
Perform grouped operations on data sets.
Transforming other Python data structures to DataFrames.
Multi-indexing hierarchically.
Load data from multiple file types.
Handling time-series data, as it has specific time-series functionalities.

2.1.1 Data structures

Pandas builds on two main data structures: Series and DataFrames. A Series represents a 1D array, while a DataFrame is a 2D labeled array. The easiest way to understand these structures is to think of DataFrames as containers for lower-dimensional data. Specifically, the columns of a DataFrame are composed of Series, and each element within a Series (i.e., the rows of the DataFrame) is an individual scalar value, such as a number or a string. In simple terms, Series are columns made of scalar elements, and DataFrames are collections of Series with assigned labels. The image below represents a DataFrame.

All Pandas data structures are value-mutable (i.e., we can change the values of elements and replace DataFrames) but some are not always size-mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame.

2.1.2 Creating a DataFrame.

There are two primary methods to create DataFrames and Series in Pandas. Firstly, they can be constructed by transforming built-in Python data structures like lists or dictionaries. Secondly, they can be generated by importing data, for example, by reading .csv files. Let’s begin by examining how to create them from Python structures.

For any code we want to run that uses pandas, just as we did with NumPy, we have to first import it. In the second line of code of the cell below, I am modifying the display options so that we can see all columns in a dataframe; otherwise, there is a default limit, and some columns get collapsed in previews.

import pandas
pandas.set_option('display.max_columns', None)

The first example illustrates how to create a DataFrame from a dictionary.

phds = {'Name': ['Joël Marbet', 'Alba Miñano-Mañero','Dmitri Kirpichev',
                'Yang Xun','Utso Pal Mustafi',
                'Christian Maruthiah','Kazuharu Yanagimoto'], 
        'Undergrad University': ['Univeristy of Bern', 'University of València',
                                'Universidad Complutense de Madrid',
                                'Renmin University of China',
                                'Presidency University','University of Melbourne','University of Tokyo'],
        'Fields':['Monetary Economics','Urban Economics','Applied Economics',
                'Development Economics','Macro-Finance',
                'Development Economics','Macroeconomics'],
        'PhD Desk': [20,10,11,15,18,23,5]}
phd_df = pandas.DataFrame(phds)
phd_df

	Name	Undergrad University	Fields	PhD Desk
0	Joël Marbet	Univeristy of Bern	Monetary Economics	20
1	Alba Miñano-Mañero	University of València	Urban Economics	10
2	Dmitri Kirpichev	Universidad Complutense de Madrid	Applied Economics	11
3	Yang Xun	Renmin University of China	Development Economics	15
4	Utso Pal Mustafi	Presidency University	Macro-Finance	18
5	Christian Maruthiah	University of Melbourne	Development Economics	23
6	Kazuharu Yanagimoto	University of Tokyo	Macroeconomics	5

To find out how many rows and columns a dataframe has, we can call the .shape attribute. For instance, the dataframe we just created has 2 rows and 4 columns.

phd_df.shape

(7, 4)

We can work with single columns or subset our dataframe to a set of columns by calling the columns names in square braces []. The returning column will be of the pandas.Series type.

phd_df['Name']

0            Joël Marbet
1     Alba Miñano-Mañero
2       Dmitri Kirpichev
3               Yang Xun
4       Utso Pal Mustafi
5    Christian Maruthiah
6    Kazuharu Yanagimoto
Name: Name, dtype: object

type(phd_df['Name'])

pandas.core.series.Series

My tip

Using double braces ([[]]) to subset a DataFrame results in a DataFrame rather than a Series. This is so because the inner pair of brackets creates a list of columns, while the outer pair allows for data selection from the DataFrame.

phd_df[['Name']]

	Name
0	Joël Marbet
1	Alba Miñano-Mañero
2	Dmitri Kirpichev
3	Yang Xun
4	Utso Pal Mustafi
5	Christian Maruthiah
6	Kazuharu Yanagimoto

type(phd_df[['Name']])

pandas.core.frame.DataFrame

Columns names can be accessed by calling the attribute .columns of a dataframe.

phd_df.columns

Index(['Name', 'Undergrad University', 'Fields', 'PhD Desk'], dtype='object')

With this in mind, we can copy a selection of our DataFrame to a new one. We use the .copy() method to generate a copy of the data and avoid broadcasting any change to the original frame we are subsetting.

phd_df2 = phd_df[['Name','Fields']].copy()
phd_df2

	Name	Fields
0	Joël Marbet	Monetary Economics
1	Alba Miñano-Mañero	Urban Economics
2	Dmitri Kirpichev	Applied Economics
3	Yang Xun	Development Economics
4	Utso Pal Mustafi	Macro-Finance
5	Christian Maruthiah	Development Economics
6	Kazuharu Yanagimoto	Macroeconomics

We could also create a dataframe from other Python data-structures. In the example below, the same DataFrame is created from a list. If we do not specify the column names, it will just give numeric labels to columns from 0 to $n$.

phds_list = [
    ['Joël Marbet', 'University of Bern', 'Monetary Economics', 20],
    ['Alba Miñano-Mañero', 'University of València', 'Urban Economics', 10],
    ['Dmitri Kirpichev', 'Universidad Complutense de Madrid', 
        'Applied Economics', 11],
    ['Yang Xun', 'Renmin University of China', 'Development Economics', 15],
    ['Utso Pal Mustafi', 'Presidency University', 'Macro-Finance', 18],
    ['Christian Maruthiah', 'University of Melbourne', 
        'Development Economics', 23],
    ['Kazuharu Yanagimoto', 'University of Tokyo', 'Macroeconomics', 5]
]
# Create a list of column names
column_names_list = ['Name', 'Undergraduate University', 'Fields','PhD desk']

# Create a DataFrame from the array, using the list as column names and the tuple as row labels
df = pandas.DataFrame(phds_list, columns=column_names_list)
df

	Name	Undergraduate University	Fields	PhD desk
0	Joël Marbet	University of Bern	Monetary Economics	20
1	Alba Miñano-Mañero	University of València	Urban Economics	10
2	Dmitri Kirpichev	Universidad Complutense de Madrid	Applied Economics	11
3	Yang Xun	Renmin University of China	Development Economics	15
4	Utso Pal Mustafi	Presidency University	Macro-Finance	18
5	Christian Maruthiah	University of Melbourne	Development Economics	23
6	Kazuharu Yanagimoto	University of Tokyo	Macroeconomics	5

A DataFrame can also be created departing from a tuple:

phds_tu = (
    ('Joël Marbet', 'University of Bern', 'Monetary Economics', 20),
    ('Alba Miñano-Mañero', 'University of València', 'Urban Economics', 10),
    ('Dmitri Kirpichev', 'Universidad Complutense de Madrid', 
        'Applied Economics', 11),
    ('Yang Xun', 'Renmin University of China', 'Development Economics', 15),
    ('Utso Pal Mustafi', 'Presidency University', 'Macro-Finance', 18),
    ('Christian Maruthiah', 'University of Melbourne', 
        'Development Economics', 23),
    ('Kazuharu Yanagimoto', 'University of Tokyo', 'Macroeconomics', 5)
)
# Create a list of column names
column_names_list = ['Name', 'Undergraduate University', 'Fields','PhD desk']

# Create a DataFrame from the array, using the list as column names and the tuple as row labels
df = pandas.DataFrame(phds_tu, columns=column_names_list)
df

	Name	Undergraduate University	Fields	PhD desk
0	Joël Marbet	University of Bern	Monetary Economics	20
1	Alba Miñano-Mañero	University of València	Urban Economics	10
2	Dmitri Kirpichev	Universidad Complutense de Madrid	Applied Economics	11
3	Yang Xun	Renmin University of China	Development Economics	15
4	Utso Pal Mustafi	Presidency University	Macro-Finance	18
5	Christian Maruthiah	University of Melbourne	Development Economics	23
6	Kazuharu Yanagimoto	University of Tokyo	Macroeconomics	5

type(phds_tu)

tuple

DataFrames can also receive as data source NumPy arrays:

import numpy as np 
data_arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Create a DataFrame from the NumPy array
df_arr = pandas.DataFrame(data_arr, columns=['A', 'B', 'C'])

df_arr

	A	B	C
0	1	2	3
1	4	5	6
2	7	8	9

Similarly, we can also transform built-in structures into pandas.Series

week_series= pandas.Series(['Monday','Tuesday','Wednesday']) # from a list
week_series

0       Monday
1      Tuesday
2    Wednesday
dtype: object

The previous line would be equivalent to first creating a list and then passing it as an argument to pandas.Series. We can also specify the row index, which can be very useful for join and concatenate operations. Additionally, we can specify a name

week_list = ['Monday','Tuesday','Wednesday']
week_l_to_series = pandas.Series(week_list,index=range(3,6),name='Week')
week_l_to_series

3       Monday
4      Tuesday
5    Wednesday
Name: Week, dtype: object

We could convert that series to a pandas.DataFrame() that would have a column name ‘Week’ and as row index (3,4,5)

pandas.DataFrame(week_l_to_series)

	Week
3	Monday
4	Tuesday
5	Wednesday

Similarly, Series can also be created from dictionaries as shown below:

weekdays_s = {
    'Monday': 'Work',
    'Tuesday': 'Work',
    'Wednesday': 'Work',
    'Thursday': 'Work',
    'Friday': 'Work',
    'Saturday': 'Weekend',
    'Sunday': 'Weekend'
}
dic_to_series = pandas.Series(weekdays_s,name='Day_type')

dic_to_series

Monday          Work
Tuesday         Work
Wednesday       Work
Thursday        Work
Friday          Work
Saturday     Weekend
Sunday       Weekend
Name: Day_type, dtype: object

Most Pandas operations and methods return DataFrames or Series. For instance, the describe() method, which returns basic statistics of numerical data, returns either a DataFrame or a Series.

df_arr['A'].describe()

count    3.0
mean     4.0
std      3.0
min      1.0
25%      2.5
50%      4.0
75%      5.5
max      7.0
Name: A, dtype: float64

x = df_arr['A'].describe() # I use describe on a series,  returns a series
type(x)

pandas.core.series.Series

x = df_arr[['A']].describe()  # I use describe on a dataframe, returns a dataframe
type(x)

pandas.core.frame.DataFrame

	A
count	3.0
mean	4.0
std	3.0
min	1.0
25%	2.5
50%	4.0
75%	5.5
max	7.0

2.1.3 Importing and exporting data

Besides creating our own tabular data, Pandas can be used to import and read data into Pandas own structures. Reading different file types (excel, parquets, json…) is supported by tha family of read_* functions. To illustrate it, let’s take a look to the credit car fraud data from Kaggle.

First, we define the location of our data file. If you have a local copy of the course folder, you can specify the path as ../data/card_transdata.csv. This path tells the program to start from the current working directory (i.e., the folder where this notebook is saved), move up one level, and then navigate to the data folder, where the .csv file is stored.

file_csv ='../data/card_transdata.csv'
data = pandas.read_csv(file_csv)

Printing the first or last n elements of the data is possible using the methods .head(n) and .tail(n), for the first and bottom elements respectively.

data.head(5)

	distance_from_home	distance_from_last_transaction	ratio_to_median_purchase_price	repeat_retailer	used_chip	online_order
0	57.877857	0.311140	1.945940	1.0	1.0	0.0
1	10.829943	0.175592	1.294219	1.0	0.0	0.0
2	5.091079	0.805153	0.427715	1.0	0.0	1.0
3	2.247564	5.600044	0.362663	1.0	1.0	1.0
4	44.190936	0.566486	2.222767	1.0	1.0	1.0

data.tail(5)

	distance_from_home	distance_from_last_transaction	ratio_to_median_purchase_price	repeat_retailer	used_chip	online_order
999995	2.207101	0.112651	1.626798	1.0	1.0	0.0
999996	19.872726	2.683904	2.778303	1.0	1.0	0.0
999997	2.914857	1.472687	0.218075	1.0	1.0	1.0
999998	4.258729	0.242023	0.475822	1.0	0.0	1.0
999999	58.108125	0.318110	0.386920	1.0	1.0	1.0

It is also possible to get all the particular data types specifications of our columns:

data.dtypes # it's an attribute of the dataframe

distance_from_home                float64
distance_from_last_transaction    float64
ratio_to_median_purchase_price    float64
repeat_retailer                   float64
used_chip                         float64
used_pin_number                   float64
online_order                      float64
fraud                             float64
dtype: object

Or to get a quick technical summary of our imported data, we can use the .info() method

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  float64
 4   used_chip                       1000000 non-null  float64
 5   used_pin_number                 1000000 non-null  float64
 6   online_order                    1000000 non-null  float64
 7   fraud                           1000000 non-null  float64
dtypes: float64(8)
memory usage: 61.0 MB

The summary is telling us that:

We have a pandas.DataFrame.
It has 1,000,000 rows and their index are from 0 to 999,9999
None of the collumns has missing data
All columns data type is a 64 bit float.
It takes up 61 megabytes of our RAM memory.

While the read_* families allow to import different sources of data, the to_* allow to export it. In the below example, I am saving a csv that contains only the first 10 rows of the credit transaction data.

data.head(10).to_csv('./subset.csv')

If we export other data to the same file, it will typically overwrite the existing one (unless we are specifying sheets in an Excel file, for instance). Also, this happens without warning, so it is important to be consistent in our naming to avoid unwanted overwrites.

data.head(5).to_csv('./subset.csv')

2.1.4 Filtering and subsetting dataframes.

It is possible to use an approach similar to the selection of columns to be able to subset our dataframe to a given set of rows.

For example, let’s extract from the credit transaction data those that happened through online orders.

online = data[data['online_order']==1]
online.head()

	distance_from_home	distance_from_last_transaction	ratio_to_median_purchase_price	repeat_retailer	used_chip	online_order
2	5.091079	0.805153	0.427715	1.0	0.0	1.0
3	2.247564	5.600044	0.362663	1.0	1.0	1.0
4	44.190936	0.566486	2.222767	1.0	1.0	1.0
6	3.724019	0.956838	0.278465	1.0	0.0	1.0
9	8.839047	2.970512	2.361683	1.0	0.0	1.0

We can check that in our new subset the variable online_order is always 1.

online['online_order'].describe()

count    650552.0
mean          1.0
std           0.0
min           1.0
25%           1.0
50%           1.0
75%           1.0
max           1.0
Name: online_order, dtype: float64

online['online_order'].unique()

array([1.])

That is, to select rows based on a column condition, we first write the condition within []. This selection bracket will give use a series of boolean (i.e., a mask) with True values for those rows that satify the condition.

mask_data = data['online_order']==1
mask_data

0         False
1         False
2          True
3          True
4          True
          ...  
999995    False
999996    False
999997     True
999998     True
999999     True
Name: online_order, Length: 1000000, dtype: bool

type(mask_data)

pandas.core.series.Series

Then, when we embrace that condition with the outer [] what we are doing is subsetting the original dataframe to those rows that are True, that is, that satisfy our condition.

If our condition involves two discrete values, we can write both conditions within the inner bracket using the | separator for or conditions and & for and. That is, we could write data[(data['var']==x) | (data['var']==y)]. This line of code is equivalent to using the .isin() conditional function with argument [x,y].

Sometimes, it is also useful to get rid of missing values on given columns using the .notna() method.

If we want to select rows and columns simultaneously, we can use the .loc[condition,variable to keep]and .iloc[condition,variable to keep] operators. For instance, to extract the distance from home only for the fraudulent transactions we could do:

data.loc[data['fraud']==1,'distance_from_home'] # first element is the condition,  
                                            # second one is the column we want.

13          2.131956
24          3.803057
29         15.694986
35         26.711462
36         10.664474
             ...    
999908     45.296658
999916    167.139756
999919    124.640118
999939     51.412900
999949     15.724799
Name: distance_from_home, Length: 87403, dtype: float64

.loc is also useful to replace the value of columns where a row satisfies certain condition. For instance, let’s create a new variable that takes value 1 only for those transactions happenning more than 5km away from home.

data['more_5km'] = 0

data.loc[data['distance_from_home'] > 5, 'more_5km'] = 1

data[data['more_5km']==1]['distance_from_home'].describe()

count    688790.000000
mean         37.534356
std          76.321463
min           5.000021
25%           9.407164
50%          17.527437
75%          37.590001
max       10632.723672
Name: distance_from_home, dtype: float64

To extract using indexing of rows and columns we typically rely in iloc instead. In the snippet of code below, I am extracting the rows from 10 to 20 and the second and third column:

data.iloc[10:21,2:4]

	ratio_to_median_purchase_price	repeat_retailer
10	1.136102	1.0
11	1.370330	1.0
12	0.551245	1.0
13	6.358667	1.0
14	2.798901	1.0
15	0.535640	1.0
16	0.516990	1.0
17	1.846451	1.0
18	2.530758	1.0
19	0.307217	1.0
20	1.838016	1.0

In whatever selection you are doing, keep in mind that the reasoning is always you first generate a mask and then you get the rows for which the mask holds.

2.1.5 Creating new columns

Pandas operates with the traditional mathematical operators in an element-wise fashion, without the need of writing for loops. For instance, to convert the distance from kilometers to miles (i.e., we divide by 1.609)

data['distance_from_home_km'] = data['distance_from_home']/1.609
data

	distance_from_home	distance_from_last_transaction	ratio_to_median_purchase_price	repeat_retailer	used_chip	used_pin_number	online_order	fraud	more_5km	distance_from_home_km
0	57.877857	0.311140	1.945940	1.0	1.0	0.0	0.0	0.0	1	35.971322
1	10.829943	0.175592	1.294219	1.0	0.0	0.0	0.0	0.0	1	6.730853
2	5.091079	0.805153	0.427715	1.0	0.0	0.0	1.0	0.0	1	3.164126
3	2.247564	5.600044	0.362663	1.0	1.0	0.0	1.0	0.0	0	1.396870
4	44.190936	0.566486	2.222767	1.0	1.0	0.0	1.0	0.0	1	27.464845
...	...	...	...	...	...	...	...	...	...	...
999995	2.207101	0.112651	1.626798	1.0	1.0	0.0	0.0	0.0	0	1.371722
999996	19.872726	2.683904	2.778303	1.0	1.0	0.0	0.0	0.0	1	12.350979
999997	2.914857	1.472687	0.218075	1.0	1.0	0.0	1.0	0.0	0	1.811595
999998	4.258729	0.242023	0.475822	1.0	0.0	0.0	1.0	0.0	0	2.646818
999999	58.108125	0.318110	0.386920	1.0	1.0	0.0	1.0	0.0	1	36.114434

1000000 rows × 10 columns

We can also operate with more than one columns. Let’s sum the indicators for repeat retailer and used chip so that we can get a new column that directly tell us if a row satisfies both conditions.

data['sum_2'] = data['repeat_retailer'] + data['used_chip']
data

	distance_from_home	distance_from_last_transaction	ratio_to_median_purchase_price	repeat_retailer	used_chip	used_pin_number	online_order	fraud	more_5km	distance_from_home_km	sum_2
0	57.877857	0.311140	1.945940	1.0	1.0	0.0	0.0	0.0	1	35.971322	2.0
1	10.829943	0.175592	1.294219	1.0	0.0	0.0	0.0	0.0	1	6.730853	1.0
2	5.091079	0.805153	0.427715	1.0	0.0	0.0	1.0	0.0	1	3.164126	1.0
3	2.247564	5.600044	0.362663	1.0	1.0	0.0	1.0	0.0	0	1.396870	2.0
4	44.190936	0.566486	2.222767	1.0	1.0	0.0	1.0	0.0	1	27.464845	2.0
...	...	...	...	...	...	...	...	...	...	...	...
999995	2.207101	0.112651	1.626798	1.0	1.0	0.0	0.0	0.0	0	1.371722	2.0
999996	19.872726	2.683904	2.778303	1.0	1.0	0.0	0.0	0.0	1	12.350979	2.0
999997	2.914857	1.472687	0.218075	1.0	1.0	0.0	1.0	0.0	0	1.811595	2.0
999998	4.258729	0.242023	0.475822	1.0	0.0	0.0	1.0	0.0	0	2.646818	1.0
999999	58.108125	0.318110	0.386920	1.0	1.0	0.0	1.0	0.0	1	36.114434	2.0

1000000 rows × 11 columns

Because sum_2 is not a very intuitive name, let’s take advantage of that to show we can rename columns:

data = data.rename(columns={'sum_2':'both_conditions'})
data

	distance_from_home	distance_from_last_transaction	ratio_to_median_purchase_price	repeat_retailer	used_chip	used_pin_number	online_order	fraud	more_5km	distance_from_home_km	both_conditions
0	57.877857	0.311140	1.945940	1.0	1.0	0.0	0.0	0.0	1	35.971322	2.0
1	10.829943	0.175592	1.294219	1.0	0.0	0.0	0.0	0.0	1	6.730853	1.0
2	5.091079	0.805153	0.427715	1.0	0.0	0.0	1.0	0.0	1	3.164126	1.0
3	2.247564	5.600044	0.362663	1.0	1.0	0.0	1.0	0.0	0	1.396870	2.0
4	44.190936	0.566486	2.222767	1.0	1.0	0.0	1.0	0.0	1	27.464845	2.0
...	...	...	...	...	...	...	...	...	...	...	...
999995	2.207101	0.112651	1.626798	1.0	1.0	0.0	0.0	0.0	0	1.371722	2.0
999996	19.872726	2.683904	2.778303	1.0	1.0	0.0	0.0	0.0	1	12.350979	2.0
999997	2.914857	1.472687	0.218075	1.0	1.0	0.0	1.0	0.0	0	1.811595	2.0
999998	4.258729	0.242023	0.475822	1.0	0.0	0.0	1.0	0.0	0	2.646818	1.0
999999	58.108125	0.318110	0.386920	1.0	1.0	0.0	1.0	0.0	1	36.114434	2.0

1000000 rows × 11 columns

2.1.6 Merging tables.

Sometimes, we find that in the process of cleaning data, we need to put together different DataFrames, either vertically or horizontally. This can be achieved concatenating along indexes or, when the dataframes have a common identifier, using .merge()

Let’s start to show we can concatenate along axis. The default axis is 0, which means that it will concatenate rows (i.e., append dataframes vertically)

data_1 = data.iloc[0:100]
data_2 = data.iloc[100:200]

concat_rows = pandas.concat([data_1,data_2], axis = 0)
concat_rows

	distance_from_home	distance_from_last_transaction	ratio_to_median_purchase_price	repeat_retailer	used_chip	used_pin_number	online_order	fraud	more_5km	distance_from_home_km	both_conditions
0	57.877857	0.311140	1.945940	1.0	1.0	0.0	0.0	0.0	1	35.971322	2.0
1	10.829943	0.175592	1.294219	1.0	0.0	0.0	0.0	0.0	1	6.730853	1.0
2	5.091079	0.805153	0.427715	1.0	0.0	0.0	1.0	0.0	1	3.164126	1.0
3	2.247564	5.600044	0.362663	1.0	1.0	0.0	1.0	0.0	0	1.396870	2.0
4	44.190936	0.566486	2.222767	1.0	1.0	0.0	1.0	0.0	1	27.464845	2.0
...	...	...	...	...	...	...	...	...	...	...	...
195	45.031736	0.055077	1.312067	1.0	0.0	0.0	1.0	0.0	1	27.987406	1.0
196	242.913187	4.424776	0.741428	1.0	0.0	0.0	0.0	0.0	1	150.971527	1.0
197	4.586564	3.365070	2.454288	1.0	0.0	0.0	0.0	0.0	0	2.850568	1.0
198	19.153042	0.403533	1.111563	1.0	0.0	1.0	1.0	0.0	1	11.903693	1.0
199	0.718005	10.054520	1.341927	0.0	1.0	0.0	0.0	0.0	0	0.446243	1.0

200 rows × 11 columns

If instead we try to concatenate horizontally, it will by default try to concatenate with the index (unless we specify ignore_index=True). In the below example, since the columns are subset in different set of rows, it just fills with NaN the rows that do not share an index in the other dataframe. We can also use custom columns to align the dataframes specifying the column name in the ‘keys’ options

data_c1 =data_1[['distance_from_home','repeat_retailer' ]]
data_c2 =data_2[['distance_from_last_transaction','used_chip' ]]
data_ch = pandas.concat([data_c1,data_c2],axis = 1) 
data_ch

	distance_from_home	repeat_retailer	distance_from_last_transaction	used_chip
0	57.877857	1.0	NaN	NaN
1	10.829943	1.0	NaN	NaN
2	5.091079	1.0	NaN	NaN
3	2.247564	1.0	NaN	NaN
4	44.190936	1.0	NaN	NaN
...	...	...	...	...
195	NaN	NaN	0.055077	0.0
196	NaN	NaN	4.424776	0.0
197	NaN	NaN	3.365070	0.0
198	NaN	NaN	0.403533	0.0
199	NaN	NaN	10.054520	1.0

200 rows × 4 columns

Using concat for horizontal concatenation can be very useful when we have more than one dataset to concatenate and want to avoid repeated lines using .merge, which by default can only concatenate two dataframes.

data_c1 =data_1[['distance_from_home','repeat_retailer' ]]
data_c2 =data_1[['distance_from_last_transaction','used_chip' ]]
data_c3 =data_1[['used_pin_number','online_order']]
data_ch = pandas.concat([data_c1,data_c2,data_c3],axis = 1) 
data_ch

	distance_from_home	repeat_retailer	distance_from_last_transaction	used_chip	used_pin_number	online_order
0	57.877857	1.0	0.311140	1.0	0.0	0.0
1	10.829943	1.0	0.175592	0.0	0.0	0.0
2	5.091079	1.0	0.805153	0.0	0.0	1.0
3	2.247564	1.0	5.600044	1.0	0.0	1.0
4	44.190936	1.0	0.566486	1.0	0.0	1.0
...	...	...	...	...	...	...
95	11.411763	1.0	2.233248	1.0	0.0	1.0
96	2.987854	1.0	9.315192	0.0	0.0	1.0
97	0.530985	0.0	1.501575	1.0	0.0	0.0
98	6.136181	1.0	2.579574	1.0	1.0	1.0
99	1.532551	0.0	3.097159	1.0	0.0	0.0

100 rows × 6 columns

If instead we wanted to use the .merge() method, we would have to repeat it twice:

data_m1 = data_c1.reset_index().merge(
                                data_c2.reset_index(),
                                on='index',
                                how='inner',
                                validate='1:1')
data_m1

	index	distance_from_home	repeat_retailer	distance_from_last_transaction	used_chip
0	0	57.877857	1.0	0.311140	1.0
1	1	10.829943	1.0	0.175592	0.0
2	2	5.091079	1.0	0.805153	0.0
3	3	2.247564	1.0	5.600044	1.0
4	4	44.190936	1.0	0.566486	1.0
...	...	...	...	...	...
95	95	11.411763	1.0	2.233248	1.0
96	96	2.987854	1.0	9.315192	0.0
97	97	0.530985	0.0	1.501575	1.0
98	98	6.136181	1.0	2.579574	1.0
99	99	1.532551	0.0	3.097159	1.0

100 rows × 5 columns

data_m1 = data_m1.merge(data_c3.reset_index(),
                        on='index',
                        how='inner',
                        validate='1:1')
data_m1

	index	distance_from_home	repeat_retailer	distance_from_last_transaction	used_chip	used_pin_number	online_order
0	0	57.877857	1.0	0.311140	1.0	0.0	0.0
1	1	10.829943	1.0	0.175592	0.0	0.0	0.0
2	2	5.091079	1.0	0.805153	0.0	0.0	1.0
3	3	2.247564	1.0	5.600044	1.0	0.0	1.0
4	4	44.190936	1.0	0.566486	1.0	0.0	1.0
...	...	...	...	...	...	...	...
95	95	11.411763	1.0	2.233248	1.0	0.0	1.0
96	96	2.987854	1.0	9.315192	0.0	0.0	1.0
97	97	0.530985	0.0	1.501575	1.0	0.0	0.0
98	98	6.136181	1.0	2.579574	1.0	1.0	1.0
99	99	1.532551	0.0	3.097159	1.0	0.0	0.0

100 rows × 7 columns

2.1.7 Summary statistics and groupby operations.

Pandas also offers various statistical measures that can be applied to numerical data columns. By default, operations exclude missing data and extend across rows.

print('Mean distance from home:', data['distance_from_home'].mean())
print('Minimum distance from home:',data['distance_from_home'].min())
print('Median distance from home:', data['distance_from_home'].median())
print('Maximum distance from home:', data['distance_from_home'].max())

Mean distance from home: 26.62879219257128
Minimum distance from home: 0.0048743850667442
Median distance from home: 9.967760078697681
Maximum distance from home: 10632.723672241103

If we want more flexibility we can also specify a series of statistics and the columns we want them to be computed on using .agg

data.agg(
    {
           "distance_from_home": ["min", "max", "median", "skew"],
            "distance_from_last_transaction": [ "max", "median", "mean"],  
       }
)

	distance_from_home	distance_from_last_transaction
min	0.004874	NaN
max	10632.723672	11851.104565
median	9.967760	0.998650
skew	20.239733	NaN
mean	NaN	5.036519

Tip

Computing the mean of a dummy variable will tell us the percentage of observations within that category.

data['fraud'].mean()  ## 8% are frauds.

0.087403

Besides these aggreggating operations, we can also compute them within category or by groups.

data.groupby("fraud")["distance_from_home"].mean()

fraud
0.0    22.832976
1.0    66.261876
Name: distance_from_home, dtype: float64

Because the default return is a pandas.Series, it is useful to convert it to a pandas.DataFrame which can be done as follows:

pandas.DataFrame(data.groupby("fraud")["distance_from_home"].mean())

	distance_from_home
fraud
0.0	22.832976
1.0	66.261876

Notice that the new indexing variable is automatically set to our groupby one. We can undo this change to recover the groupby variable as accessible:

pandas.DataFrame(data.groupby("fraud")["distance_from_home"].mean()).reset_index()

	fraud	distance_from_home
0	0.0	22.832976
1	1.0	66.261876

2.2 Visualization: Matplotlib and Seaborn.

The library that will allow us to do data visualization, whether static, animated or interactive, in Python is Matplotlib.

Matplotlib plots our data on Figures, each one containing axes where data points are defined as x-y coordinates. This means that matplotlib will allow us to manipulate the displays by changing the elements of these two classes:

Figure:

The entire figure that contains the axes, which are the actual plots.

Axes:

Contains a region for plotting data, and include the x and y axis (and z if is 3D) where we actually plot the data. The figure below, from the package original documentation, illustrates all the elements within a figure.

import matplotlib.pyplot as plt

For the following examples, we will use data different from the crad transaction. However, just as in most typical real life situations, we will have to do some pre-processing before we can start plotting our data. In this case, we are going to be using data from the World Health Organization on mortality induced by pollution and country GDP per capita data from the World Bank

death_rate = pandas.read_csv('../data/data.csv')
gdp = pandas.read_csv('../data/gdp-per-capita-worldbank.csv')

death_rate[(death_rate['Location']=='Algeria') 
            & (death_rate['Dim1']=='Both sexes') 
            & (death_rate['Dim2']=='Total') 
            & (death_rate['IndicatorCode']=='SDGAIRBOD')]

	IndicatorCode	Indicator	ValueType	ParentLocationCode	ParentLocation	Location type	SpatialDimValueCode	Location	Period type	Period	IsLatestYear	Dim1 type	Dim1	Dim1ValueCode	Dim2 type	Dim2	Dim2ValueCode	Dim3 type	Dim3	Dim3ValueCode	DataSourceDimValueCode	DataSource	FactValueNumericPrefix	FactValueNumeric	FactValueUoM	FactValueNumericLowPrefix	FactValueNumericLow	FactValueNumericHighPrefix	FactValueNumericHigh	Value	FactValueTranslationID	FactComments	Language	DateModified
2342	SDGAIRBOD	Ambient and household air pollution attributab...	numeric	AFR	Africa	Country	DZA	Algeria	Year	2019	True	Sex	Both sexes	SEX_BTSX	Cause	Total	ENVCAUSE_ENVCAUSE000	NaN	NaN	NaN	NaN	NaN	NaN	41.43	NaN	NaN	28.46	NaN	55.68	41 [28-56]	NaN	NaN	EN	2022-08-25T22:00:00.000Z

We keep the variables we are interested in.

keep_vars = ['Indicator','Location','ParentLocationCode',
            'ParentLocation','SpatialDimValueCode',
            'Location','FactValueNumeric']

death_rate = death_rate[(death_rate['Dim1']=='Both sexes') 
                        & (death_rate['Dim2']=='Total') 
                        & (death_rate['IndicatorCode']=='SDGAIRBOD')][keep_vars]

death_rate

	Indicator	Location	ParentLocationCode	ParentLocation	SpatialDimValueCode	Location	FactValueNumeric
389	Ambient and household air pollution attributab...	Bahamas	AMR	Americas	BHS	Bahamas	10.04
450	Ambient and household air pollution attributab...	Democratic Republic of the Congo	AFR	Africa	COD	Democratic Republic of the Congo	100.30
453	Ambient and household air pollution attributab...	Afghanistan	EMR	Eastern Mediterranean	AFG	Afghanistan	100.80
454	Ambient and household air pollution attributab...	Bangladesh	SEAR	South-East Asia	BGD	Bangladesh	100.90
456	Ambient and household air pollution attributab...	Mozambique	AFR	Africa	MOZ	Mozambique	102.50
...	...	...	...	...	...	...	...
3272	Ambient and household air pollution attributab...	Papua New Guinea	WPR	Western Pacific	PNG	Papua New Guinea	94.90
3277	Ambient and household air pollution attributab...	Sao Tome and Principe	AFR	Africa	STP	Sao Tome and Principe	95.96
3278	Ambient and household air pollution attributab...	Togo	AFR	Africa	TGO	Togo	96.39
3285	Ambient and household air pollution attributab...	Republic of Moldova	EUR	Europe	MDA	Republic of Moldova	98.45
3291	Ambient and household air pollution attributab...	Hungary	EUR	Europe	HUN	Hungary	99.49

183 rows × 7 columns

gdp.head()

	Entity	Code	Year	GDP per capita, PPP (constant 2017 international $)
0	Afghanistan	AFG	2002	1280.4631
1	Afghanistan	AFG	2003	1292.3335
2	Afghanistan	AFG	2004	1260.0605
3	Afghanistan	AFG	2005	1352.3207
4	Afghanistan	AFG	2006	1366.9932

gdp = gdp[(gdp['Year']==2019) & 
            (gdp['Code'].notna()) & (gdp['Code']!='World')]

gdp

	Entity	Code	Year	GDP per capita, PPP (constant 2017 international $)
17	Afghanistan	AFG	2019	2079.9219
49	Albania	ALB	2019	13655.6650
81	Algeria	DZA	2019	11627.2800
113	Angola	AGO	2019	6602.4240
145	Antigua and Barbuda	ATG	2019	23035.6580
...	...	...	...	...
6215	Vanuatu	VUT	2019	3070.3508
6247	Vietnam	VNM	2019	10252.0050
6279	World	OWID_WRL	2019	16847.4600
6311	Zambia	ZMB	2019	3372.3590
6343	Zimbabwe	ZWE	2019	2203.3967

194 rows × 4 columns

Now we are ready to merge both.

gdp_pollution = gdp.merge(death_rate, 
                        left_on='Code',
                        right_on='SpatialDimValueCode',
                        how='inner',validate='1:1')

gdp_pollution

	Entity	Code	Year	GDP per capita, PPP (constant 2017 international $)	Indicator	Location	ParentLocationCode	ParentLocation	SpatialDimValueCode	Location	FactValueNumeric
0	Afghanistan	AFG	2019	2079.9219	Ambient and household air pollution attributab...	Afghanistan	EMR	Eastern Mediterranean	AFG	Afghanistan	100.80
1	Albania	ALB	2019	13655.6650	Ambient and household air pollution attributab...	Albania	EUR	Europe	ALB	Albania	164.80
2	Algeria	DZA	2019	11627.2800	Ambient and household air pollution attributab...	Algeria	AFR	Africa	DZA	Algeria	41.43
3	Angola	AGO	2019	6602.4240	Ambient and household air pollution attributab...	Angola	AFR	Africa	AGO	Angola	59.47
4	Antigua and Barbuda	ATG	2019	23035.6580	Ambient and household air pollution attributab...	Antigua and Barbuda	AMR	Americas	ATG	Antigua and Barbuda	21.54
...	...	...	...	...	...	...	...	...	...	...	...
171	Uzbekistan	UZB	2019	7348.1470	Ambient and household air pollution attributab...	Uzbekistan	EUR	Europe	UZB	Uzbekistan	93.10
172	Vanuatu	VUT	2019	3070.3508	Ambient and household air pollution attributab...	Vanuatu	WPR	Western Pacific	VUT	Vanuatu	138.20
173	Vietnam	VNM	2019	10252.0050	Ambient and household air pollution attributab...	Viet Nam	WPR	Western Pacific	VNM	Viet Nam	107.80
174	Zambia	ZMB	2019	3372.3590	Ambient and household air pollution attributab...	Zambia	AFR	Africa	ZMB	Zambia	63.04
175	Zimbabwe	ZWE	2019	2203.3967	Ambient and household air pollution attributab...	Zimbabwe	AFR	Africa	ZWE	Zimbabwe	80.64

176 rows × 11 columns

For a quick visualization, we can use the Pandas.plot() method, that will generate a Matplotlib figure object. Our sample data is more complicated because we don’t have an X axis directly, so we will have a line graph, where the X axis is the index (i.e., row indicator) and the Y axis is the value we are plotting.

gdp_pollution['FactValueNumeric'].plot()
plt.show()

But this is not very informative. Let’s exploit the functionalities of the .plot() method to get a better grasp of our data.

gdp_pollution.plot.scatter(x='GDP per capita, PPP (constant 2017 international $)',
                           y='FactValueNumeric')

This scatterplot is starting to say something: it seems there is a negative relationship between GDP per capita and deaths due to pollution. We would have been able to produce the same graph to using Matplotlib functionalities as follows:

plt.plot(gdp_pollution['GDP per capita, PPP (constant 2017 international $)'],gdp_pollution['FactValueNumeric'],'o') #'o' = scatterplot
plt.xlabel('GDP Per Capita')  
plt.ylabel('Air pollution attributable death rate (per 100 000 population)')

Text(0, 0.5, 'Air pollution attributable death rate (per 100 000 population)')

While the Pandas functionality allows us to do direct visualizations, further customizations will be required most of the time to achieve the desired output. This is when Matplotlib comes in handy.

f, axs = plt.subplots(figsize = (5,5)) # in inches. We can define a converting factor
gdp_pollution.plot(ax=axs, 
                    x ='GDP per capita, PPP (constant 2017 international $)', 
                    y ='FactValueNumeric', 
                    kind='scatter')

axs.set_xlabel('GDP Per Capita')
axs.set_ylabel('Air pollution attributable death rate (per 100 000 population)')
axs.set_title('Title of this subplot')

x = gdp_pollution['GDP per capita, PPP (constant 2017 international $)'].values
y = gdp_pollution['FactValueNumeric'].values
f.suptitle('My first graph')
f.savefig('./fig1.png')

It is also possible to add the labels that indicate the country code using the annotate method.

f, ax = plt.subplots(figsize = (5,5)) # in inches. We can define a converting factor
gdp_pollution.plot(ax=ax, 
                    x ='GDP per capita, PPP (constant 2017 international $)', 
                    y ='FactValueNumeric', 
                    kind='scatter')

ax.set_xlabel('GDP Per Capita')
ax.set_ylabel('Air pollution attributable death rate (per 100 000 population)')
ax.set_title('Title of this subplot')

x = gdp_pollution['GDP per capita, PPP (constant 2017 international $)'].values
y = gdp_pollution['FactValueNumeric'].values
for i in range(len(gdp_pollution)): 
    plt.annotate(gdp_pollution['Code'][i], (x[i], y[i] + 0.5), fontsize=7)
f.suptitle('My first graph')
f.savefig('./fig1.png')

We can also plot multiple subplots within the same figure.

f, axs = plt.subplots(nrows=1, ncols=2, figsize = (8,5)) # Now axs contains two elements

gdp_pollution.plot(ax=axs[0], 
                    x ='GDP per capita, PPP (constant 2017 international $)', 
                    y ='FactValueNumeric',
                    kind='scatter')
                    
axs[0].set_xlabel('GDP Per Capita')
axs[0].set_ylabel('Air pollution attributable death rate (per 100 000 population)')
axs[0].set_title('Title of subplot in axs 0')

gdp_pollution.plot(ax=axs[1], 
                    x ='GDP per capita, PPP (constant 2017 international $)', 
                    y ='FactValueNumeric', 
                    kind='scatter')

axs[1].set_xlabel('GDP Per Capita')
axs[1].set_ylabel('Air pollution attributable death rate (per 100 000 population)')
axs[1].set_title('Title of subplot in axs 1')

f.suptitle('My first figure with two subplots')

Text(0.5, 0.98, 'My first figure with two subplots')

To perform more advanced statistical graphs, we can rely on Seaborn, a library that facilitates the creation of statistical graphics by leveraging matplotlib and integrating with pandas data structures.

import seaborn as sns
sns.set_theme(rc={'figure.figsize':(4,4)})

Functions in seaborn are classified as:

Figure-level: internally create their own matplotlib figure. When we call this type of functions, they initialize its own figure, so we cannot draw them into an existing axes. To customize its axes, we need to access the Matplotlib axes that are generated within the figure and then add or modify elements.
Axis-levels: the return plot is a matplotlib.pyplot.Axes object, which means we can use them within the Matplotlib figure set up.

sns.relplot(data=gdp_pollution, 
            x="GDP per capita, PPP (constant 2017 international $)",
            y="FactValueNumeric")

In Seaborn we can add an additional dimension in the scatterplot by using different colors for observations in different categories specifying the “hue” parameter.

sns.relplot(data=gdp_pollution, 
            x="GDP per capita, PPP (constant 2017 international $)", 
            y="FactValueNumeric", 
            hue = 'ParentLocation')

Furthermore, it is also straightforward to use different markers for each category specifying it in the “style” option. While here we are using the same variable for the differentiation, it is also possible to specify different variables in hue and style.

sns.relplot(data=gdp_pollution, 
            x="GDP per capita, PPP (constant 2017 international $)",
            y="FactValueNumeric", 
            hue = 'ParentLocation', 
            style='ParentLocation')

To explore more options, let’s merge our data with population at the country level.

population = pandas.read_csv('../data/population-unwpp.csv')

population = population[(population['Year']==2019) & (population['Code'].notna())]

gdp_pollution = gdp_pollution.merge(population, 
                how='inner', on='Code', validate='1:1')

Now, we can make each point have a different size depending on the population of the country

sns.relplot(data=gdp_pollution, x="GDP per capita, PPP (constant 2017 international $)", 
            y="FactValueNumeric", 
            hue = 'ParentLocation', 
            size='Population (historical estimates)', 
            sizes = (15,250))

population.sort_values('Population (historical estimates)')

	Entity	Code	Year	Population (historical estimates)
16674	Tokelau	TKL	2019	1775
12058	Niue	NIU	2019	1943
5562	Falkland Islands	FLK	2019	3729
11106	Montserrat	MSR	2019	4528
14002	Saint Helena	SHN	2019	5470
...	...	...	...	...
7722	Indonesia	IDN	2019	269582880
17580	United States	USA	2019	334319680
7650	India	IND	2019	1383112064
3383	China	CHN	2019	1421864064
18354	World	OWID_WRL	2019	7764951040

237 rows × 4 columns

Seaborn also has a functionality that allows to draw scatterplots with regression lines (regplot). The function lmplot allows also to draw the regression lines conditioning on other variables (i.e., by category)

sns.regplot(data=gdp_pollution, 
            x="GDP per capita, PPP (constant 2017 international $)",
            y="FactValueNumeric")

sns.lmplot(data=gdp_pollution,
         x="GDP per capita, PPP (constant 2017 international $)",
         y="FactValueNumeric",
         hue="ParentLocation" )

Specifying “col” or “row” will draw separate graphs for each category.

sns.lmplot(data=gdp_pollution, 
            x="GDP per capita, PPP (constant 2017 international $)",          
            y="FactValueNumeric", 
            hue="ParentLocation", 
            col ="ParentLocation" )

lmplot also performs polynomial regressions.

sns.lmplot(data=gdp_pollution, 
            x="GDP per capita, PPP (constant 2017 international $)", 
            y="FactValueNumeric", 
            order=2)

With a similar syntaxis, we can draw histogram and density plots, which can also be helpful to understand the distribution of continuous variables.

For instance, in the code below we are going to draw the histogram for GDP per capita for each parent location separately.

a = sns.displot(data = gdp_pollution, 
                x = "GDP per capita, PPP (constant 2017 international $)",
                kind = 'hist',
                col='ParentLocation')

a.set_axis_labels("GDP", "Count")
a.set_titles("{col_name}")
a.savefig('./export.png')

Similarly, we could have used:

f, ax = plt.subplots()
sns.histplot(data = gdp_pollution, 
            x = "GDP per capita, PPP (constant 2017 international $)",
            color="skyblue", 
            label="MaxTemp", kde=True, ax = ax)

plt.legend()
plt.xlabel('GDP per capita')

Text(0.5, 0, 'GDP per capita')

Heatmaps are also built-in within Seaborn and are a very popular too. to display the correlation between the variables of the dataframe. It’s like visualizing a correlation matrix with colors.

sns.heatmap(gdp_pollution[[ "GDP per capita, PPP (constant 2017 international $)",
                         "FactValueNumeric", 
                         'Population (historical estimates)']].corr())

2.3 Practice exercises

Write a loop that describes all columns of the credit transaction data.
Show solution
```
for c in data.columns:
    print(c)
```

Write a loop that plots in a different subplot the scatterplot of GDP vs mortality for each Parent Location. It should have 3 columns and 2 rows.

Show solution

#Defining a 2x2 grid of subplots
fig, axs = plt.subplots(nrows=2,ncols=3,figsize=(16,10))
plt.subplots_adjust(wspace=0.5) #adjusting white space between individual plots

# get list of parent locations
parentl = gdp_pollution['ParentLocation'].unique()
# create dictionary of colors for each country 
# Define six colors
palette = sns.color_palette()

# Use the palette to assign colors to parent locations
colors = palette[:len(parentl)]
# Create a dictionary mapping parent locations to colors
parent_color_dict = dict(zip(parentl, colors))

# convert to an array with the shapes of the figure we want so that we can loop over them 
parentl_arr= np.array(parentl)
parentl_arr= parentl_arr.reshape(2, 3)

# looping over the 2x2 grid
for i in range(2):
    for j in range(3):

        #extract the part of the data for that subset of countries
        aux = gdp_pollution[gdp_pollution['ParentLocation']==parentl_arr[i,j]]

        #Making the scatterplot
        aux.plot(ax=axs[i,j], x ='GDP per capita, PPP (constant 2017 international $)', 
                y ='FactValueNumeric', kind='scatter', color=parent_color_dict[parentl_arr[i,j]])

        # label axes

        axs[i,j].set_xlabel('GDP per capita')
        axs[i,j].set_ylabel('Pollution induced mortality')

        # set title of subplot
        axs[i,j].set_title(parentl_arr[i,j])
        #Putting a dollar sign, and thousand-comma separator on x-axis labels
        axs[i,j].xaxis.set_major_formatter('${x:,.0f}')

        # increasing font size of axis labels
        axs[i,j].tick_params(axis = 'both',labelsize=6)

parentl_arr[i,j]