> Data science course by AMA & ICAML - Practical Part

Using pandas

What we did so far are very common and basic tasks. Using Python we should recall that there are usually packages that simplify those basic tasks. A famous package for working with tabular data is pandas. Reading a csv file in pandas can be done within a single function call.

import pandas as pd  # pandas renamed to pd

table = pd.read_csv('titanic2.csv', sep=';') # we have to specify the seperator since pandas assumes a comma

Pandas stores the data in a so called data frame.

print(type(table))
<class 'pandas.core.frame.DataFrame'>

This data type has some nice functionalities e.g. a nice textual representation of the data itself. In a notebook, we can simply type the name of the table to access this representation.

table
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

The function head() will print only the first elements, which is often enough to get an impression of the table.

table.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Another very useful feature is the function describe() that will compute several descriptive values like the mean, the quantiles and the number of values.

table.describe(include='all')
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN NaN Williams, Mr. Charles Duane male NaN NaN NaN CA. 2343 NaN B96 B98 S
freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN NaN

Using the function hist(), we can directly compute and show histograms of multiple attributes.

plot = table.hist(column= ['Age', 'Fare', 'Pclass', 'Survived' ], figsize=(15, 10))

Single columns of the table are accessed by using the column name in square brackets or via the dot operator. Each column is stored as a data series which is again some sort of list with special functions e.g. a nice textual representation.

fares = table['Fare']         # Extracting the column 'Fare' (alternatively we could use table.Fare)
print(type(fares))            # This is the data type of the column representation
print(fares.dtype)            # This is the data type of each element in the column
fares.head()
<class 'pandas.core.series.Series'>
float64





0     7.2500
1    71.2833
2     7.9250
3    53.1000
4     8.0500
Name: Fare, dtype: float64
table['Pclass'].dtype
dtype('int64')

Data cleaning and preprocessing

In order to prepare our data for the later processing, we will first remove the attributes, we are not interested in.

cleaned_table = table.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin']) # step 1: drop some columns
cleaned_table.describe(include='all')
Survived Pclass Sex Age SibSp Parch Fare Embarked
count 891.000000 891.000000 891 714.000000 891.000000 891.000000 891.000000 889
unique NaN NaN 2 NaN NaN NaN NaN 3
top NaN NaN male NaN NaN NaN NaN S
freq NaN NaN 577 NaN NaN NaN NaN 644
mean 0.383838 2.308642 NaN 29.699118 0.523008 0.381594 32.204208 NaN
std 0.486592 0.836071 NaN 14.526497 1.102743 0.806057 49.693429 NaN
min 0.000000 1.000000 NaN 0.420000 0.000000 0.000000 0.000000 NaN
25% 0.000000 2.000000 NaN 20.125000 0.000000 0.000000 7.910400 NaN
50% 0.000000 3.000000 NaN 28.000000 0.000000 0.000000 14.454200 NaN
75% 1.000000 3.000000 NaN 38.000000 1.000000 0.000000 31.000000 NaN
max 1.000000 3.000000 NaN 80.000000 8.000000 6.000000 512.329200 NaN

In a second step, we remove all entries, where not all attributes are present.

cleaned_table = cleaned_table.dropna()
cleaned_table.describe(include='all')
Survived Pclass Sex Age SibSp Parch Fare Embarked
count 712.000000 712.000000 712 712.000000 712.000000 712.000000 712.000000 712
unique NaN NaN 2 NaN NaN NaN NaN 3
top NaN NaN male NaN NaN NaN NaN S
freq NaN NaN 453 NaN NaN NaN NaN 554
mean 0.404494 2.240169 NaN 29.642093 0.514045 0.432584 34.567251 NaN
std 0.491139 0.836854 NaN 14.492933 0.930692 0.854181 52.938648 NaN
min 0.000000 1.000000 NaN 0.420000 0.000000 0.000000 0.000000 NaN
25% 0.000000 1.000000 NaN 20.000000 0.000000 0.000000 8.050000 NaN
50% 0.000000 2.000000 NaN 28.000000 0.000000 0.000000 15.645850 NaN
75% 1.000000 3.000000 NaN 38.000000 1.000000 1.000000 33.000000 NaN
max 1.000000 3.000000 NaN 80.000000 5.000000 6.000000 512.329200 NaN

We can see, that the number of entries per attribute is now the same for all attributes.

In the last step we want to convert the categorical entries for ‘Sex’ to numerical values. Similar to ‘survived’ ‘male’ should be represented by 1 and ‘female’ by 0. This is done in the next cell.

sex_types = ['female', 'male']
cleaned_table.Sex = cleaned_table.Sex.astype("category").cat.codes
cleaned_table.describe(include='all')
#cleaned_table.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked
count 712.000000 712.000000 712.000000 712.000000 712.000000 712.000000 712.000000 712
unique NaN NaN NaN NaN NaN NaN NaN 3
top NaN NaN NaN NaN NaN NaN NaN S
freq NaN NaN NaN NaN NaN NaN NaN 554
mean 0.404494 2.240169 0.636236 29.642093 0.514045 0.432584 34.567251 NaN
std 0.491139 0.836854 0.481420 14.492933 0.930692 0.854181 52.938648 NaN
min 0.000000 1.000000 0.000000 0.420000 0.000000 0.000000 0.000000 NaN
25% 0.000000 1.000000 0.000000 20.000000 0.000000 0.000000 8.050000 NaN
50% 0.000000 2.000000 1.000000 28.000000 0.000000 0.000000 15.645850 NaN
75% 1.000000 3.000000 1.000000 38.000000 1.000000 1.000000 33.000000 NaN
max 1.000000 3.000000 1.000000 80.000000 5.000000 6.000000 512.329200 NaN

Finally we will save the cleaned table to a new csv file.

cleaned_table.to_csv('titanic_cleaned.csv', sep=';', index=False)
Author: Dennis Wittich, Markus Rokicki, Artem Leichter
Last modified: 15.10.2019