> Data science course by AMA & ICAML - Practical Part
Using pandas
What we did so far are very common and basic tasks. Using Python we should recall that there are usually packages that simplify those basic tasks. A famous package for working with tabular data is pandas. Reading a csv file in pandas can be done within a single function call.
import pandas as pd # pandas renamed to pd
table = pd.read_csv('titanic2.csv', sep=';') # we have to specify the seperator since pandas assumes a comma
Pandas stores the data in a so called data frame.
print(type(table))
<class 'pandas.core.frame.DataFrame'>
This data type has some nice functionalities e.g. a nice textual representation of the data itself. In a notebook, we can simply type the name of the table to access this representation.
table
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | NaN | S |
15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 0 | 0 | 248706 | 16.0000 | NaN | S |
16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.0 | 1 | 0 | 345763 | 18.0000 | NaN | S |
19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
20 | 21 | 0 | 2 | Fynney, Mr. Joseph J | male | 35.0 | 0 | 0 | 239865 | 26.0000 | NaN | S |
21 | 22 | 1 | 2 | Beesley, Mr. Lawrence | male | 34.0 | 0 | 0 | 248698 | 13.0000 | D56 | S |
22 | 23 | 1 | 3 | McGowan, Miss. Anna "Annie" | female | 15.0 | 0 | 0 | 330923 | 8.0292 | NaN | Q |
23 | 24 | 1 | 1 | Sloper, Mr. William Thompson | male | 28.0 | 0 | 0 | 113788 | 35.5000 | A6 | S |
24 | 25 | 0 | 3 | Palsson, Miss. Torborg Danira | female | 8.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
25 | 26 | 1 | 3 | Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... | female | 38.0 | 1 | 5 | 347077 | 31.3875 | NaN | S |
26 | 27 | 0 | 3 | Emir, Mr. Farred Chehab | male | NaN | 0 | 0 | 2631 | 7.2250 | NaN | C |
27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
28 | 29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | female | NaN | 0 | 0 | 330959 | 7.8792 | NaN | Q |
29 | 30 | 0 | 3 | Todoroff, Mr. Lalio | male | NaN | 0 | 0 | 349216 | 7.8958 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
861 | 862 | 0 | 2 | Giles, Mr. Frederick Edward | male | 21.0 | 1 | 0 | 28134 | 11.5000 | NaN | S |
862 | 863 | 1 | 1 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 0 | 0 | 17466 | 25.9292 | D17 | S |
863 | 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
864 | 865 | 0 | 2 | Gill, Mr. John William | male | 24.0 | 0 | 0 | 233866 | 13.0000 | NaN | S |
865 | 866 | 1 | 2 | Bystrom, Mrs. (Karolina) | female | 42.0 | 0 | 0 | 236852 | 13.0000 | NaN | S |
866 | 867 | 1 | 2 | Duran y More, Miss. Asuncion | female | 27.0 | 1 | 0 | SC/PARIS 2149 | 13.8583 | NaN | C |
867 | 868 | 0 | 1 | Roebling, Mr. Washington Augustus II | male | 31.0 | 0 | 0 | PC 17590 | 50.4958 | A24 | S |
868 | 869 | 0 | 3 | van Melkebeke, Mr. Philemon | male | NaN | 0 | 0 | 345777 | 9.5000 | NaN | S |
869 | 870 | 1 | 3 | Johnson, Master. Harold Theodor | male | 4.0 | 1 | 1 | 347742 | 11.1333 | NaN | S |
870 | 871 | 0 | 3 | Balkic, Mr. Cerin | male | 26.0 | 0 | 0 | 349248 | 7.8958 | NaN | S |
871 | 872 | 1 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 1 | 1 | 11751 | 52.5542 | D35 | S |
872 | 873 | 0 | 1 | Carlsson, Mr. Frans Olof | male | 33.0 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S |
873 | 874 | 0 | 3 | Vander Cruyssen, Mr. Victor | male | 47.0 | 0 | 0 | 345765 | 9.0000 | NaN | S |
874 | 875 | 1 | 2 | Abelson, Mrs. Samuel (Hannah Wizosky) | female | 28.0 | 1 | 0 | P/PP 3381 | 24.0000 | NaN | C |
875 | 876 | 1 | 3 | Najib, Miss. Adele Kiamie "Jane" | female | 15.0 | 0 | 0 | 2667 | 7.2250 | NaN | C |
876 | 877 | 0 | 3 | Gustafsson, Mr. Alfred Ossian | male | 20.0 | 0 | 0 | 7534 | 9.8458 | NaN | S |
877 | 878 | 0 | 3 | Petroff, Mr. Nedelio | male | 19.0 | 0 | 0 | 349212 | 7.8958 | NaN | S |
878 | 879 | 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
879 | 880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
880 | 881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 0 | 1 | 230433 | 26.0000 | NaN | S |
881 | 882 | 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
882 | 883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
883 | 884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | NaN | S |
884 | 885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
The function head()
will print only the first elements, which is often enough to get an impression of the table.
table.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Another very useful feature is the function describe()
that will compute several descriptive values like the mean, the quantiles and the number of values.
table.describe(include='all')
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 891 | 891 | 714.000000 | 891.000000 | 891.000000 | 891 | 891.000000 | 204 | 889 |
unique | NaN | NaN | NaN | 891 | 2 | NaN | NaN | NaN | 681 | NaN | 147 | 3 |
top | NaN | NaN | NaN | Williams, Mr. Charles Duane | male | NaN | NaN | NaN | CA. 2343 | NaN | B96 B98 | S |
freq | NaN | NaN | NaN | 1 | 577 | NaN | NaN | NaN | 7 | NaN | 4 | 644 |
mean | 446.000000 | 0.383838 | 2.308642 | NaN | NaN | 29.699118 | 0.523008 | 0.381594 | NaN | 32.204208 | NaN | NaN |
std | 257.353842 | 0.486592 | 0.836071 | NaN | NaN | 14.526497 | 1.102743 | 0.806057 | NaN | 49.693429 | NaN | NaN |
min | 1.000000 | 0.000000 | 1.000000 | NaN | NaN | 0.420000 | 0.000000 | 0.000000 | NaN | 0.000000 | NaN | NaN |
25% | 223.500000 | 0.000000 | 2.000000 | NaN | NaN | 20.125000 | 0.000000 | 0.000000 | NaN | 7.910400 | NaN | NaN |
50% | 446.000000 | 0.000000 | 3.000000 | NaN | NaN | 28.000000 | 0.000000 | 0.000000 | NaN | 14.454200 | NaN | NaN |
75% | 668.500000 | 1.000000 | 3.000000 | NaN | NaN | 38.000000 | 1.000000 | 0.000000 | NaN | 31.000000 | NaN | NaN |
max | 891.000000 | 1.000000 | 3.000000 | NaN | NaN | 80.000000 | 8.000000 | 6.000000 | NaN | 512.329200 | NaN | NaN |
Using the function hist()
, we can directly compute and show histograms of multiple attributes.
plot = table.hist(column= ['Age', 'Fare', 'Pclass', 'Survived' ], figsize=(15, 10))
Single columns of the table are accessed by using the column name in square brackets or via the dot operator. Each column is stored as a data series which is again some sort of list with special functions e.g. a nice textual representation.
fares = table['Fare'] # Extracting the column 'Fare' (alternatively we could use table.Fare)
print(type(fares)) # This is the data type of the column representation
print(fares.dtype) # This is the data type of each element in the column
fares.head()
<class 'pandas.core.series.Series'>
float64
0 7.2500
1 71.2833
2 7.9250
3 53.1000
4 8.0500
Name: Fare, dtype: float64
table['Pclass'].dtype
dtype('int64')
Data cleaning and preprocessing
In order to prepare our data for the later processing, we will first remove the attributes, we are not interested in.
cleaned_table = table.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin']) # step 1: drop some columns
cleaned_table.describe(include='all')
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891 | 714.000000 | 891.000000 | 891.000000 | 891.000000 | 889 |
unique | NaN | NaN | 2 | NaN | NaN | NaN | NaN | 3 |
top | NaN | NaN | male | NaN | NaN | NaN | NaN | S |
freq | NaN | NaN | 577 | NaN | NaN | NaN | NaN | 644 |
mean | 0.383838 | 2.308642 | NaN | 29.699118 | 0.523008 | 0.381594 | 32.204208 | NaN |
std | 0.486592 | 0.836071 | NaN | 14.526497 | 1.102743 | 0.806057 | 49.693429 | NaN |
min | 0.000000 | 1.000000 | NaN | 0.420000 | 0.000000 | 0.000000 | 0.000000 | NaN |
25% | 0.000000 | 2.000000 | NaN | 20.125000 | 0.000000 | 0.000000 | 7.910400 | NaN |
50% | 0.000000 | 3.000000 | NaN | 28.000000 | 0.000000 | 0.000000 | 14.454200 | NaN |
75% | 1.000000 | 3.000000 | NaN | 38.000000 | 1.000000 | 0.000000 | 31.000000 | NaN |
max | 1.000000 | 3.000000 | NaN | 80.000000 | 8.000000 | 6.000000 | 512.329200 | NaN |
In a second step, we remove all entries, where not all attributes are present.
cleaned_table = cleaned_table.dropna()
cleaned_table.describe(include='all')
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|
count | 712.000000 | 712.000000 | 712 | 712.000000 | 712.000000 | 712.000000 | 712.000000 | 712 |
unique | NaN | NaN | 2 | NaN | NaN | NaN | NaN | 3 |
top | NaN | NaN | male | NaN | NaN | NaN | NaN | S |
freq | NaN | NaN | 453 | NaN | NaN | NaN | NaN | 554 |
mean | 0.404494 | 2.240169 | NaN | 29.642093 | 0.514045 | 0.432584 | 34.567251 | NaN |
std | 0.491139 | 0.836854 | NaN | 14.492933 | 0.930692 | 0.854181 | 52.938648 | NaN |
min | 0.000000 | 1.000000 | NaN | 0.420000 | 0.000000 | 0.000000 | 0.000000 | NaN |
25% | 0.000000 | 1.000000 | NaN | 20.000000 | 0.000000 | 0.000000 | 8.050000 | NaN |
50% | 0.000000 | 2.000000 | NaN | 28.000000 | 0.000000 | 0.000000 | 15.645850 | NaN |
75% | 1.000000 | 3.000000 | NaN | 38.000000 | 1.000000 | 1.000000 | 33.000000 | NaN |
max | 1.000000 | 3.000000 | NaN | 80.000000 | 5.000000 | 6.000000 | 512.329200 | NaN |
We can see, that the number of entries per attribute is now the same for all attributes.
In the last step we want to convert the categorical entries for ‘Sex’ to numerical values. Similar to ‘survived’ ‘male’ should be represented by 1 and ‘female’ by 0. This is done in the next cell.
sex_types = ['female', 'male']
cleaned_table.Sex = cleaned_table.Sex.astype("category").cat.codes
cleaned_table.describe(include='all')
#cleaned_table.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|
count | 712.000000 | 712.000000 | 712.000000 | 712.000000 | 712.000000 | 712.000000 | 712.000000 | 712 |
unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3 |
top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | S |
freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 554 |
mean | 0.404494 | 2.240169 | 0.636236 | 29.642093 | 0.514045 | 0.432584 | 34.567251 | NaN |
std | 0.491139 | 0.836854 | 0.481420 | 14.492933 | 0.930692 | 0.854181 | 52.938648 | NaN |
min | 0.000000 | 1.000000 | 0.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 | NaN |
25% | 0.000000 | 1.000000 | 0.000000 | 20.000000 | 0.000000 | 0.000000 | 8.050000 | NaN |
50% | 0.000000 | 2.000000 | 1.000000 | 28.000000 | 0.000000 | 0.000000 | 15.645850 | NaN |
75% | 1.000000 | 3.000000 | 1.000000 | 38.000000 | 1.000000 | 1.000000 | 33.000000 | NaN |
max | 1.000000 | 3.000000 | 1.000000 | 80.000000 | 5.000000 | 6.000000 | 512.329200 | NaN |
Finally we will save the cleaned table to a new csv file.
cleaned_table.to_csv('titanic_cleaned.csv', sep=';', index=False)
Author: | Dennis Wittich, Markus Rokicki, Artem Leichter |
Last modified: | 15.10.2019 |