> Data science course by AMA & ICAML - Practical Part

Using pandas

What we did so far are very common and basic tasks. Using Python we should recall that there are usually packages that simplify those basic tasks. A famous package for working with tabular data is pandas. Reading a csv file in pandas can be done within a single function call.

import pandas as pd  # pandas renamed to pd

table = pd.read_csv('titanic2.csv', sep=';') # we have to specify the seperator since pandas assumes a comma

Pandas stores the data in a so called data frame.

print(type(table))

<class 'pandas.core.frame.DataFrame'>

This data type has some nice functionalities e.g. a nice textual representation of the data itself. In a notebook, we can simply type the name of the table to access this representation.

table

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
12	13	0	3	Saundercock, Mr. William Henry	male	20.0	0	0	A/5. 2151	8.0500	NaN	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
14	15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	NaN	S
15	16	1	2	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	0	0	248706	16.0000	NaN	S
16	17	0	3	Rice, Master. Eugene	male	2.0	4	1	382652	29.1250	NaN	Q
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	0	0	244373	13.0000	NaN	S
18	19	0	3	Vander Planke, Mrs. Julius (Emelia Maria Vande...	female	31.0	1	0	345763	18.0000	NaN	S
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	0	0	2649	7.2250	NaN	C
20	21	0	2	Fynney, Mr. Joseph J	male	35.0	0	0	239865	26.0000	NaN	S
21	22	1	2	Beesley, Mr. Lawrence	male	34.0	0	0	248698	13.0000	D56	S
22	23	1	3	McGowan, Miss. Anna "Annie"	female	15.0	0	0	330923	8.0292	NaN	Q
23	24	1	1	Sloper, Mr. William Thompson	male	28.0	0	0	113788	35.5000	A6	S
24	25	0	3	Palsson, Miss. Torborg Danira	female	8.0	3	1	349909	21.0750	NaN	S
25	26	1	3	Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...	female	38.0	1	5	347077	31.3875	NaN	S
26	27	0	3	Emir, Mr. Farred Chehab	male	NaN	0	0	2631	7.2250	NaN	C
27	28	0	1	Fortune, Mr. Charles Alexander	male	19.0	3	2	19950	263.0000	C23 C25 C27	S
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	0	0	330959	7.8792	NaN	Q
29	30	0	3	Todoroff, Mr. Lalio	male	NaN	0	0	349216	7.8958	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
861	862	0	2	Giles, Mr. Frederick Edward	male	21.0	1	0	28134	11.5000	NaN	S
862	863	1	1	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	0	0	17466	25.9292	D17	S
863	864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.5500	NaN	S
864	865	0	2	Gill, Mr. John William	male	24.0	0	0	233866	13.0000	NaN	S
865	866	1	2	Bystrom, Mrs. (Karolina)	female	42.0	0	0	236852	13.0000	NaN	S
866	867	1	2	Duran y More, Miss. Asuncion	female	27.0	1	0	SC/PARIS 2149	13.8583	NaN	C
867	868	0	1	Roebling, Mr. Washington Augustus II	male	31.0	0	0	PC 17590	50.4958	A24	S
868	869	0	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.5000	NaN	S
869	870	1	3	Johnson, Master. Harold Theodor	male	4.0	1	1	347742	11.1333	NaN	S
870	871	0	3	Balkic, Mr. Cerin	male	26.0	0	0	349248	7.8958	NaN	S
871	872	1	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1	1	11751	52.5542	D35	S
872	873	0	1	Carlsson, Mr. Frans Olof	male	33.0	0	0	695	5.0000	B51 B53 B55	S
873	874	0	3	Vander Cruyssen, Mr. Victor	male	47.0	0	0	345765	9.0000	NaN	S
874	875	1	2	Abelson, Mrs. Samuel (Hannah Wizosky)	female	28.0	1	0	P/PP 3381	24.0000	NaN	C
875	876	1	3	Najib, Miss. Adele Kiamie "Jane"	female	15.0	0	0	2667	7.2250	NaN	C
876	877	0	3	Gustafsson, Mr. Alfred Ossian	male	20.0	0	0	7534	9.8458	NaN	S
877	878	0	3	Petroff, Mr. Nedelio	male	19.0	0	0	349212	7.8958	NaN	S
878	879	0	3	Laleff, Mr. Kristo	male	NaN	0	0	349217	7.8958	NaN	S
879	880	1	1	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	0	1	11767	83.1583	C50	C
880	881	1	2	Shelley, Mrs. William (Imanita Parrish Hall)	female	25.0	0	1	230433	26.0000	NaN	S
881	882	0	3	Markun, Mr. Johann	male	33.0	0	0	349257	7.8958	NaN	S
882	883	0	3	Dahlberg, Miss. Gerda Ulrika	female	22.0	0	0	7552	10.5167	NaN	S
883	884	0	2	Banfield, Mr. Frederick James	male	28.0	0	0	C.A./SOTON 34068	10.5000	NaN	S
884	885	0	3	Sutehall, Mr. Henry Jr	male	25.0	0	0	SOTON/OQ 392076	7.0500	NaN	S
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

The function head() will print only the first elements, which is often enough to get an impression of the table.

table.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Another very useful feature is the function describe() that will compute several descriptive values like the mean, the quantiles and the number of values.

table.describe(include='all')

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
count	891.000000	891.000000	891.000000	891	891	714.000000	891.000000	891.000000	891	891.000000	204	889
unique	NaN	NaN	NaN	891	2	NaN	NaN	NaN	681	NaN	147	3
top	NaN	NaN	NaN	Williams, Mr. Charles Duane	male	NaN	NaN	NaN	CA. 2343	NaN	B96 B98	S
freq	NaN	NaN	NaN	1	577	NaN	NaN	NaN	7	NaN	4	644
mean	446.000000	0.383838	2.308642	NaN	NaN	29.699118	0.523008	0.381594	NaN	32.204208	NaN	NaN
std	257.353842	0.486592	0.836071	NaN	NaN	14.526497	1.102743	0.806057	NaN	49.693429	NaN	NaN
min	1.000000	0.000000	1.000000	NaN	NaN	0.420000	0.000000	0.000000	NaN	0.000000	NaN	NaN
25%	223.500000	0.000000	2.000000	NaN	NaN	20.125000	0.000000	0.000000	NaN	7.910400	NaN	NaN
50%	446.000000	0.000000	3.000000	NaN	NaN	28.000000	0.000000	0.000000	NaN	14.454200	NaN	NaN
75%	668.500000	1.000000	3.000000	NaN	NaN	38.000000	1.000000	0.000000	NaN	31.000000	NaN	NaN
max	891.000000	1.000000	3.000000	NaN	NaN	80.000000	8.000000	6.000000	NaN	512.329200	NaN	NaN

Using the function hist(), we can directly compute and show histograms of multiple attributes.

plot = table.hist(column= ['Age', 'Fare', 'Pclass', 'Survived' ], figsize=(15, 10))

Single columns of the table are accessed by using the column name in square brackets or via the dot operator. Each column is stored as a data series which is again some sort of list with special functions e.g. a nice textual representation.

fares = table['Fare']         # Extracting the column 'Fare' (alternatively we could use table.Fare)
print(type(fares))            # This is the data type of the column representation
print(fares.dtype)            # This is the data type of each element in the column
fares.head()

<class 'pandas.core.series.Series'>
float64





0     7.2500
1    71.2833
2     7.9250
3    53.1000
4     8.0500
Name: Fare, dtype: float64

table['Pclass'].dtype

dtype('int64')

Data cleaning and preprocessing

In order to prepare our data for the later processing, we will first remove the attributes, we are not interested in.

cleaned_table = table.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin']) # step 1: drop some columns
cleaned_table.describe(include='all')

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
count	891.000000	891.000000	891	714.000000	891.000000	891.000000	891.000000	889
unique	NaN	NaN	2	NaN	NaN	NaN	NaN	3
top	NaN	NaN	male	NaN	NaN	NaN	NaN	S
freq	NaN	NaN	577	NaN	NaN	NaN	NaN	644
mean	0.383838	2.308642	NaN	29.699118	0.523008	0.381594	32.204208	NaN
std	0.486592	0.836071	NaN	14.526497	1.102743	0.806057	49.693429	NaN
min	0.000000	1.000000	NaN	0.420000	0.000000	0.000000	0.000000	NaN
25%	0.000000	2.000000	NaN	20.125000	0.000000	0.000000	7.910400	NaN
50%	0.000000	3.000000	NaN	28.000000	0.000000	0.000000	14.454200	NaN
75%	1.000000	3.000000	NaN	38.000000	1.000000	0.000000	31.000000	NaN
max	1.000000	3.000000	NaN	80.000000	8.000000	6.000000	512.329200	NaN

In a second step, we remove all entries, where not all attributes are present.

cleaned_table = cleaned_table.dropna()
cleaned_table.describe(include='all')

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
count	712.000000	712.000000	712	712.000000	712.000000	712.000000	712.000000	712
unique	NaN	NaN	2	NaN	NaN	NaN	NaN	3
top	NaN	NaN	male	NaN	NaN	NaN	NaN	S
freq	NaN	NaN	453	NaN	NaN	NaN	NaN	554
mean	0.404494	2.240169	NaN	29.642093	0.514045	0.432584	34.567251	NaN
std	0.491139	0.836854	NaN	14.492933	0.930692	0.854181	52.938648	NaN
min	0.000000	1.000000	NaN	0.420000	0.000000	0.000000	0.000000	NaN
25%	0.000000	1.000000	NaN	20.000000	0.000000	0.000000	8.050000	NaN
50%	0.000000	2.000000	NaN	28.000000	0.000000	0.000000	15.645850	NaN
75%	1.000000	3.000000	NaN	38.000000	1.000000	1.000000	33.000000	NaN
max	1.000000	3.000000	NaN	80.000000	5.000000	6.000000	512.329200	NaN

We can see, that the number of entries per attribute is now the same for all attributes.

In the last step we want to convert the categorical entries for ‘Sex’ to numerical values. Similar to ‘survived’ ‘male’ should be represented by 1 and ‘female’ by 0. This is done in the next cell.

sex_types = ['female', 'male']
cleaned_table.Sex = cleaned_table.Sex.astype("category").cat.codes
cleaned_table.describe(include='all')
#cleaned_table.head()

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
count	712.000000	712.000000	712.000000	712.000000	712.000000	712.000000	712.000000	712
unique	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3
top	NaN	NaN	NaN	NaN	NaN	NaN	NaN	S
freq	NaN	NaN	NaN	NaN	NaN	NaN	NaN	554
mean	0.404494	2.240169	0.636236	29.642093	0.514045	0.432584	34.567251	NaN
std	0.491139	0.836854	0.481420	14.492933	0.930692	0.854181	52.938648	NaN
min	0.000000	1.000000	0.000000	0.420000	0.000000	0.000000	0.000000	NaN
25%	0.000000	1.000000	0.000000	20.000000	0.000000	0.000000	8.050000	NaN
50%	0.000000	2.000000	1.000000	28.000000	0.000000	0.000000	15.645850	NaN
75%	1.000000	3.000000	1.000000	38.000000	1.000000	1.000000	33.000000	NaN
max	1.000000	3.000000	1.000000	80.000000	5.000000	6.000000	512.329200	NaN

Finally we will save the cleaned table to a new csv file.

cleaned_table.to_csv('titanic_cleaned.csv', sep=';', index=False)

Author:	Dennis Wittich, Markus Rokicki, Artem Leichter
Last modified:	15.10.2019