Introduction to the Pandas module

Introduction to the Pandas module#

Scope#

This notebook gives some key functions to work with data base using the panda module (https://pandas.pydata.org/)

The web gives you a lot of exemples and documentations on this module:

http://pandas.pydata.org/pandas-docs/stable/10min.html

http://www.python-simple.com/python-pandas/panda-intro.php

# Setup
# %matplotlib notebook # uncomment for interactive plot

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

Load data and creat a dataframe from csv file#

More explaination can be found here : https://chrisalbon.com/python/data_wrangling/pandas_dataframe_importing_csv/

df = pd.read_csv("./_DATA/Note_csv.csv", delimiter=";")
df

	section	TD	name	ET	CC
0	MM	A	ami	14.50	11.75
1	MM	A	joyce	8.50	11.50
2	MM	C	lola	9.50	13.25
3	MM	B	irma	7.50	6.00
4	IAI	D	florence	14.50	13.25
...	...	...	...	...	...
90	MM	A	james	13.75	12.75
91	IAI	D	richard	15.25	7.00
92	MM	A	caprice	18.25	15.00
93	IAI	D	al	12.50	9.75
94	MM	B	constance	3.00	7.00

95 rows × 5 columns

Display the dataframe#

# return the beginning of the dataframe
df = df.fillna(0.0)
df.head(10)

	section	TD	name	ET	CC
0	MM	A	ami	14.50	11.75
1	MM	A	joyce	8.50	11.50
2	MM	C	lola	9.50	13.25
3	MM	B	irma	7.50	6.00
4	IAI	D	florence	14.50	13.25
5	MM	B	vi	11.00	7.50
6	MM	B	brian	14.00	16.25
7	MM	B	antoinette	14.50	17.00
8	IAI	D	fred	9.50	11.50
9	IAI	D	gaston	12.25	5.75

# return the end of the dataframe
df.tail(10)

	section	TD	name	ET	CC
85	MM	A	vin	11.00	13.00
86	MM	A	jeunesse	12.00	10.50
87	MM	A	victoire	11.75	12.00
88	MM	B	joseph	8.00	10.00
89	MM	A	fꭩx	13.00	14.50
90	MM	A	james	13.75	12.75
91	IAI	D	richard	15.25	7.00
92	MM	A	caprice	18.25	15.00
93	IAI	D	al	12.50	9.75
94	MM	B	constance	3.00	7.00

Selecting data in a dataframe#

# get data from index 2
df.loc[2]

section       MM
TD             C
name        lola
ET           9.5
CC         13.25
Name: 2, dtype: object

# get name from index 2
df.name[2]

'lola'

# Sliccing is also working

df.name[2:6]

      lola
      irma
  florence
        vi
Name: name, dtype: object

Get one of row of the dataframe#

df.TD

   A
   A
   C
   B
   D
     ..
  A
  D
  A
  D
  B
Name: TD, Length: 95, dtype: object

Get the number of student in each groupe#

df.TD.value_counts()

B    25
A    24
C    23
D    23
Name: TD, dtype: int64

Get the proportion of student between groupes#

df.TD.value_counts(normalize=True)

B    0.263158
A    0.252632
C    0.242105
D    0.242105
Name: TD, dtype: float64

Display the proportion of student between groupes#

Using the plot function of panda:

visualization optin of pandas can be found here : http://pandas.pydata.org/pandas-docs/version/0.18/visualization.html

fig = plt.figure()
df.TD.value_counts(normalize=True).plot.pie(
    labels=["A", "B", "C", "D"], colors=["r", "g", "b", "y"], autopct="%.1f"
)
plt.show()

../../_images/51a370403579ceb47682d905d9e53a286892c6f926042481efb22c7c31ec8a6c.png

Using the plot function of matplotlib:

val = df.TD.value_counts(normalize=True).values
explode = (0.5, 0, 0.2, 0)
labels = "A", "B", "C", "D"
fig1, ax1 = plt.subplots()
ax1.pie(
    val, explode=explode, labels=labels, autopct="%1.1f%%", shadow=True, startangle=90
)
ax1.axis("equal")  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

../../_images/89ffc98727d36b77e096919257a8df0fccee417c73f1d1ec166cdd409f9212b1.png

Get student list who get a grad higer taht 14 on both ET and CC#

df[(df.ET > 14.0) & (df.CC > 14.0)]

	section	TD	name	ET	CC
7	MM	B	antoinette	14.50	17.0
16	MM	C	louis	17.50	15.5
77	MM	C	karl	14.50	17.5
81	MM	B	mari	15.00	15.0
82	IAI	C	rose	17.50	15.0
92	MM	A	caprice	18.25	15.0

Make calulation on data#

The mean of ET note over all student#

df.ET.mean()

10.810526315789474

The mean of ET over student from B groupe#

df.ET[df.TD == "B"].mean()

9.72

Statistical describtion of the data by TD using the ‘groupby()’ function#

df.groupby(["TD"]).describe()  # compte the mean of each note for each groupe

	ET								CC
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
TD
A	24.0	12.552083	2.953424	4.5	11.5625	12.875	14.125	18.25	24.0	12.531250	2.350199	7.50	11.4375	12.75	13.625	17.50
B	25.0	9.720000	3.445498	3.0	7.5000	9.500	13.000	15.50	25.0	9.690000	3.906645	0.00	7.5000	10.25	11.750	17.00
C	23.0	10.630435	4.295128	1.0	7.5000	9.500	14.500	17.50	23.0	11.913043	2.903376	6.50	9.8750	12.00	13.750	17.50
D	23.0	10.358696	5.376097	0.0	6.7500	11.750	14.500	17.75	23.0	9.076087	3.244256	0.25	8.1250	10.00	11.125	13.25

Display the notes with a histogram plot#

# CC notes
fig = plt.figure()
df.CC.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()

../../_images/5cc1ffd40a8bdf693c91968536f3048e331f7ff0da8a61710c29210007f47b16.png

# ET notes
fig = plt.figure()
df.ET.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()

../../_images/55efc59eb631048d074123a12f639e853f0e6f2d2bd11cb953933442fa8bf6c7.png

fig = plt.figure()
df.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()

<Figure size 640x480 with 0 Axes>

../../_images/f7e348940ac925eabc5f8627231924caa117d88e77974da2fbf81b3f6f73f440.png

Let’s compute the mean of both notes#

We need first to add a new row to a data frame#

df["FinalNote"] = 0.0  # add  row filled with 0.0
df

	section	TD	name	ET	CC	FinalNote
0	MM	A	ami	14.50	11.75	0.0
1	MM	A	joyce	8.50	11.50	0.0
2	MM	C	lola	9.50	13.25	0.0
3	MM	B	irma	7.50	6.00	0.0
4	IAI	D	florence	14.50	13.25	0.0
...	...	...	...	...	...	...
90	MM	A	james	13.75	12.75	0.0
91	IAI	D	richard	15.25	7.00	0.0
92	MM	A	caprice	18.25	15.00	0.0
93	IAI	D	al	12.50	9.75	0.0
94	MM	B	constance	3.00	7.00	0.0

95 rows × 6 columns

df.head()

	section	TD	name	ET	CC
0	MM	A	ami	14.5	11.75
1	MM	A	joyce	8.5	11.50
2	MM	C	lola	9.5	13.25
3	MM	B	irma	7.5	6.00
4	IAI	D	florence	14.5	13.25

Let’s compute the mean#

df["FinalNote"] = 0.7 * df.ET + 0.3 * df.CC
# the axis option alows comptuting the mean over lines or rows

df.head()

	section	TD	name	ET	CC	FinalNote
0	MM	A	ami	14.5	11.75	13.675
1	MM	A	joyce	8.5	11.50	9.400
2	MM	C	lola	9.5	13.25	10.625
3	MM	B	irma	7.5	6.00	7.050
4	IAI	D	florence	14.5	13.25	14.125

fig = plt.figure()
df.FinalNote.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()

../../_images/5eaccc84e8ec9d51e1366c6e0ddc00363f405a72627d16878be3b3a4576d1f30.png

What is the overall mean ?#

df.FinalNote.mean()

10.80657894736842