Introduction to the Pandas module#

Scope#

This notebook gives some key functions to work with data base using the panda module (https://pandas.pydata.org/)

The web gives you a lot of exemples and documentations on this module:

http://pandas.pydata.org/pandas-docs/stable/10min.html

http://www.python-simple.com/python-pandas/panda-intro.php

# Setup
# %matplotlib notebook # uncomment for interactive plot

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

Load data and creat a dataframe from csv file#

More explaination can be found here : https://chrisalbon.com/python/data_wrangling/pandas_dataframe_importing_csv/

df = pd.read_csv("./_DATA/Note_csv.csv", delimiter=";")
df
section TD name ET CC
0 MM A ami 14.50 11.75
1 MM A joyce 8.50 11.50
2 MM C lola 9.50 13.25
3 MM B irma 7.50 6.00
4 IAI D florence 14.50 13.25
... ... ... ... ... ...
90 MM A james 13.75 12.75
91 IAI D richard 15.25 7.00
92 MM A caprice 18.25 15.00
93 IAI D al 12.50 9.75
94 MM B constance 3.00 7.00

95 rows × 5 columns

Display the dataframe#

# return the beginning of the dataframe
df = df.fillna(0.0)
df.head(10)
section TD name ET CC
0 MM A ami 14.50 11.75
1 MM A joyce 8.50 11.50
2 MM C lola 9.50 13.25
3 MM B irma 7.50 6.00
4 IAI D florence 14.50 13.25
5 MM B vi 11.00 7.50
6 MM B brian 14.00 16.25
7 MM B antoinette 14.50 17.00
8 IAI D fred 9.50 11.50
9 IAI D gaston 12.25 5.75
# return the end of the dataframe
df.tail(10)
section TD name ET CC
85 MM A vin 11.00 13.00
86 MM A jeunesse 12.00 10.50
87 MM A victoire 11.75 12.00
88 MM B joseph 8.00 10.00
89 MM A fꭩx 13.00 14.50
90 MM A james 13.75 12.75
91 IAI D richard 15.25 7.00
92 MM A caprice 18.25 15.00
93 IAI D al 12.50 9.75
94 MM B constance 3.00 7.00

Selecting data in a dataframe#

# get data from index 2
df.loc[2]
section       MM
TD             C
name        lola
ET           9.5
CC         13.25
Name: 2, dtype: object
# get name from index 2
df.name[2]
'lola'
# Sliccing is also working

df.name[2:6]
2        lola
3        irma
4    florence
5          vi
Name: name, dtype: object

Get one of row of the dataframe#

df.TD
0     A
1     A
2     C
3     B
4     D
     ..
90    A
91    D
92    A
93    D
94    B
Name: TD, Length: 95, dtype: object

Get the number of student in each groupe#

df.TD.value_counts()
B    25
A    24
C    23
D    23
Name: TD, dtype: int64

Get the proportion of student between groupes#

df.TD.value_counts(normalize=True)
B    0.263158
A    0.252632
C    0.242105
D    0.242105
Name: TD, dtype: float64

Display the proportion of student between groupes#

Using the plot function of panda:

visualization optin of pandas can be found here : http://pandas.pydata.org/pandas-docs/version/0.18/visualization.html

fig = plt.figure()
df.TD.value_counts(normalize=True).plot.pie(
    labels=["A", "B", "C", "D"], colors=["r", "g", "b", "y"], autopct="%.1f"
)
plt.show()
../../_images/51a370403579ceb47682d905d9e53a286892c6f926042481efb22c7c31ec8a6c.png

Using the plot function of matplotlib:

val = df.TD.value_counts(normalize=True).values
explode = (0.5, 0, 0.2, 0)
labels = "A", "B", "C", "D"
fig1, ax1 = plt.subplots()
ax1.pie(
    val, explode=explode, labels=labels, autopct="%1.1f%%", shadow=True, startangle=90
)
ax1.axis("equal")  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()
../../_images/89ffc98727d36b77e096919257a8df0fccee417c73f1d1ec166cdd409f9212b1.png

Get student list who get a grad higer taht 14 on both ET and CC#

df[(df.ET > 14.0) & (df.CC > 14.0)]
section TD name ET CC
7 MM B antoinette 14.50 17.0
16 MM C louis 17.50 15.5
77 MM C karl 14.50 17.5
81 MM B mari 15.00 15.0
82 IAI C rose 17.50 15.0
92 MM A caprice 18.25 15.0

Make calulation on data#

The mean of ET note over all student#

df.ET.mean()
10.810526315789474

The mean of ET over student from B groupe#

df.ET[df.TD == "B"].mean()
9.72

Statistical describtion of the data by TD using the ‘groupby()’ function#

df.groupby(["TD"]).describe()  # compte the mean of each note for each groupe
ET CC
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
TD
A 24.0 12.552083 2.953424 4.5 11.5625 12.875 14.125 18.25 24.0 12.531250 2.350199 7.50 11.4375 12.75 13.625 17.50
B 25.0 9.720000 3.445498 3.0 7.5000 9.500 13.000 15.50 25.0 9.690000 3.906645 0.00 7.5000 10.25 11.750 17.00
C 23.0 10.630435 4.295128 1.0 7.5000 9.500 14.500 17.50 23.0 11.913043 2.903376 6.50 9.8750 12.00 13.750 17.50
D 23.0 10.358696 5.376097 0.0 6.7500 11.750 14.500 17.75 23.0 9.076087 3.244256 0.25 8.1250 10.00 11.125 13.25

Display the notes with a histogram plot#

# CC notes
fig = plt.figure()
df.CC.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()
../../_images/5cc1ffd40a8bdf693c91968536f3048e331f7ff0da8a61710c29210007f47b16.png
# ET notes
fig = plt.figure()
df.ET.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()
../../_images/55efc59eb631048d074123a12f639e853f0e6f2d2bd11cb953933442fa8bf6c7.png
fig = plt.figure()
df.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()
<Figure size 640x480 with 0 Axes>
../../_images/f7e348940ac925eabc5f8627231924caa117d88e77974da2fbf81b3f6f73f440.png

Let’s compute the mean of both notes#

We need first to add a new row to a data frame#

df["FinalNote"] = 0.0  # add  row filled with 0.0
df
section TD name ET CC FinalNote
0 MM A ami 14.50 11.75 0.0
1 MM A joyce 8.50 11.50 0.0
2 MM C lola 9.50 13.25 0.0
3 MM B irma 7.50 6.00 0.0
4 IAI D florence 14.50 13.25 0.0
... ... ... ... ... ... ...
90 MM A james 13.75 12.75 0.0
91 IAI D richard 15.25 7.00 0.0
92 MM A caprice 18.25 15.00 0.0
93 IAI D al 12.50 9.75 0.0
94 MM B constance 3.00 7.00 0.0

95 rows × 6 columns

df.head()
section TD name ET CC FinalNote
0 MM A ami 14.5 11.75 0.0
1 MM A joyce 8.5 11.50 0.0
2 MM C lola 9.5 13.25 0.0
3 MM B irma 7.5 6.00 0.0
4 IAI D florence 14.5 13.25 0.0

Let’s compute the mean#

df["FinalNote"] = 0.7 * df.ET + 0.3 * df.CC
# the axis option alows comptuting the mean over lines or rows
df.head()
section TD name ET CC FinalNote
0 MM A ami 14.5 11.75 13.675
1 MM A joyce 8.5 11.50 9.400
2 MM C lola 9.5 13.25 10.625
3 MM B irma 7.5 6.00 7.050
4 IAI D florence 14.5 13.25 14.125
fig = plt.figure()
df.FinalNote.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()
../../_images/5eaccc84e8ec9d51e1366c6e0ddc00363f405a72627d16878be3b3a4576d1f30.png

What is the overall mean ?#

df.FinalNote.mean()
10.80657894736842