Introduction to the Pandas module#


This notebook gives some key functions to work with data base using the panda module (

The web gives you a lot of exemples and documentations on this module:

# Setup
# %matplotlib notebook # uncomment for interactive plot

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

Load data and creat a dataframe from csv file#

More explaination can be found here :

df = pd.read_csv("./_DATA/Note_csv.csv", delimiter=";")
section TD name ET CC
0 MM A ami 14.50 11.75
1 MM A joyce 8.50 11.50
2 MM C lola 9.50 13.25
3 MM B irma 7.50 6.00
4 IAI D florence 14.50 13.25
... ... ... ... ... ...
90 MM A james 13.75 12.75
91 IAI D richard 15.25 7.00
92 MM A caprice 18.25 15.00
93 IAI D al 12.50 9.75
94 MM B constance 3.00 7.00

95 rows × 5 columns

Display the dataframe#

# return the beginning of the dataframe
df = df.fillna(0.0)
section TD name ET CC
0 MM A ami 14.50 11.75
1 MM A joyce 8.50 11.50
2 MM C lola 9.50 13.25
3 MM B irma 7.50 6.00
4 IAI D florence 14.50 13.25
5 MM B vi 11.00 7.50
6 MM B brian 14.00 16.25
7 MM B antoinette 14.50 17.00
8 IAI D fred 9.50 11.50
9 IAI D gaston 12.25 5.75
# return the end of the dataframe
section TD name ET CC
85 MM A vin 11.00 13.00
86 MM A jeunesse 12.00 10.50
87 MM A victoire 11.75 12.00
88 MM B joseph 8.00 10.00
89 MM A fꭩx 13.00 14.50
90 MM A james 13.75 12.75
91 IAI D richard 15.25 7.00
92 MM A caprice 18.25 15.00
93 IAI D al 12.50 9.75
94 MM B constance 3.00 7.00

Selecting data in a dataframe#

# get data from index 2
section       MM
TD             C
name        lola
ET           9.5
CC         13.25
Name: 2, dtype: object
# get name from index 2[2]
# Sliccing is also working[2:6]
2        lola
3        irma
4    florence
5          vi
Name: name, dtype: object

Get one of row of the dataframe#

0     A
1     A
2     C
3     B
4     D
90    A
91    D
92    A
93    D
94    B
Name: TD, Length: 95, dtype: object

Get the number of student in each groupe#

B    25
A    24
C    23
D    23
Name: TD, dtype: int64

Get the proportion of student between groupes#

B    0.263158
A    0.252632
C    0.242105
D    0.242105
Name: TD, dtype: float64

Display the proportion of student between groupes#

Using the plot function of panda:

visualization optin of pandas can be found here :

fig = plt.figure()
    labels=["A", "B", "C", "D"], colors=["r", "g", "b", "y"], autopct="%.1f"

Using the plot function of matplotlib:

val = df.TD.value_counts(normalize=True).values
explode = (0.5, 0, 0.2, 0)
labels = "A", "B", "C", "D"
fig1, ax1 = plt.subplots()
    val, explode=explode, labels=labels, autopct="%1.1f%%", shadow=True, startangle=90
ax1.axis("equal")  # Equal aspect ratio ensures that pie is drawn as a circle.

Get student list who get a grad higer taht 14 on both ET and CC#

df[(df.ET > 14.0) & (df.CC > 14.0)]
section TD name ET CC
7 MM B antoinette 14.50 17.0
16 MM C louis 17.50 15.5
77 MM C karl 14.50 17.5
81 MM B mari 15.00 15.0
82 IAI C rose 17.50 15.0
92 MM A caprice 18.25 15.0

Make calulation on data#

The mean of ET note over all student#


The mean of ET over student from B groupe#

df.ET[df.TD == "B"].mean()

Statistical describtion of the data by TD using the ‘groupby()’ function#

df.groupby(["TD"]).describe()  # compte the mean of each note for each groupe
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
A 24.0 12.552083 2.953424 4.5 11.5625 12.875 14.125 18.25 24.0 12.531250 2.350199 7.50 11.4375 12.75 13.625 17.50
B 25.0 9.720000 3.445498 3.0 7.5000 9.500 13.000 15.50 25.0 9.690000 3.906645 0.00 7.5000 10.25 11.750 17.00
C 23.0 10.630435 4.295128 1.0 7.5000 9.500 14.500 17.50 23.0 11.913043 2.903376 6.50 9.8750 12.00 13.750 17.50
D 23.0 10.358696 5.376097 0.0 6.7500 11.750 14.500 17.75 23.0 9.076087 3.244256 0.25 8.1250 10.00 11.125 13.25

Display the notes with a histogram plot#

# CC notes
fig = plt.figure()
df.CC.plot.hist(alpha=0.5, bins=np.arange(1, 20))
# ET notes
fig = plt.figure()
df.ET.plot.hist(alpha=0.5, bins=np.arange(1, 20))
fig = plt.figure()
df.plot.hist(alpha=0.5, bins=np.arange(1, 20))
<Figure size 640x480 with 0 Axes>

Let’s compute the mean of both notes#

We need first to add a new row to a data frame#

df["FinalNote"] = 0.0  # add  row filled with 0.0
section TD name ET CC FinalNote
0 MM A ami 14.50 11.75 0.0
1 MM A joyce 8.50 11.50 0.0
2 MM C lola 9.50 13.25 0.0
3 MM B irma 7.50 6.00 0.0
4 IAI D florence 14.50 13.25 0.0
... ... ... ... ... ... ...
90 MM A james 13.75 12.75 0.0
91 IAI D richard 15.25 7.00 0.0
92 MM A caprice 18.25 15.00 0.0
93 IAI D al 12.50 9.75 0.0
94 MM B constance 3.00 7.00 0.0

95 rows × 6 columns

section TD name ET CC FinalNote
0 MM A ami 14.5 11.75 0.0
1 MM A joyce 8.5 11.50 0.0
2 MM C lola 9.5 13.25 0.0
3 MM B irma 7.5 6.00 0.0
4 IAI D florence 14.5 13.25 0.0

Let’s compute the mean#

df["FinalNote"] = 0.7 * df.ET + 0.3 * df.CC
# the axis option alows comptuting the mean over lines or rows
section TD name ET CC FinalNote
0 MM A ami 14.5 11.75 13.675
1 MM A joyce 8.5 11.50 9.400
2 MM C lola 9.5 13.25 10.625
3 MM B irma 7.5 6.00 7.050
4 IAI D florence 14.5 13.25 14.125
fig = plt.figure()
df.FinalNote.plot.hist(alpha=0.5, bins=np.arange(1, 20))

What is the overall mean ?#
