Introduction to the Pandas module#
Scope#
This notebook gives some key functions to work with data base using the panda module (https://pandas.pydata.org/)
The web gives you a lot of exemples and documentations on this module:
http://pandas.pydata.org/pandas-docs/stable/10min.html
http://www.python-simple.com/python-pandas/panda-intro.php
# Setup
# %matplotlib notebook # uncomment for interactive plot
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
Load data and creat a dataframe from csv file#
More explaination can be found here : https://chrisalbon.com/python/data_wrangling/pandas_dataframe_importing_csv/
df = pd.read_csv("./_DATA/Note_csv.csv", delimiter=";")
df
section | TD | name | ET | CC | |
---|---|---|---|---|---|
0 | MM | A | ami | 14.50 | 11.75 |
1 | MM | A | joyce | 8.50 | 11.50 |
2 | MM | C | lola | 9.50 | 13.25 |
3 | MM | B | irma | 7.50 | 6.00 |
4 | IAI | D | florence | 14.50 | 13.25 |
... | ... | ... | ... | ... | ... |
90 | MM | A | james | 13.75 | 12.75 |
91 | IAI | D | richard | 15.25 | 7.00 |
92 | MM | A | caprice | 18.25 | 15.00 |
93 | IAI | D | al | 12.50 | 9.75 |
94 | MM | B | constance | 3.00 | 7.00 |
95 rows × 5 columns
Display the dataframe#
# return the beginning of the dataframe
df = df.fillna(0.0)
df.head(10)
section | TD | name | ET | CC | |
---|---|---|---|---|---|
0 | MM | A | ami | 14.50 | 11.75 |
1 | MM | A | joyce | 8.50 | 11.50 |
2 | MM | C | lola | 9.50 | 13.25 |
3 | MM | B | irma | 7.50 | 6.00 |
4 | IAI | D | florence | 14.50 | 13.25 |
5 | MM | B | vi | 11.00 | 7.50 |
6 | MM | B | brian | 14.00 | 16.25 |
7 | MM | B | antoinette | 14.50 | 17.00 |
8 | IAI | D | fred | 9.50 | 11.50 |
9 | IAI | D | gaston | 12.25 | 5.75 |
# return the end of the dataframe
df.tail(10)
section | TD | name | ET | CC | |
---|---|---|---|---|---|
85 | MM | A | vin | 11.00 | 13.00 |
86 | MM | A | jeunesse | 12.00 | 10.50 |
87 | MM | A | victoire | 11.75 | 12.00 |
88 | MM | B | joseph | 8.00 | 10.00 |
89 | MM | A | fꭩx | 13.00 | 14.50 |
90 | MM | A | james | 13.75 | 12.75 |
91 | IAI | D | richard | 15.25 | 7.00 |
92 | MM | A | caprice | 18.25 | 15.00 |
93 | IAI | D | al | 12.50 | 9.75 |
94 | MM | B | constance | 3.00 | 7.00 |
Selecting data in a dataframe#
# get data from index 2
df.loc[2]
section MM
TD C
name lola
ET 9.5
CC 13.25
Name: 2, dtype: object
# get name from index 2
df.name[2]
'lola'
# Sliccing is also working
df.name[2:6]
2 lola
3 irma
4 florence
5 vi
Name: name, dtype: object
Get one of row of the dataframe#
df.TD
0 A
1 A
2 C
3 B
4 D
..
90 A
91 D
92 A
93 D
94 B
Name: TD, Length: 95, dtype: object
Get the number of student in each groupe#
df.TD.value_counts()
B 25
A 24
C 23
D 23
Name: TD, dtype: int64
Get the proportion of student between groupes#
df.TD.value_counts(normalize=True)
B 0.263158
A 0.252632
C 0.242105
D 0.242105
Name: TD, dtype: float64
Display the proportion of student between groupes#
Using the plot function of panda:
visualization optin of pandas can be found here : http://pandas.pydata.org/pandas-docs/version/0.18/visualization.html
fig = plt.figure()
df.TD.value_counts(normalize=True).plot.pie(
labels=["A", "B", "C", "D"], colors=["r", "g", "b", "y"], autopct="%.1f"
)
plt.show()
Using the plot function of matplotlib:
val = df.TD.value_counts(normalize=True).values
explode = (0.5, 0, 0.2, 0)
labels = "A", "B", "C", "D"
fig1, ax1 = plt.subplots()
ax1.pie(
val, explode=explode, labels=labels, autopct="%1.1f%%", shadow=True, startangle=90
)
ax1.axis("equal") # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
Get student list who get a grad higer taht 14 on both ET and CC#
df[(df.ET > 14.0) & (df.CC > 14.0)]
section | TD | name | ET | CC | |
---|---|---|---|---|---|
7 | MM | B | antoinette | 14.50 | 17.0 |
16 | MM | C | louis | 17.50 | 15.5 |
77 | MM | C | karl | 14.50 | 17.5 |
81 | MM | B | mari | 15.00 | 15.0 |
82 | IAI | C | rose | 17.50 | 15.0 |
92 | MM | A | caprice | 18.25 | 15.0 |
Make calulation on data#
The mean of ET note over all student#
df.ET.mean()
10.810526315789474
The mean of ET over student from B groupe#
df.ET[df.TD == "B"].mean()
9.72
Statistical describtion of the data by TD using the ‘groupby()’ function#
df.groupby(["TD"]).describe() # compte the mean of each note for each groupe
ET | CC | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
TD | ||||||||||||||||
A | 24.0 | 12.552083 | 2.953424 | 4.5 | 11.5625 | 12.875 | 14.125 | 18.25 | 24.0 | 12.531250 | 2.350199 | 7.50 | 11.4375 | 12.75 | 13.625 | 17.50 |
B | 25.0 | 9.720000 | 3.445498 | 3.0 | 7.5000 | 9.500 | 13.000 | 15.50 | 25.0 | 9.690000 | 3.906645 | 0.00 | 7.5000 | 10.25 | 11.750 | 17.00 |
C | 23.0 | 10.630435 | 4.295128 | 1.0 | 7.5000 | 9.500 | 14.500 | 17.50 | 23.0 | 11.913043 | 2.903376 | 6.50 | 9.8750 | 12.00 | 13.750 | 17.50 |
D | 23.0 | 10.358696 | 5.376097 | 0.0 | 6.7500 | 11.750 | 14.500 | 17.75 | 23.0 | 9.076087 | 3.244256 | 0.25 | 8.1250 | 10.00 | 11.125 | 13.25 |
Display the notes with a histogram plot#
# CC notes
fig = plt.figure()
df.CC.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()
# ET notes
fig = plt.figure()
df.ET.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()
fig = plt.figure()
df.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()
<Figure size 640x480 with 0 Axes>
Let’s compute the mean of both notes#
We need first to add a new row to a data frame#
df["FinalNote"] = 0.0 # add row filled with 0.0
df
section | TD | name | ET | CC | FinalNote | |
---|---|---|---|---|---|---|
0 | MM | A | ami | 14.50 | 11.75 | 0.0 |
1 | MM | A | joyce | 8.50 | 11.50 | 0.0 |
2 | MM | C | lola | 9.50 | 13.25 | 0.0 |
3 | MM | B | irma | 7.50 | 6.00 | 0.0 |
4 | IAI | D | florence | 14.50 | 13.25 | 0.0 |
... | ... | ... | ... | ... | ... | ... |
90 | MM | A | james | 13.75 | 12.75 | 0.0 |
91 | IAI | D | richard | 15.25 | 7.00 | 0.0 |
92 | MM | A | caprice | 18.25 | 15.00 | 0.0 |
93 | IAI | D | al | 12.50 | 9.75 | 0.0 |
94 | MM | B | constance | 3.00 | 7.00 | 0.0 |
95 rows × 6 columns
df.head()
section | TD | name | ET | CC | FinalNote | |
---|---|---|---|---|---|---|
0 | MM | A | ami | 14.5 | 11.75 | 0.0 |
1 | MM | A | joyce | 8.5 | 11.50 | 0.0 |
2 | MM | C | lola | 9.5 | 13.25 | 0.0 |
3 | MM | B | irma | 7.5 | 6.00 | 0.0 |
4 | IAI | D | florence | 14.5 | 13.25 | 0.0 |
Let’s compute the mean#
df["FinalNote"] = 0.7 * df.ET + 0.3 * df.CC
# the axis option alows comptuting the mean over lines or rows
df.head()
section | TD | name | ET | CC | FinalNote | |
---|---|---|---|---|---|---|
0 | MM | A | ami | 14.5 | 11.75 | 13.675 |
1 | MM | A | joyce | 8.5 | 11.50 | 9.400 |
2 | MM | C | lola | 9.5 | 13.25 | 10.625 |
3 | MM | B | irma | 7.5 | 6.00 | 7.050 |
4 | IAI | D | florence | 14.5 | 13.25 | 14.125 |
fig = plt.figure()
df.FinalNote.plot.hist(alpha=0.5, bins=np.arange(1, 20))
plt.show()
What is the overall mean ?#
df.FinalNote.mean()
10.80657894736842