Linear regression - Linear model

Linear regression - Linear model#

Linear regression is a fundamental technique in data science used to model the relationship between a dependent variable and one or more independent variables. It is widely used for predictive analysis, trend forecasting, and determining the strength of predictors. By fitting a linear equation to observed data, linear regression helps in understanding how the dependent variable changes with variations in the independent variables, making it a powerful tool for both explanatory and predictive modeling.

Scope#

A finite number $N$ of data points are available: find the best line going trought this $N$ points.

https://en.wikipedia.org/wiki/Linear_regression

reg

Introduction with a simple tow points line equation#

Let’s annalyse how to determine the simple equation of a linear function ($y=ax+b$) kowing tow points of the line.

# the known points
y = np.array([-1, 2])
x = np.array([0.5, 3])
x, y

(array([0.5, 3. ]), array([-1,  2]))

fig = plt.figure()
plt.plot(x, y, "o")
plt.grid()

To dertermine the $a$ and $b$ coefficents, we have to solve the following system:

\[ y_0=ax_0+b \]

\[ y_1=ax_1+b \]

Lets reorgenized a bit this system:

\[b + ax_0 = y_0\]

\[b + ax_1 = y_1\]

Its now easy to write it as a linear system:

\[\begin{split} \begin{bmatrix} 1 & x_0 \\ 1 & x_1 \end{bmatrix} . \begin{bmatrix} b \\ a \end{bmatrix} = \begin{bmatrix} y_0 \\ y_1 \end{bmatrix} \end{split}\]

Therfore the $a$ and $b$ coefficents can be determine by inversing the linear system:

\[\begin{split} \begin{bmatrix} b \\ a \end{bmatrix} = \begin{bmatrix} 1 & x_0 \\ 1 & x_1 \end{bmatrix}^{-1} . \begin{bmatrix} y_0 \\ y_1 \end{bmatrix} \end{split}\]

Let’s do this with python#

# definition of the involved matrix :
X = np.array([[1, x[0]], [1, x[1]]])
print("X=", X)
Y = np.array([y[0], y[1]]).T
print("Y=", Y)

X= [[1.  0.5]
 [1.  3. ]]
Y= [-1  2]

# Inversion of the systeme
coef = np.dot(np.linalg.inv(X), Y)
a = coef[1]
b = coef[0]
print(
    "a=",
    a,
)
print("b=", b)

a= 1.2000000000000002
b= -1.6

# draw the obtained line
N = 10
xmin, xmax = 0.0, 4
xi = np.linspace(xmin, xmax, N)
yi = a * xi + b

fig = plt.figure()
plt.plot(x, y, "o")
plt.plot(xi, yi, "r")
plt.grid()

What append if we add a third point ?#

# the known points
y = np.array([-1, 2, 0])
x = np.array([0.5, 3, 1.5])
x, y

(array([0.5, 3. , 1.5]), array([-1,  2,  0]))

To dertermine the $a$ and $b$ coefficents, we have to solve the following system:

\[y_0=ax_0+b\]

\[y_1=ax_1+b\]

\[y_2=ax_2+b\]

This system is over deffined: we have 2 unkowns ($a$,$b$) and 3 equations.

The goal is now to find the $a$ and $b$ coefficients (i.e. to find the linear equation) that fit the best the given data.

We can rewrite the equations in the form of the following linear system:

\[\begin{split} \begin{bmatrix} 1 & x_0 \\ 1 & x_1 \\ 1 & x_2 \end{bmatrix} \begin{bmatrix} b \\ a \end{bmatrix} = \begin{bmatrix} y_0 \\ y_1 \\ y_2 \end{bmatrix} \end{split}\]

The system can be express in a genral form:

\[X \beta=Y\]

The matrix $X$ is not square, and therfore not inversible. The methode presented above is no more apllicable.

The trick is to multiply the system by the transposed of $X$.

\[X^t X \beta= X^t Y\]

The therm $X^t X$ is now a square matrix.

Let’s check this affirmation:#

# Construcion of the X matrix
X = np.array([np.ones(x.size), x]).T
X

array([[1. , 0.5],
       [1. , 3. ],
       [1. , 1.5]])

# Computation of Xt X
np.dot(X.T, X)

array([[ 3. ,  5. ],
       [ 5. , 11.5]])

The obtained matrix is a square matrix and it can proved that this matrix is symmetric, and therfore always inversible.

From the previous equation: $$X^t X \beta= X^t Y$$

The $\beta$ vector can be express by as :

\[\beta= \left( X^t X \right)^{-1} X^t Y\]

This operator is the least square estimator.

Let’s do this with python#

# Construcion of the X matrix
X = np.array([np.ones(x.size), x]).T

# Construcion of the Y matrix
Y = y.T

print("X=", X)

print("Y=", Y)

X= [[1.  0.5]
 [1.  3. ]
 [1.  1.5]]
Y= [-1  2  0]

# Computation of the least square estimator
beta = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, Y))
print("beta=", beta)

print("The fitted linear function is:")
print("y=", beta[1], "x +", beta[0])

beta= [-1.68421053  1.21052632]
The fitted linear function is:
y= 1.2105263157894732 x + -1.6842105263157885

Now its time to plot the line#

# draw the obtained line
N = 10
xmin, xmax = 0.0, 4
xi = np.linspace(xmin, xmax, N)
yi = beta[0] + beta[1] * xi

fig = plt.figure()
plt.plot(x, y, "o")
plt.plot(xi, yi, "r")
plt.grid()

With a larger data set ?#

The data set : a synthetique data set with some noise#

N = 100
xmin, xmax = 0.0, 4
x = np.linspace(xmin, xmax, N) + np.random.rand(N)
y = (3.3 * x - 2) + 2 * np.random.randn(N)

fig = plt.figure()
plt.plot(x, y, "o", label="Data")
plt.legend()
plt.grid()

Determination of the coefficent by regression#

# Construcion of the X matrix
X = np.array([np.ones(x.size), x]).T

# Construcion of the Y matrix
Y = y.T

# Computation of the least square estimator
beta = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, Y))

print("The fitted linear function is:")
print("y=", beta[1], "x +", beta[0])

The fitted linear function is:
y= 3.0240888913670534 x + -1.290836403287066

Draw the obtain regression line#

N = 10
xmin, xmax = 0.0, 5
xi = np.linspace(xmin, xmax, N)
yi = beta[0] + beta[1] * xi

fig = plt.figure()
plt.plot(x, y, "o", label="Data")
plt.plot(xi, yi, "r", label="Regression")
plt.legend()
plt.grid()

Does python know how to do that in one line ?#

Of course yes !#

Using the polyfit function of numpy:#

# compute the coefficient using the least square methode
beta = np.polyfit(x, y, 1)
print("beta=", beta)

# creat the associate 1d polynomila function
fit = np.poly1d(beta)

beta= [ 3.02408889 -1.2908364 ]

Draw the obtaine regression line#

N = 10
xmin, xmax = 0.0, 5
xi = np.linspace(xmin, xmax, N)
yi_numpy = fit(xi)

fig = plt.figure()
plt.plot(x, y, "o", label="Data")
plt.plot(xi, yi, "+r", label="Regression")
plt.plot(xi, yi_numpy, "g", label="Regression Numpy")
plt.legend()
plt.grid()

Using numpy linalg function:#

For this methode de $X$ ans $Y$ matrix must be given by the user.

beta, r, rank, s = np.linalg.lstsq(X, Y, rcond=None)
print("beta=", beta)

beta= [-1.2908364   3.02408889]

Lucky us, all methodes give the same results !

What is the effect of noise on the resulting regression ?#

N = 100
xmin, xmax = 0.0, 4


def get_data(N, sigma):
    x = np.linspace(xmin, xmax, N)
    y = (3.3 * x - 2) + sigma * np.random.randn(N)
    return x, y


def compute_reg(x, y):
    beta, residuals, rank, singular_values, rcond = np.polyfit(x, y, 1, full=True)
    fit = np.poly1d(beta)
    SStot = ((y - y.mean()) ** 2).sum()
    SSres = residuals[0]
    R2 = 1 - SSres / SStot
    return beta, fit([xmin, xmax]), R2


x, y = get_data(100, 0.0)
beta, fity, R2 = compute_reg(x, y)
fig = plt.figure()
(points_plot,) = plt.plot(x, y, "o", label="Data")
(line,) = plt.plot([xmin, xmax], fity, "-r", label="Regression")
plt.plot([xmin, xmax], fity, "-g", label="Non Noisy data")
plt.grid()
plt.legend()


@ipw.interact(N=(5, 100, 10), sigma=(0.0, 2.0, 0.1))
def update(N, sigma):
    x, y = get_data(N, sigma)
    beta, fity, R2 = compute_reg(x, y)
    points_plot.set_xdata(x)
    points_plot.set_ydata(y)
    line.set_ydata(fity)
    plt.title("$R^2$={:0.2f}".format(R2))

Note

To go further the reader can read the Linear regression - Advanced model notebook.

Linear regression - Linear model

Contents

Linear regression - Linear model#

Scope#

Introduction with a simple tow points line equation#

Let’s do this with python#

What append if we add a third point ?#

Let’s check this affirmation:#

Let’s do this with python#

Now its time to plot the line#

With a larger data set ?#

The data set : a synthetique data set with some noise#

Determination of the coefficent by regression#

Draw the obtain regression line#

Does python know how to do that in one line ?#

Of course yes !#

Using the polyfit function of numpy:#

Draw the obtaine regression line#

Using numpy linalg function:#

What is the effect of noise on the resulting regression ?#