Simple Linear Regression¶
Importing the dataset¶
import pandas as pd
df = pd.read_csv('Salary_Data.csv')
df.head(10)
| YearsExperience | Salary | |
|---|---|---|
| 0 | 1.1 | 39343.0 |
| 1 | 1.3 | 46205.0 |
| 2 | 1.5 | 37731.0 |
| 3 | 2.0 | 43525.0 |
| 4 | 2.2 | 39891.0 |
| 5 | 2.9 | 56642.0 |
| 6 | 3.0 | 60150.0 |
| 7 | 3.2 | 54445.0 |
| 8 | 3.2 | 64445.0 |
| 9 | 3.7 | 57189.0 |
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
Splitting the dataset into the Training set and Test set¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
Training the Simple Linear Regression model on the Training set¶
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| fit_intercept | True | |
| copy_X | True | |
| tol | 1e-06 | |
| n_jobs | None | |
| positive | False |
Predicting the Test set results¶
y_pred = regressor.predict(X_test)
Model Performance Evaluation¶
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R² Score: {r2:.2f}")
Mean Absolute Error (MAE): 3426.43 Mean Squared Error (MSE): 21026037.33 Root Mean Squared Error (RMSE): 4585.42 R² Score: 0.97
Interpretation of Results¶
1. Mean Absolute Error (MAE): 3426.43¶
On average, the model’s salary predictions are off by about 3,426 units.
It gives a sense of average absolute deviation without exaggerating large errors.
A lower MAE is better.
Interpretation: The model’s predictions are generally within ±3,400 of the actual salary.
2. Mean Squared Error (MSE): 21,026,037.33¶
This measures the average squared difference between predicted and actual values.
Because it squares errors, large mistakes are penalized heavily.
It’s not as intuitive in the original units, but it’s useful for optimization.
Interpretation: The squared prediction errors average around 21 million, which corresponds to an RMSE (below) of about 4,585 in real salary terms.
3. Root Mean Squared Error (RMSE): 4,585.42¶
This is the square root of MSE, putting the error back in the same units as the target variable.
It can be viewed as a typical standard deviation of residuals (prediction errors).
A lower RMSE means better model performance.
Interpretation: The model’s salary predictions typically differ from the true values by about 4,585 units.
R² Score: 0.97¶
R² (coefficient of determination) indicates how much of the variance in the target variable is explained by the model.
It ranges from 0 to 1:
1.0 → perfect prediction
0.0 → model predicts nothing better than the mean
Interpretation: An R² of 0.97 means the model explains 97% of the variation in salaries based on years of experience — excellent performance.
Overall Interpretation¶
The linear regression model performs very well:
Errors are relatively small compared to salary values.
The model captures nearly all the trend in the data.
There’s minimal random noise left unexplained.
Visualising the Results¶
import seaborn as sns
import matplotlib.pyplot as plt
def plot_linear_regression(X, y, x_label='X', y_label='Y', title=None,
caption_text="© 2025 Thomas Uhuru"):
"""
Plots a linear regression using Seaborn with separate X and y inputs.
Works whether X/y are lists, Series, or 2D arrays.
"""
# Flatten
X = np.array(X).ravel()
y = np.array(y).ravel()
# Create DataFrame
df = pd.DataFrame({x_label: X, y_label: y})
# Plot
plt.figure(figsize=(8, 5))
sns.regplot(
x=x_label,
y=y_label,
data=df,
scatter_kws={'color': 'red', 's': 50, 'alpha': 0.8},
line_kws={'color': 'blue', 'linewidth': 2}
)
plt.title(title or f'{y_label} vs {x_label}', fontsize=14)
plt.xlabel(x_label, fontsize=12)
plt.ylabel(y_label, fontsize=12)
# Add caption
plt.figtext(
0.99, 0.01,
caption_text,
ha="right", va="bottom",
fontsize=9, color="gray", style="italic"
)
plt.tight_layout()
plt.show()
(a) The Training set results¶
plot_linear_regression(X_train, y_train,
x_label='Years of Experience',
y_label='Salary', title='Salary vs Experience (Training Data)')
(b) The Test set results¶
plot_linear_regression(X_train, regressor.predict(X_train),
x_label='Years of Experience',
y_label='Salary', title='Salary vs Experience (Test Data)')
Predicting Salary from Years of Experience¶
- What is the estimated salary for a person who has 7years of experience?
experience = 7
predicted_salary = regressor.predict(np.array([[experience]]))[0]
print(predicted_salary)
92237.78934588778
- What is the estimated salary for a person who has 4.5years of experience?
experience = 4.5
predicted_salary = regressor.predict(np.array([[experience]]))[0]
print(predicted_salary)
68872.93323808187