Multiple Linear Regression¶

Importing the libraries¶

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Importing the dataset¶

In [3]:
df = pd.read_csv('50_Startups.csv')
df.head(10)
Out[3]:
R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 151377.59 443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94
5 131876.90 99814.71 362861.36 New York 156991.12
6 134615.46 147198.87 127716.82 California 156122.51
7 130298.13 145530.06 323876.68 Florida 155752.60
8 120542.52 148718.95 311613.29 New York 152211.77
9 123334.88 108679.17 304981.62 California 149759.96
In [4]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
In [5]:
print(X[0:3]) #We have string values in the last column. 
[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']]

Encoding categorical data¶

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
In [7]:
print(X[0:3]) #The column has been encoded
[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]]

Splitting the dataset into the Training set and Test set¶

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Training the Multiple Linear Regression model on the Training set¶

In [9]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Out[9]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
fit_intercept  True
copy_X  True
tol  1e-06
n_jobs  None
positive  False

Predicting the Test set results¶

In [10]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]

Model Performance Evaluation¶

In [11]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R² Score: {r2:.2f}")
Mean Absolute Error (MAE): 7514.29
Mean Squared Error (MSE): 83502864.03
Root Mean Squared Error (RMSE): 9137.99
R² Score: 0.93
Intepret Model Coefficients¶
In [12]:
# After model training
feature_names = list(ct.get_feature_names_out())  # Get encoded + passthrough feature names

# Combine feature names with coefficients
coefficients = pd.DataFrame({
    "Feature": feature_names,
    "Coefficient": regressor.coef_
})

coefficients
Out[12]:
Feature Coefficient
0 encoder__x3_California 86.638369
1 encoder__x3_Florida -872.645791
2 encoder__x3_New York 786.007422
3 remainder__x0 0.773467
4 remainder__x1 0.032885
5 remainder__x2 0.036610
1. Location (Categorical Features)¶

These three rows (encoder__x3_...) represent dummy variables created from your “State” column. One state (the first alphabetically) is dropped automatically to avoid redundancy; it becomes the baseline for comparison.

State Coefficient Meaning

  • California (+86.64) Firms in California earn about $87 more profit than those in the baseline state, on average.
  • Florida (−872.65) Firms in Florida earn about $873 less profit than those in the baseline.
  • New York (+786.01) Firms in New York earn about $786 more profit than those in the baseline.

So: Location has some influence, but not as strong as R&D spending.

2. Numerical Predictors (Continuous Variables)¶

Feature Coefficient Meaning

  • R&D Spend (x0) +0.773 For every extra one dollar spent on R&D, profit increases by $0.77 — the strongest effect.
  • Administration (x1) +0.033 For every extra one dollar on Administration, profit increases by $0.03 (small effect).
  • Marketing Spend (x2) +0.037 For every extra one dollar on Marketing, profit increases by $0.04 (small effect).

R&D spending is the main driver of profit; administrative and marketing costs have minimal direct impact in this model.

Insights¶

The model suggests that the amount spent on R&D has the biggest positive impact on profit. Marketing and Administration spending contribute slightly, while location causes small differences (New York slightly better, Florida slightly worse). So, startups that invest more in R&D tend to make much higher profits, regardless of state.

Predict new data¶

What would be the predicted profit for a company that spends 160,000dollars on R&D, 130,000dollars on Administration, 400,000dollars on Marketing, and is located in California?

In [13]:
# Input data (same structure as training data)
new_data = np.array([[160000, 130000, 400000, 'California']], dtype=object)

# Encode and predict
new_data_encoded = ct.transform(new_data)
predicted_profit = regressor.predict(new_data_encoded)

# Display answer
print(f"Predicted Profit: ${predicted_profit[0]:,.2f}")
Predicted Profit: $185,227.93

Visualize Model Coefficients¶

In [14]:
import seaborn as sns
import matplotlib.pyplot as plt

# Sort by absolute coefficient size
coef_df = coefficients.reindex(coefficients.Coefficient.abs().sort_values(ascending=False).index)

plt.figure(figsize=(8, 5))

# Assign hue to same as y and disable legend to avoid duplication
sns.barplot(
    x="Coefficient",
    y="Feature",
    hue="Feature",           # Added hue for coloring
    data=coef_df,
    palette="coolwarm",
    legend=False             # Disable redundant legend
)

# Labels & title
plt.title("Feature Importance (Linear Regression Coefficients)", fontsize=14)
plt.xlabel("Coefficient Value", fontsize=12)
plt.ylabel("Feature", fontsize=12)
plt.grid(True, linestyle='--', alpha=0.4)

# Copyright footer
plt.figtext(0.99, 0.01, "thomasuhuru.com", ha="right", fontsize=9, color="gray")

plt.tight_layout()
plt.show()
No description has been provided for this image