Learn Machine Learning
Intro to Machine Learning
Supervised Learning with scikit-learn
Predicted values are known -> Predict the target values of unseen data, given the features
- Feature = predictor variable = independent variable
- Target variable = dependent variable = response variable
Classification
Predicting categories from labeled data (e.g., spam or not spam)
- Steps in Classification:
- Train a model using labeled data.
- Model learns patterns from the data.
- Pass unseen data to the model.
- Model predicts class labels.
K_Nearest Neighbors (KNN)
Predicts labels based on k closest neighbors (majority vote).
1
2
3
4
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
Tuning k (overfitting vs underfitting):
- Smaller k = more complex model = can lead to overfitting
- Larger k = less complex model = can cause underfitting
Model Evaluation
\(Accuracy = \frac{\text{Correct Predictions}}{\text{Total Predictions}}\)
Train/Test Split:
1
2
3
4
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test)) # Accuracy
Plot Model Complexity:
1
2
plt.plot(neighbors, train_accuracies.values(), label="Train Accuracy")
plt.plot(neighbors, test_accuracies.values(), label="Test Accuracy")
Regression
Predicting continuous values from labeled data (e.g., house prices, blood glucose levels, stock prices)
Linear Regression
Equation: \(y = ax + b\)
- a (slope), b (intercept) are model parameters.
Loss Function: Minimize Residual Sum of Squares (RSS).
Model Evaluation
- R-squared: Explained variance of the model
- 1 = perfect fit, 0 = no fit
1
2
3
4
5
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print(reg.score(X_test, y_test)) # R-squared
Mean Squared Error (MSE): Average squared difference between predicted and actual values
Root Mean Squared Error (RMSE): Average error in the model’s predictions
- Same units as target variable
- Lower RMSE = better model
1
2
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, y_pred, squared=False)) # RMSE
Cross-validation
Splitting data once may not reflect true model performance -> Prevent overfitting.
- Perform k-fold cross-validation:
- Split data into k equal parts.
- Train on (k-1) parts and test on 1 part.
- Repeat for all k parts.
1
2
3
4
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(reg, X, y, cv=kf)
print(scores.mean()) # Average performance
Regularized Regression
Standard regression can overfit when features have large coefficients. Penalizes large coefficients to reduce overfitting.
- Ridge Regression: L2 regularization Penalizes large coefficients to reduce overfitting.
1
2
3
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
- Lasso Regression: L1 regularization Performs feature selection by shrinking some coefficients to zero.
1
2
3
4
5
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
plt.bar(feature_names, lasso.coef_)
plt.show()
Fine-tuning your model
Class imbalance: Uneven frequency of classes
Confusion Matrix
Predicted: Positive (1) | Predicted: Negative (0) | |
Actual: Positive (1) | True Positive (TP) | False Negative (FN) |
Actual: Negative (0) | False Positive (FP) | True Negative (TN) |
Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Correct predictions out of all predictions
- Best for for balanced datasets
Precision = TP / (TP + FP)
- How many of the predicted positives are actually positive
- High precision means fewer false positives
- Useful when false positives are costly (e.g., detecting diseases, fraud detection)
Recall = TP / (TP + FN)
- How many actual positives are correctly identified
- High recall means fewer false negatives
- Useful when false negatives are costly (e.g., cancer detection)
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
- High F1 score means a good balance between precision and recall.
- Useful when both false positives and false negatives are costly
- Best for imbalanced datasets
1
2
3
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
Logistic Regression & ROC Curve
Logistic Regression outputs probabilities.
- If p > 0.5 → Predicted 1, otherwise 0.
ROC Curve (Receiver Operating Characteristic):
- Plots True Positive Rate (Recall) vs. False Positive Rate.
- Helps find the best classification threshold.
Plot ROC Curve:
1
2
3
4
5
6
7
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
Area under the ROC curve (ROC AUC): Measures model performance across all classification thresholds
- 1 = perfect model; 0.5 = random guessing
1
2
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))
Hyperparameter tuning
Hyperparameters are model settings chosen before training that affect performance (e.g., k in KNN, alpha in Ridge/Lasso)
GridSearchCV: Tests all combinations (best for small parameter sets)
1
2
3
4
5
6
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
param_grid = {"alpha": np.arange(0.0001, 1, 10)}
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, param_grid, cv=5)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)
RandomizedSearchCV: Randomly selects parameter combinations (faster for large sets)
1
2
3
4
from sklearn.model_selection import RandomizedSearchCV
ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=5, n_iter=2)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)
Preprocessing and Pipelines
Handling Categorical Features
Machine learning models require numeric data. Some features are text-based (e.g., “Male”, “Female”)
Convert categorical data using:
One-Hot Encoding (Dummy Variables): creates binary columns for each category (e.g., “Male” → [1,0], “Female” → [0,1])
1
pd.get_dummies(df['category'], drop_first=True)
Scikit-learn OneHotEncoder: Converts text categories to integers
1
2
3
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoder.fit_transform(df[['category']])
Handling Missing Data
Drop missing values: Remove rows with missing data
1
df.dropna(subset=["column_name"])
Imputation: Fill missing values
- Numeric data: Replace with mean/median.
- Categorical data: Replace with most frequent value.
1
2
3
4
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy="mean") # Can also use "median" or "most_frequent"
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)
Scaling: Standardization & Normalization
Features with different scales (e.g., income vs. age) can cause models to behave poorly. For example, KNN, Logistic Regression, and Neural Networks require scaled data.
Standardization: Centers data at 0 with unit variance.
1
2
3
4
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Normalization: Scales data to a fixed range (e.g., 0 to 1)
1
2
3
4
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Building Pipelines
- Pipelines automate preprocessing & modeling.
- Prevents data leakage (ensures transformations apply to test data properly).
1
2
3
4
5
6
7
8
9
# Example Pipeline: Imputation + Scaling + Model
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
("imputer", SimpleImputer(strategy="mean")),
("scaler", StandardScaler()),
("model", Ridge(alpha=1.0))
])
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
1
2
3
4
5
6
# Example Pipeline: Scaling + KNN
from sklearn.pipeline import Pipeline
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=6))]
pipeline = Pipeline(steps)
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
1
2
3
4
5
6
# Example Pipeline: GridSearchCV with Pipeline
from sklearn.model_selection import GridSearchCV
param_grid = {"knn__n_neighbors": np.arange(1, 50)}
cv = GridSearchCV(pipeline, param_grid=param_grid)
cv.fit(X_train, y_train)
print(cv.best_params_, cv.best_score_)
Evaluating multiple models
Model Selection:
Criteria | Example Models |
---|---|
Small Data | KNN, Logistic Regression |
Fast Training | Decision Tree, Naive Bayes |
High Accuracy | Random Forest, Neural Networks |
Interpretable | Linear Regression, Decision Tree |
Guiding Principles:
- Size of the dataset
- Fewer features = simpler model, faster training time
- Some models require large amounts of data to perform well
- Interpretability
- Some models are easier to explain, which can be important for stakeholders
- Linear regression has high interpretability, as we can understand the coefficients
- Flexibility
- May improve accuracy, by making fewer assumptions about data
- KNN is a more flexible model, doesn’t assume any linear relationships
Note on Metrics:
- Regression model performance: RMSE, R-squared
- Classification model performance: Confusion matrix, Accuracy, Precision, recall, F1-score, ROC AUC
Note on Scaling:
- Best to scale our data before evaluating models
- Models affected by scaling: KNN, Linear Regression (plus Ridge, Lasso), Logistic Regression, Artificial Neural Network
1
2
3
4
5
6
7
8
9
10
11
12
# Evaluating multiple models
from sklearn.model_selection import cross_val_score, KFold
models = {
"Logistic Regression": LogisticRegression(),
"KNN": KNeighborsClassifier(),
"Decision Tree": DecisionTreeClassifier()
}
results = []
for model in models.values():
kf = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X_train_scaled, y_train, cv=kf)
results.append(scores)
1
2
3
4
# Boxplot of model performance
import matplotlib.pyplot as plt
plt.boxplot(results, labels=models.keys())
plt.show()
1
2
3
4
# Evaluating on test data
for name, model in models.items():
model.fit(X_train_scaled, y_train)
print(f"{name} Test Accuracy: {model.score(X_test_scaled, y_test)}")