import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
precision_score,
average_precision_score,
recall_score,
roc_auc_score,
classification_report)from sklearn.model_selection import train_test_split
set(style='whitegrid')
sns.42)
np.random.seed(
def resample_with_prevalence(X, y, sample_size, prevalence):
""" Resamples the dataset X, y to have the desired prevalence of the positive class.
Parameters:
X (numpy.ndarray): The input features, shape (n_samples, n_features).
y (numpy.ndarray): The binary labels, shape (n_samples, ).
sample_size (int): The desired size of the resampled dataset.
prevalence (float): The desired prevalence of the positive class (0 <= prevalence <= 1).
Returns:
(numpy.ndarray, numpy.ndarray): Resampled X and y with the specified prevalence.
"""
# Ensure inputs are numpy arrays
= np.array(X)
X = np.array(y)
y
# Indices of positive and negative samples
= np.where(y == 1)[0]
pos_indices = np.where(y == 0)[0]
neg_indices
# Calculate number of positive and negative samples we need
= int(sample_size * prevalence)
num_pos_samples = sample_size - num_pos_samples
num_neg_samples
# Sanity check to avoid trying to sample more than available
if num_pos_samples > len(pos_indices) or num_neg_samples > len(neg_indices):
raise ValueError("Desired prevalence cannot be met with the available samples.")
# Sample from positive and negative indices
= np.random.choice(pos_indices, num_pos_samples, replace=False)
sample_pos_indices = np.random.choice(neg_indices, num_neg_samples, replace=False)
sample_neg_indices
# Combine and shuffle indices
= np.concatenate([sample_pos_indices, sample_neg_indices])
sampled_indices
np.random.shuffle(sampled_indices)
# Create the resampled dataset
= X[sampled_indices]
X_sampled = y[sampled_indices]
y_sampled
return X_sampled, y_sampled
def prevalence_sensitivity(n_samples, class_sep, n_clusters_per_class, weights):
"""
Calculate metrics
"""
= 10, 5, 1, 0
n_features, n_informative, n_redundant, n_repeated
# Create a population to draw from
= make_classification(
X, y =n_samples,
n_samples=n_informative,
n_informative=n_redundant,
n_redundant=n_repeated,
n_repeated=weights,
weights=class_sep,
class_sep=n_clusters_per_class
n_clusters_per_class
)
= train_test_split(X, y, test_size=0.9, random_state=42)
X_train, X_test, y_train, y_test
= RandomForestClassifier(random_state=42)
model
model.fit(X_train, y_train)
= [], [], []
p, r, auc = np.arange(0.1, 0.9, 0.05)
prevalence_range
for prev in prevalence_range:
= resample_with_prevalence(X_test, y_test, 1000, prev)
X_testp, y_testp = model.predict(X_testp)
preds = model.predict_proba(X_testp)[:, 1]
probas
p.append(average_precision_score(y_testp, probas))
r.append(recall_score(y_testp, preds))
auc.append(roc_auc_score(y_testp, probas))
= pd.DataFrame({'prevalence': prevalence_range, 'average_precision': p, 'recall': r, 'auroc': auc})
results return results
# Default prevalence rate is 0.5
= 10000
n_samples = 1.0
class_sep = 5
n_clusters_per_class = (0.5, 0.5) # (0, 1) prevalence
weights
= prevalence_sensitivity(n_samples, class_sep, n_clusters_per_class, weights)
results
= plt.subplots(figsize=(6, 4))
fig, ax
results.plot(='prevalence',
x=['average_precision', 'auroc'],
y=2,
lw=ax
ax
)'Positive Class Prevalence')
ax.set_xlabel('Metric Value'); ax.set_ylabel(
Many of the machine learning problems I have worked on are disease detection or symptom detection problems. For example: predict whether an individual has flu given signals in their wearable data. What makes these problems particularly tricky (especially in infectious disease) is that often the underlying prevalence of disease rapidly changes. This rapid change means that you very quickly get a mismatch between your training and testing data. It also means that you need to be very judicious about which model evaluation metrics you choose to track, because many metrics will be insensitive to this form of distribution drift.
When analyzing this phenomenon, a helpful starting place is to look at Bayes’ Theorem:
\[P(+ | Test^{+}) = \frac{P(Test^{+} | +) P(+)}{P(Test^{+})}\]
In the above formulation the posterior probability is the precision of the model, and the term \(P(+)\) (known as the prior in Bayesian parlance) is the prevalence of the disease. From this basic analysis we would expect the precision of any model to increase proportionally to the underlying prevalence of the positive class.
We can simulate this phenomenon and verify it holds empirically.
So clearly our Bayes Law analysis empirically holds. What’s interesting is that the model we constructed has a fairly high AUROC, even still average precision drops dramatically as the underlying class prevalence falls. This really highlights the limitations of certain metrics: your incoming data can have an 80% reduction in the underlying prevalence of a condition, and AUROC and recall would still remain relatively constant even though the model utility may be dramatically deteriorating.
Our model above was trained with a 50% class balance. Another interesting question we could ask is how the class balance at training time impacts performance at inference time when the underlying prevalence shifts.
= []
results
for p in [0.1, 0.2, 0.3, 0.4, 0.5]:
# Run the simulation ten times for each prevalence to average out the noise
for i in range(10):
= prevalence_sensitivity(n_samples, class_sep, n_clusters_per_class, (1-p, p))
r 'train_prevalence'] = p
r[
results.append(r)
= pd.concat(results) results
= plt.subplots(1, 2, figsize=(12, 4))
fig, axes
for i, (m, ax) in enumerate(zip(['average_precision', 'auroc'], axes.flatten())):
='prevalence', y=m, hue='train_prevalence', data=results, ax=ax)
sns.lineplot(x
ax.set_ylabel(m)'Test Prevalence'); ax.set_xlabel(
Based on this analysis, it appears that models trained across a variety of prevalence levels will have similar levels of precision for a fixed inference time prevalence, and all models will see an increase in precision as test time prevalence increases. The same cannot be said of AUROC which exhibits an almost “opposite” phenomenon: AUROC is invariant to test time prevalence, but varies greatly across models trained with different levels of training time prevalence.
So what’s the moral of this story? Nothing earth shattering, just a reemphasis of the fundamentals: get clear on what it is you are optimizing for and judiciously select metrics that allow you to isolate different aspects of performance related to that goal.