Your First Step with Auto-Sklong¶
Your First Step with Auto-Sklong¶
Let's walk you though your first usage of Auto-Sklong, the AutoML system for longitudinal data.
This guide will help you set up a basic AutoML run using the GamaLongitudinalClassifier from the GAMA library,
which is designed to handle temporal dependencies in longitudinal datasets.
Have Read Beforehand:
Tech Prerequisites:
- Install dependencies:
pip install scikit-longitudinal - Ensure your data is in CSV format with columns representing features across time waves and a target column. Please look at the Sklong wide/long format guide for more details.
🔍 Auto-Sklong: Explore Your First AutoML Run¶
Dataset Used in Tutorials
The tutorials use a synthetic dataset mimicking health-related longitudinal data. It's generated for illustrative purposes and does not represent real-world data.
Dataset Generation Code
import pandas as pd
import numpy as np
n_rows = 500
columns = [
'age', 'gender',
'smoke_w1', 'smoke_w2',
'cholesterol_w1', 'cholesterol_w2',
'blood_pressure_w1', 'blood_pressure_w2',
'diabetes_w1', 'diabetes_w2',
'exercise_w1', 'exercise_w2',
'obesity_w1', 'obesity_w2',
'stroke_w2'
]
data = []
for i in range(n_rows):
row = {}
row['age'] = np.random.randint(40, 71)
row['gender'] = np.random.choice([0, 1])
for feature in ['smoke', 'cholesterol', 'blood_pressure', 'diabetes', 'exercise', 'obesity']:
w1 = np.random.choice([0, 1], p=[0.7, 0.3])
if w1 == 1:
w2 = np.random.choice([0, 1], p=[0.2, 0.8])
else:
w2 = np.random.choice([0, 1], p=[0.9, 0.1])
row[f'{feature}_w1'] = w1
row[f'{feature}_w2'] = w2
if row['smoke_w2'] == 1 or row['cholesterol_w2'] == 1 or row['blood_pressure_w2'] == 1:
p_stroke = 0.2
else:
p_stroke = 0.05
row['stroke_w2'] = np.random.choice([0, 1], p=[1 - p_stroke, p_stroke])
data.append(row)
# Create DataFrame
df = pd.DataFrame(data)
# Save to a new CSV file
csv_file = './extended_stroke_longitudinal.csv'
df.to_csv(csv_file, index=False)
print(f"Extended CSV file '{csv_file}' created successfully.")
The dataset looks like:
| age | gender | smoke_w1 | smoke_w2 | cholesterol_w1 | cholesterol_w2 | blood_pressure_w1 | blood_pressure_w2 | diabetes_w1 | diabetes_w2 | exercise_w1 | exercise_w2 | obesity_w1 | obesity_w2 | stroke_w2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 66 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
| 59 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 63 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 47 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 44 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 69 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 63 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 48 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 49 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
Step 1: Load and Prepare Data¶
Using the synthetic extended_stroke_longitudinal.csv from the dataset generation code above:
from scikit_longitudinal.data_preparation import LongitudinalDataset
dataset = LongitudinalDataset('./extended_stroke_longitudinal.csv')
dataset.load_data_target_train_test_split(target_column='stroke_w2', test_size=0.2, random_state=42)
dataset.setup_features_group([[2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13]]) # Or use a preset, such as `elsa`. Read more in https://scikit-longitudinal.readthedocs.io/latest/tutorials/temporal_dependency/#pre-set-features_group-and-non_longitudinal_features
Step 2: Initialise and Fit the AutoML System¶
Use GamaLongitudinalClassifier to automate pipeline search, prioritizing temporal-aware models:
from gama.GamaLongitudinalClassifier import GamaLongitudinalClassifier
from gama.search_methods.bayesian_optimisation import BayesianOptimisation
automl = GamaLongitudinalClassifier(
features_group=dataset.feature_groups(),
non_longitudinal_features=dataset.non_longitudinal_features(),
feature_list_names=dataset.data.columns.tolist(),
max_total_time=60, # (in seconds) Short run for tutorial; increase for real use
scoring='roc_auc', # can chance the scoring metric to `optimise`
search=BayesianOptimisation(), # Other options exist, explore the API reference for more details
random_state=42 # For reproducibility
)
automl.fit(dataset.X_train, dataset.y_train)
Step 3: Predict and Evaluate¶
from sklearn.metrics import classification_report
y_pred = automl.predict(dataset.X_test)
y_prob = automl.predict_proba(dataset.X_test)
from sklearn.metrics import accuracy_score, roc_auc_score
print(f"Accuracy: {accuracy_score(dataset.y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(dataset.y_test, y_prob[:, 1]):.4f}")
print("Classification Report:")
print(classification_report(dataset.y_test, y_pred))
Step 4: Export and Reuse the Best Pipeline¶
automl.export_script('best_autosklong_pipeline.py')
print("Best pipeline exported! Load it for production use.")
This introduces basic Auto-Sklong usage. Experiment with longer search times or different scorings for better results. To understand the under the hood process, we recommend reading the Search Space Guide as well as obviously the paper, see more in Publications.