Your First Step with Auto-Sklong¶

Let's walk you though your first usage of Auto-Sklong, the AutoML system for longitudinal data. This guide will help you set up a basic AutoML run using the GamaLongitudinalClassifier from the GAMA library, which is designed to handle temporal dependencies in longitudinal datasets.

Have Read Beforehand:

Tech Prerequisites:

Install dependencies: pip install scikit-longitudinal
Ensure your data is in CSV format with columns representing features across time waves and a target column. Please look at the Sklong wide/long format guide for more details.

🔍 Auto-Sklong: Explore Your First AutoML Run¶

Dataset Used in Tutorials

The tutorials use a synthetic dataset mimicking health-related longitudinal data. It's generated for illustrative purposes and does not represent real-world data.

Dataset Generation Code

import pandas as pd
import numpy as np

n_rows = 500

columns = [
    'age', 'gender',
    'smoke_w1', 'smoke_w2',
    'cholesterol_w1', 'cholesterol_w2',
    'blood_pressure_w1', 'blood_pressure_w2',
    'diabetes_w1', 'diabetes_w2',
    'exercise_w1', 'exercise_w2',
    'obesity_w1', 'obesity_w2',
    'stroke_w2'
]

data = []

for i in range(n_rows):
    row = {}
    row['age'] = np.random.randint(40, 71)  
    row['gender'] = np.random.choice([0, 1])  

    for feature in ['smoke', 'cholesterol', 'blood_pressure', 'diabetes', 'exercise', 'obesity']:
        w1 = np.random.choice([0, 1], p=[0.7, 0.3])
        if w1 == 1:
            w2 = np.random.choice([0, 1], p=[0.2, 0.8])  
        else:
            w2 = np.random.choice([0, 1], p=[0.9, 0.1])  
        row[f'{feature}_w1'] = w1
        row[f'{feature}_w2'] = w2

    if row['smoke_w2'] == 1 or row['cholesterol_w2'] == 1 or row['blood_pressure_w2'] == 1:
        p_stroke = 0.2  
    else:
        p_stroke = 0.05  
    row['stroke_w2'] = np.random.choice([0, 1], p=[1 - p_stroke, p_stroke])

    data.append(row)

# Create DataFrame
df = pd.DataFrame(data)

# Save to a new CSV file
csv_file = './extended_stroke_longitudinal.csv'
df.to_csv(csv_file, index=False)
print(f"Extended CSV file '{csv_file}' created successfully.")

The dataset looks like:

age	gender	smoke_w2	cholesterol_w1	cholesterol_w2	blood_pressure_w1	blood_pressure_w2	diabetes_w1	diabetes_w2	exercise_w1	exercise_w2	obesity_w1	obesity_w2	stroke_w2
66	0	1	0	0	0	0	1	1	0	1	0	0	0
59	0	0	1	1	0	0	1	1	1	1	1	1	1
63	0	0	1	1	0	0	0	0	0	0	0	0	1
47	0	0	1	1	0	0	0	0	0	0	1	0	0
44	0	0	1	1	1	1	0	0	0	0	1	1	1
69	1	0	0	0	1	1	0	0	0	0	0	0	0
63	0	0	0	0	0	0	0	0	0	0	0	0	0
48	1	0	0	0	0	0	0	0	0	1	0	0	0
49	1	0	0	0	0	0	0	0	0	1	0	1	0

Step 1: Load and Prepare Data¶

Using the synthetic extended_stroke_longitudinal.csv from the dataset generation code above:

from scikit_longitudinal.data_preparation import LongitudinalDataset

dataset = LongitudinalDataset('./extended_stroke_longitudinal.csv')
dataset.load_data_target_train_test_split(target_column='stroke_w2', test_size=0.2, random_state=42)
dataset.setup_features_group([[2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13]]) # Or use a preset, such as `elsa`. Read more in https://scikit-longitudinal.readthedocs.io/latest/tutorials/temporal_dependency/#pre-set-features_group-and-non_longitudinal_features

Step 2: Initialise and Fit the AutoML System¶

Use GamaLongitudinalClassifier to automate pipeline search, prioritizing temporal-aware models:

from gama.GamaLongitudinalClassifier import GamaLongitudinalClassifier
from gama.search_methods.bayesian_optimisation import BayesianOptimisation

automl = GamaLongitudinalClassifier(
    features_group=dataset.feature_groups(),
    non_longitudinal_features=dataset.non_longitudinal_features(),
    feature_list_names=dataset.data.columns.tolist(),
    max_total_time=60,  # (in seconds) Short run for tutorial; increase for real use
    scoring='roc_auc', # can chance the scoring metric to `optimise`
    search=BayesianOptimisation(), # Other options exist, explore the API reference for more details
    random_state=42 # For reproducibility
)

automl.fit(dataset.X_train, dataset.y_train)

Step 3: Predict and Evaluate¶

from sklearn.metrics import classification_report

y_pred = automl.predict(dataset.X_test)
y_prob = automl.predict_proba(dataset.X_test)

from sklearn.metrics import accuracy_score, roc_auc_score

print(f"Accuracy: {accuracy_score(dataset.y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(dataset.y_test, y_prob[:, 1]):.4f}")

print("Classification Report:")
print(classification_report(dataset.y_test, y_pred))

Step 4: Export and Reuse the Best Pipeline¶

automl.export_script('best_autosklong_pipeline.py')
print("Best pipeline exported! Load it for production use.")

This introduces basic Auto-Sklong usage. Experiment with longer search times or different scorings for better results. To understand the under the hood process, we recommend reading the Search Space Guide as well as obviously the paper, see more in Publications.

age	gender	smoke_w2	cholesterol_w1	cholesterol_w2	blood_pressure_w1	blood_pressure_w2	diabetes_w1	diabetes_w2	exercise_w1	exercise_w2	obesity_w1	obesity_w2	stroke_w2
66	0	1	0	0	0	0	1	1	0	1	0	0	0
59	0	0	1	1	0	0	1	1	1	1	1	1	1
63	0	0	1	1	0	0	0	0	0	0	0	0	1
47	0	0	1	1	0	0	0	0	0	0	1	0	0
44	0	0	1	1	1	1	0	0	0	0	1	1	1
69	1	0	0	0	1	1	0	0	0	0	0	0	0
63	0	0	0	0	0	0	0	0	0	0	0	0	0
48	1	0	0	0	0	0	0	0	0	1	0	0	0
49	1	0	0	0	0	0	0	0	0	1	0	1	0

age	gender	smoke_w2	cholesterol_w1	cholesterol_w2	blood_pressure_w1	blood_pressure_w2	diabetes_w1	diabetes_w2	exercise_w1	exercise_w2	obesity_w1	obesity_w2	stroke_w2
66	0	1	0	0	0	0	1	1	0	1	0	0	0
59	0	0	1	1	0	0	1	1	1	1	1	1	1
63	0	0	1	1	0	0	0	0	0	0	0	0	1
47	0	0	1	1	0	0	0	0	0	0	1	0	0
44	0	0	1	1	1	1	0	0	0	0	1	1	1
69	1	0	0	0	1	1	0	0	0	0	0	0	0
63	0	0	0	0	0	0	0	0	0	0	0	0	0
48	1	0	0	0	0	0	0	0	0	1	0	0	0
49	1	0	0	0	0	0	0	0	0	1	0	1	0

age	gender	smoke_w2	cholesterol_w1	cholesterol_w2	blood_pressure_w1	blood_pressure_w2	diabetes_w1	diabetes_w2	exercise_w1	exercise_w2	obesity_w1	obesity_w2	stroke_w2
66	0	1	0	0	0	0	1	1	0	1	0	0	0
59	0	0	1	1	0	0	1	1	1	1	1	1	1
63	0	0	1	1	0	0	0	0	0	0	0	0	1
47	0	0	1	1	0	0	0	0	0	0	1	0	0
44	0	0	1	1	1	1	0	0	0	0	1	1	1
69	1	0	0	0	1	1	0	0	0	0	0	0	0
63	0	0	0	0	0	0	0	0	0	0	0	0	0
48	1	0	0	0	0	0	0	0	0	1	0	0	0
49	1	0	0	0	0	0	0	0	0	1	0	1	0