Embed Machine Learning Projects for First‑Year Students

Applied Statistics and Machine Learning course provides practical experience for students using modern AI tools — Photo by Sh
Photo by Sharad Bhat on Pexels

Embedding machine learning projects in a first-year statistics class is as simple as adding a ready-made Jupyter notebook that pulls data from Kaggle, runs a complete pipeline, and grades itself automatically.

Did you know that 70% of beginners drop out because they cannot turn theory into a finishable project? Let us show you the exact notebook you can import in 3 minutes.

Machine learning in Undergraduate Stats Class

When I first introduced an end-to-end ML pipeline to my introductory statistics students, I used a single notebook that walked them through data collection, feature engineering, model training, and validation. The notebook starts with a short data-scraping script that pulls a public CSV from a Kaggle competition. I then ask students to write a one-line function that creates lag features, reinforcing the idea that feature engineering is a repeatable step.

To keep results reproducible, I lock the random seed at the top of the notebook and show them how to commit the notebook to a Git repository. By cloning the repo on the campus GPU cluster, every student gets the same seed, the same library versions, and therefore identical baseline scores. This tiny habit teaches version control early and eliminates the "my code works on my machine" excuse.

Throughout the four-hour lab, I weave hypothesis testing into the workflow. After fitting a logistic regression, students run a chi-square test on the confusion matrix to see if their model’s accuracy is statistically better than a random guess. This reinforces statistical rigor while they get hands-on experience with scikit-learn, pandas, and seaborn.

Finally, I embed a markdown cell that prompts them to write a short interpretation of the model coefficients. By connecting the statistical language they learned in lecture to the numeric output of the model, the abstract concept of correlation becomes concrete.

Key Takeaways

  • One notebook can cover the full ML pipeline.
  • Reproducibility comes from fixing seeds and Git.
  • Statistical tests stay central to model evaluation.
  • Students write brief interpretations for each model.

AI Tools for Instant Kaggle Starter Templates

I rely on AI-driven notebook generators to eliminate the tedious CSV download step. By clicking a button in the Kaggle companion portal, a pre-populated notebook appears in Google Colab with train/test splits already defined. The notebook also includes a cell that calls !kaggle datasets download -d {dataset} so the data lands in the runtime automatically.

To speed up resource provisioning, I connect the notebook to AWS SageMaker Local. The first cell runs !pip install sagemaker-local and then launches a Docker container with a GPU. Compared with a full local Anaconda install, this reduces setup time by over 80%, which matches the experience I heard from early adopters (9to5Mac).

For novices who shy away from code, I embed an auto-generated feature importance widget. The widget reads the trained RandomForest model and produces an interactive bar chart with just a few lines of code. Students can see which variables drive predictions before they start hyper-parameter tuning, helping them build data-driven intuition early.

Below is a quick code snippet that you can copy into any starter notebook:

# Install Kaggle API and authenticate
!pip install kaggle
import os
os.environ['KAGGLE_USERNAME'] = 'YOUR_USERNAME'
os.environ['KAGGLE_KEY'] = 'YOUR_API_TOKEN'

# Pull dataset and unzip
!kaggle datasets download -d zillow/zecon
!unzip zecon.zip -d data

Workflow Automation: Bridging Assignments to Grading

In my semester-long project, I linked JupyterHub to Moodle using the Moodle REST API. When a student clicks "Submit", a small Python script posts the notebook file to a Moodle assignment endpoint and records the UTC timestamp. This no-code handoff means the learning platform receives the exact version the student ran.

Behind the scenes, I schedule Rundeck jobs to pull new submissions every five minutes. Each job spins up a lightweight Docker container that runs a suite of unit tests: checking for missing imports, confirming that model.fit converged, and verifying that the final ROC-AUC exceeds 0.70. The test results are posted to a color-coded dashboard - green for pass, red for fail - so students get immediate feedback.

Scoring is automated with a simple Python script that reads the model’s confusion matrix, calculates precision, recall, and ROC-AUC, then writes a CSV row for each student. The script also adds a bonus column if the student achieved a calibration score above 0.9. Because the grading logic lives in code, I can handle a surge of 200 submissions during finals without extra effort.

Here’s a minimal example of the grading script:

import pandas as pd
from sklearn.metrics import precision_score, recall_score, roc_auc_score

def grade_submission(results_path):
    df = pd.read_json(results_path)
    prec = precision_score(df['y_true'], df['y_pred'])
    rec = recall_score(df['y_true'], df['y_pred'])
    auc = roc_auc_score(df['y_true'], df['y_score'])
    score = (prec + rec + auc) / 3
    return {'precision': prec, 'recall': rec, 'auc': auc, 'final_score': score}

How to Embed a Kaggle Competition in Your Course

My first step is to create a single YAML file that describes the competition. The file contains fields for the competition title, the dataset URL, and a baseline score that I compute ahead of time. A lightweight Python script reads the YAML, clones a template notebook, and injects the metadata into the notebook’s front-matter.

To keep the workflow secure, I place a placeholder for the Kaggle API token inside a hidden cell. When the notebook runs on the university server, a secret manager injects the token at runtime, allowing the notebook to pull the latest leaderboard JSON without exposing credentials to students.

For grading fairness, I generate a private test set that is not part of the public Kaggle split. The notebook hides the ground-truth labels until the final evaluation step, which runs on a secure server after the submission deadline. This mirrors the real Kaggle competition environment while preserving academic integrity.

The YAML template looks like this:

competition:
  title: "Titanic Survival Prediction"
  dataset_url: "https://www.kaggle.com/c/titanic/data"
  baseline_auc: 0.85
  private_test_path: "data/private_test.csv"

The accompanying Python scaffolder reads the file and writes a new notebook called titanic_assignment.ipynb for each class section. Because the process is fully scripted, I can spin up a fresh assignment for a new cohort in under five minutes.


Predictive Analytics: Turning Notebook Outputs into Student Feedback

After a student submits their notebook, I parse the JSON output that contains model metrics, feature importance, and prediction plots. An Azure Function picks up this JSON, merges it with a LaTeX template, and produces a personalized PDF feedback report. The report highlights three sections: strengths (e.g., high feature importance alignment), weaknesses (e.g., low recall), and next steps (e.g., try a different regularization parameter).

To give instructors a class-wide view, I aggregate the JSON files into a dashboard built with Plotly Dash. The dashboard shows the average feature importance across the cohort, the distribution of model variances, and a heat map of common error types. When I notice that most students struggle with handling missing values, I schedule an extra lab session focused on imputation techniques.

Email hooks are also part of the system. If a student’s ROC-AUC crosses the 0.80 threshold, an automated email fires, congratulating them and encouraging them to experiment with ensemble methods. The email includes a one-click link that opens a new notebook pre-loaded with a stacking template, nudging students toward iterative learning.

Below is the Azure Function snippet that builds the feedback PDF:

import json, pdfkit

def main(req):
    data = json.loads(req.get_body)
    html = render_template('feedback.html', **data)
    pdf = pdfkit.from_string(html, False)
    return func.HttpResponse(pdf, mimetype='application/pdf')

Supervised Learning: Fine-Tuning Models for Exam-Ready Grading

When I want students to explore hyper-parameter tuning without burning GPU hours, I give them a concise Optuna script. The script limits the search space to a few sensible ranges - learning rate between 0.001 and 0.01, max depth between 3 and 6 - so each trial finishes in under a minute on the campus GPU.

Cross-validation is another pillar of the assignment. I ask students to use stratified K-fold splitting so that each fold preserves the class distribution. This prevents inflated performance scores on imbalanced datasets and gives them a realistic sense of variance across folds.

To make calibration concrete, I embed Bokeh visualizations that plot predicted probabilities against observed frequencies. Students can interactively hover over bins to see how well their model’s confidence matches reality. This hands-on view demystifies the concept of probability calibration and prepares them for exam questions that ask them to interpret calibration curves.

Here is the Optuna study snippet I share:

import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 50, 200)
    max_depth = trial.suggest_int('max_depth', 3, 6)
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
    return cross_val_score(clf, X, y, cv=5, scoring='roc_auc').mean

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)
print('Best params:', study.best_params)

FAQ

Q: How do I get a Kaggle API token for my students?

A: Each student creates a Kaggle account, navigates to "My Account → API", clicks "Create New Token", and downloads the kaggle.json file. You can store the token securely on the server and inject it at runtime so students never see the raw key.

Q: Can I use Google Colab instead of a campus GPU cluster?

A: Yes. The starter notebooks include a Colab badge that launches the environment with a pre-installed GPU runtime. The same notebook runs on the campus cluster because all library versions are pinned in the requirements.txt file.

Q: How do I keep students from cheating on the private test set?

A: The private test set is stored on a secure server and never shipped to the student’s notebook. The grading script loads the test data, runs the model, and compares predictions to hidden labels. Only the final score is returned to the student.

Q: What if a student’s model fails to converge?

A: The automated unit tests flag convergence failures and insert a helpful hint cell into the notebook, suggesting a lower learning rate or more epochs. This immediate feedback prevents frustration and keeps the project moving forward.

Q: Is it possible to extend this workflow to other courses?

A: Absolutely. The YAML scaffolder, grading scripts, and feedback functions are language-agnostic. You can swap the scikit-learn model for a time-series forecast or a natural-language classifier and reuse the same automation pipeline.

Read more