GitHub Actions for Model Training and Versioning

This lab demonstrates how to use GitHub Actions to automate the process of training a machine learning model, storing the model, and versioning it. This allows you to easily update and improve your model in a collaborative environment.

GitHub link

Prerequisites

GitHub Lab-1

Getting started

Follow these steps similar to GitHub Lab-1

Open the labs working directory(e.g. mlops_labs) using VSC and open a new terminal inside VSC.
Create a new github repository (e.g. github_lab2) for this lab in your github account.
Clone the github repository into your working directory and move inside the cloned repository in the terminal using the following command in terminal.
```
cd githublab2
```
Set up project structure with data, src, test directories and add empty __init__.py files in each of them.
Create .gitignore and requirements.txt files
Add, Commit and Push changes.
You don’t require to create a virtual environment for this lab as all the scripts will be run on a virtual machine in GitHub.
But, if you are modifying the scripts for your projects, it is a good practice to create a virtual environment, install required packages and run the scripts in local to verify that they are working without any errors before running them through the github actions.

Project Structure:

   mlops_labs/
   └── github_lab2/
      ├── README.md
      ├── data/
      │   └── __init__.py
      ├── src/
      │   └── __init__.py
      ├── test/
      │   └── __init__.py
      ├── .gitignore
      └── requirements.txt

Running the Workflow

Customize Model Training:
1. Create or copy-paste the train_model.py, evaluate_model.py files into your src folder.
2. Add the necessary packages in your requirements.txt similar to this.
3. Create a .github directory, a workflows sub-directory in it and copy-paste the model_calibration.yml and model_calibration_on_push.yml files into .github/workflows.
  - This lab generates synthetic data for demonstration purposes.
  - For your projects, modify the train_model.py script in the src/ directory according to your dataset and model requirements.
4. The project structure should look similar to this:
```
   mlops_labs/
   └── github_lab2/
      ├── README.md
      ├── data/
      │   └── __init__.py
      ├── src/
      │   ├── __init__.py
      │   ├── train_model.py
      │   └── evaluate_model.py
      ├── test/
      │   └── __init__.py
      ├── .gitignore
      ├── requirements.txt
      └── .github/
         └── workflows/
               ├── model_calibration.yml
               └── model_calibration_on_push.yml
```
Modify GitHub credentials:
1. Update user.email and user.name in model_calibration.yml and model_calibration_on_push.yml files.
  - You can find your user.email by running the following in your VSC terminal.
    git config user.email
  - You can find your user.name by running the following in your VSC terminal.
    git config user.name
2. Make sure your GITHUB_TOKEN has permission to read and write files to your repository.
  - You can do this at Repository’s Settings-> Actions -> General -> Workflow permissions -> Check Read and Write permissions -> Save.
Push Your Changes: Commit your changes and push them to your GitHub repository.
GitHub Actions Workflow: Once you push changes to the main branch, the GitHub Actions workflow will be triggered automatically.
View Workflow Progress: You can track the progress of the workflow by going to the “Actions” tab in your GitHub repository.
Retrieve the Trained Model: After the workflow completes successfully, the trained model will be stored in the models/ directory.

Model Evaluation

The model evaluation is performed automatically within the GitHub Actions workflow. The evaluation results (e.g., F1 Score) are stored in the metrics/ directory.

Versioning the Model

Each time you run the workflow, a new version of the model is created and stored. You can access and use these models for your projects.

GitHub Actions Workflow Details

The workflow consists of the following steps:

Generate and Store Timestamp: A timestamp is generated and stored in a file for versioning.
Model Training: The train_model.py script is executed, which trains a random forest classifier on synthetic data and stores the model in the models/ directory.
Model Evaluation: The evaluate_model.py script is executed to evaluate the model’s F1 Score on synthetic data, and the results are stored in the metrics/ directory.
Store and Version the New Model: The trained model is moved to the models/ directory with a timestamp-based version.
Commit and Push Changes: The metrics and updated model are committed to the repository, allowing you to track changes.

Model Calibration Workflow

Overview

The model_calibration_on_push.yml workflow is a part of the automation process for machine learning model calibration within this repository. It is essential for ensuring that the model’s predictions are accurate and well-calibrated, a critical step in machine learning applications.

Workflow Purpose

This workflow’s primary purpose is to calibrate/train a machine learning model after each push to the main branch of the repository. Calibration is a crucial step to align model predictions with reality, particularly when dealing with classification tasks.

Workflow Execution

Let’s break down how this workflow operates step by step:

Step 1: Trigger on Push to Main Branch

This workflow is automatically initiated when changes are pushed to the main branch of the repository. It ensures that the model remains calibrated and up-to-date with the latest data and adjustments.

Step 2: Prepare Environment

The workflow begins by setting up a Python environment and installing the necessary Python libraries and dependencies. This is crucial to ensure that the model calibration process can be executed without any issues.

Step 3: Generate Synthetic Data

The synthetic data is randomly generated for classification.

Step 4: Calibrate Model

In this step, a RandomForestClassifier model is trained/calibrated on the synthetic data and saved in the Ubuntu virtual machine’s models/ directory.
Training Metrics like Accuracy and F1-score are logged using mlflow and saved in the mlruns/ directory.

Step 5: Evaluate saved Model

The latest calibrated model is retrieved from the models/ directory of the GitHub’s virtual Ubuntu machine and evaluated on new synthetic data.
Test metrics like F1-score is computed and stored in the repository’s metrics/ directory.
The latest calibrated model is then stored in the repository’s models/ directory.

Step 6: Commit and Push Changes

This final step involves committing the calibrated model and any other relevant files to the repository. It is essential to keep track of the changes made during the calibration process and to store the calibrated model in the repository for future applications and reference.

Customization

The model_calibration_on_push.yml workflow can be customized to align with your specific project requirements. You can modify calibration methods, the directory where the calibrated model is saved, or any other aspects of the calibration process to meet your project’s unique needs.
The model_calibration.yml workflow works similar to the model_calibration_on_push.yml workflow but runs at a scheduled time every day instead getting triggered on every push to main branch.

GitHub Lab-2