Knowledge Distillation for Customer Support LLMs

IMPORTANT: Please read the following before proceeding. This AMP includes or otherwise depends on certain third party software packages. Information about such third party software packages are made available in the notice file associated with this AMP. By configuring and launching this AMP, you will cause such third party software packages to be downloaded and installed into your environment, in some instances, from third parties' websites. For each third party software package, please see the notice file and the applicable websites for more information, including the applicable license terms.

If you do not wish to download and install the third party software packages, do not configure, launch or otherwise use this AMP. By configuring, launching or otherwise using the AMP, you acknowledge the foregoing statement and agree that Cloudera is not responsible or liable in any way for the third party software packages.

Knowledge Distillation for Customer Support LLMs

Project Overview

This project addresses the challenge of improving the accuracy and speed of a customer support LLM while adhering to data privacy constraints. By leveraging synthetic data generation and fine-tuning techniques, we demonstrate how to train a smaller, faster LLM (Meta-Llama-3.1-8B-Instruct) for real-time analysis of customer support requests and compare the results to a base model. The workflow is divided into four core steps:

Setup & Model Initialization
Data Preparation
Fine-Tuning with LoRA
Inference, Evaluation, and Benchmarking

Step 0: Environment Setup & Model Initialization

Overview

This is the starting notebook of the project. In this step, we download all the required models and install the needed libraries.

Purpose:

Download models required for this project

Key Components:

Initializes two foundational models:
- Meta-Llama-3.1-8B-Instruct (target for fine-tuning)
- Microsoft Phi-4 (used later as an evaluation judge)

Output:

Ready-to-use models and libraries for subsequent steps

Step 1: Data Preparation

Overview

In this notebook, we use the output data from Synthetic Data Studio (SDS) and process it for finetuning and evaluation. Cloudera's customer support team separates and processes customer and Cloudera comments using two different output formats. Thus, we use different SDS generated data for each comment type. In addition, the SDS output is a list of topics and each topic contains the relevant prompt, completion, and evaluation. We use the evaluation score to filter low-quality data and combine the prompt with the expected completion to teach the LLM using finetuning. For LLM finetuning, we combine the customer and Cloudera comments into one LLM for efficiency. We also leave 1000 samples out for processing Cloudera and customer comments.

Purpose:

Generate structured training data from raw customer support comments
Split data into training/evaluation sets

Process:

Loads raw data from ClouderaComments.json and CustomerComments.json
Filters high-quality entries (score >4.9)
Formats entries into prompt-answer pairs:
- Prompt: Customer support comment + structured questions
- Completion: Model answers to the questions (e.g., scores, summaries)
Splits data into Train_Clean (3500 samples) and Evaluation_Clean (500 samples for Cloudera comments and 500 samples for customer comments)

Variables to Customize

Filenames for input data (e.g., filename='Data/ClouderaComments').
Data split sizes (e.g., 3500 for training vs 500 for evaluation).

Output:

Cleaned datasets in AllComments_Clean_Train.json and evaluation files

Step 2: Fine-Tuning with LoRA

Overview

In this notebook, we finetune the LLM using distilled knowledge from SDS. At a high-level, we add the special tokens before fine-tuning, split the data into training and dev sets, finetune lora adapters, and merge and store the model. Purpose:
Adapts the Meta-Llama-3.1-8B-Instruct model to the customer support domain
Uses LoRA (Low-Rank Adaptation) for efficient parameter updates

Key Configurations:

LoRA Parameters:
- Rank (lora_r): 128
- Alpha (lora_alpha): 64
- Dropout: 0.05
Training:
- Dataset from Step 2 formatted into chat templates
- Trained for 1 epoch with gradient accumulation
- Saves fine-tuned model to ./tmp/merged_...

Variables to Customize

LoRA Parameters
- lora_r: Rank (default 128).
- lora_alpha: Scaling factor (default 64).
- lora_dropout: Dropout rate (default 0.05).
Data_Size: Number of samples used for training (default 5000). if the number of samples is more than data samples available, it uses the maximum available.
FT-num_train_epochs: Number of training epochs (default 1).

Output:

A domain-specific model optimized for customer support tasks

Step 3: Inference, Evaluation, and Benchmarking

Overview

In this final notebook, we infer the output (completion) for each Cloudera and customer comments separately. Using the generated answers, we parse the output, extract the relevant information and instruct an LLM-as-a-judge to compare the outputs of the two LLMs (score if A or B model is best or if it is a tie). Here, we evaluate only on answers that there is no tie between the models and compute the winrate and the percentage of ties. Also, this step shows example outputs from each LLM.

Purpose:

Compare the fine-tuned model against the baseline (Meta-Llama-3.1-8B-Instruct)
Use an external judge (Phi-4 14B) to evaluate output quality

Process:

Generate outputs for both models on evaluation data
Format outputs into structured comparisons for the judge
Judge evaluates pairs of answers and selects the better-performing model
Compute the winrate of the Finetuned model compared to baseline for each question and average.

Variables to Customize

Customer: 0 for Cloudera comments, 1 for customer comments.
EvalLLM: Path to the evaluation model (e.g., microsoft/phi-4).

Setup & Installation

Environment requirements:
- Python 3.11+
Tested models:
- Tested for Finetuning: unsloth/Meta-Llama-3.1-8B-Instruct and unsloth/gemma-2-2b-it
- Tested for LLM-as-a-judge: microsoft/phi-4
Memory and GPU Requirements:
- Step 2 and 3 require a GPU with ~48GB VRAM for 8B model training
- 48GB of cpu memory
Cleaning resources
- After finishing step 3, we need to reset the kernel of the notebook to release the GPU.

Usage Guide

Run Step 0 first to download models.
Execute Step 1 to process raw data into training/evaluation sets.
Proceed to Step 2 to fine-tune the LLM.
Run Step 3 to evaluate performance against the baseline.

Advanced Usage Guide

The AMP enables three main categories of customizations: custom input data, custom model choice, and custom LLM-as-a-judge evaluation

Custom input data: To modify the input data with your own custom data you need to update the input files used in Step1 (such as ClouderaComments.json). The data need to follow the format exported by SDS. The format also expects data quality evaluation scores to be used for filtering as exported by SDS. Note that if you need to update the prompts, we need to update the CommentText and PreppendClouderaQuestions variables to reflect the new prompts.
Custom model choice: To use your own model for finetuning, you need to download the model in step0 and set the ModelFT variable to the new model (step2 and step3). Note that the second cell and the Config/Config.py provides variables finetuning parameters to choose from, for example, the learning rate etc. Finally, here we can modify the target output path by setting the TargetDir variable.
Custom LLM-as-a-judge evaluation: To select your own LLM-as-a-judge, you need to download the model in step0 and set the variable EvalLLM in step3. In addition, you can load a new evaluation set by replacing the files evaluation files (such as Data/CustomerComments_Evaluation_Clean.json). If you need to ignore the first lines of the finetuned LLM or the base LLM you can use the variables: StartLineFT=1, StartLineBase=1 to ignore 1 line for example. Finally, you can modify the LLM-as-a-judge instructions by changing cells 18 and 19.

Expected Outputs

Step 2: A fine-tuned model saved to ./tmp/merged_*.
Step 3: Evaluation metrics (win rate, tie percentage) printed in the notebook.

Testing correct execution

Run all steps with unsloth/Meta-Llama-3.1-8B-Instruct for finetuning
Evaluate using LLM-as-a-judge microsoft/phi-4
The final cell in step 3 tests if the average winrate is 82%

Known issues

Both the AMP deployment and session need to run the same python version

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
0_session-install-dependencies		0_session-install-dependencies
Config		Config
Data		Data
Libs		Libs
assets		assets
.project-metadata.yaml		.project-metadata.yaml
Config.py		Config.py
DataSynthesisLib.py		DataSynthesisLib.py
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
Readme.md		Readme.md
Step0-Installation.ipynb		Step0-Installation.ipynb
Step1-DataPreparation.ipynb		Step1-DataPreparation.ipynb
Step2-Finetuning.ipynb		Step2-Finetuning.ipynb
Step3-Evaluation.ipynb		Step3-Evaluation.ipynb
catalog.yaml		catalog.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Knowledge Distillation for Customer Support LLMs

Project Overview

Step 0: Environment Setup & Model Initialization

Step 1: Data Preparation

Step 2: Fine-Tuning with LoRA

Step 3: Inference, Evaluation, and Benchmarking

Setup & Installation

Usage Guide

Advanced Usage Guide

Expected Outputs

Testing correct execution

Known issues

About

Uh oh!

Releases

Packages

Languages

License

cloudera/CML_AMP_Knowledge_Distillation_With_Private_Data

Folders and files

Latest commit

History

Repository files navigation

Knowledge Distillation for Customer Support LLMs

Project Overview

Step 0: Environment Setup & Model Initialization

Step 1: Data Preparation

Step 2: Fine-Tuning with LoRA

Step 3: Inference, Evaluation, and Benchmarking

Setup & Installation

Usage Guide

Advanced Usage Guide

Expected Outputs

Testing correct execution

Known issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages