Model Documentation

Model Documentation

CodeOriginClassifier uses **microsoft/codebert-base** as its pre-trained encoder. CodeBERT is a RoBERTa-base model (12 transformer layers, 768 hidden dimensions, 125M parameters) that was further pre-trained on the CodeSearchNet corpus using two objectives:

CyberKatsu

May 2, 2026

0 upvotes

0 downloads

0 views

ai llm rag

View source

# Model Documentation ## Transfer Learning Strategy ### Why CodeBERT? CodeOriginClassifier uses **microsoft/codebert-base** as its pre-trained encoder. CodeBERT is a RoBERTa-base model (12 transformer layers, 768 hidden dimensions, 125M parameters) that was further pre-trained on the CodeSearchNet corpus using two objectives: 1. **Masked Language Modelling (MLM):** Randomly masks 15% of tokens in code and trains the model to predict them. This forces the model to learn syntactic patterns (bracket matching, indentation), semantic patterns (variable naming conventions, API usage), and control flow structures. 2. **Replaced Token Detection (RTD):** A small "generator" network produces plausible replacement tokens; the model must distinguish real tokens from replacements. This objective is especially relevant to our task — it trains the model to detect subtle token-level anomalies, which is analogous to detecting the "statistically average" token choices that characterise LLM-generated code. ### Why Not Train From Scratch? Training a 125M-parameter transformer from scratch on a 10k-sample dataset would: - **Overfit immediately** — the model has ~12,500x more parameters than training samples. - **Fail to learn code structure** — 10k samples is insufficient to learn even basic syntax, let alone the nuanced stylistic differences between human and LLM code. - **Waste compute** — CodeBERT already encodes a rich understanding of code structure from 6.4M functions across 6 languages. Transfer learning lets us inherit this structural understanding and train only a small classification head that learns to map CodeBERT's representations to the binary human/LLM decision. ### Why Not Use a Larger Model (e.g., CodeLlama, StarCoder)? Decoder-only models like CodeLlama and StarCoder are optimised for code *generation*, not code *understanding*. Their autoregressive architecture means they process tokens left-to-right and cannot attend to future context. For classification, we need bidirectional attention — the model must see the *entire* snippet before making a prediction. CodeBERT's bidirectional encoder architecture is the natural fit. Additionally, the 125M parameter size is practical for CPU-only inference in a Docker container without requiring a GPU. ## Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ │ │ Input: [CLS] def greet ( name ) : ... [SEP] [PAD] │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ CodeBERT Encoder (FROZEN) │ │ │ │ 12 transformer layers │ │ │ │ 768 hidden dimensions │ │ │ │ 125M parameters (all non-trainable) │ │ │ └──────────────────┬──────────────────────────────┘ │ │ │ │ │ CLS token output (768-d vector) │ │ │ │ │ ┌──────────────────▼──────────────────────────────┐ │ │ │ Classification Head (TRAINABLE) │ │ │ │ │ │ │ │ Dropout(0.3) │ │ │ │ Dense(256, activation='relu') │ │ │ │ Dropout(0.2) │ │ │ │ Dense(1, activation='sigmoid') │ │ │ │ │ │ │ │ ~400K trainable parameters │ │ │ └──────────────────┬──────────────────────────────┘ │ │ │ │ │ Output: P(LLM-Generated) ∈ [0, 1] │ │ │ └─────────────────────────────────────────────────────────┘ ``` ### Why Freeze the Encoder? With ~10k training samples, the ratio of trainable parameters to data points is critical: | Configuration | Trainable Params | Param-to-Sample Ratio | |---------------------|------------------|-----------------------| | Full fine-tuning | ~125M | ~12,500:1 | | Frozen encoder | ~400K | ~40:1 | A 40:1 ratio is within the range where gradient-based optimisation can generalise without severe overfitting, especially with dropout regularisation and early stopping. If the dataset grows to 100k+ samples, the encoder could be partially unfrozen (top 2-4 layers) for a second training phase — a technique known as **gradual unfreezing** (Howard & Ruder, 2018). The `--unfreeze` flag in the training script enables this. ## Training Procedure | Hyperparameter | Value | |----------------------|------------------------------| | Optimiser | Adam | | Learning rate | 2e-5 (with ReduceLROnPlateau)| | Batch size | 16 | | Epochs | 5 (with early stopping, patience=2) | | Loss | Binary cross-entropy | | Dropout (head) | 0.3 → 0.2 | | Max token length | 512 | ### Early Stopping Training halts when the validation loss does not improve for 2 consecutive epochs. The best weights (by validation loss) are restored automatically. This prevents the classification head from memorising the training set. ### Experiment Tracking All training runs are logged to MLflow, including per-epoch loss/accuracy curves, test-set metrics (accuracy, precision, recall, F1, AUC-ROC), the confusion matrix, and the final model checkpoint. ## Explainability: Integrated Gradients ### The Problem A binary classifier that outputs "LLM-Generated: 87% confidence" is useful but opaque. For this tool to be actionable, users need to understand *which parts* of the code influenced the prediction. ### The Approach We implement **Integrated Gradients** (Sundararajan, Taly & Yan, 2017), a gradient-based attribution method that assigns an importance score to each input token. The core idea: 1. Define a **baseline** input — a zero-vector embedding that represents "no code." 2. Construct a straight-line path from the baseline to the actual input embeddings. 3. At each step along this path, compute the gradient of the output with respect to the interpolated embeddings. 4. **Integrate** (average) these gradients across all steps. 5. Multiply by the difference between the actual and baseline embeddings to get per-token attributions. ### Why Not Attention Weights? Attention weights show what the model *looked at*, not what *caused* the prediction. In a frozen-encoder setup, the attention patterns were learned during CodeBERT's pre-training and remain fixed regardless of the downstream task. Integrated Gradients, by contrast, trace the causal path from input tokens through the entire model (including the trained classification head) to the output. ### Why Not SHAP? SHAP (Lundberg & Lee, 2017) is model-agnostic but requires 2^N evaluations for N features in the exact case, or uses approximations (KernelSHAP) that can be noisy for high-dimensional inputs like token sequences. Integrated Gradients is exact, satisfies formal axioms (sensitivity + implementation invariance), and runs in O(n_steps) forward passes — practical for real-time serving. ### Implementation Detail Because the encoder is frozen and its embedding layer is internal to `TFRobertaModel`, we cannot simply differentiate with respect to `input_ids` (which are discrete integers). Instead, we: 1. Extract the continuous embeddings from the encoder's embedding layer. 2. Construct interpolated embeddings between the zero baseline and the actual values. 3. Feed these directly into the encoder via `inputs_embeds` (bypassing the embedding lookup). 4. Aggregate per-token attributions using the L2 norm across the 768 hidden dimensions. The top-5 tokens by attribution score are returned alongside the prediction. ## Limitations 1. **Explainability is approximate** — Integrated Gradients assumes a linear interpolation path, which may not capture non-linear interactions between tokens. 2. **Frozen encoder limits adaptation** — The encoder's representations are optimised for general code understanding, not specifically for human-vs-LLM discrimination. Fine-tuning the top encoder layers could improve both accuracy and attribution quality. 3. **Token-level granularity** — Attribution is at the sub-word token level (BPE units), which can be unintuitive (e.g., a variable name split into 3 tokens will show 3 separate scores). A production system might aggregate attributions at the word or line level. ## References - Feng, Z., et al. (2020). CodeBERT: A Pre-Trained Model for Programming and Natural Languages. *arXiv:2002.08155*. - Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic Attribution for Deep Networks. *ICML 2017*. - Howard, J. & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. *ACL 2018*. - Lundberg, S.M. & Lee, S. (2017). A Unified Approach to Interpreting Model Predictions. *NeurIPS 2017*.

Related Documents

CVPR 2017 Abstracts Collection

Overview

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

Loss Functions