Loading...
Loading...
This guide explains how to add new coding exercises to the Zentara Code evals system. The evals system is a distributed evaluation platform that runs AI coding tasks in isolated VS Code environments to test AI coding capabilities across multiple programming languages.
# Adding Additional Evals Exercises
This guide explains how to add new coding exercises to the Zentara Code evals system. The evals system is a distributed evaluation platform that runs AI coding tasks in isolated VS Code environments to test AI coding capabilities across multiple programming languages.
## Table of Contents
1. [What is an "Eval"?](#what-is-an-eval)
2. [System Overview](#system-overview)
3. [Adding Exercises to Existing Languages](#adding-exercises-to-existing-languages)
4. [Adding Support for New Programming Languages](#adding-support-for-new-programming-languages)
## What is an "Eval"?
An **eval** (evaluation) is fundamentally a coding exercise with a known solution that is expressed as a set of unit tests that must pass in order to prove the correctness of a solution. Each eval consists of:
- **Problem Description**: Clear instructions explaining what needs to be implemented
- **Implementation Stub**: A skeleton file with function signatures but no implementation
- **Unit Tests**: Comprehensive test suite that validates the correctness of the solution
- **Success Criteria**: The AI must implement the solution such that all unit tests pass
The key principle is that the tests define the contract - if all tests pass, the solution is considered correct. This provides an objective, automated way to measure AI coding performance across different programming languages and problem domains.
**Example Flow**:
1. AI receives a problem description (e.g., "implement a function that reverses a string")
2. AI examines the stub implementation and test file
3. AI writes code to make all tests pass
4. System runs tests to verify correctness
5. Success is measured by test pass/fail rate
## System Overview
The evals system consists of several key components:
- **Exercises Repository**: [`Zentara-Code-Evals`](https://github.com/ZentaraCodeInc/Zentara-Code-Evals) - Contains all exercise definitions
- **Web Interface**: [`apps/web-evals`](../apps/web-evals) - Management interface for creating and monitoring evaluation runs
- **Evals Package**: [`packages/evals`](../packages/evals) - Contains both controller logic for orchestrating evaluation runs and runner container code for executing individual tasks
- **Docker Configuration**: Container definitions for the `controller` and `runner` as well as a Docker Compose file that provisions Postgres and Redis instances required for eval runs.
### Current Language Support
The system currently supports these programming languages:
- **Go** - `go test` for testing
- **Java** - Maven/Gradle for testing
- **JavaScript** - Node.js with Jest/Mocha
- **Python** - pytest for testing
- **Rust** - `cargo test` for testing
## Adding Exercises to Existing Languages
TL;DR - Here's a pull request that adds a new JavaScript eval: https://github.com/ZentaraCodeInc/Zentara-Code-Evals/pull/3
### Step 1: Understand the Exercise Structure
Each exercise follows a standardized directory structure:
```
/evals/{language}/{exercise-name}/
├── docs/
│ ├── instructions.md # Main exercise description
│ └── instructions.append.md # Additional instructions (optional)
├── {exercise-name}.{ext} # Implementation stub
├── {exercise-name}_test.{ext} # Test file
└── {language-specific-files} # go.mod, package.json, etc.
```
### Step 2: Create Exercise Directory
1. **Clone the evals repository**:
```bash
git clone https://github.com/ZentaraCodeInc/Zentara-Code-Evals.git evals
cd evals
```
2. **Create exercise directory**:
```bash
mkdir {language}/{exercise-name}
cd {language}/{exercise-name}
```
### Step 3: Write Exercise Instructions
Create `docs/instructions.md` with a clear problem description:
```markdown
# Instructions
Create an implementation of [problem description].
## Problem Description
[Detailed explanation of what needs to be implemented]
## Examples
- Input: [example input]
- Output: [expected output]
## Constraints
- [Any constraints or requirements]
```
**Example from a simple reverse-string exercise**:
```markdown
# Instructions
Create a function that reverses a string.
## Problem Description
Write a function called `reverse` that takes a string as input and returns the string with its characters in reverse order.
## Examples
- Input: `reverse("hello")` → Output: `"olleh"`
- Input: `reverse("world")` → Output: `"dlrow"`
- Input: `reverse("")` → Output: `""`
- Input: `reverse("a")` → Output: `"a"`
## Constraints
- Input will always be a valid string
- Empty strings should return empty strings
```
### Step 4: Create Implementation Stub
Create the main implementation file with function signatures but no implementation:
**Python example** (`reverse_string.py`):
```python
def reverse(text):
pass
```
**Go example** (`reverse_string.go`):
```go
package reversestring
// Reverse returns the input string with its characters in reverse order
func Reverse(s string) string {
// TODO: implement
return ""
}
```
### Step 5: Write Comprehensive Tests
Create test files that validate the implementation:
**Python example** (`reverse_string_test.py`):
```python
import unittest
from reverse_string import reverse
class ReverseStringTest(unittest.TestCase):
def test_reverse_hello(self):
self.assertEqual(reverse("hello"), "olleh")
def test_reverse_world(self):
self.assertEqual(reverse("world"), "dlrow")
def test_reverse_empty_string(self):
self.assertEqual(reverse(""), "")
def test_reverse_single_character(self):
self.assertEqual(reverse("a"), "a")
```
**Go example** (`reverse_string_test.go`):
```go
package reversestring
import "testing"
func TestReverse(t *testing.T) {
tests := []struct {
input string
expected string
}{
{"hello", "olleh"},
{"world", "dlrow"},
{"", ""},
{"a", "a"},
}
for _, test := range tests {
result := Reverse(test.input)
if result != test.expected {
t.Errorf("Reverse(%q) = %q, expected %q", test.input, result, test.expected)
}
}
}
```
### Step 6: Add Language-Specific Configuration
**For Go exercises**, create `go.mod`:
```go
module reverse-string
go 1.18
```
**For Python exercises**, ensure the parent directory has `pyproject.toml`:
```toml
[project]
name = "python-exercises"
version = "0.1.0"
description = "Python exercises for Zentara Code evals"
requires-python = ">=3.9"
dependencies = [
"pytest>=8.3.5",
]
```
### Step 7: Test Locally
Before committing, test your exercise locally:
**Python**:
```bash
cd python/reverse-string
uv run python3 -m pytest -o markers=task reverse_string_test.py
```
**Go**:
```bash
cd go/reverse-string
go test
```
The tests should **fail** with the stub implementation and **pass** when properly implemented.
## Adding Support for New Programming Languages
Adding a new programming language requires changes to both the evals repository and the main Zentara Code repository.
### Step 1: Update Language Configuration
1. **Add language to supported list** in [`packages/evals/src/exercises/index.ts`](../packages/evals/src/exercises/index.ts):
```typescript
export const exerciseLanguages = [
"go",
"java",
"javascript",
"python",
"rust",
"your-new-language", // Add here
] as const
```
### Step 2: Create Language-Specific Prompt
Create `prompts/{language}.md` in the evals repository:
```markdown
Your job is to complete a coding exercise described the markdown files inside the `docs` directory.
A file with the implementation stubbed out has been created for you, along with a test file (the tests should be failing initially).
To successfully complete the exercise, you must pass all the tests in the test file.
To confirm that your solution is correct, run the tests with `{test-command}`. Do not alter the test file; it should be run as-is.
Do not use the "ask_followup_question" tool. Your job isn't done until the tests pass. Don't attempt completion until you run the tests and they pass.
You should start by reading the files in the `docs` directory so that you understand the exercise, and then examine the stubbed out implementation and the test file.
```
Replace `{test-command}` with the appropriate testing command for your language.
### Step 3: Update Docker Configuration
Modify [`packages/evals/Dockerfile.runner`](../packages/evals/Dockerfile.runner) to install the new language runtime:
```dockerfile
# Install your new language runtime
RUN apt update && apt install -y your-language-runtime
# Or for languages that need special installation:
ARG YOUR_LANGUAGE_VERSION=1.0.0
RUN curl -sSL https://install-your-language.sh | sh -s -- --version ${YOUR_LANGUAGE_VERSION}
```
### Step 4: Update Test Runner Integration
If your language requires special test execution, update [`packages/evals/src/cli/runUnitTest.ts`](../packages/evals/src/cli/runUnitTest.ts) to handle the new language's testing framework.
### Step 5: Create Initial Exercises
Create at least 2-3 exercises for the new language following the structure described in the previous section.
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.