Loading...
Loading...
The `create_test_set.py` script helps you interactively build a golden test dataset for evaluating the retrieval system.
# Evaluation Scripts
## Create Golden Test Set
The `create_test_set.py` script helps you interactively build a golden test dataset for evaluating the retrieval system.
### Quick Start
```bash
python scripts/create_test_set.py
```
### How It Works
1. **Shows random items** from your collection with full details
2. **Helps you write queries** for each item with natural language
3. **Automatically records ground truth** (item IDs, categories, etc.)
4. **Saves to `data/eval/test_queries.json`**
### Interactive Example
```
🔍 Golden Test Set Builder
Options:
1. Browse random items from all categories
2. Browse items from a specific category
How many items to browse? [10]: 5
════════════════════════════════════════
ITEM 1
════════════════════════════════════════
Category: Food
Headline: Tofuya Ukai dining spot beneath Tokyo Tower
Summary: Beautiful restaurant beneath Tokyo Tower...
Create a query for this item?
(y)es, (n)o, (m)ulti-item, (q)uit [y]: y
Query: What restaurants are in Tokyo?
Query Type: 1 (location_search)
Reference answer (optional): Tofuya Ukai
✓ Created query q001
```
### Features
#### Query Creation Modes
- **Single-item** (`y`) - One query → one item
- **Multi-item** (`m`) - One query → multiple items (e.g., "all Tokyo restaurants")
- **Skip** (`n`) - Skip current item
- **Quit** (`q`) - Save and exit
#### Query Types
1. **location_search** - "restaurants in Tokyo"
2. **category_search** - "beauty products"
3. **specific_question** - "what perfume brands..."
4. **object_content** - "images with text"
5. **complex_multi_part** - "Japanese food and shopping"
### Output Format
Creates `data/eval/test_queries.json`:
```json
{
"queries": [
{
"id": "q001",
"query": "What restaurants are in Tokyo?",
"type": "location_search",
"ground_truth_items": ["item-id-1", "item-id-2"],
"expected_category": "Food",
"min_expected_results": 2,
"reference_answer": "Tofuya Ukai"
}
]
}
```
### Best Practices
✅ **Do:**
- Create 20-30 diverse queries
- Use natural language
- Include various query types
- Add reference answers for key queries
❌ **Don't:**
- Create only one query type
- Use overly specific/technical queries
- Skip categories you want to evaluate
### Keyboard Shortcuts
- `y` - Create query
- `n` - Skip item
- `m` - Multi-item query
- `q` - Quit
- `Ctrl+C` - Cancel current input
### Next Steps
After creating queries:
```bash
# Review the test set
cat data/eval/test_queries.json | python -m json.tool
# Run evaluation (see below)
python scripts/evaluate_retrieval.py
```
## Evaluate Retrieval Quality
The `evaluate_retrieval.py` script measures search quality using standard Information Retrieval metrics.
### Quick Start
```bash
# Standard run against golden database (uses subdomain routing)
python scripts/evaluate_retrieval.py
# Verbose output with detailed progress
python scripts/evaluate_retrieval.py --verbose
```
### Prerequisites
1. **Start the API server** on port 8000:
```bash
uvicorn main:app --port 8000
```
The evaluation script automatically uses golden.localhost subdomain routing to access the golden database.
2. **Ensure you have the evaluation dataset**:
- `data/eval/retrieval_evaluation_dataset.json` (50 test queries)
### How It Works
1. **Validates API connection** - Finds running API server (tries 8000, 8001, 8080, 3000)
2. **Routes to golden database** - Uses golden.localhost subdomain routing (via Host header)
3. **Verifies item count** - Ensures you're testing against the golden DB (55 items)
4. **Runs all queries** - Executes 50 test queries against the search endpoint
5. **Calculates metrics** - Computes Precision@K, Recall@K, MRR, NDCG@K
6. **Generates reports** - Creates markdown and JSON reports with detailed results
### Metrics Calculated
For each K value (1, 3, 5, 10):
- **Precision@K** - Of the top K results, what fraction are relevant?
- **Recall@K** - Of all relevant items, what fraction appear in top K?
- **NDCG@K** - Normalized Discounted Cumulative Gain (accounts for ranking quality)
- **MRR** - Mean Reciprocal Rank (average of 1/rank of first relevant result)
### Command-Line Options
```bash
python scripts/evaluate_retrieval.py [OPTIONS]
Options:
--port PORT API port (default: 8000)
--base-url URL Full base URL (overrides port)
--use-golden-subdomain Use golden.localhost routing (default: True)
--no-golden-subdomain Disable golden routing (test against production)
--dataset PATH Evaluation dataset path
--output-dir PATH Report output directory (default: data/eval/reports)
--top-k VALUES K values for metrics (default: 1,3,5,10)
--expected-items N Expected DB items (default: 55)
--skip-item-check Skip item count validation
--verbose Show detailed progress
```
### Examples
```bash
# Run against golden database (default)
python scripts/evaluate_retrieval.py
# Run against production database
python scripts/evaluate_retrieval.py --no-golden-subdomain --skip-item-check
# Custom K values
python scripts/evaluate_retrieval.py --top-k 1,5,10,20
# Remote server
python scripts/evaluate_retrieval.py --base-url http://192.168.1.100:8000
# Custom output location
python scripts/evaluate_retrieval.py --output-dir ./my_reports
```
### Output Reports
Creates two timestamped files in `data/eval/reports/`:
1. **`eval_YYYYMMDD_HHMMSS_report.md`** - Human-readable markdown report
- Summary metrics table
- Breakdown by query type
- Detailed results for all 50 queries
2. **`eval_YYYYMMDD_HHMMSS_report.json`** - Machine-readable JSON
- Full metric data
- Individual query results
- Timing statistics
- Configuration details
### Sample Output
```
============================================================
Retrieval Evaluation Script
============================================================
✓ API endpoint: http://localhost:8000
✓ Item count validated: 55 items
Loaded dataset: 50 queries
Evaluating 50 queries...
Progress: 50/50 (100%)
Completed in 8.43s
✓ Reports generated:
- data/eval/reports/eval_20241214_153022_report.json
- data/eval/reports/eval_20241214_153022_report.md
============================================================
Evaluation complete!
============================================================
```
### Safety Features
The script prevents accidental evaluation against production:
- **Golden subdomain routing** - Automatically routes to golden database by default
- **Item count validation** - Checks that DB has exactly 55 items (golden dataset size)
- **Port auto-discovery** - Tries common ports if default fails
- **Clear warnings** - Shows warning if item count doesn't match expected
Use `--no-golden-subdomain --skip-item-check` to run against production instead.
FHD uses keywords to create unique run-specific settings. This dictionary describes the purpose of each keyword, as well as their logic or applicable ranges. Some keywords can override others, which is also documentated. The FHD default is listed when applicable, which can be overriden by a top-level script.
[← Back: Cost Model](05_cost_model.md) | [Back to Project →](README.md)
A tool to aid researchers in assessing whether research papers adhere to scientific best practices. This application uses AI to automatically generate falsification forms, helping researchers verify the scientific robustness of their work across disciplines including social sciences and natural sciences.
This is the source code of the EMNLP 2019 paper [**Event Detection with Trigger-Aware Lattice Neural Network**](https://www.aclweb.org/anthology/D19-1033.pdf) . TLNN model aims to address the issues of trigger-word mismatch and trigger polysemy. In this project, the event detection is a sequence labeling task. For more information, please read the paper.