Loading...
Loading...
This file provides guidance to AI coding assistants when working with dictionary data in this directory.
# AGENTS.md
This file provides guidance to AI coding assistants when working with dictionary data in this directory.
## Overview
The `Source/Data/` directory contains all dictionary data files and the build system for generating McBopomofo's language model data. These files define the mapping between Bopomofo phonetic input and Traditional Chinese characters/phrases, along with frequency data and special features like macros and symbols.
**For comprehensive dictionary development workflows**, see [Wiki: 詞庫開發說明](https://github.com/openvanilla/McBopomofo/wiki/詞庫開發說明). This document provides technical reference for AI assistants.
**GitHub Copilot users:** See `.github/instructions/Data.instructions.md` for path-specific Copilot instructions for this directory.
**General guidelines** (emoji, date/time format, Conventional Commits, etc.) are defined in the root `AGENTS.md` and `.github/copilot-instructions.md`.
**Key Principles:**
- In most cases, you'll be **adding** new characters or phrases rather than deleting existing ones
## File Descriptions
### Source Data Files (Input)
| File | Purpose | Format |
|------|---------|--------|
| `BPMFBase.txt` | Single character Bopomofo mappings | `character bopomofo pinyin tone tag` |
| `BPMFMappings.txt` | Multi-character phrases (2-6 chars) | `phrase bpmf1 bpmf2 ...` (space-separated) |
| `BPMFPunctuations.txt` | Punctuation marks mapping | Similar to BPMFBase |
| `phrase.occ` | Phrase frequency/occurrence data | `phrase frequency` (tab-separated) |
| `heterophony1.list` | Primary heterophony readings | `character bopomofo` |
| `heterophony2.list` | Secondary heterophony readings | `character bopomofo` |
| `heterophony3.list` | Tertiary heterophony readings | `character bopomofo` |
| `exclusion.txt` | Phrase frequency exclusions | `phrase context_to_exclude` (tab-separated) |
| `Symbols.txt` | Special symbols (era names, etc.) | `symbol bopomofo score` |
| `Macros.txt` | Text macros (date/time) | `MACRO@NAME bopomofo score` |
| `associated-punctuation.txt` | Punctuation for phrase associations | Special format for derive script |
### Generated Output Files
| File | Generated By | Purpose |
|------|--------------|---------|
| `data.txt` | `make all` | Main language model data for McBopomofo |
| `data-plain-bpmf.txt` | `make all` | Data for traditional Bopomofo IME mode |
| `associated-phrases-v2.txt` | `make all` | Associated phrase suggestions |
| `PhraseFreq.txt` | `make all` | Compiled frequency data |
**Important:** Generated output files are built locally and NOT committed to git (listed in `.gitignore`).
## Common Commands
```bash
# Essential Make Targets
make all # Generate all output files
make sort # Sort all data files with correct C locale
make check # Validate data integrity
make tidy # Clean up formatting issues
make clean # Clean generated files
# Recommended workflow: format, sort, validate, and build
make tidy sort check all
```
### Building via Xcode
```bash
xcodebuild -project ../McBopomofo.xcodeproj -target Data -configuration Debug build
# Or select "Data" scheme in Xcode and build (⌘+B)
```
This runs `make all` as part of the Xcode build process. For dictionary development, using `make` directly is recommended.
### Critical: C Locale Sorting
**All data files MUST be sorted with C locale** for binary search compatibility:
```bash
# Primary data files
LC_ALL=C sort -o BPMFMappings.txt BPMFMappings.txt
LC_ALL=C sort -o phrase.occ phrase.occ
# Heterophony lists
env LANG=C sort -k1 heterophony1.list | uniq > tmp && mv tmp heterophony1.list
env LANG=C sort -k1 heterophony2.list | uniq > tmp && mv tmp heterophony2.list
env LANG=C sort -k1 heterophony3.list | uniq > tmp && mv tmp heterophony3.list
```
## Workflow: Adding/Modifying Phrases
### Adding a New Multi-Character Phrase
1. **Add to BPMFMappings.txt:**
```
新詞彙 ㄒㄧㄣ ㄘˊ ㄏㄨㄟˋ
```
Format: phrase, then Bopomofo for each character (space-separated)
2. **Add to phrase.occ with frequency:**
```
新詞彙 100
```
- Frequency must be a positive integer (0 is acceptable, negative values are NOT)
- Use tab separator between phrase and frequency
- Higher frequency = more common phrase
3. **Sort both files:**
```bash
make sort
# Or manually:
LC_ALL=C sort -o BPMFMappings.txt BPMFMappings.txt
LC_ALL=C sort -o phrase.occ phrase.occ
```
4. **Validate and build:**
```bash
make check # Validate integrity
make all # Generate output files
```
### Adding a Single Character
Single characters go into `BPMFBase.txt`:
```
字 ㄗˋ zi4 -4 big5
```
Format: `character bopomofo pinyin tone tag`
### Handling Heterophony Characters (破音字)
Do NOT suggest any changes to `heterophony1.list`, `heterophony2.list` or `heterophony3.list`. If a PR contains any changes to those files, highlight them and ask human reviewers to pay attention to them.
For Mandarin heteronyms, an entry in `heterophony1.list` sets the primary reading and causes other readings of the same character to be demoted. This can be problematic for high-frequency characters whose other readings are also common. To compensate, `heterophony2.list` and `heterophony3.list` "promote" those readings back, but with a frequency discount to ensure they don't outrank the primary one.
This is a complex language model tuning mechanism, so agents must NOT suggest changes on their own.
### Adding Emojis and Symbols
Emojis/symbols are allowed but **MUST NOT be the default candidate**. Add to `Symbols.txt` with negative score (e.g., `-8`).
## File Format Specifications
| File | Format | Rules | Example |
|------|--------|-------|---------|
| **BPMFMappings.txt** | `phrase bpmf1 bpmf2 ...` | Space-separated; character count = Bopomofo count; C locale sorted | `小麥注音 ㄒㄧㄠˇ ㄇㄞˋ ㄓㄨˋ ㄧㄣ` |
| **phrase.occ** | `phrase<TAB>frequency` | Tab-separated; frequency ≥ 0 (no negatives); C locale sorted | `小麥<TAB>120` |
| **heterophony*.list** | `character bopomofo` | One reading per line; sorted by character | `中 ㄓㄨㄥ` |
| **exclusion.txt** | `phrase<TAB>context` | Tab-separated; excludes phrase when in context | `一下<TAB>國一下` |
| **Symbols.txt** | `symbol bopomofo score` | Space-separated; negative score for low priority | `平成 ㄆㄧㄥˊ-ㄔㄥˊ -8` |
| **Macros.txt** | `MACRO@NAME bopomofo score` | Space-separated; runtime expansion | `MACRO@DATE_TODAY ㄐㄧㄣ-ㄊㄧㄢ -8` |
**Critical Format Rules:**
- BPMFMappings.txt and phrase.occ use **different separators** (space vs tab)
- Character count in phrase must equal number of Bopomofo readings
- Frequencies must be non-negative integers (0 acceptable, negatives NOT allowed)
- All files must be C locale sorted for binary search compatibility
## Python Package Structure
The `curation/` package contains library modules organized into submodules:
| Submodule | Purpose |
|-----------|---------|
| `curation.builders` | Data building and processing tools (frequency_builder, phrase_deriver) |
| `curation.compilers` | Data compilation tools (main_compiler, plain_bpmf_compiler) |
| `curation.validators` | Validation and analysis tools (score_validator) |
| `curation.utils` | General utilities (text_filter) |
Scripts with side effects are located in `scripts/`:
| Script | Purpose |
|--------|---------|
| `count_occurrences.py` | Counts phrase occurrences in text corpus |
| `analyze_data.py` | Analyzes dictionary data and generates reports |
| `map_bpmf.py` | Helper for automatic Bopomofo mapping |
### Python Development Guidelines
**CRITICAL RULES for AI coding assistants:**
#### 1. Module Organization Rules
**Library modules** (in `curation/` package) **MUST NOT** have side effects at module level:
- **PROHIBITED**: Opening files, reading/writing data, printing output at module level
- **PROHIBITED**: Executing code immediately when module is imported
- **REQUIRED**: All initialization must be in functions
- **REQUIRED**: Module must be importable without executing code
**Example of BAD module (violates rules):**
```python
# BAD: Has side effects at import time
import configparser
config = configparser.ConfigParser()
config.read('config.ini') # WRONG: Reads file at import!
corpus = open('corpus.txt').read() # WRONG: Opens file at import!
```
**Example of GOOD module:**
```python
# GOOD: No side effects, importable as library
import configparser
def load_config(config_path='config.ini'):
"""Load configuration from file."""
config = configparser.ConfigParser()
config.read(config_path)
return config
def main():
"""Main entry point for CLI usage."""
config = load_config()
# ... rest of logic
if __name__ == '__main__':
main()
```
#### 2. Import Guidelines
- **REQUIRED**: All imports at top of file
- **NEVER** use inline imports (unless explicitly necessary for specific technical reasons)
- Use relative imports within package (e.g., `from .compiler_utils import HEADER`)
- Use absolute imports for external packages (e.g., `import argparse`)
**Example of BAD imports:**
```python
# BAD: Inline import
def process_data():
import pandas as pd # WRONG: NEVER do this!
return pd.DataFrame(data)
```
**Example of GOOD imports:**
```python
# GOOD: All imports at top
import argparse
import sys
from typing import List, Dict
from .compiler_utils import HEADER
def process_data(input_data):
result = []
for item in input_data:
result.append(item.upper())
return result
```
#### 3. Side Effect Management
**Scripts with side effects belong in `scripts/` directory, NOT in `curation/` package.**
Scripts that do any of the following must be in `scripts/`:
- Read configuration files at module level
- Open and process data files at module level
- Print output or generate reports at module level
- Execute analysis immediately when imported
**When to use `scripts/` vs `curation/`:**
| Location | Purpose | Characteristics |
|----------|---------|-----------------|
| `scripts/` | Pure CLI tools | Has side effects; not importable as library; immediate execution |
| `curation/` | Library modules | No side effects; importable; reusable functions |
#### 4. Script vs Library Separation
**Library modules** in `curation/` should:
- Provide reusable functions and classes
- Have a `main()` function for CLI usage
- Be declared in `pyproject.toml` `[project.scripts]`
- Follow pattern: `mcbpmf-tool-name = "curation.module:main"`
**Scripts** in `scripts/` should:
- Be standalone executables
- Have all logic in `if __name__ == '__main__':` block
- Be called directly: `python3 scripts/script_name.py`
- NOT be imported by other modules
#### 5. PEP-8 Naming Conventions
- Module names: `lowercase_with_underscores.py`
- Function names: `lowercase_with_underscores()`
- Class names: `CapitalizedWords`
- Constants: `UPPERCASE_WITH_UNDERSCORES`
**Examples:**
- GOOD: `frequency_builder.py`, `main_compiler.py`, `text_filter.py`
- BAD: `buildFreq.py`, `nonCJK_filter.py`, `cook-plain-bpmf.py`
#### 6. Package Installation
Install as editable package for development:
```bash
pip install -e . # Install package
pip install -e ".[dev]" # Install with dev dependencies
pip install -e ".[notebook]" # Install with notebook dependencies
```
After installation, use console scripts:
```bash
mcbpmf-build-freq # Instead of: python3 -m curation.builders.frequency_builder
mcbpmf-compile # Instead of: python3 -m curation.compilers.main_compiler
mcbpmf-validate-scores # Instead of: python3 -m curation.validators.score_validator
```
## Project Path Configuration
All scripts and modules use centralized path constants from the `curation` package:
```python
from curation import PROJECT_ROOT, CONFIG_FILE
# PROJECT_ROOT = Source/Data/ directory (where pyproject.toml lives)
# CONFIG_FILE = Source/Data/textpool.rc
# Example usage in scripts
config = configparser.ConfigParser()
config.read(CONFIG_FILE)
corpus_path = Path(config.get('data', 'corpus_path')).expanduser()
```
**Do NOT** compute paths relatively in individual scripts:
- BAD: `Path(__file__).parent.parent`
- BAD: `os.path.abspath(sys.argv[0]).split('/')`
- GOOD: `from curation import PROJECT_ROOT`
This ensures:
- Single source of truth for project structure
- Easy refactoring if directory structure changes
- Consistent behavior across all tools
## Historical Context: Tool Evolution (2012-2025)
### Migration from bin/ to curation/ (October 2024)
Prior to October 2024, all Python tools were located in the `bin/` directory (now renamed to `bin_legacy/`). This directory accumulated tools over 13+ years with contributions from multiple developers.
#### Tool Creation Timeline
- **2012-08-06**: `cook.py` created by Mengjuei Hsieh, replacing Ruby implementation
- **2012-09-16**: `buildFreq.py` created, replacing bash version
- **2013-01-02**: `self-score-test.py` added for quality validation
- **2013-01-21**: C version moved to `C_Version/` subdirectory ("phasing out")
- **2024-03-15**: `derive_associated_phrases.py` added by Lukhnos Liu (v2 system)
- **2024-08-25**: `audit_encoding.swift` added by zonble
- **2025-03-08**: `cook.py` modernized with Black formatting and argparse
#### Why Migration Was Needed
The bin/ structure had accumulated issues:
1. Flat organization (~30 files, no logical grouping)
2. Mixed concerns (library code, CLI scripts, config files, legacy tools)
3. Inconsistent naming conventions
4. Some modules had side effects at import time
5. Not installable as proper Python package
6. Each script calculated paths differently
#### What Was Migrated vs Preserved
**Migrated to `curation/` package** (actively used):
- All compilation and build tools
- Frequency calculation
- Data validation
- Text processing utilities
**Moved to `scripts/`** (CLI-only, with side effects):
- Corpus occurrence counting
- Data analysis reports
- BPMF mapping helpers
**Preserved in `bin_legacy/`** (historical reference):
- `audit_encoding.swift` - Still usable standalone tool (2024)
- `C_Version/` - Fast C implementation, phased out in 2013
- `Sample_Prep/` - Corpus preparation methodology
- `disabled/` - Legacy Perl/Ruby/Bash implementations
#### Path Configuration Changed
- **Before**: Each script calculated paths relatively
- **After**: Import from `curation` package: `from curation import PROJECT_ROOT, CONFIG_FILE`
#### Using Legacy Tools
The `audit_encoding.swift` tool is still functional:
```bash
cd bin_legacy
swift audit_encoding.swift # Validates BPMFBase.txt encoding categories
```
C version (for performance comparison):
```bash
cd bin_legacy/C_Version
export TEXTPOOL=/path/to/corpus
./count.bash 測試詞彙
```
For complete migration history and tool details, see `bin_legacy/DEPRECATED.md`.
#### Key Contributors
- **Mengjuei Hsieh**: Original Python implementation (2012-2013)
- **Lukhnos Liu**: Modernization and associated phrases v2 (2024-2025)
- **zonble**: Encoding audit tool (2024)
## Data Generation Pipeline
The build process flows: `phrase.occ` + `exclusion.txt` → `buildFreq.py` → `PhraseFreq.txt` → `cook.py` (+ other inputs) → `data.txt` → `derive_associated_phrases.py` → `associated-phrases-v2.txt`.
For detailed pipeline diagram, frequency calculation algorithms, and heterophony processing logic, see `algorithm.md` section "字典資料的生成與使用".
## Editorial Guidelines
For editorial policies, see [Wiki: 詞庫開發說明](https://github.com/openvanilla/McBopomofo/wiki/詞庫開發說明).
### Phrase Quality Control
Check phrase rarity: `site:.tw "phrase"` (under 1,000 results → likely safe to remove)
### Data Integrity Checks
```bash
# Character count matches Bopomofo count (empty output = good)
awk 'length($1)/3!=NF-1' BPMFMappings.txt
# Phrase consistency check
diff -u <(awk '{print $1}' BPMFMappings.txt|sort -u) \
<(awk 'length($1)>3{print $1}' phrase.occ|sort -u)
```
## Testing Your Changes
```bash
make tidy sort # Format and sort
make check # Validate integrity
make all # Build output files
make _install # Install to ~/Library/Input Methods/McBopomofo.app/
pkill -HUP McBopomofo # Restart
```
## Common Issues and Solutions
| Issue | Solution |
|-------|----------|
| Sort order is wrong | Always use C locale: `LC_ALL=C sort -o file file` |
| Phrase in BPMFMappings.txt not showing | Verify phrase exists in `phrase.occ` with non-zero frequency |
| make check fails (character count) | Ensure character count = Bopomofo reading count |
| Heterophony shows wrong default | Place common reading in `heterophony1.list`, less common in `heterophony2.list` |
| Emoji appears as first candidate | Add to `Symbols.txt` with negative score (e.g., -8) |
## Reference Files for Context
- Root `AGENTS.md`: Overall project architecture and build system
- `algorithm.md`: Detailed algorithm explanation (Chinese)
- [Wiki: 程式架構](https://github.com/openvanilla/McBopomofo/wiki/程式架構): Program architecture
- [Wiki: 詞庫開發說明](https://github.com/openvanilla/McBopomofo/wiki/詞庫開發說明): Dictionary development guide
- [Wiki: 使用手冊](https://github.com/openvanilla/McBopomofo/wiki/使用手冊): User manual
An AI client and API for WordPress to communicate with any generative AI models of various capabilities using a uniform API. Built on top of the [PHP AI Client](https://github.com/WordPress/php-ai-client), it provides a WordPress-native Prompt Builder, an Admin Settings Screen for credentials, automatic credential wiring, a PSR-compliant HTTP client, and a client-side JavaScript API.
> This file provides instructions for AI agents that read AGENTS.md (GitHub Copilot, Cursor, Windsurf, Cline, Aider, OpenCode, and others).
This document collects ideas and instructions for implementing future improvements. Follow these when adding features or refactoring the code.
> This file must stay **in sync** with `CLAUDE.md`. Whenever you change one, mirror the same change in the other so both tools continue to work correctly.