AGENTS.md

# AGENTS.md This file provides guidance to AI coding assistants when working with dictionary data in this directory. ## Overview The `Source/Data/` directory contains all dictionary data files and the build system for generating McBopomofo's language model data. These files define the mapping between Bopomofo phonetic input and Traditional Chinese characters/phrases, along with frequency data and special features like macros and symbols. **For comprehensive dictionary development workflows**, see [Wiki: 詞庫開發說明](https://github.com/openvanilla/McBopomofo/wiki/詞庫開發說明). This document provides technical reference for AI assistants. **GitHub Copilot users:** See `.github/instructions/Data.instructions.md` for path-specific Copilot instructions for this directory. **General guidelines** (emoji, date/time format, Conventional Commits, etc.) are defined in the root `AGENTS.md` and `.github/copilot-instructions.md`. **Key Principles:** - In most cases, you'll be **adding** new characters or phrases rather than deleting existing ones ## File Descriptions ### Source Data Files (Input) | File | Purpose | Format | |------|---------|--------| | `BPMFBase.txt` | Single character Bopomofo mappings | `character bopomofo pinyin tone tag` | | `BPMFMappings.txt` | Multi-character phrases (2-6 chars) | `phrase bpmf1 bpmf2 ...` (space-separated) | | `BPMFPunctuations.txt` | Punctuation marks mapping | Similar to BPMFBase | | `phrase.occ` | Phrase frequency/occurrence data | `phrase frequency` (tab-separated) | | `heterophony1.list` | Primary heterophony readings | `character bopomofo` | | `heterophony2.list` | Secondary heterophony readings | `character bopomofo` | | `heterophony3.list` | Tertiary heterophony readings | `character bopomofo` | | `exclusion.txt` | Phrase frequency exclusions | `phrase context_to_exclude` (tab-separated) | | `Symbols.txt` | Special symbols (era names, etc.) | `symbol bopomofo score` | | `Macros.txt` | Text macros (date/time) | `MACRO@NAME bopomofo score` | | `associated-punctuation.txt` | Punctuation for phrase associations | Special format for derive script | ### Generated Output Files | File | Generated By | Purpose | |------|--------------|---------| | `data.txt` | `make all` | Main language model data for McBopomofo | | `data-plain-bpmf.txt` | `make all` | Data for traditional Bopomofo IME mode | | `associated-phrases-v2.txt` | `make all` | Associated phrase suggestions | | `PhraseFreq.txt` | `make all` | Compiled frequency data | **Important:** Generated output files are built locally and NOT committed to git (listed in `.gitignore`). ## Common Commands ```bash # Essential Make Targets make all # Generate all output files make sort # Sort all data files with correct C locale make check # Validate data integrity make tidy # Clean up formatting issues make clean # Clean generated files # Recommended workflow: format, sort, validate, and build make tidy sort check all ``` ### Building via Xcode ```bash xcodebuild -project ../McBopomofo.xcodeproj -target Data -configuration Debug build # Or select "Data" scheme in Xcode and build (⌘+B) ``` This runs `make all` as part of the Xcode build process. For dictionary development, using `make` directly is recommended. ### Critical: C Locale Sorting **All data files MUST be sorted with C locale** for binary search compatibility: ```bash # Primary data files LC_ALL=C sort -o BPMFMappings.txt BPMFMappings.txt LC_ALL=C sort -o phrase.occ phrase.occ # Heterophony lists env LANG=C sort -k1 heterophony1.list | uniq > tmp && mv tmp heterophony1.list env LANG=C sort -k1 heterophony2.list | uniq > tmp && mv tmp heterophony2.list env LANG=C sort -k1 heterophony3.list | uniq > tmp && mv tmp heterophony3.list ``` ## Workflow: Adding/Modifying Phrases ### Adding a New Multi-Character Phrase 1. **Add to BPMFMappings.txt:** ``` 新詞彙ㄒㄧㄣㄘˊ ㄏㄨㄟˋ ``` Format: phrase, then Bopomofo for each character (space-separated) 2. **Add to phrase.occ with frequency:** ``` 新詞彙 100 ``` - Frequency must be a positive integer (0 is acceptable, negative values are NOT) - Use tab separator between phrase and frequency - Higher frequency = more common phrase 3. **Sort both files:** ```bash make sort # Or manually: LC_ALL=C sort -o BPMFMappings.txt BPMFMappings.txt LC_ALL=C sort -o phrase.occ phrase.occ ``` 4. **Validate and build:** ```bash make check # Validate integrity make all # Generate output files ``` ### Adding a Single Character Single characters go into `BPMFBase.txt`: ``` 字ㄗˋ zi4 -4 big5 ``` Format: `character bopomofo pinyin tone tag` ### Handling Heterophony Characters (破音字) Do NOT suggest any changes to `heterophony1.list`, `heterophony2.list` or `heterophony3.list`. If a PR contains any changes to those files, highlight them and ask human reviewers to pay attention to them. For Mandarin heteronyms, an entry in `heterophony1.list` sets the primary reading and causes other readings of the same character to be demoted. This can be problematic for high-frequency characters whose other readings are also common. To compensate, `heterophony2.list` and `heterophony3.list` "promote" those readings back, but with a frequency discount to ensure they don't outrank the primary one. This is a complex language model tuning mechanism, so agents must NOT suggest changes on their own. ### Adding Emojis and Symbols Emojis/symbols are allowed but **MUST NOT be the default candidate**. Add to `Symbols.txt` with negative score (e.g., `-8`). ## File Format Specifications | File | Format | Rules | Example | |------|--------|-------|---------| | **BPMFMappings.txt** | `phrase bpmf1 bpmf2 ...` | Space-separated; character count = Bopomofo count; C locale sorted | `小麥注音ㄒㄧㄠˇ ㄇㄞˋ ㄓㄨˋ ㄧㄣ` | | **phrase.occ** | `phrase<TAB>frequency` | Tab-separated; frequency ≥ 0 (no negatives); C locale sorted | `小麥<TAB>120` | | **heterophony*.list** | `character bopomofo` | One reading per line; sorted by character | `中ㄓㄨㄥ` | | **exclusion.txt** | `phrase<TAB>context` | Tab-separated; excludes phrase when in context | `一下<TAB>國一下` | | **Symbols.txt** | `symbol bopomofo score` | Space-separated; negative score for low priority | `平成ㄆㄧㄥˊ-ㄔㄥˊ -8` | | **Macros.txt** | `MACRO@NAME bopomofo score` | Space-separated; runtime expansion | `MACRO@DATE_TODAY ㄐㄧㄣ-ㄊㄧㄢ -8` | **Critical Format Rules:** - BPMFMappings.txt and phrase.occ use **different separators** (space vs tab) - Character count in phrase must equal number of Bopomofo readings - Frequencies must be non-negative integers (0 acceptable, negatives NOT allowed) - All files must be C locale sorted for binary search compatibility ## Python Package Structure The `curation/` package contains library modules organized into submodules: | Submodule | Purpose | |-----------|---------| | `curation.builders` | Data building and processing tools (frequency_builder, phrase_deriver) | | `curation.compilers` | Data compilation tools (main_compiler, plain_bpmf_compiler) | | `curation.validators` | Validation and analysis tools (score_validator) | | `curation.utils` | General utilities (text_filter) | Scripts with side effects are located in `scripts/`: | Script | Purpose | |--------|---------| | `count_occurrences.py` | Counts phrase occurrences in text corpus | | `analyze_data.py` | Analyzes dictionary data and generates reports | | `map_bpmf.py` | Helper for automatic Bopomofo mapping | ### Python Development Guidelines **CRITICAL RULES for AI coding assistants:** #### 1. Module Organization Rules **Library modules** (in `curation/` package) **MUST NOT** have side effects at module level: - **PROHIBITED**: Opening files, reading/writing data, printing output at module level - **PROHIBITED**: Executing code immediately when module is imported - **REQUIRED**: All initialization must be in functions - **REQUIRED**: Module must be importable without executing code **Example of BAD module (violates rules):** ```python # BAD: Has side effects at import time import configparser config = configparser.ConfigParser() config.read('config.ini') # WRONG: Reads file at import! corpus = open('corpus.txt').read() # WRONG: Opens file at import! ``` **Example of GOOD module:** ```python # GOOD: No side effects, importable as library import configparser def load_config(config_path='config.ini'): """Load configuration from file.""" config = configparser.ConfigParser() config.read(config_path) return config def main(): """Main entry point for CLI usage.""" config = load_config() # ... rest of logic if __name__ == '__main__': main() ``` #### 2. Import Guidelines - **REQUIRED**: All imports at top of file - **NEVER** use inline imports (unless explicitly necessary for specific technical reasons) - Use relative imports within package (e.g., `from .compiler_utils import HEADER`) - Use absolute imports for external packages (e.g., `import argparse`) **Example of BAD imports:** ```python # BAD: Inline import def process_data(): import pandas as pd # WRONG: NEVER do this! return pd.DataFrame(data) ``` **Example of GOOD imports:** ```python # GOOD: All imports at top import argparse import sys from typing import List, Dict from .compiler_utils import HEADER def process_data(input_data): result = [] for item in input_data: result.append(item.upper()) return result ``` #### 3. Side Effect Management **Scripts with side effects belong in `scripts/` directory, NOT in `curation/` package.** Scripts that do any of the following must be in `scripts/`: - Read configuration files at module level - Open and process data files at module level - Print output or generate reports at module level - Execute analysis immediately when imported **When to use `scripts/` vs `curation/`:** | Location | Purpose | Characteristics | |----------|---------|-----------------| | `scripts/` | Pure CLI tools | Has side effects; not importable as library; immediate execution | | `curation/` | Library modules | No side effects; importable; reusable functions | #### 4. Script vs Library Separation **Library modules** in `curation/` should: - Provide reusable functions and classes - Have a `main()` function for CLI usage - Be declared in `pyproject.toml` `[project.scripts]` - Follow pattern: `mcbpmf-tool-name = "curation.module:main"` **Scripts** in `scripts/` should: - Be standalone executables - Have all logic in `if __name__ == '__main__':` block - Be called directly: `python3 scripts/script_name.py` - NOT be imported by other modules #### 5. PEP-8 Naming Conventions - Module names: `lowercase_with_underscores.py` - Function names: `lowercase_with_underscores()` - Class names: `CapitalizedWords` - Constants: `UPPERCASE_WITH_UNDERSCORES` **Examples:** - GOOD: `frequency_builder.py`, `main_compiler.py`, `text_filter.py` - BAD: `buildFreq.py`, `nonCJK_filter.py`, `cook-plain-bpmf.py` #### 6. Package Installation Install as editable package for development: ```bash pip install -e . # Install package pip install -e ".[dev]" # Install with dev dependencies pip install -e ".[notebook]" # Install with notebook dependencies ``` After installation, use console scripts: ```bash mcbpmf-build-freq # Instead of: python3 -m curation.builders.frequency_builder mcbpmf-compile # Instead of: python3 -m curation.compilers.main_compiler mcbpmf-validate-scores # Instead of: python3 -m curation.validators.score_validator ``` ## Project Path Configuration All scripts and modules use centralized path constants from the `curation` package: ```python from curation import PROJECT_ROOT, CONFIG_FILE # PROJECT_ROOT = Source/Data/ directory (where pyproject.toml lives) # CONFIG_FILE = Source/Data/textpool.rc # Example usage in scripts config = configparser.ConfigParser() config.read(CONFIG_FILE) corpus_path = Path(config.get('data', 'corpus_path')).expanduser() ``` **Do NOT** compute paths relatively in individual scripts: - BAD: `Path(__file__).parent.parent` - BAD: `os.path.abspath(sys.argv[0]).split('/')` - GOOD: `from curation import PROJECT_ROOT` This ensures: - Single source of truth for project structure - Easy refactoring if directory structure changes - Consistent behavior across all tools ## Historical Context: Tool Evolution (2012-2025) ### Migration from bin/ to curation/ (October 2024) Prior to October 2024, all Python tools were located in the `bin/` directory (now renamed to `bin_legacy/`). This directory accumulated tools over 13+ years with contributions from multiple developers. #### Tool Creation Timeline - **2012-08-06**: `cook.py` created by Mengjuei Hsieh, replacing Ruby implementation - **2012-09-16**: `buildFreq.py` created, replacing bash version - **2013-01-02**: `self-score-test.py` added for quality validation - **2013-01-21**: C version moved to `C_Version/` subdirectory ("phasing out") - **2024-03-15**: `derive_associated_phrases.py` added by Lukhnos Liu (v2 system) - **2024-08-25**: `audit_encoding.swift` added by zonble - **2025-03-08**: `cook.py` modernized with Black formatting and argparse #### Why Migration Was Needed The bin/ structure had accumulated issues: 1. Flat organization (~30 files, no logical grouping) 2. Mixed concerns (library code, CLI scripts, config files, legacy tools) 3. Inconsistent naming conventions 4. Some modules had side effects at import time 5. Not installable as proper Python package 6. Each script calculated paths differently #### What Was Migrated vs Preserved **Migrated to `curation/` package** (actively used): - All compilation and build tools - Frequency calculation - Data validation - Text processing utilities **Moved to `scripts/`** (CLI-only, with side effects): - Corpus occurrence counting - Data analysis reports - BPMF mapping helpers **Preserved in `bin_legacy/`** (historical reference): - `audit_encoding.swift` - Still usable standalone tool (2024) - `C_Version/` - Fast C implementation, phased out in 2013 - `Sample_Prep/` - Corpus preparation methodology - `disabled/` - Legacy Perl/Ruby/Bash implementations #### Path Configuration Changed - **Before**: Each script calculated paths relatively - **After**: Import from `curation` package: `from curation import PROJECT_ROOT, CONFIG_FILE` #### Using Legacy Tools The `audit_encoding.swift` tool is still functional: ```bash cd bin_legacy swift audit_encoding.swift # Validates BPMFBase.txt encoding categories ``` C version (for performance comparison): ```bash cd bin_legacy/C_Version export TEXTPOOL=/path/to/corpus ./count.bash 測試詞彙 ``` For complete migration history and tool details, see `bin_legacy/DEPRECATED.md`. #### Key Contributors - **Mengjuei Hsieh**: Original Python implementation (2012-2013) - **Lukhnos Liu**: Modernization and associated phrases v2 (2024-2025) - **zonble**: Encoding audit tool (2024) ## Data Generation Pipeline The build process flows: `phrase.occ` + `exclusion.txt` → `buildFreq.py` → `PhraseFreq.txt` → `cook.py` (+ other inputs) → `data.txt` → `derive_associated_phrases.py` → `associated-phrases-v2.txt`. For detailed pipeline diagram, frequency calculation algorithms, and heterophony processing logic, see `algorithm.md` section "字典資料的生成與使用". ## Editorial Guidelines For editorial policies, see [Wiki: 詞庫開發說明](https://github.com/openvanilla/McBopomofo/wiki/詞庫開發說明). ### Phrase Quality Control Check phrase rarity: `site:.tw "phrase"` (under 1,000 results → likely safe to remove) ### Data Integrity Checks ```bash # Character count matches Bopomofo count (empty output = good) awk 'length($1)/3!=NF-1' BPMFMappings.txt # Phrase consistency check diff -u <(awk '{print $1}' BPMFMappings.txt|sort -u) \ <(awk 'length($1)>3{print $1}' phrase.occ|sort -u) ``` ## Testing Your Changes ```bash make tidy sort # Format and sort make check # Validate integrity make all # Build output files make _install # Install to ~/Library/Input Methods/McBopomofo.app/ pkill -HUP McBopomofo # Restart ``` ## Common Issues and Solutions | Issue | Solution | |-------|----------| | Sort order is wrong | Always use C locale: `LC_ALL=C sort -o file file` | | Phrase in BPMFMappings.txt not showing | Verify phrase exists in `phrase.occ` with non-zero frequency | | make check fails (character count) | Ensure character count = Bopomofo reading count | | Heterophony shows wrong default | Place common reading in `heterophony1.list`, less common in `heterophony2.list` | | Emoji appears as first candidate | Add to `Symbols.txt` with negative score (e.g., -8) | ## Reference Files for Context - Root `AGENTS.md`: Overall project architecture and build system - `algorithm.md`: Detailed algorithm explanation (Chinese) - [Wiki: 程式架構](https://github.com/openvanilla/McBopomofo/wiki/程式架構): Program architecture - [Wiki: 詞庫開發說明](https://github.com/openvanilla/McBopomofo/wiki/詞庫開發說明): Dictionary development guide - [Wiki: 使用手冊](https://github.com/openvanilla/McBopomofo/wiki/使用手冊): User manual

Related Documents

WordPress AI Client - Coding Agent Guide

AGENTS.md — Cross-Platform Agent Instructions

Contributor Guidelines for the `ee` editor

Light Manager Air Integration Guidelines