AGENTS

Agent Notes (updated 2025-11-13)

TUIs at a glance:
- timestamp_textual_app.py (TimestampLogApp): capture constraints and view artifact.md (Markdown). It doesn’t run NLCO.
- agent_manual_pkg/src/agent_manual_pkg/tui.py (Agent Manual TUI): interactive agent runner (satisfaction, memory, DSPy logs).
- Legacy nlco_textual.py was removed; use the two TUIs above instead.
Headless alternative is nlco_iter.py (console loop). Run: python nlco_iter.py.
You do NOT need both at once; the TUI runs iterations itself. Avoid concurrent runs (shared files).
Run commands:
- Timestamp (wrapper, recommended): ./timestamp_tui.sh
- Timestamp (one-liner, hardened): stty iutf8; LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 ./timestamp_textual_app.py --lenient-input --fallback-encoding cp1252
- Agent Manual: python -m agent_manual_pkg.cli (supports --model and --max-tokens).
Files touched: none (informational change only).

Things to keep in mind

Textual and Rich must be installed to run the Timestamp TUI.
MLflow is optional for headless; structured schedule JSON is no longer produced by the refiner.
Timestamp TUI default paths now live under the user’s private directory:
- Default base: ~/.nlco/private (override with NLCO_PRIVATE_DIR).
- Resolved files: constraints.md, artifact.md in that directory.
- Override per-file via env: TIMESTAMP_CONSTRAINTS_PATH, TIMESTAMP_ARTIFACT_PATH or CLI flags --constraints-path / --artifact-path.
timestamp_textual_app.py appends to the resolved constraints file and can be used alongside NLCO tools, but beware of concurrent writes to the same file.

Path migration note (2025-11-13)

We did NOT move any existing constraints.md/artifact.md. They remain at repo root, untracked.
To keep using them without moving: TIMESTAMP_CONSTRAINTS_PATH="$PWD/constraints.md" TIMESTAMP_ARTIFACT_PATH="$PWD/artifact.md" ./timestamp_textual_app.py.
To migrate to the new defaults (~/.nlco/private):
- One‑liner (hardened): IFS=$'\n\t'; set -euo pipefail; umask 077; base="${NLCO_PRIVATE_DIR:-$HOME/.nlco/private}"; install -d -m 700 "$base"; for f in constraints.md artifact.md; do src="$PWD/$f"; dst="$base/$f"; if [ -f "$src" ] && [ ! -e "$dst" ]; then mv -- "$src" "$dst" && echo "moved $f -> $dst"; else echo "skipped $f"; fi; done

Path migration executed (2025-11-13)

On this machine, we moved artifact.md → ~/.nlco/private/artifact.md and found ~/.nlco/private/constraints.md already present; repo‑root constraints.md was untouched.
TUI now reads/writes from the private paths by default; override via TIMESTAMP_CONSTRAINTS_PATH / TIMESTAMP_ARTIFACT_PATH if needed.
Context now includes weekday explicitly: Datetime: YYYY-MM-DD HH:MM:SS (Friday) for better temporal grounding.
Auto-backups: Before any write to constraints.md, we snapshot the current file to .nlco/backups/{hourly|daily|weekly}/constraints-*.md if the period’s file doesn’t exist yet. Env override: NLCO_BACKUP_DIR.
Constraints tail sizing: In timestamp_app_core, tail now always derives from pane height (tail = max(height - 2, 1)). The old TIMESTAMP_CONSTRAINTS_TAIL numeric env is ignored for rendering.
Mobile SSH tip: some clients clip the last column. Use either --right-margin N (env TIMESTAMP_RIGHT_MARGIN) or --pad-eol (env TIMESTAMP_PAD_EOL=1) to add space at the right; both affect rendering only (no file changes).

Key bindings (Timestamp TUI)

gi focus input • ga focus artifact • F1 toggle help (more reliable than Ctrl+H on some phones/SSH) • Ctrl+C exit • PageUp/PageDown scroll artifact.

Next Steps (2025-11-13)

46a. Wire TimewarriorModule.run() into nlco_iter.iteration_loop() behind env NLCO_TIMEW=1; log a short status line and add 2 unit tests (timew present/absent). Recommended.
46b. Add a minimal “unchanged twice” stop rule to headless iterations to prevent endless runs; 2 tests (no-change stops, change resets counter).
46c. Apply TIMESTAMP_RIGHT_MARGIN padding in timestamp_app_core.TimestampLogApp (Constraints Markdown) and add one style assertion test. (Tracking.)
46d. Prune remaining legacy references to nlco_textual.py in docs and code comments; keep the file but mark clearly as deprecated.
46e. Harden JSONL model logging for path errors (permission/dir missing) with a tiny try/except and one test; keep code minimal.
46f. (Done) Lightweight advisory file lock to avoid simultaneous writes to constraints.md when headless + timestamp/web app run together.
- New: file_lock.locked_file(path, mode='a+') (fcntl LOCK_EX; Linux only).
- Used by: constraints_io.append_line, timestamp_textual_app._append_to_constraints, webapp/nlco_htmx/utils.write_constraints_entry.
- Test: tests/test_constraints_locking_utils.py spawns two processes appending concurrently; asserts one heading and all lines present.

Proposed Next Steps (80)

80a. Split TUI.apply_memory_updates into *_create/_update/_delete helpers with one focused unit test. Minimal code; lowers CC hotspot. (Recommended.)
80b. Style cleanup in tui.py: remove semicolons flagged by ruff (E702/E703); no behavior change; quick win + keep lint clean.
80c. Wire TimewarriorModule.run into nlco_iter behind NLCO_TIMEW=1 with 2 tests (timew present/absent). Small, controlled change toward earlier goals.
80d. Add a tiny /help command that echoes HelpScreen text into #log; add one assertion in tests to lock UX.

Shell hardening cheats

One-liner (ad‑hoc): stty iutf8; LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 ./timestamp_textual_app.py --lenient-input --fallback-encoding cp1252
Persistent (bash/zsh): add [ -t 0 ] && stty iutf8 || true to your shell rc.

Models & budgets (NLCO iter)

Primary LM: deepseek/deepseek-reasoner with max_tokens=40000 (see nlco_iter.py).
Support LM for subsystems: deepseek/deepseek-chat with max_tokens=4000, temperature=0 (used in various support modules).
Memory now uses the primary LM (reasoner) in headless.

Memory module limits

MemoryModule.max_iters = 4 by default. Each ReAct step can call a tool (e.g., replace_memory or append_memory).
One replace_memory call replaces all occurrences of the search string in memory.md (not just one), and increments the edit counter.
Effective bound per invocation: up to 4 tool-driven edits, but fewer if the ReAct loop decides to stop early.
No explicit OpenRouter reasoning budget is set; if routed via OpenRouter, provider defaults apply. We do not pass reasoning/max_reasoning_tokens today.

Reasoning trace display

nlco_iter prints a “Model Reasoning · Refiner” panel when the provider returns native reasoning. Critic is currently disabled.
- DeepSeek API: reads message.reasoning_content.
- OpenRouter: reads message.reasoning.{text|summary}.
To force reasoning over OpenRouter, pass the unified reasoning parameter (e.g., {"enabled": true} or {"effort": "medium"}) via your LM config; we kept runtime code minimal and non-intrusive.

Model output logging

All model outputs (including reasoning if present) append to JSONL at .nlco/model_log.jsonl.
Set NLCO_MODEL_LOG=/path/to/file.jsonl to change the destination.
Each line: { ts, stage, output, reasoning }.

Textual Markdown

Textual provides a Markdown widget that parses Markdown with a GFM-like parser (tables, task lists, strikethrough, autolinks).
For interactive/spreadsheet-like tables, use DataTable; Markdown tables are static.
TimestampLogApp now renders artifact.md via Markdown (read-only) instead of a TextArea.

Timestamp TUI complexity reductions (2025-11-12)

Merged CLI/TTY preflight across modules: timestamp_textual_app.py now delegates to timestamp_app_core for CLI parsing and UTF‑8 TTY checks, with a tiny wrapper that preserves the stty iutf8 behavior required by tests.
Extracted _build_append to assemble constraint writes; keeps _append_to_constraints linear and short.
Kept richer lenient‑input helper in wrapper (prints one warning, patches read) since tests assert those messages; core retains a minimal variant.
Result: wrapper no longer has any CC ≥ C; worst offenders are B(10/8/6). Core remains B‑level for _ensure_utf8_tty and _parse_cli.
Constraints pane: factored wrapper’s constraints logic into helpers — _md_preserve_lines, _maybe_scroll_constraints_end. The old _tail_env helper was removed; the wrapper now derives the tail from the pane height (tail = max(height − 2, 1)), matching the core. _load_constraints stays straight‑line and ≤ B. Tests updated accordingly.
- Added tests: tests/test_timestamp_constraints_helpers.py covers _tail_env defaults/values/invalids, newline preservation, and autoscroll focus/no-focus behavior.

Legacy: Old Vim Input

_OldVimInput (previously in timestamp_textual_app.py) has been removed. The app uses VimInput from timestamp_vim_input.py.

Release

v0.1.1 (2025-11-06): reasoning trace panels in nlco_iter, JSONL model logging, TimestampLogApp Markdown view + gi input focus, tests updated.
v0.1.2 (2025-11-10): TimestampLogApp adds a minimal UTF-8 TTY preflight to avoid Textual UnicodeDecodeError on misconfigured terminals; new tests in tests/test_timestamp_textual_preflight.py.
v0.1.3 (2025-11-11): Add a 1‑column right padding to the Log widget in timestamp_textual_app.py to avoid last‑column clipping observed on some mobile SSH clients (e.g., JuiceSSH/Termux). New test tests/test_timestamp_textual_layout_margin.py pins this CSS.
v0.1.4 (2025-11-11): Optional lenient input hook for Textual: set TIMESTAMP_LENIENT_INPUT=1 to monkeypatch textual.drivers.linux_driver.decode to fall back to cp1252 (or TEXTUAL_FALLBACK_ENCODING) when non‑UTF‑8 bytes arrive. Tests: tests/test_timestamp_textual_lenient_input.py.
v0.1.5 (2025-11-11): Fix script entry NameError by defining the lenient hook before main()/__main__ guard; add tests/test_timestamp_textual_entry_order.py to ensure running the script directly doesn’t raise NameError and exits cleanly.
v0.1.6 (2025-11-11): Add tests: constraints append behavior (tests/test_timestamp_constraints_append.py), lenient warn-once (tests/test_timestamp_textual_lenient_warn_once.py), preflight success path (tests/test_timestamp_textual_preflight_success.py), and timestamp formatting (tests/test_timestamp_format_line.py).
v0.1.7 (2025-11-11): Add CLI flags to TimestampLogApp script: --lenient-input to enable decode fallback and --fallback-encoding ENC to select the fallback codec (default cp1252). Test tests/test_timestamp_textual_cli_lenient.py verifies the flag path without launching a real UI.
v0.1.8 (2025-11-11): Strengthen lenient input: in addition to patching linux_driver.decode, also patch linux_driver.read to sanitize non‑UTF‑8 bytes (fallback → UTF‑8) before they reach the decoder. This addresses cases where the driver binds the original decode at definition time.
v0.1.9 (2025-11-11): Add timestamp_tui.sh wrapper which sets UTF‑8 locale, enables iutf8, and runs the app with --lenient-input --fallback-encoding cp1252. Test tests/test_timestamp_shell_wrapper.py asserts wrapper contents.
v0.1.10 (2025-11-11): Ensure timestamp_tui.sh has executable bit set in repo workspace.
v0.1.11 (2025-11-11): Document copy‑paste one‑liner and step‑by‑step shell hardening commands.
v0.1.11 (2025-11-11): Document quick shell hardening commands (UTF‑8 locale + iutf8) for ad‑hoc sessions.
v0.1.12 (2025-11-11): Add --right-margin N to adjust Log right padding at runtime (env TIMESTAMP_RIGHT_MARGIN). Helps phones/SSH clients that clip the last column. Test: tests/test_timestamp_cli_right_margin.py.
v0.1.13 (2025-11-11): Add --pad-eol (env TIMESTAMP_PAD_EOL=1) to append a single space when rendering each log line, without writing it to file—works around last-column clipping that margins don’t solve. Test: tests/test_timestamp_cli_pad_eol.py.
v0.1.14 (2025-11-11): Make help reliable on mobile/SSH by adding F1 binding for toggle_help (some clients translate Ctrl+H to backspace). Added tests: tests/test_timestamp_help_binding_and_action.py.
v0.1.15 (2025-11-11): Make artifact Markdown view scrollable via CSS overflow: auto, allow focusing it (ga shortcut) and mark focusable. Tests: tests/test_timestamp_artifact_scrollable.py, tests/test_timestamp_artifact_focus_shortcut.py.
v0.1.16 (2025-11-11): Replace the upper Log with a scrollable Constraints Markdown pane. It renders constraints.md directly and auto-refreshes. Shortcut gi focuses input, ga focuses artifact. Tests: tests/test_timestamp_constraints_view_load.py; updated CSS test to reference #constraints-view padding.
v0.1.17 (2025-11-11): Default the Constraints pane to show the bottom (latest entries). On each load/refresh, we call scroll_end(); if supported, auto_scroll=True is set. Test updated in tests/test_timestamp_constraints_view_load.py to assert scrolling.
v0.1.18 (2025-11-11): Make auto-scroll polite: it’s disabled while the constraints pane is focused, and can be turned off with --no-auto-scroll (env TIMESTAMP_AUTO_SCROLL=0). Tests: tests/test_timestamp_no_auto_scroll_flag.py, tests/test_timestamp_constraints_focus_blocks_autoscroll.py.
v0.1.19 (2025-11-11): More tests for constraints pane: mtime-driven refresh (tests/test_timestamp_constraints_refresh_mtime.py), missing-file handling (tests/test_timestamp_constraints_missing_file.py), and gi input-focus shortcut (tests/test_timestamp_input_focus_shortcut.py).
v0.1.20 (2025-11-11): Remove structured schedule output from the Refiner. RefineSignature now returns only refined_artifact; headless and TUI paths no longer write or parse structured_schedule.json. Tests updated accordingly.
v0.1.21 (2025-11-11): Remove Critic module and input from Refiner. RefineSignature drops the critique field; headless and Textual flows no longer call or display Critic. TUI “Critique” panel removed. Tests updated.
v0.1.22 (2025-11-11): Add SystemState Pydantic model with last_artifact_update (ISO). Passed to the Refiner right after constraints in both headless and Textual flows. Tests: tests/test_system_state_refiner_input_headless.py, tests/test_system_state_refiner_input_textual.py.
v0.1.23 (2025-11-11): (historical) nlco_textual.py was made executable. This TUI is now legacy and not maintained.
v0.1.24 (2025-11-11): Repo housekeeping — commit and push TUI + pipeline changes (mobile SSH fixes, scrollable panes, constraints view overhaul, removal of Critic/structured schedule, new SystemState input) and tests.
v0.1.25 (2025-11-11): (historical) nlco_textual.py removal was planned. The file may still be present but should be treated as deprecated; use headless loop (python nlco_iter.py) or timestamp_textual_app.py.
v0.1.26 (2025-11-11): TimestampLogApp now tails constraints.md by default (last 200 lines) and scrolls to bottom. Flags/env: --constraints-tail N (env TIMESTAMP_CONSTRAINTS_TAIL) to adjust; --no-auto-scroll to stop snapping to end.
v0.1.27 (2025-11-11): Tests for artifact scroll actions and fallbacks (tests/test_timestamp_artifact_scroll_actions.py).
v0.1.28 (2025-11-11): TimestampLogApp respects newlines in the constraints view by emitting Markdown line breaks. Test: tests/test_timestamp_constraints_newlines.py.
v0.1.29 (2025-11-12): Remove unused _OldVimInput from timestamp_textual_app.py; the TUI relies solely on VimInput.
v0.1.30 (2025-11-12): Add hourly/daily/weekly auto-backups for constraints.md (locked writes). New module backups.py; used by constraints_io, timestamp_textual_app, and HTMX utils. Tests: tests/test_constraints_backups.py.
v0.1.31 (2025-11-12): Constraints tail now always tracks pane height in timestamp_app_core. Tests: tests/test_timestamp_constraints_tail_auto.py and updated display/tail default tests.
v0.1.32 (2025-11-12): Ran ruff check . across repo; 397 findings, 166 auto-fixable. Consider adding a minimal pyproject.toml Ruff config and staged fixes.
v0.1.33 (2025-11-12): Applied safe Ruff auto-fixes (ruff check . --fix). Findings reduced to 221 from 397; remaining include E402/E70x/F841 and a few F821/E722. No code semantics changes intended.
v0.1.34 (2025-11-12): Added minimal Ruff config .ruff.toml (py311, line-length 100, ignore E501; per-file ignores for legacy/intentional patterns; excluded one experimental file). Current findings with config: 102.
v0.1.35 (2025-11-12): Fixed high-signal Ruff issues: F821 in agent_manual_b.py, interactive_chat.py, textual_dspy/app.py; E722 in abbrev_decoder/... and online_optimization_system.py; minor F841 cleanups. Added targeted per-file ignores for tests, world_model, and submodules. ruff check now passes clean with .ruff.toml.
v0.1.46 (2025-11-17): Headless path unification + zsh helper + tests.
- nlco_iter now resolves artifact/constraints/memory paths via timestamp_app_core.resolve_*, defaulting to ~/.nlco/private/ (keeps files out of the repo). Env overrides still work: NLCO_{CONSTRAINTS,ARTIFACT,MEMORY,SHORT_TERM}_PATH or TIMESTAMP_*_PATH.
- Added minimal CLI scripts/constraints_add_entry.py to append a single constraints line with HH:MM:SS time and daily headings. Flag --now exists only for tests.
- Added scripts/zsh_functions.zsh exposing a(); usage: a your message appends to constraints using the same helpers/paths as the TUI/HTMX.
- Tests: tests/test_add_constraint_script.py ensures day heading and line formatting, deduping headings on same day, and new day headings.
- Artifact view defaults to top scroll already; tests in place: tests/test_timestamp_artifact_default_top.py and tests/test_timestamp_artifact_no_autoscroll.py (env TIMESTAMP_AUTO_SCROLL=0 disables it).
v0.1.47 (2025-11-17): Quick usage cheat‑sheet for constraints appends.
- Zsh function: source scripts/zsh_functions.zsh:1 then run a do the thing.
- CLI: python3 scripts/constraints_add_entry.py did the thing (uses HH:MM:SS; daily # YYYY-MM-DD).
- Path resolution: honors NLCO_CONSTRAINTS_PATH → default ~/.nlco/private/constraints.md.
- Verification: tail -n 5 ~/.nlco/private/constraints.md (or the env‑overridden path).

Notes/Learnings

Shell stays minimal by delegating to the tested Python helper; consistent behavior with TUI/HTMX.
Memory replace tool prints unified diffs; look for them in nlco_iter console output.

File locations (default; outside repo)

Constraints: ~/.nlco/private/constraints.md (set NLCO_CONSTRAINTS_PATH to override)
Artifact: ~/.nlco/private/artifact.md (set NLCO_ARTIFACT_PATH)
Memory: ~/.nlco/private/memory.md (set NLCO_MEMORY_PATH)
Short-term: ~/.nlco/private/short_term_memory.md (set NLCO_SHORT_TERM_PATH)

Quick zsh function (a)

Source once in your shell rc: source /home/tom/git/agent/scripts/zsh_functions.zsh
Use: a did a quick thing → writes HH:MM:SS did a quick thing under today’s # YYYY-MM-DD heading.

What we learned / keep in mind

Sharing the path resolvers across TUI and headless keeps artifacts out of the repo by default and avoids FileNotFound races.
Keep shell glue minimal; delegate formatting and locking to Python helpers we already test.
v0.1.36 (2025-11-12): Simplify Timestamp TUI constraints pane — wrapper tail derives from visible pane height; env TIMESTAMP_CONSTRAINTS_TAIL is ignored for rendering. Removed _tail_env. Tests updated: tests/test_timestamp_constraints_tail_view.py, tests/test_timestamp_constraints_helpers.py, and tests/test_timestamp_constraints_newlines.py adjusted to inject pane height.
v0.1.37 (2025-11-12): Delegate wrapper helpers to core: added tiny shared helpers in timestamp_app_core.py (md_preserve_lines, constraints_tail_from_height, scroll_end) and made the wrapper call them. Added tests/test_timestamp_constraints_equivalence.py to assert wrapper/core render identical content for the same pane height.
v0.1.38 (2025-11-12): Wrapper now respects TIMESTAMP_CONSTRAINTS_ROWS like the core. On mount, it sets #constraints-container height and uses it to derive tail (rows-2). Added tests/test_timestamp_constraints_rows_env_wrapper.py to validate height + tail.
v0.1.39 (2025-11-12): Unified TUI apps: timestamp_textual_app.TimestampLogApp now subclasses the core timestamp_app_core.TimestampLogApp to remove duplicate CSS/compose and reuse helpers. Wrapper overrides only: key bindings, timers/refresh, input submit formatting, help toggle, artifact scroll actions, and a focus‑aware _scroll_constraints_end (skips autoscroll when constraints pane focused). Tests added: tests/test_timestamp_constraints_equivalence.py (wrapper vs core rendering), tests/test_timestamp_artifact_scroll_helpers_delegate.py (wrapper delegates to core scroll helpers).
v0.1.40 (2025-11-12): Extract constraints append formatting to shared helper constraints_io.build_append_block(existing, needs_heading, date_str, line). Wrapper now uses it in _append_to_constraints. New tests: tests/test_constraints_append_helper.py covers first-entry, same-day, and next-day w/o trailing newline.
v0.1.41 (2025-11-12): Add smoke tests using Textual run_test() for both core and wrapper TimestampLogApp to ensure they compose and render without launching a real UI (tests/test_timestamp_smoke_run_test.py).
v0.1.42 (2025-11-12): Wrapper CLI adds --constraints-rows N to set TIMESTAMP_CONSTRAINTS_ROWS (mirrors core env). Test: tests/test_timestamp_cli_constraints_rows.py.
v0.1.43 (2025-11-12): HTMX writer now uses shared build_append_block for consistent constraints formatting. Updated webapp/nlco_htmx/utils.write_constraints_entry. New test tests/test_web_htmx_append_block_consistency.py covers next-day insert with missing trailing newline.
v0.1.44 (2025-11-12): Change timestamp format from HHMM to HH:MM:SS across TUI and HTMX. Updated _format_line in timestamp_textual_app.py and write_constraints_entry in webapp/nlco_htmx/utils.py. Tests updated accordingly (Textual app formatting, HTMX POST/API, and consistency test).
v0.1.45 (2025-11-12): Artifact pane now scrolls to the top by default (polite auto_scroll gating). Core calls scroll_home after load; wrapper inherits behavior. Added tests: tests/test_timestamp_artifact_default_top.py for core and wrapper.

Things learned / to keep in mind (2025-11-12)

Two TimestampLogApp classes exist (core and wrapper). Their behaviors can drift; we aligned tailing behavior to reduce divergence. Consider consolidating or delegating constraints logic from the wrapper to the core in a future change.
CLI still accepts --constraints-tail and sets TIMESTAMP_CONSTRAINTS_TAIL for backward compatibility, but the wrapper ignores it during rendering. Tests only assert env propagation.
Textual's Markdown.update may emit a benign "no running event loop" message when called on stubbed views in tests; currently harmless and can be ignored.
Helper naming: the core calls _scroll_constraints_end. For backward‑compat with older tests, the wrapper keeps _maybe_scroll_constraints_end() and forwards it to _scroll_constraints_end().
Append logic centralization: both headless and TUI paths should use constraints_io.build_append_block for consistent spacing/heading behavior; today only the wrapper calls it. Consider adopting it in any other path that writes constraints to avoid drift.

Structured Memory — Options (2025-11-11)

Option A (light): add sectioned headings in memory.md (Policies/Procedures/Glossary) and constrain tools to edit within a selected section; add tests for section targeting.
Option B (tags): require a tiny YAML front‑matter per block (tags: [policy, time], updated:). Provide a minimal append_memory --tags helper and validate via tests.
Option C (index JSON): maintain .nlco/memory_index.json with {id, title, tags, updated, offset} for quick lookup; unit‑test index build and lookup.
Option D (RAG-lite): embed memory blocks once (e.g., 384‑d float per block) and select top‑k to show to the LM; start with a toy cosine impl + tests on selection only.
Option E (recency window): inject only blocks updated in last N days into context when memory changed; add a test that verifies injection happens only on recent edits.
Option F (write rules): add a 2‑rule acceptance gate: “non‑transient + reusable”; if not satisfied, do not write. Test with examples.
Option G (CLI): tiny ./mem.py list|show|append to manipulate memory deterministically; add smoke tests for the commands.

Quick Run — Textual Apps (cheat sheet)

Install deps once: source .venv/bin/activate || true; pip install -r requirements.txt
Legacy NLCO TUI: deprecated and not maintained; examples removed. Use headless nlco_iter.py or timestamp_textual_app.py.
Timestamp TUI (notes/constraints): ./timestamp_tui.sh (recommended)
- Alt: ./timestamp_textual_app.py --lenient-input --fallback-encoding cp1252
- Phone/SSH hardening: stty iutf8 && export LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8

TTY / UTF-8 preflight (TimestampLogApp)

Symptom: UnicodeDecodeError from textual.drivers.linux_driver on launch when the terminal isn’t UTF-8 or stty iutf8 isn’t set.
Change: timestamp_textual_app.py now checks for (a) TTY stdin/stdout, (b) UTF-8 locale, and (c) stty iutf8. If any fail, it exits with a clear message rather than crashing inside Textual.
Additionally: a small main() wrapper prints a concise hint and exits if a UnicodeDecodeError escapes the app, though Textual may still render its own traceback from a background thread. The fix remains to correct the terminal environment.
How to fix env: ensure a UTF-8 locale (e.g., export LANG=en_US.UTF-8) and enable UTF-8 input on the TTY: stty iutf8.
Tests: tests/test_timestamp_textual_preflight.py exercises the three failure modes with monkeypatching.

Troubleshooting: iutf8 disabled

Enable for current shell: stty iutf8 (run in the same pane you launch the app from).
Verify: stty -a | grep -E -- '-?iutf8' → should show iutf8 (not -iutf8).
Persist for interactive shells (bash/zsh): add to ~/.bashrc or ~/.zshrc:
- [ -t 0 ] && stty iutf8 || true
tmux/screen: run stty iutf8 inside each pane; to persist, keep the shell-rc line above (it runs only in interactive TTYs).

Right edge clipping (mobile SSH)

Symptom: rightmost character of many lines is missing when running TimestampLogApp over JuiceSSH/Termux.
Likely cause: terminal last-column/autowrap quirk or off‑by‑one width reporting over SSH. Textual/Rich will happily write into the last cell; some terminals fail to render it.
Mitigation (2025-11-11): increased right padding on the Log widget (padding: 1 2;) so content stays one cell away from the terminal’s right edge.
Quick check on client: compare tput cols vs stty -a | grep -o 'columns [0-9]\+'; they should match. Ensure TERM=xterm-256color and locale is UTF‑8. Minimal probe: printf '%*sX\n' "$COLUMNS" '' should visibly print an X in the last column.

Non‑UTF‑8 input over SSH (2025-11-11)

Symptom: Textual prints a background thread traceback with UnicodeDecodeError: invalid start byte 0x.. from textual.drivers.linux_driver.decode.
Cause: the SSH client sends bytes that aren’t UTF‑8 (often CP1252 like 0x99 for ™). iutf8 doesn’t transcode; it just changes line editing. Textual expects UTF‑8 and crashes.
Fix on the remote shell: export LANG=en_US.UTF-8; export LC_ALL=en_US.UTF-8; stty iutf8 (run inside tmux panes too). Verify with locale charmap → UTF-8 and stty -a shows iutf8.
Fix on the client (examples): set JuiceSSH/Termux character encoding to UTF‑8 and disable any legacy encoding. Avoid pasting content that yields CP1252 bytes; the hex for ™ should be e2 84 a2 (UTF‑8), not 99.
Probe: run xxd -p then paste a ™ and press Enter; if you see 99, your client isn’t sending UTF‑8.
App behavior: we warn pre‑launch, but Textual may still show its own traceback if non‑UTF‑8 bytes arrive later. We prefer environment fixes over code fallbacks to keep the app minimal.
Opt‑in fallback: set TIMESTAMP_LENIENT_INPUT=1 (and optionally TEXTUAL_FALLBACK_ENCODING=cp1252) to enable a small monkeypatch that decodes bad bytes via cp1252. This is intentionally off by default to avoid hiding real issues.
CLI alternative: run ./timestamp_textual_app.py --lenient-input [--fallback-encoding cp1252] to toggle without env vars. Flags are parsed minimally and ignore unknown args. Quick run (lenient input)
One line: ./timestamp_textual_app.py --lenient-input --fallback-encoding cp1252
With env hardening for this shell:
- export LANG=en_US.UTF-8 (or keep C.UTF-8)
- export LC_ALL=en_US.UTF-8
- stty iutf8
- ./timestamp_textual_app.py --lenient-input --fallback-encoding cp1252

Shell wrapper (shortest path)

Run: ./timestamp_tui.sh
It sets LANG/LC_ALL, runs stty iutf8 if on a TTY, then executes the TUI with lenient input.

Quick shell hardening (no run)

Per‑pane: stty iutf8
Locale: export LANG=en_US.UTF-8 and export LC_ALL=en_US.UTF-8
Optional: export TERM=xterm-256color

Copy‑paste one‑liner (only harden shell)

stty iutf8 && export LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 TERM=xterm-256color

Repo housekeeping (2025-11-10)

Committed and pushed v0.1.2 changes: UTF-8 TTY preflight in timestamp_textual_app.py, error hint in main(), and tests test_timestamp_textual_preflight.py+test_timestamp_textual_main.py.

Repo housekeeping (2025-11-13)

README refocused on NLCO headless loop and Timestamp TUI. DeepSeek batch docs moved to a short subpackages note. Added hardened one‑liner and wrapper as primary run paths.
README polished for readability: compact contents list, consistent section headers, and concise quick‑start. No behavior changes.
README now features centered heading and Material‑style badges (shields.io); purely visual.
Pushed documentation commits to origin/main.

Nootropics log (read-only)

NLCO Textual UI now shows the last 72h of entries from ~/.nootropics_log.jsonl in a side panel.
Strictly read-only: the loader never writes or truncates the file.
Env var NLCO_NOOTROPICS_LOG can point to a different JSONL file for testing.
Minimal schema: each line must be JSON with an ISO ts field; lines without ts are skipped.
Helper: append_nootropics_section(context) now appends the section; both headless and TUI call this instead of duplicating string glue.

Caching placement

To maximize DSPy cache reuse, nootropics data is appended to the context input (not constraints).
This keeps the constraints string stable across runs and pushes variable nootropics lines "behind" constraints in the prompt ordering.
nlco_iter.py appends a Nootropics (last 72h) section at the end of context.

NLCO iter tests

Added tests covering headless iteration behavior without touching real LMs:
- tests/test_nlco_iter_logging_and_schedule.py ensures artifact update, structured schedule JSON write, and JSONL model logs with reasoning.
- tests/test_nlco_iter_nootropics_context.py asserts nootropics appear only in context and that constraints remain unchanged.
- Existing tests/test_nlco_iter.py validates async memory invocation and artifact write.
Note: Running the entire repo test suite may fail due to duplicate test module names in packages/deepseek-batch/tests/. Run selective tests for NLCO iter.

Quick usage

Headless (nlco_iter): scheduler decides when to run; finished-check is currently disabled.
Timestamp app: gi focuses the input if focus is elsewhere; use Enter to append a timestamped line.

Iteration counts

After a constraints.md change (headless), the loop runs up to MAX_ITERATIONS in one invocation. Override with NLCO_MAX_ITERS (default 3).
On scheduled hourly ticks (no changes), it runs exactly 1 iteration per tick.
In the TUI, each press of r runs exactly 1 iteration.

Layout tweak (2025-11-07)

Removed NLCO TUI layout. Timestamp app has its own constraints height (8) documented below.

Timestamp app constraints

TimestampLogApp now uses a fixed constraints height of 8 (#constraints-container { height: 8; }).
Uses the shared constraints_io.tail_lines for tailing and scrolls to bottom by default.
Test: tests/test_timestamp_constraints_height.py pins the height.
Tests: tests/test_timestamp_constraints_display.py verifies the tail content and scroll-to-end behavior; tests/test_timestamp_constraints_tail_default.py ensures default tail (200) doesn't trim small files.
Wrapper: tests/test_timestamp_wrapper_exports.py smoke tests that timestamp_textual_app exposes the app and helpers, and that main() wires parse/tty/lenient/run calls without launching a real UI.
Env: tests/test_timestamp_constraints_autoscroll_env.py asserts TIMESTAMP_AUTO_SCROLL=0 disables scrolling to end on reload.

Refactor: Timestamp app split

timestamp_vim_input.py contains the minimal VimInput widget.
timestamp_app_core.py holds the app class and helper functions.
timestamp_textual_app.py is a thin wrapper that re-exports the app, helpers, and main().

Constraints pane behavior (Timestamp app)

Tail count = max(pane height − 2, 1). No implicit fallback.
Env TIMESTAMP_CONSTRAINTS_ROWS=N sets the container height to N rows at mount time and drives tail count (N-2).
Scrolls to the end after refresh so the bottom is visible by default.

File-first constraints behavior

Shared helpers constraints_io.tail_lines and constraints_io.append_line centralize file tailing and appending.

Short‑term memory

File: short_term_memory.md.
Producers:
- TimewarriorModule: appends a one‑line event whenever it starts/stops or prints a notable status.
- ExecutiveModule: appends a one‑line trace per tool action when used (not currently wired in headless; TUI doesn’t invoke ExecutiveModule).
Consumers: none at runtime; we don’t read it back into model context today. It’s for lightweight, append‑only breadcrumbs.

Potential rough edges observed

nlco_iter.py disables finished-check (if False:), so it may iterate indefinitely unless externally stopped.
Legacy nlco_textual.py also writes files in place; avoid running it alongside headless to prevent clobbering.
Timewarrior context: headless nlco_iter.py currently does not call TimewarriorModule.run, so no Timewarrior info is added to the model context. In the Textual UI, TimewarriorModule.run is invoked and its output is shown in the "Timewarrior" pane, but it is not injected into the context string passed to Critic/Refiner.
Critic stage disabled (2025-11-08): We skip the Critic call and pass an empty critique to Refiner. The TUI shows “Critic disabled” in the Critique panel.
Planner stage: In the Textual UI we call PlanningModule.run and show its output; in headless nlco_iter.py the planner is instantiated but not called yet.

Future test ideas

Add a Textual App.run_test() smoke test to ensure NLCOTextualApp composes and updates logs without launching a real UI.
Validate that running one iteration updates the artifact. Structured schedule JSON is no longer produced by default.
Model lineage update (2025-09-29)
- DeepSeek docs state both deepseek-chat and deepseek-reasoner were upgraded to DeepSeek-V3.2-Exp; chat = non-thinking mode, reasoner = thinking mode. Over OpenRouter, reasoning appears in the reasoning field; via DeepSeek API it appears as message.reasoning_content. Timewarrior Control — Conceptual Plan
Current state: TimewarriorModule is wired for use but not yet invoked in the headless loop (nlco_iter.py). We plan to gate it behind NLCO_TIMEW=1 and log a concise status line each iteration.
Likely failure cause: headless flow never invokes Timewarrior; summary parsing is brittle (looks for a "Tags" line and the phrase "there is currently no active time tracking").
Minimal enablement: call timewarrior_tracker.run(...) early in iteration_loop (before memory/planning) and log the result. Add two tests that monkeypatch the tool to simulate summary and start/stop success.
Robust detection: prefer timew export JSON (or timew get dom.active*) to detect active state + tags instead of scraping summary. Keep it simple—no fallbacks unless asked.
Control policy: default to deterministic rules from structured_schedule.json (start/stop on block boundaries; derive tags from the active block). Use the LLM reviewer only when schedule vs. context disagree.
Safety and UX: add a 2–3 minute hysteresis to avoid flapping; add a dry-run flag; allow manual overrides via constraints lines (e.g., timew:start tag1 tag2, timew:stop).
Observability: record {ts, action, tags, justification, stdout/stderr} to .nlco/timew_log.jsonl; unit-test both “timew missing” and normal paths.

Artifact improvement (concept-only, 2025-11-08)

Minimal acceptance gate: only replace the baseline artifact when a tiny rubric score improves. No fallbacks.
Rubric inputs: constraints-derived checklist coverage, schedule consistency, and fewer TODO/TBD markers; return 0..100.
Loop tweak: compute score(prev) and score(candidate); accept candidate if strictly higher (or equal with fewer TODOs). Keep a “best so far”.
Focus cue: use Affect.suggested_focus to generate a one-liner “focus for next iteration” and pass it to the refiner.
Tests proposed (not yet implemented): two-unchanged-hashes stop, accept-only-on-improvement, and context-frozen-per-iteration. Memory handling (summary, 2025-11-11)
File memory.md is the persistent knowledge base; edits are made only when durable info should be kept.
Module MemoryModule runs a small ReAct loop with tools show_memory, replace_memory, append_memory, reset_memory and writes back only if changes occurred.
Headless and Textual flows both use the primary LM for memory updates; a short result string is printed to the Memory pane when changes happen.
We don’t inject memory.md back into the model context yet (display-only except for edits). Short-term breadcrumbs go to short_term_memory.md separately. Code Quality Snapshot (2025-11-12)
Refreshed Radon (post-constraints refactor): repo avg CC C ≈ 14.89. Top hotspots unchanged; constraints pane no longer contributes.
Radon CC (C–F) hotspots (function · score) — latest:
- dspy_programs/memory_gan.main · D (24)
- dspy_programs/taskwarrior_agent.main · D (21)
- refiner_signature.render_schedule_timeline · C (20)
- agent_manual_pkg.tui.TUI._process_job · C (20)
- executive_module._execute_step · C (16)
- nlco_iter.iteration_loop · C (15)
- timewarrior_module._apply_decision · C (15)
- Full listing captured via radon cc -s -n C -a ..
Radon MI: no grade‑C production files observed; agent_manual_pkg/.../tui.py remains MI “C” (legacy/test-heavy context). See radon mi -s ..

Actionable Quick Wins

timestamp_vim_input._handle_normal_mode_key refactored into helpers — now below C and no longer listed.
nlco_iter.iteration_loop refactored internally (helpers for reading state, building context, logging, and refiner print) with no behavior change; CC dropped below C.
Agent TUI: unified command routing. Merged _handle_inline_commands into a single _route_command, and kept a tiny _handle_command wrapper for compatibility. Overlap eliminated for /model, /modules, /max_tokens, and /layout.
Tests for unified router: added test_layout_command_via_on_input and a full /modules flow test to assert state transitions and configuration calls. The suite now covers /layout, /model, /modules, and /max_tokens (prompted and inline). See agent_manual_pkg/tests/test_tui_router_commands.py and existing tests.
Split satisfaction update: _process_job now calls _update_goals(prompt) and _update_score() separately; added unit tests for each. Kept _update_satisfaction as a tiny wrapper for compatibility.
Radon artifact (58d): added scripts/gen_radon_report.py which writes JSON + minimal HTML to .nlco/meta/. Test: tests/test_radon_report.py monkeys patches subprocess to avoid external dependency.
timestamp_app_core._load_constraints split into small helpers (_tail_count, _constraints_text, _scroll_constraints_end) to simplify the constraints pane logic without behavior changes.
Extract subroutines from nlco_iter.iteration_loop (context build, model calls, writeback) to lower CC without changing behavior.
In timewarrior_module._apply_decision, add early returns for NONE/denied cases to flatten nesting. Radon Snapshot (2025-11-12, 01:35)
Repo average CC: C ≈ 14.89 (46 C–F blocks).
Top hotspots unchanged: agent_manual_pkg.tui.TUI.on_input_submitted D(26), dspy_programs/memory_gan.main D(24), dspy_programs/taskwarrior_agent.main D(21), plus several C(20–16) functions.
MI: no grade‑C production files; agent_manual_pkg/.../tui.py MI “C” (legacy/test heavy). Security scan (2025-11-13)
Scope: working tree, full Git history (patterns), and TruffleHog v2 entropy/regex pass.
High‑risk secrets: none found (no private keys, AWS/GitHub/Slack tokens, Bearer tokens).
.env files: only .env.example is tracked (placeholders). Any .env* in the tree are untracked.
Emails: only test/example emails in content; commit metadata naturally contains author emails (not part of file contents).
Nested repo note: telegram-mcp-repo/.git/ exists locally but is not tracked; avoid bundling this folder when archiving.
Largest blobs in history are code/log artifacts; no credential patterns detected in those blobs.

How to re‑run locally

Quick grep (HEAD): `rg -n --hidden -P -g '!/.git' -g '!/.git/' -e '-----BEGIN (RSA |DSA |EC |OPENSSH )?PRIVATE KEY-----' -e 'ghp_[A-Za-z0-9]{36,}' -e 'github_pat_[A-Za-z0-9_]{80,}' -e 'AKIA[0-9A-Z]{16}' -e 'ASIA[0-9A-Z]{16}' -e 'xox[bap]-' -e 'hooks.slack.com/services/' -e 'Authorization: Bearer [A-Za-z0-9_-.]+'
Full history (patterns): git rev-list --all | while read c; do git grep -I -n --full-name -E '(AKIA[0-9A-Z]{16}|ASIA[0-9A-Z]{16}|ghp_[A-Za-z0-9]{36,}|github_pat_[A-Za-z0-9_]{80,}|xox[bap]-|hooks\\.slack\\.com/services/|BEGIN [A-Z ]+PRIVATE KEY|Authorization: Bearer )' "$c" || true; done
TruffleHog v2 (optional): python -m pip install --user trufflehog==2.2.1 && trufflehog --regex --entropy=True --json --since_commit $(git rev-list --max-parents=0 HEAD | tail -n1) --branch $(git rev-parse --abbrev-ref HEAD) file://$PWD

Recommended hardening

Add a pre-commit hook with gitleaks or trufflehog (kept minimal; fail on verified secrets only).
Enable GitHub secret scanning & push protection on the repo (if using GitHub).
Convert telegram-mcp-repo to a proper submodule or add packaging excludes so its local .git/ never ships.

PII scrub (2025-11-13)

Removed from Git history and remote: constraints.md, memory.md, short_term_memory.md, notes.md, info.md.
Local copies preserved (untracked) and added to .gitignore.
Safety: mirror backup at /tmp/agent-pre-scrub-mirror-YYYYmmdd-HHMMSS, tag pre-scrub-20251113-060334, branch backup/pre-scrub-20251113-060334.
Force-pushed rewritten history to origin (all branches + tags).
If collaborators exist: they must git fetch --all --prune and hard reset their branches (history was rewritten).

Pre-commit secrets scan (2025-11-13)

Hook: .pre-commit-config.yaml with a local secrets-scan that runs scripts/secrets_scan.sh over staged files.
Patterns are high-confidence only (no entropy): private keys, GitHub tokens, GitHub PATs, AWS AKIA/ASIA, Slack webhooks, Bearer tokens.
Setup once: python -m pip install --user pre-commit && pre-commit install
Run ad-hoc: bash scripts/secrets_scan.sh $(git diff --cached --name-only)
Tests: tests/test_secrets_scan.py covers clean and leaked cases.

PII double‑check (2025-11-13)

Local repo: no occurrences of constraints.md, memory.md, short_term_memory.md, notes.md, info.md in any commit; high‑risk patterns only appear in documentation and the scan script (not secrets).
Remote main: clean (no target files present).
Remote PR refs: refs/pull/{1..17}/head still contain the removed files (GitHub stores PR heads separately). Action: close these PRs and recreate from the new main. GitHub will GC unreachable objects over time; to accelerate removal, contact GitHub Support.
Re‑run locally: see “How to re‑run” in Security scan; remote check: git clone --mirror $REPO_URL /tmp/repo.git && cd /tmp/repo.git && <same scans>.

PII prevention policy (2025-11-13)

Separation: personal logs/notes live outside the repo (default paths remain constraints.md, memory.md, short_term_memory.md, notes.md, info.md but are .gitignored and treated as local state).
Blocking hooks: pre-commit secrets-scan is mandatory for contributors (pre-commit install). A second hook forbid-paths blocks staging any of the five Markdown files (matched by basename).
CI gate (planned): GitHub Actions job runs the same scans on PRs and fails if secrets or forbidden paths are touched.
Repo settings: enable GitHub “Secret scanning” and “Push protection” in Settings → Code security and analysis.
Packaging: add .gitattributes export-ignore entries for those files to keep them out of git archive and release tarballs (planned).
Incident playbook: if a leak occurs, run the scrub script (filter-repo), force-push with backups/tags, close PR refs, notify collaborators to hard reset.

Forbid-paths hook (2025-11-13)

Hook: .pre-commit-config.yaml forbid-paths runs scripts/forbid_paths.sh.
Deny-list: constraints.md, memory.md, short_term_memory.md, notes.md, info.md (basename match).
Tests: tests/test_forbid_paths.py ensures safe files pass and forbidden names fail.

Release (continued)

v0.1.46 (2025-11-13): Temporary headless hotfix — added a local artifact.md to avoid a first-run crash while investigating path alignment. The file remained ignored by Git.
v0.1.47 (2025-11-13): Proper fix — headless now uses the shared resolver (timestamp_app_core.resolve_artifact_path) so artifact.md lives under ~/.nlco/private (or env override). Removed the temporary repo file and added a tiny FileNotFoundError guard in _read_artifact_and_state() so first-run works without creating files up front. Tests: tests/test_nlco_iter_paths_env.py, tests/test_nlco_iter_missing_artifact.py.
v0.1.48 (2025-11-13): Align headless constraints path with TUI — CONSTRAINTS_FILE now uses timestamp_app_core.resolve_constraints_path() (defaults to ~/.nlco/private/constraints.md). Updated test tests/test_nlco_iter_paths_env.py to assert both artifact and constraints use shared resolvers.
v0.1.49 (2025-11-13): Add shared resolvers for memory files — resolve_memory_path() and resolve_short_term_path() default to ~/.nlco/private/{memory.md, short_term_memory.md} with env overrides. Headless now passes these to MemoryModule and TimewarriorModule. Tests updated: extended tests/test_timestamp_paths_env.py and tests/test_nlco_iter_paths_env.py.
v0.1.50 (2025-11-13): Add tests asserting artifact does not auto-scroll when TIMESTAMP_AUTO_SCROLL=0: tests/test_timestamp_artifact_no_autoscroll.py (core + wrapper). Complements default-top tests.
v0.1.51 (2025-11-13): Web app uses shared defaults for memory files — WebConfig.memory_path and WebConfig.short_term_memory_path now default via timestamp_app_core.resolve_memory_path/_short_term_path. Test: tests/test_web_paths_resolvers.py.
v0.1.52 (2025-11-15): Memory diffs — on replace_memory, NLCO now prints a minimal unified diff to the console after a successful replacement. Kept implementation tiny (uses difflib.unified_diff). Test: tests/test_memory_replace_diff.py asserts diff headers and changed lines.
v0.1.53 (2025-11-15): Added dspy_programs/concept_worldmodel_experiment.py, a minimal concept-worldmodel experiment using DSPy + DeepSeek. Concepts are Pydantic models, tagging uses a DSPy Signature with Pydantic output, and metrics cover tagging quality plus concept-structure via logistic regression. New tests: tests/test_concept_worldmodel_experiment.py exercise synthetic data generation, the ConceptTagger Pydantic path, and the metric helpers.
v0.1.54 (2025-11-15): Made dspy_programs/concept_worldmodel_experiment.py executable (shebang + chmod) so it can be run directly as ./dspy_programs/concept_worldmodel_experiment.py once dependencies and environment are configured.
v0.1.55 (2025-11-16): Added dspy_programs/concept_world_model.py, a multi-step concept world-model + control experiment. It simulates a hidden-state reactor env, tags STATE concepts via an LLM, represents ACTION concepts in the same concept universe, fits a linear RewardModel on discounted returns G_t ≈ f([state_concepts_t, action_bits_t]), discovers high-reward concept pairs, asks the LLM to mint new abstract STATE concepts from those pairs, re-tags with the expanded schema, and runs a greedy actor that chooses actions by argmax predicted discounted reward.
v0.1.56 (2025-11-16): Added dspy_programs/concept_world_model_v3.py, an updated world-model + control script that fixes the pre-/post-action mismatch (training on PRE-action observations so Q(s_t, a_t) semantics are consistent), increases meltdown frequency via more aggressive glitch dynamics and lower thresholds, and uses a fully unified concept feature space: X_t is a K-dimensional vector over all concepts (STATE bits from the LLM, ACTION one-hots from the env, MODEL concepts currently zeroed). It computes Monte-Carlo returns G_t under a random policy, fits a linear RewardModel G_t ≈ f(X_t), and acts greedily by embedding current STATE bits + candidate ACTION bits into X_t and taking argmax predicted return. Default run is small (num_episodes=5, max_steps=12) to keep LLM calls manageable.
v0.1.57 (2025-11-17): Updated dspy_programs/concept_world_model_v3.py concept encoding. STATE concept activations now use a ternary scheme in training data: 1 = present, -1 = explicitly absent, 0 = missing / not defined at that time (e.g. concepts introduced later). The LLM still returns boolean activations; the tagger maps them into {-1, 0, 1}. Future-occupancy analysis now treats presence as value > 0 when building discounted occupancy signals, so S_j(t) remains “discounted future presence” even with the richer encoding. RewardModel features embed these ternary STATE values directly; ACTION concepts remain one-hot (0/1). Running v3 still requires valid OpenRouter credentials for DSPy.
v0.1.58 (2025-11-17): Switched the default LM for concept_world_model_v3 to Gemini 2.5 Flash over OpenRouter (openrouter/google/gemini-2.5-flash) and added a --lm CLI flag so the model can be overridden (e.g. --lm openrouter/deepseek/deepseek-reasoner). DSPy is now configured inside main() from this flag; the module still defines a default LM at import time for ad-hoc use, but script runs will reconfigure it based on --lm. Without a valid OPENROUTER_API_KEY, runs will fail during tagging or concept creation.
v0.1.59 (2025-11-17): Added tests/test_concept_world_model_v3.py to cover key pieces of the v3 world-model script without calling real LMs. Tests stub the DSPy Predict call to feed a fixed activations_json into LLMConceptTagger.tag_state and assert the ternary encoding ([1, -1, 0] for present / absent / missing). Additional tests exercise EpisodeDataset.build_discounted_reward() for (a) a no-meltdown episode (checking discounted Monte Carlo returns) and (b) an episode that ends in meltdown (reward at the meltdown step is 0 and earlier steps do not see any future reward beyond that point). A fourth test patches LogisticRegression and ConceptCreator so that _analyze_concept_future() deterministically appends a new meta-concept to the Experiment.universe, verifying that new STATE concepts are added correctly without depending on real model weights. Two more tests assert that malformed activations_json (invalid JSON or wrong schema) causes Pydantic ValidationError to propagate from tag_state, so schema violations fail loudly instead of being silently treated as zeros. Two further tests cover future-occupancy analysis: one asserts that when all concept future-occupancy targets are constant, _analyze_concept_future() prints the “all zero/constant” message along with per-concept y means (always 0/1); the other constructs a mixed target for a single concept and asserts that this message is not emitted. All eight tests pass with pytest -q tests/test_concept_world_model_v3.py.
v0.1.60 (2025-11-17): Made Pydantic validation failures in the concept-tagging path very visible at runtime. LLMConceptTagger.tag_state now wraps ConceptActivations.model_validate_json(...) in a try/except ValidationError block; on failure it uses rich to draw a red rule with a clear message, prints the raw activations_json payload, and then re-raises the original exception. This keeps the behavior non-fallback (errors still fail fast) while making it obvious in logs when the LLM returned malformed or schema-incompatible activations.
v0.1.61 (2025-11-17): Relaxed the default cosine-similarity threshold for future-occupancy meta-concept creation. Experiment now takes a corr_threshold parameter (CLI flag --corr-threshold, default 0.0); _analyze_concept_future uses this instead of a hard-coded 0.8. With the default, any positively correlated pair (sim > 0) is eligible, but we still cap additions with max_new and keep the “no structure” path when weight vectors are all zero. Experiment.run_greedy_actor_demo also prints concepts_added_total and samples_total in each step, and _analyze_concept_future reports how many concepts were created in that analysis plus the running total. To avoid sklearn shape errors after adding concepts, _greedy_action resets the LinearRegression coefficients when the feature dimension changes.
v0.1.62 (2025-11-17): Removed vestigial baseline_* fields from run_greedy_actor_demo output; the earlier Monte Carlo warmup (which had populated baseline_overall_mean and baseline_action_means) is gone, so those baselines were always 0/empty and misleading. Tests now assert that per-step logs contain no baseline_mean_G or baseline_action_means strings. A --difficulty CLI flag was added to Experiment/ReactorEnv to scale instability: difficulty >1.0 multiplies glitch probability, scales stress/margin drift, and tightens meltdown thresholds slightly; difficulty <1.0 does the opposite. Default remains 1.0.
v0.1.63 (2025-11-17): ReactorEnv now distinguishes between base difficulty and a per-step current_difficulty() which ramps up with the episode progress (step_idx / max_steps) and resets on reset(). Glitch probability, drift, and meltdown thresholds use current_difficulty() rather than the raw base value, and the per-step log line prints difficulty=<current> so you can see the ramp. The reward model is refit once per episode at the end, using Monte-Carlo returns over that episode; a compact summary of the fit (samples, mean_G, and top 3 weights) is logged after each episode. RewardModel.predict pads/truncates feature vectors to the learned feature dimension instead of resetting the model when new concepts are added; if the model has never been fit, it behaves as a zero predictor. Tests cover padding/truncation, meltdown behavior, epsilon-greedy exploration, operator notes (including steam/turbine hints), and the reward-model fit log line. The reward model now uses Lasso(alpha=0.01) instead of plain LinearRegression/Ridge to encourage sparse concept weights; Experiment tracks how often each concept’s reward-model coefficient is (effectively) zero across fits, and the fit summary prints out always_zero=[...] as a list of concepts that have been zero in every fit so far (natural pruning candidates). A new --max-concepts CLI flag (default 15) sets a hard cap on total concepts; when exceeded, the most frequently zero-weight STATE concept is dropped after each episode, never dropping ACTION/MODEL concepts.
v0.1.64 (2025-11-17): Simplified the base STATE concept definitions in concept_world_model_v3 to be short and generic (“core under high stress”, “safety margin small”, “channel has frequent disturbances”, “operator annoyed/angry”, “situation clearly dangerous”). IDs are unchanged (STRAINED_CORE, THIN_MARGIN, STORMY_CHANNEL, AGITATED_OPERATOR, CRITICAL_MODE) so tests and downstream logic remain valid, but prompts to the LLM are now less opinionated and easier to swap out for other environments.

Next Steps (2025-11-13)

90a. Add a test verifying artifact does not auto‑scroll when TIMESTAMP_AUTO_SCROLL=0 (symmetry with constraints). Recommended.
90b. Optional: CLI/env toggle to select artifact top/bottom; add 1–2 tests.
90c. Consider constraints_io.append_daily_line(now, message) helper to encapsulate last‑date detection; 1 unit test.

Things learned (2025-11-13)

Align paths across headless + TUI to avoid surprises; using the shared resolver eliminates the repo-root dependency.
Minimal guard (not a broad fallback): treating missing artifact as empty input allows first-run without creating files ahead of time. Things learned (2025-11-15)
Concept-worldmodel experiments live under dspy_programs/concept_worldmodel_experiment.py; they rely on DSPy’s Pydantic integration (Pydantic models as input/output fields) and scikit-learn for simple logistic-regression structure probes. Tests stub the DSPy Predict call so no real LM traffic occurs; when adding similar experiments, follow this pattern to keep tests fast and offline-friendly.
When adding new Python scripts meant to be run directly (not just imported), add a shebang and ensure the file has the executable bit set in the workspace so ./path/to/script.py works without extra chmod steps.
Running ./dspy_programs/concept_worldmodel_experiment.py without OpenRouter credentials currently fails with a litellm AuthenticationError (HTTP 401). This is expected; configure OPENROUTER_API_KEY (and any other dspy/OpenRouter settings) in the environment before running the full experiment to avoid noisy stack traces and repeated failures.
Invoking the concept-worldmodel script via zsh -ic ./dspy_programs/concept_worldmodel_experiment.py does not change this requirement; an interactive shell run still needs valid OpenRouter credentials. A previous run was aborted mid-call, but after sourcing ~/.env_api_keys in bash the script completed end-to-end and printed tagging + structure metrics, confirming the wiring works with a valid key.
The concept-worldmodel script seeds both Python’s random and NumPy (random.seed(42), np.random.seed(42)) at the start of main(), so the synthetic dataset (observations + ground-truth labels) is reproducible across runs. LM outputs may still vary depending on the model’s temperature/config.
The concept-worldmodel script now prints the full concept list once at startup via print_concepts() (id + definition for each concept) before generating data. Tests (test_print_concepts_lists_ids) assert this helper emits all concept IDs, so future changes to the concept set must keep the description output in sync.
The concept-worldmodel script logs the first few examples (indices, observations, true labels, and predicted activations) to .nlco/meta/concept_worldmodel_samples.jsonl via log_samples_jsonl(). Tests (test_log_samples_jsonl_writes_lines) cover the helper; when debugging or comparing runs, inspect this JSONL rather than scrolling back through the full console output.
Concept-worldmodel hyperparameter search:
- For the structure models (C_j ~ other concepts), we now run a tiny random hyperparameter search over logistic regression settings once, globally, instead of per concept:
  - We sample three (C, max_iter) pairs with C ~ 10**U[-2,2] and max_iter ~ U{200..800}, fit a model for each concept and each trial, and aggregate validation CE/acc/AUC + total ΔCE across all non-degenerate concepts.
  - We print a compact table (Hyperparam search for structure models (shared across concepts)) showing all trials and their average scores plus sum_ΔCE, then select the best trial by sum_ΔCE and reuse its (C, max_iter) for all per-concept reports and ΔCE_self.
- The usefulness summary is labeled “ΔCE_self (predictable from others)” and sorted by ΔCE_self descending, with frac_of_total showing each concept’s share of the total ΔCE_self sum. A one-line explanation clarifies that ΔCE_self = baseline_CE − best_CE for predicting that concept from the others, so higher values mean a concept is more structured/predictable in the learned concept space.
Concept-worldmodel pipeline (detailed notes):
- High-level goal: let an LLM create and use abstract concepts as a compact world model, then use a separate ML layer to learn structure over those concepts (predictive relations, redundancies, higher-level abstractions). The outer objective is “make the system less surprised” (better prediction / compression), not just classify labels.
- Concept representation: concepts are treated as predicates over observations (C(o) ∈ {0,1} now, optionally [0,1] later). In code we keep them as Pydantic Concept(id, definition) objects so they can be passed into DSPy signatures dynamically and used as stable keys in matrices.
- Binary vs scalar activations: v1 uses binary activations (0/1) for simplicity and for clean metrics; the design allows upgrading to scalar scores ([0,1] confidence) later, with thresholding for discrete models and raw scores for richer ML.
- What makes a concept “good”:
  - Compression/MDL view: adding the concept should reduce L(data | K) + L(K); i.e. it lets us encode many observations with fewer bits.
  - Predictive view: the concept should improve prediction of external targets or of other concepts.
  - Invariants view: conditioned on the concept being true, some other properties become very stable (low entropy).
- Concept–observation matrix: for N observations and K concepts we build Z ∈ {0,1}^{N×K} (or scores). On top of this, we run standard ML:
  - For each concept C_j, train logistic regression C_j ~ Z_{-j}; compare baseline CE (constant pos_rate) vs model CE to get ΔCE, plus acc/AUC.
  - Concepts that are highly predictable from others are likely composite/redundant; concepts that are hard to predict are more primitive/independent.
  - The same machinery can be pointed at external labels (e.g. failure) to get concept importances.
- Causality vs compression: we treat MDL/compression + predictive fit as a pragmatic heuristic for structural learning in concept space, not as full causal discovery. Score-based causal-learning ideas inspire the design (simpler factorisations that still predict well) but we do not try to identify causal direction; we just learn “useful, stable dependencies” between concepts.
- Synthetic worlds: to avoid contamination from real-world knowledge, we use synthetic simulators with hidden numeric rules. The current script uses a reactor log world (hidden stress, margin, glitches, tone → textual logs + concept labels STRAINED_CORE, THIN_MARGIN, STORMY_CHANNEL, AGITATED_OPERATOR, CRITICAL_MODE), but the same pattern could be reused for other toy worlds.
- Logprobs and fuzziness: the inner concept layer does not require token logprobs; we can treat the LLM as a black-box concept tagger. Logprobs are only needed if the outer objective is “reduce raw text perplexity given a world-model prefix”. For concept learning and structure analysis we work entirely in concept space (bits/scores + CE/ΔCE).
- Concept auto-regression and higher-level concepts: in the full design, we:
  - Fit sparse models C_j ~ other Cs to see which concepts cluster and which combinations have strong predictive power.
  - Use those patterns (e.g. A & B & ¬C) as candidates for new higher-level concepts, then ask the LLM to propose a short natural-language definition from positive vs negative examples.
  - Keep both: a machine-side pattern (function of base concepts) and a compressed text definition (handle for the LLM).
- Multi-step / future-aware concepts: for sequential data, instead of failure_step_23 we define trajectory-wide labels like WILL_EVENTUALLY_FAIL(t) or FAIL_WITHIN_H(t) and treat them as concepts that depend on current state. This lets us train per-step “failure risk” concepts without exploding the concept set by time index.
- Text definitions and compression: for new composite concepts, the definition prompt should not expose the raw logical pattern (A, B, C). Instead we show positive and negative examples and ask for a short abstraction, explicitly discouraging “laundry list” definitions. We can enforce compression by constraining length and rejecting definitions that are not materially shorter than spelling out the components. Things learned (2025-11-16)
The multi-step concept world-model script lives at dspy_programs/concept_world_model.py. It keeps the LLM’s role minimal (STATE tagging + concept naming) and pushes control and value learning into a tiny linear RewardModel over concept bits, plus a greedy actor that selects actions by predicted discounted reward.
The new script currently uses bare dspy.InputField() / dspy.OutputField() (no positional ...) in its Signatures; older InputField(...) usage now fails against the local DSPy version with TypeError: InputField() takes 0 positional arguments but 1 was given. When adding new Signatures, follow the zero-positional-args pattern so they stay compatible.
Running python dspy_programs/concept_world_model.py without valid OpenRouter credentials fails during the first tagging pass with a litellm.AuthenticationError (“No auth credentials found” / HTTP 401) from the OpenRouter backend. This is expected; configure OPENROUTER_API_KEY (e.g., via source ~/.env_api_keys) before running the full multi-step experiment, especially since it makes many LLM calls (one per step, plus concept-creation calls).
The multi-step script also seeds random and numpy inside Experiment.run() (random.seed(42), np.random.seed(42)), so the synthetic trajectories (hidden state, actions under the random behavior policy, and meltdown events) are reproducible across runs. LLM outputs (STATE tags and new concept definitions) may still vary with model temperature/config.
To keep runs tractable in this environment, concept_world_model.py now defaults to num_episodes=5 in main(), which means ~50 LLM calls for the initial tagging pass plus a few more for the greedy-actor demo. Bigger runs (e.g., 40 episodes) are possible by editing the Experiment arguments or wiring a small CLI/env override, but be aware they scale linearly in LLM calls and wall time.
For debugging transparency, concept_world_model.py now prints all LLM inputs and outputs for STATE tagging and concept creation: before each tagging call it prints the observation and the current STATE concept set (id + source), then prints the returned activation dict; before each new-concept call it prints the pattern concepts, pattern description, and a couple of positive/negative examples, followed by the generated concept id/definition. This is intentionally verbose and should be left as-is while we’re iterating on the world-model design; if it becomes too noisy for larger runs, consider gating it behind a simple verbosity flag or env var rather than removing it outright.
The v3 script (concept_world_model_v3.py) keeps the same broad structure but makes a few semantic fixes and runtime tweaks:
- RL semantics: episodes are logged with PRE-action observations (obs_t) and actions a_t, and Monte-Carlo returns G_t are computed from step t onward; the RewardModel is trained on (obs_t, a_t, G_t) so its predictions are genuinely Q-like for the random behavior policy. At inference time the greedy actor also operates on PRE-action observations, embedding STATE bits and each candidate ACTION bit into the unified concept vector before querying the RewardModel.
- Environment tuning: glitch probability now ramps up more strongly with high stress/low margin, and the meltdown rule is relaxed to stress > 85, margin < 25, glitches >= 2, which produces meltdowns with reasonable frequency over ~10–15 steps. This gives the RewardModel a more informative target signal than the earlier almost-always-survive regime.
- DSPy integration: to stay compatible with the current DSPy version, v3 uses InputField(desc=...) / OutputField(desc=...) and a JSON-string output (activations_json) for the tagging Signature, parsing into the ConceptActivations Pydantic model via model_validate_json. Avoid using a field named labels on Signatures, since that collided with internal attributes in earlier experiments and produced confusing 'function' object errors at runtime.
- Runtime: a full v3 run with num_episodes=5 and max_steps=12 (60 decision steps, plus 3 greedy episodes) completes in ~5 minutes in this environment with the current OpenRouter-backed DeepSeek model. Increasing episodes/steps will raise runtime roughly linearly due to one LLM call per state tag, so any future CI-style tests for this script should stub the LM instead of hitting the real endpoint.
- LLM I/O visibility: like the original concept_world_model.py, v3 now prints all LLM inputs and outputs for STATE tagging and concept creation. For tagging it prints the STATE concept set once, then for each observation logs [LLM TAG INPUT v3] (index + full text) and [LLM TAG OUTPUT v3] (concept→0/1 dict). For concept creation (once we have strong pairs) it logs [LLM NEW-CONCEPT INPUT v3] with pattern concepts, description, and a couple of positive/negative examples, followed by [LLM NEW-CONCEPT OUTPUT v3] with the proposed id/definition. This is intentionally verbose for debugging; if it becomes cumbersome for larger experiments, consider adding a simple verbosity flag or env gate instead of removing the logs.
- Output formatting: v3 now uses Rich (rich.console.Console, rich.table.Table) for key sections. The random-policy simulation, baseline stats, and train/test split are wrapped in console.rule() headers and printed with color; baseline per-action returns are shown in a small Rich table. The greedy-actor demo prints one line per step with colored fields for the chosen action, predicted per-action returns, baseline means, and running average rewards (global + per-episode), plus colored meltdown/no-meltdown markers. The LLM tagging logs remain plain print() calls to keep them simple; only the high-level experiment/RewardModel/actor sections use Rich styling.
- Control knobs: the --episodes/--epochs CLI arg now controls both (a) the number of random-policy episodes used to collect training data and (b) the number of greedy-actor episodes in the demo (we pass num_episodes into both phases). An additional --epsilon arg toggles epsilon-greedy exploration during the greedy phase; with --epsilon > 0, each step chooses a random action with probability ε instead of the argmax predicted return. This keeps the model-learning phase Monte-Carlo over a random policy, while allowing exploration in the evaluation/control phase without touching the RewardModel itself.
- Environment hardness + reward: we simplified away the separate random warmup and now start directly in ε-greedy episodes with a trivial zero-initialized RewardModel. The env gained a simple demand variable ("low"/"high") which is exposed in the observation text (“grid demand is modest” vs “grid demand is elevated; operators are asked to keep output up”). Reward is now demand-sensitive: meltdown yields a large penalty; otherwise we add a base survival reward plus an “output” term (push > steady > cool) that is rewarded when demand is high and mildly penalized when demand is low if output is far from a moderate level. This creates a minimal, sensible trade-off where “always cool” or “always push” is no longer obviously best; the agent must read both STATE concepts and demand text to balance safety vs output over time.
- Minimal implementation scope: the current script covers the first step only:
  - Hand-defined concepts (5 reactor concepts).
  - Synthetic data with hidden numeric rules.
  - LLM tagging vs ground truth (acc/F1).
  - Concept-from-concept structure via logistic regression (ΔCE, acc, AUC).
  - A small JSONL log (concept_worldmodel_samples.jsonl) with a few observations, labels, and predicted activations for inspection.
  - No concept importance ranking, automatic concept creation, MDL selection, or sequence modeling yet—those are future layers on top of the same representation.
v0.1.61 (2025-11-17): Updated the base STATE concepts in dspy_programs/concept_world_model_v3.py to be simple, generic text features instead of reactor-specific ones. The new seed vocabulary is TEXT_LENGTHY, STRUCTURED_TEXT, TECHNICAL_TONE, and SPECIAL_SYMBOLS, with short definitions about length, structure, technicality, and unusual symbols. This better matches the intent to have schema-level concepts like “text length / structured text / technical / special symbols” as the starting point, and keeps the LLM prompts more portable across different simulated worlds. Tests for v3 still pass since they only depend on counts/shapes and not on the specific IDs.
v0.1.62 (2025-11-17): Reduced L1 regularization strength in the v3 RewardModel from alpha=0.001 to alpha=0.0001 (Lasso(alpha=0.0001, max_iter=10000) in concept_world_model_v3.py). This keeps the sparsity signal for concept pruning but lets more weight mass survive, which should make Q-value fits less biased under small data. tests/test_concept_world_model_v3.py still passes (21 tests); sklearn still emits convergence warnings in a couple of tiny synthetic tests, which we accept for now to keep the implementation minimal.
v0.1.63 (2025-11-17): Switched TagConcepts in concept_world_model_v3 from a raw JSON-string output (activations_json) to a typed ConceptActivations output field. LLMConceptTagger.tag_state now calls ConceptActivations.model_validate(out.activations) instead of model_validate_json, so DSPy/Pydantic handle the structure directly and we no longer ask the LM to hand-build JSON. Tests were updated to stub out.activations with Python objects instead of JSON strings, and two error-path tests now assert that malformed activations values still raise ValidationError. This keeps the “fail loudly on bad schema” behavior while reducing the chance of brittle JSON formatting bugs like stray f{ in the payload. tests/test_concept_world_model_v3.py continues to pass (21 tests), and a 3-episode manual run with Gemini 2.5 Flash completed without tagger validation errors.
v0.1.64 (2025-11-17): Switched the v3 RewardModel from L1 (Lasso) to L2 (Ridge) while keeping the same concept-importance and zero-weight tracking logic. This keeps the model strictly linear but removes the hard sparsity assumption, which should make Q-value fits smoother under small, noisy datasets. We still treat coefficients with |w| < 1e-6 as “effectively zero” for pruning stats. tests/test_concept_world_model_v3.py still passes (21 tests) after the change.
v0.1.65 (2025-11-17): Changed v3’s concept-pruning rule: when max_concepts is exceeded, we now drop the STATE concept with the lowest average importance under the RewardModel, rather than the one with the highest zero-weight count. Experiment tracks concept_importance_sums and concept_importance_updates and computes a simple mean score per concept; pruning chooses the minimum of that mean over LLM concepts only. The console message now reports avg_importance=... instead of zero-counts. Test test_max_concepts_drops_lowest_importance_state_concept was updated to synthesize importance sums and assert that the lowest-importance concept is removed when pruning triggers; the v3 test module still passes (21 tests).
v0.1.66 (2025-11-17): Tuned v3’s RewardModel Ridge regularization; current setting is alpha=0.1, a mild L2 penalty that stabilizes fits without making weights overly small. This still behaves like a plain linear Q-model in concept space but avoids some of the ill-conditioning seen with alpha=0.0. tests/test_concept_world_model_v3.py continues to pass after the change.
v0.1.67 (2025-11-17): Added a global noise factor to the v3 reactor environment and CLI. ReactorEnv now takes noise (>=0) and scales all stochastic dynamics with it: the random uniform deltas for stress, margin, and output in step() are multiplied by noise, glitch probability is scaled by difficulty * noise, and demand-flip probability is min(1, 0.2 * noise). Experiment accepts noise and forwards it, and concept_world_model_v3.py exposes --noise (default 1.0). A new test test_noise_factor_scales_push_dynamics seeds Python’s RNG and asserts that a PUSH step with noise=2.0 produces a larger |Δstress| than with noise=0.1 under the same underlying random draw. The v3 test module now has 22 passing tests.
v0.1.68 (2025-11-17): Extended the v3 reactor observation string to include the exact numeric power output. ReactorEnv._make_observation now appends (measured reactor output={self.output:.3f}). to the operator note, so logs include both qualitative steam/turbine hints and a precise scalar in the same sentence. New test test_observation_includes_exact_power_output sets a fixed env.output and asserts that the formatted value appears in the observation. The v3 tests now total 23 and all pass.
v0.1.69 (2025-11-17): After each episode in v3, the summary block now prints a running average episode reward across all episodes seen so far, in addition to per-episode totals and the cumulative sum. Specifically, run_greedy_actor_demo computes running_avg = cumulative / len(episode_summaries) and logs Running avg episode reward: ... after the “Episode totals so far” list and “Cumulative total reward”. New test test_episode_summary_includes_running_avg stubs the env/policy, runs two episodes, and asserts that the string appears in the captured console output. The v3 test module now has 24 passing tests.
v0.1.70 (2025-11-17): Changed v3’s greedy-learning phase to use a replay-style buffer and always train the RewardModel on all collected samples, not just the last episode. Experiment now keeps memory_states (dicts concept_id→value), memory_actions, and memory_returns; after each episode it computes discounted returns for that episode, appends them to the buffers, then calls _build_training_data() to embed every stored sample into the current concept universe (STATE bits by id, ACTION one-hot, missing concepts as 0). RewardModel.fit is then called on this full dataset each time. This guarantees that earlier episodes continue to influence Q-value estimates even after new concepts are introduced, without misaligning feature indices. Tests still pass (24); the reward-fit logging now reports samples= as the total replay size instead of per-episode length.
v0.1.71 (2025-11-17): Added a small regression test to ensure that the numeric power output in v3 stays in sync with the operator note. test_power_output_in_note_tracks_updates sets env.output twice (0.500, then 1.250) and checks that _make_observation() shows the corresponding (measured reactor output=...) substring each time, confirming the note always reflects the latest output. The v3 test module now has 25 passing tests.
v0.1.72 (2025-11-17): Fixed a v3 pruning quirk where the meta-concept created in a given future-occupancy analysis could be immediately dropped by the max_concepts budget. Experiment now tracks new_state_concepts_last_analysis (reset at the start of _analyze_concept_future) and appends any newly added STATE concept ids to it; the pruning candidates list excludes those ids, so only pre-existing STATE concepts are considered for removal in that episode. This keeps freshly minted concepts around for at least one training/pruning cycle. All v3 tests still pass (25).
v0.1.73 (2025-11-17): Added test_newly_created_concepts_not_pruned_immediately to the v3 tests. It stubs ConceptCreator.create to return a known meta-concept id, runs _analyze_concept_future() to add it, seeds concept_importance_sums so an older base concept is weakest, and then reconstructs the pruning candidate set as in run_greedy_actor_demo. The assertion ensures the new concept id is not in the candidates, locking in the “don’t immediately drop the concept we just created” behavior. The v3 test module now has 26 passing tests.
v0.1.74 (2025-11-17): Tightened the v3 meta-concept prompt to prefer “cause-of-correlation” concepts rather than simple conjunctions. ProposeNewConcept’s docstring and field descriptions now explicitly ask the LLM to describe a plausible underlying situation or cause that would make the two pattern concepts tend to be true together (given their high future-occupancy correlation), and to avoid restating or conjoining the input IDs. The pattern_desc we pass from PairwiseConceptDiscovery.discover() was updated to mention strongly correlated discounted occupancy and to ask for an explanation of that shared cause. No code flow changes; tests still pass (26).

Related Documents

Browser-only development

Claude Agents — Reference & Recommendations

Golden DKG Prototype -- Master Plan

Swarms Examples Index