Automated end-to-end testing for AI agent skills(agentskills.io). Launches Claude Code and Cursor as subprocesses, runs scenarios in real workspaces, and asserts what the model actually does.
# skillprobe [](https://pypi.org/project/skillprobe/) [](https://pypi.org/project/skillprobe/) [](https://github.com/Anyesh/skillprobe/actions/workflows/test.yml) [](https://github.com/Anyesh/skillprobe/blob/main/LICENSE) Release notes: see [CHANGELOG.md](CHANGELOG.md) or the [GitHub Releases page](https://github.com/Anyesh/skillprobe/releases).  Automated testing for LLM skills. Launches Claude Code or Cursor as subprocesses, runs scenarios in isolated workspaces, and reports what passed and what didn't. Skills are just text injected into the LLM context, and LLMs are probabilistic, so they'll get ignored some percentage of the time no matter how carefully you word them. If you want hard enforcement, hooks are the right tool since they run deterministically every time. But hooks can only check things after the fact (linting, file restrictions, blocked commands). They cant guide the model toward better architectural decisions, teach it your team's domain conventions, set the tone of code review feedback, or help it reason through a multi-step workflow. Skills handle that side, and skillprobe measures how reliably they do it. ## When you need this If you write a few personal skills and tweak them by feel, you probably dont need this. That loop is fast and good enough for individual use. Where it breaks down: - **Model updates break skills silently.** Anthropic ships a new Sonnet, Cursor updates their agent, and a skill that worked last week now produces different output. Nobody notices because nobody retested. - **Teams sharing skills.** When 20 engineers share a "code review" skill, one person's gut check isnt repre
Agent that generates comprehensive documentation, API references, architecture diagrams, and developer onboarding guides from existing code.
Agent configuration for systematic bug investigation that traces issues from error logs through the codebase to root cause with suggested fixes.
Agent for integrating third-party APIs including SDK setup, type generation, error handling, retry logic, and rate limit management.
Cursor's built-in autonomous coding agent that can make multi-file edits, run terminal commands, search the codebase, and iteratively build features with minimal human intervention.
Cloud-based autonomous coding agent that runs in the background on remote sandboxed environments, handling complex multi-step tasks while you continue working.
Cursor's multi-file editing agent within Composer mode that can create, edit, and delete files across your entire project in a single conversation.