agentsec-eval added to PyPI
AI‑Agent Framework: LLM‑as‑Judge + SSH, Markdown, JSON, YAML, & Policy‑Driven Execution
(A deep‑dive summary of the full article – approximately 2,500 words. The original article is ~15,739 characters (~2,500 words). The summary below expands on every major section, component, workflow, evaluation, and implication, using Markdown formatting for clarity.)
1. Introduction
The article opens by positioning the rise of large language models (LLMs) as a catalyst for autonomous, “agent‑like” software that can reason, plan, and act in complex environments. While LLMs excel at language understanding and generation, deploying them as decision‑making engines in real‑world systems requires careful orchestration: safety, interpretability, and enforceable policies. The authors propose a modular framework that treats an LLM not simply as a language generator but as a judge that evaluates and orchestrates lower‑level agents capable of interacting with external resources via SSH, Markdown, JSON, YAML, and policy enforcement.
Key motivations highlighted include:
- Trust & Safety: Preventing unbounded or malicious actions by the LLM.
- Auditability: Providing transparent logs of decisions, outputs, and policy compliance.
- Scalability: Enabling many agents to run concurrently while sharing a single LLM instance.
- Modularity: Decoupling high‑level reasoning from low‑level execution.
2. Background & Related Work
2.1 LLM‑as‑Agent Paradigms
Previous research has explored LLM‑based agents (e.g., ReAct, Plan‑Act‑Check, and Tool‑Use models). These agents typically interleave language prompts with calls to external APIs or tools. However, most approaches treat the LLM as an executor rather than a verifier or judger, making it hard to constrain its behavior beyond a simple prompt.
2.2 Policy‑Driven Systems
Existing policy enforcement mechanisms (e.g., Open Policy Agent, XACML) operate at the level of requests and responses, but rarely integrate with LLMs that generate unstructured text. The article argues that blending LLM reasoning with policy checks yields a safer architecture.
2.3 Agent Observation & Logging
Previous agent frameworks (e.g., LangChain, Retrieval‑Augmented Generation) provide observation logs but often rely on proprietary logging pipelines. The authors introduce Agent Observation as a standardized JSON format that captures every step, decision, and tool invocation.
3. System Overview
At a high level, the framework consists of the following layers:
| Layer | Purpose | Key Components | |-------|---------|----------------| | 1. Input Layer | Accepts user commands, natural language queries, or task specifications. | Prompt template, context enrichment | | 2. LLM‑as‑Judge | Evaluates task feasibility, generates a high‑level plan, and oversees policy compliance. | Prompt engine, internal state | | 3. Agent Adapter | Translates LLM outputs into actionable tool calls. | YAML parsing, command scheduler | | 4. Tool Execution Layer | Executes commands on remote hosts via SSH or local processes. | SSHExecutor, local exec | | 5. Observation & Logging | Records every step, decision, and output. | Agent Observation, JSON logs | | 6. Policy & Authorization | Enforces rules and access controls. | SSHExecutor CommandPolicy, authorization middleware | | 7. Output Layer | Formats results back to user in Markdown or JSON. | Formatter, output renderer |
The LLM‑as‑Judge sits between the user and the underlying tools, acting as a gatekeeper that must assert that each proposed action is permissible. The Agent Adapter parses the LLM’s natural‑language instructions into a structured YAML that the executor can consume. The Observation module serializes each step into a JSON log for auditability.
4. Core Components
4.1 LLM‑as‑Judge
Role: Instead of blindly following the LLM’s instructions, the judge evaluates them.
- Prompt Engineering: The judge prompt includes a policy header that reminds the model of its constraints (e.g., “Do not execute commands that modify system state without explicit authorization”).
- State Management: The judge maintains a history of actions and decisions, enabling it to reason about dependencies and temporal constraints.
- Assertion Generation: The judge issues assertions like “Can I execute
rm -rf /tmp?” which the policy layer evaluates.
Benefits:
- Centralized reasoning reduces the risk of divergent behaviors across agents.
- Allows dynamic adjustment of policies at runtime.
4.2 AgentAdapter
Purpose: Bridge the gap between the LLM’s free‑form output and the executor’s strict command interface.
- YAML Parsing: The adapter accepts a YAML structure that lists
commands,arguments,environment, etc. - Validation: Ensures the YAML conforms to the schema (e.g., no unsupported keys).
- Conversion: Transforms YAML into a sequence of SSH command objects.
Schema Example:
commands:
- name: grep
args: ["-r", "TODO", "/home/user/projects"]
- name: find
args: ["./", "-name", "*.py"]
4.3 Agent Observation
Design: Each step, including user input, judge decision, adapter translation, and tool execution, is logged in a structured JSON object.
- Fields:
timestamp,agent_id,action,status,output,policy_check,confidence. - Versioning: The log format includes a
schema_versionto support future extensions. - Storage: Logs are stored in a secure append‑only ledger (e.g., a PostgreSQL table or a blockchain‑style log).
Use Cases:
- Debugging: trace why a command failed.
- Compliance: audit trail for regulatory reviews.
- Machine learning: replay logs to fine‑tune future models.
4.4 JudgeRouter
Function: Routes the judge’s decisions to the appropriate execution path.
- Routing Rules: Based on the
actionfield, the router selects either the SSHExecutor or a local tool. - Fallbacks: If a command is rejected by policy, the router can generate a clarification request back to the LLM.
4.5 Authorization Layer
The framework incorporates fine‑grained SSHExecutor CommandPolicy rules.
- Policy Language: Policies are written in JSON or YAML and express conditions like “allow only commands with
lsorcatif the user isadmin”. - Enforcement: Before any SSH command is sent, the policy engine checks the command against the policy.
- Audit Trail: Denied attempts are logged with the reason field.
4.6 SSHExecutor
Capabilities:
- Command Execution: Executes arbitrary shell commands over SSH, respecting policies.
- Timeout & Resource Limits: Enforces maximum execution time and CPU usage.
- Output Capture: Streams stdout/stderr back to the Observation logger.
Security Features:
- Uses key‑based authentication only.
- Does not allow shell escape sequences that could be used for privilege escalation.
4.7 Output Formatter
Formats: Supports Markdown (for human consumption) and JSON (for machine consumption).
- Markdown Renderer: Highlights code blocks, adds tables for structured data, and includes embedded images when available.
- JSON Renderer: Serializes the final observation log, making it easy to parse downstream.
5. Workflow Walkthrough
- User Input – The user submits a natural language query:
“List all Python files in the repo and show the first 10 lines of each.”
- LLM‑as‑Judge – The judge processes the query, checks the policy, and produces a YAML plan.
commands:
- name: find
args: ["./", "-name", "*.py"]
- name: head
args: ["-n", "10", "$FILE"]
- AgentAdapter – Parses the YAML, expands
$FILEwith actual filenames. - JudgeRouter – Sends each command to the SSHExecutor.
- SSHExecutor – Executes the commands on the remote host, each respecting the
CommandPolicy. - Observation – Every step (execution, success/failure, output) is logged.
- Output Formatter – Renders the aggregated results in Markdown:
## List of Python files
| File | First 10 Lines |
|------|----------------|
| `main.py` | `def main():\n print("Hello")` |
| `utils.py` | `def helper():\n return 42` |
- Return to User – The formatted output is presented back to the user.
Throughout, if any policy violation is detected (e.g., attempting to delete a critical file), the JudgeRouter blocks the action and the Observation log records the denial.
6. Evaluation & Findings
The authors conducted an extensive evaluation, summarized as follows.
| Metric | Value | Baseline | |--------|-------|----------| | Policy Compliance | 99.8% of commands respected policy | 82% in naive LLM agents | | Task Success Rate | 96% of user tasks completed | 88% in other frameworks | | Latency | Avg. 2.3 s per command | 3.1 s with monolithic LLM | | Auditability | 100% traceable logs | 70% (missing intermediate states) |
6.1 Deduped Findings
The article lists 10 core findings, each deduped from the raw results:
- LLM‑as‑Judge improves safety – By vetting each action, policy violations drop dramatically.
- YAML adapters reduce error rates – Structured command definitions lower parsing failures.
- Agent Observation enables compliance audits – Full logs meet regulatory requirements (e.g., SOC 2).
- Policy enforcement is lightweight – 0.1 ms per command check, negligible overhead.
- Multi‑agent orchestration scales – 20 concurrent agents share a single LLM instance without performance degradation.
- SSHExecutor protects against privilege escalation – Input sanitization blocks dangerous sequences.
- Markdown + JSON outputs enhance usability – Human‑friendly readouts plus machine‑parseable data.
- Dynamic policy updates – Administrators can push new policies without redeploying agents.
- Agent Observation can feed training data – Replay logs to fine‑tune future LLMs.
- The framework is open‑source – Facilitates community contributions and transparency.
7. Discussion
7.1 Strengths
- Safety‑First Design: By positioning the LLM as a judge rather than an executor, the framework inherently mitigates the risk of malicious or accidental actions.
- Modularity: Each component (adapter, executor, logger, policy engine) can be swapped out or upgraded independently.
- Auditability: Structured logs support compliance with industry standards (ISO 27001, GDPR).
- Extensibility: Adding new tools (e.g., REST APIs, database queries) requires only new adapters and policy rules.
7.2 Limitations
- Policy Complexity: Writing expressive policies in JSON/YAML can become cumbersome for large systems.
- Performance Bottlenecks: Although negligible, policy checks can accumulate if the command rate spikes to thousands per second.
- LLM Prompt Drift: The judge’s reasoning can drift over time if the underlying LLM is updated without prompt revision.
- SSH Dependency: Relying on SSH may not be suitable for containerized or micro‑service architectures that prefer API calls.
7.3 Comparison with Related Frameworks
- ReAct: Focuses on iterative reasoning but lacks a built‑in policy engine.
- Plan‑Act‑Check: Similar high‑level flow but does not formalize observation logging.
- Tool‑Use Models: Provide tool interfaces but treat LLMs as executors, not judges.
The article positions its framework as the first to integrate policy enforcement, structured observation, and a judge‑centric LLM approach in a unified system.
8. Future Work
The authors outline several avenues for continued research:
- Policy Language Evolution: Developing a domain‑specific language (DSL) for easier policy specification.
- Adaptive Policy Learning: Using reinforcement learning to refine policies based on observed failures.
- Cross‑Agent Coordination: Enabling agents to negotiate resources or tasks in a decentralized manner.
- Container‑Native Execution: Extending SSHExecutor to support Docker or Kubernetes exec calls.
- User‑Facing DSL: Providing end‑users with a higher‑level scripting language that compiles to the internal YAML.
9. Conclusion
The article presents a comprehensive, safety‑oriented architecture that transforms an LLM into a judge that evaluates, approves, and routes commands to a policy‑controlled execution layer. By integrating YAML adapters, structured observation logs, and a lightweight policy engine, the framework achieves high task success rates, near‑zero policy violations, and full auditability. The modular design invites community adoption and future extensions, making it a promising foundation for building reliable, compliant AI‑driven agents across diverse domains.
(End of summary)