Overview

Agent Architecture Overview

Stavily utilizes a robust two-agent system designed for efficient and secure automation. All communications between agents and the central orchestrator are secured via HTTPS JsonRPC with Bearer token authentication.

Agent Types

Action Agents

Execute actions based on data collected by Sensor Agents or instructions from the orchestrator. They serve as the hands of the platform, performing tasks like restarting services or running commands.

Sensor Agents

Responsible for collecting data from your systems, including metrics, logs, and API endpoints. They act as the eyes of the platform, monitoring for trigger conditions.

Core Design Principles

The Stavily Agent is a lightweight, cross-platform, general-purpose worker that operates as the hands of the central Stavily orchestrator. Its design is intentionally simple: it does not contain any independent logic, monitoring capabilities, or decision-making engines. Agents are simple, stateless execution environments.

Key Principles:

Agent types and runtime: The codebase records agent type metadata (e.g., sensor, action) and uses these types to influence capability discovery and orchestration behavior. Agents share a common runtime but agent type affects intent and orchestration logic.
Primary orchestrator-driven communication: Agents primarily poll the orchestrator for instructions (polling loop implemented in shared agent code). Agents can also participate in provisioning flows and may perform outbound requests when required.
Mostly stateless execution: Agents execute instructions in ephemeral sandboxes; they may maintain local caches (plugin cache, configuration) for performance and resilience.
Instruction-based work: The orchestrator delivers instruction objects which contain plugin id, configuration and input data; agents execute and report results.

API Communication Architecture

Agents communicate exclusively with the secure Public Agent API. All other APIs are inaccessible to them. The communication pattern is a simple, robust polling loop.

Agent Architecture

API Flow Pattern

graph LR
    subgraph "Stavily Agent"
        A[Agent]
    end

    subgraph "Stavily Orchestrator"
        API[Public Agent API]
        WF[Workflow Service]
    end

    A -->|"1. Poll for work GET /agents/v1/{id}/instructions"| API
    API -->|"2. No work pending 204 No Content"| A

    A -->|"Poll again"| API
    API -->|"3. Workflow needs action"| WF
    WF -->|"4. Poll for work GET /agents/v1/{id}/instructions"| A
    A -->|"5. Deliver instruction 200 OK with Instruction"| API
    API -->|"6. Execute plugin"| A
    A -->|"7. Report result POST /agents/v1/{id}/instructions/{instr_id}/result"| API
    API -->|"8. Update workflow"| WF

Complete Workflow Execution Example

This example demonstrates how an action is executed on an agent after a trigger has already been detected by the Stavily orchestrator.

Scenario: High CPU Remediation

sequenceDiagram
    participant SENSOR as Sensor Agent
    participant GRAFANA as Grafana API
    participant ORCH as Stavily Orchestrator
    participant API as Public Agent API
    participant ACTION as Action Agent

    loop Continuous Monitoring
        SENSOR->>GRAFANA: Query metrics (CPU usage)
        GRAFANA-->>SENSOR: Return current metrics
        SENSOR->>SENSOR: Evaluate user-defined condition (CPU > 90%)
    end

    SENSOR->>API: POST /triggers (condition met: CPU > 90%)
    API->>ORCH: Forward trigger to orchestrator
    ORCH->>ORCH: Process trigger and determine remediation action
    ORCH->>API: Queue "restart-service" instruction for Action Agent

    loop Action Agent Polling
        ACTION->>API: GET /instructions
        API->>ACTION: 204 No Content (no work yet)
    end

    ACTION->>API: GET /instructions
    API->>ACTION: 200 OK (returns "restart-service" instruction)

    ACTION->>ACTION: Downloads and executes action plugin
    ACTION->>API: POST /instructions/{id}/result (status: completed)

    API->>ORCH: Update workflow status

    loop Post-Remediation Check
        SENSOR->>GRAFANA: Query metrics again
        GRAFANA-->>SENSOR: Return updated metrics
        SENSOR->>SENSOR: Check if CPU < 90% (condition resolved)
    end

    alt CPU normalized
        SENSOR->>API: POST /triggers (success: CPU normalized)
        API->>ORCH: Forward success trigger
        ORCH->>ORCH: Generate success output/notification
    else CPU still high
        SENSOR->>API: POST /triggers (failure: CPU still high)
        API->>ORCH: Forward failure trigger
        ORCH->>ORCH: Generate failure output/alert
    end

Detailed Execution Flow

Continuous Monitoring (Sensor Agent): The Sensor Agent continuously queries the Grafana API for metrics and evaluates user-defined conditions.
Trigger Detection (Sensor Agent): When the condition is met (e.g., CPU > 90%), the Sensor Agent reports the trigger event to the orchestrator.
Workflow Evaluation (Orchestrator): The orchestrator processes the trigger and determines the appropriate remediation action.
Action Queuing (Orchestrator): The orchestrator queues an instruction for the Action Agent via the Public Agent API.
Action Execution (Action Agent): The Action Agent polls for work, receives the instruction, downloads and executes the remediation plugin.
Result Reporting (Action Agent): The Action Agent reports the execution result back to the orchestrator.
Post-Remediation Verification (Sensor Agent): The Sensor Agent checks the metrics again to verify if the issue is resolved.
Output Generation (Orchestrator): Based on the verification result, the orchestrator generates appropriate outputs (success notification or failure alert).

Agent Runtime Architecture

Go Agent Binary: A single, cross-platform binary with minimal dependencies.
Instruction Poller: The core loop that communicates with the orchestrator API.
Plugin Executor: Executes action plugins in isolated, sandboxed environments.
Configuration Manager: Handles dynamic configuration updates sent from the orchestrator.
Secure Communicator: Manages secure mTLS communication with the Public Agent API.

Benefits of this Architecture

Simplicity & Reliability: With no complex logic, agents are simple to maintain, deploy, and scale. Their stateless nature makes them highly resilient.
Security: The attack surface is minimized. Agents have no standing permissions and only communicate with a single, dedicated API endpoint via mTLS.
Centralized Control: All automation logic, monitoring, and decision-making resides in the central orchestrator, providing a single pane of glass for management and auditing.
Scalability: Since agents are lightweight workers, scaling out to thousands of nodes is straightforward.

This unified agent architecture provides a robust, secure, and maintainable foundation for the Stavily automation platform.