Future Plans

This document showcases advanced workflow features that are planned for future releases. These enhancements will significantly expand Stavily’s automation capabilities, enabling more complex, resilient, and observable workflows.

Advanced Features Preview

The example below demonstrates several upcoming features working together in a comprehensive CPU remediation workflow with escalation and recovery capabilities.

Advanced Workflow Example

name: "Advanced CPU Remediation with Escalation and Recovery"
description: "Comprehensive remediation workflow featuring looping, approvals, compensation, and observability"
version: "2.0.0"
created_by: "stavily-system"
tags: ["remediation", "cpu", "ops", "monitoring"]

metadata:
  revision: 3
  environment: "production"
  category: "infrastructure"
  ui:
      color: "#4B9EFF"
      icon: "cpu"

agents:
  pools:
      - pool_name: "prod-servers"
          agent_regex: "prod-.*"
          auto_install_plugins: true
          on_plugin_missing: "skip_agent"
  single_agent: "fallback-agent-001"

secrets:
  - name: "ssh_key"
      type: "vault"
      ref: "vault:infra/ssh_ops_key"
  - name: "email_smtp_creds"
      type: "env"
      ref: "SMTP_CREDS"

variables:
  cpu_threshold: 90
  alert_recipients: ["[email protected]", "[email protected]"]

triggers:
  agents:
      single_agent: "trigger-agent"
  steps:
      - name: "cpu-alert"
          plugin: "prometheus-trigger-v1.1.0"
          config:
              metric: "cpu_usage_percent"
              condition: "> {{ variables.cpu_threshold }}"
              duration: "2m"
      - name: "manual-trigger"
          plugin: "manual-approval-trigger"
          config:
              message: "Run manually for testing or escalation"
          enabled: false

actions:
  agents:
      pools:
          - pool_name: "actions-pool"
              agent_regex: "actions-.*"
  steps:
      - name: "analyze-processes"
          plugin: "python-script-action-v2.1.0"
          config:
              script: "analyze_cpu_processes.py"
              timeout: "30s"
          error_handling:
              on_failure: "continue"
              max_retries: 3
              retry_delay: "15s"

      - name: "analyze-all-servers"
          plugin: "remote-analyzer-v3.0.0"
          for_each: "{{ agents.pools[0].agent_regex }}"
          parallel: true
          config:
              script: "remote_cpu_check.py"

      - name: "approval"
          plugin: "manual-approval-action-v1.0.0"
          depends_on: ["analyze-processes"]
          type: "manual"
          config:
              approvers: ["[email protected]"]
              message: "CPU remediation requires approval to kill processes."

      - name: "kill-high-cpu-processes"
          plugin: "system-command-action-v1.2.0"
          depends_on: ["approval"]
          condition: "analyze-processes.output.high_cpu_count > 3"
          config:
              command: "pkill -f high_cpu_process"
              timeout: "10s"
          error_handling:
              on_failure: "invoke-compensation"

      - name: "restart-service"
          plugin: "service-management-action-v1.5.0"
          depends_on: ["kill-high-cpu-processes"]
          condition: "kill-high-cpu-processes.status == 'success'"
          config:
              service: "webapp"
              action: "restart"

      - name: "subworkflow-escalation"
          plugin: "subworkflow-caller-v1.0.0"
          depends_on: ["restart-service"]
          condition: "restart-service.status == 'failed'"
          config:
              workflow_id: "incident-escalation-v2"
              parameters:
                  severity: "high"
                  category: "cpu"

compensation:
  - name: "rollback-service"
      plugin: "service-management-action-v1.5.0"
      config:
          service: "webapp"
          action: "rollback"

outputs:
  agents:
      single_agent: "output-agent"
  steps:
      - name: "send-summary"
          plugin: "formatted-email-output-v1.5.0"
          depends_on: ["restart-service"]
          config:
              recipients: "{{ variables.alert_recipients }}"
              template: "cpu_remediation_summary"
              attachments:
                  - path: "logs/workflow-{{ workflow.id }}.txt"

monitoring:
  timeout: "15m"
  heartbeat_interval: "30s"
  progress_tracking: true
  log_level: "debug"
  metrics:
      enabled: true
      export_to: "prometheus"

observability:
  tracing: true
  record_logs: true
  ui_annotations:
      highlight_steps: ["kill-high-cpu-processes", "subworkflow-escalation"]

permissions:
  owner: "devops-team"
  allow_run_by: ["ops", "admins"]
  read_only_for: ["auditors"]

Advanced Workflow Architecture

graph TD
    A[Workflow Trigger] --> B{Parallel Execution Engine}
    B --> C[For-Each Iterator]
    B --> D[Sequential Steps]

    C --> E[Parallel Step 1]
    C --> F[Parallel Step 2]
    C --> G[Parallel Step N]

    D --> H[Sequential Step 1]
    D --> I[Sequential Step 2]

    E --> J{Error Check}
    F --> J
    G --> J
    H --> K{Condition Check}
    I --> K

    J -->|Failure| L[Compensation Handler]
    K -->|Failure| L

    L --> M[Rollback Actions]
    M --> N[Recovery Workflow]

    J -->|Success| O[Continue]
    K -->|Success| O

    O --> P[Observability Layer]
    P --> Q[Metrics Export]
    P --> R[Tracing]
    P --> S[UI Annotations]

Key Advanced Features Explained

For-Each Loops

The for_each directive enables iteration over collections, executing steps multiple times with different parameters.

Example: for_each: "{{ agents.pools[0].agent_regex }}" - runs the step once for each matching agent.

Benefits:

Process multiple items concurrently
Dynamic scaling based on agent pools
Reduced execution time for batch operations

Parallel Execution

The parallel: true flag allows steps to run concurrently, significantly reducing total execution time for independent operations.

Use Cases:

Multi-server deployments
Independent validation checks
Parallel data processing tasks

# Example: Parallel server analysis
- name: "analyze-all-servers"
plugin: "remote-analyzer-v3.0.0"
for_each: "{{ agents.pools[0].agent_regex }}"
parallel: true
config:
  script: "remote_cpu_check.py"

Compensation Blocks

Define automatic rollback procedures that execute when steps fail, ensuring system consistency and reliability.

Features:

Automatic failure detection
Configurable rollback actions
State restoration capabilities

Error Recovery

Advanced error handling with invoke-compensation triggers rollback workflows when critical failures occur.

Strategies:

Service rollback to previous versions
Configuration restoration
Data cleanup operations

# Example: Compensation configuration
compensation:
- name: "rollback-service"
  plugin: "service-management-action-v1.5.0"
  config:
    service: "webapp"
    action: "rollback"
    backup_version: "v1.2.3"

# Trigger compensation on failure
error_handling:
on_failure: "invoke-compensation"

Distributed Tracing

Full request tracing across workflow steps and sub-workflows for debugging complex execution paths.

Capabilities:

End-to-end request tracking
Performance bottleneck identification
Cross-service dependency mapping

UI Annotations

Visual enhancements in the web UI to highlight critical steps and provide better workflow monitoring.

Features:

Step highlighting and status indicators
Real-time progress visualization
Interactive workflow diagrams

# Example: Enhanced observability
observability:
tracing: true
record_logs: true
ui_annotations:
  highlight_steps: ["kill-high-cpu-processes", "subworkflow-escalation"]
  color_scheme: "error-states"
  progress_indicators: true

RBAC Integration

Role-based access control for workflow execution, with granular permissions for different user types.

Access Levels:

Owner: Full control
Admin: Execute and modify
Operator: Execute only
Auditor: Read-only access

Secret Management

Integration with external secret stores (Vault, environment variables) for secure credential handling.

Supported Backends:

HashiCorp Vault
AWS Secrets Manager
Azure Key Vault
Environment variables

# Example: Security configuration
secrets:
- name: "ssh_key"
  type: "vault"
  ref: "vault:infra/ssh_ops_key"
- name: "email_smtp_creds"
  type: "env"
  ref: "SMTP_CREDS"

permissions:
owner: "devops-team"
allow_run_by: ["ops", "admins"]
read_only_for: ["auditors"]

Development Roadmap

Category	Feature	Status	Description / Implementation Notes
Core DSL	Workflow metadata (`id`, `name`, `description`, `version`)	✅	Already implemented — clear and semantic.
	Agent definitions (workflow/type/plugin level)	✅	Fully implemented and flexible with regex, pools, single agent.
	Conditional execution (`condition`, `depends_on`)	✅	Present and working.
	Error handling (`on_failure`, retries)	✅	Already defined with retry logic.
	Plugin architecture (versioned plugins)	✅	Excellent — version pinning adds consistency.
Advanced Execution	Loops / For-Each	🧩 In Progress	Example included (`for_each`). Needs parser and scheduler support for concurrent runs.
	Parallel Execution (`parallel: true`)	🧩 In Progress	Needs execution engine concurrency support.
	Sub-workflows / Nested workflows	🧩 In Progress	DSL supports via `subworkflow-caller` — backend orchestration logic to implement.
	Manual Approval Step	✅	Implemented as `manual-approval-action`.
	Compensation / Rollback	🧩 In Progress	DSL supports `compensation` block; needs engine hooks.
Data / Context	Variable & templating system	✅	Implemented (`{{ variables.* }}` syntax).
	Outputs referencing & propagation	✅	Already supported (e.g. `analyze-processes.output.*`).
	Global and step-level environment variables	🚧 To Do	Should add `.env` injection or `env:` per step.
	Type-safe variables / schema	🚧 To Do	Would improve validation and autocompletion.
Secrets / Security	Secrets injection from vault/env	🧩 In Progress	DSL design defined; need integration with secret backends.
	Permissions & RBAC per workflow	🧩 In Progress	DSL supports via `permissions` block; requires enforcement logic.
	Signed workflows / integrity check	🚧 To Do	Could add checksum or digital signature for verified deployment.
Observability	Monitoring (timeout, heartbeat, progress)	✅	Already implemented.
	Structured logging & tracing	🧩 In Progress	Partially in `observability` — needs backend support.
	Metrics export (Prometheus)	🧩 In Progress	DSL supports it; exporter to be implemented.
Usability	Tags, metadata, UI hints	✅	Present (`metadata`, `ui_annotations`).
	Validation & static analysis	🚧 To Do	YAML linter / validator should check for circular dependencies.
	Visual DAG support	🚧 To Do	Auto-generate DAG visualization from dependencies.
DevOps / Platform	Version control / rollback	🧩 In Progress	Version field exists; implement history & rollback mechanism.
	Plugin marketplace / registry	🚧 To Do	Backend feature for discoverability and updates.
	Cron / schedule triggers	🚧 To Do	Extend triggers with `cron` or time-based scheduling.