Workflows (Coming Soon)

In Development — The workflow system is under active development and not yet available for general use. This page describes the concepts and direction.

The Problem

Processing large document collections requires many operations: OCR, text extraction, structure parsing, description generation, clustering, embedding. Running these manually doesn't scale. A research archive with 10,000 scanned pages needs automation that can:

Handle thousands of items concurrently
Chain operations together (OCR → structure extraction → description)
Recover gracefully from failures
Track progress across the entire pipeline

The Approach

Arke's workflow system is built on two primitives:

Kladoi (singular: klados) are discrete action units—external services that perform a single, well-defined operation. Each klados declares what it accepts, what it produces, and what permissions it needs. Examples:

OCR klados: accepts images, produces text
Structure extraction klados: accepts text, produces hierarchical entities
Description klados: accepts entities, produces summaries

Rhizai (singular: rhiza) compose multiple kladoi into directed acyclic graphs (DAGs). A rhiza defines:

Which klados runs first
How data flows between steps
Where to fan out (scatter) for parallel processing
Where to collect results (gather) before continuing

Scatter/Gather Parallelism

The key to scale is the scatter/gather pattern:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   [PDF Splitter]                                            │
│        │                                                    │
│        ▼ scatter (1 PDF → 500 pages)                        │
│   ┌────┴────┬────┬────┬────┬─── ... ───┐                    │
│   ▼         ▼    ▼    ▼    ▼           ▼                    │
│ [OCR]    [OCR] [OCR] [OCR] [OCR]    [OCR]  (500 concurrent) │
│   │         │    │    │    │           │                    │
│   └────┬────┴────┴────┴────┴─── ... ───┘                    │
│        ▼ gather                                             │
│   [Structure Extraction]                                    │
│        │                                                    │
│        ▼                                                    │
│   [Description Generation]                                  │
│        │                                                    │
│        ▼                                                    │
│      done                                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

A single PDF becomes 500 concurrent OCR jobs, which reconverge for structure extraction. The workflow handles the coordination—you just define the graph and invoke it.

Use Cases

Pipeline	What It Does
Document digitization	PDF → pages → OCR → structured text → searchable entities
Knowledge graph generation	Documents → structure extraction → entity linking → graph
Batch description	Thousands of images → concurrent description generation
Format conversion	Mixed media → normalized formats → unified processing
Clustering analysis	Entities → embedding → similarity clustering → labeled groups

The Domino Effect

Setting up a workflow is like arranging dominoes. You define the structure once—what connects to what, where to fan out, where to collect. Then you knock down the first one.

A single invocation cascades through potentially thousands of parallel operations. Each klados does its work, passes results forward, and the next stage begins automatically. You watch progress, not manage execution.

Security Model

Workflows operate under strict security constraints:

Temporal permissions — Each klados receives time-limited access that expires after the job completes
Collection-scoped — Permissions are always scoped to specific collections, never global
Signed requests — All invocations are cryptographically signed to prevent tampering
Audit trail — Every operation is logged to a job collection for full traceability

What's Next

The workflow system is in active development. When it launches, you'll be able to:

Invoke pre-built workflows for common processing pipelines
Compose custom workflows from available kladoi
Build your own kladoi for specialized processing
Monitor workflow progress in real-time
Resume failed workflows from the point of failure

For now, document processing is available through the web interface at arke.institute.