Arke
BuildSearch & Query

Semantic Search

How to use Arke's semantic search powered by Pinecone vector embeddings.

Overview

Arke uses Pinecone for semantic similarity search. When entities are created or updated, their text content is embedded into vectors and indexed. You can search by meaning, not just keywords.

Search Endpoints

Arke provides specialized search endpoints for different use cases:

EndpointPurpose
POST /search/collectionsSearch for collections by text query
POST /search/entitiesSearch entities within specific collection(s)
POST /search/discoverTwo-step discovery: find collections, then search within them
POST /search/agentsSearch for agents across the network
POST /search/similar/collectionsFind collections similar to a given collection
POST /search/similar/itemsFind items similar to a given entity (cross-collection)
POST /search/collections
Content-Type: application/json
Authorization: Bearer <token>

{
  "query": "medical research papers",
  "limit": 10
}

Entity Search Within Collections

Search within one or more collections:

POST /search/entities
Content-Type: application/json
Authorization: Bearer <token>

{
  "collection_pi": "01KFNR0H0Q791Y1SMZWEQ09FGV",
  "query": "cetology research",
  "limit": 20,
  "types": ["chapter", "document"]
}

Or search multiple collections in parallel:

{
  "collection_pis": ["01KCOLL1...", "01KCOLL2..."],
  "query": "whale sightings",
  "limit": 20,
  "per_collection_limit": 5
}

When you do not know which collections to search, use the discover endpoint:

POST /search/discover
Content-Type: application/json
Authorization: Bearer <token>

{
  "query": "white whale sightings",
  "limit": 20,
  "collection_limit": 10,
  "per_collection_limit": 5,
  "types": ["file", "document"]
}

This performs a two-step search:

  1. Finds collections semantically related to your query
  2. Searches within each collection in parallel
  3. Aggregates and ranks results across all collections

Entity Expansion

All search endpoints support the expand parameter to control how much entity data is returned:

ValueDescription
"preview" (default)Lightweight preview with label, timestamps, truncated description
"full"Complete entity manifest with all properties and relationships
"none"Search metadata only (fastest, smallest payload)

Example with expansion:

{
  "query": "research papers",
  "limit": 10,
  "expand": "preview"
}

Response includes entity_preview or entity field depending on mode:

{
  "results": [
    {
      "pi": "01KENTITY...",
      "label": "Research Paper.pdf",
      "type": "file",
      "score": 0.92,
      "collection_pi": "01KCOLL...",
      "entity_preview": {
        "id": "01KENTITY...",
        "type": "file",
        "label": "Research Paper.pdf",
        "description_preview": "Analysis of entity management patterns...",
        "created_at": "2025-01-15T10:00:00.000Z",
        "updated_at": "2025-01-20T14:30:00.000Z"
      }
    }
  ],
  "metadata": {
    "query": "research papers",
    "result_count": 1
  }
}

Find entities similar to a known entity:

POST /search/similar/items
Content-Type: application/json
Authorization: Bearer <token>

{
  "pi": "01KENTITY...",
  "collection_pi": "01KCOLL...",
  "limit": 20,
  "tier1_limit": 10,
  "tier2_limit": 5,
  "include_same_collection": true
}

This performs a two-tier search:

  1. Finds collections similar to the entity's collection
  2. Searches within each collection for similar items
  3. Aggregates results with diversity weighting

How Indexing Works

  1. Entity is created or updated
  2. Text content is extracted (from properties, OCR output, etc.)
  3. Content is embedded into a vector using the embedding model
  4. Vector is stored in Pinecone with entity metadata
  5. Searches compare query vectors against indexed vectors using cosine similarity

Two-Tier Discovery Model

Arke uses a two-tier search model for discovery:

  1. Collection-level -- Find collections with similar content
  2. Entity-level -- Drill into specific files within collections

This keeps costs sustainable while encouraging thoughtful curation.

On this page