Documents API - PDF and File Processing

Upload documents, images, videos, and audio files. Hebbrix automatically extracts content, generates embeddings, and makes everything searchable.

Supported Formats

Documents - PDF, DOCX, TXT, MD, HTML, CSV
Images - PNG, JPG, WebP, GIF, SVG
Videos - MP4, WebM, MOV (transcription)
Audio - MP3, WAV, M4A (transcription)

Processing Pipeline

Upload

File uploaded and validated

Extract

Text extraction, OCR for images, transcription for audio/video

Chunk

Content split into semantic chunks with overlap

Embed

Generate vector embeddings for each chunk

Index

Store in vector database for fast retrieval

Endpoints

Code Examples

Upload a Document

Python

import os
import requests

BASE = "https://api.hebbrix.com/v1"
H = {"Authorization": f"Bearer {os.environ['HEBBRIX_API_KEY']}"}

# POST /v1/documents/upload is multipart/form-data.
# `collection_id` is a FORM field (not a query param). Omit it to let the
# server auto-assign your default collection. The response field
# `collection_auto_assigned` will be true when that happens.
with open("research_paper.pdf", "rb") as f:
    r = requests.post(
        f"{BASE}/documents/upload",
        headers=H,
        files={"file": ("research_paper.pdf", f, "application/pdf")},
        data={
            "collection_id": "col_xyz",  # optional
            "category": "research",      # optional
            "tags": "ml,2024",           # optional, comma-separated
        },
    )
body = r.json()
doc = body["document"]

print(f"Document ID: {doc['id']}")
# Legacy internal status (fine-grained enum):
#   uploaded / processing / searchable / enriching / enriched / processed / failed / deleted
print(f"status = {doc['status']}")
# PDF-contract lifecycle status (prefer this for new integrations):
#   pending / processing / completed / failed / deleted
print(f"processing_status = {doc['processing_status']}")
print(f"auto_assigned_default = {body['collection_auto_assigned']}")

Poll until processing completes

Python (with polling)

import time

doc_id = doc["id"]
while True:
    r = requests.get(f"{BASE}/documents/{doc_id}", headers=H)
    doc = r.json()["document"]
    if doc["processing_status"] in ("completed", "failed"):
        break
    print(f"Processing ({doc['processing_status']})…")
    time.sleep(2)

if doc["processing_status"] == "failed":
    raise RuntimeError(f"Processing failed: {doc.get('processing_error')}")

print(f"Document ready: {doc['chunk_count']} chunks, {doc['memory_count']} memories")

cURL Example

The endpoint is multipart/form-data. collection_id is a form field (not a query param); omit it to let the server auto-assign the caller's default collection. Add -H "X-Hebbrix-Require-Collection: true" to forbid silent defaulting.

Upload Document

# Upload to a specific collection
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -F "file=@document.pdf" \
  -F "collection_id=col_xyz" \
  -F "category=research" \
  -F "tags=ml,2024"

# Let the server auto-assign to your default collection
# (response will include  "collection_auto_assigned": true )
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -F "file=@document.pdf"

# Strict mode: 422 if collection_id is missing
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -H "X-Hebbrix-Require-Collection: true" \
  -F "file=@document.pdf" \
  -F "collection_id=col_xyz"

Processing Status

Document processing is asynchronous. Check the status field:

pending - Queued for processing
processing - Currently being processed
completed - Ready for search
failed - Check error field

Processing lifecycle & status fields

A document moves through a single lifecycle. The fields below describe the same progression from different angles, which is why they can look like they overlap. Use this section as the canonical reference.

State machine

uploaded → extracting → processing → indexing → searchable

At any step the document can transition to failed instead.

Field meanings

Field	Type	Description
processing_status	string	The authoritative lifecycle state (e.g. `"pending"`, `"processing"`, `"completed"`, `"failed"`).
status	string	A coarse/legacy alias of the lifecycle (e.g. `"processed"`); prefer `processing_status`.
index_status	string	Indexing sub-state: `"indexing"` while embeddings/BM25 are being written, `"completed"` when done.
is_searchable	boolean	Boolean; `true` means the document's memories are retrievable via search right now. This can be `true` before `memories_indexed` equals `memories_total`, because search becomes available incrementally.
memories_created	integer	Number of memories extracted from the document.
memories_indexed	integer	Number of those memories fully embedded/indexed so far.
memories_total	integer	Total memories expected for the document.

Treat is_searchable: true as the readiness signal for querying; use memories_indexed === memories_total only if you need every chunk fully indexed.

Documents & Media