Do you replace the model providers?

No — we sit in front of them. Keep your OpenAI, Anthropic, and self-hosted endpoints; gain routing, caching, observability, and governance.

What about self-hosted LLMs?

Full support for vLLM, TGI, Ollama, and any OpenAI-compatible endpoint. Mix and match per-request.

How do you handle rate limits?

Per-key, per-team, per-model token-aware rate limiting with queueing and burst control.

Built-in PII detection and redaction at request time, configurable per policy.

Platform

The enterprise AI OS — every layer in one platform.

Deepstack is a single operating layer for the hard problems of enterprise AI: model orchestration, agent infrastructure, neural search / RAG, observability, and governance.

Capabilities

Explore every layer.

AI Infrastructure Neural Infrastructure Enterprise AI Stack Agent Infrastructure AI Orchestration Workflow Engine Neural Search LLM Infrastructure

Architecture / Live

The enterprise AI OS, top to bottom.

Five core modules — governance, observability, agents, orchestration, and retrieval — designed to interoperate as one operating system.

Layer 00

Enterprise Governance

Permissions + RBAC

Audit Log

AI Policies

Compliance

Deploy Controls

Layer 01

AI Observability

Prompts + Traces

Latency

Hallucinations

Failures

Tokens + Cost

Layer 02

Agent Infrastructure

AI Agents

Workflow Chains

Memory

Tool Execution

Multi-Agent

Layer 03

Model Orchestration

OpenAI

Claude

Gemini

Local Models

Open-Source LLMs

Layer 04

Neural Search / RAG

Vector Search

Semantic Retrieval

Embeddings

Context Engineering

Rerankers

Developer experience

One API. Any model. Any agent.

A single endpoint routes to the right model with policy-aware fallback. The SDK turns it into agents in a few lines.

HTTP / curl

curl https://api.deepstack.dev/v1/route \
  -H "Authorization: Bearer $DEEPSTACK_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "task": "reasoning",
    "messages": [{ "role": "user", "content": "Summarize Q3 earnings." }],
    "policy": { "max_cost_usd": 0.02, "max_latency_ms": 2000 },
    "fallback": ["gpt-4o", "claude-3.5-sonnet", "llama-3.1-405b"]
  }'

TypeScript SDK

import { Deepstack } from "@deepstack/sdk";

const ds = new Deepstack();

const agent = ds.agent({
  tools: [searchDocs, runSQL, sendEmail],
  retrieval: { index: "company-kb", topK: 8 },
  observability: { trace: true, evals: ["faithfulness"] },
});

const result = await agent.run("Draft the investor update.");

Observability

See every token. Every trace. Every dollar.

Cost, latency, quality and drift — unified across models, agents, and retrieval.

Production · us-east-1Last 24henv=prod

Live

Tokens / sec

12,840

+8.4%

p95 latency

182 ms

-12 msp50 94 · p95 182 · p99 318

Cost / 1M tok

$1.84

-21%

Inference latency (ms)p50 / p95 / p99

00:0006:0012:0018:00now

Model mix

gpt-4o38%
claude-3.5-sonnet27%
llama-3.1-405b18%
mistral-large11%
others6%

Security & Compliance

Enterprise-ready from day one.

Audited

SOC 2 Type II

Eligible BAA

HIPAA

EU residency

GDPR

In progress

ISO 27001

Available

Self-host / BYOC

Per request

Zero-retention mode

Start building

The neural infrastructure layer is ready.

Free to start. Production-grade by default. Built for the systems you'll ship tomorrow.

Start free Book a demo