DATAFABRIX VISION · Layers 2–3

Unified observability
across the entire AI stack.

Datafabrix Vision is the observability module of the platform. It unifies telemetry across PCIe, GPU, SSD, power, network, and workload into a single coherent stream — so engineers can trace any AI workload's performance from kernel launch all the way down to the substrate.

Roadmap · 2027
Datafabrix Vision module visualization
DATAFABRIX VISION · MODULE

Observability — engineered for AI-class workloads.

End-to-end traces across the AI stack. OpenTelemetry-native. AI-specific golden signals.

WHY IT MATTERS FOR AI DATA CENTERS

The problem we solve.

AI workloads are notoriously hard to observe. A training run touches a hundred subsystems on every iteration: GPU compute, NVLink, host PCIe, NIC, storage, controller firmware, power, cooling. When something is slow, the answer almost always lives across multiple layers — but the tools to see across layers don't exist.

Vision was built specifically for this. It correlates traces and metrics across the full AI stack — workload-level traces from your scheduler, system-level metrics from BMCs and host OS, device-level signal from PCIe and storage controllers, and substrate-level data from the Datafabrix PCIe Gen6 Thermal-Aware Smart Backplane.

The result: when a training run is slow, you can answer 'why' in one query — not five tools, not three teams, not a war room.

End-to-end
Trace coverage
OpenTelemetry
Native
Per-workload
Attribution
Prometheus
Compatible
CAPABILITIES

What Datafabrix Vision does.

  1. End-to-end traces

    A training iteration is traceable from kernel launch through PCIe transfer through storage access through power draw. One trace ID, every layer.

  2. Golden-signal dashboards

    Pre-built dashboards for AI workloads: throughput, latency, throttle events, error rates, energy per token. Tuned for the metrics that actually matter.

  3. Correlation engine

    When metric A spikes, Vision automatically surfaces every other signal that correlates — across layers, across tenants, across time.

  4. OpenTelemetry-native

    First-class OpenTelemetry support means Vision drops into your existing observability pipeline without disruption. Your existing tools keep working; Vision adds the AI-stack-specific layer.

  5. Prometheus-compatible

    Every metric Vision emits is also available via Prometheus. Your existing Grafana dashboards, alerts, and integrations keep working.

  6. Per-workload attribution

    Every byte of resource consumption is attributable to a workload, a tenant, a service. The unit economics of every AI workload become measurable.

HOW IT HELPS AI DATA CENTERS

Real scenarios. Real outcomes.

Three representative engagements that illustrate the kind of value Datafabrix Vision delivers in the field.

The Problem

The slow training iteration

A 4-hour training iteration is now taking 4 hours 12 minutes. Sometimes. Not always. Engineering has no idea why.

Our Approach

Vision's correlation engine surfaces that the 5% iterations align with a specific tenant's bursty checkpoint-write pattern saturating a shared NVMe channel. The training tenant and the checkpoint tenant are on the same rack.

The Outcome

Re-pack the two tenants. p95 iteration time stabilises. The training team gets the answer in one query.

The Problem

Per-token energy accounting

Sustainability team wants per-token energy attribution for the company's largest LLM inference service. CFO wants per-token cost for the same.

Our Approach

Vision aggregates power draw, cooling cost (via Thermal module), and compute time per inference request. Output: a verifiable per-token energy and per-token cost figure.

The Outcome

Sustainability report has its numbers. CFO has a pricing-margin model. Both teams have a single source of truth.

The Problem

End-to-end perf debugging

An NVLink-aware all-reduce is performing 18% below expected. The team's been on it for a week. Three different tools each say 'looks fine to me'.

Our Approach

Vision's trace shows the all-reduce stalls on a specific PCIe path during a specific phase. Substrate-level signal confirms a marginal connector on one slot. The connector is reseated.

The Outcome

All-reduce returns to spec. One trace, one root cause, one hour of debugging instead of one week.

INTEGRATIONS

Drops cleanly into your existing stack.

Open-standards first. Your existing tooling keeps working — Datafabrix Vision adds the AI-infrastructure-specific layer you've been missing.

OpenTelemetry Prometheus Grafana Datadog Jaeger Honeycomb
EXPLORE THE PLATFORM

Datafabrix Vision works best with...

Ready to see Vision in action?

Tell us about your fleet and your top operational pain. We will map Datafabrix Vision to a 90-day pilot scope — and quantify the expected outcome — within five business days.