OpenTelemetry integration architecture
Goal
- Replace Gestalt's custom logging/event/metrics with OpenTelemetry for logs, metrics, and traces.
- Keep OTLP as the on-the-wire standard to avoid vendor lock-in.
- Run a local collector that can later be swapped for remote backends.
Scope
- Backend: emit OTLP logs, metrics, traces from Go services.
- Collector: otelcol-gestalt started with the server and stopped on exit.
- Frontend: read logs and metrics via Collector HTTP endpoints (OTLP/HTTP).
High-level design
- Gestalt server owns a local Collector lifecycle (start before serving HTTP, stop on exit).
- Backend SDK exports OTLP to the local Collector:
- Traces: OTLP/HTTP by default, gRPC optional.
- Metrics: OTLP/HTTP by default, gRPC optional.
- Logs: OTLP/HTTP by default, gRPC optional.
- Collector pipeline:
- Receiver: otlpreceiver (grpc + http)
- Processor: batchprocessor
- Exporters: otlpexporter (future remote), fileexporter (local persistence)
Runtime topology
- Server process:
- starts collector as a child process
- initializes OTel SDK providers and exporters
- wires HTTP middleware and event/log bridges
- Collector process:
- listens on localhost (OTLP gRPC 4317, OTLP HTTP 4318)
- writes persistent log/metric/trace files to .gestalt/otel/
Configuration
- New config block (server):
- otel.enabled (bool, default true)
- otel.endpoint (string, default http://127.0.0.1:4318)
- otel.service_name (string, default gestalt)
- otel.resource_attributes (map[string]string)
- otel.exporter (http|grpc)
- otel.log_level (INFO default for frontend)
- Collector config file:
- Location: .gestalt/otel/collector.yaml
- Owned by Gestalt (rendered at startup with ports and file paths)
- Environment variables (runtime):
- GESTALT_OTEL_ENABLED (collector on/off)
- GESTALT_OTEL_COLLECTOR (collector binary path)
- GESTALT_OTEL_CONFIG (collector config path)
- GESTALT_OTEL_DATA_DIR (collector data dir)
- GESTALT_OTEL_GRPC_ENDPOINT / GESTALT_OTEL_HTTP_ENDPOINT (collector listen endpoints)
- GESTALT_OTEL_REMOTE_ENDPOINT (optional OTLP gRPC exporter target)
- GESTALT_OTEL_REMOTE_INSECURE (true to skip TLS verification for remote exporter)
- GESTALT_OTEL_SELF_METRICS (true to enable collector self-metrics)
- GESTALT_OTEL_MAX_RECORDS (cap records read from local otel.json for APIs)
- GESTALT_OTEL_SDK_ENABLED (SDK on/off)
- GESTALT_OTEL_SERVICE_NAME (service.name override)
- GESTALT_OTEL_RESOURCE_ATTRIBUTES (comma-separated key=value list)
- Port selection:
- Defaults to 127.0.0.1:4317 (gRPC) and 127.0.0.1:4318 (HTTP).
- If defaults are occupied and no endpoint env vars are set, Gestalt picks an available adjacent port pair and logs the selection.
- Setting GESTALT_OTEL_GRPC_ENDPOINT or GESTALT_OTEL_HTTP_ENDPOINT disables randomization for the collector.
Resource model
- Resource attributes (static):
- service.name=gestalt
service.version=<build version>service.instance.id=<hostname or random instance id>- os.type, os.version
- build.commit, build.time (if available)
Log mapping
- internal/logging.LogEntry -> OTel LogRecord
- body: message
- severity: mapped from Level
- attributes: context map
- timestamp: LogEntry.Timestamp
- Event bus records -> LogRecord with type attributes:
- event.bus
- event.type
- event.payload (structured where possible)
Metrics mapping
- Replace internal/metrics.Registry metrics with OTel instruments:
- flow.activities.succeeded/failed
- event_bus.subscribers (gauge)
- events.published/dropped (counter)
- sessions.active (gauge)
Tracing model
- HTTP server spans: per-request spans with standard http.* attributes.
- Explicit spans for key actions:
- session.create, session.delete, agent.input, session.output
- Propagate trace context:
- from inbound HTTP headers (traceparent)
- into WebSocket connect spans
Frontend access
- Logs ingest: POST /api/otel/logs (OTLP LogRecords).
- Traces: /api/otel/traces (trace_id/span_name/since/until/limit/query).
- Metrics: /api/otel/metrics (name/since/until/limit/query).
- Log stream: /api/logs/stream (SSE, OTLP LogRecords) with a last-hour replay on connect.
Log retention and replay
- Collector writes otel.json; Gestalt rotates it by size/age/count limits.
- Retrieval is via /api/logs/stream (last-hour replay from LogHub) and local files.
Migration plan (high level)
- Phase 1: run OTel in parallel with existing logging and event bus.
- Phase 2: wire OTel to all API endpoints and key events.
- Phase 3: swap frontend log source to OTel.
- Phase 4: remove internal/logging and internal/metrics usage.
Testing strategy
- Unit tests for log and event mapping to OTel attributes.
- Integration tests for OTLP exporter wiring (mock OTLP endpoint).
- End-to-end tests for HTTP span creation and error tagging.
Open questions
- Final storage format for fileexporter (JSON or OTLP/Protobuf)
- Retention policy for .gestalt/otel/ data
- Whether to expose Collector port to LAN or keep localhost only