Observability & Monitoring

Observability is a first-class architectural concern in distributed systems. Unlike traditional monitoring which focuses on predefined metrics, observability provides the ability to understand system behavior by examining its outputs—logs, metrics, and traces. This document explains the five core architectural patterns that make Tuturuuu deeply observable.

Observability vs Monitoring: Monitoring tells you when something is wrong. Observability tells you why it’s wrong and how to fix it. Our architecture provides both.

1. Distributed Tracing Provides Full Request Visibility

Architectural Choice

The architecture implements distributed tracing across all microservices, allowing a single user request to be tracked as it flows through multiple services, providing complete end-to-end visibility of request execution.

Impact and Justification

In a microservices architecture, a single user action (like “Create Order”) can trigger a cascade of operations across dozens of services. Without distributed tracing, understanding the complete execution path is nearly impossible. Each service’s logs exist in isolation, making it extremely difficult to correlate related operations. Distributed tracing solves this by assigning each request a unique trace ID that flows through all services involved in handling that request. Each service records spans (units of work) that are linked together by the trace ID, creating a complete map of the request’s journey through the system. This enables powerful debugging capabilities: When a user reports a slow checkout, we can search for their trace ID and see exactly which service or database query caused the delay, how long each step took, and where failures occurred. In Tuturuuu - Distributed Tracing Architecture:

// Next.js instrumentation for OpenTelemetry
// apps/web/instrumentation.ts
import { registerOTel } from '@vercel/otel';

export function register() {
  registerOTel({
    serviceName: 'tuturuuu-web',
    // Traces automatically propagate across services
  });
}

// Automatic trace propagation in API calls
export async function createWorkspace(data: WorkspaceData) {
  // Trace context automatically included in headers
  const trace = getActiveTrace();

  // Start a new span for this operation
  return await trace.span('workspace.create', async (span) => {
    span.setAttribute('workspace.name', data.name);
    span.setAttribute('user.id', data.ownerId);

    // 1. Create workspace in database
    const workspace = await createWorkspaceInDb(data);
    span.addEvent('workspace.created', { id: workspace.id });

    // 2. Trigger background jobs
    await trigger.event({
      name: 'workspace.created',
      payload: { workspaceId: workspace.id },
      // Trace ID automatically propagated to background job
    });

    span.addEvent('events.triggered');

    return workspace;
  });
}

// Background job continues the trace
client.defineJob({
  id: 'workspace-setup',
  name: 'Setup new workspace',
  version: '1.0.0',
  trigger: eventTrigger({ name: 'workspace.created' }),
  run: async (payload, io) => {
    // Trace context automatically available
    const span = io.getSpan();
    span.addEvent('workspace.setup.started');

    // All operations recorded in the same trace
    await io.sendEmail({ ... });
    await io.createDefaultResources({ ... });

    span.addEvent('workspace.setup.completed');
  }
});

Querying traces for debugging:

// Find slow requests
const slowTraces = await tracing.query({
  service: 'tuturuuu-web',
  operation: 'workspace.create',
  minDuration: 5000, // Slower than 5 seconds
  timeRange: 'last-24h'
});

// Analyze a specific failed request
const trace = await tracing.getTrace('trace-abc-123');
console.log(trace.spans.map(span => ({
  service: span.serviceName,
  operation: span.name,
  duration: span.duration,
  error: span.error,
  // See complete execution path
})));

Clarifying Additions

Architects gain insight across the entire flow of a request. Instead of examining logs from individual services in isolation, distributed tracing provides a unified view of how a request moves through the system. Bottlenecks become easy to identify. When performance degrades, trace visualization immediately reveals which service, function, or database query is responsible for the slowdown. This supports informed performance decisions. With complete visibility into request execution, performance optimization efforts can be precisely targeted at the actual bottlenecks rather than guessed at.

2. Centralized Logging Increases Diagnosability

Architectural Choice

All services emit structured logs to a centralized logging system where they can be aggregated, searched, and correlated across service boundaries.

Impact and Justification

In a distributed system with dozens of services, each potentially running multiple instances, managing logs becomes a critical challenge. If each service writes logs to its local filesystem, diagnosing issues requires:

Knowing which service(s) to examine
Finding the specific instance that handled the request
SSH-ing into individual servers
Manually correlating timestamps across different log files

This is completely impractical at scale. Centralized logging solves this by streaming all logs to a unified platform (like Vercel Logs, Datadog, or ELK stack) where they can be:

Searched across all services simultaneously
Filtered by trace ID, user ID, workspace ID, error type, etc.
Correlated with metrics and traces
Alerted on specific patterns

In Tuturuuu - Structured Logging:

// Structured logging with context
import { logger } from '@tuturuuu/logging';

export async function processPayment(
  workspaceId: string,
  amount: number,
  userId: string
) {
  // Create contextual logger
  const log = logger.child({
    workspaceId,
    userId,
    operation: 'payment.process',
    // Context automatically included in all log entries
  });

  log.info('Payment processing started', { amount });

  try {
    const result = await chargeCustomer(amount);

    log.info('Payment successful', {
      transactionId: result.id,
      amount: result.amount,
      status: result.status
    });

    return result;
  } catch (error) {
    log.error('Payment failed', {
      error: error.message,
      errorCode: error.code,
      amount,
      // Structured data for easy querying
    });

    throw error;
  }
}

Log query examples:

// Find all errors for a specific user
logs.query({
  level: 'error',
  userId: 'user-123',
  timeRange: 'last-24h'
});

// Find all failed payments in a workspace
logs.query({
  operation: 'payment.process',
  level: 'error',
  workspaceId: 'ws-456',
  timeRange: 'last-7d'
});

// Correlate logs with trace
logs.query({
  traceId: 'trace-abc-123',
  // See all log entries from all services involved in this request
});

Log aggregation architecture:

// All services use consistent log format
export interface LogEntry {
  timestamp: string;
  level: 'debug' | 'info' | 'warn' | 'error';
  service: string;
  traceId?: string;
  spanId?: string;
  userId?: string;
  workspaceId?: string;
  operation: string;
  message: string;
  metadata: Record<string, unknown>;
}

// Logs automatically shipped to centralized platform
// Vercel automatically collects Next.js logs
// Custom services can use winston/pino with transport

Clarifying Additions

Logs from all services become searchable in one place. Engineers can investigate issues without needing to know which specific service or instance to examine—they can search globally across the entire system. Patterns emerge clearly across system boundaries. Centralized logging makes it easy to spot trends like “error rates spiking across multiple services” or “all finance-related operations failing,” which would be invisible when examining services in isolation. This simplifies incident analysis. During outages, teams can quickly query logs to understand what went wrong, which users were affected, and how to remediate the issue.

3. Per-Service Metrics Enable Fine-Grained Monitoring

Architectural Choice

Each microservice exposes its own health metrics (request rates, error rates, latency percentiles, resource utilization) that can be collected, visualized, and alerted on independently.

Impact and Justification

Coarse-grained, application-level metrics (like “total requests per second”) provide limited value in a microservices architecture. When an alert fires for “high error rate,” we need to immediately know which service is failing, which specific operations are affected, and which resources are saturated. Per-service metrics provide this granularity. Each service exposes detailed metrics about its own health and performance, allowing us to:

Detect failures quickly at the service level
Scale specific services based on their individual load
Set targeted alerts for critical operations
Understand resource utilization per service for cost optimization

In Tuturuuu - Service Metrics:

// Each service exposes Prometheus-style metrics
import { metrics } from '@tuturuuu/observability';

// Counter: Incrementing values (requests, errors)
const requestCounter = metrics.counter({
  name: 'http_requests_total',
  description: 'Total HTTP requests',
  labels: ['service', 'method', 'route', 'status']
});

// Histogram: Distribution of values (latency, response size)
const latencyHistogram = metrics.histogram({
  name: 'http_request_duration_seconds',
  description: 'HTTP request latency',
  labels: ['service', 'method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5] // Percentile buckets
});

// Gauge: Current value (active connections, queue depth)
const activeConnectionsGauge = metrics.gauge({
  name: 'active_connections',
  description: 'Currently active connections',
  labels: ['service']
});

// Instrument application code
export async function handleRequest(
  request: Request,
  service: string
) {
  const start = Date.now();

  try {
    // Record request
    requestCounter.inc({ service, method: request.method, route: request.url });

    // Track active connections
    activeConnectionsGauge.inc({ service });

    const response = await processRequest(request);

    // Record success
    requestCounter.inc({
      service,
      method: request.method,
      route: request.url,
      status: response.status
    });

    return response;
  } catch (error) {
    // Record error
    requestCounter.inc({
      service,
      method: request.method,
      route: request.url,
      status: 500
    });

    throw error;
  } finally {
    // Record latency
    const duration = (Date.now() - start) / 1000;
    latencyHistogram.observe({
      service,
      method: request.method,
      route: request.url
    }, duration);

    // Decrease active connections
    activeConnectionsGauge.dec({ service });
  }
}

Metric visualization and alerting:

// Query metrics for monitoring dashboard
const metrics = {
  // Request rate per service
  requestRate: query('rate(http_requests_total[5m]) by (service)'),

  // Error rate per service
  errorRate: query('rate(http_requests_total{status=~"5.."}[5m]) by (service)'),

  // P95 latency per service
  p95Latency: query('histogram_quantile(0.95, http_request_duration_seconds) by (service)'),

  // Resource utilization
  cpuUsage: query('container_cpu_usage_seconds_total by (service)'),
  memoryUsage: query('container_memory_usage_bytes by (service)'),
};

// Alert on anomalies
const alerts = [
  {
    name: 'HighErrorRate',
    condition: 'error_rate > 0.05', // 5% error rate
    for: '5m',
    severity: 'critical',
    notify: ['oncall-team']
  },
  {
    name: 'HighLatency',
    condition: 'p95_latency > 2', // 2 seconds
    for: '10m',
    severity: 'warning',
    notify: ['engineering-team']
  }
];

Business metrics:

// Custom business metrics for product insights
const workspaceCreations = metrics.counter({
  name: 'workspaces_created_total',
  description: 'Total workspaces created',
  labels: ['plan']
});

const activeUsers = metrics.gauge({
  name: 'active_users',
  description: 'Currently active users',
  labels: ['workspaceId']
});

const aiTokensUsed = metrics.counter({
  name: 'ai_tokens_used_total',
  description: 'AI tokens consumed',
  labels: ['model', 'workspace_id']
});

Clarifying Additions

Each service exposes its own health signals. Instead of relying on external monitoring that only observes symptoms, services self-report their internal state, providing deep insight into their operation. Issues are detected quickly and precisely. Alerts can be configured for specific services and operations, ensuring the right team is notified immediately when their service degrades. This helps maintain consistent performance. Continuous monitoring of per-service metrics allows teams to proactively identify and resolve performance issues before they impact users.

4. Health Checks Enable Automated Healing

Architectural Choice

Every service exposes standardized health check endpoints that report the service’s current state. Orchestration platforms continuously probe these endpoints and automatically restart or replace unhealthy instances.

Impact and Justification

In traditional monolithic deployments, when a service becomes unhealthy (due to memory leaks, deadlocks, or resource exhaustion), manual intervention is often required to restart the application. This leads to extended downtime and requires on-call engineers to be available 24/7. Automated health checks combined with orchestration platforms (like Kubernetes, Vercel, or AWS ECS) create a self-healing system. When a service instance fails its health check, the orchestrator automatically:

Stops routing new requests to the unhealthy instance
Starts a replacement instance
Terminates the unhealthy instance once the replacement is ready

This automation dramatically improves system availability and reduces operational burden. In Tuturuuu - Health Check Implementation:

// Standardized health check endpoint
// apps/web/src/app/api/health/route.ts
export async function GET() {
  const checks = await Promise.allSettled([
    checkDatabaseConnection(),
    checkSupabaseConnection(),
    checkTriggerConnection(),
    checkMemoryUsage(),
    checkDiskSpace(),
  ]);

  const isHealthy = checks.every(check => check.status === 'fulfilled');
  const status = isHealthy ? 200 : 503;

  return new Response(JSON.stringify({
    status: isHealthy ? 'healthy' : 'unhealthy',
    timestamp: new Date().toISOString(),
    checks: {
      database: checks[0].status === 'fulfilled' ? 'ok' : 'failed',
      supabase: checks[1].status === 'fulfilled' ? 'ok' : 'failed',
      trigger: checks[2].status === 'fulfilled' ? 'ok' : 'failed',
      memory: checks[3].status === 'fulfilled' ? 'ok' : 'failed',
      disk: checks[4].status === 'fulfilled' ? 'ok' : 'failed',
    }
  }), {
    status,
    headers: { 'Content-Type': 'application/json' }
  });
}

// Detailed dependency checks
async function checkDatabaseConnection() {
  try {
    await database.query('SELECT 1');
    return { status: 'ok' };
  } catch (error) {
    return { status: 'failed', error: error.message };
  }
}

async function checkMemoryUsage() {
  const used = process.memoryUsage();
  const heapUsedPercent = (used.heapUsed / used.heapTotal) * 100;

  if (heapUsedPercent > 90) {
    return { status: 'degraded', heapUsedPercent };
  }

  return { status: 'ok', heapUsedPercent };
}

Liveness vs Readiness Checks:

// Liveness: Is the service running?
// If this fails, restart the container
export async function GET(_req: Request, { params }: { params: { type: string } }) {
  if (params.type === 'liveness') {
    // Simple check: can we respond?
    return new Response(JSON.stringify({ alive: true }), { status: 200 });
  }

  // Readiness: Is the service ready to handle traffic?
  // If this fails, remove from load balancer
  if (params.type === 'readiness') {
    const canHandleTraffic = await checkAllDependencies();

    return new Response(
      JSON.stringify({ ready: canHandleTraffic }),
      { status: canHandleTraffic ? 200 : 503 }
    );
  }
}

Orchestrator configuration (conceptual):

# Vercel automatically handles health checks
# For custom deployments (Kubernetes):
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: tuturuuu-web
    image: tuturuuu/web:latest
    livenessProbe:
      httpGet:
        path: /api/health/liveness
        port: 3000
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /api/health/readiness
        port: 3000
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 2

Clarifying Additions

Unhealthy components recover automatically. Instead of requiring manual intervention, the system detects and replaces failing instances without human involvement. The system maintains functionality without manual effort. Self-healing capabilities ensure that transient failures (like temporary resource exhaustion) are resolved automatically, minimizing downtime. This reduces operational load significantly. On-call engineers are freed from responding to routine failures that the system can handle itself, allowing them to focus on more complex issues.

5. Event Streams Provide Historical Insights

Architectural Choice

The event-driven architecture preserves a complete, immutable history of all significant system events, enabling powerful historical analysis, debugging, and system recovery capabilities.

Impact and Justification

Traditional request-response architectures are ephemeral—once a request completes, the only record is what was explicitly written to logs. This makes it extremely difficult to:

Understand how the system arrived at its current state
Reproduce bugs that occurred in production
Analyze trends and patterns over time
Recover from data corruption by replaying events

The event sourcing pattern, enabled by our event-driven architecture, preserves the complete history of state changes. Instead of storing only the current state, we store every event that led to that state. This provides:

Perfect auditability: Complete record of what happened and when
Time travel debugging: Reconstruct system state at any point in history
Event replay: Reproduce production bugs by replaying event sequences
Trend analysis: Analyze patterns over weeks or months

In Tuturuuu - Event Stream Architecture:

// All events persisted in ordered stream
export async function publishEvent(event: DomainEvent) {
  await eventStore.append({
    id: uuid(),
    type: event.type,
    aggregate: event.aggregateId, // e.g., workspace-123
    version: event.version,
    timestamp: new Date(),
    payload: event.payload,
    metadata: {
      userId: event.userId,
      traceId: getCurrentTraceId(),
      causationId: event.causationId, // What caused this event
      correlationId: event.correlationId, // Related events
    }
  });
}

// Query event history
export async function getWorkspaceHistory(workspaceId: string) {
  return await eventStore.query({
    aggregate: workspaceId,
    orderBy: 'timestamp',
    // Returns complete history of workspace changes
  });
}

// Example output:
const history = [
  { type: 'workspace.created', timestamp: '2024-01-01T00:00:00Z', data: {...} },
  { type: 'member.added', timestamp: '2024-01-01T00:05:00Z', data: {...} },
  { type: 'member.role.changed', timestamp: '2024-01-01T00:10:00Z', data: {...} },
  { type: 'settings.updated', timestamp: '2024-01-01T00:15:00Z', data: {...} },
  // Complete, ordered history of all changes
];

Event replay for debugging:

// Reproduce a production bug locally
export async function replayEvents(aggregateId: string, untilTimestamp?: Date) {
  const events = await eventStore.query({
    aggregate: aggregateId,
    until: untilTimestamp
  });

  // Start with empty state
  let state = createEmptyState();

  // Replay each event to reconstruct state
  for (const event of events) {
    state = applyEvent(state, event);
  }

  // Now we have exact state at that point in time
  return state;
}

// Example: Reproduce the state of workspace when bug occurred
const bugState = await replayEvents('workspace-123', new Date('2024-11-15T14:30:00Z'));
// Can now debug why the bug happened

Historical trend analysis:

// Analyze patterns over time
export async function analyzeUserBehavior(userId: string, days: number) {
  const events = await eventStore.query({
    userId,
    from: Date.now() - (days * 86400000),
  });

  // Analyze patterns
  const patterns = {
    mostActiveHours: calculateActiveHours(events),
    commonWorkflows: identifyWorkflows(events),
    errorPatterns: findErrorSequences(events),
    featureUsage: countFeatureUsage(events),
  };

  return patterns;
}

System recovery through event replay:

// Recover from data corruption
export async function rebuildProjection(projectionName: string) {
  // Projections are derived from events
  // If projection is corrupted, rebuild from event stream

  await clearProjection(projectionName);

  const events = await eventStore.query({ all: true });

  for (const event of events) {
    await updateProjection(projectionName, event);
  }

  // Projection fully rebuilt from source of truth (events)
}

Clarifying Additions

Long-term patterns reveal recurring issues. By analyzing months or years of events, teams can identify systemic problems that aren’t visible in short-term logs or metrics. Debugging becomes easier through historical replay. Instead of trying to guess what state the system was in when a bug occurred, engineers can precisely reconstruct that state by replaying the event stream. This strengthens reliability over time. The ability to analyze historical patterns and replay events enables continuous improvement, as teams can identify and fix the root causes of recurring issues.

Observability Architecture Summary

Capability	Technology	What It Provides	Use Case Example
Distributed Tracing	OpenTelemetry, Vercel	End-to-end request visibility	”Why is checkout slow for this user?”
Centralized Logging	Vercel Logs, Datadog	Unified log search and correlation	”Find all errors in last 24h across all services”
Service Metrics	Prometheus, Vercel Analytics	Granular health and performance data	”Which service is causing high error rate?”
Health Checks	HTTP endpoints + orchestration	Self-healing and automated recovery	”Replace unhealthy instances automatically”
Event Streams	Trigger.dev, Event Store	Historical analysis and replay	”Reconstruct workspace state from 2 weeks ago”

Observability Best Practices in Tuturuuu

1. Instrument Everything

Add trace spans to all critical operations
Log structured data, not strings
Expose metrics for all business operations
Capture events for all state changes

2. Standardize Observability Data

Consistent log format across services
Standard metric naming conventions
Trace context propagation everywhere
Unified event schema

3. Make Observability Queryable

Centralize all observability data
Index logs by key dimensions (userId, workspaceId, traceId)
Build dashboards for common queries
Enable ad-hoc exploration

4. Alert on Symptoms, Not Causes

Alert when users are impacted (high error rate, high latency)
Use observability data to diagnose root causes
Avoid alert fatigue with proper thresholds

5. Use Observability for Business Insights

Track feature adoption through events
Monitor business KPIs through metrics
Analyze user behavior through event streams
Measure product performance, not just system performance

Observability Stack in Tuturuuu

// Complete observability configuration
export const observability = {
  // Distributed tracing
  tracing: {
    provider: 'OpenTelemetry',
    backend: 'Vercel',
    sampleRate: 1.0, // 100% sampling initially
    exporters: ['vercel', 'console']
  },

  // Centralized logging
  logging: {
    provider: 'Vercel Logs',
    level: 'info',
    format: 'json',
    includeContext: true, // traceId, userId, workspaceId
  },

  // Metrics collection
  metrics: {
    provider: 'Vercel Analytics',
    exportInterval: 60000, // 1 minute
    defaultLabels: {
      service: process.env.SERVICE_NAME,
      environment: process.env.NODE_ENV,
    }
  },

  // Health checks
  healthChecks: {
    liveness: '/api/health/liveness',
    readiness: '/api/health/readiness',
    interval: 10000, // 10 seconds
  },

  // Event streaming
  events: {
    provider: 'Trigger.dev',
    retention: '90d',
    enableReplay: true,
  }
};

Microservices Patterns - Service design and deployment
Event-Driven Architecture - Event streaming patterns
Extensibility, Resilience & Scalability - System quality attributes
Security Architecture - Security observability and audit trails

Overview

Platform

Build

Learn

Reference

Observability & Monitoring

1. Distributed Tracing Provides Full Request Visibility

Architectural Choice

Impact and Justification

Clarifying Additions

2. Centralized Logging Increases Diagnosability

Architectural Choice

Impact and Justification

Clarifying Additions

3. Per-Service Metrics Enable Fine-Grained Monitoring

Architectural Choice

Impact and Justification

Clarifying Additions

4. Health Checks Enable Automated Healing

Architectural Choice

Impact and Justification

Clarifying Additions

5. Event Streams Provide Historical Insights

Architectural Choice

Impact and Justification

Clarifying Additions

Observability Architecture Summary

Observability Best Practices in Tuturuuu

1. Instrument Everything

2. Standardize Observability Data

3. Make Observability Queryable

4. Alert on Symptoms, Not Causes

5. Use Observability for Business Insights

Observability Stack in Tuturuuu

Overview

Platform

Build

Learn

Reference

​1. Distributed Tracing Provides Full Request Visibility

​Architectural Choice

​Impact and Justification

​Clarifying Additions

​2. Centralized Logging Increases Diagnosability

​Architectural Choice

​Impact and Justification

​Clarifying Additions

​3. Per-Service Metrics Enable Fine-Grained Monitoring

​Architectural Choice

​Impact and Justification

​Clarifying Additions

​4. Health Checks Enable Automated Healing

​Architectural Choice

​Impact and Justification

​Clarifying Additions

​5. Event Streams Provide Historical Insights

​Architectural Choice

​Impact and Justification

​Clarifying Additions

​Observability Architecture Summary

​Observability Best Practices in Tuturuuu

​1. Instrument Everything

​2. Standardize Observability Data

​3. Make Observability Queryable

​4. Alert on Symptoms, Not Causes

​5. Use Observability for Business Insights

​Observability Stack in Tuturuuu

​Related Documentation

1. Distributed Tracing Provides Full Request Visibility

Architectural Choice

Impact and Justification

Clarifying Additions

2. Centralized Logging Increases Diagnosability

Architectural Choice

Impact and Justification

Clarifying Additions

3. Per-Service Metrics Enable Fine-Grained Monitoring

Architectural Choice

Impact and Justification

Clarifying Additions

4. Health Checks Enable Automated Healing

Architectural Choice

Impact and Justification

Clarifying Additions

5. Event Streams Provide Historical Insights

Architectural Choice

Impact and Justification

Clarifying Additions

Observability Architecture Summary

Observability Best Practices in Tuturuuu

1. Instrument Everything

2. Standardize Observability Data

3. Make Observability Queryable

4. Alert on Symptoms, Not Causes

5. Use Observability for Business Insights

Observability Stack in Tuturuuu

Related Documentation