Monitoring and Observability

Comprehensive monitoring enables proactive incident response and data-driven optimization. YeboLearn maintains 99.9% uptime through robust observability practices.

Monitoring Philosophy

Three Pillars of Observability

1. Logs (What happened)

Structured event records
Debugging and auditing
Historical analysis

2. Metrics (How much/many)

Time-series numerical data
Performance trends
Alerting thresholds

3. Traces (Request flow)

End-to-end request tracking
Latency breakdown
Dependency mapping

Monitoring Goals

Proactive Over Reactive:

Detect issues before users report them
Alert on trends, not just failures
Prevent incidents through early warnings

Actionable Over Comprehensive:

Monitor what matters
Every alert must be actionable
Reduce noise, increase signal

Fast Mean Time to Detection (MTTD):

Target: <2 minutes
Current: 2 minutes
Real-time monitoring and alerting

Fast Mean Time to Resolution (MTTR):

Target: <1 hour
Current: 25 minutes
Quick access to relevant data

Monitoring Stack

Infrastructure Monitoring

Google Cloud Monitoring:

yaml

Platform Metrics:
- CPU utilization (%)
- Memory usage (%)
- Disk I/O
- Network traffic

Cloud Run:
- Container instances
- Request count
- Request latency
- Cold starts
- Error rate

Cloud SQL:
- CPU/Memory usage
- Connections (active/max)
- Query performance
- Replication lag (HA mode)
- Storage usage

Dashboard Example:

Infrastructure Health Dashboard
├─ Cloud Run
│   ├─ Active instances: 2 (avg), 8 (max)
│   ├─ CPU: 35% (avg), 78% (peak)
│   ├─ Memory: 68% (avg), 85% (peak)
│   └─ Cold starts: <1% of requests
├─ Cloud SQL
│   ├─ CPU: 45% (avg), 82% (peak)
│   ├─ Memory: 62%
│   ├─ Connections: 12/25
│   ├─ Query time: 35ms (avg)
│   └─ Storage: 38GB/100GB
└─ Network
    ├─ Ingress: 45 MB/s (avg)
    ├─ Egress: 32 MB/s (avg)
    └─ Latency: 12ms (avg)

Application Monitoring

Custom Metrics (Prometheus):

typescript

// Metrics instrumentation
import { Counter, Histogram, Gauge } from 'prom-client';

// Request counter
export const httpRequests = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

// Request duration
export const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.1, 0.5, 1, 2, 5],
});

// Active users
export const activeUsers = new Gauge({
  name: 'active_users_total',
  help: 'Currently active users',
});

// Usage in middleware
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;

    httpRequests.inc({
      method: req.method,
      route: req.route?.path || 'unknown',
      status: res.statusCode,
    });

    httpDuration.observe(
      {
        method: req.method,
        route: req.route?.path || 'unknown',
        status: res.statusCode,
      },
      duration
    );
  });

  next();
});

Business Metrics:

typescript

// Quiz completion tracking
export const quizCompletions = new Counter({
  name: 'quiz_completions_total',
  help: 'Total quiz completions',
  labelNames: ['subject', 'difficulty'],
});

// AI feature usage
export const aiFeatureUsage = new Counter({
  name: 'ai_feature_usage_total',
  help: 'AI feature usage count',
  labelNames: ['feature'], // quiz_gen, essay_grade, etc
});

// Payment transactions
export const paymentTransactions = new Counter({
  name: 'payment_transactions_total',
  help: 'Payment transactions',
  labelNames: ['provider', 'status'], // mpesa/stripe, success/failed
});

// Usage
quizCompletions.inc({ subject: 'mathematics', difficulty: 'medium' });
aiFeatureUsage.inc({ feature: 'quiz_generation' });
paymentTransactions.inc({ provider: 'mpesa', status: 'success' });

Error Tracking

Sentry Integration:

typescript

// Initialize Sentry
import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1, // 10% of requests
  integrations: [
    new Sentry.Integrations.Http({ tracing: true }),
    new Sentry.Integrations.Express({ app }),
  ],
});

// Capture errors
app.use(Sentry.Handlers.requestHandler());
app.use(Sentry.Handlers.tracingHandler());

// Error handler
app.use((err, req, res, next) => {
  // Log to Sentry
  Sentry.captureException(err, {
    tags: {
      route: req.route?.path,
      method: req.method,
    },
    user: {
      id: req.user?.id,
      email: req.user?.email,
    },
    extra: {
      body: req.body,
      params: req.params,
    },
  });

  // Send response
  res.status(500).json({ error: 'Internal server error' });
});

Error Categories:

Sentry Dashboard Organization:
├─ By Environment
│   ├─ Production (high priority)
│   ├─ Staging (medium priority)
│   └─ Development (low priority)
├─ By Severity
│   ├─ Fatal (immediate attention)
│   ├─ Error (high priority)
│   ├─ Warning (monitor)
│   └─ Info (log only)
└─ By Component
    ├─ API errors
    ├─ Database errors
    ├─ AI integration errors
    ├─ Payment errors
    └─ Frontend errors

Performance Monitoring

Real User Monitoring (RUM):

typescript

// Frontend performance tracking
export function trackPagePerformance() {
  if (typeof window === 'undefined') return;

  window.addEventListener('load', () => {
    const perfData = window.performance.timing;
    const pageLoadTime = perfData.loadEventEnd - perfData.navigationStart;
    const domReadyTime = perfData.domContentLoadedEventEnd - perfData.navigationStart;
    const ttfb = perfData.responseStart - perfData.requestStart;

    // Send to analytics
    analytics.track('page_performance', {
      page: window.location.pathname,
      loadTime: pageLoadTime,
      domReady: domReadyTime,
      ttfb,
      connection: navigator.connection?.effectiveType,
      deviceMemory: navigator.deviceMemory,
    });

    // Alert if slow
    if (pageLoadTime > 3000) {
      console.warn('Slow page load:', pageLoadTime);
    }
  });
}

// Core Web Vitals
import { getCLS, getFID, getLCP } from 'web-vitals';

function sendToAnalytics(metric) {
  analytics.track('web_vital', {
    name: metric.name,
    value: metric.value,
    rating: metric.rating,
    page: window.location.pathname,
  });
}

getCLS(sendToAnalytics);
getFID(sendToAnalytics);
getLCP(sendToAnalytics);

API Performance Monitoring:

typescript

// Track slow database queries
import { PrismaClient } from '@prisma/client';

const prisma = new PrismaClient({
  log: [
    {
      emit: 'event',
      level: 'query',
    },
  ],
});

prisma.$on('query', (e) => {
  if (e.duration > 100) {
    // Log slow queries (>100ms)
    logger.warn('Slow query detected', {
      query: e.query,
      duration: e.duration,
      params: e.params,
    });

    // Track metric
    slowQueries.inc({
      model: extractModel(e.query),
    });
  }
});

// Track AI API latency
export async function callGeminiAPI(prompt: string) {
  const start = Date.now();

  try {
    const response = await geminiClient.generateContent(prompt);
    const duration = Date.now() - start;

    // Track metric
    aiApiDuration.observe({ status: 'success' }, duration / 1000);

    return response;
  } catch (error) {
    const duration = Date.now() - start;
    aiApiDuration.observe({ status: 'error' }, duration / 1000);
    throw error;
  }
}

Log Aggregation

Structured Logging:

typescript

// Winston logger configuration
import winston from 'winston';

export const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'yebolearn-api',
    environment: process.env.NODE_ENV,
  },
  transports: [
    // Console for local development
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.simple()
      ),
    }),

    // Cloud Logging for production
    new winston.transports.Stream({
      stream: process.stdout,
      format: winston.format.json(),
    }),
  ],
});

// Usage with context
logger.info('Quiz completed', {
  userId: 'user-123',
  quizId: 'quiz-456',
  score: 85,
  duration: 1200,
});

logger.error('Payment failed', {
  userId: 'user-123',
  amount: 500,
  provider: 'mpesa',
  error: 'Insufficient funds',
  transactionId: 'txn-789',
});

logger.warn('High API latency', {
  endpoint: '/api/student/dashboard',
  duration: 850,
  threshold: 500,
});

Log Levels:

Production (LOG_LEVEL=warn):
ERROR: Critical errors, failures
WARN: Potential issues, degraded performance
(INFO and DEBUG disabled in production)

Staging (LOG_LEVEL=info):
ERROR: Critical errors
WARN: Warnings
INFO: Important events (user actions, API calls)
(DEBUG disabled)

Development (LOG_LEVEL=debug):
ERROR: All errors
WARN: All warnings
INFO: All significant events
DEBUG: Detailed debugging information

Distributed Tracing

Google Cloud Trace:

typescript

// Trace API requests
import { trace } from '@google-cloud/trace-agent';

// Initialize
trace.start({
  projectId: 'yebolearn-prod',
  samplingRate: 10, // 10% of requests
});

// Automatic tracing for HTTP requests
// Manual span for specific operations
export async function generateQuiz(topic: string) {
  const span = trace.createChildSpan({ name: 'generateQuiz' });

  try {
    // Call Gemini API
    const quizSpan = trace.createChildSpan({ name: 'gemini-api-call' });
    const quiz = await geminiClient.generateContent(prompt);
    quizSpan.endSpan();

    // Save to database
    const dbSpan = trace.createChildSpan({ name: 'save-quiz' });
    await db.quiz.create({ data: quiz });
    dbSpan.endSpan();

    return quiz;
  } finally {
    span.endSpan();
  }
}

Trace Analysis:

Request Trace Example:
GET /api/student/dashboard
Total: 450ms
├─ Authentication middleware: 25ms
├─ Fetch student data: 180ms
│   ├─ Database query: 145ms
│   └─ Cache lookup: 35ms
├─ Fetch enrollments: 120ms
│   └─ Database query: 115ms
├─ Calculate progress: 85ms
│   └─ Aggregation logic: 80ms
└─ Serialize response: 40ms

Insights:
- Database queries taking 260ms (58%)
- Opportunity: Add caching for enrollments

Dashboards

Executive Dashboard

High-Level Metrics (Grafana):

YeboLearn Platform Health
┌─────────────────────────────────────────┐
│ System Status                            │
│ ✓ All Systems Operational               │
│ Uptime: 99.97% (30 days)                │
│ Active Users: 2,340                     │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│ Performance                              │
│ API Response Time: 145ms (p50)          │
│ Page Load Time: 1.9s (avg)              │
│ Error Rate: 0.3%                        │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│ Business Metrics (Today)                │
│ Quiz Completions: 1,240                 │
│ AI Features Used: 340                   │
│ New Signups: 28                         │
│ Revenue: $420                           │
└─────────────────────────────────────────┘

Engineering Dashboard

Detailed Technical Metrics:

API Performance
├─ Request Rate: 45 req/s (avg), 120 req/s (peak)
├─ Response Time: p50=145ms, p95=380ms, p99=820ms
├─ Error Rate: 0.3% (target: <1%)
└─ Top Endpoints by Latency:
    1. /api/student/progress - 210ms
    2. /api/analytics/dashboard - 280ms
    3. /api/ai/generate-quiz - 8,500ms (AI feature)

Database Performance
├─ Query Time: 35ms (avg), 180ms (p95)
├─ Connections: 12/25 active
├─ Slow Queries (>100ms): 15/hour
├─ Cache Hit Rate: 94%
└─ Index Hit Rate: 98.5%

Infrastructure
├─ Cloud Run Instances: 2 (avg), 8 (max)
├─ CPU Usage: 35% (avg), 78% (peak)
├─ Memory Usage: 68% (avg), 85% (peak)
├─ Database CPU: 45% (avg), 82% (peak)
└─ Storage: 38GB/100GB (38%)

Error Tracking (Last 24 Hours)
├─ Total Errors: 45
├─ New Errors: 3
├─ Resolved Errors: 12
└─ Top Errors:
    1. Database timeout - 12 occurrences
    2. Gemini API rate limit - 8 occurrences
    3. Invalid quiz submission - 6 occurrences

AI Features Dashboard

AI-Specific Metrics:

AI Feature Performance
├─ Quiz Generation
│   ├─ Requests: 180/day
│   ├─ Avg latency: 8s
│   ├─ Success rate: 99.2%
│   ├─ Cost: $0.15/request
│   └─ Quality score: 9.1/10
├─ Essay Grading
│   ├─ Requests: 45/day
│   ├─ Avg latency: 45s
│   ├─ Success rate: 98.5%
│   ├─ Cost: $0.12/request
│   └─ Teacher approval: 87%
└─ Content Recommendations
    ├─ Requests: 2,400/day
    ├─ Avg latency: 2s
    ├─ Click-through: 34%
    └─ Cost: $0.02/request

Gemini API Usage
├─ Requests: 12,000/day
├─ Tokens In: 4.2M/day
├─ Tokens Out: 1.6M/day
├─ Total Cost: $6.50/day
├─ Rate Limits: 25/60 req/min (42%)
└─ Error Rate: 1.2%

Alerting

Alert Configuration

Critical Alerts (PagerDuty):

yaml

# API Down
- name: api_down
  condition: uptime < 99% for 2 minutes
  severity: critical
  notify: pagerduty
  escalation: immediate

# High Error Rate
- name: high_error_rate
  condition: error_rate > 5% for 3 minutes
  severity: critical
  notify: pagerduty
  escalation: after 5 minutes

# Database Down
- name: database_down
  condition: db_connections = 0 for 1 minute
  severity: critical
  notify: pagerduty + cto
  escalation: immediate

# Payment Processing Failed
- name: payment_failures
  condition: payment_failure_rate > 10% for 2 minutes
  severity: critical
  notify: pagerduty + finance
  escalation: after 10 minutes

Warning Alerts (Slack #engineering):

yaml

# Slow API Response
- name: slow_api
  condition: p95_latency > 1s for 10 minutes
  severity: warning
  notify: slack
  message: "API response time elevated: {{value}}ms"

# High Memory Usage
- name: high_memory
  condition: memory_usage > 80% for 15 minutes
  severity: warning
  notify: slack
  message: "Memory usage: {{value}}%"

# Increased Error Rate
- name: elevated_errors
  condition: error_rate > 2% for 10 minutes
  severity: warning
  notify: slack
  message: "Error rate elevated: {{value}}%"

# AI API Rate Limit Approaching
- name: gemini_rate_limit
  condition: gemini_requests > 50/min for 5 minutes
  severity: warning
  notify: slack
  message: "Approaching Gemini API rate limit: {{value}} req/min"

Alert Best Practices:

Effective Alerts:
✓ Actionable (team can fix)
✓ Specific (clear what's wrong)
✓ Timely (detect before users)
✓ Relevant (not noise)

Alert Fatigue Prevention:
✓ Group related alerts
✓ Deduplicate similar alerts
✓ Adjust thresholds based on patterns
✓ Auto-resolve when issue clears
✓ Review and tune monthly

On-Call Rotation

Schedule:

Weekly rotation:
- Week 1: Sarah
- Week 2: John
- Week 3: Lisa
- Week 4: Mark

On-call responsibilities:
- Respond to PagerDuty alerts (24/7)
- Triage and resolve P0/P1 incidents
- Escalate if needed
- Document incident in postmortem
- Handoff status to next on-call

On-call capacity:
- Protected from sprint commitments
- Focus on monitoring and incidents
- Handle urgent bugs and hotfixes

Uptime Tracking

External Monitoring

UptimeRobot Configuration:

Monitored Endpoints:
├─ https://api.yebolearn.app/health
│   ├─ Check interval: 1 minute
│   ├─ Timeout: 30 seconds
│   └─ Expected: 200 OK + "healthy" in response
├─ https://yebolearn.app
│   ├─ Check interval: 5 minutes
│   ├─ Timeout: 30 seconds
│   └─ Expected: 200 OK
└─ https://api.yebolearn.app/api/v1/status
    ├─ Check interval: 5 minutes
    ├─ Timeout: 10 seconds
    └─ Expected: 200 OK + valid JSON

Notifications:
- Alert on: Down for 2 minutes
- Notify: PagerDuty + Slack
- Escalation: Email team lead after 10 minutes

Health Check Endpoint

typescript

// Comprehensive health check
export async function healthCheck(): Promise<HealthStatus> {
  const checks = await Promise.allSettled([
    checkDatabase(),
    checkRedis(),
    checkGeminiAPI(),
    checkEmailService(),
    checkPaymentGateway(),
  ]);

  const results = {
    database: getCheckResult(checks[0]),
    redis: getCheckResult(checks[1]),
    gemini: getCheckResult(checks[2]),
    email: getCheckResult(checks[3]),
    payment: getCheckResult(checks[4]),
  };

  const allHealthy = Object.values(results).every(
    r => r.status === 'healthy'
  );

  return {
    status: allHealthy ? 'healthy' : 'degraded',
    timestamp: new Date().toISOString(),
    version: process.env.APP_VERSION,
    uptime: process.uptime(),
    checks: results,
  };
}

async function checkDatabase(): Promise<CheckResult> {
  try {
    const start = Date.now();
    await db.$queryRaw`SELECT 1`;
    const latency = Date.now() - start;

    return {
      status: 'healthy',
      latency: `${latency}ms`,
    };
  } catch (error) {
    return {
      status: 'unhealthy',
      error: error.message,
    };
  }
}

Uptime Targets

SLA: 99.9% uptime (three nines)
Allowed downtime: 43 minutes/month

Current performance:
- Last 30 days: 99.97% (13 min downtime)
- Last 90 days: 99.95% (65 min downtime)
- Last 12 months: 99.93% (6.1 hours downtime)

Status: ✓ Exceeding SLA

Incident Management

Incident Response Process

1. Detection (Target: <2 min)

Alert fires
On-call engineer paged
Initial triage begins

2. Assessment (Target: <5 min)

Determine severity (P0-P3)
Identify affected systems
Estimate user impact

3. Response (Target: <1 hour)

Mitigate immediate impact
Apply fix or rollback
Communicate status

4. Resolution

Verify fix deployed
Monitor for recurrence
Update status page

5. Postmortem (Within 48 hours)

Document incident timeline
Root cause analysis
Action items to prevent recurrence

Incident Severity Levels

P0 - Critical:
- Complete service outage
- Data loss risk
- Security breach
Response: Immediate, all hands on deck
Example: API completely down, database corruption

P1 - High:
- Major feature broken
- Payment processing down
- Significant user impact
Response: Within 15 minutes
Example: Quiz submissions failing, M-Pesa integration down

P2 - Medium:
- Minor feature degraded
- Performance issues
- Moderate user impact
Response: Within 2 hours
Example: Slow dashboard, AI features timing out

P3 - Low:
- Cosmetic issues
- Minor bugs
- Minimal user impact
Response: Next business day
Example: Typo in UI, broken link in email

Quality Overview - Quality standards
Code Standards - Coding guidelines
Deployment Process - Deployment monitoring
Incident Runbooks - Response procedures

Monitoring and Observability ​

Monitoring Philosophy ​

Three Pillars of Observability ​

Monitoring Goals ​

Monitoring Stack ​

Infrastructure Monitoring ​

Application Monitoring ​

Error Tracking ​

Performance Monitoring ​

Log Aggregation ​

Distributed Tracing ​

Dashboards ​

Executive Dashboard ​

Engineering Dashboard ​

AI Features Dashboard ​

Alerting ​

Alert Configuration ​

On-Call Rotation ​

Uptime Tracking ​

External Monitoring ​

Health Check Endpoint ​

Uptime Targets ​

Incident Management ​

Incident Response Process ​

Incident Severity Levels ​

Related Documentation ​

Monitoring and Observability

Monitoring Philosophy

Three Pillars of Observability

Monitoring Goals

Monitoring Stack

Infrastructure Monitoring

Application Monitoring

Error Tracking

Performance Monitoring

Log Aggregation

Distributed Tracing

Dashboards

Executive Dashboard

Engineering Dashboard

AI Features Dashboard

Alerting

Alert Configuration

On-Call Rotation

Uptime Tracking

External Monitoring

Health Check Endpoint

Uptime Targets

Incident Management

Incident Response Process

Incident Severity Levels

Related Documentation