Monitoring and Observability
Comprehensive monitoring enables proactive incident response and data-driven optimization. YeboLearn maintains 99.9% uptime through robust observability practices.
Monitoring Philosophy
Three Pillars of Observability
1. Logs (What happened)
- Structured event records
- Debugging and auditing
- Historical analysis
2. Metrics (How much/many)
- Time-series numerical data
- Performance trends
- Alerting thresholds
3. Traces (Request flow)
- End-to-end request tracking
- Latency breakdown
- Dependency mapping
Monitoring Goals
Proactive Over Reactive:
- Detect issues before users report them
- Alert on trends, not just failures
- Prevent incidents through early warnings
Actionable Over Comprehensive:
- Monitor what matters
- Every alert must be actionable
- Reduce noise, increase signal
Fast Mean Time to Detection (MTTD):
- Target: <2 minutes
- Current: 2 minutes
- Real-time monitoring and alerting
Fast Mean Time to Resolution (MTTR):
- Target: <1 hour
- Current: 25 minutes
- Quick access to relevant data
Monitoring Stack
Infrastructure Monitoring
Google Cloud Monitoring:
Platform Metrics:
- CPU utilization (%)
- Memory usage (%)
- Disk I/O
- Network traffic
Cloud Run:
- Container instances
- Request count
- Request latency
- Cold starts
- Error rate
Cloud SQL:
- CPU/Memory usage
- Connections (active/max)
- Query performance
- Replication lag (HA mode)
- Storage usageDashboard Example:
Infrastructure Health Dashboard
├─ Cloud Run
│ ├─ Active instances: 2 (avg), 8 (max)
│ ├─ CPU: 35% (avg), 78% (peak)
│ ├─ Memory: 68% (avg), 85% (peak)
│ └─ Cold starts: <1% of requests
├─ Cloud SQL
│ ├─ CPU: 45% (avg), 82% (peak)
│ ├─ Memory: 62%
│ ├─ Connections: 12/25
│ ├─ Query time: 35ms (avg)
│ └─ Storage: 38GB/100GB
└─ Network
├─ Ingress: 45 MB/s (avg)
├─ Egress: 32 MB/s (avg)
└─ Latency: 12ms (avg)Application Monitoring
Custom Metrics (Prometheus):
// Metrics instrumentation
import { Counter, Histogram, Gauge } from 'prom-client';
// Request counter
export const httpRequests = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status'],
});
// Request duration
export const httpDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.5, 1, 2, 5],
});
// Active users
export const activeUsers = new Gauge({
name: 'active_users_total',
help: 'Currently active users',
});
// Usage in middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequests.inc({
method: req.method,
route: req.route?.path || 'unknown',
status: res.statusCode,
});
httpDuration.observe(
{
method: req.method,
route: req.route?.path || 'unknown',
status: res.statusCode,
},
duration
);
});
next();
});Business Metrics:
// Quiz completion tracking
export const quizCompletions = new Counter({
name: 'quiz_completions_total',
help: 'Total quiz completions',
labelNames: ['subject', 'difficulty'],
});
// AI feature usage
export const aiFeatureUsage = new Counter({
name: 'ai_feature_usage_total',
help: 'AI feature usage count',
labelNames: ['feature'], // quiz_gen, essay_grade, etc
});
// Payment transactions
export const paymentTransactions = new Counter({
name: 'payment_transactions_total',
help: 'Payment transactions',
labelNames: ['provider', 'status'], // mpesa/stripe, success/failed
});
// Usage
quizCompletions.inc({ subject: 'mathematics', difficulty: 'medium' });
aiFeatureUsage.inc({ feature: 'quiz_generation' });
paymentTransactions.inc({ provider: 'mpesa', status: 'success' });Error Tracking
Sentry Integration:
// Initialize Sentry
import * as Sentry from '@sentry/node';
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 0.1, // 10% of requests
integrations: [
new Sentry.Integrations.Http({ tracing: true }),
new Sentry.Integrations.Express({ app }),
],
});
// Capture errors
app.use(Sentry.Handlers.requestHandler());
app.use(Sentry.Handlers.tracingHandler());
// Error handler
app.use((err, req, res, next) => {
// Log to Sentry
Sentry.captureException(err, {
tags: {
route: req.route?.path,
method: req.method,
},
user: {
id: req.user?.id,
email: req.user?.email,
},
extra: {
body: req.body,
params: req.params,
},
});
// Send response
res.status(500).json({ error: 'Internal server error' });
});Error Categories:
Sentry Dashboard Organization:
├─ By Environment
│ ├─ Production (high priority)
│ ├─ Staging (medium priority)
│ └─ Development (low priority)
├─ By Severity
│ ├─ Fatal (immediate attention)
│ ├─ Error (high priority)
│ ├─ Warning (monitor)
│ └─ Info (log only)
└─ By Component
├─ API errors
├─ Database errors
├─ AI integration errors
├─ Payment errors
└─ Frontend errorsPerformance Monitoring
Real User Monitoring (RUM):
// Frontend performance tracking
export function trackPagePerformance() {
if (typeof window === 'undefined') return;
window.addEventListener('load', () => {
const perfData = window.performance.timing;
const pageLoadTime = perfData.loadEventEnd - perfData.navigationStart;
const domReadyTime = perfData.domContentLoadedEventEnd - perfData.navigationStart;
const ttfb = perfData.responseStart - perfData.requestStart;
// Send to analytics
analytics.track('page_performance', {
page: window.location.pathname,
loadTime: pageLoadTime,
domReady: domReadyTime,
ttfb,
connection: navigator.connection?.effectiveType,
deviceMemory: navigator.deviceMemory,
});
// Alert if slow
if (pageLoadTime > 3000) {
console.warn('Slow page load:', pageLoadTime);
}
});
}
// Core Web Vitals
import { getCLS, getFID, getLCP } from 'web-vitals';
function sendToAnalytics(metric) {
analytics.track('web_vital', {
name: metric.name,
value: metric.value,
rating: metric.rating,
page: window.location.pathname,
});
}
getCLS(sendToAnalytics);
getFID(sendToAnalytics);
getLCP(sendToAnalytics);API Performance Monitoring:
// Track slow database queries
import { PrismaClient } from '@prisma/client';
const prisma = new PrismaClient({
log: [
{
emit: 'event',
level: 'query',
},
],
});
prisma.$on('query', (e) => {
if (e.duration > 100) {
// Log slow queries (>100ms)
logger.warn('Slow query detected', {
query: e.query,
duration: e.duration,
params: e.params,
});
// Track metric
slowQueries.inc({
model: extractModel(e.query),
});
}
});
// Track AI API latency
export async function callGeminiAPI(prompt: string) {
const start = Date.now();
try {
const response = await geminiClient.generateContent(prompt);
const duration = Date.now() - start;
// Track metric
aiApiDuration.observe({ status: 'success' }, duration / 1000);
return response;
} catch (error) {
const duration = Date.now() - start;
aiApiDuration.observe({ status: 'error' }, duration / 1000);
throw error;
}
}Log Aggregation
Structured Logging:
// Winston logger configuration
import winston from 'winston';
export const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: 'yebolearn-api',
environment: process.env.NODE_ENV,
},
transports: [
// Console for local development
new winston.transports.Console({
format: winston.format.combine(
winston.format.colorize(),
winston.format.simple()
),
}),
// Cloud Logging for production
new winston.transports.Stream({
stream: process.stdout,
format: winston.format.json(),
}),
],
});
// Usage with context
logger.info('Quiz completed', {
userId: 'user-123',
quizId: 'quiz-456',
score: 85,
duration: 1200,
});
logger.error('Payment failed', {
userId: 'user-123',
amount: 500,
provider: 'mpesa',
error: 'Insufficient funds',
transactionId: 'txn-789',
});
logger.warn('High API latency', {
endpoint: '/api/student/dashboard',
duration: 850,
threshold: 500,
});Log Levels:
Production (LOG_LEVEL=warn):
ERROR: Critical errors, failures
WARN: Potential issues, degraded performance
(INFO and DEBUG disabled in production)
Staging (LOG_LEVEL=info):
ERROR: Critical errors
WARN: Warnings
INFO: Important events (user actions, API calls)
(DEBUG disabled)
Development (LOG_LEVEL=debug):
ERROR: All errors
WARN: All warnings
INFO: All significant events
DEBUG: Detailed debugging informationDistributed Tracing
Google Cloud Trace:
// Trace API requests
import { trace } from '@google-cloud/trace-agent';
// Initialize
trace.start({
projectId: 'yebolearn-prod',
samplingRate: 10, // 10% of requests
});
// Automatic tracing for HTTP requests
// Manual span for specific operations
export async function generateQuiz(topic: string) {
const span = trace.createChildSpan({ name: 'generateQuiz' });
try {
// Call Gemini API
const quizSpan = trace.createChildSpan({ name: 'gemini-api-call' });
const quiz = await geminiClient.generateContent(prompt);
quizSpan.endSpan();
// Save to database
const dbSpan = trace.createChildSpan({ name: 'save-quiz' });
await db.quiz.create({ data: quiz });
dbSpan.endSpan();
return quiz;
} finally {
span.endSpan();
}
}Trace Analysis:
Request Trace Example:
GET /api/student/dashboard
Total: 450ms
├─ Authentication middleware: 25ms
├─ Fetch student data: 180ms
│ ├─ Database query: 145ms
│ └─ Cache lookup: 35ms
├─ Fetch enrollments: 120ms
│ └─ Database query: 115ms
├─ Calculate progress: 85ms
│ └─ Aggregation logic: 80ms
└─ Serialize response: 40ms
Insights:
- Database queries taking 260ms (58%)
- Opportunity: Add caching for enrollmentsDashboards
Executive Dashboard
High-Level Metrics (Grafana):
YeboLearn Platform Health
┌─────────────────────────────────────────┐
│ System Status │
│ ✓ All Systems Operational │
│ Uptime: 99.97% (30 days) │
│ Active Users: 2,340 │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Performance │
│ API Response Time: 145ms (p50) │
│ Page Load Time: 1.9s (avg) │
│ Error Rate: 0.3% │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Business Metrics (Today) │
│ Quiz Completions: 1,240 │
│ AI Features Used: 340 │
│ New Signups: 28 │
│ Revenue: $420 │
└─────────────────────────────────────────┘Engineering Dashboard
Detailed Technical Metrics:
API Performance
├─ Request Rate: 45 req/s (avg), 120 req/s (peak)
├─ Response Time: p50=145ms, p95=380ms, p99=820ms
├─ Error Rate: 0.3% (target: <1%)
└─ Top Endpoints by Latency:
1. /api/student/progress - 210ms
2. /api/analytics/dashboard - 280ms
3. /api/ai/generate-quiz - 8,500ms (AI feature)
Database Performance
├─ Query Time: 35ms (avg), 180ms (p95)
├─ Connections: 12/25 active
├─ Slow Queries (>100ms): 15/hour
├─ Cache Hit Rate: 94%
└─ Index Hit Rate: 98.5%
Infrastructure
├─ Cloud Run Instances: 2 (avg), 8 (max)
├─ CPU Usage: 35% (avg), 78% (peak)
├─ Memory Usage: 68% (avg), 85% (peak)
├─ Database CPU: 45% (avg), 82% (peak)
└─ Storage: 38GB/100GB (38%)
Error Tracking (Last 24 Hours)
├─ Total Errors: 45
├─ New Errors: 3
├─ Resolved Errors: 12
└─ Top Errors:
1. Database timeout - 12 occurrences
2. Gemini API rate limit - 8 occurrences
3. Invalid quiz submission - 6 occurrencesAI Features Dashboard
AI-Specific Metrics:
AI Feature Performance
├─ Quiz Generation
│ ├─ Requests: 180/day
│ ├─ Avg latency: 8s
│ ├─ Success rate: 99.2%
│ ├─ Cost: $0.15/request
│ └─ Quality score: 9.1/10
├─ Essay Grading
│ ├─ Requests: 45/day
│ ├─ Avg latency: 45s
│ ├─ Success rate: 98.5%
│ ├─ Cost: $0.12/request
│ └─ Teacher approval: 87%
└─ Content Recommendations
├─ Requests: 2,400/day
├─ Avg latency: 2s
├─ Click-through: 34%
└─ Cost: $0.02/request
Gemini API Usage
├─ Requests: 12,000/day
├─ Tokens In: 4.2M/day
├─ Tokens Out: 1.6M/day
├─ Total Cost: $6.50/day
├─ Rate Limits: 25/60 req/min (42%)
└─ Error Rate: 1.2%Alerting
Alert Configuration
Critical Alerts (PagerDuty):
# API Down
- name: api_down
condition: uptime < 99% for 2 minutes
severity: critical
notify: pagerduty
escalation: immediate
# High Error Rate
- name: high_error_rate
condition: error_rate > 5% for 3 minutes
severity: critical
notify: pagerduty
escalation: after 5 minutes
# Database Down
- name: database_down
condition: db_connections = 0 for 1 minute
severity: critical
notify: pagerduty + cto
escalation: immediate
# Payment Processing Failed
- name: payment_failures
condition: payment_failure_rate > 10% for 2 minutes
severity: critical
notify: pagerduty + finance
escalation: after 10 minutesWarning Alerts (Slack #engineering):
# Slow API Response
- name: slow_api
condition: p95_latency > 1s for 10 minutes
severity: warning
notify: slack
message: "API response time elevated: {{value}}ms"
# High Memory Usage
- name: high_memory
condition: memory_usage > 80% for 15 minutes
severity: warning
notify: slack
message: "Memory usage: {{value}}%"
# Increased Error Rate
- name: elevated_errors
condition: error_rate > 2% for 10 minutes
severity: warning
notify: slack
message: "Error rate elevated: {{value}}%"
# AI API Rate Limit Approaching
- name: gemini_rate_limit
condition: gemini_requests > 50/min for 5 minutes
severity: warning
notify: slack
message: "Approaching Gemini API rate limit: {{value}} req/min"Alert Best Practices:
Effective Alerts:
✓ Actionable (team can fix)
✓ Specific (clear what's wrong)
✓ Timely (detect before users)
✓ Relevant (not noise)
Alert Fatigue Prevention:
✓ Group related alerts
✓ Deduplicate similar alerts
✓ Adjust thresholds based on patterns
✓ Auto-resolve when issue clears
✓ Review and tune monthlyOn-Call Rotation
Schedule:
Weekly rotation:
- Week 1: Sarah
- Week 2: John
- Week 3: Lisa
- Week 4: Mark
On-call responsibilities:
- Respond to PagerDuty alerts (24/7)
- Triage and resolve P0/P1 incidents
- Escalate if needed
- Document incident in postmortem
- Handoff status to next on-call
On-call capacity:
- Protected from sprint commitments
- Focus on monitoring and incidents
- Handle urgent bugs and hotfixesUptime Tracking
External Monitoring
UptimeRobot Configuration:
Monitored Endpoints:
├─ https://api.yebolearn.app/health
│ ├─ Check interval: 1 minute
│ ├─ Timeout: 30 seconds
│ └─ Expected: 200 OK + "healthy" in response
├─ https://yebolearn.app
│ ├─ Check interval: 5 minutes
│ ├─ Timeout: 30 seconds
│ └─ Expected: 200 OK
└─ https://api.yebolearn.app/api/v1/status
├─ Check interval: 5 minutes
├─ Timeout: 10 seconds
└─ Expected: 200 OK + valid JSON
Notifications:
- Alert on: Down for 2 minutes
- Notify: PagerDuty + Slack
- Escalation: Email team lead after 10 minutesHealth Check Endpoint
// Comprehensive health check
export async function healthCheck(): Promise<HealthStatus> {
const checks = await Promise.allSettled([
checkDatabase(),
checkRedis(),
checkGeminiAPI(),
checkEmailService(),
checkPaymentGateway(),
]);
const results = {
database: getCheckResult(checks[0]),
redis: getCheckResult(checks[1]),
gemini: getCheckResult(checks[2]),
email: getCheckResult(checks[3]),
payment: getCheckResult(checks[4]),
};
const allHealthy = Object.values(results).every(
r => r.status === 'healthy'
);
return {
status: allHealthy ? 'healthy' : 'degraded',
timestamp: new Date().toISOString(),
version: process.env.APP_VERSION,
uptime: process.uptime(),
checks: results,
};
}
async function checkDatabase(): Promise<CheckResult> {
try {
const start = Date.now();
await db.$queryRaw`SELECT 1`;
const latency = Date.now() - start;
return {
status: 'healthy',
latency: `${latency}ms`,
};
} catch (error) {
return {
status: 'unhealthy',
error: error.message,
};
}
}Uptime Targets
SLA: 99.9% uptime (three nines)
Allowed downtime: 43 minutes/month
Current performance:
- Last 30 days: 99.97% (13 min downtime)
- Last 90 days: 99.95% (65 min downtime)
- Last 12 months: 99.93% (6.1 hours downtime)
Status: ✓ Exceeding SLAIncident Management
Incident Response Process
1. Detection (Target: <2 min)
- Alert fires
- On-call engineer paged
- Initial triage begins
2. Assessment (Target: <5 min)
- Determine severity (P0-P3)
- Identify affected systems
- Estimate user impact
3. Response (Target: <1 hour)
- Mitigate immediate impact
- Apply fix or rollback
- Communicate status
4. Resolution
- Verify fix deployed
- Monitor for recurrence
- Update status page
5. Postmortem (Within 48 hours)
- Document incident timeline
- Root cause analysis
- Action items to prevent recurrence
Incident Severity Levels
P0 - Critical:
- Complete service outage
- Data loss risk
- Security breach
Response: Immediate, all hands on deck
Example: API completely down, database corruption
P1 - High:
- Major feature broken
- Payment processing down
- Significant user impact
Response: Within 15 minutes
Example: Quiz submissions failing, M-Pesa integration down
P2 - Medium:
- Minor feature degraded
- Performance issues
- Moderate user impact
Response: Within 2 hours
Example: Slow dashboard, AI features timing out
P3 - Low:
- Cosmetic issues
- Minor bugs
- Minimal user impact
Response: Next business day
Example: Typo in UI, broken link in emailRelated Documentation
- Quality Overview - Quality standards
- Code Standards - Coding guidelines
- Deployment Process - Deployment monitoring
- Incident Runbooks - Response procedures