Monitoring et observabilite 22 min de lecture

Logs centralises et alerting

Centraliser les logs avec Loki

Loki collecte les logs de tous les pods Kubernetes via Promtail et les rend disponibles dans Grafana.

Requetes LogQL dans Grafana

# Voir les logs d'une application specifique
{namespace="production", app="mon-app"}

# Filtrer les erreurs
{namespace="production", app="mon-app"} |= "ERROR"

# Compter les erreurs par minute
rate({namespace="production"} |= "ERROR" [5m])

# Logs de Keycloak avec filtre
{namespace="auth", app="keycloak"} |= "LOGIN_ERROR"

# Parser les logs JSON
{app="mon-app"} | json | status >= 500

Configurer les alertes

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: monitoring
spec:
  groups:
    - name: application
      rules:
        # Alerte si taux d'erreur > 5%
        - alert: HighErrorRate
          expr: |
            rate(app_requests_total{status=~"5.."}[5m])
            / rate(app_requests_total[5m]) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Taux d'erreur eleve sur {{ $labels.app }}"
            description: "Le taux d'erreur 5xx depasse 5% depuis 5 minutes."

        # Alerte si un pod redémarre trop souvent
        - alert: PodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} en crash loop"

        # Alerte si CPU > 80%
        - alert: HighCpuUsage
          expr: |
            (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))
            by (instance)) > 0.8
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "CPU > 80% sur {{ $labels.instance }}"

Alertmanager : router les alertes

# alertmanager.yaml
global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by: [alertname, namespace]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: slack-critical
    - match:
        severity: warning
      receiver: email-team

receivers:
  - name: default
    email_configs:
      - to: devops@company.com

  - name: slack-critical
    slack_configs:
      - api_url: https://hooks.slack.com/services/xxx
        channel: "#alerts-critical"
        title: "{{ .GroupLabels.alertname }}"
        text: "{{ .CommonAnnotations.summary }}"

  - name: email-team
    email_configs:
      - to: team@company.com
        subject: "[WARN] {{ .GroupLabels.alertname }}"
Dashboard Grafana recommandes : Importez les dashboards ID 315 (Kubernetes cluster), 7249 (Kubernetes pods) et 13770 (Node Exporter) depuis grafana.com/grafana/dashboards.