Monitoring Azure Infrastructure with Azure Monitor and Grafana

Monitoring Azure Infrastructure with Azure Monitor and Grafana

Set up comprehensive monitoring for your Azure infrastructure using Azure Monitor, Log Analytics, and Grafana dashboards for full observability.

Observability is critical for running production workloads. In this post, we’ll build a comprehensive monitoring solution using Azure Monitor, Log Analytics Workspace, and Grafana.

The Three Pillars of Observability

  • Metrics - Numeric measurements over time (CPU, memory, request rates)
  • Logs - Timestamped records of discrete events
  • Traces - Request flows across distributed systems

Setting Up Azure Monitor

Enable Container Insights for AKS

az aks enable-addons \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --addons monitoring \
  --workspace-resource-id /subscriptions/<sub-id>/resourcegroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<workspace>

Create Custom Metric Alerts

az monitor metrics alert create \
  --name "High CPU Alert" \
  --resource-group myResourceGroup \
  --scopes "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/myAKSCluster" \
  --condition "avg Percentage CPU > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/microsoft.insights/actionGroups/myActionGroup"

Log Analytics Queries (KQL)

Kusto Query Language (KQL) is powerful for analyzing logs:

// Container resource usage
KubeNodeInventory
| where TimeGenerated > ago(1h)
| summarize 
    AvgCPU = avg(todouble(AllocatableCpuCapacity)),
    AvgMemory = avg(todouble(AllocatableMemoryCapacity))
  by Computer, ClusterName
| render timechart

// Pod restart tracking
KubePodInventory
| where TimeGenerated > ago(24h)
| where RestartCount > 0
| project TimeGenerated, Name, Namespace, RestartCount, ContainerStatus
| order by RestartCount desc

// HTTP error rate
ContainerLog
| where LogEntry contains "HTTP"
| where LogEntry contains "5" 
| summarize ErrorCount = count() by bin(TimeGenerated, 5m)
| render timechart

Deploying Grafana on AKS

# grafana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
        - name: grafana
          image: grafana/grafana:10.4.2
          ports:
            - containerPort: 3000
          env:
            - name: GF_SECURITY_ADMIN_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: grafana-secrets
                  key: admin-password
            - name: GF_AUTH_AZURE_AD_ENABLED
              value: "true"
            - name: GF_AUTH_AZURE_AD_CLIENT_ID
              valueFrom:
                secretKeyRef:
                  name: grafana-secrets
                  key: azure-client-id
          volumeMounts:
            - name: grafana-storage
              mountPath: /var/lib/grafana
      volumes:
        - name: grafana-storage
          persistentVolumeClaim:
            claimName: grafana-pvc

Azure Monitor Workbooks

Create rich interactive reports with Azure Monitor Workbooks:

{
  "version": "Notebook/1.0",
  "items": [
    {
      "type": 9,
      "content": {
        "version": "KqlParameterItem/1.0",
        "parameters": [
          {
            "id": "timeRange",
            "version": "KqlParameterItem/1.0",
            "name": "TimeRange",
            "type": 4,
            "value": {
              "durationMs": 3600000
            }
          }
        ]
      }
    },
    {
      "type": 3,
      "content": {
        "version": "KqlItem/1.0",
        "query": "KubeNodeInventory | summarize count() by Status",
        "size": 1,
        "chartType": 3
      }
    }
  ]
}

Setting Up Alerts and Notifications

Action Groups

# Create action group with email and Teams webhook
az monitor action-group create \
  --name "DevOps Team" \
  --resource-group myResourceGroup \
  --short-name "devops" \
  --email-receiver name=oncall email=devops@company.com \
  --webhook-receiver name=teams \
    service-uri=https://company.webhook.office.com/webhookb2/...

Best Practices

  1. Set meaningful thresholds - Base alerts on baselines, not guesses
  2. Use dynamic thresholds - Let Azure ML determine normal behavior
  3. Implement alert fatigue prevention - Use action rules to suppress noise
  4. Create runbooks - Link alerts to remediation documentation
  5. Regular review - Review and tune alerts monthly

Conclusion

A comprehensive monitoring strategy combining Azure Monitor, Log Analytics, and Grafana gives you full visibility into your Azure infrastructure. The combination of metrics, logs, and traces ensures you can detect, diagnose, and resolve issues quickly.