Monitoring Azure Infrastructure with Azure Monitor and Grafana

Observability is critical for running production workloads. In this post, we’ll build a comprehensive monitoring solution using Azure Monitor, Log Analytics Workspace, and Grafana.

The Three Pillars of Observability

Metrics - Numeric measurements over time (CPU, memory, request rates)
Logs - Timestamped records of discrete events
Traces - Request flows across distributed systems

Setting Up Azure Monitor

Enable Container Insights for AKS

az aks enable-addons \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --addons monitoring \
  --workspace-resource-id /subscriptions/<sub-id>/resourcegroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<workspace>

Create Custom Metric Alerts

az monitor metrics alert create \
  --name "High CPU Alert" \
  --resource-group myResourceGroup \
  --scopes "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/myAKSCluster" \
  --condition "avg Percentage CPU > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/microsoft.insights/actionGroups/myActionGroup"

Log Analytics Queries (KQL)

Kusto Query Language (KQL) is powerful for analyzing logs:

// Container resource usage
KubeNodeInventory
| where TimeGenerated > ago(1h)
| summarize 
    AvgCPU = avg(todouble(AllocatableCpuCapacity)),
    AvgMemory = avg(todouble(AllocatableMemoryCapacity))
  by Computer, ClusterName
| render timechart

// Pod restart tracking
KubePodInventory
| where TimeGenerated > ago(24h)
| where RestartCount > 0
| project TimeGenerated, Name, Namespace, RestartCount, ContainerStatus
| order by RestartCount desc

// HTTP error rate
ContainerLog
| where LogEntry contains "HTTP"
| where LogEntry contains "5" 
| summarize ErrorCount = count() by bin(TimeGenerated, 5m)
| render timechart

Deploying Grafana on AKS

# grafana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
        - name: grafana
          image: grafana/grafana:10.4.2
          ports:
            - containerPort: 3000
          env:
            - name: GF_SECURITY_ADMIN_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: grafana-secrets
                  key: admin-password
            - name: GF_AUTH_AZURE_AD_ENABLED
              value: "true"
            - name: GF_AUTH_AZURE_AD_CLIENT_ID
              valueFrom:
                secretKeyRef:
                  name: grafana-secrets
                  key: azure-client-id
          volumeMounts:
            - name: grafana-storage
              mountPath: /var/lib/grafana
      volumes:
        - name: grafana-storage
          persistentVolumeClaim:
            claimName: grafana-pvc

Azure Monitor Workbooks

Create rich interactive reports with Azure Monitor Workbooks:

{
  "version": "Notebook/1.0",
  "items": [
    {
      "type": 9,
      "content": {
        "version": "KqlParameterItem/1.0",
        "parameters": [
          {
            "id": "timeRange",
            "version": "KqlParameterItem/1.0",
            "name": "TimeRange",
            "type": 4,
            "value": {
              "durationMs": 3600000
            }
          }
        ]
      }
    },
    {
      "type": 3,
      "content": {
        "version": "KqlItem/1.0",
        "query": "KubeNodeInventory | summarize count() by Status",
        "size": 1,
        "chartType": 3
      }
    }
  ]
}

Setting Up Alerts and Notifications

Action Groups

# Create action group with email and Teams webhook
az monitor action-group create \
  --name "DevOps Team" \
  --resource-group myResourceGroup \
  --short-name "devops" \
  --email-receiver name=oncall email=devops@company.com \
  --webhook-receiver name=teams \
    service-uri=https://company.webhook.office.com/webhookb2/...

Best Practices

Set meaningful thresholds - Base alerts on baselines, not guesses
Use dynamic thresholds - Let Azure ML determine normal behavior
Implement alert fatigue prevention - Use action rules to suppress noise
Create runbooks - Link alerts to remediation documentation
Regular review - Review and tune alerts monthly

Conclusion

A comprehensive monitoring strategy combining Azure Monitor, Log Analytics, and Grafana gives you full visibility into your Azure infrastructure. The combination of metrics, logs, and traces ensures you can detect, diagnose, and resolve issues quickly.