Observability is critical for running production workloads. In this post, we’ll build a comprehensive monitoring solution using Azure Monitor, Log Analytics Workspace, and Grafana.
The Three Pillars of Observability
- Metrics - Numeric measurements over time (CPU, memory, request rates)
- Logs - Timestamped records of discrete events
- Traces - Request flows across distributed systems
Setting Up Azure Monitor
Enable Container Insights for AKS
az aks enable-addons \
--resource-group myResourceGroup \
--name myAKSCluster \
--addons monitoring \
--workspace-resource-id /subscriptions/<sub-id>/resourcegroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<workspace>
Create Custom Metric Alerts
az monitor metrics alert create \
--name "High CPU Alert" \
--resource-group myResourceGroup \
--scopes "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/myAKSCluster" \
--condition "avg Percentage CPU > 80" \
--window-size 5m \
--evaluation-frequency 1m \
--action "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/microsoft.insights/actionGroups/myActionGroup"
Log Analytics Queries (KQL)
Kusto Query Language (KQL) is powerful for analyzing logs:
// Container resource usage
KubeNodeInventory
| where TimeGenerated > ago(1h)
| summarize
AvgCPU = avg(todouble(AllocatableCpuCapacity)),
AvgMemory = avg(todouble(AllocatableMemoryCapacity))
by Computer, ClusterName
| render timechart
// Pod restart tracking
KubePodInventory
| where TimeGenerated > ago(24h)
| where RestartCount > 0
| project TimeGenerated, Name, Namespace, RestartCount, ContainerStatus
| order by RestartCount desc
// HTTP error rate
ContainerLog
| where LogEntry contains "HTTP"
| where LogEntry contains "5"
| summarize ErrorCount = count() by bin(TimeGenerated, 5m)
| render timechart
Deploying Grafana on AKS
# grafana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:10.4.2
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-secrets
key: admin-password
- name: GF_AUTH_AZURE_AD_ENABLED
value: "true"
- name: GF_AUTH_AZURE_AD_CLIENT_ID
valueFrom:
secretKeyRef:
name: grafana-secrets
key: azure-client-id
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-pvc
Azure Monitor Workbooks
Create rich interactive reports with Azure Monitor Workbooks:
{
"version": "Notebook/1.0",
"items": [
{
"type": 9,
"content": {
"version": "KqlParameterItem/1.0",
"parameters": [
{
"id": "timeRange",
"version": "KqlParameterItem/1.0",
"name": "TimeRange",
"type": 4,
"value": {
"durationMs": 3600000
}
}
]
}
},
{
"type": 3,
"content": {
"version": "KqlItem/1.0",
"query": "KubeNodeInventory | summarize count() by Status",
"size": 1,
"chartType": 3
}
}
]
}
Setting Up Alerts and Notifications
Action Groups
# Create action group with email and Teams webhook
az monitor action-group create \
--name "DevOps Team" \
--resource-group myResourceGroup \
--short-name "devops" \
--email-receiver name=oncall email=devops@company.com \
--webhook-receiver name=teams \
service-uri=https://company.webhook.office.com/webhookb2/...
Best Practices
- Set meaningful thresholds - Base alerts on baselines, not guesses
- Use dynamic thresholds - Let Azure ML determine normal behavior
- Implement alert fatigue prevention - Use action rules to suppress noise
- Create runbooks - Link alerts to remediation documentation
- Regular review - Review and tune alerts monthly
Conclusion
A comprehensive monitoring strategy combining Azure Monitor, Log Analytics, and Grafana gives you full visibility into your Azure infrastructure. The combination of metrics, logs, and traces ensures you can detect, diagnose, and resolve issues quickly.