[monitoring] Add KubeVirt VM dashboards#2198
[monitoring] Add KubeVirt VM dashboards#2198mattia-eleuteri wants to merge 1 commit intocozystack:maincozystack/cozystack:mainfrom mattia-eleuteri:monitoring/vm-dashboardsmattia-eleuteri/cozystack:monitoring/vm-dashboardsCopy head branch name to clipboard
Conversation
📝 WalkthroughWalkthroughAdds two new Grafana dashboards for VM monitoring: a "VM Console Logs" logs panel and a comprehensive "Virtual Machines" Prometheus-backed dashboard; registers both in the monitoring dashboards list. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
dashboards/vm/virtual-machines.json (1)
120-145: Clarify whether this is CPU % or CPU cores.These panels plot raw
rate(..._seconds_total)values, so they scale with the number of busy vCPUs. Rendering them aspercentunitmakes the result ambiguous for multi-vCPU guests. Either normalize by vCPU count before formatting as%, or rename/unit the panels as cores.Also applies to: 375-414
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dashboards/vm/virtual-machines.json` around lines 120 - 145, The "Average CPU" stat panel is using avg(rate(kubevirt_vmi_cpu_usage_seconds_total...)) but is labeled/unit'd as "percentunit", which is incorrect for raw CPU-seconds (it yields cores * time); update the panel to either (A) normalize the expression by the vCPU count metric (e.g., divide by a vCPU count metric such as kubevirt_vmi_vcpu_count or another available label) and keep the unit as percentunit if you convert to usage fraction*100, or (B) change the unit from "percentunit" to a core-based unit and rename the "Average CPU" title to "Average vCPUs" (and apply the same change to the other similar panels referenced at lines ~375-414) so the legend/unit matches the raw avg(rate(...)) output.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dashboards/vm/console-logs.json`:
- Around line 21-30: The "namespace" textbox field currently named "namespace"
with type "textbox" and default/current value "*" allows users to override
tenant scope and causes cluster-wide log queries; fix by removing or converting
this field to a non-editable fixed value (e.g., change type from "textbox" to
"const"/hidden and set a tenant-specific namespace value) or remove it and
enforce namespace via the data source or RBAC so the query referenced (the
namespace filter used in the panel query) cannot be set to "*" by users; update
the "namespace" field definition accordingly and ensure any panel queries use
the enforced namespace variable rather than the editable textbox.
In `@dashboards/vm/virtual-machines.json`:
- Around line 194-220: The panel queries (refId net_rx, net_tx, iops_r, iops_w)
produce one series per interface/drive and must be aggregated before the
downstream table merge: wrap each rate(...) expression with sum by (name)(...)
so the PromQL becomes sum by
(name)(rate(kubevirt_vmi_network_receive_bytes_total{...}[$__rate_interval]))
(and similarly for net_tx and for iops_r/iops_w while keeping the
drive!="cloudinitdisk" filter), leaving legendFormat, refId and instant settings
unchanged.
---
Nitpick comments:
In `@dashboards/vm/virtual-machines.json`:
- Around line 120-145: The "Average CPU" stat panel is using
avg(rate(kubevirt_vmi_cpu_usage_seconds_total...)) but is labeled/unit'd as
"percentunit", which is incorrect for raw CPU-seconds (it yields cores * time);
update the panel to either (A) normalize the expression by the vCPU count metric
(e.g., divide by a vCPU count metric such as kubevirt_vmi_vcpu_count or another
available label) and keep the unit as percentunit if you convert to usage
fraction*100, or (B) change the unit from "percentunit" to a core-based unit and
rename the "Average CPU" title to "Average vCPUs" (and apply the same change to
the other similar panels referenced at lines ~375-414) so the legend/unit
matches the raw avg(rate(...)) output.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a55ed75d-9122-4948-bb8c-1deedc257c76
📒 Files selected for processing (3)
dashboards/vm/console-logs.jsondashboards/vm/virtual-machines.jsonpackages/system/monitoring/dashboards.list
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
dashboards/vm/console-logs.json (1)
29-40:⚠️ Potential issue | 🟠 MajorNamespace filtering still isn't an isolation boundary.
The textbox wildcard is gone, but
namespaceis still user-selectable vialabel_values(..., namespace). The filter on Line 74 only constrains the query to the chosen namespace; it does not stop a user from switching to another namespace that the datasource can see. If tenant isolation is part of this dashboard's contract, scope still needs to be enforced outside this variable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dashboards/vm/console-logs.json` around lines 29 - 40, The "namespace" dashboard variable (query "label_values(kubevirt_vmi_memory_available_bytes, namespace)") is still user-selectable and thus does not enforce tenant isolation; replace this dynamic label_values source with a tenant-scoped constant (or remove the variable entirely) and make all panels use the injected tenant namespace instead of the "namespace" variable. Concretely: stop populating "namespace" from label_values(kubevirt_vmi_memory_available_bytes, namespace), either hardcode the tenant namespace or bind it to a secure injected variable like ${tenant_namespace}, and update every panel/filter that currently references the "namespace" variable (the panel filter mentioned around the existing query on Line 74) to use the tenant-scoped value so users cannot switch to other namespaces. Ensure datasource permission-based scoping or external enforcement is used if per-request injection is required.
🧹 Nitpick comments (1)
dashboards/vm/console-logs.json (1)
29-57: Use a selector source that survives VM shutdown.Lines 36 and 50 drive both dropdowns from
kubevirt_vmi_memory_available_bytes, so a VM disappears from the selector as soon as it stops exporting that metric even if its recent console logs still exist. For a log dashboard, it's safer to back these variables with a source that outlives the VM instead of a live memory series.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dashboards/vm/console-logs.json` around lines 29 - 57, The namespace and vm template variables ("namespace" and "vm") use the ephemeral metric kubevirt_vmi_memory_available_bytes so VMs drop out when that metric stops; change both queries to use a stable selector such as kubevirt_vmi_info or kubevirt_vmi_labels (e.g. label_values(kubevirt_vmi_info, namespace) and label_values(kubevirt_vmi_info{namespace="$namespace"}, name)) instead of kubevirt_vmi_memory_available_bytes so selectors persist after VM shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dashboards/vm/virtual-machines.json`:
- Around line 577-655: The network charts use per-interface Prometheus metrics
(e.g., kubevirt_vmi_network_receive_bytes_total,
kubevirt_vmi_network_transmit_bytes_total,
kubevirt_vmi_network_receive_packets_total,
kubevirt_vmi_network_transmit_packets_total,
kubevirt_vmi_network_receive_errors_total,
kubevirt_vmi_network_transmit_errors_total,
kubevirt_vmi_network_receive_packets_dropped_total,
kubevirt_vmi_network_transmit_packets_dropped_total) but their legendFormat only
shows {{ name }}, collapsing multiple NIC series; either aggregate each query
(wrap with sum by (name)(...)) to show per-VM totals, or preserve per-interface
visibility by updating the legendFormat for all affected targets (Inbound
Throughput, Outbound Throughput (TX), Packets/s, Errors & Drops) to include the
interface label (e.g., change "{{ name }}" to "{{ name }} {{ interface }}" or
"{{ name }} {{ interface }} RX/TX" as appropriate) and apply the same change to
the errors/drops targets.
---
Duplicate comments:
In `@dashboards/vm/console-logs.json`:
- Around line 29-40: The "namespace" dashboard variable (query
"label_values(kubevirt_vmi_memory_available_bytes, namespace)") is still
user-selectable and thus does not enforce tenant isolation; replace this dynamic
label_values source with a tenant-scoped constant (or remove the variable
entirely) and make all panels use the injected tenant namespace instead of the
"namespace" variable. Concretely: stop populating "namespace" from
label_values(kubevirt_vmi_memory_available_bytes, namespace), either hardcode
the tenant namespace or bind it to a secure injected variable like
${tenant_namespace}, and update every panel/filter that currently references the
"namespace" variable (the panel filter mentioned around the existing query on
Line 74) to use the tenant-scoped value so users cannot switch to other
namespaces. Ensure datasource permission-based scoping or external enforcement
is used if per-request injection is required.
---
Nitpick comments:
In `@dashboards/vm/console-logs.json`:
- Around line 29-57: The namespace and vm template variables ("namespace" and
"vm") use the ephemeral metric kubevirt_vmi_memory_available_bytes so VMs drop
out when that metric stops; change both queries to use a stable selector such as
kubevirt_vmi_info or kubevirt_vmi_labels (e.g. label_values(kubevirt_vmi_info,
namespace) and label_values(kubevirt_vmi_info{namespace="$namespace"}, name))
instead of kubevirt_vmi_memory_available_bytes so selectors persist after VM
shutdown.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 74986ee1-b4b3-4f5d-bfca-4b09402aab36
📒 Files selected for processing (2)
dashboards/vm/console-logs.jsondashboards/vm/virtual-machines.json
60d258b to
bcf42b3
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
dashboards/vm/virtual-machines.json (1)
577-655:⚠️ Potential issue | 🟡 MinorNetwork legends still collapse multi-NIC series.
These targets are still per-interface, but the legends only show the VM and direction. On multi-NIC VMs Grafana will plot multiple indistinguishable lines. Either aggregate to per-VM totals with
sum by (name)or include{{ interface }}in every RX/TX/error/drop legend.Example fix if you want to keep per-interface visibility
- "legendFormat": "{{ name }}" + "legendFormat": "{{ name }} ({{ interface }})" - "legendFormat": "{{ name }}" + "legendFormat": "{{ name }} ({{ interface }})" - "legendFormat": "{{ name }} RX" + "legendFormat": "{{ name }} RX ({{ interface }})" - "legendFormat": "{{ name }} TX" + "legendFormat": "{{ name }} TX ({{ interface }})" - "legendFormat": "{{ name }} RX errors" + "legendFormat": "{{ name }} RX errors ({{ interface }})" - "legendFormat": "{{ name }} TX errors" + "legendFormat": "{{ name }} TX errors ({{ interface }})" - "legendFormat": "{{ name }} RX dropped" + "legendFormat": "{{ name }} RX dropped ({{ interface }})" - "legendFormat": "{{ name }} TX dropped" + "legendFormat": "{{ name }} TX dropped ({{ interface }})"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dashboards/vm/virtual-machines.json` around lines 577 - 655, The panel legends for network panels ("Outbound Throughput (TX)", "Packets/s", "Errors & Drops") collapse multi-NIC series because legendFormat uses only "{{ name }}"/direction; update each target in those panels to either aggregate per-VM (wrap the PromQL in sum by (name) e.g. sum by(name)(rate(...))) or keep per-interface visibility by adding the interface label to the legend (change legendFormat to include "{{ interface }}" such as "{{ name }} {{ interface }} TX"/"RX"/"{{ interface }} RX errors" etc); modify the target expressions and/or legendFormat in the targets for kubevirt_vmi_network_receive_*, kubevirt_vmi_network_transmit_*, kubevirt_vmi_network_*_packets_dropped_total, and kubevirt_vmi_network_*_errors_total accordingly.
🧹 Nitpick comments (1)
dashboards/vm/virtual-machines.json (1)
132-143: Revisit CPU thresholds now that these panels show cores.The dashboard now renders CPU as
short, but the cutoffs stayed at0.7/0.9and0.5/0.8. Those values behave like leftover percentage thresholds and will color ordinary load as saturation on larger VMs. If the intent is utilization, normalize by allocated vCPUs; otherwise drop the color thresholds.Also applies to: 278-294
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dashboards/vm/virtual-machines.json` around lines 132 - 143, The CPU panels set fieldConfig.defaults.unit to "short" but keep thresholds.steps values like 0.7/0.9 and 0.5/0.8 (percentage-style), which will miscolor load when panels show cores; update the panels that define fieldConfig.defaults.decimals and fieldConfig.defaults.thresholds.steps (the entries with values 0.7/0.9 and 0.5/0.8, also present at the other occurrence around the 278-294 region) to either (A) normalize the threshold values by allocated vCPUs (compute threshold = desired utilization * vCPU count so thresholds become absolute core counts), (B) change fieldConfig.defaults.unit to "percent" and keep the fractional thresholds, or (C) remove/disable thresholds.steps entirely if no coloring is desired; apply the chosen fix consistently to both occurrences and ensure decimals remain appropriate for the selected unit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dashboards/vm/virtual-machines.json`:
- Around line 423-453: The Grafana panels using
rate(kubevirt_vmi_vcpu_wait_seconds_total[...] ) and
rate(kubevirt_vmi_vcpu_delay_seconds_total[...] ) are labeled with unit "s"
which is incorrect because rate() yields seconds/second (dimensionless); update
the panels (identified by title "vCPU Wait" and "vCPU Delay (Steal Time)" or the
target expressions rate(kubevirt_vmi_vcpu_wait_seconds_total{...}) and
rate(kubevirt_vmi_vcpu_delay_seconds_total{...})) to use a dimensionless unit
such as "short" or an explicit "s/s" (or alternatively change the query to
increase(...)/interval if you really want elapsed seconds over the range), and
adjust the fieldConfig.defaults.unit from "s" to the chosen unit so axis and
tooltips display correctly.
---
Duplicate comments:
In `@dashboards/vm/virtual-machines.json`:
- Around line 577-655: The panel legends for network panels ("Outbound
Throughput (TX)", "Packets/s", "Errors & Drops") collapse multi-NIC series
because legendFormat uses only "{{ name }}"/direction; update each target in
those panels to either aggregate per-VM (wrap the PromQL in sum by (name) e.g.
sum by(name)(rate(...))) or keep per-interface visibility by adding the
interface label to the legend (change legendFormat to include "{{ interface }}"
such as "{{ name }} {{ interface }} TX"/"RX"/"{{ interface }} RX errors" etc);
modify the target expressions and/or legendFormat in the targets for
kubevirt_vmi_network_receive_*, kubevirt_vmi_network_transmit_*,
kubevirt_vmi_network_*_packets_dropped_total, and
kubevirt_vmi_network_*_errors_total accordingly.
---
Nitpick comments:
In `@dashboards/vm/virtual-machines.json`:
- Around line 132-143: The CPU panels set fieldConfig.defaults.unit to "short"
but keep thresholds.steps values like 0.7/0.9 and 0.5/0.8 (percentage-style),
which will miscolor load when panels show cores; update the panels that define
fieldConfig.defaults.decimals and fieldConfig.defaults.thresholds.steps (the
entries with values 0.7/0.9 and 0.5/0.8, also present at the other occurrence
around the 278-294 region) to either (A) normalize the threshold values by
allocated vCPUs (compute threshold = desired utilization * vCPU count so
thresholds become absolute core counts), (B) change fieldConfig.defaults.unit to
"percent" and keep the fractional thresholds, or (C) remove/disable
thresholds.steps entirely if no coloring is desired; apply the chosen fix
consistently to both occurrences and ensure decimals remain appropriate for the
selected unit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dd64ca1f-3810-462b-b6fc-ef2a53608e7d
📒 Files selected for processing (3)
dashboards/vm/console-logs.jsondashboards/vm/virtual-machines.jsonpackages/system/monitoring/dashboards.list
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/system/monitoring/dashboards.list
- dashboards/vm/console-logs.json
|
Hey @mattia-eleuteri could you please fix conflict after merging |
Signed-off-by: Mattia Eleuteri <mattia@hidora.io> Signed-off-by: mattia-eleuteri <mattia@hidora.io>
bcf42b3 to
46104de
Compare
|
Rebased on upstream/main — conflict with #2197 in |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dashboards/vm/virtual-machines.json`:
- Around line 127-134: The panel is using unit "percentunit" while the PromQL
expr returns CPU cores (seconds/second) via
kubevirt_vmi_cpu_usage_seconds_total, which makes multi-vCPU VMs show >100%;
update the panel to either change the fieldConfig.defaults.unit from
"percentunit" to a generic unit like "short" and update the legend/text to
indicate "cores", or normalize the query to produce a 0–1 percentage by dividing
rate(kubevirt_vmi_cpu_usage_seconds_total{...}[$__rate_interval]) by a vCPU
total metric such as kubevirt_vmi_vcpu_seconds_total{state="total"} (and keep
percentunit only if you perform that normalization).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 38339b5d-6bf4-4f6e-a2f8-30b96e01da1c
📒 Files selected for processing (3)
dashboards/vm/console-logs.jsondashboards/vm/virtual-machines.jsonpackages/system/monitoring/dashboards.list
🚧 Files skipped from review as they are similar to previous changes (1)
- dashboards/vm/console-logs.json
| "expr": "avg(rate(kubevirt_vmi_cpu_usage_seconds_total{namespace=\"$namespace\", name=~\"$vm\"}[$__rate_interval]))", | ||
| "legendFormat": "" | ||
| } | ||
| ], | ||
| "options": { "colorMode": "value", "graphMode": "area", "textMode": "value" }, | ||
| "fieldConfig": { | ||
| "defaults": { | ||
| "unit": "percentunit", |
There was a problem hiding this comment.
CPU percentunit may be misleading for multi-vCPU VMs.
rate(kubevirt_vmi_cpu_usage_seconds_total) returns CPU cores consumed (seconds/second), not a 0-1 ratio. For a VM with 4 vCPUs at full utilization, this returns ~4.0, which Grafana displays as "400%" with percentunit.
Consider using short with a label like "cores" or normalize by vCPU count if you truly want percentage:
rate(kubevirt_vmi_cpu_usage_seconds_total{...}[$__rate_interval])
/ on(name, namespace) kubevirt_vmi_vcpu_seconds_total{state="total"}
Alternatively, adjust thresholds to reflect core counts rather than percentages.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@dashboards/vm/virtual-machines.json` around lines 127 - 134, The panel is
using unit "percentunit" while the PromQL expr returns CPU cores
(seconds/second) via kubevirt_vmi_cpu_usage_seconds_total, which makes
multi-vCPU VMs show >100%; update the panel to either change the
fieldConfig.defaults.unit from "percentunit" to a generic unit like "short" and
update the legend/text to indicate "cores", or normalize the query to produce a
0–1 percentage by dividing
rate(kubevirt_vmi_cpu_usage_seconds_total{...}[$__rate_interval]) by a vCPU
total metric such as kubevirt_vmi_vcpu_seconds_total{state="total"} (and keep
percentunit only if you perform that normalization).
Summary
dashboards/vm/virtual-machines.json— KubeVirt VM resource metrics (CPU, memory, network, storage) with namespace and VM template variablesdashboards/vm/console-logs.json— VM serial console log viewer via VictoriaLogs with namespace filteringvm/virtual-machinesandvm/console-logsentries todashboards.listProblem
dashboards.listreferencesvm/virtual-machinesandvm/console-logsbut the corresponding JSON files don't exist.The console-logs dashboard includes
kubernetes_namespace_namefiltering to prevent cross-tenant log exposure.Relates to #2194
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit