Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Redundant MetricsCapture in trace_call produces orphan metrics with incomplete resource labels #16173

Copy link
Copy link

Description

@waiho-gumloop
Issue body actions

Environment details

  • OS type and version: macOS / Linux
  • Python version: 3.13
  • google-cloud-spanner version: 3.63.0 (current main)

Description

Every Spanner operation that goes through trace_call() produces orphan OpenTelemetry metric data points with incomplete resource labels (missing project_id and instance_id). These orphan data points persist for the process lifetime due to cumulative aggregation and are re-exported to Cloud Monitoring every 60 seconds, which rejects them with:

INVALID_ARGUMENT: One or more TimeSeries could not be written:
timeSeries[...]: the set of resource labels is incomplete, missing (instance_id)

Root cause

trace_call() in _opentelemetry_tracing.py wraps every operation with a bare MetricsCapture() (no resource_info). Meanwhile, every caller of trace_call already provides its own MetricsCapture(self._resource_info) with correct labels.

When Python evaluates with trace_call(...) as span, MetricsCapture(self._resource_info):, two separate MetricsTracer instances are created:

  1. tracer_A (from trace_call's internal MetricsCapture()): has instance_config, location, client_hash, client_uid, client_name from the factory, but never receives project_id or instance_id
  2. tracer_B (from the caller's MetricsCapture(resource_info)): has correct labels, overwrites tracer_A in the context var

On exit, tracer_B records correct metrics first, then tracer_A records metrics with incomplete labels. Since the SpannerMetricsTracerFactory never has project_id/instance_id in its _client_attributes (only set per-tracer via resource_info or MetricsInterceptor), tracer_A always starts without them and is never populated because the MetricsInterceptor only touches the current context-var tracer (tracer_B).

With OpenTelemetry's cumulative aggregation, once these orphan aggregation buckets are created, they persist for the process lifetime and are re-exported every 60 seconds.

History

Impact

  • Affects every Spanner operation (~27 code paths) on every invocation
  • Creates persistent orphan metric aggregation buckets
  • Produces repeated INVALID_ARGUMENT error logs every 60 seconds
  • Wastes CPU/network on exporting invalid TimeSeries
  • Application functionality is unaffected; valid metrics from the caller's MetricsCapture still work

Steps to reproduce

  1. Create a spanner.Client() with metrics enabled (default)
  2. Perform any Spanner operation (e.g., session.create(), snapshot.execute_sql())
  3. Observe INVALID_ARGUMENT errors logged from the metrics exporter every 60 seconds

Suggested fix

Remove the bare MetricsCapture() from trace_call — it is redundant since every caller already provides its own. See PR googleapis/python-spanner#1522.

Reactions are currently unavailable

Metadata

Metadata

Assignees

Labels

api: spannerIssues related to the Spanner API.Issues related to the Spanner API.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Morty Proxy This is a proxified and sanitized view of the page, visit original site.