Research a combined scoring system for catalog search

Scoring Overview

This proposal ticket aims to develop a comprehensive scoring system for catalog search results, blending four PostgreSQL-derived signals, rank + similarity + popularity + freshness, into a single score so search works well even when the UI doesn’t expose manual sorting, or for cases where most users are unlikely to adjust advanced sorting options. Each signal is tunable, letting ops adjust weights until results feel right. All data already lives in indexed columns or computed expressions, so adding/removing factors has negligible performance impact.
Final score: rank + similarity + popularity + freshness. Each term is either 0–1 or
scaled down by a weight, so the combined value stays interpretable.

Rank (0–1)

Generated by ts_rank_cd(search_vector, query, 32) over our weighted TSVECTOR.
Normalization option 32 squashes scores into [0,1).
We control emphasis by changing setweight tiers in Dataset.search_vector. Fields tagged
'A' matter most, 'C' least.

Similarity (0–1)

Computed via trigram similarity (pg_trgm) on dataset titles. Identical strings score
1.0; unrelated strings trend toward 0.
FUZZY_THRESHOLD filters out weak matches so only meaningful overlaps reach scoring.

Popularity (0–POP_WEIGHT)

Based on the last two weeks of Google Analytics hits. We normalize hits by
POPULARITY_SCALE (e.g., 5 000 for top traffic) and multiply by POPULARITY_WEIGHT.
Formula: popularity = (hits / POPULARITY_SCALE) * POPULARITY_WEIGHT. With a 25% weight, even the hottest dataset contributes at most 0.25.

Freshness (0–FRESH_WEIGHT)

Derived from harvest/update timestamps: freshness_base = 0.5 ** (age_in_days /
half_life_days).
half_life_days controls how fast freshness fades (14 → half-value every two weeks).
Multiply by FRESHNESS_WEIGHT to cap its score impact, e.g., 5% weight → max 0.05.
Full term: freshness = freshness_base * FRESHNESS_WEIGHT.

Operational Notes

Treat the scorer as a black box for end users but keep tuning knobs (FUZZY_THRESHOLD, POPULARITY_WEIGHT, POPULARITY_SCALE, half_life_days, FRESHNESS_WEIGHT) available to ops for balancing relevance vs. recency vs. traffic.
For advanced users, in addition to the default sorting by the comprehensive scoring, we also provide alternative sorting options:
- -- - default
- Relevant – based on ranking and similarity
- Popular – based on dataset popularity
- Title - based on dataset slug
- Harvested Date – based on the most recent harvest date

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Research a combined scoring system for catalog search #5517

Scoring Overview

Rank (0–1)

Similarity (0–1)

Popularity (0–POP_WEIGHT)

Freshness (0–FRESH_WEIGHT)

Operational Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Research a combined scoring system for catalog search #5517

Description

Scoring Overview

Rank (0–1)

Similarity (0–1)

Popularity (0–POP_WEIGHT)

Freshness (0–FRESH_WEIGHT)

Operational Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions