-
Notifications
You must be signed in to change notification settings - Fork 163
Open
Labels
Description
Scoring Overview
-
This proposal ticket aims to develop a comprehensive scoring system for catalog search results, blending four PostgreSQL-derived signals,
rank
+similarity
+popularity
+freshness
, into a single score so search works well even when the UI doesn’t expose manual sorting, or for cases where most users are unlikely to adjust advanced sorting options. Each signal is tunable, letting ops adjust weights until results feel right. All data already lives in indexed columns or computed expressions, so adding/removing factors has negligible performance impact. -
Final score:
rank
+similarity
+popularity
+freshness
. Each term is either 0–1 or
scaled down by a weight, so the combined value stays interpretable.
Rank (0–1)
- Generated by ts_rank_cd(search_vector, query, 32) over our weighted TSVECTOR.
Normalization option 32 squashes scores into [0,1). - We control emphasis by changing setweight tiers in Dataset.search_vector. Fields tagged
'A' matter most, 'C' least.
Similarity (0–1)
- Computed via trigram similarity (pg_trgm) on dataset titles. Identical strings score
1.0; unrelated strings trend toward 0. - FUZZY_THRESHOLD filters out weak matches so only meaningful overlaps reach scoring.
Popularity (0–POP_WEIGHT)
- Based on the last two weeks of Google Analytics hits. We normalize hits by
POPULARITY_SCALE (e.g., 5 000 for top traffic) and multiply by POPULARITY_WEIGHT. - Formula: popularity = (hits / POPULARITY_SCALE) * POPULARITY_WEIGHT. With a 25% weight, even the hottest dataset contributes at most 0.25.
Freshness (0–FRESH_WEIGHT)
- Derived from harvest/update timestamps: freshness_base = 0.5 ** (age_in_days /
half_life_days). - half_life_days controls how fast freshness fades (14 → half-value every two weeks).
Multiply by FRESHNESS_WEIGHT to cap its score impact, e.g., 5% weight → max 0.05. - Full term: freshness = freshness_base * FRESHNESS_WEIGHT.
Operational Notes
- Treat the scorer as a black box for end users but keep tuning knobs (FUZZY_THRESHOLD, POPULARITY_WEIGHT, POPULARITY_SCALE, half_life_days, FRESHNESS_WEIGHT) available to ops for balancing relevance vs. recency vs. traffic.
- For advanced users, in addition to the default sorting by the comprehensive scoring, we also provide alternative sorting options:
--
- defaultRelevant
– based on ranking and similarityPopular
– based on dataset popularityTitle
- based on dataset slugHarvested Date
– based on the most recent harvest date
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
🌈 Catalog UI 60 Day Project