Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Research a combined scoring system for catalog search #5517

Copy link
Copy link
@FuhuXia

Description

@FuhuXia
Issue body actions

Scoring Overview

  • This proposal ticket aims to develop a comprehensive scoring system for catalog search results, blending four PostgreSQL-derived signals, rank + similarity + popularity + freshness, into a single score so search works well even when the UI doesn’t expose manual sorting, or for cases where most users are unlikely to adjust advanced sorting options. Each signal is tunable, letting ops adjust weights until results feel right. All data already lives in indexed columns or computed expressions, so adding/removing factors has negligible performance impact.

  • Final score: rank + similarity + popularity + freshness. Each term is either 0–1 or
    scaled down by a weight, so the combined value stays interpretable.

Rank (0–1)

  • Generated by ts_rank_cd(search_vector, query, 32) over our weighted TSVECTOR.
    Normalization option 32 squashes scores into [0,1).
  • We control emphasis by changing setweight tiers in Dataset.search_vector. Fields tagged
    'A' matter most, 'C' least.

Similarity (0–1)

  • Computed via trigram similarity (pg_trgm) on dataset titles. Identical strings score
    1.0; unrelated strings trend toward 0.
  • FUZZY_THRESHOLD filters out weak matches so only meaningful overlaps reach scoring.

Popularity (0–POP_WEIGHT)

  • Based on the last two weeks of Google Analytics hits. We normalize hits by
    POPULARITY_SCALE (e.g., 5 000 for top traffic) and multiply by POPULARITY_WEIGHT.
  • Formula: popularity = (hits / POPULARITY_SCALE) * POPULARITY_WEIGHT. With a 25% weight, even the hottest dataset contributes at most 0.25.

Freshness (0–FRESH_WEIGHT)

  • Derived from harvest/update timestamps: freshness_base = 0.5 ** (age_in_days /
    half_life_days).
  • half_life_days controls how fast freshness fades (14 → half-value every two weeks).
    Multiply by FRESHNESS_WEIGHT to cap its score impact, e.g., 5% weight → max 0.05.
  • Full term: freshness = freshness_base * FRESHNESS_WEIGHT.

Operational Notes

  • Treat the scorer as a black box for end users but keep tuning knobs (FUZZY_THRESHOLD, POPULARITY_WEIGHT, POPULARITY_SCALE, half_life_days, FRESHNESS_WEIGHT) available to ops for balancing relevance vs. recency vs. traffic.
  • For advanced users, in addition to the default sorting by the comprehensive scoring, we also provide alternative sorting options:
    • -- - default
    • Relevant – based on ranking and similarity
    • Popular – based on dataset popularity
    • Title - based on dataset slug
    • Harvested Date – based on the most recent harvest date

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    🌈 Catalog UI 60 Day Project
    Show more project fields

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.