Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

TreeSitter ideas for Crush

The thinking is that TreeSitter can help Crush build safer, smarter AI-assisted workflows.

Why TreeSitter here?

  • Structured code view: Instead of raw text, we operate on functions, types, imports, and statements.
  • Smaller, better context: Feed the LLM only the AST parts that matter, not whole files.
  • Safer edits: AST-aware searches/renames reduce accidental breakage.

What I am trying out already

  • Language loader: detect language by filename and parse with TreeSitter.
  • Query helper: run queries and return captures in document order.
  • Symbol extraction:
    • Go: top-level function extraction
    • JS/TS: basic function/arrow-function export extraction
  • Tools for the agent:
    • symbols: list top‑level symbols for a file.
    • impact: first‑pass change‑impact analysis. Confirms definition presence and finds references via ripgrep; returns buckets (definitions/imports/call_sites/test_files) with a rough “blast‑radius” score.

Near‑term ideas (incremental, pragmatic)

  • Context packer tool

    • Input: (path, symbol[, radius])
    • Output: compact bundle with the symbol’s AST node, nearby helpers, import lines, and N nearest callers.
    • Goal: reduce prompt size and model tunnel vision by giving the LLM the right slice of code.
  • Find‑refs / Go‑to‑def tools (AST‑aware)

    • Per language queries for definitions and call sites.
    • Reduce false positives vs regex and inform safer edits.
  • Preflight edit guardrails (agent middleware)

    • Before any write/multiedit: call impact (+ context packer) and score blast‑radius.
    • Gate large edits: split into stages or ask for approval if risk exceeds threshold.
  • Postflight verification

    • Re‑run impact to ensure no new unresolved refs appeared.
    • Build/lint and run tests for impacted packages/files.
    • Auto‑revise or roll back last edit on regressions.
  • Better classification of references

    • Distinguish imports, interface impls, method calls, and tests per language.
    • Prioritize what the LLM should read/update first.

Medium‑term improvements

  • AST‑aware refactors

    • Rename symbols, change function signatures, or move functions across files using TreeSitter transforms, paired with existing edit tooling.
  • Semantic chunking for RAG

    • Index code by AST units (functions/types/modules), not lines, to improve retrieval quality.
  • Structural diff summaries

    • Summarize changes by AST (added function, changed params, removed branch) for PR descriptions and agent memory.
  • Test scaffolding

    • Generate table‑driven test skeletons from exported units and public APIs.
  • Style/security queries

    • Detect risky patterns (string‑concat SQL, unchecked errors, unsafe exec) and propose targeted fixes.

Longer‑term experiments

  • Cross‑language dependency mapping

    • Relate backend endpoints to frontend callers and tests; navigate across boundaries during edits.
  • Prompt budget optimizer

    • Convert code to minimal AST summaries (names, signatures, key literals) for quick overviews, expand only when needed.

This would work quite well also in combination with GraphRAG, a while back I experimented with using TreeSitter to analyze our platform at work, consisting of multiple codebases, using various languages, and building up a connected graph in Neo4J (though any graph database would work).

The benefit there is that relationships become first-class citizens in the context, and graph-based algorithms can be used for all kinds of things, such as finding the shortest path between two nodes, finding the most similar nodes to a given node, or even community detection.

Just some ideas, not even to sure how viable/useful they are. I'll probably give some of these a try anyway.

You must be logged in to vote

Replies: 2 comments · 1 reply

Comment options

This is a great idea, and totally on our radar.

You must be logged in to vote
1 reply
@TheApeMachine
Comment options

If it helps, I have done some experimenting here: https://github.com/TheApeMachine/crush/tree/feature/treesitter-integration, which I will likely continue for a little while. I fear that I may not be doing everything correctly, or might not be doing everything aligned with your team's vision, so I am just considering this as experimentation, likely not a realistic implementation.

Comment options

This would be fantastic. I dont know if this helps you at all, but RooCode and KiloCode (same thing) use tree-sitter for various things, including a codebase indexer that stores embeddings (genreated via whatever embedding provider youve selected) in qdrant. It works quite well.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
💡
Ideas
3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.