Rendered at 09:49:46 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
das-bikash-dev 24 hours ago [-]
Nice project, especially given the VRAM constraints. A few things I've learned
building production RAG that might help:
1. Separate your query analysis from retrieval. A single LLM call can classify
the query type, decide whether to use hybrid search, and pick search parameters
all at once. This saves a round-trip vs doing them sequentially.
2. If you add BM25 alongside vector search, the blend ratio matters a lot by
query type. Exact-match queries need heavy keyword weighting, while conceptual
questions need more embedding weight. A static 50/50 split leaves performance
on the table.
3. For your evaluator/generator being the same model — one practical workaround
is to skip LLM-as-judge evaluation entirely and use a small cross-encoder
reranker between retrieval and generation instead. It catches the cases where
vector similarity returns semantically related but not actually useful chunks,
and it gives you a relevance score you can threshold on without needing a
separate evaluation model.
4. Consider a two-level cache: exact match (hash the query, short TTL) plus a
semantic cache (cosine similarity threshold on the query embedding, longer TTL).
The semantic layer catches "how do I X" vs "what's the way to X" without hitting
the retriever again.
What model are you using for generation on the 8GB? That constraint probably
shapes a lot of the architecture choices downstream.
1. Separate your query analysis from retrieval. A single LLM call can classify the query type, decide whether to use hybrid search, and pick search parameters all at once. This saves a round-trip vs doing them sequentially.
2. If you add BM25 alongside vector search, the blend ratio matters a lot by query type. Exact-match queries need heavy keyword weighting, while conceptual questions need more embedding weight. A static 50/50 split leaves performance on the table.
3. For your evaluator/generator being the same model — one practical workaround is to skip LLM-as-judge evaluation entirely and use a small cross-encoder reranker between retrieval and generation instead. It catches the cases where vector similarity returns semantically related but not actually useful chunks, and it gives you a relevance score you can threshold on without needing a separate evaluation model.
4. Consider a two-level cache: exact match (hash the query, short TTL) plus a semantic cache (cosine similarity threshold on the query embedding, longer TTL). The semantic layer catches "how do I X" vs "what's the way to X" without hitting the retriever again.
What model are you using for generation on the 8GB? That constraint probably shapes a lot of the architecture choices downstream.