Context
In late 2022 the term 'vector database' became a Twitter buzzword overnight, but very few people seemed to actually understand the loop end to end. I wanted to build it from scratch before LLMs made it trivial.
Semantic search over arbitrary documents.
In late 2022 the term 'vector database' became a Twitter buzzword overnight, but very few people seemed to actually understand the loop end to end. I wanted to build it from scratch before LLMs made it trivial.
Implement a working semantic search pipeline over a small corpus of documents — chunking, embedding, indexing, querying — without leaning on a hosted vector DB or a managed embedding API. Keep it small enough that the whole thing fits in one Jupyter notebook.
Built around the Haystack library for the orchestration, sentence-transformers for embeddings, and FAISS for the index. Documents were Markdown files from my own notes; queries were posed in natural English.
Working semantic search over ~12K personal-note chunks. Surprisingly useful. More importantly, gave me the mental model that powers everything I do with RAG today.
I used cosine similarity with no re-ranker. A cross-encoder pass on the top-50 would have been worth the latency cost. I also under-invested in the chunking strategy — that turned out to be 80% of the quality lever.