2023ai

Haystack Similarity Search

Semantic search over arbitrary documents.

PythonVector embeddingsFAISS

Haystack Similarity Search · case study

Context

In late 2022 the term 'vector database' became a Twitter buzzword overnight, but very few people seemed to actually understand the loop end to end. I wanted to build it from scratch before LLMs made it trivial.

Problem

Implement a working semantic search pipeline over a small corpus of documents — chunking, embedding, indexing, querying — without leaning on a hosted vector DB or a managed embedding API. Keep it small enough that the whole thing fits in one Jupyter notebook.

Approach

Built around the Haystack library for the orchestration, sentence-transformers for embeddings, and FAISS for the index. Documents were Markdown files from my own notes; queries were posed in natural English.

Build

Pre-processing pipeline: normalise Markdown → 300-token chunks with 50-token overlap.
Embedding step using `all-MiniLM-L6-v2` for speed and memory.
FAISS flat index in memory; switched to IVF when the corpus crossed 5K chunks.
Query loop: embed → top-k → return chunks with source links.
Notebook end-to-end so each step's output was inspectable.

Outcome

Working semantic search over ~12K personal-note chunks. Surprisingly useful. More importantly, gave me the mental model that powers everything I do with RAG today.

What I would change

I used cosine similarity with no re-ranker. A cross-encoder pass on the top-50 would have been worth the latency cost. I also under-invested in the chunking strategy — that turned out to be 80% of the quality lever.

← Previous

Stock Prediction Web App

NASDAQ forecasting with custom regression backend.

Embedded Systems ML

Voltage prediction from amplifier circuit data.

← All projects Talk about this →