CSC Search Strategy
Full-Text Search Strategy (Optimised for Static Hosting)
To support search across the contents of hundreds of documents and varied sources (including PDFs) within our MkDocs-based site, we needed an approach that simulates full-text search while remaining scalable, performant, and static-host compatible (i.e. no backend/db).
Main Goals
- Allow users to search document content (not just titles)
- Ensure performance remains fast even as hundreds of documents are added
- Avoid full-text duplication or payload bloat in both the frontend and backend(to avoid hitting Git limits)
- Keep compatibility with static site hosting (e.g. GitHub Pages,
list.js
,lunr.js
, Mkdocs)
Optimisation Strategy
Rather than store the entire text of every PDF in the search index (which would be slow and bloated), we take the following approach:
- Extract Text from PDFs
- Using
pdfplumber
, each document’s text is extracted and cleaned -
Unicode characters like curly quotes, dashes, ellipses are normalised for consistency
-
Generate Excerpts
- A short
excerpt
is created from the first meaningful paragraph (ignoring TOCs and headings) -
This is used in search results and graph visualisation tooltips
-
Lemmatise and Tokenise Keywords
- Words are lemmatised (e.g. running, ran, runs → run) using
nltk
- Common English stopwords are removed
-
This creates a compressed keyword representation of each document
-
Apply Document Frequency Filtering
- Using
CountVectorizer
fromscikit-learn
, we:- Remove overly common terms (appear in >85% of docs)
- Remove rare noise terms (appear in <2 docs)
-
This results in signal-rich keywords per document
-
Final Search Index Generation
- Each optimised document (content) is represented in
docs/search_index.json
with:doc_id
,name
,excerpt
,url
,tags
, and optimisedkeywords
- At runtime, search results display:
- Matched titles
- Decoded excerpts
- A prioritised and randomly sampled subset of keywords (max 20) to avoid repetition or alphabetical bias
Optional Archive (Parquet)
For potential archival or later ML processing is needed, a full .parquet
file can also be saved (admin_scripts/docs_index.parquet
) with complete text and metadata (disabled by default to keep project lightweight, but a variable flag can bve set in the py index build script).
Key Libraries Used
Purpose | Tool |
---|---|
PDF text extraction | pdfplumber |
Text cleanup | re , unicodedata , DOMParser (JS) |
Lemmatisation & stopwords | nltk |
Frequency filtering | scikit-learn (CountVectorizer ) |
Output formats | json , pandas , parquet |
This setup we think ensures search is (acceptably)fast, useful, and scalable — and that the project can grow without sacrificing performance or frontend simplicity. We're in the process of scaling this up for more complete/realistic testing alongside a cyclic review approach.