Site search using Lunr

Site search using Lunr

With lunr_index (and optionally lunr_index_fields) in wmk_config.yaml, wmk will build a search index for Lunr.js and place it in idx.json in the webroot. In order to minimize its size, no metadata about each record is saved to the index. Instead, a simple list of pages (with title and summary) is placed in idx.summaries.json. The summary is taken either from one of the frontmatter fields summary, intro or description (in order of preference) or, failing that, from the start of the page body.

If lunr_languages is present in wmk_config.yaml, stemming rules for those languages will be applied when building the index. The value may be a two-letter lowercase country code (ISO-639-1) or a list of such codes. The currently accepted languages are de, da, en, fi, fr, hu, it, nl, no, pt, ro, and ru (this is the intersection of the languages supported by lunr.js and NLTK, respecively). The default language is en. Attempting to specify a non-supported language will raise an exception.

The index is built via the lunr.py module and the stemming support is provided by the Python Natural Language Toolkit.

For information about the supported syntax of the search expression, see the Lunr documentation.

Note that because Lunr creates a single index file for the whole site, it may not be a practical option for large sites with lots of content – a realistic limit may be somewhere around 1,000 pages or so. In such a case, you may want to consider using solutions that break the index up into smaller chunks, such as Pagefind. Alternatively, you should look into a server-side or hosted solution such as Algolia or Meilisearch.

Limitations

  • Building the index does not mean that the search functionality is complete. It remains to point to lunr.js in the templates and write some javascript to interface with it and display the results. However, since every website is different, this cannot be provided by wmk directly. It is up to the template (or theme) author to actually load the index and present a search interface to the user.

  • Similarly, if a “fancy” preview of results is required which cannot be fulfilled using the information in idx.summaries.json, this must currently be solved independently by the template/theme author.

  • Note that only the raw content document is indexed, not the HTML after the markdown (or other input content) has been processed. The only exception to this is that the binary input formats (DOCX, ODT, EPUB) are converted to markdown before being indexed. The output of Mako templates (including shortcodes called from the content documents) is not indexed either.