DeweySearch

A static search index for .NET.

DeweySearch builds a sharded, Pagefind-style inverted index at build time and queries it in the browser, fetching only the shards a search touches. No server, no runtime service, no dependencies.

Install → Source on GitHub

Version

0.1.x

Target

net10.0

License

MIT

Status

Preview

005.1

Installation

DeweySearch ships as a single, dependency-free NuGet package. The build-time indexer targets net10.0; the browser client is plain JavaScript with no build step.

bash

# Add DeweySearch to your project
dotnet add package DeweySearch

powershell

# Visual Studio Package Manager Console
Install-Package DeweySearch

xml

<ItemGroup>
  <PackageReference Include="DeweySearch" Version="0.1.*" />
</ItemGroup>

Hosting in ASP.NET Core? The DeweySearch.Web package serves the JavaScript client as a static web asset so the NuGet package and the source can never drift.

005.13

Quick start

Describe your documents, build the index, and query it from the browser.

Describe your documents

A SearchDocument is a flat record — URL, title, optional description, heading text, and plain-text body. DeweySearch knows nothing about pages, locales, or sections; the host produces these from whatever content model it has.

csharp

using DeweySearch;
  
var documents = new List<SearchDocument>
{
    new(Url: "/guide/install",
        Title: "Installation",
        Description: "Add DeweySearch to your project.",
        Headings: "Requirements Setup",
        Body: "DeweySearch ships as a single dependency-free NuGet package."),
};

Build the index

IndexBuilder.Build produces an in-memory SearchIndex; ToFiles() serializes it to the static JSON artifacts the client fetches (index.json, t-{prefix}.json, f-{docId}.json).

csharp

using DeweySearch;
  
var index = new IndexBuilder(new IndexOptions { ShardPrefixLength = 2 })
    .Build(documents);
  
Directory.CreateDirectory("wwwroot/search-index");
foreach (var (name, bytes) in index.ToFiles())
    File.WriteAllBytes(Path.Combine("wwwroot/search-index", name), bytes);

Query in the browser

Point DeweySearchEngine at the directory holding the artifacts and call search(). It loads the manifest once, then only the shards and fragments a query actually needs.

javascript

// dewey-search.js exposes DeweySearchEngine on the global scope — no build step.
const engine = new DeweySearchEngine('/search-index');
const results = await engine.search('install');
  
for (const hit of results) {
    console.log(hit.url, hit.title, hit.score);
}

Checkpoint

The index is data, not code. Rebuild it whenever your content changes and commit the JSON alongside your site — there is nothing to deploy and nothing to run.

005.74

Core concepts

Six ideas cover most of DeweySearch's surface — a build-time C# library that emits static JSON, and a tiny client-side engine that queries it.

005.74 · A

Inverted index

A sharded, Pagefind-style inverted index, built once at build time. The whole index is never shipped to the browser.

005.74 · B

Tokenizer & stemmer

An accent-folding tokenizer and a plurals-only stemmer, implemented identically in C# and JavaScript and pinned by shared conformance fixtures.

005.74 · C

Prefix shards

Postings are split into per-term-prefix shards. A query downloads only the shards its terms touch, not the whole corpus.

005.74 · D

BM25 in the browser

All scoring runs client-side: BM25 with field boosts, prefix completion, bounded fuzzy matching, and synonym expansion.

005.74 · E

Facets

An open facet dictionary — any dimension on a document is interned and shipped in the manifest for client-side filtering.

005.74 · F

Fragments

Per-document excerpt fragments are fetched on demand, only for the results actually shown.

How it works

One sharded inverted index, built once at build time and queried in the browser — only the shards a search actually touches are ever downloaded. Scroll to follow a single document the whole way: tokenizer, postings, shards, then BM25 scoring in the browser.

005.1 BUILD TIME · C#

005.1 · Document #1 /headings

TITLE Heading Styles

HEADING HeadingStyle

BODY Choose heading styles.

id = position in list = 1

── Tokenize

café fold → cafe

"Choose heading styles." runs → chooseheadingstyles

HeadingStyle split → headingstyleheadingstyle

HTTPResponse acronym → httpresponsehttpresponse

utf8 digit → utf8utf8

When a run splits, the whole run is kept too — shown filled.

── Stem · plurals only

styles → style −s

policies → policy −ies → y

boxes → box −es (sibilant)

indices → index irregular

gerunds untouched

string ⇏ str string

heading ⇏ head heading

── Inverted index · term → postings

heading → [[ 1, ____, _ ]]

field flags · OR

| TITLE 0001 = 1

| HEADING 0010 = 2

| BODY 1000 = 8

= 1011 = 11

tf · occurrences

1→2→3

heading → [[ 1, 11, 3 ]]

── Shard · by 2-char stem prefix

manifest

index.json

n: 2 · avgdl: 6 · psz: 2 · shards: 6

shard

t-ch.json

shard

t-he.json

shard

t-in.json

shard

t-pa.json

shard

t-st.json

shard

t-th.json

fragment

f-0.json

fragment

f-1.json

── In the browser · query

"headings" tokenize → headings stem → heading

contract

byte-for-byte identical to the build-time logic, pinned by shared fixtures.

heading → first 2 chars → he

── Fetch only the shard the prefix points to

on the server · static files

shard

t-ch.json

shard

t-he.json

shard

t-in.json

shard

t-pa.json

shard

t-st.json

shard

t-th.json

GET t-he.json

in the browser

t-he.json

+ index.json (once)

1 of 6 shards downloaded

── Score · BM25 × field boost × match quality

t-he.json · candidates

term	match	quality	boost	score
heading	exact	1.000	7	≈ 7.12
headingstyle	prefix 7/12	0.525	2	≈ 0.64

formula

score = idf · tf′ · boost · quality

idf = ln(N/df) ≈ 0.693 · k1 = 1.2 · b = 0.75 · field boost: title 4 / heading 2 / desc 2 / body 1

── Result · fragment fetched only for shown hits

/headings score 7.1

Heading Styles

Choose heading styles.

from f-1.json · doc 1 · fields title+heading+body (11)

One tokenizer, two runtimes — only the shards a query touches ever leave the server.

01 / 09

005.1

A document is just flat fields.

Each page becomes a record with title, headings, description, and body. Its position in the input list is its id — that integer is used everywhere downstream.

005.13

Tokenize.

Fold accents, split on punctuation, then cleave camelCase, acronyms, and digit boundaries. When a run splits, the whole run is kept too — so HeadingStyle is findable as heading, style, and headingstyle.

005.13

Stem — plurals only.

A tiny ordered stemmer; first rule wins. Gerunds are left alone on purpose: string stays string, heading stays heading. That protects technical vocabulary.

005.74

Inverted index.

One posting per (term, document): [docId, fieldFlags, tf]. Field bits OR together — heading appears in title (1), heading (2), and body (8): 1 | 2 | 8 = 11.

005.74

Shard & emit JSON.

Terms file into a shard keyed by the first two characters of the stem. The whole index becomes data, not code — static JSON you commit beside your site.

005.133

One tokenizer, two runtimes.

The query runs through byte-for-byte identical tokenize + stem logic. That contract is why "headings" lands on the index key heading.

005.133

Fetch only what it touches.

From the stem heading, derive prefix he and request t-he.json — and nothing else. The other five shards stay on the server.

005.133

Score in the browser.

For each candidate term, BM25 × field boost × match quality. heading (exact, boost 7) scores ≈ 7.12; headingstyle (prefix completion, boost 2) ≈ 0.64.

005.133

Result.

Fetch f-1.json only for results actually shown; render the snippet with the match highlighted.

005.133

API reference

The public surface is small and splits in two: a build-time C# library that emits the index, and a browser client that queries it. The C# shapes are pulled live from source; the JavaScript client has no Roslyn to mine, so it is documented by hand.

Build-time · C#

`SearchDocument`

record

One input document fed to Build. The host is responsible for producing these from whatever content model it has — DeweySearch knows nothing about pages, locales, or sections.

Url: string: Canonical URL of the document, surfaced in results and fragments.
Title: string: Title shown in results; weighted highest when ranking.
Description: string? = null: Optional summary; weighted above body.
Headings: string: Space-joined heading text; weighted above body.
Body: string: Plain-text body used for full-text matching and excerpt fragments.
Priority: int = 1: Relative ranking weight; higher wins. Default: 1.
Facets: IReadOnlyDictionary<string, string[]>? = null: Open facet map (dimension name to its values) for client-side filtering, e.g. { "section": ["Guides"], "tag": ["cli", "beginner"] }. Dimensions are arbitrary and caller-defined; values are interned and assigned stable ids at build time. Null for none.
Crumbs: IReadOnlyList<string>? = null: Optional ancestor breadcrumb trail (root-to-parent labels, excluding this document's own Title) for hierarchical results — e.g. a heading record carrying ["Page Title", "Parent Heading"]. Stored verbatim in the manifest for the client to display and group by; DeweySearch does not interpret it. Null or empty for a top-level result.

`IndexOptions`

class

Configuration for building a SearchIndex.

ShardPrefixLength: int = 2: Number of leading characters of a stemmed term used as its shard key. Lower values produce fewer, larger shards; higher values produce more, smaller shards. Default: 2.
MaxEditDistance: int = 2: Upper bound on the edit distance the client applies for typo-tolerant matching. The client also scales the budget down for short terms; this caps it. Set to 0 to require exact matches. Default: 2.
Synonyms: Dictionary<string, string[]> = []: Query-time synonyms. Each entry maps a term to the additional terms it should also match. Keys and values are stemmed at build time and shipped in the index manifest, so callers write natural words (e.g. "config" => ["configuration"]). Default: empty.

IndexBuilder calls

ctor

new IndexBuilder(IndexOptions options)

Create a builder. Use the parameterless overload for defaults (2-character shard prefixes, edit distance 2).

method

SearchIndex Build(IReadOnlyList<SearchDocument> documents)

Build the inverted index, BM25 stats, document table, facets, and excerpt fragments. A document's id is its position in the list.

method

IReadOnlyDictionary<string, byte[]> ToFiles()

Serialize the index to its static artifacts keyed by leaf file name: index.json, one t-{prefix}.json per shard, one f-{docId}.json per document.

In the browser · JavaScript

`DeweySearchEngine`

javascript

Self-contained client for the static index — no dependencies, no CDN. Fetches the manifest once, then only the term-prefix shards a query touches and the per-document fragments for results actually shown.

new DeweySearchEngine(basePath): Construct against the directory holding the artifacts from ToFiles() — index.json plus the t-{prefix}.json and f-{docId}.json files. The leaf names are the contract; the host owns the directory.
search(query) → Promise<Result[]>: Tokenize the query, then score every touched document with BM25, field boosts, prefix completion, bounded fuzzy matching and synonyms, and resolve to a ranked list of { docId, score, fields } — best first. Loads the manifest on first call, then fetches only the shards the query's terms touch.
loadManifest() → Promise<Manifest>: Fetch index.json once and cache it. search() calls this implicitly; call it yourself to warm the cache or to read availableFacets() and docEntry() before the first query.
docEntry(docId) → object | null: The manifest record behind a result — u (url), t (title), c (breadcrumb trail), f (facet ids). Use it to render a hit and to group by page via docEntry(id).u.split('#')[0].
loadFragment(docId) → Promise<object | null>: Fetch one document's excerpt fragment (f-{docId}.json) on demand — only for the handful of results you actually render.
availableFacets() → object: The manifest's facet dictionary (dimension → values, interned to ids) for building filter UI. Empty until the manifest is loaded.
matchesFacets(docId, activeFacets) → boolean: True when a document satisfies every active facet selection, so chips can re-filter results without re-fetching shards. activeFacets maps each dimension to a Set of selected ids.
DeweySearchEngine.FieldFlags → static: Frozen bit flags OR-ed into each result's fields — { Title: 1, Heading: 2, Description: 4, Body: 8 } — so a host can tell a title or heading hit from a body-only one and skip a redundant snippet.

The module also exports tokenize(text) and stem(word) — the cross-language contract primitives, byte-for-byte mirrors of the C# Tokenizer and Stemmer and pinned by shared conformance fixtures.