idx

ADR 0010: Index Filename Tokens in BM25 Corpus for Recall

Status

Accepted

Context

ADR 0003 established that file names and paths are stored as metadata (FileNameTerms / PathTerms) and are used as post-retrieval filters rather than BM25 content. This design kept the BM25 corpus clean and avoided inflating IDF for common English words that happen to be popular file-name fragments (e.g. service, test, handler).

However, it introduced a recall gap: if a query term appears only in a file’s name and not in its content, the file will not be included in the candidate set at all — it simply won’t be returned. For example, searching for scoring would not return search_scoring.go if the word “scoring” does not appear inside that file’s content.

This violates a fundamental expectation: a developer who types scoring should always find the file called search_scoring.go, regardless of whether the word appears in its body.

Metadata-only storage (ADR 0003) remains appropriate for path-level filters (--path flag), but is insufficient as the sole representation of the file name in retrieval scenarios.

Decision

Add a third pass to BuildIndex in bm25_index_service.go that tokenises each document’s filename via domain.TokenizeFileName and inserts those tokens into the BM25 Terms map.

The pass follows the existing content-indexing passes and applies one rule to avoid corrupting content statistics:

If a term derived from the filename is already present in Terms for that specific document (i.e. the same term also occurs in the file’s content), the filename-derived entry is skipped for that document.

This preserves:

The existing IDF (“fourth”) pass runs after the new third pass and computes IDF across the full populated Terms map, including filename-derived entries.

Tokenisation delegates to domain.TokenizeFileName (introduced alongside ADR 0009), which handles snake_case, CamelCase, dotted extensions, and path separators uniformly.

// Third pass: index filename tokens for recall
for _, document := range documents {
    fileNameTokens := domain.TokenizeFileName(document.Name)
    fileNameFreqs, fileNamePositions := domain.CountTokenFrequencies(fileNameTokens)
    for term, freq := range fileNameFreqs {
        if termStats := index.Terms[term]; termStats != nil {
            if _, alreadyIndexed := termStats.Docs[document.Name]; alreadyIndexed {
                continue
            }
        }
        index.AddTerm(term, document.Name, freq, fileNamePositions[term])
    }
}

Decision Drivers

Consequences

Positive

Negative

Operational Notes