Designing a Toy Search Engine with Trie and TF-IDF

A search engine, as the name suggests, is a tool that helps you search through your content. It efficiently retrieves relevant information from a vast pool of data, allowing users to find specific documents, web pages, or files quickly. By indexing content and processing search queries, search engines provide a streamlined way to access the information you need, whether it's for research, entertainment, or everyday tasks.

Technically, a search engine consists of two parts: an indexer and a query engine. The indexer takes raw content or documents as input and turns them into an inverted index that maps words.

"dogs are cute" → doc1  
"cats are cute" → doc2  

Inverted index:
"dogs" → [doc1]
"are"  → [doc1, doc2]
"cute" → [doc1, doc2]
"cats" → [doc2]

The query engine searches the inverted index to find the content or documents you need based on the search query.

What matters is how you design your indexer and the query engine!!!

For this project, I am using Golang, but the concept and implementation are language-agnostic.

Indexer

I used the simplest implementation for the indexer, and here's what it does:

Takes input: a list of documents/content (array of strings).
For each document:
- Assign a docID.
- Break the text into words.
- Update the inverted index.

The in-memory indexer looks like -

package index

type DB map[string][]string

type Dict []string

type InMemoryIndexer struct {
    document map[string]string
    db       DB
    dict     Dict
}

The document is an array of strings that need to be indexed. The db is the inverted index mapping of the documents, and the dict acts as a dictionary for the keys in the db.

And the implementation goes like

func (i *InMemoryIndexer) Index() error {
    for key, doc := range i.document {
        tokens := strings.Split(doc, " ")
        for _, token := range tokens {
            i.db[token] = append(i.db[token], key)
            i.dict = append(i.dict, token)
        }
    }
    return nil
}

func (i *InMemoryIndexer) GetIndex() DB {
    return i.db
}

func (i *InMemoryIndexer) GetDict() []string {
    return i.dict
}

To set up a new indexer, you can create an initializer like this:

func NewInMemoryIndexer(document map[string]string) *InMemoryIndexer {
    return &InMemoryIndexer{
        document: document,
        db:       make(DB),
        dict:     Dict{},
    }
}

This is the simplest implementation of an indexer. You can enhance the indexer by improving its components:

Tokenizer → Can handle punctuation, convert to lowercase, etc.
Reducing words to their root (e.g., running → run)
Dropping uninformative words (e.g., in, the, are)

Now, let’s start with the query engine!!

Query Engine

The logic behind the query engine is straightforward:

Input: a query string.
Break it into words.
Look up each word in the inverted index.
Return matching document IDs and map them back to the original content.

The performance of your query engine depends on the algorithm you use to extract and locate the query string within the content.

For the query engine, you can use the following improved terms:

Prefix Matcher → A tool that identifies all words sharing the same prefix as the query string.
Document Locator → Identifies the documents containing the words found by the prefix matcher.
Relevance Scorer → Evaluates and ranks the documents to determine which is most relevant to the query string.

Prefix Matcher

I used a simple Trie data structure to implement prefix searching over all the words in the dictionary. This helps find words similar to those in the query string.

In a Trie, each node represents a single character of a string. The root node is typically empty, and each subsequent level of the Trie represents the next character in the strings being stored. By traversing the Trie from the root to a particular node, you can reconstruct the prefix that leads to that node. This makes it easy to find all words that share a common prefix.

The implementation of a Trie is as follows:

type Node struct {
    Char     string // to store the character of a node
    Children [26]*Node // to store children nodes
    End      int64 // to store the index of the doc from the dictionary being stored
}

func NewNode(char string) *Node {
    node := &Node{Char: char}
    for i := 0; i < 26; i++ {
        node.Children[i] = nil
    }
    node.End = -1
    return node
}

type Trie struct {
    RootNode *Node
}

func NewTrie() *Trie {
    root := NewNode("\000")
    return &Trie{RootNode: root}
}

For example, let's use this dictionary and query string.

dict := []string{"dog","don","dig","doll","deaf","doggy","dob"}
querystring := "do"

Insertion in a Trie

To insert a word into a Trie, you start at the root node and follow the path corresponding to each character in the word. If a character path does not exist, you create a new node for that character. Once you reach the end of the word, you mark the final node as the end of a word. This process ensures that all prefixes of the word are represented in the Trie.

The implementation for inserting a word into a Trie is as follows:

func (t *Trie) Insert(word string, index int64) {
    current := t.RootNode
    strippedWord := strings.ToLower(strings.ReplaceAll(word, " ", ""))

    for i := 0; i < len(strippedWord); i++ {
        charIndex := strippedWord[i] - 'a'
        if current.Children[charIndex] == nil {
            current.Children[charIndex] = NewNode(string(strippedWord[i]))
        }
        current = current.Children[charIndex]
    }
    current.End = index
}

Searching in a Trie

To search for a word or a prefix in a Trie, you start at the root and follow the path corresponding to each character in the query string. If you can follow the path for all characters in the query, the prefix exists in the Trie. If you reach a node that does not have a path for the next character in the query, the prefix does not exist. For prefix matching, once you reach the end of the query string, you can explore all child nodes to find all words that share the prefix. To find the word, I used DFS, which stands for depth-first search, to locate words from the root to the leaf node.

func (t *Trie) Search(prefix string) []string {
    var results []string
    strippedPrefix := strings.ToLower(strings.ReplaceAll(prefix, " ", ""))
    current := t.RootNode

    for i := 0; i < len(strippedPrefix); i++ {
        index := strippedPrefix[i] - 'a'
        if current.Children[index] == nil {
            return results
        }
        current = current.Children[index]
    }

    t.dfs(current, &results)

    return results
}

func (t *Trie) dfs(node *Node, results *[]string) {
    if node == nil {
        return
    }

    if node.End != -1 {
        *results = append(*results, strconv.FormatInt(node.End, 10))
    }

    for i := 0; i < 26; i++ {
        if node.Children[i] != nil {
            t.dfs(node.Children[i], results)
        }
    }
}

Ref: https://medium.com/@itachisasuke/implementing-a-search-engine-in-golang-trie-data-structure-c45152ddda24

So, the implementation inside the Query engine looks like this:

terms := strings.Split(queryString, " ")

allTerms := []string{}

for _, term := range terms {
    for _, index := range i.trie.Search(term) {
        idx, _ := strconv.Atoi(index)
        allTerms = append(allTerms, i.dict[idx])
    }
}

allTerms := []string{"dog", "doggy", "don", "doll"}

Document Locator

Since all the words with the same prefix for the query string have been found, we now need to locate the document that contains these words or a combination of these words.

The implementation ensures unique document collection by iterating through terms, checking each document's presence in a set, and adding only unseen documents to the final list, preventing duplicates.

documents := []string{}
docSet := make(map[string]struct{})

for _, term := range allTerms {
    docArray := i.db[term]
    for _, doc := range docArray {
        _, exists := docSet[doc]
        if !exists {
            documents = append(documents, doc)
            docSet[doc] = struct{}{}
        }
    }
}

We have all the valid documents for the query string, but we don't know which document is the most relevant and appropriate.

Ranking

To rank the documents' relevance, I have used TF-IDF.

It’s a way to rank search results so that the most “important” documents show up first.

TF (Term Frequency):
How often does a word appear in a document?
- If “dogs” appears 5 times in a doc, it’s more relevant than if it appears once.
IDF (Inverse Document Frequency):
How rare is the word across the whole collection?
- Common words like “are”, “the” appear in many docs → less useful.
- Rare words like “playful” appear in fewer docs → more useful.
TF × IDF = Score:
- High score → word is frequent in this doc, but rare overall.
- Low score → word is either too common or not frequent.

The implementation simply follows what has been described.

rankings := make(map[string]float64)

totalDocuments := len(i.documents)

for _, docID := range documents {
    docContent, ok := i.documents[docID]
    if !ok {
        continue
    }
    score := 0.0
    docTerms := strings.Split(docContent, " ")
    if len(docTerms) == 0 {
        continue
    }

    for _, term := range allTerms {
        termCount := 0
        for _, docTerm := range docTerms {
            if docTerm == term {
                termCount++
            }
        }

        if termCount > 0 {
            tf := float64(termCount) / float64(len(docTerms))

            numDocsWithTerm := len(i.db[term])
            if numDocsWithTerm == 0 {
                continue
            }
            idf := math.Log(float64(totalDocuments) / float64(numDocsWithTerm))

            score += tf * idf
        }
    }
    rankings[docID] = score
}

    // sort documents by score
sortedDocuments := make([]string, 0, len(documents))
for docID := range rankings {
    sortedDocuments = append(sortedDocuments, docID)
}

sort.Slice(sortedDocuments, func(i, j int) bool {
    return rankings[sortedDocuments[i]] > rankings[sortedDocuments[j]]
})

That's it! By combining all three components, you have a functional query engine. You can further customize it by changing the ranking model to something like BM25.

The combined code looks like this:

package query

import (
    "context"
    "encoding/json"
    "math"
    "sort"
    "strconv"
    "strings"
    "time"

    "github.com/redis/go-redis/v9"
    "github.com/vr-varad/search-engine/index"
)

type InMemoryQueryEngine struct {
    db        index.DB
    dict      index.Dict
    documents map[string]string
    rdb       *redis.Client
    trie      *Trie
}

func NewInMemoryQueryEngine(db index.DB, dict index.Dict, documents map[string]string, client *redis.Client) *InMemoryQueryEngine {
    trie := NewTrie()
    for index, word := range dict {
        trie.Insert(word, int64(index))
    }

    return &InMemoryQueryEngine{
        db:        db,
        dict:      dict,
        documents: documents,
        rdb:       client,
        trie:      trie,
    }
}

func (i *InMemoryQueryEngine) Search(ctx context.Context, queryString string) ([]string, error) {

    if len(queryString) == 0 {
        return []string{""}, nil
    }

    cachedResults, err := i.rdb.Get(ctx, queryString).Result()
    if err == nil {
        var results []string
        if err := json.Unmarshal([]byte(cachedResults), &results); err == nil {
            return results, nil
        }
    }

    terms := strings.Split(queryString, " ")

    allTerms := []string{}

    for _, term := range terms {
        for _, index := range i.trie.Search(term) {
            idx, _ := strconv.Atoi(index)
            allTerms = append(allTerms, i.dict[idx])
        }
    }

    documents := []string{}
    docSet := make(map[string]struct{})

    for _, term := range allTerms {
        docArray := i.db[term]

        for _, doc := range docArray {
            _, exists := docSet[doc]
            if !exists {
                documents = append(documents, doc)
                docSet[doc] = struct{}{}
            }
        }
    }

    // ranking
    rankings := make(map[string]float64)

    totalDocuments := len(i.documents)

    for _, docID := range documents {
        docContent, ok := i.documents[docID]
        if !ok {
            continue
        }
        score := 0.0
        docTerms := strings.Split(docContent, " ")
        if len(docTerms) == 0 {
            continue
        }

        for _, term := range allTerms {
            termCount := 0
            for _, docTerm := range docTerms {
                if docTerm == term {
                    termCount++
                }
            }

            if termCount > 0 {
                tf := float64(termCount) / float64(len(docTerms))

                numDocsWithTerm := len(i.db[term])
                if numDocsWithTerm == 0 {
                    continue
                }
                idf := math.Log(float64(totalDocuments) / float64(numDocsWithTerm))

                score += tf * idf
            }
        }
        rankings[docID] = score
    }

    // sort documents by score
    sortedDocuments := make([]string, 0, len(documents))
    for docID := range rankings {
        sortedDocuments = append(sortedDocuments, docID)
    }

    sort.Slice(sortedDocuments, func(i, j int) bool {
        return rankings[sortedDocuments[i]] > rankings[sortedDocuments[j]]
    })

    // cache the results
    resultsJSON, err := json.Marshal(sortedDocuments)
    if err == nil {
        i.rdb.Set(ctx, queryString, resultsJSON, 30*time.Second) // Cache for 30 seconds
    }

    return sortedDocuments, nil
}

This is a basic implementation of a search engine and serves as the foundation for industry-level search engines like ElasticSearch.

There are several improvements and modifications you can make to this, including:

Refactoring the indexing process to cover a wider range of indexing.
Adding persistent data storage.
Trying different ranking algorithms.

This blog mostly covers the implementation. You can add an HTTP server in front to make it an API. Then, create a frontend and connect them both. Boom! You have a fully functional search engine.

Search engines are fast. Keep this in mind, as the implementations above have some bottlenecks and latency. I will work on these issues and update the code on this GitHub link.

Conclusion

This project shows how we can build a simple search engine using a Trie and TF-IDF to index and rank documents. It provides a clear understanding of the basics of searching and ranking, while also leaving room for future enhancements, such as permanently saving data and utilizing more effective ranking methods. With more work, this small project can turn into a much stronger and more useful search system.
You can check out more of my work here: GitHub | LinkedIn.

Command Palette