A trie is an effective data structure that could be an excellent addition to your problem-solving toolbox. Let’s see which areas the trie usage is more applicable than a standard set or map implementation. And what other benefits you can get from using it.
Aperitif
I like the name trie more than prefix tree or digital tree, although prefix tree is the most accurate definition. The name trie comes from retrieval but is pronounced as “try,” not “tree.” The idea was to distinguish the trie from the tree.
Try to guess what a trie is. I built trie from 4 words: apple, banana, bastard, ban:
I highly recommend to use a visualization tool from the University of San Francisco to play with a trie.
Trie stores keys that are represented as a sequence (usually a string) compactly. Then you can use it as a set or a hash map to query keys. In addition to that, you get an opportunity to query keys and values by the key prefix.
Naive implementation
I do not focus on the optimal algorithm. The goal of this article is to get you familiar with trie and when you can use it.
Let’s design a desired API:
// Trie represents a Trie data structure.
//
// https://en.wikipedia.org/wiki/Trie
type Trie interface {
// Insert inserts a word into the trie and returns true
// if the word already exists.
Insert(word string) bool
// Contains returns true if an exact match of the word is found, otherwise false.
Contains(word string) bool
// SearchByPrefix finds and returns words by prefix.
SearchByPrefix(prefix string) []string
// StartsWith returns true if there is any word in the trie that starts
// with the given prefix.
StartsWith(prefix string) bool
// Size returns the number of keys in the tree.
Size() int
}
It will be enough to implement these two functions: Insert and Contains, to grasp the idea. And it will be a reasonable basis for the following improvements.
I start by implementing generic rune trie. In Go rune is a superset of ASCII and represents Unicode code points. Think of it as char type in Java or C, but for Unicode.
So, there is Insert function without any optimizations, just the first naive implementation:
// runeTrie is a rune-wise trie implementation.
type runeTrie struct {
root *runeNode
size int
}
type runeNode struct {
children map[rune]*runeNode
last bool
}
// NewRuneTrie creates a rune-wise new trie.
func NewRuneTrie() Trie {
return &runeTrie{root: &runeNode{make(map[rune]*runeNode), false}}
}
// Insert inserts a word into the trie and returns true
// if the word already exists.
func (t *runeTrie) Insert(word string) bool {
exists := true
current := t.root
for _, letter := range word {
n, ok := current.children[letter]
if !ok {
exists = false
n = &runeNode{make(map[rune]*runeNode), false}
current.children[letter] = n
}
current = n
}
current.last = true
if !exists {
t.size++
}
return exists
}
The algorithm is simple:
- Set current node value to the root node.
- For each letter of the word:
- Check if the current node contains a letter and if it does not have it, create a new node for the letter and attach it to the current node.
- But if it includes the letter, get the node under this letter and set it as the current node. Return to 2.
- Mark the last processed node as the end of the word.
Contains function is much simpler and seems straightforward:
// Contains returns true if an exact match of the word is found, otherwise false.
func (t *runeTrie) Contains(word string) bool {
n, _ := t.nodeByPrefix(word)
return n != nil && n.last
}
func (t *runeTrie) nodeByPrefix(prefix string) (*runeNode, rune) {
current := t.root
var r rune
for _, letter := range prefix {
n, ok := current.children[letter]
if !ok {
return nil, 0
}
current = n
r = letter
}
return current, r
}
We traverse a tree by letters until the last node.
Let’s benchmark this implementation. Go has tools to write simple benchmark tests:
func BenchmarkRuneTrieInsert(b *testing.B) {
var r bool
for i := 0; i < b.N; i++ {
t := NewRuneTrie()
for _, w := range words {
r = t.Insert(w)
}
}
benchmarkResult = r
}
And then we run the benchmark:
$ go test -bench=. -benchmem -benchtime=1000x
goos: darwin
goarch: amd64
op 0 allocs/op
BenchmarkRuneTrieInsert-8 1000 7196 ns/op 6984 B/op 129 allocs/op
BenchmarkRuneTrieContains-8 1000 517 ns/op 0 B/op 0 allocs/op
BenchmarkWordHashSetInsert-8 1000 1406 ns/op 1100 B/op 4 allocs/op
BenchmarkWordHashSetContains-8 1000 178 ns/op 0 B/op 0 allocs/op
PASS
Go bench tool does not count stack allocations, so I see that Contains function does not contain any heap allocations, which is excellent.
Trie structure enables us to add one more function that is not feasible for a regular hash map - SearchByPrefix:
// Finds and returns words by prefix.
func (t *runeTrie) SearchByPrefix(prefix string) []string {
node, r := t.nodeByPrefix(prefix)
return search(node, r, []rune(prefix[:len(prefix)-1]))
}
func (t *runeTrie) nodeByPrefix(prefix string) (*runeNode, rune) {
current := t.root
var r rune
for _, letter := range prefix {
n, ok := current.children[letter]
if !ok {
return nil, 0
}
current = n
r = letter
}
return current, r
}
func search(currentNode *runeNode, currentRune rune, prefix []rune) []string {
words := []string{}
if currentNode == nil {
return words
}
newPrefix := append(prefix, currentRune)
if currentNode.last {
words = append(words, string(newPrefix))
}
for letter, node := range currentNode.children {
newWords := search(node, letter, newPrefix)
words = append(words, newWords...)
}
return words
}
The function is not optimized, but it shows that this is possible and not so complex. It calls the function defined earlier nodeByPrefix
.
Applications
Autocomplete
Since the trie allows to search words by prefix, it is a good candidate for autocompletion algorithms.
Routing
fasthttp library for Go uses trie as an internal implementation for routing. It is not only the fastest (probably) implementation of the routing, but it also does not produce any garbage.
You can dive deeper if you wish.
Other options
If you need:
- A set in which you might need to perform a lookup by the prefix.
- Compactly store a lot of similar data.
- You build the structure once and query it after.
Trie might be a good candidate for you.
For example, you build a set of domains, including subdomains in memory at the application startup, and then only query it. You might consider a trie for this task.
But anyway, always check your performance assumptions with benchmarks.
If you do not need to lookup by prefix and your data structure is changing a lot, a hash map or a hash set is a better alternative.
Known Trie implementations
I found two exciting libraries with an exemplary implementation of trie:
You can also use mine, but it is not optimized and created for learning purposes