Hash indexes
Last updated
Was this helpful?
Last updated
Was this helpful?
Let’s say our data storage consists only of appending to a file, as in the preceding example. Then the simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file—the location at which the value can be found, as illustrated in Figure 3-1. Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote (this works both for inserting new keys and for updating existing keys). When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.
File is on disk, index is in RAM. If part of data is already in filesystem cache => read does not read I/O.
As we append to the file how we avoid running of disk space? Split log file in segments and compact them (leave only recent values for each key):
Each segment now has its own in-memory hash table, mapping keys to file offsets
Instead of CSV use binary format. It is faster and simpler
Append deletion record, which is used during compaction.
If db restarts => hash index is gone. Idea is to keep snapshots of index on disk.
One writer thread, many writer threads.
Why not just update value in segment, but use append and compaction?
Append and segment merging is faster.
concurrency and crash recovery are simpler (no chance to have partly new and partly old value for key)
Limitations of hash index:
Hash table has to fit into RAM
Range queries are not efficient