# Hash indexes

Let’s say our data storage consists only of appending to a file, as in the preceding example. Then the simplest possible <mark style="background-color:orange;">indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file—the location at which the value can be found</mark>, as illustrated in Figure 3-1. Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote (this works both for inserting new keys and for updating existing keys). When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.

<figure><img src="https://415484505-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LxtoAXZwwOc4XGto8vb%2Fuploads%2FRP3xp1hLBh6d2Uc5OyQq%2FScreenshot%202024-01-21%20at%2015.26.09.png?alt=media&#x26;token=185403e1-d81a-42e9-b2a7-67d7268bff16" alt=""><figcaption></figcaption></figure>

File is on disk, index is in RAM. If part of data is already in filesystem cache => read does not read I/O.

As we append to the file how we avoid running of disk space? Split log file in segments and compact them (leave only recent values for each key):

<figure><img src="https://415484505-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LxtoAXZwwOc4XGto8vb%2Fuploads%2F5FnLyhSfCSwBUPuePaVJ%2FScreenshot%202024-01-21%20at%2015.35.50.png?alt=media&#x26;token=24af2432-91be-44b6-b215-a454be75795c" alt=""><figcaption></figcaption></figure>

```
background:
   compact frozen segments and merge them => write to new files
   onFinish: switch read requests to use new merged and compacted files
      => delete old segment files
```

<mark style="background-color:orange;">Each segment now has its own in-memory hash table, mapping keys to file offsets</mark>

```
segment 1 <-> hash index 1 (oldest)
segment 2 <-> hash index 2
segment 3 <-> hash index 3 (youngest)

Find key X. 
Look up index 3 for key X. If not found look up index 2, etc.
```

### Important aspects

#### File format

Instead of CSV use binary format. It is faster and simpler

#### Deleting records

Append deletion record, which is used during compaction.

#### Crash recovery

If db restarts => hash index is gone. Idea is to keep snapshots of index on disk.

#### Concurrency control

One writer thread, many writer threads.

{% hint style="success" %}
Why not just update value in segment, but use append and compaction?

* Append and segment merging is faster.
* concurrency and crash recovery are simpler (no chance to have partly new and partly old value for key)
  {% endhint %}

{% hint style="warning" %}
Limitations of hash index:

* Hash table has to fit into RAM
* Range queries are not efficient
  {% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://amartyushov.gitbook.io/tech/databases/index/hash-indexes.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
