Hashing
Why we need hashing
To achieve horizontal scaling, it is important to distribute requests/data efficiently across servers.
Traditional (modulus) Hashing
If you have n cache servers, a common way to balance the load is to use the following hash method:
serverIndex = hash(key) % n --where n is the number of servers in the pool
Suppose we have 4 servers in the pool and 8 string keys with their hashes:
| key | hash | hash % 4 |
|---|---|---|
| key0 | 18358617 | 1 |
| key1 | 26143584 | 0 |
| key2 | 18131146 | 2 |
| key3 | 35863496 | 0 |
| key4 | 34085809 | 1 |
| key5 | 27581703 | 3 |
| key6 | 38164978 | 2 |
| key7 | 22530351 | 3 |
To fetch the server where the key is stored, we perform the modular operation f(key) % 4. So hash(key0) % 4 means the client must contact server 1 fetch the cached data.
This approach works well when the size of the server pool doesn’t change. However, if new servers are added or existing servers removed, the hashing algorith changes. For example, if we removed a server, the hash algorith is now hash(key) % 3. If an existing client already had data in the cache, and they used this updated hash algorithm, they will receive a different server index that doesn’t contain their cached data. This results in cache misses. When one server goes offline or is removed, most cache clients will connect to the wrong servers to fetch data. Consistent Hashing is a method to fix this problem.
Consistent Hashing
Consistent hashing is a technique used in distributed systems to divide data among multiple caching servers or nodes. It aims to evenly distribute the data and minimize the amount of data that needs to be moved when nodes are added or removed from the system.
With consistent hashing, the hash space is represented as a ring, also known as a hash ring. Each server is assigned a position on the ring based on its hash value. The data is also hashed, and its hash value is mapped onto the ring. To determine which server should store the data, the position of the data’s hash value is found on the ring, and the next server in a clockwise direction on the ring becomes the data’s assigned server.
This approach provides several advantages:
- Load balancing: Since the servers are evenly distributed on the ring, the data is also distributed evenly, minimizing hotspots and ensuring a balanced load across the nodes.
- Scalability: When a new server is added, only a portion of the data needs to be remapped to the new server, reducing the overall amount of data movement. Similarly, when a server is removed, only the data assigned to that server needs to be redistributed.
- Fault tolerance: In the event of a server failure, only the data assigned to that server needs to be remapped, minimizing the impact on the overall system.
- Consistency: The term “consistent” in consistent hashing refers to the stability of the mapping between data and servers. In traditional hashing, small changes in the number of servers can drastically change the assignment of data, but consistent hashing minimizes such changes.
Overall, consistent hashing allows for efficient and dynamic data distribution in distributed systems, enabling scalability, fault tolerance, and load balancing.