Because I am a visual learner, when learning about data structures, I usually find myself looking for or drawing a visual representation of what the structure looks like. But when I took an in-depth look at hashes (similar to dictionaries in other languages), I realized that I didn’t have a very deep understanding of what was going on below the surface.
What is a Hash anyway?
Like a lot of data structures, Ruby provides us with a ready to use class Hash
, which we can use by calling Hash.new
or using the hash literal {}
.
Like other data structures, a hash holds a collection of data, but that data is not ordered. Instead, a hash uses keys (which can be any data type), to point to it’s values. Hashes also provide us lot of benefits over similar data structures (like Arrays or LinkedLists), because they provide us an O1 lookup time for access/searches, insertion, and deletion.
In computer memory, we know arrays are a block of memory set aside next to each other, and that each successive item in the array comes immediately after the next. Similarly, data structures like linked lists always have a “pointer” that gets us to the next item in the structure. But what is going on under the surface with a hash?
Well, it turns out they are usually made of both arrays and linked lists!
How it Works
The primary data structure of a hash is usually something like an array. When we want to store an item, Ruby takes the key we provide it and transforms it into an index by running it through a hashing algorithm. We can do this quickly using the Ruby the .hash
method already built in to Ruby.
"karson".hash
# => 3190424403111594787
"ruby".hash
# => -211952950235035342
Unless we are working with an extremely large array, 3190424403111594787
probably isn’t a valid index of our array, so we can shop it down by dividing the hashcode by the size of the array. After invoking the modulo operator we are returned the remainder, which we can confidently assume will always be an index inside this array. From here, we can begin to store our values.
Let’s write a helper method that can do this quickly for us
class HashTable
...
private
def h(key)
key.hash % @size
# For now let's assume @size represents the size of the array.
end
end
Collisions
If we aren’t using a great hashing algorithm, we run the possibility of getting the same hashcode from the hashing algorithm. Though the probability is unlikely, it does exist and could cause issues where we get multiple hashcodes that point to the same index, and thus the same value. Even more likely, is that when dividing our hashcode by the size of the array, we get the same index back. Both of these issues are called collisions, when we try to store data when there is already something located.
We can avoid these by using a linked list, storing both the key and value pairs, and re-hashing to traverse down the linked list.
If we were using a bad hashing algorithm, time complexity again approaches On, but for the most part, we can safely assume that we avoid this with a good implementation and can stay close to O1 runtime.
Building A Hash From Scratch
Let’s begin building out our own hash from scratch in Ruby to get a better idea of what’s going on. We can begin by initializing our class and setting a few instance variables. I’m going to use @size
to set the default size of our array, and @count
to count how many items we have stored in the array.
class HashTable
def initialize
@size = 10
@array = Array.new(@size) { [] }
@count = 0
end
end
Let’s think about how we want to add items into our hash and how we want to retrieve them – commonly called our getter and setter methods. With hashes, these are usually in bracket notation, luckily for us Ruby provides us a way to access this notation inside of a class. Let’s start by writing some psuedo-code.
class HashTable
...
def []=(key, value)
# 1. Take the key and hash it
# 2. Take the hashcode and divide it by the length of the array
# 3. Take the index and see if there is already an item in that spot
# 4. If there is an item in this index and it has the same key, overwrite the value
# 5. If there is an item in this index and it is not the same key, add the key-value pair to the linked list
# 6. If there is no item in this index, add the key-value pair
end
end
This method is trying to do a lot of things up front, let’s invoke our helper method we defined previously.
class HashTable
...
private
def h(key)
key.hash % @size
end
end
NOTE: Instead of using a linked list to chain through all items with the same index, I decided to use another array for the sake of simplicity.
When our HashTable starts to fill up, we need some way of resizing our array to ensure we avoid collisions and have room to continue storing items. Commonly, this is called a load factor and defaults somewhere around 70%. When 70% capacity is reached, we can double the size of the array. To get started, I created a class constant, another helper method, and an attr_accessor
to quickly get our array.
class HashTable
LOAD_FACTOR = 0.7
attr_accessor :array
def initialize
@size = 10
@array = Array.new(@size) { [] }
@count = 0
end
...
private
def h(key)
key.hash % @size
end
def resize
old_array = @array
@size *= 2
@array = Array.new(@size) { [] }
# Reassign everything to our new array
end
end
Let’s now revisit our setter method we defined above. When attempting to put something inside of our array, this would be the perfect time to check if the array needs to be resized.
- Check if the array needs to be resized. ✅
- Take the key and hash it ✅
- Take the hashcode and divide it by the length of the array ✅
class HashTable
...
def []=(key, value)
resize if (@count.to_f / @size > LOAD_FACTOR)
slot = h(key)
...
end
...
end
Now we have the appropriate index
of which array we need to select, called slot
. Let’s select the array and then check if it already contains a key-value pair in that slot. Following our psuedo code, let’s knock out number 4: Take the index and see if there is already an item in that spot.
class HashTable
...
def []=(key, value)
resize if (@count.to_f / @size > LOAD_FACTOR)
slot = h(key)
kv_pair = arr.find { |pair|
pair[0] == key
}
...
end
...
end
We can address the last few items in our psuedocode with the following if statement:
...
if kv_pair
kv_pair[1] = value
# Overwrite the value of the current key-value with the new value.
else
@count += 1
arr << [key, value]
# Oherwise, increase our counter and add our new key-value pair into the array
end
...
Adding a a getter method that mirrors the structure of our setter, our entire HashTable class should look like the following:
class HashTable
LOAD_FACTOR = 0.7
attr_accessor :array
def initialize
@size = 10
@array = Array.new(@size) { [] }
@count = 0
end
def []=(key, value)
resize if (@count.to_f / @size > LOAD_FACTOR)
slot = h(key)
arr = @array[slot]
kv_pair = arr.find { |pair|
pair[0] == key
}
if kv_pair
kv_pair[1] = value
else
@count += 1
arr << [key, value]
end
end
def [](key)
slot = h(key)
arr = @array[slot]
kv_pair = arr.find { |pair|
pair[0] == key
}
kv_pair[1] if kv_pair
end
private
def h(key)
key.hash % @size
end
def resize
old_array = @array
@size *= 2
@array = Array.new(@size) { [] }
count = 0
old_array.each do |arr|
array.each do |key, value|
self[key] = value
end
end
end
end