Hash Indexes on Postgres

If you ever looked at the Postgres documentation on index types you must have seen the prominent box screaming caution when you get to the hash type section.

Hash indexes are currently discouraged for not being WAL-logged. As the documentation says, this means, that in case of a crash, the index may get corrupted. Which will require a reindex operation when the server starts again.

With such a big warning sign, one might ask, why does Postgres still supports Hash indexes? Are there any advantages that may be worth the risks?

In theory, hash indexes should deliver better performance for equality operations when compared to btree indexes. This is because a search on a hash is O(1) while on binary tree it's O(log(N)). But hows does this reflect on practice?

So, I did a bit of experimentation to check weather or not there are any advantages with hash indexes.

First, we need to generate a reasonable amount of data. I did this by running this ruby command:

ruby -e '(1..100000000).each { |i| puts i.to_s.rjust(20, "0") }' > data.txt  

This will create a file with 100M lines in the following format.

00000000000000000001  
00000000000000000002  
...
00000000000099999999  
00000000000100000000  

Next, we can run psql and import this data to two tables. One that we will be indexing with a btree and other that we will be indexing using a hash index.

CREATE TABLE b_table (id VARCHAR(256));  
CREATE TABLE h_table (id VARCHAR(256));

COPY b_table FROM '/PATH/data.txt';  
COPY h_table FROM '/PATH/data.txt';  

Once the import is finished, we can start creating the indexes. For this part I'm enabling the timing flag on psql. You can do this by running \timing. This will print the time it takes to execute each command.

Now, lets run the commands to create indexes on both tables.

CREATE INDEX b_table_id_index ON b_table USING btree (id);

CREATE INDEX  
Time: 3800872.035 ms

CREATE INDEX h_table_id_index ON h_table USING hash (id);

CREATE INDEX  
Time: 290529.397 ms  

The output we get here is very interesting. The btree index takes about an hour to be created (not suprising for 100M records) but the hash index takes only a bit under 5 minutes for the same data. This is a somewhat unexpected find but it's something to keep in mind as this is might be very useful in some use cases.

What about the size of the indexes? We can easily find that out by running the following query:

SELECT relname, pg_size_pretty(pg_relation_size(C.oid))  
FROM pg_class C  
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)  
WHERE nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast') AND relkind = 'i';

     relname      | pg_size_pretty 
------------------+----------------
 h_table_id_index | 4096 MB
 b_table_id_index | 3873 MB
(2 rows)

The results shows that the hash index takes a bit more space than the btree. So, if the concern is space, a hash index will actually perform worse than a btree. As a side note, if space is a major concern for you, a BRIN index might be the way to go, but this feature is only available in Postgres 9.5.

Now, for the test that I'm more excited to see, the read performance. I wrote at the start of this post that, in theory, equality operations should be faster using a hash index. To test if this is true in practice, I wrote a small ruby script to run a benchmark.

require 'pg'  
require 'benchmark/ips'

Benchmark.ips do |x|  
  conn = PG.connect(dbname: 'test')

  seed = Random.new_seed
  b_random = Random.new(seed)
  h_random = Random.new(seed)

  x.report('btree') do
    key = b_random.rand(100000000).to_s.rjust(20, '0')
    conn.exec("SELECT * FROM b_table WHERE id = '#{key}'")
  end

  x.report('hash') do
    key = h_random.rand(100000000).to_s.rjust(20, '0')
    conn.exec("SELECT * FROM h_table WHERE id = '#{key}'")
  end

  x.compare!
end  

I'm using the benchmark-ips gem here. It will run each example for 5 seconds and output how many iterations per second each one is able to perform.

bundle exec ruby bench.rb  
Warming up --------------------------------------  
               btree   262.000  i/100ms
                hash   361.000  i/100ms
Calculating -------------------------------------  
               btree      2.919k (±13.2%) i/s -     14.410k
                hash      3.777k (±14.1%) i/s -     18.411k

Comparison:  
                hash:     3776.5 i/s
               btree:     2918.7 i/s - same-ish: difference falls within error

The output is a bit surprising. Even though there's more iterations per second on the example with the hash index, the results are close enough that the benchmark is unable to determine which one performs better.

Not happy, with this result, I went ahead and did a bit of digging around to understand why isn't the difference more prominent. First of all, lets look at the query plans on each table.

EXPLAIN ANALYSE SELECT id FROM b_table WHERE id = '00000000000000000001';  
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using b_table_id_index on b_table  (cost=0.57..8.59 rows=1 width=21) (actual time=3.913..3.913 rows=0 loops=1)
   Index Cond: (id = '123123123'::text)
   Heap Fetches: 0

EXPLAIN ANALYSE SELECT id FROM h_table WHERE id = '00000000000000000001';  
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Index Scan using h_table_id_index on h_table  (cost=0.00..8.02 rows=1 width=21) (actual time=1.526..1.529 rows=1 loops=1)
   Index Cond: ((id)::text = '00000000000000000001'::text)

There's a small but fundamental difference between the two query plans. With the btree index, Postgres performs an Index Only Scan while with hash index it opts for an Index Scan. This is very relevant. "Index only scans" were launched with Postgres 9.2 and basically check if all the data that you need is in the index, if so, it doesn't access the actual table row. By doing this, it drastically reduces the amount of IO required for the operation. You can read more about this here and here.

Unfortunately Index only scans are not yet supported on hash indexes.

This skews our benchmark. But, it's worth to check what would happen if the Index only scan was not an option. We could do this by adding more columns to our test tables or just by disabling index only scans. I'm doing the later because it's a lot easier and should yield the same results. Postgres allows you to set a bunch of flags that will influence how the query optimiser will chose what operations are best to fetch your data. To disable index only scans, you can run the following command on psql.

SET enable_indexonlyscan = off;  

With this feature turned off, we can run our benchmark again, and now, both indexes will be used under the same conditions.

bundle exec ruby bench.rb  
Warming up --------------------------------------  
               btree   283.000  i/100ms
                hash   363.000  i/100ms
Calculating -------------------------------------  
               btree      2.912k (±15.5%) i/s -     14.150k
                hash      3.967k (± 6.1%) i/s -     19.965k

Comparison:  
                hash:     3966.7 i/s
               btree:     2911.6 i/s - 1.36x slower

Right now the difference between the two indexes becomes more apparent with the btree lookup being 1.36x slower. This is more in line with the theoretical expectations we had.

In conclusion, there seems to be some advantages from using hash indexes. The time to create the index seems to be the biggest advantage they offer. The read performance is a bit better but not as much that I think it justifies the risks.

It also feels like hash indexes aren't getting much "love" recently as it feels like there's an incomplete implementation. I'd probably hold back on using it production right now, but I hope there will be new developments in the future that will realise the full potential of this feature.