Jared Rader

TIL - GIN Indexes

Been meaning to post about this for a while, but I recently learned about Generalized Inverted Indexes. According to the Postgres docs:

GIN indexes are “inverted indexes” which are appropriate for data values that contain multiple component values, such as arrays. An inverted index contains a separate entry for each component value, and can efficiently handle queries that test for the presence of specific component values.

I plan to write a longer post about my exploration into this, including the pros and cons of denormalization, but I’ll give a short overview of a problem I was trying to solve and how the GIN index came into play.

At Reflektive, I’ve been working on a user profile feature that gives each person a personalized newsfeed of the feedbacks they have sent or received.

Over time, the types of feedback we have in Reflektive has grown, and with each type of feedback, specific rules around who can see the feedback. For example, a user can give somebody private feedback, and only those two individuals should be allowed to see it. Or, Person A can write a private note about Person B, and not share it with Person B. While this is still stored as a Feedback in our database, and Person B is the recipient, Person B should not be allowed to see that Feedback.

In total, we have 9 different types of feedbacks, and unless the feedback is public, the feedback is private and shared with, at most, 3 people.

Querying for these feedbacks got quite complicated, because at the time, we were using a combination of attributes and checking associations to determine the type of feedback. This led to a class that looked something like this (actual queries not shown):

class FeedbackReceivedByUser
  def execute
    gather_queries.inject(:union)
  end

  private

  def gather_queries
    case viewer
    when :admin admin_queries
    when :manager, :peer then manager_and_peer_queries
    when :self then profile_owner_queries
    end
  end

  def admin_queries
    [
      public_received,
      private_received,
      sent_by_viewer,
      feedback_requested_by_viewer
    ]
  end

  def manager_and_peer_queries
    [
      public_received,
      sent_by_viewer,
      feedback_requested_by_viewer
    ]
  end

  def profile_owner_queries
    [public_received, owner_sent_to_self, shared_with_owner]
  end
end

The #execute method makes use of the ActiveRecord Union gem which allows me to UNION together the separate queries into one.

In thinking about the main goal, I realized the question I was answering was just, which feedbacks are visible to the person looking at the profile?

This got me thinking - maybe I could use a bit of denormalization and store an array, called visible_to of the user IDs of the people allowed to view the private feedbacks on each feedback record (empty arrays would signify that the feedback is visible to everyone). Then, instead of writing several separate queries, I could simplify it to one query that simply said ‘Fetch all feedbacks Person B has sent or received that are visible to Person A’.

I knew Postgres allowed storing array datatypes, and this seemed like a promising use case. However, I wanted to be sure that the query would be performant. I noticed another model in our code did something similar using the ANY operator. I could use that as well and construct the query like this:

SELECT feedbacks.* FROM feedbacks
WHERE (?) = ANY(feedbacks.visible_to)
OR feedbacks.visible_to IS NULL

However, in researching this, I realized the performance would be really slow on large tables as it doesn’t use any index, so each record would have to be checked individually whether an ID existed in the array. I found out it was possible to use a GIN index to attain faster lookup performance. Even better, this was an available ActiveRecord option for creating a database index.

In the end, I was able to create a scope on the Feedback class called visible_to and use some custom SQL to make use of the GIN-indexed visible_to array:

class Feedback < ApplicationRecord

  scope :visible_to, ->(user_id) do
    where('visible_to @> ARRAY[?] OR visible_to IS NULL', user_id))
  end
end

And my many separate queries could be simplified to:

def viewable_received_feedbacks
  user.received_feedbacks.visible_to(query_user.id)
end

My new class looked something like this:

class FeedbackReceivedByUser
  def execute
    gather_queries.inject(:union)
  end

  private

  def gather_queries
    case viewer
    when :admin admin_queries
    when :manager, :peer, :self then viewable_received
    end
  end

  def admin_queries
    [viewable_received, viewable_to_admin]
  end

  def viewable_received_feedback_query
    [viewable_received]
  end

  def viewable_to_admin
    target_user.received_feedbacks.where.not(sender: query_user)
  end

  def viewable_received
    target_user.received_feedbacks.visible_to(query_user.id)
  end
end

For managers, peers, and folks looking at their own profile, I have simplified the many queries down to one - it simply fetches the feedbacks that are visible to that person.

Performance

In addition to simpler queries, I found I got a performance boost in all cases using the denormalized visible_to array with a GIN index. Depending on the scenario of the type of viewer, performance gains ranged between 2 to 5 times faster than UNIONing a bunch of separate queries.

looking at self

Comparison:
Using GIN array:         3584.5 i/s
Using old queries:       717.5 i/s - 5.00x  slower

looking at direct report

Comparison:
Using GIN array:         1063.4 i/s
Using old queries:       527.7 i/s - 2.02x  slower

looking at peer

Comparison:
Using GIN array:        1160.0 i/s
Using old queries:      566.0 i/s - 2.05x  slower