Permify Bug: Concurrent Transactions Snapshot Issue

by Dimemap Team 52 views

Hey guys! Ever run into a situation where your database seems a little… off? Well, that's what we're diving into today. Specifically, we're talking about a bug in Permify where concurrent write transactions can grab the same snapshot. This leads to some funky behavior where one transaction's changes might get lost in the shuffle. Let's break down what's happening, how to reproduce it, and what you can do about it. It’s a bit of a head-scratcher, but we'll get through it together.

Understanding the Bug in Permify

So, the core issue here revolves around Permify and how it handles concurrent write transactions. The problem is that sometimes, two different write operations will end up using the same snapshot of the database. Think of a snapshot like a frozen moment in time. When two transactions use the same frozen moment, and then one of them makes a change, the second transaction might not see that change. This leads to inconsistencies. In the case described, if you have two transactions deleting different relations, only one of them might actually reflect in the subsequent read requests. This is because both transactions operated on the same initial snapshot, and the changes from one were not propagated to the other. This can be super frustrating because it makes your data seem unreliable. Imagine trying to manage access permissions, and some of your changes just disappear into the ether. It’s like a digital magic trick gone wrong.

This bug particularly affects operations where you’re deleting or modifying multiple relations simultaneously. The impact can be significant, especially in systems where real-time accuracy of permissions is critical. It can lead to users having access they shouldn't, or losing access they should have. The provided example showcases a scenario where three relations are initially created, and then two of them are deleted concurrently. Depending on the timing, one or both of the deletions might not be correctly reflected in subsequent checks. This inconsistency can lead to unexpected behavior and security vulnerabilities. Debugging such issues can be a nightmare because the root cause isn't always immediately obvious. The erratic nature of the bug, sometimes working and sometimes failing, further complicates troubleshooting. If you're building systems that rely on Permify for fine-grained access control, understanding and mitigating this bug is crucial.

Steps to Reproduce the Permify Bug

Want to see this bug in action? Here's how to reproduce it. First, you'll need to set up your environment. This involves using Docker Compose to spin up a PostgreSQL database and a Permify instance. The docker-compose.yaml file provided in the bug report sets up these services, making it easy to get started. Make sure you have Docker and Docker Compose installed on your machine. Next, you'll need the schema.perm file. This file defines the schema for your access control system. In this example, it defines a simple schema for a 'doc' entity with relations like can_read, can_comment, and can_edit, all related to a 'user' entity. This schema is the blueprint for your permission model. Finally, the crucial part is the test.sh script. This bash script automates the steps required to reproduce the bug. It first writes the schema to Permify, grants initial permissions, and then simultaneously attempts to delete two of those permissions. It then checks the access permissions using check_access calls before and after the deletion attempts. The script uses curl to interact with the Permify API, sending requests to write data and check permissions. It also uses jq to parse and format the JSON payloads for these requests. The script is designed to highlight the issue by showing how the expected changes don't always appear after the concurrent delete operations. By running this script, you can witness the bug firsthand. You'll see that, in some runs, everything works as expected, but in others, one of the deletions doesn't take effect, demonstrating the inconsistent behavior. The setup uses a simple but effective test case that focuses on core functionalities: creating and deleting relations. This focused approach makes it easier to isolate and understand the bug's behavior.

Running the Test Script

To run the test script, save the docker-compose.yaml, schema.perm, and test.sh files in the same directory. Then, execute the following commands in your terminal. First, bring up the Docker containers using docker-compose up. This command starts the PostgreSQL database and the Permify instance. Make sure there are no errors during the startup process. Once the containers are running, execute the test script using ./test.sh. This will run the series of API calls designed to replicate the bug. The script will output the results of the permission checks, allowing you to see if the deletions were successful. Carefully examine the output of the script. Look for instances where the expected changes in permissions (deletions) are not reflected in the subsequent checks. Compare the outputs from different runs. Note the variations in behavior. Some runs will correctly reflect both deletions, while others will only reflect one. These discrepancies are a key indicator of the bug. If you encounter the bug, you'll see that the permissions checks after the concurrent delete operations don't always match your expectations, as one of the deletions might be missing. The script's output helps visualize the inconsistent behavior caused by the bug, and how a permission that was supposed to be removed is still being recognized.

Expected Behavior and Workarounds

So, what should happen? The expected behavior is that all runs should consistently reflect the changes made by the concurrent transactions. Specifically, after the deletions, the permission checks should accurately reflect which relations have been removed. This means that if you delete the 'can_comment' and 'can_edit' relations, subsequent 'check' requests for those permissions should return