introduce
This blog post describes Zalando's recent contribution to the Postgres JDBC driver, which aims to address a long-standing issue in the driver's integration with Postgres logical replication that caused runaway write-ahead log (WAL) growth. We'll describe the issue and its impact on Zalando, and detail the fix made upstream of the JDBC driver, which addresses all other client issues with the Debezium and Postgres JDBC drivers.
Postgres logical replication for Zalando
Zalando's builders can use a low-code solution that enables them to claim a stream of events originating from the Postgres database. Each event stream declaration provides a microapplication, powered by Debezium Engine, that uses Postgres Logical Replication to publish table-level change events as they occur. These event streams, with the ability to publish events to a variety of different technologies and transform them at will via AWS Lambda, form a core part of Zalando's infrastructure offerings. At the time of writing, Zalando already has hundreds of event streams originating from Postgres.
A common problem with logical replication is the excessive growth of Postgres WAL logs. Sometimes, write-ahead logs (WALs) can grow to the point where WAL takes up all the available disk space on the database node, causing the node to be downgraded to read-only - something that is really undesirable in a production environment! This issue is common in situations where there is little or no write traffic to the streamed table - but once written, any excessive WAL growth disappears immediately. In recent years, as Postgres source event streams have become more popular in Zalando, we've seen this issue come up more and more frequently.
So, what's going on at the bottom of this event flow process? How can Postgres reliably ensure that all data change events are emitted and captured by interested clients? The answers to these questions are essential to understanding the problem and finding solutions.
In order to explain this problem and how we can fix it, we first have to explain the internals of Postgres replication. In Postgres, a write-ahead log (WAL) is a strictly ordered sequence of events that occur in a database. These WAL events are the source of truth for the database, and streaming and replaying WAL events is how physical and logical replication work. Physical replication is used for database replication. Logical replication is the topic of this blog and allows clients to subscribe to WAL events for data changes. In both cases, the replication client tracks its progress in the WAL by checking its location directly on the primary database, known as the log sequence number (LSN). WAL events stored on the primary database can only be discarded after all replication clients, both physical and logical, have confirmed that they have been processed. If a client fails to confirm that it has handled a WAL event, the primary node retains the WAL event and all subsequent WAL events until the acknowledgment occurs.
Simple, right?
Well, the path to happiness is really simple. However, as you might imagine, this blog post is not about a path to happiness.
issue
Before we move on, let me briefly describe the architecture that is running into problems along the way:
A Postgres database with logical replication set up for two of its three tables
We have a database with multiple tables, which are represented here by different colors: blue (1), pink (2), purple (3), etc. In addition, we are specifically listening for changes to the blue and pink tables. These changes are transferred to the blue and pink clients, respectively, via logical replication. In our case, these clients are our Postgres source event streaming applications that use Debezium and PgJDBC in the background to bridge the gap between Postgres byte array messages and Java by providing a user-friendly API.
The key thing to note here is that all table changes go into the same WAL. WAL exists at the server level, and we can't break it down into table-level or schema-level concepts. All changes to all tables in all schemas in all databases on that server go into the same WAL.
To track the progress of blue and pink replication separately, the database server uses a structure called a replication slot. A replication slot should be created for each client - so in this case, we have blue (upper, denoted as 1) and pink (lower, denoted as 2) replication slots - each slot will contain information about its client's progress through the WAL. It does this by storing the LSN of the last flushed WAL along with a few other information, but we keep it simple.
If we zoom in on WAL, we can simply state as follows:
Each client has a replication slot whose progress is tracked through the WAL.
Here, for clarity, I'm going to denote LSN as a decimal number. In fact, they are represented as a hexadecimal combination of page number and position.
When writes occur on any table in the database, they are written to the WAL - the next available log location is #7. If a write occurs on the blue table, a message with this information is sent to the blue client, and once the client acknowledges receipt of change #7, the blue replication slot advances to #7. However, WAL#7 with LSN cannot be reclaimed, and its disk space cannot be freed up for the time being, as the pink copy slot is still only on #6.
As changes occur in the blue table, the blue client's replication slot advances, but there is no reason for the pink slot to move
If the blue table continues to receive writes, but no writes occur on the pink table, the pink copy slot will never have a chance to move forward, and all blue WAL events will be shelved, taking up space.
This continues, and the WAL becomes very large, at risk of using all the disk space of the entire server
However, once a write occurs in the pink table, this change will be written to the next available WAL location, e.g. #14, the pink client will acknowledge receipt, and the pink copy slot will advance to position #14. Now we have the following status:
As soon as a write occurs in the pink table, the pink copy slots move forward, and WAL events can be removed up to position #13, as they are no longer needed by any slots
This is the heart of the matter. The pink client is not interested in these WAL events, however, Postgres cannot delete these WAL events until the pink client acknowledges the later LSN in its replication slot. This will continue indefinitely until the disk space is completely used up by old WAL events that can't be deleted until writes occur in the pink table.
Mitigation strategies
A lot of blog posts talk about this bug, the phenomenon, the behavior, whatever you want to call it. Hacking solutions abound. By far the most popular is to create a scheduled job that writes dummy data to a pink table to force it forward. This solution has been used in Zalando in the past, but it is a temporary solution that does not solve the core problem and requires additional workload overhead to set up Postgres logical replication from now on.
甚至连前 Debezium 负责人 Gunnar Morling 也撰写过有关该主题的文章。
Byron Wolfman mentions a pure solution in a blog post, but later abandons that prospect in favor of the same temporary solution. The following quote is taken from his article on this topic:
Excerpted from a blog post detailing the pure cursor forward solution as well as the "fake write" hacking technique
This is indeed the purest solution. In our case, the Java application is the end consumer, and the first port for a message from Postgres is PgJDBC, the Java driver for Postgres. If we can solve this problem at this level, then it will be abstracted from all Java applications (including Debezium) and solved.
Our solution
It's important to note that while Postgres only sends replication messages when a write operation occurs, it periodically sends KeepAlive messages to maintain the connection between it and PgJDBC (in this case). This KeepAlive message contains very little data: some identifiers, a timestamp, a bit that indicates whether a reply is needed or not, but most importantly, the KeepAlive message contains the current WAL LSN of the database server. Historically, PgJDBC has not responded to KeepAlive messages, and nothing has changed on the server side after the KeepAlive message has been sent. This needs to change.
The original message flow between the database server and the PgJDBC driver. Only the replication message will receive an acknowledgment from the driver.
The fix involves updating the LSN of the last replicated message received from the server and the LSN of the latest message acknowledged by the client. If the two LSNs are the same, and the client then receives a KeepAlive message with a higher LSN, the client can imply that it has flushed all relevant changes and that some extraneous changes are happening in the database that the client doesn't care about. The client can safely confirm to the server that it has received this change, advancing its replication slot location and allowing the Postgres server to delete those irrelevant WAL events. This approach is conservative enough to confirm the LSN while guaranteeing that no relevant events are skipped.
The updated message flow now also includes an acknowledgment response for each KeepAlive message, allowing all replicas to continuously acknowledge receipt of WAL changes
The fix was implemented, tested, and submitted to PgJDBC via a pull request. The fix was merged on August 31, 2023 and is scheduled to be released in the PgJDBC 42.7.0 release.
launch
Backwards compatibility is supported for streaming applications powered by Debezium, and these features have been removed from newer versions of Debezium. To maintain this backward compatibility, our application does not use the latest version of Debezium, and does not use the latest version of PgJDBC that Debezium introduced as a transitive dependency. To take advantage of this fix while maintaining this backward compatibility, we modified the build script to selectively override the latest version of the passing PgJDBC dependencies, and we took advantage of this option to build not one, but two Docker images for our app: one unchanged and one with locally built version 42.6.1-patched PgJDBC with our fix. We're rolling out this modified Docker image to our test environment, while still using the unchanged image in our production environment. This allows us to safely verify that our event streaming application continues to function as expected and monitor behavior to verify that WAL growth issues have been resolved.
To verify that the issue has indeed gone away, we monitored the total WAL size chart on the low-activity database for a few days. Before you implement a fix, you'll typically see the following total WAL size chart, indicating that the issue has been around for 36 hours:
Fix the out-of-control WAL growth before fixing
The same database after remediation now has a WAL size graph like this, within the same time frame, and with no other changes to the persistence tier, service tier, or activity:
WAL growth (or lack thereof!) after the fix. )
Since the fix itself is designed conservatively enough to confirm LSNs so that we can guarantee that we will never skip or miss an event, this evidence is enough to confidently roll out newer Docker images to our production clusters, addressing the out-of-control WAL growth of hundreds of Postgres source event streams on Zalando. No more hacking :)
作者:Declan Murphy Software Engineer
Source: https://engineering.zalando.com/posts/2023/11/patching-pgjdbc.html