Middlewares for Data Syncing between ElasticSearch and PostgreSQL

Purpose

I was tasked to build a data synchronization pipeline between ElasticSearch and PostgreSQL, in order to incrementally sync search data into ES. I didn’t want to handcraft too much so I looked into three existing middlewares: PGSync, Logstash+JDBC, and Elastic-Connector. This blog will discuss their pros and cons based on my actual experiments. Besides, in the last section, I will discuss other options to sync data and why I didn’t consider them.

The conclusion is that

Logstash+JDBC is the best choice for my specific use cases despite it has some disadvantages.
PGSync is pretty good middleware. But it has a bug. It also offers less flexibility and customization in terms of table relations.
Elastic-Connector for PostgreSQL is trash unless you only need very basic functionalities. It is more like a open-source framework for ppl to build own connectors on it.
Refer to last section for why I didn’t consider other options.

What I need

I need two important features:

The middleware should be capable to incrementally transforms and transfers data between ES and SQL. Some middlewares (e.g. some functionalities of Elastic-Connector) only support full-sync. Full-sync means sync everything by overriding the whole old ES index with the new one.
The middleware should be able to perform complex data transformations. Most importantly, nest/de-normalize SQL tables. This means nesting some child SQL tables into parent tables to make up a big JSON to store in ES.

I don’t need exact real-time syncing for new changes. So syncing in intervals (e.g. every 20 seconds) is acceptable.

PGSync VS Logstash

Logstash is the most popular and widely-used middleware. It is developed and officially supported by ES team. However, I was initially attracted by PGSync and decided to give it a try. It has many advantages over Logstash.

Logstash works by using SQL to query database, extract changed data, then transform and send data to ElasticSearch. Instead, PGSync utilizes logical decoding feature of PostgreSQL and captures the change through the transaction logs. It doesn’t query database directly.

This yields several advantages:

Key points	Logstash	PGSync
Impact on DB	High, since it queries DB directly using SQL	✔️ Low, it passively captures data through logs
————–	————-	————-
Tracking Column	Needed, Logstash uses it to check for changes. A timestamp is often used. All tables to be synced need to have this column. And they need to be updated when changes happen	✔️ Not needed
————–	————-	————-
Deletion	We have to use soft deletion. Make `is_deleted:Boolean` column false when deleting. This is because if you hard delete a record, Logstash won’t know about it since it can’t see the tracking column any more. Unless extra efforts are made to ensure hard deletion consistency. [Extra] However, in some cases soft deletion is the preferred approach for main DB anyway. For example, recovery user data if he accidentally deleted something. Plus other similar reasons. Besides, some articles mentioned massive hard deletion might make database index `fragmented` (altho this is not the only cause) and affect performance. And if it gets too bad, the index should be reorganized or rebuilt. I did some readings on this topic. It doesn’t seem like it is a big problem for normal use cases. Check out [this blog] on InnoDB engine The fun thing is I have literally came across two articles, the first one is “why you should do soft deletion”. The second one is “why soft deletion is bad”.😂 So I guess “soft deletion or not” is a case-by-case decision making.	✔️ Can achieve hard/soft deletion

PGSync (3.0.0)

A Bug

I gave PGSync a try. One thing I gotta complain is the documentation is very broken and hard to understand. It is made by a single developer, and has a small community.

The version I used is 3.0.0. The syncing for a single table works pretty smooth. However, I encountered a bug when testing nested relationship. Incremental CREATE operation to child tables in a nested relationship won’t be reflected to ES. For example, suppose we nest Review tables inside Movie table. During the initial full-sync, everything works fine. However, in the subsequent syncing stages, creating a new Review table won’t be synced to ES.

I found a similar issue (https://github.com/toluaina/pgsync/pull/493) raised. But his pull request doesn’t seem to fix the problem.

Here is a table summarized what works and what doesn’t, CUD stands for create/update/delete

————–	Single Table	Parent-Child table nesting
Will full-sync capture everything?	YES	YES
Will incremental CUD operation be captured and reflected to ES?	YES	if the CUD is made on parent table, YES; if UPDATE/DELETE is made on child table, YES; If the CREATE is made on child table. NO.

Walk-around

There is a walk-around to this issue. The primary issue lies in the fact that CREATE operations on the child table do not get properly reflected in ES. However, if we create a child table and then subsequently update an irrelevant field in parent table, all changes will be captured and synced to ES. This can solve the problem.

Low customization

However, I soon realize another issue - the customization options are pretty limited. Our project is using generic foreign key to build relationships between some tables. Such relationship can’t be specified in PGSync’s descriptive JSON file. It is quite user-friendly but not as powerful as raw SQL, especially in terms of customizing relationships.

Logstash + JDBC

Now, gotta go with Logstash. It has a big community and a lot of useful articles.

Those cons I mentioned earlier still hold. They caused some efforts to modify DB table and backend server logic to incorporate tracking column and soft deletion. But anyway I am happy with the strong customization capability provided by SQL and logstash’s internal filters.

Elastic-Connector

Official guide on PostgreSQL-ES using connector

General guide on how to run a connector in Docker

I also tried Elastic-Connector. After connecting ES to a connector, I can view it directly in Kibana GUI to schedule jobs and set up sync rules. This sounds pretty appealing to me.

It is a pretty new middleware, and has literally 0 community resources. I only knew this middleware because it has top ranking on Google’s search result…That was how it tricked me… 😭

Problems

It took me a long time to even set up a connector. Due to the lack of community resource, sometimes I gotta check the source code of it to debug.

For example, it doesn’t even have a configuration option to disable SSL certification validation during the initial set up stages. I need to manually modify the source code to disable it and I hate doing this. Probably because this middleware is primarily made for native Elastic Cloud deployment…

After I got everything working, I tested with some dummy data. There are “basic sync rules” and “advanced sync rules”.

Basic sync rules will work for both incremental sync and full sync. While advance sync rule only works for full sync.
Basic sync rules contain some very basic matching conditions.
The “table nesting” I want is within the advanced rule for PostgreSQL connector Link.

It is definitely not capable for my use cases. I don’t see any advantage of using this one instead of Logstash. Probably the only good thing is it has GUI integration inside Kibana.

Other options!!!

Some other doable options:

PostgreSQL->Debezium->Kafka->ES Connector/Logstash->ES: I didn’t take this into consideration for our project because it seems like a resource-intensive solution, since two more middlewares as used. But it looks very interesting to me. Debezium is a CDC (change data capture) tool and also uses PostgreSQL’s logical decoding feature. Plus this solution provides great flexibility, customization and stability. Maybe one day I will give it a try.
Third party data ETL tools such AirByte, Estuary…
In your backend framework, it is doable to perform create/update/delete to both the main SQL database and ES simultaneously. It is doable but sounds like a bad idea to me.

End

Thanks for reading! It is a long article and hopefully I didn’t make any mistakes. I am not an ES expert but I indeed spent quite some time to research these stuff. Please feel free to drop a comment below if I got anything wrong.