Purpose
I was tasked to build a data synchronization pipeline between ElasticSearch and PostgreSQL, in order to incrementally sync search data into ES. I didn’t want to handcraft too much so I looked into three existing middlewares: PGSync, Logstash+JDBC, and Elastic-Connector. This blog will discuss their pros and cons based on my actual experiments. Besides, in the last section, I will discuss other options to sync data and why I didn’t consider them.
The conclusion is that
Logstash+JDBC
is the best choice for my specific use cases despite it has some disadvantages.PGSync
is pretty good middleware. But it has a bug. It also offers less flexibility and customization in terms of table relations.Elastic-Connector
for PostgreSQL is trash unless you only need very basic functionalities. It is more like a open-source framework for ppl to build own connectors on it.- Refer to last section for why I didn’t consider other options.
What I need
I need two important features:
- The middleware should be capable to incrementally transforms and transfers data between ES and SQL. Some middlewares (e.g. some functionalities of Elastic-Connector) only support full-sync. Full-sync means sync everything by overriding the whole old ES index with the new one.
- The middleware should be able to perform complex data transformations. Most importantly, nest/de-normalize SQL tables. This means nesting some child SQL tables into parent tables to make up a big JSON to store in ES.
I don’t need exact real-time syncing for new changes. So syncing in intervals (e.g. every 20 seconds) is acceptable.
PGSync VS Logstash
Logstash is the most popular and widely-used middleware. It is developed and officially supported by ES team. However, I was initially attracted by PGSync and decided to give it a try. It has many advantages over Logstash.
Logstash works by using SQL to query database
, extract changed data, then transform and send data to ElasticSearch. Instead, PGSync utilizes logical decoding
feature of PostgreSQL and captures the change through the transaction logs. It doesn’t query database directly.
This yields several advantages:
Key points | Logstash | PGSync |
---|---|---|
Impact on DB | High, since it queries DB directly using SQL | ✔️ Low, it passively captures data through logs |
————– | ————- | ————- |
Tracking Column | Needed, Logstash uses it to check for changes. A timestamp is often used. All tables to be synced need to have this column. And they need to be updated when changes happen | ✔️ Not needed |
————– | ————- | ————- |
Deletion |
|
✔️ Can achieve hard/soft deletion |
PGSync (3.0.0)
A Bug
I gave PGSync a try. One thing I gotta complain is the documentation is very broken and hard to understand. It is made by a single developer, and has a small community.
The version I used is 3.0.0. The syncing for a single table works pretty smooth. However, I encountered a bug when testing nested relationship. Incremental CREATE operation to child tables in a nested relationship won’t be reflected to ES. For example, suppose we nest Review
tables inside Movie
table. During the initial full-sync, everything works fine. However, in the subsequent syncing stages, creating a new Review
table won’t be synced to ES.
I found a similar issue (https://github.com/toluaina/pgsync/pull/493) raised. But his pull request doesn’t seem to fix the problem.
Here is a table summarized what works and what doesn’t, CUD stands for create/update/delete
————– | Single Table | Parent-Child table nesting |
---|---|---|
Will full-sync capture everything? | YES | YES |
Will incremental CUD operation be captured and reflected to ES? | YES | if the CUD is made on parent table, YES; if UPDATE/DELETE is made on child table, YES; If the CREATE is made on child table. NO. |
Walk-around
There is a walk-around to this issue. The primary issue lies in the fact that CREATE operations on the child table do not get properly reflected in ES. However, if we create a child table and then subsequently update an irrelevant field in parent table, all changes will be captured and synced to ES. This can solve the problem.
Low customization
However, I soon realize another issue - the customization options are pretty limited. Our project is using generic foreign key to build relationships between some tables. Such relationship can’t be specified in PGSync’s descriptive JSON file. It is quite user-friendly but not as powerful as raw SQL, especially in terms of customizing relationships.
Logstash + JDBC
Now, gotta go with Logstash. It has a big community and a lot of useful articles.
Those cons I mentioned earlier still hold. They caused some efforts to modify DB table and backend server logic to incorporate tracking column and soft deletion. But anyway I am happy with the strong customization capability provided by SQL and logstash’s internal filters.
Elastic-Connector
Official guide on PostgreSQL-ES using connector
General guide on how to run a connector in Docker
I also tried Elastic-Connector. After connecting ES to a connector, I can view it directly in Kibana GUI to schedule jobs and set up sync rules. This sounds pretty appealing to me.
It is a pretty new middleware, and has literally 0 community resources. I only knew this middleware because it has top ranking on Google’s search result…That was how it tricked me… 😭
Problems
It took me a long time to even set up a connector. Due to the lack of community resource, sometimes I gotta check the source code of it to debug.
- For example, it doesn’t even have a configuration option to disable SSL certification validation during the initial set up stages. I need to manually modify the source code to disable it and I hate doing this. Probably because this middleware is primarily made for native Elastic Cloud deployment…
After I got everything working, I tested with some dummy data. There are “basic sync rules” and “advanced sync rules”.
- Basic sync rules will work for both
incremental sync
andfull sync
. While advance sync rule only works forfull sync
. - Basic sync rules contain some very basic matching conditions.
- The “table nesting” I want is within the advanced rule for PostgreSQL connector Link.
It is definitely not capable for my use cases. I don’t see any advantage of using this one instead of Logstash. Probably the only good thing is it has GUI integration inside Kibana.
Other options!!!
Some other doable options:
- PostgreSQL->Debezium->Kafka->ES Connector/Logstash->ES: I didn’t take this into consideration for our project because it seems like a resource-intensive solution, since two more middlewares as used. But it looks very interesting to me. Debezium is a CDC (change data capture) tool and also uses PostgreSQL’s
logical decoding
feature. Plus this solution provides great flexibility, customization and stability. Maybe one day I will give it a try. - Third party data ETL tools such AirByte, Estuary…
- In your backend framework, it is doable to perform create/update/delete to both the main SQL database and ES simultaneously. It is doable but sounds like a bad idea to me.
End
Thanks for reading! It is a long article and hopefully I didn’t make any mistakes. I am not an ES expert but I indeed spent quite some time to research these stuff. Please feel free to drop a comment below if I got anything wrong.