Kafka Streams .NET – Is this the right approach for merging product data from Debezium?
Hi everyone 👋
I’m currently migrating a legacy system where product information is spread across 8 different tables in a SQL Server database.
📌 Goal:
Merge all the data from those tables into a single enriched product message, and store it in a NoSQL database.
🛠️ What I’ve done so far:
- I’m using Debezium to capture real-time changes from the SQL Server tables.
- Each table has its own Kafka topic.
- Then, I use Kafka Streams (via Streamiz.Kafka.Net) to perform joins between the KTables and produce the enriched product.
❓My questions:
- Is this a good fit for Kafka Streams?
Would you say this is the right approach for this type of use case?
- Normal behavior?
Each time I refresh Kafka UI or check the output topic, my app seems to process around 500 to 1000 messages. It feels a bit heavy or slow, so I’m wondering if I might be missing something in the setup.
To illustrate, I’m sharing below my example using only 2 tables.
And it’s this one that generates between 500 and 1000 messages in the output topic.
When I say “refresh,” I mean I’m spamming the F5 key on Kafka UI to see the number of messages appearing in the topic.
Thanks a lot in advance for any guidance! 🙏
6 Replies
I think the general idea sounds pretty solid, what's the overall plan for sunsetting the old application? Are clients expected to keep writing through the old system, so you need to keep supporting forwarding of old writes to the new DB?
When it comes to the volume of messages, how many entries are on the tables?
On a side note, that's a very well prepared question, much appreciated 🙏
Exactly, it's an initial phase. Around 10 million entries per day across all tables combined.
That doesn't sound too far off then, right? Not sure how quick the consumers are, maybe you can set up some metrics to get a feel? I haven't worked with Debezium for well over a year, but IIRC it would initially snapshot the database, so that might cause a lot of messages as well.
If that's a concern, you could maybe also have a batch job transfer the current state for starters, and configure debezium to only capture future changes
From Debezium to the table topics, there's no issue, everything is very fast. What really surprises me is the performance difference between simply sending messages and performing the JOIN operation. Kafka itself can handle millions of messages without problems, and I expected Kafka Streams to offer similar performance
When I look at typical large-scale data processing scenarios, like in ETL pipelines where joins are also commonly used, I feel like I can’t be the only one doing this kind of operation. So even though I’m only using a single Kafka partition for now, the huge gap between the speed of raw message ingestion and the slowness of the join really makes me think I’m doing something wrong somewhere
Do you know where the bottleneck is (As in, is it a hardware bottleneck)? I haven't used Kafka streams in .net, but looking at the library I've seen some configuration parameters which might be interesting for throughput optimization (NumStreamThreads, commit behavior etc). I've also read through https://lgouellec.github.io/streamiz/threading-model.html a bit
Yes, I've already tested the NumStreamThreads parameter, and it does improve performance. But honestly, I expected better baseline performance even with just a single thread. Anyway, I definitely don't want to bother you. I was just wondering if anyone had already experienced similar performance. If not, it's clearly up to me to investigate further, not you! 😄
Thanks a lot for your help and your advice! ❤️