Implementing Batch Processes with Feldera

Implementing Batch Processes with Feldera

Ben Pfaff
Ben PfaffChief Engineer / Co-Founder
| December 18, 2024

Batch jobs are a classic application of databases. When data arrives in large collections of records, queries can run after each new arrival. Similarly, if records arrive gradually but query results are only needed on, say, a daily basis, this is a kind of batch job as well.

When a batch job is processing all the data in a growing database, the job will slow down over time, because each time it runs, the database redoes all of the computations from previous runs and adds new computations for the new data. As the database grows larger, the batch job takes more and more time, and it can eventually take longer than the acceptable time budget. It can even take over a day to run a daily batch job!

In contrast, Feldera performs incremental computation (as we described previously). As new data arrives, it reuses work from its previous computations to efficiently update query results used by batch jobs. As the database grows, the time to update query results grows much more slowly than for queries in a database, because it can reuse much of the work from the previous runs. Feldera turns time-consuming database batch jobs into fast incremental updates.

This blog entry will look at an example of this kind of transformation, using TPC-H, an industry-standard benchmark that simulates reporting for business-related questions. First, we’ll show the trend of how TPC-H performs using a popular open source database, watching its runtime grow as we insert more and more data. Then we’ll demonstrate the same process with Feldera, where we can see that each new data arrival takes approximately the same amount of data.

Batching with a Database

With a database, a batch job inserts a large collection of records and then runs a collection of queries. To demonstrate performance trends with batch jobs on a traditional database, we ran TPC-H with a popular open source database on a desktop computer. (Because our goal is to show trends, not to compare the performance of particular databases, we won’t say which database.) We used scale factor 100, meaning that the input data is about 100 GB in CSV format, consisting of about 1.6 billion records.

With the database and machine that we chose, it took about 38 minutes to load the data. We ran TPC-H query 5, which is a 6-way join that we selected as representative of the set of TPC-H queries, which took about 4 minutes. We did not use indexes because we found that, whether we added them before or after loading the data, it made the overall process longer.

To demonstrate the performance trend as data is inserted, we divided the “orders” and “lineitem” tables in the generated input, which in total are about half the input records, into 10 equal-sized batches, each comprising about 75,000,000 records across the two tables. Because “orders” records refer to “lineitem” records, we ensured that if an “order” record was in a batch, so were its “lineitem” records. Then, for each batch in turn, we inserted its records and reran the query. Again, we did not use indexes because the process ran faster without them.

The following table shows the results. The “runtime” is the number of seconds to load the additional batch and run the query. Notice the clear upward trend in runtime: as one would expect, running a query on more data takes more time.

BatchRuntime
1219 s
2256 s
3250 s
4282 s
5302 s
6331 s
7345 s
8383 s
9416 s
10441 s

Batching in Feldera

Feldera is not a database. Instead, it is an incremental computation engine. The user supplies queries to execute using SQL, the same language used for a database–and in fact we used the same SQL for Feldera and for the database–but Feldera processes data and produces output as the data arrives, updating the output whenever it changes.

With Feldera on the same machine that we used for the database, with storage enabled to minimize memory consumption, running the full 100 GB of data through it takes about 28 minutes. The table below breaks down the time that Feldera takes to process each batch. Notice that the time to process each batch is relatively constant, varying between 170 s and 173 s per batch. This demonstrates how Feldera’s incremental computation enables it to accept new data without re-processing the existing data. As data grows, Feldera doesn’t slow down!

BatchRuntime
1173 s
2171 s
3170 s
4171 s
5170 s
6170 s
7172 s
8172 s
9170 s
10173 s

Conclusion

The graph below shows the performance trends for Feldera versus the database that we used. Ignoring the absolute performance, and looking just at trends, the data shows that databases slow down as the total data size increases, whereas Feldera’s performance remains steady. In other words, Feldera is applicable to batch processes, as well as for real-time data processing, with performance that will not let customers down as their data grows.

Batch runtime for Feldera vs. Database (a line graph of the tables above)

Other articles you may like

Incremental Update 6 at Feldera

We’re excited to announce the release of v0.26, which represents a significant step forward for Feldera. This release includes over 200 commits, adding 25,000 lines of new code and documentation. Let's dive into the highlights!

Database computations on Z-sets

How can Z-sets be used to implement database computations

Incremental Update 5 at Feldera

A quick overview of what's new in v0.25.