Feature Wars: SQL vs DSL for Feature Engineering
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to an ML model.
Two distinct approaches to feature engineering have emerged recently:
- The first approach relies on general-purpose tools like SQL or Spark
- The second approach turns to domain-specific languages (DSLs) tailored explicitly for feature engineering. Examples include Tecton, Chronon, Fennel, and others.
The emergence of DSLs for feature engineering is intriguing. Building a new language is no small feat—it involves designing the language, implementing a compiler, runtime, and tooling, and, last but not least, convincing users to adopt it. Most of these languages are implemented as extensions inside a host language, usually Python, thus they are technically embedded DSLs (EDSLs), which simplifies things. Still, their development remains a large and risky undertaking.
Evolving the DSL may be the biggest challenge of all. While the language may start small and clean, as users push its boundaries, it grows in size and complexity, often losing much of its original appeal.
In short, the decision to build a new language should never be taken lightly. These DSLs are not hobbyist projects; companies are betting their future on these new languages.
So why take the DSL route? Is there something inherently limiting in SQL and other general-purpose tools that makes DSLs a more compelling option?
Personally, I believe the answer is NO. There is nothing special about feature engineering that requires a new language. Done right, SQL is both more powerful and easier to use than a DSL. Thus, instead of turning feature engineering into a language design problem, we should focus on building better general-purpose query engines that meet the needs of modern ML applications.
Disclaimer
I have a horse in the race. As a developer of an incremental SQL engine, which, among other applications, is well suited for real-time feature engineering, I bring a certain bias to the table.
Despite the deliberately provocative title, I am hoping to start a discussion, not a war, and would genuinely love to hear from the other camp:
- How do folks building DSLs see the problem?
- What played into your decision to build a new language?
- Do you see the DSL as a temporary solution or the right long-term choice?
The goal of this post isn't to critique any particular DSL. Instead, I focus on discussing the broader merits of the different approaches, rather than specific design choices made by developers of a specific DSL.
Why DSL?
The primary reason for building DSLs is developer productivity. By using language constructs tailored to an application domain, a DSL can express complex problems clearly and concisely. A classic example is regular expressions—a DSL designed for pattern matching. With only a few characters, developers can specify patterns that would otherwise require hundreds of lines of error-prone code in a general-purpose language.
Are feature engineering DSLs indeed better at expressing ML features than a general-purpose query language like SQL? In my experience, they aren't. SQL lends itself nicely to implementing a wide range of complex feature queries in a natural way. To illustrate this point, we explore common types of features and show how they can be implemented in SQL in the next section.
Conversely, today's feature engineering DSLs do not introduce any new concepts that are not present in SQL. To the contrary, they support a narrow subset of SQL capabilities. Thus, developer productivity is not a factor in the creation of these DSLs.
So why DSL, really?
Outside of the ease of use, the second main reason people build DSLs is ease of implementation: by restricting the expressive power of the language, one can make it amenable to an efficient implementation.
The problem with SQL is not that it is hard to write or lacks expressiveness, but rather that feature engineering applications have unique requirements that most existing SQL engines are unable to meet.
- Dual-mode operation. Feature engineering requires the same set of queries to operate in two modes: offline on historical data for model training and testing, and online on live data for real-time inference.
- Offline/online parity. In order to maintain model accuracy, feature vectors computed in the online mode must be identical to those computed offline given identical inputs.
- Feature freshness. In real-time ML applications, it is often necessary to update feature vectors based on the latest inputs with latencies in the order of 100ms or less.
- Unbounded inputs. Real-time ML applications often process unbounded data streams, e.g., financial transactions, IoT sensor data, or system logs. The feature engine must be able to compute on such unbounded data efficiently.
No established general-purpose query engine today meets all (or, frankly, any) of these requirements. Users often rely on two separate engines for offline and online feature computation, which requires implementing feature queries twice—a terrible user experience to begin with. But the biggest issues arise with implementing the online mode. While platforms like Flink or Spark Streaming can handle many feature queries, using them effectively to achieve the desired latency, throughput, and accuracy requires deep expertise that most ML engineers just don't have.
As a result, while it may only take a few days to train a model using offline features, building a production feature pipeline that works on live input data often takes many months.
This is where the idea of a DSL starts to look attractive. By narrowing down the set of queries the user can write, a DSL can hide a lot of the rough edges and deliver ease-of-use and performance, at the expense of generality. Here are some of the things a DSL designer can do:
- Compiling to multiple backends. A DSL can present a unified syntax to the developer, while using different offline and online query engines under the hood.
- Subsetting. By restricting the types of queries the user can write, the DSL can avoid queries that the underlying engine cannot handle well. Example: most streaming engines handle rolling aggregates poorly, so many feature engineering DSLs simply don't allow this type of query.
- Optimizations. The DSL compiler can generate highly optimized code for supported queries, getting the best possible performance within the restricted query set.
- Custom engines. Finally, DSL developers can choose to build their own query engine from scratch instead of relying on off-the-shelf tools. In this case, a smaller, more focused language is much easier to support than full SQL.
In summary, it is not surprising that many users prefer a restrictive, yet easy-to-use, DSL to unwieldy general-purpose frameworks. However it is also clear that the ideal long-term solution is a general-purpose engine that offers both performance and ease-of-use.
Why SQL?
Imagine we had such an engine, that can run arbitrary SQL queries, in both online and offline mode with zero feature skew, high throughput, and low latency. How would we go about implementing ML features using it? Let's look at some common classes of features we encounter in ML applications:
Pattern 1: Aggregation
The purpose of feature engineering is to summarize a large array of raw input data into a low-dimensional representation as feature vectors. The primary means to achieve this in SQL is aggregation. SQL supports many flavors of aggregation: you can aggregate values over the entire table, or partition the table into groups using GROUP BY
, and aggregate within each group. When working with time-series data, it is often useful to aggregate data for a particular time frame. To this end, modern SQL dialects support multiple forms of time-based windows: tumbling, hopping, session windows, as well as rolling (aka sliding) windows.
Feature engineering DSLs typically support a small subset of these capabilities, which can be restrictive, since each one can be useful in different situations.
Here is an example showing rolling aggregates in SQL. Note how different aggregate functions (AVG
and SUM
) over multiple time windows (1 and 30 days) are combined in a single query.
-- Credit card transactions.
CREATE TABLE transaction (
ts TIMESTAMP LATENESS INTERVAL 10 MINUTES, -- Transaction time
amt DOUBLE, -- Transaction amount
cc_num BIGINT NOT NULL -- Credit card number
);
-- Compute two rolling aggregates:
-- 1. Average spend per transaction over a 1-day window.
-- 2. Total spend over a 30-day window.
CREATE VIEW rolling_aggregates AS
SELECT
*,
AVG(amt) OVER window_1_day as avg_1day,
SUM(amt) OVER window_30_day as sum_30day
FROM
TRANSACTION
WINDOW
window_1_day AS (PARTITION BY cc_num ORDER BY ts RANGE BETWEEN INTERVAL 1 DAYS PRECEDING AND CURRENT ROW),
window_30_day AS (PARTITION BY cc_num ORDER BY ts RANGE BETWEEN INTERVAL 30 DAYS PRECEDING AND CURRENT ROW);
Real-world feature pipelines often compute dozens of such aggregates. In a recent article, I argue that they might be the most important class of ML features for real-time fraud detection and other threat mitigation use cases.
Pattern 2: Data enrichment
In feature engineering we often need to combine data from multiple sources, for instance to enrich credit card transaction records with additional attributes extracted from customer profiles. Such data enrichment can be naturally expressed as SQL joins. If there's one thing SQL does extremely well, it's joins. In addition to the standard inner, outer, left, and right joins, modern SQL dialects support as-of joins, which are particularly useful when working with time series data:
-- Find the most recent customer record for each transaction, flag the transaction as
-- out-of-state if the state where the transaction was performed doesn't
-- match customer's home state.
CREATE VIEW enriched_transaction AS
SELECT
transaction.*,
transaction.state != customer.state AS out_of_state
FROM
transaction LEFT ASOF JOIN customer
MATCH_CONDITION ( transaction.ts >= customer.ts )
ON transaction.customer_id = customer.id;
Pattern 3: Composing features
Production ML models typically use hundreds of features, with complex features often constructed using simpler ones as subqueries. While SQL has limited modular programming capabilities, it allows defining reusable queries as views and composing them into more complex views. The example below uses an as-of join to flag out-of-state transactions. The resulting view is used to construct a more complex view that maintains the rolling 30-day count of out-of-state transactions per customer.
-- Flag out-of-state transactions.
CREATE VIEW enriched_transaction AS
SELECT
transaction.*,
(transaction.state != customer.state) AS out_of_state
FROM
transaction LEFT ASOF JOIN customer
MATCH_CONDITION ( transaction.ts >= customer.ts )
ON transaction.customer_id = customer.id;
-- Compute the number of out-of-state transactions within the 30-day time frame.
CREATE VIEW transaction_with_history AS
SELECT
*,
SUM(case when out_of_state then 1 else 0 end) OVER window_30_day as out_of_state_count
FROM
enriched_transaction
WINDOW window_30_day AS (PARTITION BY customer_id ORDER BY ts RANGE BETWEEN INTERVAL 30 DAYS PRECEDING AND CURRENT ROW);
Other capabilities
An important advantage of a general-purpose engine is that it's, well, general. Feature queries are not fundamentally different from any other analytical workload and, as such, can benefit from many advanced features of a modern query engine, including:
- Support for semi-structured data, which would enable feature engineers to parse, flatten, and query JSON documents.
- Window operators like
LAG
andLEAD
can be used to create feature vectors that refer to the previous or next event in the stream. - User-defined functions allow extending the language with new functionality.
- Vector search can be used to enrich features based on similarity in high-dimensional spaces.
Conclusion
Feature engineering DSLs are undeniably a valuable technology, addressing a significant practical need. However, their existence highlights the limitations of general-purpose query engines. If users could solve their feature engineering problems effectively with tools like SQL or Spark, they wouldn’t need to adopt specialized systems that require learning new languages and APIs and offer only a fraction of the capabilities of a general-purpose engine.
To reiterate, in order to be suitable for the job, a general-purpose query engine must support the same queries over batch and streaming inputs. For streaming inputs, it must deliver sub-100ms latency while guaranteeing query outputs identical to those produced on batch inputs.
We happen to have built just such an engine. Check it out at https://www.feldera.com. Told you, I'm biased 😉
Other articles you may like
Incremental Update 6 at Feldera
We’re excited to announce the release of v0.26, which represents a significant step forward for Feldera. This release includes over 200 commits, adding 25,000 lines of new code and documentation. Let's dive into the highlights!
Database computations on Z-sets
How can Z-sets be used to implement database computations
Incremental Update 5 at Feldera
A quick overview of what's new in v0.25.