Skip to main content

Streaming SQL extensions

In order to implement features that are supported by streaming engines, we offer a few extensions to standard SQL.

Append-only tables

By specifying the property 'append_only' = 'true' a user instructs Feldera that a table will only receive INSERT updates (no deletions or updates). This type of table is frequently used in streaming programs. This usage pattern enables the compiler to apply additional optimizations.

Example

CREATE TABLE T (
...
) WITH (
'append_only' = 'true'
);

LATENESS expressions

LATENESS is a property of the data in a column of a table or a view that is relevant for the case of stream processing. LATENESS is described by an expression that evaluates to a constant value. The expression must have a type that can be subtracted from the column type. For example, a column of type TIMESTAMP may have a lateness specified as an INTERVAL type.

To specify LATENESS for a table column, the column declaration can be annotated in the CREATE TABLE statement. For example:

CREATE TABLE order_pickup (
when TIMESTAMP NOT NULL LATENESS INTERVAL '1:00' HOURS TO MINUTES,
location VARCHAR
);

To specify LATENESS for a view, our custom SQL statement LATENESS must be used. The statement specifies a view, a column of the view, and an expression for the latness value. The statement may appear before or after the view declaration. For example:

LATENESS V.COL1 INTERVAL '1' HOUR;
CREATE VIEW V AS SELECT T.COL1, T.COL2 FROM T;

The LATENESS property of a column allows values that are too much "out of order" to be ignored. A value X is "out of order" if the table or view has previously has contained a row with a value Y for the respective column, such that Y > X. A value X is too much out of order if Y - lateness > X.

For example, consider the table above, and a sequence of insertions, each as a separate transaction:

INSERT INTO T VALUES('2020-01-01 00:00:00', 'home');
INSERT INTO T VALUES('2020-01-01 01:00:00', 'office');
-- next row is late, but not too late
INSERT INTO T VALUES('2020-01-01 00:10:00', 'shop');
INSERT INTO T VALUES('2020-01-01 02:00:00', 'home');
-- next row is too late, and it will be ignored
INSERT INTO T VALUES('2020-01-01 00:20:00', 'friend');

The second insertion is not out of order, since its timestamp is larger than the timestamp of the first insertion. The third insertion is out of order, since its timestamp value is smaller than the second insertion. Buy the third insertion is not too late, since it is only 50 minutes late, whereas the specified column lateness is 1 hour. The fifth row is too late, though, since it is 100 minutes late with respect to the fourth row.

Lateness is used to instruct the data processing system to ignore the rows that are too much out of order; the system behaves as if such insertions never took place. The "current timestamp", used to decide if a value is late, is updated only after the current program step. So if all 5 insertions above are executed within one step, all five will take place. If they are executed as 5 steps, only the first 4 will take place.

The LATENESS annotation ensures that some computed results that reflect past data may not be updated due to very late coming data. This also enables the runtime system to avoid maintaining very old state, which may never impact future results.

A table or view can have any number of columns annotated with lateness. An inserted row is considered "too late" if any of its annotated columns is too late.

WATERMARK expressions

WATERMARK is an annotation on the data in a column of a table that is relevant for the case of stream processing. WATERMARK is described by an expression that evaluates to a constant value. The expression must have a type that can be subtracted from the column type. For example, a column of type TIMESTAMP may have a watermark specified as an INTERVAL type.

To specify WATERMARK for a table column, the column declaration can be annotated in the CREATE TABLE statement. For example:

CREATE TABLE order_pickup (
when TIMESTAMP NOT NULL WATERMARK INTERVAL '1:00' HOURS TO MINUTES,
location VARCHAR
);

The effect of the WATERMARK is to delay the processing of the input rows until they are less likely to arrive out of order with respect to other rows. More precisely, the system maintains the largest value encountered so far in any input row for the columns that have a watermark. Given a WATERMARK annotation with value W, an input row with a value X for a watermarked column will be "held up" until an input row with a value X + W has been received. The program will behave as if the row with value X has only just been received.

If a table has multiple columns annotated with WATERMARK values, a row is released only when all the delays have been exceeded.