Skip to main content

Parquet Format

Feldera can ingest and output data in the Parquet format.

We document the Parquet format and how it interacts with different SQL types in this page.

Types

The parquet file is expected to be a valid parquet file with a schema. The schema (row name and type) must match the table definition in the Feldera pipeline program. We use Arrow to specify the data-types in parquet. The following table shows the mapping between Feldera SQL types and Arrow types.

Feldera SQL TypeApache Arrow Type
BOOLEANBoolean
TINYINT, SMALLINT, INTEGER, BIGINTInt8, Int16, Int32, Int64
FLOAT, DOUBLE, DECIMALFloat32, Float64, Decimal
VARCHAR, CHAR, STRINGLargeUtf8
BINARY, VARBINARYDataType::Binary
TIMEDataType::UInt64 (time in nanoseconds)
TIMESTAMPDataType::Timestamp(TimeUnit::Millisecond, None) (milliseconds since unix epoch)
DATEDataType::Int32 (days since unix epoch)
ARRAYDataType::LargeList
STRUCTDataType::Struct
MAPDataType::Dictionary
VARIANTLargeUtf8 (JSON-encoded string, see VARIANT documentation)

Example

In this example, we configure a table to load data from a Parquet file.

create table PARTS (
part bigint not null,
vendor bigint not null,
price bigint not null
) with ('connectors' = '[{
"transport": {
"name": "url_input",
"config": { "path": "https://feldera-basics-tutorial.s3.amazonaws.com/parts.parquet" }
},
"format": {
"name": "parquet",
"config": {}
}
}]');

For reference, the following python script was used to generate the parts.parquet file:

import pyarrow as pa
import pyarrow.parquet as pq

data = {
'PART': [1, 2, 3],
'VENDOR': [2, 1, 3],
'PRICE': [10000, 15000, 9000]
}
table = pa.Table.from_pydict(data)
pq.write_table(table, 'parts.parquet')