ordeq_spark
SparkCSV
dataclass
¶
Bases: IO[DataFrame]
IO for loading and saving CSV using Spark.
Example:
1 2 3 4 5 6 | |
load(infer_schema=True, header=True, **load_options)
¶
Load a CSV into a DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
infer_schema
|
bool
|
Whether to infer the schema. |
True
|
header
|
bool
|
Whether the CSV has a header row. |
True
|
load_options
|
Any
|
Additional options for loading. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The loaded DataFrame. |
save(df, **save_options)
¶
Save a DataFrame to CSV.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to save. |
required |
save_options
|
Any
|
Additional options for saving. |
{}
|
SparkDataFrame
dataclass
¶
Bases: Input[DataFrame]
Allows a Spark DataFrame to be hard-coded in python. This is suitable for small tables such as very simple dimension tables that are unlikely to change. It may also be useful in unit testing.
Example usage:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
SparkExplainHook
¶
SparkGlobalTempView
dataclass
¶
Bases: IO[DataFrame]
IO for reading from and writing to Spark global temporary views.
Examples:
Create and save a DataFrame to a global temp view:
1 2 3 4 5 6 7 8 | |
Load the DataFrame from the global temp view:
1 2 | |
SparkHiveTable
dataclass
¶
Bases: SparkTable, IO[DataFrame]
IO for reading from and writing to Hive tables in Spark.
Examples:
Save a DataFrame to a Hive table:
1 2 3 4 5 6 7 8 | |
SparkIcebergTable
dataclass
¶
Bases: SparkTable, IO[DataFrame]
IO used to load & save Iceberg tables using Spark.
Example usage:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
Saving is idempotent: if the target table does not exist, it is created with the configuration set in the save options.
Table properties can be specified on the properties attribute.
Currently, the properties will be taken into account on write only.
1 2 3 4 5 6 7 | |
Currently only supports a subset of Iceberg writes. More info 1:
save(df, mode='overwritePartitions', partition_by=())
¶
Saves the DataFrame to the Iceberg table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame to save |
required |
mode
|
SparkIcebergWriteMode
|
write mode, one of: - "create" - create the table, fail if it exists - "createOrReplace" - create the table, replace if it exists - "overwrite" - overwrite the table - "overwritePartitions" - overwrite partitions of the table - "append" - append to the table |
'overwritePartitions'
|
partition_by
|
SparkIcebergPartitionType
|
columns to partition by, can be - a single column name as a string, e.g. "colour" - a list of column names, e.g. ["colour", "dt"] - a tuple of tuples for more complex partitioning, e.g. (("colour",), (F.years, "dt")) |
()
|
Raises:
| Type | Description |
|---|---|
ValueError
|
if mode is not one of the supported modes |
RuntimeError
|
if the Spark captured exception cannot be parsed |
CapturedException
|
if there is an error during the write operation |
SparkJSON
dataclass
¶
Bases: IO[DataFrame]
IO for loading and saving JSON using Spark.
Example:
1 2 3 4 | |
load(**load_options)
¶
Load a JSON into a DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
load_options
|
Any
|
Additional options for loading. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The loaded DataFrame. |
save(df, **save_options)
¶
Save a DataFrame to JSON.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to save. |
required |
save_options
|
Any
|
Additional options for saving. |
{}
|
SparkJobGroupHook
¶
Bases: NodeHook
Node hook that sets the Spark job group to the node name. Please make sure the Spark session is initialized before using this hook.
Example usage:
1 2 3 4 5 6 7 8 9 10 11 12 | |
SparkParquet
dataclass
¶
Bases: IO[DataFrame]
IO for loading and saving Parquet files using Spark.
Basic usage:
1 2 3 4 | |
Loading with options:
1 2 3 | |
Saving with options:
1 2 3 4 5 | |
load(**load_options)
¶
Loads a Parquet file into a Spark DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
load_options
|
Any
|
Additional options to pass to Spark's read.parquet method. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A Spark DataFrame containing the data from the Parquet file. |
save(df, **save_options)
¶
Saves a Spark DataFrame to a Parquet file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The Spark DataFrame to save. |
required |
save_options
|
Any
|
Additional options to pass to Spark's write.parquet method. |
{}
|
SparkSession
dataclass
¶
Bases: Input[SparkSession]
Input representing the active Spark session. Useful for accessing the active Spark session in nodes.
Example:
1 2 3 4 5 | |
Example in a node:
1 2 3 4 5 6 7 8 9 10 | |
load()
¶
Gets the active SparkSession. Errors if there is no active Spark session.
Returns:
| Type | Description |
|---|---|
SparkSession
|
pyspark.sql.SparkSession: The Spark session. |
SparkTempView
dataclass
¶
Bases: IO[DataFrame]
IO for reading from and writing to Spark temporary views.
Examples:
Create and save a DataFrame to a temp view:
1 2 3 4 5 6 7 8 | |
Load the DataFrame from the temp view:
1 2 | |