Skip to content

utils.py

apply_schema(df, schema)

Applies a schema to the DataFrame, meaning: - select only those columns in the schema from the DataFrame - cast these to the types specified in the schema

Note that this method does not check for nullability in the schema.

Parameters:

Name Type Description Default
df DataFrame

DataFrame

required
schema StructType

StructType (schema)

required

Returns:

Type Description
DataFrame

the DataFrame with schema applied

create_schema(schema)

Utility method that creates a Spark StructType (or, schema) from tuple inputs, without need to import and write StructType & StructField

Parameters:

Name Type Description Default
schema tuple[tuple[str, DataType, bool], ...]

tuple of schema inputs

required

Returns:

Type Description
StructType

StructType

get_spark_session()

Helper to get the SparkSession

Returns:

Type Description
SparkSession

the spark session object

Raises:

Type Description
RuntimeError

when the spark session is not active