Skip to content

IO

IO refers to the loading and saving of data. For example, you might have an IO that loads a CSV file. In Ordeq, an IO is represented by an IO class.

Why use IOs?

To understand why IOs are useful, let's look at a simple example. Suppose we want to load a simple CSV file. We could write a function that loads the CSV file directly:

import csv
from pathlib import Path


def load_csv(path: Path) -> list[list[str]]:
    with path.open(mode="r") as f:
        reader = csv.reader(f)
        return list(reader)

The main downside of this approach is that it immediately loads the data when the IO details, like the path to the file, are defined:

>>> load_csv(Path("to/data.csv"))
[(1, "kiwis", 7.2), (2, "grapefruits", 1.4)]

Instead, we can use the CSV IO from ordeq_files to represent the CSV file:

from pathlib import Path

from ordeq_files import CSV

io = CSV(path=Path("to/data.csv"))

Defining the IO does not load the data yet, until we tell it to:

>>> io.load()
[(1, "kiwis", 7.2), (2, "grapefruits", 1.4)]

IOs do not hold any data

IOs do not hold any data themselves, they just know how to load and save data.

This means IOs can be defined separately from when they are used. It also means IOs can be easily reused in different places.

The same IO can be used to save data as well:

>>> data_to_save = [(1, "apples", 3.5), (2, "bananas", 4.0)]
>>> io.save(data_to_save)

The data argument

The first argument to the save method, which is the data to be saved, is positional-only. This is to avoid ambiguity, as it ensures the required data argument precedes any save options.

Lastly, IOs serve as convenient and lightweight representations of data in your project:

>>> print(io)
CSV(path=PosixPath('to/data.csv'))

More complex IOs

A key feature of IOs is that they abstract the loading and saving behaviour from the user. IOs are typically used to handle the interaction with file systems, cloud storage, APIs, databases and other data sources. Unlike the example above, these more complex IOs manage everything from authentication to (de)serialization.

Ordeq offers many off-the-shelf IOs for common data formats, such as CSV, Excel, JSON, Parquet, and SQL databases. Refer to the API documentation for a full list of available IOs.

Using IOs

IOs can be used stand-alone, for instance when exploring data in a Jupyter notebook. Suppose you just received an Excel file from a colleague and want to take a look at it. You can use the PandasExcel IO from ordeq_pandas to load and inspect the data:

>>> from ordeq_pandas import PandasExcel
>>> from pathlib import Path
>>> fruit_sales = PandasExcel(path=Path("fruit_sales.xlsx"))
>>> df = fruit_sales.load()
>>> df.head(2)
Fruit       Quantity (kg)  Price   Store
0   apples          '3.5'   1.2     A
1  bananas          '4.0'   0.8     B
>>> df.dtypes
Fruit           object
Quantity (kg)   object
Price           float64
Store           object
dtype: object

Unfortunately the data types are not quite right. We want to convert the Quantity (kg) column to float and rename the columns to be more convenient. Furthermore, we want to drop the Store column as we don't need it.

Load & save options

You can alter the loading behaviour of an IO through its load options:

>>> fruit_sales = fruit_sales.with_load_options(
...     dtype={"Quantity (kg)": float},
...     usecols="A:C",
...     names=["fruit", "quantity_kg", "price"],
... )
>>> df = fruit_sales.load()
>>> df.dtypes
fruit           object
quantity_kg     float64
price           float64
dtype: object

Here, the load options are used to specify the data types, select specific columns, and rename a column. Under the hood, these options are passed to pandas.read_excel.

Building IO load and save options

The with_load_options and with_save_options methods return a new IO instance with the updated options. The original IO instance remains unchanged.

Similarly, you can alter the saving behaviour of an IO through its save options:

>>> fruit_sales = fruit_sales.with_save_options(index=False)
>>> fruit_sales.save(df)

For more information on the available load and save options, refer to the documentation of the specific IO you are using.

IOs should not apply transformations

IOs should only be concerned with loading and saving data. Therefore, IOs should not apply any transformation on load or save. Some load or save options do incur what can be considered a transformation, like the casting or renaming done above. As a rule of thumb:

  • if the option is specific to your use case, it should be done outside the IO
  • if the option refers to an operation that is likely to be useful to others, it might be appropriate as a load/save option.
  • if the option is closely tied to the IO implementation, it is likely appropriate as a load/save option.

Attributes

Often it can be useful to annotate IOs with additional attributes. Examples of useful attributes include:

  • the description (e.g., "Sales data by month")
  • the layer of the data (e.g., raw, processed, final)
  • the source of the data (e.g., internal, external, third-party)
  • the owner of the data (e.g., team or person responsible)

Attributes can be assigned using the with_attributes method:

from pathlib import Path

from ordeq_files import CSV

sales = CSV(path=Path("sales.csv")).with_attributes(
    description="Sales data by month",
    layer="gold",
    source="internal",
    owner="dwh-team@company.com",
)

Attributes are stored on the IO instance. Framework extensions like ordeq-viz can leverage these attributes to provide additional functionality.

Resources

Your project may contain multiple IOs for the same underlying resource. For instance, one IO that saves a raw CSV, and another IO to manipulate the CSV data. The processing of the raw CSV can be done using a lower-level library like Boto3, and the manipulation with a DataFrame-like library such as Polars.

IOs can be assigned a resource using the @ operator:

from ordeq_boto3 import S3Object
from ordeq_polars import PolarsEagerCSV

sales_raw = S3Object(bucket="bucket", key="sales.csv") @ "sales-file"
sales_df = PolarsEagerCSV(path="s3://bucket/sales.csv") @ "sales-file"

This tells Ordeq that sales_raw to sales_df use the same resource (in this case, a file on S3). Defining resources informs other developers, but most importantly helps running and visualizing your project, as we will see later.

Where to go from here?