Skip to content

IO

IO refers to the loading and saving of data. For example, you might have an IO that loads a CSV file. In Ordeq, an IO is represented by an IO class.

Why use IOs?

To understand why IOs are useful, let's look at a simple example. Suppose we want to load a simple CSV file. We could write a function that loads the CSV file directly:

import csv
from pathlib import Path

def load_csv(path: Path) -> list[list[str]]:
    with path.open(mode='r') as f:
        reader = csv.reader(f)
        return list(reader)

The main downside of this approach is that it immediately loads the data when the IO details, like the path to the file, are defined:

>>> load_csv(Path("to/data.csv"))
[(1, "kiwis", 7.2), (2, "grapefruits", 1.4)]

Instead, we can use the CSV IO from ordeq_files to represent the CSV file:

from ordeq_files import CSV
from pathlib import Path

io = CSV(path=Path("to/data.csv"))

Defining the IO does not load the data yet, until we tell it to:

>>> io.load()
[(1, "kiwis", 7.2), (2, "grapefruits", 1.4)]

IOs do not hold any data

IOs do not hold any data themselves, they just know how to load and save data.

This means IOs can be defined separately from when they are used. It also means IOs can be easily reused in different places.

The same IO can be used to save data as well:

data_to_save = [(1, "apples", 3.5), (2, "bananas", 4.0)]
io.save(data_to_save)

Lastly, IOs serve as convenient and lightweight representations of data in your project:

>>> print(io)
CSV(path=PosixPath('to/data.csv'))

More complex IOs

A key feature of IOs is that they abstract the loading and saving behaviour from the user. IOs are typically used to handle the interaction with file systems, cloud storage, APIs, databases and other data sources. Unlike the example above, these more complex IOs manage everything from authentication to (de)serialization.

Ordeq offers many off-the-shelf IOs for common data formats, such as CSV, Excel, JSON, Parquet, and SQL databases. Refer to the API documentation for a full list of available IOs.

Using IOs

IOs can be used stand-alone, for instance when exploring data in a Jupyter notebook. Suppose you just received an Excel file from a colleague and want to take a look at it. You can use the PandasExcel IO from ordeq_pandas to load and inspect the data:

>>> from ordeq_pandas import PandasExcel
>>> from pathlib import Path
>>> fruit_sales = PandasExcel(path=Path("fruit_sales.xlsx"))
>>> df = fruit_sales.load()
>>> df.head(2)
Fruit       Quantity (kg)  Price   Store
0   apples          '3.5'   1.2     A
1  bananas          '4.0'   0.8     B
>>> df.dtypes
Fruit           object
Quantity (kg)   object
Price           float64
Store           object
dtype: object

Unfortunately the data types are not quite right. We want to convert the Quantity (kg) column to float and rename the columns to be more convenient. Furthermore, we want to drop the Store column as we don't need it.

Load & save options

You can alter the loading behaviour of an IO through its load options:

>>> fruit_sales = fruit_sales.with_load_options(
...     dtype={"Quantity (kg)": float},
...     usecols="A:C",
...     names=["fruit", "quantity_kg", "price"],
... )
>>> df = fruit_sales.load()
>>> df.dtypes
fruit           object
quantity_kg     float64
price           float64
dtype: object

Here, the load options are used to specify the data types, select specific columns, and rename a column. Under the hood, these options are passed to pandas.read_excel.

Building IO load and save options

The with_load_options and with_save_options methods return a new IO instance with the updated options. The original IO instance remains unchanged.

Similarly, you can alter the saving behaviour of an IO through its save options:

>>> fruit_sales = fruit_sales.with_save_options(index=False)
>>> fruit_sales.save(df)

For more information on the available load and save options, refer to the documentation of the specific IO you are using.

IOs should not apply transformations

IOs should only be concerned with loading and saving data. Therefore, IOs should not apply any transformation on load or save. Some load or save options do incur what can be considered a transformation, like the casting or renaming done above. As a rule of thumb:

  • if the option is specific to your use case, it should be done outside the IO
  • if the option refers to an operation that is likely to be useful to others, it might be appropriate as a load/save option.
  • if the option is closely tied to the IO implementation, it is likely appropriate as a load/save option.

Where to go from here?