Skip to content

Creating a IO class

This guide will help you create a new IO class by extending the base classes provided in ordeq.framework.io. IO classes are basic building block in ordeq to abstract IO operations from data transformations.

Frequently used IO implementations are offered out-of-the-box as ordeq packages. For instance, there is support for JSON, YAML, Pandas, NumPy, Polars and many more. These can be used where applicable and serve as reference implementation for new IO classes.

1. Understanding the IO Base Class

The IO class is an abstract base class that defines the structure for loading and saving data. It includes the following key methods:

  • load(): Method to be implemented by subclasses for loading data.
  • save(data): Method to be implemented by subclasses for saving data.

2. Example: FaissIndex

Before creating a custom IO class step-by-step, first consider the following implementation using the Faiss library. The FaissIndex class extends the IO class and implements the load and save methods:

>>> from dataclasses import dataclass
>>> from pathlib import Path
>>> import faiss
>>> from ordeq.framework.io import IO
>>> @dataclass(frozen=True, kw_only=True)
... class FaissIndex(IO[faiss.Index]):
...     path: Path
...
...     def load(self) -> faiss.Index:
...         return faiss.read_index(str(self.path))
...
...     def save(self, index: faiss.Index) -> None:
...         faiss.write_index(index, str(self.path))

3. Creating Your Own IO Class

In this section, we will go step-by-step through the creation of a simple text-based file dataset.

Step 3.1: Define Your IO Class

Create a new class that extends the IO class and implement the load and save methods.

>>> from dataclasses import dataclass
>>> from pathlib import Path
>>> from ordeq.framework.io import IO
>>> @dataclass(frozen=True, kw_only=True)
... class CustomIO(IO):
...     path: Path
...
...     def load(self):
...         pass
...
...     def save(self, data):
...         pass

Step 3.2: Implement the load Method

This method should contain the logic for loading your data. For example:

def load(self):
    return self.path.read_text()

Step 3.3: Implement the save Method

This method should contain the logic for saving your data. For example:

def save(self, data):
    self.path.write_text(data)

Load- or save arguments

The path attribute is used by both the load and save method. It's also possible to provide parameters to the individual methods. For instance, we could let the user control the newline character used by write_text:

>>> from dataclasses import dataclass
>>> from pathlib import Path
>>> from ordeq.framework.io import IO
>>> @dataclass(frozen=True, kw_only=True)
... class CustomIO(IO):
...     path: Path
...
...     def load(self):
...         return self.path.read_text()
...
...     def save(self, data, newline: str = "\n"):
...         self.path.write_text(data, newline=newline)

A common pattern when using third party functionality is to delegate keyword arguments to another function.

Below is an example of this for a Pandas CSV IO class:

>>> from dataclasses import dataclass
>>> from pathlib import Path
>>> import pandas as pd
>>> from ordeq.framework.io import IO
>>> @dataclass(frozen=True, kw_only=True)
... class PandasCSV(IO[pd.DataFrame]):
...     path: Path
...
...     def load(self, **load_args) -> pd.DataFrame:
...        return pd.read_csv(self.path, **load_args)
...
...     def save(self, data: pd.DataFrame, **save_args):
...         data.write_csv(self.path, **save_args)

Providing type information

We can provide the str argument to IO to indicate that CustomIO class loads and saves strings. This type should match the return type of the load method and the first parameter of the save method.

>>> from dataclasses import dataclass
>>> from pathlib import Path
>>> from ordeq.framework.io import IO
>>> @dataclass(frozen=True, kw_only=True)
... class CustomIO(IO[str]):  # IO operates on type `str`
...     path: Path
...
...     def load(self) -> str:  # and thus returns a `str` on load
...         return self.path.read_text()
...
...     def save(self, data: str) -> None:  # and takes a `str` as first argument to `save`
...         self.path.write_text(data)

Read-only and write-only classes

While most data need to be loaded and saved alike, this is not always the case. If in our code one of these operations is not necessary, then we can choose to not implement them.

Practical examples are:

  • Read-only: When loading machine learning models from a third party registry where we have only read permissions (e.g. HuggingFace).
  • Write-only: when a Matplotlib plot is rendered to a PNG file, we cannot load the Figure back from the PNG data.

Creating a Read-only class using Input

For a practical example of a class that is Read-only, we will consider generating of synthetic sensor data. The SensorDataGenerator class will extend the Input class, meaning it will only have to implement the load method.

>>> import random
>>> from dataclasses import dataclass
>>> from typing import Any
>>> from ordeq.framework.io import Input
>>> @dataclass(frozen=True, kw_only=True)
... class SensorDataGenerator(Input[dict[str, Any]]):
...     """Example Input class to generate synthetic sensor data
...
...     Examples:
...     >>> generator = SensorDataGenerator(sensor_id="sensor_3")
...     >>> data = generator.load()
...     {'sensor_id': 'sensor_3', 'temperature': 22.001252691230633, 'humidity': 35.2674852725557}
...     """
...
...     sensor_id: str
...
...     def load(self) -> dict[str, Any]:
...         """Simulate reading data from a sensor"""
...         return {
...             "sensor_id": self.sensor_id,
...             "temperature": random.uniform(20.0, 30.0),
...             "humidity": random.uniform(30.0, 50.0)
...         }

Saving data using this dataset would raise a ordeq.framework.io.IOException explaining the save method is not implemented.

Similarly, you can inherit from the Output class for IO that only require to implement the save method. The ordeq-matplotlib package contains an example of this in MatplotlibFigure.