Creating an IO class¶
This guide will help you create a new IO class by extending the base classes provided by Ordeq.
IO classes are basic building block in ordeq to abstract IO operations from data transformations.
Frequently used IO implementations are offered out-of-the-box as ordeq packages.
For instance, there is support for JSON, YAML, Pandas, NumPy, Polars and many more.
These can be used where applicable and serve as reference implementation for new IO classes.
Creating your own IO class¶
In this section, we will go step-by-step through the creation of a simple text-based file dataset.
All IO classes implement the IO class.
The IO class is an abstract base class that defines the structure for loading and saving data.
It includes the following key methods:
load(): Method to be implemented by subclasses for loading data.save(data): Method to be implemented by subclasses for saving data.
First, create a new class that extends the IO class and implement these load and save methods.
The class should also have an __init__ method to initialize the necessary attributes, such as the file path.
Which IO attributes should be in the __init__?
Attributes that are necessary for both loading and saving data should be defined in the __init__ method.
For example, a file path or database connection string.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
The load method should contain the logic for loading your data.
For example:
1 2 | |
The save method should contain the logic for saving your data.
For example:
1 2 | |
Save methods should not return anything
Save methods should always return None.
Ordeq will raise an error if a save method returns another type.
Load- and save arguments¶
The path attribute is used by both the load and save method.
It's also possible to provide parameters to the individual methods.
For instance, we could let the user control the newline character used by write_text:
1 2 3 4 5 6 7 8 9 10 | |
- The
newlineargument is specific to thesavemethod.
All arguments to the load and save methods (except self and data) should have a default value.
A common pattern when using third party functionality is to delegate keyword arguments to another function.
Below is an example of this for the CustomIO class:
1 2 3 4 5 6 7 8 9 10 | |
The CustomIO class can now be used as follows:
1 2 3 | |
Tips & tricks¶
Providing type information¶
We can provide the str argument to IO to indicate that CustomIO class loads and saves strings.
1 2 3 4 5 6 7 8 9 10 | |
IO[str]indicates that the IO operates on typestr- The
loadreturns astr - The
savetakes astras first argument
Ordeq will check that the signature of the load and save methods match the specified type.
For instance, the following implementation would raise a type error:
1 2 3 4 5 6 7 8 9 10 | |
IO[str]indicates that the IO operates on typestr- This raises a type error:
loadshould returnstr - This raises a type error:
saveshould takestr
Ordeq also supports static load and save methods.
In this case the self argument is omitted.
Using dataclass for IO classes¶
To simplify the definition of IO classes, you can use the dataclass decorator from the dataclasses library.
This allows us to define the attributes of the class in a more concise way.
Let's reconsider our running example using @dataclass:
1 2 3 4 5 6 7 8 9 10 11 12 | |
Using @dataclass to define IO classes is optional and purely for convenience.
The load and save methods can be implemented as usual.
Please refer to the dataclasses documentation for more information.
IO classes should not implement __eq__ or __hash__
All IOs inherit the __eq__ or __hash__ methods from the IO base class.
IO classes should not override these methods, and doing so issues a warning.
Stay close to the underlying API¶
In most cases, the IO class will be a thin adapter around an existing API or library. When creating a new IO class, try to stay close to the underlying API to make it easier for users to understand and use your IO class:
- try to use the same parameter names and types as the underlying API.
- create one IO class per API or data format.
Here is an example that is not recommended:
1 2 3 4 5 6 7 8 9 10 11 | |
- This is not recommended because, the
is_excelparameter makes it unclear what the load method will do.
Instead, create two separate IO classes: PandasCSV and PandasExcel.
This makes it clearer what each class does and avoids confusion about the parameters.
Read-only and write-only classes¶
While most data need to be loaded and saved alike, this is not always the case. If in our code one of these operations is not necessary, then we can choose to not implement them.
Practical examples are:
- Read-only: when loading machine learning models from a third party registry where we have only read permissions (e.g. HuggingFace).
- Write-only: when a Matplotlib plot is rendered to a PNG file, we cannot load the
Figureback from the PNG data.
Creating a read-only class using Input¶
For a practical example of a class that is read-only, we will consider generating of synthetic sensor data.
The SensorDataGenerator class will extend the Input class, meaning it will only have to implement the load method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
Inputindicates to Ordeq that this class is read-only.
Saving data using this dataset would raise a ordeq.IOException explaining the save method is not implemented.
Similarly, you can inherit from the Output class for IO that only require to implement the save method.
The ordeq-matplotlib package contains an example of this in MatplotlibFigure.
Advanced typing¶
Overloading load and save¶
Sometimes it is useful to provide multiple signatures for the load and save methods.
For example, we might want to allow loading data as either a string or bytes.
We can achieve this using the @overload decorator from Python's built-in typing module.
More on method overloading
For more information on function overloading in Python, refer to the documentation.
Here is a simplified snippet from the Gzip IO in ordeq-files:
1 2 3 4 5 6 7 8 9 10 11 | |
- The
GzipIO can load and save bothstrandbytes. - The first
@overloaddefines the signature for loadingbytes. - The second
@overloaddefines the signature for loadingstr.
The @overload decorator indicates to type checkers that the load method can return different types.
Furthermore, the return type can be inferred from the provided arguments:
1 2 3 | |
Mixed type IO¶
When inheriting from IO, the load and the save method are expected to operate on the same type.
In some cases, you may want to create an IO class that loads a different type than it saves.
These IO are called mixed type IOs.
To create a mixed type IO, simply inherit from Input and Output instead of IO.
Here's a snippet of a mixed type IO in the ordeq-chromadb package:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
- The
ChromaDBCollectionIO loadschromadb.Collectionand savesdict[str, Any]. - The
loadmethod returns achromadb.Collection. - The
savemethod takes adict[str, Any]as first argument.
A common reason to use mixed type IOs is when the IO leverages a library that has different types for reading and writing data.
Mixed type IOs should be used sparingly
Mixed type IOs can make code harder to understand and maintain. Use them only when there is a clear need for different load and save types.