Contents

Managing multi-dimensional datasets with metadata in Python

What many datasets for machine learning tutorials are lacking is metadata. Example datasets are, with good reason, kept as simple as possible and only capture the core of the problem. In the real world however, a business goal very rarely comes as one neat pack of uniform data. You may have CT images obtained with different scanners, manufacturing data from different facilities, sales data including different marketing campaigns… When working on a particular task, you want to be aware of such information, as disregarding it may lead to unexpected model behavior.

Detecting unintended bias is a crucial topic in itself. In this post however, I want to investigate how to write clean and efficient code for handling metadata alongside an actual dataset. A few tricks and design choices make working with metadata a lot easier and can facilitate a more thorough model evaluation.

/posts/datascience/metadata-in-machinelearning/teaser.png

What this post is NOT about

The concepts described in this article do not apply to all machine learning problems. When working with tabular data, Pandas provides all necessary tools out of the box. The main challenge comes with handling multidimensional data (images, audio, etc.), which does not fit into a data frame, in combination with additional metadata, which does.

Also, this post is mainly concerning small-ish datasets, that do not run into memory limitations. For larger-scale applications, other considerations such as distributed computing and storage come into play, that overshadow the issues addressed here.

Finally, we are not really looking at problems, where both tabular and multidimensional data are used as input to a machine learning model. Here is an article on Medium highlighting that topic.

Multiple parallel arrays of data are difficult to maintain

Most tutorials and example scripts for machine learning solutions contain lines similar to the following, which result in a set of data arrays holding the X and y data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# https://keras.io/api/datasets/fashion_mnist/
from tensorflow import keras

(x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()
assert x_train.shape == (60000, 28, 28)
assert x_test.shape == (10000, 28, 28)
assert y_train.shape == (60000,) and y_test.shape == (10000,)

# https://scikit-learn.org/stable/datasets/real_world.html#the-olivetti-faces-dataset
import sklearn.datasets

data = sklearn.datasets.fetch_olivetti_faces()
x_data, y_data = data["images"], data["target"]
assert x_data.shape == (400, 64, 64) and y_data.shape == (400,)

Often, x_data is a multidimensional array with the input values and y_data is a one-dimensional vector of targets. These are the ready-to-use datasets that let you focus on the actual tutorial content rather than cumbersome logistics. You can simply use the arrays with model.fit(x_train, y_train).

Now, consider the following scenario: You are working on a machine learning model for a multidimensional dataset like Fashion-MNIST. In addition to the actual data, you also have some additional information associated with each sample (i.e. each image). For the Fashion-MNIST example, this could be the release date or the designated season of each piece of clothing, or even the name of the designer. Sticking to the style seen in so many tutorials, you use different arrays to store the corresponding metadata. To illustrate this, I am creating some random metadata for the Fashion-MNIST dataset:

1
2
3
4
5
6
7
8
9
# x_train/test, y_train/test loaded as before through Keras datasets.
import numpy as np

year_train = np.random.choice(np.arange(2015, 2023, dtype=int), len(y_train))
year_test = np.random.choice(np.arange(2015, 2023, dtype=int), len(y_test))

seasons = "Spring Summer Autumn Winter".split()
season_train = np.random.choice(seasons, len(y_train))
season_test = np.random.choice(seasons, len(y_test))

During your experiments, you decide to only include data after a certain year for model training and start off like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# First, one-hot encode targets.
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)

x_train_after_2020 = x_train[year_train > 2020]
y_train_after_2020 = y_train[year_train > 2020]

# define and compile your Keras model
# ...

model.fit(x_train_after_2020, y_train_after_2020)

For evaluation of the model, data is also filtered:

1
2
3
4
x_test_after_2020 = x_test[year_test > 2020]
y_test_after_2020 = y_test[year_test > 2020]

print(model.evaluate(x_test_after_2020, y_test_after_2020))

This works well with a single condition for subdividing the data. But the code gets out of hand quickly, if you want to add more filters. Let’s say you want to test the model performance separately for each design season, and write a loop:

1
2
3
4
data_filter = year_test > 2020
for season in seasons:
    season_filter = data_filter & (season_test == season)
    print(season, model.evaluate(x_test[season_filter], y_test[season_filter]))

In the last example, we at least defined each condition as a reusable boolean array. This reduces the amount of duplicated code, but it is still quite cumbersome and difficult to maintain. Wouldn’t it be nice if we could use some Pandas features for handling the data? Unfortunately, Pandas is not built to handle multidimensional arrays. So what other options do you have?

Potential solutions for handling ML datasets with metadata

The following sections suggest different approaches to moving away from multiple arrays holding data and metadata to more maintainable and readable solutions.

1) Pandas data frame with metadata and sample indices

While multidimensional ML inputs do not fit into a Pandas data frame, metadata typically does. For each sample in x_data you have a number of attributes (year and season in the example earlier). We can create a data frame with the metadata. Then, we can utilize the internal numerical index which should map 1:1 to the samples in x_data and y_data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import pandas as pd
data_train = pd.DataFrame({
    "year": year_train, "season": season_train,
    "target": np.argmax(y_train, axis=1)
})
data_test = pd.DataFrame({
    "year": year_test, "season": season_test,
    "target": np.argmax(y_test, axis=1)
})

assert np.array_equal(x_train, x_train[data_train.index])

data_after2020 = data_train.query("year > 2020")
model.fit(x_train[data_after2020.index], y_train[data_after2020.index])

With Pandas, the code becomes shorter, more readable and easier to maintain. The internal index is preserved in most Pandas operations, so the per-season evaluation from before could now be written with a groupby.

1
2
for season, df in data_test.query("year > 2020").groupby("season"):
    print(season, model.evaluate(x_test[df.index], y_test[df.index]))

However, with data_test, x_test and y_test, we still need to keep track of three different objects. With the next solution, once initialized, there is a single data frame to deal with.

2) Subclassing Pandas data frame to include input and target data

Although it is not the most straightforward solution, we can create a custom class based on a data frame, that can keep track of ML inputs (x_data) and targets (y_data) while providing the ease of use given by Pandas. In action, such a MLDataFrame could work as follows. One can utilize many features provided by pandas to filter/group the data and easily access the corresponding inputs and targets for the ML task.

1
2
winter_df = mldf.query("season == 'Winter'")
model.fit(winter_df.x_data, winter_df.y_data)

Let’s see how we get there. The documentation describes the necessary steps to subclass data structures, which is surprisingly easy. Obviously, pd.DataFrame must be specified as the base class. Besides, if we want inherited methods to return the modified class, a special _constructor function must be defined. Finally, any class attributes that should be retained by the constructor have to be declared in the _metadata attribute. Here is everything we need, to properly subclass a data frame.

1
2
3
4
5
6
class MLDataFrame(pd.DataFrame):
    _metadata = ["_x_data", "_y_data"]

    @property
    def _constructor(self):
        return MLDataFrame

Now, we can define our own initializer. We are giving up on a couple of positional arguments from the base initializer. Setting the index manually through the constructor is not allowed as we rely on the built-in numeric index.

1
2
3
4
5
6
7
8
# class MLDataFrame(pd.DataFrame):
    def __init__(self, data, x_data=None, y_data=None, **kwargs):
        if kwargs.pop("index", None):
            raise ValueError("Cannot set a custom index on MLDataFrame.")
        super().__init__(data, **kwargs)

        self._x_data = x_data
        self._y_data = y_data

We store the x and y data in private class attributes. These will always hold the full dataset, regardless of any splitting or sub-setting of the MLDataFrame. Two properties, aptly named x_data and y_data, then provide the actual (potentially filtered) data arrays.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
    @property
    def x_data(self):
        if self._x_data is None:
            raise ValueError("Cannot access unspecified x_data")
        return self._x_data[self.index]

    @property
    def y_data(self):
        if self._y_data is None:
            raise ValueError("Cannot access unspecified y_data")
        return self._y_data[self.index]

Some more adaptations and checks would be necessary, to handle bad user input and unintended use (for example, set_index would need to be disallowed). But as long as a user does not try to do too fancy stuff, the bare minimum class definition above is enough. Creating an MLDataFrame is almost like creating a regular data frame. In this case, I am providing the metadata as a dictionary of vectors. Additionally, the ML inputs x_train and targets y_train are given to the data frame. Here is a visualization of what we have achieved:

/posts/datascience/metadata-in-machinelearning/MLDataFrame.png

Sub-setting the data for training and evaluation becomes very clean.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
mldf = MLDataFrame(
    {"year": year_train, "season": season_train,
     "target": np.argmax(y_train, axis=1)},
    x_data=x_train,
    y_data=y_train,
)

assert np.array_equal(mldf.query("year > 2020").x_data, x_train_after_2020)
assert np.array_equal(mldf.query("year > 2020").y_data, y_train_after_2020)

winter_df = mldf.query("season == 'Winter'")
model.fit(winter_df.x_data, winter_df.y_data)

A group-by-loop is now even more concise:

1
2
3
4
5
6
7
8
9
mldf_test = MLDataFrame(
    {"year": year_test, "season": season_test,
     "target": np.argmax(y_test, axis=1)},
    x_data=x_test,
    y_data=y_test,
)

for season, season_df in mldf_test.query("year > 2020").groupby("season"):
    print(season, model.evaluate(season_df.x_data, season_df.y_data))

As a nice bonus, the MLDataFrame is compatible with scikit-learn’s most important function: train_test_split.

1
2
3
4
5
6
7
8
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(mldf, test_size=0.3)
print(f"Training split: {train_df.x_data.shape} -> {train_df.y_data.shape}")
print(f"Validation split: {val_df.x_data.shape} -> {val_df.y_data.shape}")

#> Training split: (42000, 28, 28) -> (42000, 10)
#> Validation split: (18000, 28, 28) -> (18000, 10)

Note that after filtering an MLDataFrame, the underlying x and y data stays the same. In fact, x_data and y_data are given as references to the the constructor and mldf._x_data is therefore exactly the same as x_train. However, due to numpy’s advanced indexing, the mldf.x_data and mldf.y_data properties always return copies of the indexed array.

1
2
3
4
5
6
7
# _x_data references corresponding array
assert mldf._x_data is x_train
assert mldf._x_data is winter_df._x_data

# but: x_data returns a copy
assert mldf.x_data is not x_train
assert winter_df.x_data is not x_train  # obviously - not even the same shape

3) Sticking to the data arrays and other ML dataset solutions

If you are not actually dealing with a lot of metadata or you are not required to cycle through different subsets of the same data for training and testing, it is simplest to just stick to the original format with separate data arrays. There is no need to over-engineer the metadata issue, as long as it does not get out of hand.

Alternatively, one may consider using existing dataset solutions, that address other challenges in data handling. In PyTorch for example, you can define a custom DataLoader to map files on disk directly into a dataset object for training/testing and even do some more complicated stuff. Tensorflow’s tf.data.Dataset can be used to transform and augment datasets in a flexible and efficient manner. I would also have loved to use Xarray. But I found it to be not as intuitive and user-friendly for the metadata use case as I was hoping.

Conclusion

Metadata plays an important role in many machine learning and data science projects. Typically, there is some nuance and context to consider when working on a particular dataset. Handling metadata in code can be tricky and if not done properly, it can result in very convoluted code that becomes error-prone and difficult to maintain. In this post, I showed a couple of options to move away from handling metadata alongside ML datasets in a more flexible and cleaner fashion.

A custom MLDataFrame class can make filtering and sub-setting data a lot less verbose. But using a non-standard solution for this fairly specialized problem may not always be the best choice. The implementation described above only covers the most simple use and would need several adaptions to improve robustness. Also, as mentioned in the beginning, there are ML applications, where other concerns outweigh the issues addressed with the MLDataFrame.

Nonetheless, I have found myself juggling multiple data arrays in parallel many times, where a tool like the MLDataFrame would have helped big time. You, also, may re-consider some of your data handling procedures, especially when there are different scenarios and groupings to deal with.


Below is the complete MLDataFrame implementation you can copy with one click. You can find the complete example code from this post on GitHub.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
"""Provides MLDataFrame class holding machine learning data with metadata."""
import pandas as pd

class MLDataFrame(pd.DataFrame):
    """Data frame holding multidimensional data with corresponding metadata.

    Args:
        data: any data input acceptable by pd.DataFrame
        x_data (np.ndarray): input data array, samples indexed along first dim.
        y_data (np.ndarray): target data array, indexed along first dimension.
        kwargs: additional keyword arguments accepted by pd.DataFrame.

    Attributes:
        x_data: machine learning inputs.
        y_data: machine learning targets.

    """
    _metadata = ["_x_data", "_y_data"]

    @property
    def _constructor(self):
        return MLDataFrame

    def __init__(self, data, x_data=None, y_data=None, **kwargs):
        if kwargs.pop("index", None):
            raise ValueError("Cannot set a custom index on MLDataFrame.")
        super().__init__(data, **kwargs)

        self._x_data = x_data
        self._y_data = y_data

    @property
    def x_data(self):
        """Machine learning inputs."""
        if self._x_data is None:
            raise ValueError("Cannot access unspecified x_data")
        return self._x_data[self.index]

    @property
    def y_data(self):
        """Machine Learning targets."""
        if self._y_data is None:
            raise ValueError("Cannot access unspecified y_data")
        return self._y_data[self.index]