neděle 23. července 2023

Dataframe API design

A dataframe is a convenient way how to represent tabular data.

However, dataframe libraries are notoriously tedious to implement, because a dataframe library should be feature-rich.

There were multiple attempts to simplify to problem.

One notable approach is to implement only operations on rows. And if you need to do something on columns, you first transpose the dataframe to convert the columns to rows. This simple trick reduced the size of API (count of methods) by roughly 1/3. Unfortunately, heterogeneous (composing of multiple different data types) dataframes are not the easiest to transpose. The author of the approach solved it by using dynamic typing - each cell contained information about its data type.

Can we further reduce the size of the API? I argue that the answer is yes. Each dataframe should have its own metaframe, which is nothing else but a dataframe with the metadata about the dataframe's columns. Metadata like column names, data types (like in information_schema.columns in SQL databases) and statistics like count of missing values, count of unique values, average, standard deviation and so on, which can be used for queries or query optimization. And these metadata should be manipulable and editable with exactly the same API as dataframe.

Hypothetical examples in Python syntax:
print(df)  # some dataframe instance
print(df.mf)  # the dataframe's metaframe
print(df.some_column_name)  # prints a column named "some_column_name"
 
# Rename a column.
# Commonly we would have a dedicated function or a method for this like:
#    column_name(df)[3] = "new_name"
# or
#    df.rename(columns={"old_name": "new_name"})
# but we reuse ordinary syntax for dataframe manipulation:
df.mf.column_name[3] = "new_name"


# Select columns without any missing value.
# Commonly we would have a dedicated function or a method for this like:
#    df[:, ~any(isnull(df))]
# or
#    df[:, ~df.isnull().any()]
# but we reuse ordinary syntax for dataframe manipulation by using missing_values column in the metaframe:
df[:, df.mf.missing_values == 0]

# Select columns with "good" substring.
df[:, regex(df.mf.column_name, ".*good.*")]

# Select integer typed columns.
df[:, df.mf.data_type=="integer"]

# Cast a column to string. Implementable with https:#stackoverflow.com/questions/51885246/callback-on-variable-change-in-python
df.mf.data_type[3]=="integer"
Metaframe should be open for addition of new columns as needed. For example, do you want to calculate feature importance and then filter based on that? Sure enough we can store the feature importance in an independent array and then filter based on the array:
feature_importance = get_feature_importance(df)
df[:, feature_importance>0]

But what if there are intermediate steps between feature_importance calculation and filtering, where you manipulate column position or add or delete columns? Suddenly, you have to keep feature_importance array synchronized with the df. And that can be tedious and error prone. But if you store feature_importance into metaframe, the dataframe library will take care of keeping it synchronized (when you add columns, the corresponding feature_importance value will be null - no magic).

However, if you wanted a bit of magic, the library might keep track of operations performed on columns and keep lineage of the columns. For example, the library might track which dataframes and columns were used in computation of the columns. This is useful in complex systems where it is sometimes difficult the origin of some data.

Implementation

Because we want to treat a metaframe as if it was a dataframe, metaframe has its own metadata. Hence, we get a never-ending chain of frames. This can be implemented as a linked list.

Without loss of generality, let's assume that each dataframe has 2 mandatory metadata attributes:
    column_names,
    data_types.
Additionally, each dataframe can have an unlimited count of optional metadata attributes, e.g.:
    roles,
    interpretations,
    notes,
    ...

Then the chain of the frames will eventually start to repeat with:

2 mandatory columns [column_names, data_types] and 2 rows that in column_names have: ["column_names", "data_types"].

We can treat it as a "sentinel node" in linked lists, which marks the end of the list - once we encounter it, it will be just referencing itself. Note that a single global "sentinel" can be shared by all frame chains.

However, if we want to make the data structure easier to serialize, we had better to avoid never-ending loops. To do that (and make the space overhead smaller), we might just use null in places, where we would reference the "sentinel" - it would be then up to the code to handle nullable metaframes (be it in the library or user code).