sobota 5. prosince 2020

Pandas violates composability

Python Pandas library violates composability. What is a composability? It is the ability to take a part of code and use it as an input for another code.

For example, SQL is composable - SQL takes relations and produces a relation. Lisp is composable - Lisp takes lists and produces a list. Matlab is composable - classical Matlab code takes arrays and produces an array.

But Pandas is not composable. A Pandas method, which takes a DataFrame, may return anything, be it:

  • Categorical
  • DataFrame
  • Group
  • List 
  • Mask (a Numpy vector)
  • Index
  • IntervalIndex
  • Scalar (be it a Python data type, Numpy data type or Pandas data type)
  • Series
  • Slice
  • TimeDeltaIndex

And these are just the one I got after one minute at Pandas API.

Now you may argue: "What's the problem? Python uses duct typing". The issue is, that many Pandas methods, which accept a DataFrame, refuse to work on anything but a DataFrame. And that's the better case. In the worse case, the method accepts it, but produces something else than you would get if you passed the data as a DataFrame.

To make it even worse, Pandas mixes the ways how do you do things. Sometimes, you have to use method of the instance. Sometimes you have to access an attribute of the instance. And sometimes you have to use a class method.

If Pandas was using class methods everywhere, like Numpy, it would be possible to hide the differences between the different data structures. But you don't hide a missing attribute or method of an instance.

Composability is not a small thing. For example, GQL, a new language for graph databases, has composability as the governing principle. And people seem to, knowingly or not, gravitate toward composability. If nothing else, datatable, an alternative to Pandas in Python, ditches Series data structure, as it is considered unnecesary.