Datafilos: 2020

sobota 5. prosince 2020

Pandas violates composability

Python Pandas library violates composability. What is a composability? It is the ability to take a part of code and use it as an input for another code.

For example, SQL is composable - SQL takes relations and produces a relation. Lisp is composable - Lisp takes lists and produces a list. Matlab is composable - classical Matlab code takes arrays and produces an array.

But Pandas is not composable. A Pandas method, which takes a DataFrame, may return anything, be it:

Categorical
DataFrame
Group
List
Mask (a Numpy vector)
Index
IntervalIndex
Scalar (be it a Python data type, Numpy data type or Pandas data type)
Series
Slice
TimeDeltaIndex

And these are just the one I got after one minute at Pandas API.

Now you may argue: "What's the problem? Python uses duct typing". The issue is, that many Pandas methods, which accept a DataFrame, refuse to work on anything but a DataFrame. And that's the better case. In the worse case, the method accepts it, but produces something else than you would get if you passed the data as a DataFrame.

To make it even worse, Pandas mixes the ways how do you do things. Sometimes, you have to use method of the instance. Sometimes you have to access an attribute of the instance. And sometimes you have to use a class method.

If Pandas was using class methods everywhere, like Numpy, it would be possible to hide the differences between the different data structures. But you don't hide a missing attribute or method of an instance.

Composability is not a small thing. For example, GQL, a new language for graph databases, has composability as the governing principle. And people seem to, knowingly or not, gravitate toward composability. If nothing else, datatable, an alternative to Pandas in Python, ditches Series data structure, as it is considered unnecesary.

středa 18. listopadu 2020

What is wrong with Wikipedia

I like Wikipedia. But I am worried about the future of Wikipedia. Why? Because it keeps growing without limit.

When Wikipedia was based, it got one thing right: the delivery of the information has to snowball. You begin with the minimal quantum of information that is self-sustainable. And then you keep iteratively expanding that. A good article at Wikipedia starts with a self-sustainable sentence, which can exist alone and provides basic description of the keyword. Then there is the rest of the first paragraph, which slightly expands the first sentence. And as the whole, the first paragraph is self-sustainable. Then there is the rest of the paragraphs that make the header. They slightly expand the description of the first paragraph. And together with the first paragraph, they are self-sustainable. And finally, there is the body, which provides the rest of the information and which is, by the fact that it completes the whole article, also self sustainable.

A good metaphor to this concept is the understanding of a picture. With the snowball approach, you first look at the picture from a distance. And you recognize a house. Then you move closer and recognize the front door and windows. You move even closer, and recognize individual parts of the doors. And finally, when you move the closest, you see details like the cracks on the door panels.

In contrast, with pinhole approach you scan the picture pixel-by-pixel. And you have to reconstruct the whole picture your mind.

While the pinhole approach is perfectly fine for computers, humans generally prefer the snowball approach. If nothing else, it allows them to skip irrelevant information like details of the clouds, because from the previous step they already know that that patch of pixels are clouds.

For long time, Wikipedia followed the snowball approach. But now, it keeps shifting to pinhole approach as the bodies of the articles keep growing without any limit.

For me, the current transition from the header to the body is frequently too abrupt. I cope with that by switching from English version of the article to some non-English version, where the articles are smaller. Most of the time, the smaller version provides all the information that I need. But even if it does not answer everything, I at least know which information I seek. And I can then quickly jump to the relevant parts in English version of the article.

But how can the situation be systematically rectified? There are multiple options:
   1) Identify an optimal size of the articles and start truncating the overgrown articles.
   2) Allow fast time traveling to time when the article was closest to the optimal size.
   3) Expandable TOC. Now, TOC doesn't even fit whole screen. Could we by default hide the lowest level headers but on click expand them?
    4) Inform Wikipedist that the article is too long and that they should abstain from making it longer. If something, they should make it shorter.
    5) Introduce another layer of granularity.

The first approach is politically unacceptable. Wikipedists like to expand articles, not to reduce them. The second approach is meaningful on topics that do not evolve rapidly, like interpretation of ancient events, but fails miserably on contemporary topics. The third approach is a nice technical solution. The forth approach might help a bit. But the last option is the only real solution that might have a chance to succeed.

sobota 11. dubna 2020

My requirements for a data-scientist programming language

United data type and data structures

Reasoning: As a data-scientist, you may want to be able to quickly apply different libraries on your data. But that can work only if they all use the same data representation.

Example: R doesn't have a canonical implementation of sparse matrices. Hence, each library uses their own implementation. And if you want to process your sparse data with two different libraries, you frequently have to perform the format conversion over a dense matrix. That's a no go for any non-toy matrix. Matlab got it, in the case of sparse matrices, right: there is only a single format for sparse matrices.

True pass-by-value

Many languages use pass-by-value. But the approaches to make it computationally feasible differ. Python and Java frequently pass-by-value only a reference. While R and Matlab use copy-on-write (CoW), which delivers behaviour that I call true pass-by-value.

Reasoning: Data in the operation systems and databases are true pass-by-value. Hence, the expectation is set. R and Matlab got it right.

Working autocomplete

It is nice when autocomplete works on table names, column names, file paths, function names, function argument names... It decreases typo rate, speeds up typing and provides real-time validation - if the autocomplete found the file/table/column/function/whatever, it exists.

Example: Thanks to clause ordering in SQL, the table and column name autocomplete doesn't work very well. LINQ got it right: first define tables and only then columns.

Working documentation

As a data-scientist, you may have to work with many different tools. And a good manual can make all the difference when you are learning something new.

Example: Python uses multiple formats for documentation and that can cause errors in documentation rendering, when a wrong formatter is used. Java got it right: provide (and enforce) a single documentation format.

Simple copy-paste of matrices

It is nice to be able to simply copy-paste matrices from the result of print(), Excel or publication directly into the interpreter/script code.

Example: Python (and Numpy, Pandas,...) requires commas between the values in a matrix. But when you print a Numpy matrix, it is printed without commas (for improved legibility). That means that you can't simply copy-paste use the printed matrix into your code: you have to first add those missing commas. Matlab got it right: copy-paste from/to Excel works. And parsing of copy-pasted tables from pdfs/web pages frequently even works better than in Excel.

This requirement can be extended to all other data structures.

High-quality and interactive plotting

As a data scientist you may greatly benefit from visualizing the data.

Example: The built-in plots in R are static. Matlab got it right: you may interact with the plot. Move overlapping labels a bit. Or read the dato value below the cursor.

Support for functions with many arguments (default values,...)

Argument parsing & validation can take a lot of code, if the language doesn't handle it for you.

Example: Matlab doesn't support named arguments in the syntax. But many functions accept something like f('argument_name', 'argument_value'). Since argument names are strings, argument name autocomplete doesn't work. And when you are passing many string arguments, it is difficult for a reader to recognize what is actually the argument name and what is the argument value - they are both strings! R got it right: just like almost any other functional language.

True raw strings

As a data-scientist, you may want to embed different languages (be it regex, SQL or HTML) in your language. And being able to simply copy-paste the foreign code without the need to escape/uneescape makes live easier.

Example: In Java, you have to escape backslash in regex with another backslash. Groovy got it right: just use """ """.

Python Pandas

Pandas is convenient but sometimes also a bit inconsistent. For example, None==None is in pure Python and Numpy evaluated as True. In Pandas, it is evaluated as False:

import pandas as pd
import numpy as np

df = pd.DataFrame([None])
x = np.array([None])

print(None == None) # Python says True
print(x == x)       # Numpy says True
print(df == df)     # Pandas says False

While we should generally avoid equality comparison for detection of None in the code (and use is, respectively isnull()), when we are actually comparing two variables coming from the outside, we may end up comparing None to None. And if we care about the result of the comparison (we do, otherwise why we would bother with the comparison in the first place?), we have to be careful, whether we are comparing the variables with "Pandas logic", or "the rest of Python world logic".

úterý 18. února 2020

Civilization 6

One of the most criticized features of the Civ 5&6 is "one unit per tile" (1UPT) limitation - in the older versions of Civ, it was possible to stack an unlimited count of units on a single tile.

The benefit of the change is more opportunities for tactics. For example, it is possible to replay Battle of Marathon in Jadwiga's Legacy scenario - a single well placed pikeman can indefinitely block a mountain passage against a hoard of knights, till only a single knight can attack the pikeman per turn. Or you may place a city between two large lakes, like real-world Madison-Wisconsin, making it difficult to "flank" the city and put it in to the siege. And of course, 1UPT fixes the infamous Stack of Doom issue.

On the other end, it makes logistics difficult. It happened to me multiple times that I ordered a unit to move forward, only to watch it in horor how it moves backward because all the "resting places" in the forward direction are already occupied by my units. A simple mitigation of the described nuisance is to introduce a keyboard shortcut particularly popular in Microsoft Word: Ctrl+Z. The implementation can take an inspiration from The Battle for Wesnoth turn based strategy, which gives the user the opportunity to take back a miss-click. But to prevent the abuse of the feature by the players, it is not possible to take back a move that uncovers part of a map or a move that lifts fog-of-war. Westnoth also prevents you from taking back a move, which results into a combat. Since Civ 6 still uses randomness in the damage calculation, this limitation would have to be ported as well. The long loading times because of a silly miss-click would be mostly just part of the distant past...

Another irritating thing is that sometimes a victory (or loss) is inevitable. But you still have to get thru many turns to get there. Some people enjoy this part of the game. But I find it boring. It would be nice to have something like "auto combat" or "quick battle" from Heroes of Might & Magic III that would allow you to watch the march to the victory from the resting position of your seat or that would allow you to just skip to the victory screen because you really need some sleep time. Didn't you get the outcome you expected? No worry, just press the sweet Ctrl+Z combo and you can show the computer how it should have been done.

The biggest advantage of having the "fast end" option is that as a developer you do not have to stress yourself so much with fine-tuning the winning conditions. Does the game slip too frequently into the war-of-attrition? Just add "fast end" option. Problem mitigated. Release the game. And leave the re-balancing on the expansion pack.

And finally, the game logic for city states and civilizations should be the same thing. City states are great. But I want to be able to:

Befriend not only city states but also civilizations with amenities.
Get unique bonuses not only from being suzerain of city states, but also from being a friend to civilizations.

And reversely, civilizations are great, but I want to be able to:

Select not only civilizations that appear in the game, but also city states.

All these issues can be nicely resolved by adding a switch into the "create new game" GUI next to each civilization, which would "cripple" the selected civilization to a city state. This switch would:

Remove the leader and her/his traits. But the civilization traits would stay there.
Limit the count of cities to one.
Adjust things like loyalty bonus,... to actually make the city states work.