sobota 11. dubna 2020

My requirements for a data-scientist programming language

United data type and data structures

Reasoning: As a data-scientist, you may want to be able to quickly apply different libraries on your data. But that can work only if they all use the same data representation.

Example: R doesn't have a canonical implementation of sparse matrices. Hence, each library uses their own implementation. And if you want to process your sparse data with two different libraries, you frequently have to perform the format conversion over a dense matrix. That's a no go for any non-toy matrix. Matlab got it, in the case of sparse matrices, right: there is only a single format for sparse matrices.

True pass-by-value

Many languages use pass-by-value. But the approaches to make it computationally feasible differ. Python and Java frequently pass-by-value only a reference. While R and Matlab use copy-on-write (CoW), which delivers behaviour that I call true pass-by-value.

Reasoning: Data in the operation systems and databases are true pass-by-value. Hence, the expectation is set. R and Matlab got it right.

Working autocomplete

It is nice when autocomplete works on table names, column names, file paths, function names, function argument names... It decreases typo rate, speeds up typing and provides real-time validation - if the autocomplete found the file/table/column/function/whatever, it exists.  

Example: Thanks to clause ordering in SQL, the table and column name autocomplete doesn't work very well. LINQ got it right: first define tables and only then columns.

Working documentation

As a data-scientist, you may have to work with many different tools. And a good manual can make all the difference when you are learning something new.

Example: Python uses multiple formats for documentation and that can cause errors in documentation rendering, when a wrong formatter is used. Java got it right: provide (and enforce) a single documentation format.

Simple copy-paste of matrices

It is nice to be able to simply copy-paste matrices from the result of print(), Excel or publication directly into the interpreter/script code.

Example: Python (and Numpy, Pandas,...) requires commas between the values in a matrix. But when you print a Numpy matrix, it is printed without commas (for improved legibility). That means that you can't simply copy-paste use the printed matrix into your code: you have to first add those missing commas. Matlab got it right: copy-paste from/to Excel works. And parsing of copy-pasted tables from pdfs/web pages frequently even works better than in Excel. 

This requirement can be extended to all other data structures.

High-quality and interactive plotting

As a data scientist you may greatly benefit from visualizing the data.

Example: The built-in plots in R are static. Matlab got it right: you may interact with the plot. Move overlapping labels a bit. Or read the dato value below the cursor.

Support for functions with many arguments (default values,...)

Argument parsing & validation can take a lot of code, if the language doesn't handle it for you.

Example: Matlab doesn't support named arguments in the syntax. But many functions accept something like f('argument_name', 'argument_value'). Since argument names are strings, argument name autocomplete doesn't work. And when you are passing many string arguments, it is difficult for a reader to recognize what is actually the argument name and what is the argument value - they are both strings! R got it right: just like almost any other functional language.

True raw strings

As a data-scientist, you may want to embed different languages (be it regex, SQL or HTML) in your language. And being able to simply copy-paste the foreign code without the need to escape/uneescape makes live easier.

Example: In Java, you have to escape backslash in regex with another backslash. Groovy got it right: just use """ """.

Python Pandas

Pandas is convenient but sometimes also a bit inconsistent. For example, None==None is in pure Python and Numpy evaluated as True. In Pandas, it is evaluated as False:
import pandas as pd
import numpy as np

df = pd.DataFrame([None])
x = np.array([None])

print(None == None) # Python says True
print(x == x) # Numpy says True
print
(df == df) # Pandas says False
While we should generally avoid equality comparison for detection of None in the code (and use is, respectively isnull()), when we are actually comparing two variables coming from the outside, we may end up comparing None to None. And if we care about the result of the comparison (we do, otherwise why we would bother with the comparison in the first place?), we have to be careful, whether we are comparing the variables with "Pandas logic", or "the rest of Python world logic".