Datafilos: srpna 2017

středa 30. srpna 2017

Not invented here syndrom

Whenever I find that my code can be replaced with a system call, I go a long way to replace my call with the system call because I generally trust more the sytem libraries than my own code. When it comes to third party libraries, I am conservative. Sometimes they are outright buggy, sometimes they are correct but get deprecated and removed. And sometimes the whole system gets so complex due to the gross amount of third party code that no one wants to deal with the mess anymore and the project dies off. Hence, I like to try new 3rd party libraries and study them. But I commit into using them extensively only after thorough consideration of alternatives and testing.

pátek 11. srpna 2017

A proper handling of nulls

I am familiar with two approaches how to deal with nulls in programming languages: either completely avoid them or embrace them. Languages for logic programming generally avoid the nulls (a great apologetic is given by Design and Implementation of the LogicBlox System). But languages that permit them should provide following facilities:

1) meta information about why the value is missing
2) nullable & non-nullable variables

For example, SAS got it right 40 years ago. In SAS, a missing value is represented with a dot. That by itself is not great whenever you need to print out the code or the data, because you never know whether that dot really represents a missing variable or it is just an imperfection of the paper or the printer. But it permits to easily define the reason why the value is missing:
   .                 // Generic missing value
   .refusedToAnswer // Missing value with a metadatum
   .didntKnow       // Missing value with a different metadatum Hence, generic algorithms can threat all missing values the same way. But if you want to treat them differently, for example because refusedToAnswer can have a vastly different meaning in a questionary than didntKnow, you can do it.

Furthermore, SAS provides optional non-null constraints on attributes, just like SQL. The only ward on SAS's implementation is that it raises exceptions only during the runtime, not during the compilation time as, for example, Kotlin does. Note also that nullability must be configurable for all variables. For example in C#, nullability is configurable only for value types, not class types. And this omission is a source of many null-pointer exceptions.

There is just one thing where I am not sure which approach is better. If we have a function sum, it can:

Accept nullable variables and use some default strategy to deal with nulls (as SQL does).
Accept nullable variables and blow during the runtime when null is encountered, unless some strategy is defined (R takes this approach).
Have a dedicated sumnan function, which accepts nullable variables and takes a parameter, which determines, how nulls should be treated. Non-nulable variables get accepted by sum function (something like this is used in MATLAB, minus the type control).

The first approach is convenient to use. But potentially dangerous, because the user may never realize that a null leaked into the data and the conclusions are wrong.

The second approach is safer, because the user at least learns that something is wrong immediately when null is passed to the function without passing a strategy how to deal with nulls. The disadvantage is that it is a runtime check, not a static check. Nevertheless, programmers that are concerned about safety may use a lint to identify calls to functions with nullable variables without defining the strategy what to do with nulls.

The third approach is easier to validate by the compiler than the second approach. But the naming convention can be difficult to enforce.

Do you have some thought about this issue?