Datafilos: 2019

neděle 15. prosince 2019

How to kill processes without the necessary privileges

Windows have one strange property: the shutdown is not an atomic operation. Hence, if you do not have the privilege to terminate programs (like an antivirus on a corporate machine) but still have a privilege to perform shutdown (quite common on laptops), you may still succeed in killing the unwanted processes.

The procedure:

Open Excel.
Invoke Windows shutdown.
Windows will tell you that Excel has unsaved documents. Do nothing. Just wait until all unwanted processes are killed.
Cancel the shutdown.

How to mitigate this security weakness:

Run regedit
Go to HKEY_CURRENT_USER\Control Panel\Desktop
Set AutoEndTasks to 1

pátek 22. listopadu 2019

Old products were reliable, the new one not so much

I hear this line pretty frequently. But the data do not support this statement. One of the possible explanations of the belief that older products were more reliable is selection bias.

Products have a variable lifespan. Hence, when we use a century-old item, the item does not break easily, because it was stress-tested for a century and the weaklings were weaned a long time ago. On the other end, when we use a new item, there is a good chance that it won't survive long because it wasn't stress-tested and weaned for a century like the old products that survived to these days.

Another factor is the variable variance of durability. When the distinct count of the manufacturers that produce the product is high and when the manufacturers are independent of each (e.g.: they use only local resources), we may expect high variance in the durability of the products. On the other end, if there is just a few manufacturers or if they all use the same components or design, we may expect low variance in the durability of the products. Hence, some of the products that were produced during the high variance period are likely going to have outstanding durability (just like it is likely that some of them had really terribly low durability). The "issue" with the new products is, that we currently live in a fairly globalized world and many technologies that we use daily were commoditized (standardized and made widely available). Hence, many new items that we use daily have a fairly predictable lifespan. A lifespan, which does not span centuries (because that would be overkill). Consequently, we may sometimes find "indestructible" items that were created when the technology was new. But keep in mind that just like "indestructible" items were produced, there were also "rubbish" items, which were quickly thrown out.

Overall, the data suggest that the quality of manufacturing keeps improving over time. But thanks to selection bias and variable variance, the reverse appears to hold for the item-user.

Addendum: We could model it analytically. For simplicity, assume that probability of a product failure follows log-normal distribution (I picked this distribution because it fulfills three basic properties: it does not allow negative values, it has a long tail and people are familiar with it).

Selection bias can be then illustrated as a difference between probability that a new product fails in the next 10 years vs. a probability that a 100 years old product fails in the next 10 years. The first probability is going to be large, because log-normal distribution is "fat" at the beginning. But once we get to the tail, the derivative of the distribution is going to be close to 0. In other words, if a product survives 100 years, it is actually more likely that it will fail after 10 years than during the next 10 year period.

The variable variance can be illustrated with an observation that whenever an engineer doesn't know how to accurately estimate something, he/she prefers to overestimate the parameters and build the thing robustly. However, sometimes the initial design has a flaw, which reduces the lifespan of the product (hence, the peak is wide - the product can last long but also fail quickly). The next phase is fixing these flaws. But everything else is left as before (the fat from the beginning is removed but the fat tail is preserved - this is the period from which we observe many "eternal" products). Over the time, the product is price optimized (the fat tail is removed - the products have a predictable lifespan without extremes and they all look like garbage in comparison to the eternal products of the past).

čtvrtek 14. listopadu 2019

An app for climbing shoe recommendation

One of the most important factors of a good climbing shoe is a good fit. Unfortunately, human feet vary greatly. Feet vary not only in the length, but also in the length/width and length/height ratios, toe lengths:

and deviations like bunions, hammer toes and so on. In the case of walking shoe, a single "size" measure is enough to guarantee a good enough fit, i.e.: the shoe doesn't slip but it also doesn't hurt anywhere. But a single measure is enough for walking shoe only because in walking shoe we tolerate wast spaces between the feet and the shoe (e.g.: between the toes and the shoe). In the climbing shoe, each such empty space results into degradation of the climbing performance because our feet do not have a good contact with the rock at that particular "empty" spot. Hence, power climbers generally prefer as snug fit, as they can handle.

Ideally, an experienced shop assistant should be able to recommend a well fitting climbing shoe based on the look at the client's foot. Shoe, which can provide a snug fit on the client's feet without causing deformities. But my experience is that the salesperson is commonly (and naturally) biased toward the shoes that fit him/her well. Only exceptionally you encounter an expert, who can overcome the bias. But these experts generally (and naturally) work for some brand and if this brand does not make shoe for your type of feet, you are out of luck.

Hence my proposal: an app in a phone, which would take a photo of your feet (a self-photo when you are standing barefoot on the floor), perform some rudimentary calculations (like length-to-width ratio, the ratio of individual toe lengths to the feet length,...), provide some result illustrations in order to persuade the users that your app actually does something (e.g.: overlay the user's foot outline to the prototypical foot) and display shoe ranked from best fit to the worst fit.

There are three obstacles in order to get it working:

Data collection
Monetization
Machine learning

Data collection

First, you have to have some data in order to feed the recommendation system. The best option would be to contact some climbing shoe manufacturer, present them your aim and ask them for their shoe profiles. I do not think that the biggest manufacturer's like La Sportiva are going to be supportive (they may see it as too risky). But the small manufacturer's may see it as a good opportunity for shoving progressiveness and improving their visibility without risking too much. Plus, thanks to the fact that they are small, they can make their decision quickly. And finally, there are many small manufacturers and they vary wildly - it would be surprising if neither of them was eccentric enough to provide you with the data (or allow you to take photos of the shoe lasts...).

But the initial measurements are just the beginning. You also have to track what people actually buy and what do they return. And based on that alter the recommendations.

Monetization

Second, if you want the app to be successful in the long term, it has to make money. Or it will eventually die due to technological obsolescence (and lack of your motivation to keep it alive on your side). In this case, the monetization model is simple: a shop/brand that can provide good recommendation will make better sales. The reasoning is simple: whenever a customer has doubts which product to buy, they prefer to postpone their action. And whenever they postpone their action, you risk that they will eventually perform the action (the purchase) somewhere else or that they do not perform the action at all (they stick with their current shoe or they even completely abandon climbing). Hence, the app should allow purchase and return realization in order to collect data and make money.

Machine learning

The app must be simple and fast to use. Hence, it is a good idea to not require any measurement with a ruler - a single photo of the foot/feet should be enough.
But how to process the photo? I would simply train a convolutional neural network to identify, which pixels belong to a feet and which to the background. I would collect training images of the feet from internet and build the ground truth masks either in Photoshop (for tough photos) or with local thresholding methods (Savuola thresholding for photos with clean background). Since everything relies on getting a good outline of the feet, I would also instruct the users to get the photo of their feet on white monolithic background like a paper or a wall. The paper has the advantage that you can estimate the size of the paper (A4 or letter) based on the paper side ratio and use it for feet size estimation (toe to heel) and perspective correction (as the camera does not always have to be at the same position and always point in the same direction) based on the knowledge that the paper should have right angles.

Once we have an outline of the foot (let's say of the right foot), we should normalize the outline. This is important for visualization (the overlay the user's feet to the prototypical feet) and for feature extraction. I would simple use affine projection of an ideal foot outline to the obtained outline in OpenCV (or whatever is your favourite tool).

Once we have the user's normalized foot outline, we can compare overlay of the user's foot to the inner shape of the shoe. Ideally, there should be a perfect overlap. And the measure of the overlap can be used for ranking of the shoe.

Latter on, once we collect enough data from sells and return records, we could even retrain the convolutional network to directly rank the shoes based on the foot photo. The idea is, that the photo may contain more information than the outline alone. And that the neural network could be better in deciding, which parts of the feet tolerate overly tight/loose fits and which not so much.

Evaluation

At the beginning, the goal will be to get repeatable results. I.e.: when we snap two photos of the same foot, we expect to get the same foot outline and the same shoe ranking. On the other end, when we snap two wildly different foots, we expect different outlines and different shoe ranking.

Latter on, we can simply maximize profit (sale margin minus the returns).

Edit

It looks like that there is already at least one company that takes body measurements thru camera: Menro, which makes smart suits. As a reference of the body size, they take A4 paper with 2 corners blacked to get a good contrast against a (likely white painted) wall.

pondělí 28. října 2019

Student performance

Once, a professor told me: "A student must be either smart or diligent to perform well". Back in time, I thought that it is an oversimplification. But I was wrong.

When I built a predictive model, which predicts whether a student passes to the next term based on their past performance (~100 features), the model had AUC=0.8. Not bad. But neither useful. What stroked me: the model could be simplified to 2 features:

Performance in an off-topic but obligatory course that does not require anything more than handing out simple homework in time (the teachers were realizing that the course is off-topic for these students and treated the course more like a recruitment opportunity rather than an opportunity to filter out bad students).
Performance in a mathematical logic course.

This simplified model had AUC=0.78, which is not much worse than 0.8 of the full model. When visualized, the model looked like:

On the horizontal axis, we have the performance from the off-topic course, which exercises student's diligence. On the vertical axis, we have the performance from mathematical logic, which exercises student's intelligence.

There are 3 interesting takeaways:

The professor was right. A student has to be either smart or diligent in order to perform well. And while I was right as well - it is a simplification - it is also an extremely practical and accurate simplification.
The student does not have to be smart and diligent in order to perform well. It is enough if the student is smart or diligent. This is something that I didn't expect. But the shape of the decision border speaks for itself: the North-West and South-East corners are green, not red. And the decision border is convex, not concave. Hence, this takeaway cannot be dismissed just by saying that the teachers are just "too soft".
The decision border looks like an arch and not like a line. This is also surprising, because it suggests that whenever we have multiple unrelated scores, we should move away from Manhattan space (where we just sum the scores) to Euclidean space (where we first square the scores). Some universities already somehow do that - if an applicant is outstandingly good in sport or art, the applicant is preferred. But the scores from Math and Languages are still generally just summed.

neděle 27. října 2019

File format for Helmer GPS locator

The file format of 'p_data.txt' is CSV. Unfortunately, the columns do not have a label. The inferred meaning of the columns follow in hope that it will save someone a few minutes:

IMEI
The operation mode (monitor/tracker)
Timestamp in YYMMDDHH24MISS format
?
?
Time in HH24MISS format
?
Latitude in Degrees Decimal Minutes format
Latitude hemisphere (N/S)
Longitude in Degrees Decimal Minutes format
Longitude hemisphere (E/W)
Speed
Direction in degrees

úterý 22. října 2019

Braess's Paradox

Braess's paradox states that when we add a new path into a network, it may decrease the total flow through the network.

The paradox can be nicely illustrated on a spring. However, beware that Nature's illustration of the spring experiment is flown:

The lengths of the springs should be different when we cut the rope (while the lengths of the remaining ropes should remain unchanged). The corrected illustration:

On Wikipedia, you can read that you can also observe it in electrical networks "at low temperatures using a scanning gate microscopy":

But you don't need a fancy technology to observe the paradox in electrical circuits. Nature's article actually got close to a "simple circuit":

But it requires a source of constant current. A simpler schema, which utilizes a source of constant voltage, is:

We just need 2 light bulbs, 2 resistors, a 9V battery (or any other source of constant voltage) and an ampermeter to measure the change of the current. Optionally, we may include a button or switch. But a piece of wire is enough.

How does it work? Incandescent bulbs have an interesting property: when they are cold, they work almost like a short-circuit. Only once they heat up, their resistance increases to the nominal value (the diagram is for a 60W light-bulb):

The nominal resistance of the light bulb in our schema is 5V/(0.45W/5V)=56Ω. When the button is open, the nominal current thru a single light bulb is 9V/(56Ω+68Ω)=0.073A. But once we close the button, more current goes thru the light bulbs than thru the resistors, because the light bulbs have nominal resistance of 56Ω while the resistors have 68Ω. But the increased current thru the light bulbs heats up the filaments, the light bulb resistance increases and the total current thru the network decreases.

Experimental currents thru the whole network for different resistors:

resistance [Ω]	open [mA]	close [mA]
32	160	160.5
46	140	140
68	125	124
100	112	113

The paradox was observed only for one value of the resistors' resistance (68Ω). The optimal resistance for maximization of the paradox effect is somewhere around 62Ω. If you have a tandem potentiometer that can withstand ~1 Watt (e.g.: an old wire potentiometer), I would be happy to hear what is the actual optimal value and the size of the effect.

Applications: A teaser on an open-day/science-fair in a school. You ask visitors to guess when the circuit is going to consume more energy:

When the button is visibly closed and the light bulbs are shining brightly?
Or when the button is visibly open and the light bulbs are just dim?

The paradoxical answer is that the circuit consumes more energy when the light bulbs are dim.

To puzzle the visitors even more, you may stress that the circuit contains only passive components and that the trick is not in the power supply.

pondělí 30. září 2019

Litmus paper of classifiers

Naive Bayes (NB) is one of the first classifiers that I like to run on classification problems. The reasoning follows.

NB is fast. NB is one of the fastest nontrivial and widely available classifiers that you can use. This property is particularly useful whenever you want to test the whole workflow beginning with the data collection (e.g.: a web form) to the action (e.g.: display the result in the form).
NB is tune-less. That is nice because you can immediately use the NB accuracy as a reference against which you can compare accuracy of more sophisticated algorithms - if your advanced classifier delivers worse accuracy than NB, you immediately knew that the parameters of the advanced classifier must be terribly wrong.
NB is not picky. It handles both, numerical and nominal features, missing values, high-cardinality features, polynomial labels and rare classes without sweating. No fancy data-preprocessing is required.
NB has the right sensitivity to data imperfections. Real-world data tend to be messy - full of outliers, irrelevant features, redundant features... And NB has high enough sensitivity to data imperfection to reward you with a non-zero improvement in the classification accuracy when you improve data quality. On the other end, NB is robust enough to get some signal even in the murkiest data where other models just give up (be it because of perfect feature collinearity, widely different feature scale or some other potentially lethal nuisance).

Of course, NB is never the last classifier that I test on a data set. But it is a pragmatic first choice to test.

úterý 25. června 2019

Drug discount regulation

This is a reaction to The insulin racket: why a drug made free 100 years ago is recently expensive.

The problem that my proposal aims to fix:
The difference between the "official list price" (which has gone up) and the "price actually paid by large providers" (which has gone down).

The proposal:
The proposal calls for the same cost of a drug for everyone (in the US) regardless of whether the subject is an uninsured person or a large insurance company.

Sure enough, some order-quantity discounts would be permissible but only up to a capped rate. E.g.: 10% discount for ordering a million pills is ok. But 99% discount is not ok.

Comparison to alternatives:
"Pre-restitution" could solve the issue as well. But it is politically hardly acceptable (the last time when a pharmaceutical company was nationalized in the US was, I think, during the second world war). Discount regulation, on the other end, preserves the ownership.

Cap on the cost could work as well. But it is tough to set a fair price, particularly for new drugs. While discount regulation does not eliminate the risk of bad pricing, it shifts the responsibility of the good pricing to drug manufacturers. Hence, discount regulation is less risky for lawmakers than the price cap. And the drug manufacturers are better positioned to quickly adjust a bad price, as the price itself remains unregulated.

Single-payer healthcare could work as well. The issue is that it is a big change. And big changes are risky. Sure enough, the discount cap is a big change as well. But it is going to negatively impact just "pharmacy benefit managers". While single-payer healthcare would negatively
impact both, pharmacy benefit managers and insurance companies.

Additional advantages of the discount regulation:

In principle, the proposal is item agnostic. While it makes sense to initially limit the scope of the bill to a few troublesome drugs in order to limit the impact of unexpected consequences, there is no evident reason why the bill should not, in the end, apply to all treatments.
It eliminates a lot of "unproductive" hassle associated with the price negotiation. Essentially, it could reduce industry of pharmacy benefit managers (they would still exist - medical care is not just about the cost of treatments but also about cost-benefit analysis, which will remain a valuable tool in presence of multiple treatments).

What it does not attempt to solve:
The proposal aims to lower the gap between the price of the drug for an uninsured person and the price for an insurance company. But it does not attempt to decrease the price of the drug for the insurance company. I.e.: a drug that is already too expensive for an insurance company will remain to be too expensive. But that is politically acceptable because the drug will remain expensive for _everyone_.

Corner scenarios:
People are extremely creative in avoiding the law. Hence, the law would have to strike the balance between the generality of the formulation and enforceability.
Examples that should be illegal:

The manufacturer decides to stick with the high price. But in order to compensate it for the insurance company, they will provide some drugs "that are close to expiration" for free.
The manufacturer decides to stick with the high price. But time-to-time it decreases the price just for a moment until it signs a contract with the insurance company.
The manufacturer decides to stick with the high price, but only for the US. Hence, the insurance companies will simply import the drug for a much lower price.

As I am not skilled in law-speak, I do not provide a suggested law formulation. Nevertheless, it is reasonable to assume that the amount of the text necessary to describe the discount cap is going to be lower than in the case of the cost cap (as the cost cap has to provide a unique cap for each drug).

pátek 22. března 2019

How to fix backlight bleed

Sometimes a laptop display gets damaged by pressure or water. And the damage manifests itself by the presence of bright spots:

The best way how to get rid of the white speckles is to change the display. But that can cost like a new laptop. A fast and free solution is to use a dark theme thru the system - the bright spots are not so irritating on the dark background as they are on bright background. But if we are determined to stick with the light theme, we can overlay the screen with an image, which is dark where the speckles are and otherwise is transparent. Note that this solution is not perfect - when you see bright spots on the display, it is because the backlight layer got damaged. And because there is some nonzero space between the backlight and the pixels, we can get a perfect alignment only for a single observation point - when we move our head a bit, the overlay image gets slightly misaligned and the edges of the speckles will be visible. Furthermore, it took me a few hours to get acceptable results with this approach.

Walkthrough for using the overlay method:
    1) Get an application that can permanently overlay an image over the screen. On MacOS, I used Uberlayer.
    2) Make a good photo of the display with completely white background. The camera should be roughly at the position, from which you commonly look at the screen. Also, we want to take a longer exposition (e.g.: 1 second) in order to avoid capturing the display refresh deformations (they look like a moire pattern on my display). Hence, a stative can be handy. Finally, it can be better to take a photo in a darkened room in order to avoid capturing reflections on the display.
    3) Convert the photo to shades of shade.
    4) Write down the position of the bright spots on the photo in pixels.
    5) Write down the position of the bright spots on your display in pixels. I used Ruler.
    6) Align the photo to the screen with a combination of RANSAC and projective transformation.
    7) Homogenize the illumination of the photo.
    8) Invert the color of the image - bright spots will become dark spots.
    9) Set the transparency.
10) Use the generated image as the screen overlay.

The script in Matlab:
% Load data
original = rgb2gray(imread('photo.tif'));

% Location of spots on the photo (as read from: imshow(original))
% From: left x top.
photo = [
    556     1125    % the brightest speckel
    101     961     % the single speckel on the left
    61      1578    % the bottom left single pixel
    2422    1161    % right: the top bright spot
    2465    1216    % right: the lowest bright spot on right
    1065    698     % middle: the brightest speckel (north west)
    15      31      % corners...
    2545    1637
    22      1630
    2548    67
];

% Location of bright points on the display
% Averaged from 2 measures
screen = [
    302.5   611     % the brightest speckel
    47      523     % the single speckel on the left
    23      870     % the bottom left single pixel
    1367.5 627     % right: the top bright spot
    1393.5 658.5   % right: the lowest bright spot on right
    589     368     % middle: the brightest speckel (north west)
    1       1       % corners...
    1440    900
    1       900
    1440    1
];
% Visualization
figure('Name', 'Photo')
imshow(original);
hold on
plot(photo(:,1), photo(:,2), 'or')

% Map photo to screen coordinates
rng(2001); % RANSAC is stochastic (it may exclude outliers) -> set seed
[tform, inlierpoints1, inlierpoints2] = estimateGeometricTransform(photo, screen, 'projective', 'MaxDistance', 3);

% Transform the photo
outputView = imref2d(size(original));
image = imwarp(original, tform, 'OutputView', outputView, 'FillValues', 255);

% Crop the image to the size of the screen
image = image(1:900, 1:1440);

% Plot the aligned photo
figure('Name', 'Photo after transformation')
imshow(image)
hold on
plot(screen(:,1), screen(:,2), 'ob')

%% Illumination homogenization
% First, we remove the white speckles. See:
%   https://www.mathworks.com/help/images/correcting-nonuniform-illumination.html
se = strel('disk', 25);
background = imopen(image, se);

% Gaussian smoothing creates nicely smooth transitions (at least with doubles)
smoothed = imgaussfilt(double(background), 25);

% Subtract background illumination from the image to get the foreground
foreground = double(image) - smoothed;

% Remove too dark pixels (essentially the leaking border)
threshold = median(foreground(:)) + std(foreground(:));
borderless = max(threshold, foreground);
imagesc(borderless)

%% Invert the colors
result = uint8(255-(borderless-threshold));
imshow(result)

%% Set transparency
black = uint8(ones(900, 1440));

maximum = max(borderless(:));
minimum = min(borderless(:));
alpha = (borderless-minimum)/(maximum-minimum);

%% Store the image
imwrite(black, 'overlay.png', 'alpha', alpha)

neděle 6. ledna 2019

Matlab

I took me a while to realize why I reproduce results from articles in Matlab (and not in Java, Python or R).

Here are the reasons:

Matlab has succinct syntax for matrix operations, which is similar enough to equations in articles (this is where Java gets beaten).
It is extremely simple to get an example data from the article into the Matlab code - you just copy-paste the data from the table in the article, surround them in square brackets and that's it. You have a matrix. There is no need to separate the numbers with commas (but you can - they are optional). There is no need to separate the individual rows with brackets. It just works. This is where Python and R gets beaten.

Who would guess that great copy-paste support would affect the choice of the language...

Another nice thing is that Matlab (and R in RStudio) autocompletes paths to files and and directories - once again, it is a small thing. But it helps to avoid typos.