Datafilos

neděle 3. května 2026

How not to automatically open the gate

I read multiple articles on the internet about automatically opening/closing your gate for your car based on the GPS in your cellphone. When you enter/leave your home zone, the cellphone triggers the opening/closing. To make sure it works properly, stick a sensor on your gate to sense that the gate was properly closed. And use a tiny script to integrate it. A task for an hour, at worst a day, right?

I thought so as well. But I was wrong. Murphy's Law says that if something can fail, it will fail. In this case, everything initially failed. If you plan to automate your gate, I hope my observations will save you some time. The content is divided into 3 sections: location sensing, gate position sensing, and integration.

Location sensing: I use an iPhone. It has GPS and cellular internet. It should work, right? And it did. However, the iPhone checks whether you have entered/left a zone only once per 30 seconds (or a minute, I do not recall exactly). That means that if you enter the home zone, the gate can start opening immediately, after 30 seconds, or anything in between. One option for how to deal with that is to start opening the gate when I am still far away. However, what if I stop by a nearby shop or a restaurant when returning home? Then the gate will stay open, potentially for hours (or until a timeout). That is not ideal. Another option is to wait in front of the gate for up to 30 seconds until it opens. That is more irritating than pushing a button on the remote control. No thank you.

I researched how to increase the zone testing frequency. However, the iPhone does not allow you to touch the hardware. And it does not allow you to run a background process indefinitely. Hence, you can't just keep triggering the GPS in a wait loop without running the app in the foreground. Hence, I decided to use dedicated hardware: iBeacons. With iBeacon, the maximal delay in location sensing was less than 1 second. Awesome! However, the reach was ~50 meters. And I wanted 170 meters. After a bit more research, I ordered devices with 4× higher reach. But that was false advertisement. It had exactly the same reach. I concluded that this is a dead end.

After rereading the internet articles, I realized that all the authors were using Android. So I tested Android. It had a maximal delay of 5 seconds in sensing the zone changes. That was... acceptable. Unfortunately, it was draining the battery so much that by evening the Android was dead. But I wasn't giving up. The Android doesn't have to keep sensing the location frequently all the time. It is fine if it does so only when it is in my car. Commonly, it is done by sensing a Bluetooth connection between the cellphone and the car. But my car does not have Bluetooth. I could have sensed the iBeacon, which I had in the car from the previous experimentation. But I decided to mitigate the battery drain issue as well and I purchased a wireless charger for the Android. When the Android is charged by that specific wireless charger, it switches to high-frequency location sensing. Once it disconnects from the wireless charger, it switches to the normal refresh frequency. And that worked. Until it didn't.

When a cellphone attempts to get its location after a long pause (e.g., in the morning), it needs to update its GPS ephemeral data. It can download it directly from the satellites, but it is (usually) faster to get from the internet (i.e., it uses "assisted GPS"). Unfortunately, assisted GPS does not work well when you are in the middle of a Wi-Fi to cellular handover (a common occurrence when I am leaving my home). While the handover for normal processes is close to seamless, for assisted GPS, it isn't. The download process has to first exhaust memory and get killed by the kernel... And only once it is restarted, it successfully downloads the ephemeral data over cellular internet and I finally get my location. I decided to mitigate the issue by switching to unlimited internet on my phone and just always use cellular internet. Another nuisance was that GPS was sometimes providing location estimates that were kilometers off, just to correct itself in the next second. Fortunately, all that was needed was to reject measurements that had higher inaccuracy, as reported by GPS itself, than ~100 meters.

Gate position sensing: For sectional garage doors, it is customary to use a tilt sensor or two magnetic sensors to sense whether the door is truly closed/opened. After experiencing all the issues above, it was clear to me that this cannot be skipped. However, I didn't want to place a battery-powered magnetic sensor on my swing gate as the battery would not survive long in the winter. So, I got a bright idea to reuse my camera, which was already pointed at the gate. I got it working in the day, at night, at dusk (tough, as neither IR nor the visual spectrum is giving a good picture at this part of the day), in haze, with partial occlusion by people, cars, snowflakes, and flying bugs. I have dealt with sun glare by 3D printing a lens hood. And when I realized that there is still going to be a brief moment in the year when the sun blinds the camera even with the lens hood, I felt like Indiana Jones in Raiders of the Lost Ark. And I planned to install a second camera aimed at the gate, just at a different angle. That way, as long as there is at most one sun in the sky, I get a clear picture of the gate. For a while, it worked. Until it didn't. Somehow, dandelion seeds managed to attach to the glass lens and screw up my image processing. I had enough. I decided to install the magnetic sensors that I should have installed right away. After studying the exploded diagram of the engine that powers my gate, I found convenient places to place wired magnetic sensors. After debouncing the signal from the mechanical relays in software, it worked reliably. Success.

Integration: At this point, it would be suspicious if everything worked right away, wouldn't it? But all I had to deal with was race condition in the processing code.

TL;DR: For automatic gate opening:
iPhone DIDN'T WORK
iBeacon DIDN'T WORK
Android WORKED
Camera DIDN'T WORK
Magnetic sensors WORKED

čtvrtek 25. prosince 2025

How to identify faulty sensors in home automation

Home automation tend to be unreliable as it is composed of many individual components. If each component has 99.9% reliability, we have 100 statistically independent components, then we have only 99.9^100=90.5% chance that everything works. That is not much. Hence, we need to monitor the components and alert when they brake down.

I use 5 types of heuristics to trigger an alert that there is an issue with the sensor. For illustration, I will explain them on a wireless battery powered thermometer sensor:

Too low/too high. When temperature in a boiler is above 90˚ C or below 30˚C, it is a sign of troubles.
Unavailable. When the temperature reading is unavailable for 15 minutes, battery in the sensor died or the sensor got water damaged.
Stuck. When the temperature reading AND signal strength (RSSI) is stuck at the same value for over 24 hours, the battery voltage is too low or the sensor needs to be restarted. I like to combine multiple sensors from the same device to decrease the false alert probability when it is plausible that the same value reading is legal. For example, the temperature in my basement is constant as long as no one opens the basement doors. Hence, if I measured only temperature, it would result in false alerts. On the other end, the battery powered thermometer next to the server is so close that RSSI is always at the maximal value. Hence, an alert based on RSSI alone would give false alerts. By requiring temperature AND RSSI to be stuck, I can use the same code for all my thermometers without false alerts. And I do not have to think about whether the particular sensor has more variable temperature or RSSI.
Rapid change. Generally, when the temperature is increasing too quickly, it might be a sign of fire. Or that the battery is almost empty and the sensor became erratic.
Noisiness. When the standard deviation of the temperature over the last hour is too high, the battery is dying.

Nevertheless, my single favorite alert trigger for network based sensors is a check whether the sensor's web page is loading, or not. It happened to me multiple time that a sensor was answering on ping. But otherwise the device was unresponsive. Hence, a ping is not sufficient. However, a simple HTTP status code check so far worked reliably and universally across all my network devices.

Does it mean ping alerts are useless? No. Once I had a faulty device. So I filled a warranty claim. But the claim was denied because "I have unreliable network and should hire professionals to fix it". So I presented them with the ping and HTTP status code historical logs for the device. The device was answering on ping. But web was 404. This was enough for them to accept the warranty claim. If I didn't have the ping logs, they could have claimed that the Ethernet cable was faulty... But the fact that ping worked continuously for a week silenced them. Hence, having multiple alert triggers, even if they partially overlap, payed off for me.

středa 22. ledna 2025

Reduction for the parallel port on Brother printers

Brother printers do not use a traditional Centronics 36-pin that you can find on other printers or 2-row 25-pin D-SUB parallel port (LPT) that you find on computers. Instead of that, they use a smaller 3-row 26-pin D-SUB connector. And new printers do not come with a cable/reduction to LPT. You have to go and buy the reduction from Brother under name "PC-5000".

On one end, I understand Brother. If I were them, I would also want to know how many people still need LPT. And by selling the reduction separately you get the count. On the other end, the cable sells for a quarter of the printer.

Since the reduction is nothing else but a simple wire reduction that you can solder from and old LPT cable and a new 3-row 26-pin D-SUB connector for 2 dollars.

The wiring is simple. Pin 1 on one connector goes on pin 1 on the other connector. Pin 2 on one connector goes on pin 2 on the other connector. And so on. Pin 26 does not have its counterpart and is left unconnected. Ground goes on the ground.

For reference, I include wire colors on my cable (note: they might be different from your cable). And the pin numbers are the numbers on the D-SUB connectors on the cable.

D-SUB 26:

1	black	10	white	19	black white
2	brown	11	pink	20	brown white
3	red	12	azure	21	red white
4	orange	13	red black	22	orange white
5	yellow	14	orange black	23	green white
6	green	15	yellow black	24	blue white
7	blue	16	green black	25	purple white
8	purple	17	grey black	26	not wired
9	grey	18	pink black	ground	shielding of the cable

D-SUB 25:

1	black	14	orange black
2	brown	15	yellow black
3	red	16	green black
4	orange	17	grey black
5	yellow	18	pink black
6	green	19	black white
7	blue	20	brown white
8	purple	21	red white
9	grey	22	orange white
10	white	23	green white
11	pink	24	blue white
12	azure	25	purple white
13	red black	ground	shielding of the cable

Just note that the 3-line D-SUB is a bit overcrowded in comparison to 2-line D-SUB. You might want to start from the middle row and use a micro soldering iron.

pondělí 25. listopadu 2024

Better debugging UI for CTE

One of the great advancements in SQL was introduction of Common Table Expressions (CTE), which made creation of long queries without creating intermediate tables on disk (or temporary tables if the database supports it) convenient.

However, debugging of long CTEs is difficult. To find the bug, I currently use bisection. I modify the query to show the first rows of some intermediate CTE somewhere in the middle of the query. If the CTE result looks OK, the bug is somewhere in the second part of the query. If the result does not look OK, I know that the bug is in the first half of the query. And this process repeats recursivelly, until I pinpoint the single offending CTE.

This approach works well on quickly running queries. However, once each query execution takes long time, the debugging drags as it requires multiple executions to just find the offending CTE, let alone to fix the CTE.

A possible mitigation is to enhance the query editor. Just like some query editors allow you to fold CTEs, the editor should present these information for each CTE:

the CTE's row count
an icon, which on click opens the CTE result

These informations should appear in the editor progressively, as they get available. With this approach, identification of the offending CTE should require, at most, 1 complete query execution. If you are lucky, just by looking at the row count of the intermediate CTEs, you spot the issue while the query is still running.

Technical limitations:

The database ought to support temporary tables. This is needed for a simple implementation of the CTEs' result preview.
While in the debugging mode, optimisation techniques that work across CTEs, like predicate pushdown, will not be applicable. This is acceptable as programmers expect a slow-down in the debug mode. However, it means that there has to be an additional "Debug" button next to the ordinary "Run" button. These two modes can't be mixed.

Future enhancements:
Do we want to store whole CTE results or only the top 1000 rows? If we store whole tables, we can use that to warm start the query execution, once we modify the query. The disadvantage of storing the whole results is that even a single query in the debugging mode can exhaust disk space and consequently prematurelly terminate, while in the ordinary run mode the query may finnish without any issue. Hence, the limit mode will likely have to be implemented. Nevertheless, for testing of the core idea, the limit/warm-start functionality is not needed.

středa 22. listopadu 2023

Google maps' color palette

Once again, google maps changed the color palette. And once again, people are unhappy about it. The change is supposed to improve clarity on low-quality car displays. But it seems to make people with high-quality displays unhappy.

I say, that a single color map can't make everyone happy. Some people are color-blind, hence preferring "50 shades of gray". Other people see the colors but have a crappy display that can't distinctly show more than a few levels of gray, hence preferring "papagayo colors".

The solution is to let people to define their own palette per navigation type (no navigation, by car, by public transit, by foot, by bike,...) and share the palette configurations. This will take care of:

eye and display imperfections,
differences in the opinions of what type of information is important (if nothing else, it is reasonable to assume that this differs from one biome to another),
personal preferences (when you are for years accustomed to one palette, you might prefer to stick to the palette, simply because your brain can navigate the old palette faster than the new one).

neděle 23. července 2023

Dataframe API design

A dataframe is a convenient way how to represent tabular data.

However, dataframe libraries are notoriously tedious to implement, because a dataframe library should be feature-rich.

There were multiple attempts to simplify to problem.

One notable approach is to implement only operations on rows. And if you need to do something on columns, you first transpose the dataframe to convert the columns to rows. This simple trick reduced the size of API (count of methods) by roughly 1/3. Unfortunately, heterogeneous (composing of multiple different data types) dataframes are not the easiest to transpose. The author of the approach solved it by using dynamic typing - each cell contained information about its data type.

Can we further reduce the size of the API? I argue that the answer is yes. Each dataframe should have its own metaframe, which is nothing else but a dataframe with the metadata about the dataframe's columns. Metadata like column names, data types (like in information_schema.columns in SQL databases) and statistics like count of missing values, count of unique values, average, standard deviation and so on, which can be used for queries or query optimization. And these metadata should be manipulable and editable with exactly the same API as dataframe.

Hypothetical examples in Python syntax:

print(df) # some dataframe instance
print(df.mf) # the dataframe's metaframe
print(df.some_column_name) # prints a column named "some_column_name"

# Rename a column.
# Commonly we would have a dedicated function or a method for this like:
#   column_name(df)[3] = "new_name"
# or
#   df.rename(columns={"old_name": "new_name"})
# but we reuse ordinary syntax for dataframe manipulation:
df.mf.column_name[3] = "new_name"

# Select columns without any missing value.
# Commonly we would have a dedicated function or a method for this like:
#   df[:, ~any(isnull(df))]
# or
#   df[:, ~df.isnull().any()]
# but we reuse ordinary syntax for dataframe manipulation by using missing_values column in the metaframe:
df[:, df.mf.missing_values == 0]

# Select columns with "good" substring.
df[:, regex(df.mf.column_name, ".*good.*")]

# Select integer typed columns.
df[:, df.mf.data_type=="integer"]

# Cast a column to string. Implementable with https:#stackoverflow.com/questions/51885246/callback-on-variable-change-in-python
df.mf.data_type[3]=="integer"

Metaframe should be open for addition of new columns as needed. For example, do you want to calculate feature importance and then filter based on that? Sure enough we can store the feature importance in an independent array and then filter based on the array:

feature_importance = get_feature_importance(df)
df[:, feature_importance>0]

But what if there are intermediate steps between feature_importance calculation and filtering, where you manipulate column position or add or delete columns? Suddenly, you have to keep feature_importance array synchronized with the df. And that can be tedious and error prone. But if you store feature_importance into metaframe, the dataframe library will take care of keeping it synchronized (when you add columns, the corresponding feature_importance value will be null - no magic).

However, if you wanted a bit of magic, the library might keep track of operations performed on columns and keep lineage of the columns. For example, the library might track which dataframes and columns were used in computation of the columns. This is useful in complex systems where it is sometimes difficult the origin of some data.

Implementation

Because we want to treat a metaframe as if it was a dataframe, metaframe has its own metadata. Hence, we get a never-ending chain of frames. This can be implemented as a linked list.

Without loss of generality, let's assume that each dataframe has 2 mandatory metadata attributes:
   column_names,
   data_types.
Additionally, each dataframe can have an unlimited count of optional metadata attributes, e.g.:
   roles,
   interpretations,
   notes,
   ...

Then the chain of the frames will eventually start to repeat with:

2 mandatory columns [column_names, data_types] and 2 rows that in column_names have: ["column_names", "data_types"].

We can treat it as a "sentinel node" in linked lists, which marks the end of the list - once we encounter it, it will be just referencing itself. Note that a single global "sentinel" can be shared by all frame chains.

However, if we want to make the data structure easier to serialize, we had better to avoid never-ending loops. To do that (and make the space overhead smaller), we might just use null in places, where we would reference the "sentinel" - it would be then up to the code to handle nullable metaframes (be it in the library or user code).

The final optional is to use dataframes even in the back end. We can a dataframe with an ordered list of dataframes, where for two neighboring dataframes it holds that the bottom one is a metaframe for the upper one. And the last row is the loopy metaframe. The advantage of this design is that in comparison to a sentinel, it is easy to serialize. And in comparison to the null design, the loopy metaframe can have additional columns, which can be important for some applications and generally evolution of the format.

úterý 16. května 2023

Summary evaluation for Wikipedia

Wikipedia articles, at least in English, tend to be overgrown - they contain a lot of information of mixed importance. However, we do not always have time to go thru all the content. It helps that articles are structured to have the most important things in the first sentence/paragraph. However, the importance is not really differentiated within the body. If you have to read the body, you get swamp. I use two tricks to deal with that: 1. Switch to a different language. The idea is that articles in different languages are smaller. However, they still contain the most important information. 2. Use a historical version of the article. The idea is that the most important information was entered before the less important information. People are obsessed these days with text-generative AI. Hence a proposal to use AI for shortening of English articles. Do you need a short description? Generate just a single sentence. Was it not enough? Generate the rest of the paragraph. Need even more? Write a subtopic, which interests you. How to evaluate the quality of the summaries? A. Machine translate all different language variants of the article into English and check the information overlap between the summary and the language variants. Ideally, the overlap will be large. This exploits trick #1. B. Check the overlap between the summary and historical versions of the article. Ideally, the information in the summary will be present even in the old versions of the article. This exploits trick #2. Limitations: 1. Some important information is known only from some date. For example, election results are not available before the results are announced. This can be corrected by observing how quickly given information spreads across different language versions. If the information spreads quickly, it is likely important information, even though it is young information. 2. Language variants are highly correlated because they copy from each other. However, it is reasonable to assume that, for example, English and Spanish are more correlated than, for example, Tuu and Thai, simply because fewer people speak both Tuu and Thai than English and Spanish. If the compensation of these differences is necessary, estimate a correlation matrix on the data and use it to weight the signal.