čtvrtek 31. prosince 2015

Missing values in a decision tree

There multiple ways how a decision tree can deal missing values in the data.
  1. When a decision has to be made on an attribute that is missing, the scoring of the instance can terminate and class probabilities of the current node can be returned as the prediction. Note that the implementation has to keep class probabilities not only of the leaves, but of all the nodes in the tree.
  2. Or we may keep a statistics about how many samples goes into each node. And if a decision has to be made based on a missing value, then the instance goes into the most frequent descendant.
  3. We may also train the tree how to score based on a missing value. For example, if a split is learned on a continuous attribute and the split says that 90% of the training samples goes into the right descendant, the model can also learn that if an instance has a missing value, then based on the class label it more similar to the instances in the left descendant.  Hence it sends the instances with the missing attribute value left.

sobota 12. prosince 2015

Money


 There is one thing that isolates a person from humanity better than a prison ever could - money. If you are rich, people are not going to forgive you. There are different strategies how to cope with the isolation:

  1. Bribe artists to like you by buying their art.
  2. Bribe scientist to like you by supporting their research.
  3. Bribe women to like you by hiring prostitutes.
  4. Seize power to force people to listen to you.
  5. Take drugs to forget about the world.
  6. Substitute civilization and humanity with nature.
  7. Believe in something with whole your heart.



Comparison of import.io and OutWit

When import.io was released, I was excited. However, the excitement disappeared. The reasons follow:
  1. Whenever you are defining a crawler, you have to always define at least 5 examples, even though you know, that in this case just 2 examples would be enough.
  2. The interface is sluggish even in the offline version of import.io.
  3. The crawling is approximately 10 times slower than in OutWit.
  4. The export is not satisfactory. If you tell import.io to export the data into csv, then import.io strips away all commas from the scraped text. If you need to preserve the commas, you can still export the data in XLS or JSON. But Excel has a limit on the length of text in cell. And when you get over the limit, you cannot open the file. JSON is neither a workable solution because the characters in the text are not always correctly escaped, making the JSON invalid. Hence, after several hours of web scraping with import.io you find yourself unable to scrape import.io.
While OutWit irritates me with it's deep context menus, at least it does it's work.