Datafilos: ledna 2014

středa 29. ledna 2014

My knowledge of SAS Enterprise Miner

Currently I am familiar with 48/70 (69%) nodes in Enterprise Miner.

čtvrtek 23. ledna 2014

Execution of MBR score.sas generated in SAS Enterprise Miner in SAS Enterprise Guide

Execution of non-data-step models like MBR and SVM is different from execution of data-step models like tree and regression. Hence it is not possible to call score.sas like:

data output;
   set input;
   %include "C:\score.sas";
run;

Instead you have to use something like:

/* Note: For k-NN score.sas is not enough by itself.
   k-NN also needs whole training set.*/

/* Define EMSCORE library where MBR will look for training
   and scoring data. It has to be named EMSCORE. */
libname EMSCORE base "C:\EMSCORE";

/* Define training data. It has to be named EM_TRAIN_MBR
   and reside in EMSCORE. */
data EMSCORE.EM_TRAIN_MBR;
    set learning;
run;

/* Define scoring data. Scoring data will be overwritten
   with MBR prediction. It has to be named em_score_output. */
data EMSCORE.scoring;
    set scoring;
run;
%let em_score_output= EMSCORE.scoring;

/* Execute MBR */
data _null_;
   %include "C:\score.sas";
run;

pondělí 20. ledna 2014

Comparison of variable selection methods

Since I didn't know which variable selection method to use, I performed a trivial test on Sonar dataset. Sonar dataset has 60 attributes. But I arbitrarily decided to reduce the number of attributes to 10. Then I measured classification accuracy with ten fold cross-validation. And to get an idea how feature selection methods are dependent on the classifiers I tried three different models: naive Bayes, k-NN and classification tree:

Based on the comparison the best method to use is SVM attribute selection. However, this method requires parameter tunning. The next best variable selection method is Chi2. The disadvantage of this method is that it favors attributes with many levels. Hence the performance of Chi2 could be severely hindered on diverse set not like Sonar set. The last method from the top three is information gain ratio. The advantage of this method is that it can handle attributes with diverse number of levels not like Chi2.

neděle 19. ledna 2014

Oh people of SAS, you are amazing!

Sometimes you have to use comma between parameters, for example in definition of macro:
%macro append(columnName=, labelName=);
But it other cases, like dropping, you can't use comma between parameters:
data table(drop=attr1 attr2 attr3);
If you type DAAT instead of DATA, your program will run anyway with a polite note in your SAS log telling you that it has assumed you meant DATA and went ahead and executed based on that assumption, but if not, hey, feel free to let it know.

čtvrtek 16. ledna 2014

SAS experience

Jak se pozná, že jsem pracoval v SASu? Mám neuvěřitelně vyčesané vlasy do zádu. Obávám se, že pokud mi i zítra řeknou, že budu pracovat se SASem, skončím, jako Homer Simpson, který se připravil o část kštice, kdykoliv se dozvěděl, že bude mít další dítě.

Práce

Co nemám rád na mém současném pracovišti? Koberec v kanceláři. Jak šoupu nohama po koberci, nabíjím se statickou elektřinou. A potom stačí dotknout se vodovodní baterie a zajiskří to.

Ale alespoň můžu říkat, že v práci přímo zářím.

úterý 14. ledna 2014

Nest Thermostat

Since we have self learning thermostats I predict that sooner or later we will have self learning electrical kettles. When you wake up, the water for the morning coffee will be preheated. So when you push the button you don't have to wait an eternity for the morning cup of coffee. And in the evening when it detects you are returning home it preheats water. It doesn't matter whether you use the hot water for a cup of tea or for dinner. It will be there ready on push of button. But in the mean time the kettle is not going to keep the water warm since it knows no one is home. Comfort and efficiency blended together. PS: did you notice how hard it is to estimate how much of water to pour into the kettle so you always pour a bit more of water than is necessary just to be sure you don't end up with too little water? If you have your favorite cup the intelligent kettle can learn it and preheat the exact amount of water. And it can get so far that you push the button and based on the sensors it fills any cup just the way you like - to the border or a centimeter below. Your choice.

čtvrtek 9. ledna 2014

SAS Miner error messages

Are you getting error messages when you ran MBR (Memory Based Reasoning, also known as k-nearest neighbors)?

ERROR in MBR: Run time error was encountered. Please see log for more details.

SOLUTION: Edit variables entering into MBR and unselect all ID attributes like _dataobs_. That should do the trick.

NOTE: MBR processes only interval variables. And it's not sufficient to just change attribute level from binary to interval.

ERROR in Partial Least Squares: No factor was extracted. Use different classifier.

Solution: Use variable selection to reduce the search space.

sobota 4. ledna 2014

Optimismus

Viděl jsem, jak přijíždí tramvaj. Už byla za hranicí, kterou jsem mohl ovlivnit rychlostí běhu a zdali tramvaj stihnu, už jen záleželo na řidiči, zda počká. Ale řekl jsem si: je ráno a po ránu bývají lidi shovívavější. A je nový rok, lidi jsou optimističtí. A z toho optimismu jsem se jal tramvaj stíhat. Už jsem stál ve dveřích tramvaje a plánoval do nich vstoupit co nejrychleji. Co kdyby se v řidiči probudily sadistické pudy?

Ale jak jsem vstupoval do tramvaje, část těla ještě pokračovala v letu paralelně s tramvají. Vzorově jsem se v těch dveřích natáhl.

Comparsion of SAS Enterprise Miner and RapidMiner

Enterprise Miner (EM)
+ Nice interactive visualizations. Whenever you click on a label, the corresponding data in the chart gets highlighted. Still, the plots could support interactive exploration similar to KNIME, where when you highlight a sample in one plot, it gets highlighted in all other plots.
+ Ingenious default settings. You plug the operator and it works. In RapidMiner you have often preprocess the data and fiddle with the settings to get usable result in a reasonable time.
+ You can selectively execute a portion of a flow. While in RapidMiner you can execute only whole flow. The selective execution in EM is a great feature, which accelerates development, because if you make a mistake at the end of the flow, you don't have to recalculate everything from the scratch. Instead of that, EM (by default) recalculates only branches affected by the change. Unfortunately, this feature takes a high toll. While RapidMiner can process data "on the fly", EM has to process data in steps - each operator in the flow has to finish before the subsequent operators can start. Furthermore, in EM you have to have fast and big storage for intermediate calculations.
- Uninformative error messages. You often have to blindly test many different things until you find the root of the problem.
- It's full of bugs. Not that RapidMiner would be free of bugs. But after a day of bug hunting in RapidMiner you can at least fix it in the source code by yourself. In the case of SAS you either have to be better hacker, then am I, or you have to go through frustrating process of communication with the infamous SAS support.
- EM ecosystem lacks ETL operators - instead of them, you are supposed to use SAS code operator. Personally, I prefer to perform all the transformations in Enterprise Guide. And then I just load the final table(s) into EM.
- EM doesn't know undo command. Instead of undo, you have to confirm each action, since each change is irreversible. This approach goes against the best UI practice to limit amount of modal windows and make each action easily reversible.
- EM doesn't know save command. Beware of blind confirming of modal windows! It happened to me several time that instead of deleting a single operator whole flow was deleted because I miss-clicked the operator and selected the drawing board instead.
- Setup SAS ecosystem takes some time. Count something around one month.
- Keep SAS ecosystem running takes a lot of energy. Count two days of repairs for each working day.
- Import of data into EM is tedious as you have first define a library and then walk through a long wizard. Count 5 minutes for even the simplest data set. But at least example data sets from SASHELP are fast to load.
- Moving of EM projects from one directory to another is trivial but unintuitive. First, you must make sure that the name of the project directory is the same as the project name. Second, you open moved projects via "New project" and selecting project directory as the destination directory.
- Metadatabase. So far all it does is that it doubles maintenance time. I am sure there must be some benefit of having metadatabase but so far I do not see it. Hence I strongly suggest everyone to install a standalone version instead of the server version unless you have some good reason to do it otherwise.

RapidMiner (RM)
+ Fully sufficient ecosystem. You can do all ETL within RapidMiner.
+ You can truly visually program in RM since you have loop operators and conditions.
+ You have source codes.
+ The drawing board is editable during the runtime. Hence you can prepare a new experiment while RapidMiner is still performing calculations on the last experiment.
- Every time you have to rerun whole flow even though you want to make just a small refinement.
- It's too easy to setup the flows/operators that they would take eternity to calculate. For comparison, everything in SAS is optimized for big data and you don't get trapped in computational black holes too often.
- You can't stop running node. You can only tell RapidMiner to prevent execution of subsequent nodes. Hence forced application restarts are common.
~ Since each connection between operators in RM fulfills distinct function it's common that you have to wire two operators with more than one connection (for example one connection for training data and second connection for testing data), making the flow look overcrowded. In EM it's always enough to make a single connection between two operators as each connection can transfer any type of data (training, testing, validation...). Hence flows in EM looks tidier.
~ On the other end RM allows grouping of operators into a single operator allowing "divide and conquer" strategy. While absence of operator nesting in EM makes small EM flows easy to understand, bigger flows look messier in EM than in RM.
- SVD and PCA are ridiculously slow and memory consuming in comparison to SAS versions.
- Graphical presentation of hierarchical clustering is just a joke in comparison to output from Orange, KNIME or MATLAB.
- I miss Partial Least Square regression and full Bayes (not just Naive Baves). But you can get them from WEKA plugin.
- ROC plot should show reference line for simpler visual evaluation.

středa 1. ledna 2014

Programming languages/scripts in which I wrote something

Pascal
VisualBasic
Assembler
C
C++
C#
Python
MATLAB
Lisp
Prolog
R
NetLogo
Java
JavaScript
PHP
SQL
SAS (if we relax requirements for definition of programming languages/scripts)

And I am not good in any one.