sobota 4. ledna 2014

Comparsion of SAS Enterprise Miner and RapidMiner

Enterprise Miner (EM)
+ Nice interactive visualizations. Whenever you click on a label, the corresponding data in the chart gets highlighted. Still, the plots could support interactive exploration similar to KNIME, where when you highlight a sample in one plot, it gets highlighted in all other plots.
+ Ingenious default settings. You plug the operator and it works. In RapidMiner you have often preprocess the data and fiddle with the settings to get usable result in a reasonable time.
+ You can selectively execute a portion of a flow. While in RapidMiner you can execute only whole flow. The selective execution in EM is a great feature, which accelerates development, because if you make a mistake at the end of the flow, you don't have to recalculate everything from the scratch. Instead of that, EM (by default) recalculates only branches affected by the change. Unfortunately, this feature takes a high toll. While RapidMiner can process data "on the fly", EM has to process data in steps - each operator in the flow has to finish before the subsequent operators can start. Furthermore, in EM you have to have fast and big storage for intermediate calculations.    
- Uninformative error messages. You often have to blindly test many different things until you find the root of the problem.
- It's full of bugs. Not that RapidMiner would be free of bugs. But after a day of bug hunting in RapidMiner you can at least fix it in the source code by yourself. In the case of SAS you either have to be better hacker, then am I, or you have to go through frustrating process of communication with the infamous SAS support.
- EM ecosystem lacks ETL operators - instead of them, you are supposed to use SAS code operator. Personally, I prefer to perform all the transformations in Enterprise Guide. And then I just load the final table(s) into EM.
- EM doesn't know undo command. Instead of undo, you have to confirm each action, since each change is irreversible. This approach goes against the best UI practice to limit amount of modal windows and make each action easily reversible.  
- EM doesn't know save command. Beware of blind confirming of modal windows! It happened to me several time that instead of deleting a single operator whole flow was deleted because I miss-clicked the operator and selected the drawing board instead.
- Setup SAS ecosystem takes some time. Count something around one month.
- Keep SAS ecosystem running takes a lot of energy. Count two days of repairs for each working day.
- Import of data into EM is tedious as you have first define a library and then walk through a long wizard. Count 5 minutes for even the simplest data set. But at least example data sets from SASHELP are fast to load.
- Moving of EM projects from one directory to another is trivial but unintuitive. First, you must make sure that the name of the project directory is the same as the project name. Second, you open moved projects via "New project" and selecting project directory as the destination directory.   
- Metadatabase. So far all it does is that it doubles maintenance time. I am sure there must be some benefit of having metadatabase but so far I do not see it. Hence I strongly suggest everyone to install a standalone version instead of the server version unless you have some good reason to do it otherwise.

RapidMiner (RM)
+ Fully sufficient ecosystem. You can do all ETL within RapidMiner.
+ You can truly visually program in RM since you have loop operators and conditions.
+ You have source codes.
+ The drawing board is editable during the runtime. Hence you can prepare a new experiment while RapidMiner is still performing calculations on the last experiment.
- Every time you have to rerun whole flow even though you want to make just a small refinement.
- It's too easy to setup the flows/operators that they would take eternity to calculate. For comparison, everything in SAS is optimized for big data and you don't get trapped in computational black holes too often.
- You can't stop running node. You can only tell RapidMiner to prevent execution of subsequent nodes. Hence forced application restarts are common.
~ Since each connection between operators in RM fulfills distinct function it's common that you have to wire two operators with more than one connection (for example one connection for training data and second connection for testing data), making the flow look overcrowded. In EM it's always enough to make a single connection between two operators as each connection can transfer any type of data (training, testing, validation...). Hence flows in EM looks tidier.
~ On the other end RM allows grouping of operators into a single operator allowing "divide and conquer" strategy. While absence of operator nesting in EM makes small EM flows easy to understand, bigger flows look messier in EM than in RM.
- SVD and PCA are ridiculously slow and memory consuming in comparison to SAS versions.
- Graphical presentation of hierarchical clustering is just a joke in comparison to output from Orange, KNIME or MATLAB.
- I miss Partial Least Square regression and full Bayes (not just Naive Baves). But you can get them from WEKA plugin.
- ROC plot should show reference line for simpler visual evaluation.

Žádné komentáře: