Data is exploding and permeates everywhere. As the need to find meaningful patterns hidden within data and take action or make predictions on them becomes an everyday requirement in all segments of society, Data Science software tools have become much easier to use. What we’re seeing now with these software tools is analogous to the rapid growth of Visicalc in the 80s and Excel in the 90s.
Data Analytics Software is rapidly evolving. What started as a glorified excel-like dedicated app has matured to embrace advanced statistical analysis, machine learning and impressive visuals. The software offerings have made great strides using libraries, data models and visual programming environments to encapsulate mathematical complexity and guide users to more intuitively construct data pipelines and models.
More complicated high-end dedicated tools like TensorFlow and Apache Spark are also thriving in parallel. These two address high-performance, large scale, real-time data analysis with specialized models for deep learning neural nets (TF) and algorithms amendable to these more demanding environments (AS).
Presently there is an interesting mix of lower-level Python and R type programmable solutions as well as high-level visual mostly non-programmable solutions like RapidMiner and Orange. Here is a Global 2015 Usage Survey from KDnuggets.com:
The accompanying article explains that while R is currently the leader, Python usage is growing faster and should surpass R within 2-3 years at the current rate. R’s lead is a legacy of being the first high-level, open-source data analytics package with a very large library of statistical functionality popular in statistics, social sciences, machine learning, etc. Although R has a huge library of statistical and mathematical libraries and a very nice interactive web publishing system in Shiny, it is a niche language compared to Python which receives much larger developer contributions in much broader areas than just data science.
Python has subsumed much of R core functionality through libraries like pandas, matplotlib, etc. and has much greater interoperability as the data and Internet glue language for the everything from Web Servers to Message Queues. Python is becoming the lingua franca of machine learning. For example, Orange has hooks for Python widgets and TensorFlow Deep Learning only recently announced a R API (which only binds to the Python API layer instead of the underlying, much more performant C++ code). However, RapidMiner has add-ons for both Python and R custom programmed modules.
Visual Programming environments can be useful to help learn a language and domain like Data Science. One of the most popular courses at Harvard is an introduction to Computer Science for non-majors which uses the same MIT Scratch Visual Programming system my middle schooler uses. The LEGO Mindstorms Visual Programming system also has roots back to MIT and shares code with National Instruments LabView – commercial software for data acquisition, instrument control and industrial automation.
I’ve been evaluating both Orange and RapidMiner since they provide the most sophisticated visual programming environment among open-sourced data science software packages I’ve researched. RapidMiner may have a bit more functionality, but the UX and workflow seem slightly more intuitive in Orange.
Orange allows some power-users to inject their own Python code packaged as widgets into the software which is a nice escape hatch for more powerful data manipulations. On the other hand RapidMiner has a nice feature called “wisdom of crowds” that suggest options based upon the usage patterns of their 100,000+ user base when constructing models, setting parameters, etc. Both Orange and RapidMiner have modular plug-ins that extend core functionality to include specialized text, image and other features often found in other data science packages like Weka.
As a visual programming environments, it’s slow compared to standard keyboard IDE programming but since you’re dealing at a higher abstraction level it seems fast and less error-prone. Both Orange and RapidMiner are great for learning concepts in Data Science, prototyping pipelines and for exploratory data analysis. A great feature would be to export models as code that could be scripted, automated and executed much more efficiently without the visual interface overhead.