Contents
Intro Primer For WEKA Machine Learning Software
Weka is a machine learning software and data mining workbench. It’s an acronym for the Waikato Environment for Knowledge Analysis. It contains a collection of visualization tools and algorithms for data analysis and predictive modeling. It is a very convenient tool with wonderful graphical user interfaces for you to experiment with machine learning and data mining models on your data.
Hoang Pham Truc Phuong, hptphuong@gmail.com, is the author of this article and he contributes to RobustTechHouse Blog for our Machine Learning column. RobustTechHouse is a web & mobile app development house focusing on Financial (Fintech) and ECommerce sectors and likes to dabble with data analysis and machine learning too.
[Updated on 25 May 2015] Also see our follow up post on Intro Primer To WEKA Explorer For Machine LearningWhy Weka?
Weka supports several standard data mining tasks with many standard data mining algorithms ranging from normal ones to really complex ones. All of Weka’s techniques are predicated on the assumption that the data is available as a single flat file or relation, where each data point is described by a fixed number of attributes. Here are some main features of Weka:
Data Preprocessing
Weka supports various file formats e.g, CSV, Matlab etc and its own file format (ARFF). It also supports most common database management systems (DBMS) including HSQL, SQL SERVER, MySQL, PostgreSQL etc through java connections. For data processing, Weka has over 75 methods for filtering, ranging from basic to advanced operators eg principal component analysis.
Classification
Weka has a lot of classification methods. Classifiers can be divided into “Bayesian” methods (Naive Bayes, Bayesian nets etc.), lazy methods (nearest neighbor and variants), rule-based methods (decision tables, OneR, RIPPER), tree learners (C4.5, Naive Bayes trees, M5, J.48 etc), function-based learners (linear regression, SVMs, Multilayer Perceptron, Gaussian processes) and miscellaneous methods.
Clustering
Weka has most classic algorithms for clustering such as: Simple KMeans, Hierarchical class clustering, simple expectation maximization (EM).
Attribute Selection
The set of attributes used is essential for classification performance. Various selection criteria and search methods are available.
Data Visualization
Data can be inspected visually by plotting attribute values against the class, or against other attribute values. Classifier output can be compared to training data in order to detect outliers and observe classifier characteristics and decision boundaries. For specific methods, there are specialized tools for visualization, such as a tree viewer for any method that produces classification trees, a Bayes network viewer with automatic layout, and a dendrogram viewer for hierarchical clustering
Time Series Forecasting
This is a new function in Weka from version 3.7.x (version for Developers). Weka supports many methods for predicting time series as function-based learning (Gaussian processing, linear regression, Multilayer perceptron neural network, SMOreg-support vector machine for regression), lazy method (K-nearest neighbours, Locally weighted learning and KStar) and trees (Random forest, random tree)
From my experience, here are some reasons which make Weka a good toolbox for Machine Learning:
1. Easy to use graphical user interfaces.
2. Contains most of the powerful algorithms published for machine learning.
3. Free availability under the GNU General Public License.
4. Portability, since it is fully implemented in the Java programming language and runs on almost any modern computing platform.
5. A comprehensive collection of data pre-processing and modelling techniques.
Intro to the Weka GUI
1. Download and Install
Download from Weka Download Link. There are two versions of Weka: Stable version (3.6.12) and developer version (3.7.12). I personally prefer the developer version because it allows me to install more packages, e.g, time series forecasting.
After downloading, unzip the zip file and run this command:
> java -Xmx1000M -jar weka.jar
The above shows the subtle differences between the standard and developer versions.
To connect to a DBMS, you should to do the following steps:
1. Download java connection compatible with your DBMS,e.g, mysql-connector-java, sql-connector-java
2. Use this syntax to run weka with DBMS:
> java -Xmx1000M -cp weka path:java_connection_path weka.gui.GUIChooser.
Here is the example I used to connect to mysql:
> java -Xmx1000M -cp /home/phuong/weka-3-7-12/weka.jar:/home/phuong/java_conn/mysql-connector-java-5.1.34-bin.jar weka.gui.GUIChooser
2. Weka Explorer
In, Weka explorer, you can visualize, clean your data and try some algorithms for clustering, classification and forecasting. Some features are different between the stable version & developer version of Weka. Here, I am using “Weka Explorer” in the developer version.
The explorer interface is divided into 11 different tabs in two tab lines (top line contain 5 features and the other have 6 features) . The top line is only have in the developer version.
- RConsole: It is an extension which combines Weka with R language and reuses some a lot of the awesome functions from R.
- Parallel Coordinates Plot: a common way of visualizing high-dimensional geometry and analyzing multivariate data.
- Projection Plot: To apply algorithms such as clustering algorithms and visualize the results on the graph directly.
- Visualize 3D: Plot your data in 3D space!
- Forecasting: This function is used for time series forecasting. You will find some famous algorithms such as SVM, regression in here.
- Preprocess: Load a dataset and manipulate the data into a form that you want to work with.
- Classify: Select and run classication and regression algorithms to operate on your data.
- Cluster: Select and run clustering algorithms on your dataset.
- Associate: Run association algorithms to extract insights from your data.
- Select Attributes: Run attribute selection algorithms on your data to select those attributes that are relevant to the feature you want to predict.
- Visualize: Visualize the relationship between attributes.
3. Weka Experimenter
Unlike Weka Explorer that is used for analysis and experimenting with algorithms, “Weka Experimenter” is for designing experiments with your selection of algorithms and datasets, running experiments and analyzing the results. For example, the user can create an experiment that runs several schemes against a series of datasets and then analyse the results to determine if one of the schemes is statistically better than the other schemes.
4. Knowledge Flow
Knowledge Flow helps you create a process to apply machine learning. It helps you graphically design your process and run the design that you created. The analysis process goes like this: loading and transforming of input data, followed by running of algorithms and then presentation of results.
References
You can review some links below for more information about Weka.
- An Introduction To The WEKA Data Mining System
- Data Mining With Weka
- More Data Mining With Weka
- WEKA Documentations
Conclusion
Here we provided an Intro Primer For WEKA Machine Learning Software. Hope you found it useful.
If you like our articles, please follow and like our Facebook page where we regularly share interesting posts and check out our other blog articles.
RobustTechHouse is a leading tech company focusing on mobile app development, ECommerce, Mobile-Commerce and Financial Technology (FinTech) in Singapore. If you are interested to engage RobustTechHouse on your projects, you can contact us here.
Unlike Weka Explorer that is used for analysis and experimenting with algorithms, “Weka Experimenter” is for designing experiments with your selection of algorithms and datasets, running experiments and analyzing the results. For example, the user can create an experiment that runs several schemes against a series of datasets and then analyze the results to determine if one of the schemes is statistically better than the other schemes.