Intro Primer To WEKA Explorer For Machine Learning
Previously, in our Intro Primer For WEKA Machine Learning Software post, we introduced you to Weka and suggested that the Weka Explorer tool could be useful. In this post, we will show you why it is a useful tool for exploring your data, from doing the simplest to the most complex analysis on your data. We will guide you step by step through the analysis of simple problems using Weka Explorer tools for preprocessing, classification, clustering, association, attribute selection, and visualization of data. At the end of the tutorial, you should be able to analyze your own data with Weka Explorer using the various tools and interpret the results.
Hoang Pham Truc Phuong, firstname.lastname@example.org, is the author of this article and he contributes to RobustTechHouse Blog for our Machine Learning column. RobustTechHouse focusses on Mobile App Development in Singapore.
1. Launch Weka Explorer
Click on the “Explorer” button on “Weka GUI Chooser” and the Weka Explorer window will launch.
1.1 Status Box
You should see the status box at the left bottom of the window. It displays messages that keep you informed about what’s going on in Weka. For example, if the Explorer is busy loading a file, the status box will explain as such.
And if Weka explorer is working on data transfers, the status box will show messages for this.
Tips: Right click on the inside of the status box and the sub-menu will appear with two options:
- Memory information. Shows the amount of memory available to Weka.
- Run garbage collection. Force the java garbage collector to search for memory that is no longer needed and frees it up, allowing more space for new tasks. Note that the garbage collector is constantly running as a background task anyway.
1.2 Log Button
Clicking on this button brings up a separate window containing a scrollable text field. Each line of text is stamped with the time it was entered into the log. As you perform actions in Weka, the log keeps a record of what has happened. For people using the command line or the SimpleCLI, the log now also contains the full setup strings for classification, clustering, attribute selection, etc., so it is possible to copy/paste them elsewhere. Options for dataset(s) and, if applicable, the class attribute still needs to be provided by the user (e.g., -t for classifiers or -i and -o for filters).
1.3 Weka Status Icon
To the right of the status box is the Weka status icon. When no processes are running, the bird icon is taking a nap. The number beside the X symbol gives the number of concurrent processes running. When the system is idle it is zero, but it increases as the number of processes increases. When any process is started, the bird icon gets up and starts moving around. If it’s standing but stops moving for a long time, it’s sick: something has gone wrong! In that case you should restart the WEKA Explorer.
2. Preprocessing Data
Data in the real world is frequently dirty. So preprocessing is an important step for successful data mining. Weka has wonderful support for preprocessing data. Here, step-by-step, we take you through how to do data preprocessing on Weka.
2.1 Opening file from a local file system
In the stable version, you can only load some basic file types like *.arff, *. arff.gz, *.names, *.data, *.csv, *.libsvm, *.dat, *.bsi, *.xrff, *.xrff.gz. The developer version supports more file types like: *.json, *.json.gz,…
2.2 Opening a File From a Website
You can directly load data from a given URL. For example, you can choose “open URL” and input this link http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/contact-lenses.arff
2.3 Reading data from a database
In our last Intro Primer For WEKA Machine Learning Software post on Weka introduction , we mentioned that Weka supports connecting to database management systems (DBMS). From Weka Explorer, you can connect and load data from databases. Here are the steps to load data in Weka Explorer after connecting to a DBMS:
- Choose Open DB
- The URL should read “jdbc:odbc:dbname” where dbname is the name you gave the user DSN.
- Click Connect
- Enter a Query, e.g., “select * from tablename” where tablename is the name of the database table you want to read. Or you could put a more complicated SQL query here instead.
- Click Execute
- When you’re satisfied with the returned data, click OK to load the data into the Preprocess panel.
2.4 Generate artificial data
Weka also supports generation of sample data. Here is the list of some sample data which is supported by Weka.
- Agrawal: Generates a people database and is based on the paper by Agrawal et al.
- BayesNet: Generates random instances based on a Bayes network
- Led24: This generator produces data for a display with 7 LEDs.
- RandomRBF: Data is generated by first creating a random set of centers for each class.
- RDG1: A data generator that produces data randomly by producing a decision list.
2.5 The Current Relation
When the data loaded, the Preprocess panel shows a variety of information. The current relation box, which can be interpreted as a single relational table in database terminology, has three entries:
- Relation. The name of the relation, as given in the file it was loaded from. Filters, described below, modify the name of a relation.
- Instances. The number of instances (data points/records) in the data.
- Attributes. The number of attributes (features) in the data.
The current relation box is labelled as 1 in the screenshot below.
2.6 Working With Attributes
Below the current relation box is a box titled Attributes (labelled as number 2 in screenshot above). There are four buttons, and beneath them is a list of the attributes in the current relation.
The list has three columns:
- No. A number that identifies the attribute in the order they are specified in the data file.
- Selection tick boxes. These allows you to select which attributes are present in the relation.
- Name. The name of the attribute, as it was declared in the data file.
When you click on different rows in the list of attributes, the fields change in the box to the right titled Selected attribute (labelled as number 3 in screenshot above). This box displays the characteristics of the currently highlighted attribute in the list:
- The name of the attribute, the same as that given in the attribute list.
- The type of attribute, most commonly Nominal or Numeric.
- The number and percentage of instances in the data for which this attribute is missing.
- The number of different values that the data contains for this attribute.
- The number and percentage of instances in the data having a value for this attribute that no other instances have.
For example, load the weather.arff and remove a record of temperature attribute at line 4 (by pressing the “Edit” button and edit directly)
Choose “temperature” attribute and you see some static value in selected attribute box:
- Type: Nominal. It means that this is not a numeric but a string type.
- Missing: 1(7%). This means that we lack one value in this attribute, and this is 7% of all records.
- Distinct: 3. This means that there 3 distinct values for records: hot, mild and cold.
- Unique: 0. This means that other instances do not have the same value.
2.7 Working With Filters
WEKA contains filters for discretization, normalization, resampling, attribute selection, transformation and combination of attributes. Sometimes you need to transform your data from numeric to nominal values for some techniques such as association rule mining. In Weka, we can use the “discretize” feature of filters to do this transform.
To explain this feature, we can go through a small example. Load file “weather.numeric.Arff” from Weka’s sample data.
In this data set, the “temperature” attribute is a numeric type and it is a continuous variable. But in some techniques, we don’t need to know the exact value of temperatue. We just need the state of temperature, such as: cold, hot etc. Weka can help us do it using the filter function. You just need to follow below steps:
- In ‘Filters’ window, click on the ‘Choose’ button:
- It will show a pull-down menu with a list of available filters. Select unsupervised -> Attribute -> Discretize
Click in the red rectangular area, the option of discretize will appear and set bins to 3 (here I want to divide into three level):
Click “apply” button, we will have the following result:
The temperature was divided into three ranges: (-inf,71];(71,78] and (78,inf). Then we can use RenameNominalValues to change to label which you want.
3. Data Visualization
Weka can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations (Xgobi-style). WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden” data points. To open the Visualization screen, click on the ‘Visualize’ tab.
Select a square that corresponds to the attributes you would like to visualize. For example, let’s choose ‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that corresponds to ‘play on the left and ‘outlook’ at the top.
Here we provided an Intro Primer To WEKA Explorer For Machine Learning. Hope you found it useful.
If you like our articles, please follow and like our Facebook page where we regularly share interesting posts and check out our other blog articles where we write about programming, eCommerce, mobile-commerce, FinTech, Machine Learning and other interesting topics.
RobustTechHouse is a leading tech company focusing on mobile app development, ECommerce, Mobile-Commerce and Financial Technology (FinTech) in Singapore. If you are interested to engage RobustTechHouse on your web, mobile app development, ECommerce, Mobile-Commerce, Financial Technology (FinTech) projects in Singapore, you can contact us here.