Monday 13 October 2014

How to Train My own Model using Mahout

Mahout Training A Model

       Mahout is a scalable machine learning algorithms focused primarily in the areas of collaborative filtering. Follow step by step process to train your own model based on your own data. Here in this example we are giving training on 20news-all data folder.
 
1. Create a working directory and copy all the data into this directory, make a copy of duplicate, so that your original data will not lost.

2.    Create Sequence files from Training data using following command:
a.    Command:
./bin/mahout  seqdirectory   -­i 20news-all -o 20news-seq
3.    We have to convert Sequence files to Vectores, mahout will use vector files only for all algorithms, its the basic need to convert data file to sequence and then sequence files to vector files.
a.    Command:
./bin/mahout seq2sparse ­i 20news­seq ­o 20news-vectors ­lnorm ­nv -wt tfidf -wt(--weight) =  TF  or  TFIDF
4.    In practice, training examples are typically divided into two parts. One part, known as the training data, consists of 80 –90 percent of the available data. The training data is used in training to produce the model. A second part, called the test data, is then given to the model without telling it the desired answers, although they’re known. This is done in order to compare the output that results from classifying, using the desired output. Here, we split the data set randomly as 80% training data and 20% test data. The value 20 after the “ randomSelectionPct” parameter indicates that 20% of the data will be selected randomly for the test data set
a.    Command:
./bin/mahout  split ­i 20news­vectors/tfidf-vectors --trainingOutput  20news-train-vectors -­testOutput 20news­test­vectors -­randomSelectionPct 20 -­overwrite -­sequenceFiles -xm sequential
5.    Training Naive Bayes model
a.    Command:
./bin/mahout trainnb -i 20news-train-vectors -el -o model -li labelindex –ow


6.    Self testing on training set: You have ready with your own model now, but we need to test the accuracy of the model by using training set itself.
a.    Command:
./bin/mahout testnb -i 20news­train­vectors -m model -l labelindex -ow -o 20news-testing
b.    Output:
                                          i.    Kappa                                       0.4955
                                         ii.    Accuracy                                    99.103%
                                        iii.    Reliability                                99.0858%
                                       iv.    Reliability (standard deviation)            0.0952
7.    Testing on holdout set, Here we are using holdout set to test model accuracy.
a.    Command:
./bin/mahout testnb -i 20news­test­vectors -m model -l labelindex -ow -o 20news-testing
b.    Output:
                                          i.    Accuracy                                    90.093% 

This is the way to train a model in Apache mahout. Next blog will be how to use this model to get sentiment of a comment.

No comments:

Post a Comment