Mahout Training A Model
Mahout is a scalable machine learning algorithms focused primarily in the areas of collaborative filtering. Follow step by step process to train your own model based on your own data. Here in this example we are giving training on 20news-all data folder.
1. Create a working directory and copy all the data into this directory, make a copy of duplicate, so that your original data will not lost.
2.
Create Sequence files from Training data using following command:
a.
Command:
./bin/mahout seqdirectory -i 20news-all -o 20news-seq
./bin/mahout seqdirectory -i 20news-all -o 20news-seq
3.
We have to convert Sequence files to Vectores, mahout will use vector files only for all algorithms, its the basic need to convert data file to sequence and then sequence files to vector files.
a.
Command:
./bin/mahout seq2sparse i 20newsseq o 20news-vectors lnorm nv -wt tfidf -wt(--weight) = TF or TFIDF
./bin/mahout seq2sparse i 20newsseq o 20news-vectors lnorm nv -wt tfidf -wt(--weight) = TF or TFIDF
4.
In practice, training examples are typically divided
into two parts. One part, known as the training data, consists of 80 –90
percent of the available data. The training data is used in training to produce
the model. A second part, called the test data, is then given to the model
without telling it the desired answers, although they’re known. This is done in
order to compare the output that results from classifying, using the desired
output. Here, we split the data set randomly as 80% training data and 20% test
data. The value 20 after the “ randomSelectionPct” parameter indicates that 20%
of the data will be selected randomly for the test data set
a. Command:
./bin/mahout split i 20newsvectors/tfidf-vectors --trainingOutput 20news-train-vectors -testOutput 20newstestvectors -randomSelectionPct 20 -overwrite -sequenceFiles -xm sequential
./bin/mahout split i 20newsvectors/tfidf-vectors --trainingOutput 20news-train-vectors -testOutput 20newstestvectors -randomSelectionPct 20 -overwrite -sequenceFiles -xm sequential
5.
Training Naive Bayes model
a.
Command:
./bin/mahout trainnb -i 20news-train-vectors -el -o model -li labelindex –ow
./bin/mahout trainnb -i 20news-train-vectors -el -o model -li labelindex –ow
6.
Self testing on training set: You have ready with your own model now, but we need to test the accuracy of the model by using training set itself.
a.
Command:
./bin/mahout testnb -i 20newstrainvectors -m model -l labelindex -ow -o 20news-testing
./bin/mahout testnb -i 20newstrainvectors -m model -l labelindex -ow -o 20news-testing
b.
Output:
7.
Testing on holdout set, Here we are using holdout set to test model accuracy.
a.
Command:
./bin/mahout testnb -i 20newstestvectors -m model -l labelindex -ow -o 20news-testing
./bin/mahout testnb -i 20newstestvectors -m model -l labelindex -ow -o 20news-testing
b.
Output:
This is the way to train a model in Apache mahout. Next blog will be how to use this model to get sentiment of a comment.