This chapter describes the wagon program for building standard classification and regression trees.
In spite of everyone talling us that CART building software was freely available on the net, and in spite of following up many false leads, we couldn't find any freely available software that could build classification and regression trees.
Also we knew we wanted to have a generic system to do this as we know we
want to experiement with our own cost functions and probably later our
own pruning methods so a well-designed programmable system was what we
were looking for. wagon
is the result. It it still under
development features but it seems to work and works
quite efficiently. The book "Classification and Regression Trees," by
Breiman, L., Freidman, J., Olshen, R. and Stone, C. breiman84
seems to contain all the explanation necessary and simply implement what
we thought we needed from there. Wagon
supports building of both
classification and regression trees from features and produces trees in
a Lisp format, particularly suitable to Festival but the trees it builds
may be used in any system.
There are things missing, particularly well defined ways to add new cost functions. We do want this to be well defined so we can add them easily so expect that part to improve.
We should also say that there are many underspecified aspects to the
implementation of a CART building system, thus result you get from this
system are unlikely to exactly the same as those when built with another
tree building system. We have done test with data against other
statistical systems and wagon
gets similar results. We are
confident that the results from test data are correct. If you want to
make statistical claims about technique rather than just use the result,
you should check the code more closely.
If you do think wagon
is performing worse than you expect on
data please let us know, it may be a bug.
`Wagon' requires a data description file and a datafile, optionally you may specify a test datafile too. This may be the real test set (if you are building simple trees) or held data from the training set in the case of stepwise. An external program, `wagon_test', may also be used to test a tree against data.
A data record consists of a number of field, each field
must be described to `wagon'. The database
description file identified with the -desc
option
should contain a single list of field descriptors. Each
field descriptor consists of a field name and a type
The types currently supported are
binary
float
ignore
list of symbols
An example description file is
((duration float) (Syllable.position_type single initial final mid 0) (pos_in_syl float) (syl_initial 0 1) (syl_final 0 1) )
It is not unusual for the field names to be Festival feature names as then the resulting tree may be directly used in Festival.
A database consists of vectors, one per line. Each vector must have she same number of fields and appropriate values, as defined in the description file.
The first field is the field that is to be predicted.
If the first field is a float a regression tree is built. If the first field is a class then a classification tree is built.
If the first field is of cluster type, the integer numbers in
that field are treated as indeices into the distance matrix specified
by -distmatrix
. In that case the imppurity measure will be
the mean distance between all memebers of the split.
-desc filename
-data filename
-stop N
-test filename
-frs N
-balance N
-held_out I
-noprune
-noprune
is specified no single class sub-trees are pruned and all the probability
density functions are preserved.
-output FILE
-distmatrix FILE
FMatrix::binsave()
defining distances
between the categories in the predictee.
-stepwise
-swlimit N
To show how to use wagon
this chapter shows a general
example of extracting data from a database and using it
to build a tree.
For example suppose we wish to predict segmental duration based on the following features: segment name, vowel/consonant, position in syllable, stress of syllable, accent of syllable, position of syllable in word and break level after and before the syllable this segment is within Within Festival these features are identified by the following names
duration name ph_vc pos_in_syl R:SylStructure.parent.stress R:SylStructure.parent.accented R:SylStructure.parent.position_type R:SylStructure.parent.R:Syllable.p.syl_break R:SylStructure.parent.syl_break
using a festival script dumpfeats
in `festival/examples/'
we can dump the features of a set of utterances. This function
requires utterances to be the format loaded by utt.load
the creation of such files is discussed elsewhere.
Alternatively you may create the feature data files from your own database format with your own programs as long as the resulting output looks like
0.186 # - 0 0 0 0 0 0 0.018 dh - 0 0 0 single 0 1 0.062 i + 1 0 0 single 0 1 0.075 s - 2 0 0 single 0 1 0.05 w - 0 0 0 single 1 1 0.059 @ + 1 0 0 single 1 1 0.061 z - 2 0 0 single 1 1 0.142 ii + 0 1 1 initial 1 0 0.041 z - 1 1 1 initial 1 0 0.086 ii + 0 0 0 final 0 1 0.127 f - 0 0 0 single 1 1 0.031 @ + 1 0 0 single 1 1 0.088 r - 2 0 0 single 1 1 0.189 uh + 0 0 1 single 1 4 0.188 s - 1 0 1 single 1 4 0.1 # - 0 0 0 0 0 0
Given out above datafile we need a description file for the fields it contains. In this case it would be
((duration float) (name uh e a o i u ii uu oo aa @@ ai ei oi au ou e@ i@ u@ @ p t k b d g s z sh zh f v th dh ch jh h m n ng l y r w #) (ph_vc + -) (pos_in_syl float) (R:SylStructure.parent.stress 0 1) (R:SylStructure.parent.accented 0 1) (R:SylStructure.parent.position_type single initial final mid 0) (R:SylStructure.parent.R:Syllable.p.syl_break 0 1 2 3 4) (R:SylStructure.parent.syl_break 0 1 2 3 4) )
Note the field names match the Festival feature names. This is highly recommend if you are going to use the resulting trees in Festival as they trees can then be more easily applied.
Note that we define R:SylStructure.parent.syl_break
and
R:SylStructure.parent.R:Syllable.p.syl_break
as categorial even though their values consist only of the integers 0
through 4. It is of course a choice and either would work but it
is sometimes better to represent features as classes rather than
as ranges. Experimentation of different representations of the same
information is of course the only, real test.
Let us assume our datafile is saved in `train.data' and our data description is in `train.desc'. We can no build a CART tree for our data with the following command
wagon -desc train.desc -data train.data
This will print detaisl of the tree as it is built, then the full tree at the end, followed a short summary of its predictive capacity on the training data itself. We can make it do that final test on a different file. Suppose we had removed 20% of our data and saved it in `test.data' we could use that to test the final tree with the command
wagon -desc train.desc -data train.data -test test.data
The program `wagon_test' allows the independent use and testing of a CART as generated from `wagon'. `wagon_test' takes a datafile and description file in same format as `wagon' as well as a CART in the format generated by `wagon'. It can test the tree against the data give summary of the predictive power. It can also simple predict the values from the tree for the given data. The prediction can be done in two modes. Simple predict where the appropriate leaf node of the tree is returned, that is the whole probability density function (in the classification case) or mean and standard deviation (in the regression tree case). The second case is simply predicting the most probable value (item in class with highest probability or the mean).
For example to test a generated tree of a new set of test data.
wagon_test -desc train.desc -data test.data
Note that the CART building program `wagon' itself may also test on specified test data but this allows later tests to be done without rebuilding the tree.
To predict values for a set of feature vectors and save the output to a file you use
wagon_test -desc train.desc -data newdata.data -predict_val -o newdata.out
Go to the first, previous, next, last section, table of contents.