Go to the first, previous, next, last section, table of contents.


Wagon

This chapter describes the wagon program for building standard classification and regression trees.

Wagon introduction

In spite of everyone talling us that CART building software was freely available on the net, and in spite of following up many false leads, we couldn't find any freely available software that could build classification and regression trees.

Also we knew we wanted to have a generic system to do this as we know we want to experiement with our own cost functions and probably later our own pruning methods so a well-designed programmable system was what we were looking for. wagon is the result. It it still under development features but it seems to work and works quite efficiently. The book "Classification and Regression Trees," by Breiman, L., Freidman, J., Olshen, R. and Stone, C. breiman84 seems to contain all the explanation necessary and simply implement what we thought we needed from there. Wagon supports building of both classification and regression trees from features and produces trees in a Lisp format, particularly suitable to Festival but the trees it builds may be used in any system.

There are things missing, particularly well defined ways to add new cost functions. We do want this to be well defined so we can add them easily so expect that part to improve.

We should also say that there are many underspecified aspects to the implementation of a CART building system, thus result you get from this system are unlikely to exactly the same as those when built with another tree building system. We have done test with data against other statistical systems and wagon gets similar results. We are confident that the results from test data are correct. If you want to make statistical claims about technique rather than just use the result, you should check the code more closely.

If you do think wagon is performing worse than you expect on data please let us know, it may be a bug.

Wagon command options

`Wagon' requires a data description file and a datafile, optionally you may specify a test datafile too. This may be the real test set (if you are building simple trees) or held data from the training set in the case of stepwise. An external program, `wagon_test', may also be used to test a tree against data.

Wagon data description file

A data record consists of a number of field, each field must be described to `wagon'. The database description file identified with the -desc option should contain a single list of field descriptors. Each field descriptor consists of a field name and a type The types currently supported are

binary
Field will only takes values 0 and 1.
float
field may talk any floating point value
ignore
field is ignored, it may have any value
list of symbols
this defines a class, if the first item is ignore this whole filed is ignored.

An example description file is

((duration float)
 (Syllable.position_type single initial final mid 0)
 (pos_in_syl float)
 (syl_initial 0 1)
 (syl_final 0 1)
)

It is not unusual for the field names to be Festival feature names as then the resulting tree may be directly used in Festival.

Wagon data file

A database consists of vectors, one per line. Each vector must have she same number of fields and appropriate values, as defined in the description file.

The first field is the field that is to be predicted.

If the first field is a float a regression tree is built. If the first field is a class then a classification tree is built.

If the first field is of cluster type, the integer numbers in that field are treated as indeices into the distance matrix specified by -distmatrix. In that case the imppurity measure will be the mean distance between all memebers of the split.

Wagon command line options

-desc filename
the data description filename.
-data filename
the data filename.
-stop N
The minimum number of example that must be in a leave of a decision tree. Splits causes values to be less that this value are not taken. The default value is 50.
-test filename
A datafile to be used to test the fully grwon tree against after it has been built. Results are printed as a confusion matrix for classification trees or correlation, mean error for regression trees.
-frs N
The float range split. When searching for tree splits based on float features this is the number of partitions to separate the range of the floats into. The default value is 10.
-balance N
Vary the size fo the stop value with the number of items a the current node. This allows the building of more balanced trees and seems to be useful particularly for regression trees rather than classification trees. If the balance is non-0 (the default) the stop value for that node is the number of items divided by the balance, unless that's smaller than the specified stop value. For example if N is 20 and the base stop value is 50, then the used stop value is 5% of the number of examples at that node while there as more than 1000 examples.
-held_out I
Percent to hold out of training data for final pruning. This is a percentage, 10-15% seems reasonable. Ideally with enough data build a tree with a small stop value and then prune the tree with the held out data should give a reasonable result. This is effectively giving you a dynamically size stop value depending on the accuracy of the predictions. The disadvantage is that by removing I% you have less training data and hence the accuracy may go down on unseen data.
-noprune
When building a classification tree if every leaf in a sub-tree always selects the same class the subtree is removed the its top node becomes a leaf node. However if you wish to use the probability density functions generated by the tree the further classification in a single class sub-tree might still be neeed. If -noprune is specified no single class sub-trees are pruned and all the probability density functions are preserved.
-output FILE
an output filename to save the generated tree.
-distmatrix FILE
A distance matrix (made by FMatrix::binsave() defining distances between the categories in the predictee.
-stepwise
Find which features contribute the most to building a tree. This very computationally intensive option will first ignore all features and then greedily add features until the best possible tree can be found (subject to the other options specified). Unlike the quesiton building which finds the best feature to question at that point, this optimises the which features to use over the whole tree. This can often significantly improve results, especially when there are many closely correlated features. Note this wont find the optimal set. This option uses the provided test data to check the built trees. This means the the provided test data should not be your real test data as the build processes is effectively optimising for that test data.
-swlimit N
This specifies the required improvement during stepwise construction to add a new feature. Its default is 0 meaning that any positive is required before a new feature is added. This number should be used as a percentage improvement. A value of 1 requires a 1% improvement in the measurements (correlation (regression) or pecentage correct (classification)).

CART building

To show how to use wagon this chapter shows a general example of extracting data from a database and using it to build a tree.

Getting the data

For example suppose we wish to predict segmental duration based on the following features: segment name, vowel/consonant, position in syllable, stress of syllable, accent of syllable, position of syllable in word and break level after and before the syllable this segment is within Within Festival these features are identified by the following names

duration
name
ph_vc
pos_in_syl
R:SylStructure.parent.stress
R:SylStructure.parent.accented
R:SylStructure.parent.position_type
R:SylStructure.parent.R:Syllable.p.syl_break
R:SylStructure.parent.syl_break

using a festival script dumpfeats in `festival/examples/' we can dump the features of a set of utterances. This function requires utterances to be the format loaded by utt.load the creation of such files is discussed elsewhere.

Alternatively you may create the feature data files from your own database format with your own programs as long as the resulting output looks like

0.186 # - 0 0 0 0 0 0 
0.018 dh - 0 0 0 single 0 1 
0.062 i + 1 0 0 single 0 1 
0.075 s - 2 0 0 single 0 1 
0.05 w - 0 0 0 single 1 1 
0.059 @ + 1 0 0 single 1 1 
0.061 z - 2 0 0 single 1 1 
0.142 ii + 0 1 1 initial 1 0 
0.041 z - 1 1 1 initial 1 0 
0.086 ii + 0 0 0 final 0 1 
0.127 f - 0 0 0 single 1 1 
0.031 @ + 1 0 0 single 1 1 
0.088 r - 2 0 0 single 1 1 
0.189 uh + 0 0 1 single 1 4 
0.188 s - 1 0 1 single 1 4 
0.1 # - 0 0 0 0 0 0 

Constructing feature files

Given out above datafile we need a description file for the fields it contains. In this case it would be

((duration float)
 (name uh e a o i u ii uu oo aa @@ ai ei oi au ou e@ i@ u@ @ p 
       t k b d g s z sh zh f v th dh ch jh h m n ng l y r w #)
 (ph_vc + -)
 (pos_in_syl float)
 (R:SylStructure.parent.stress 0 1)
 (R:SylStructure.parent.accented 0 1)
 (R:SylStructure.parent.position_type single initial final mid 0)
 (R:SylStructure.parent.R:Syllable.p.syl_break 0 1 2 3 4)
 (R:SylStructure.parent.syl_break 0 1 2 3 4)
)

Note the field names match the Festival feature names. This is highly recommend if you are going to use the resulting trees in Festival as they trees can then be more easily applied.

Note that we define R:SylStructure.parent.syl_break and R:SylStructure.parent.R:Syllable.p.syl_break as categorial even though their values consist only of the integers 0 through 4. It is of course a choice and either would work but it is sometimes better to represent features as classes rather than as ranges. Experimentation of different representations of the same information is of course the only, real test.

Building and testing the tree

Let us assume our datafile is saved in `train.data' and our data description is in `train.desc'. We can no build a CART tree for our data with the following command

wagon -desc train.desc -data train.data

This will print detaisl of the tree as it is built, then the full tree at the end, followed a short summary of its predictive capacity on the training data itself. We can make it do that final test on a different file. Suppose we had removed 20% of our data and saved it in `test.data' we could use that to test the final tree with the command

wagon -desc train.desc -data train.data -test test.data

Using CART

The program `wagon_test' allows the independent use and testing of a CART as generated from `wagon'. `wagon_test' takes a datafile and description file in same format as `wagon' as well as a CART in the format generated by `wagon'. It can test the tree against the data give summary of the predictive power. It can also simple predict the values from the tree for the given data. The prediction can be done in two modes. Simple predict where the appropriate leaf node of the tree is returned, that is the whole probability density function (in the classification case) or mean and standard deviation (in the regression tree case). The second case is simply predicting the most probable value (item in class with highest probability or the mean).

For example to test a generated tree of a new set of test data.

wagon_test -desc train.desc -data test.data

Note that the CART building program `wagon' itself may also test on specified test data but this allows later tests to be done without rebuilding the tree.

To predict values for a set of feature vectors and save the output to a file you use

wagon_test -desc train.desc -data newdata.data -predict_val -o newdata.out


Go to the first, previous, next, last section, table of contents.