Changed example data solder to solder.balance. The full version of the data is available in the survival package.
Rpart would fail with a formula having ~. - x on the right hand side. A simple bookkeeping error in creating an index.
Added a section to the vignette on user written functions, which explains why and when one can avoid checking all 2^k splits for a categorical predictor with k levels.
The C and R code has been reformatted for legibility.
The old compatibility function rpconvert()
has been removed.
The cross-validation functions allow for user interrupt at the end of evaluating each split.
Variable Reliability
in data set car90
is
corrected to be an ordered factor, as documented.
Surrogate splits are now considered only if they send two or more cases with non-zero weight each way. For numeric/ordinal variables the restriction to non-zero weights is new: for categorical variables this is a new restriction.
Surrogate splits which improve only by rounding error over the
default split are no longer returned. Where weights and missing
values are present, the splits
component for some of these
was not returned correctly.
A fit of class ‘"rpart"’ now contains a component for
variable ‘importance’, which is reported by the
summary()
method.
The text()
method gains a minlength
argument,
like the labels()
method. This adds finer control: the
default remains pretty = NULL
, minlength = 1L
.
The handling of fits with zero and fractional weights has been corrected: the results may be slightly different (or even substantially different when the proportion of zero weights is large).
Some memory leaks have been plugged.
There is a second vignette, ‘longintro.Rnw’, a version of the original Mayo Tecnical Report on rpart.
Added dataset car90
, a corrected version of the
S-PLUS dataset car.all
(used with permission).
This version does not use paste0{}
and so works
with R 2.14.x.
Merged in a set of Splus code changes that had accumulated at
Mayo over the course of a decade. The primary one is a change in how
indexing is done in the underlying C code, which leads to a major
speed increase for large data sets. Essentially, for the lower
leaves all our time used to be eaten up by bookkeeping, and this was
replaced by a different approach. The primary routine also uses
.Call{}
so as to be more memory efficient.
The other major change was an error for asymmetric loss matrices, prompted by a user query. With L=loss asymmetric, the altered priors were computed incorrectly – they were using L' instead of L. Upshot – the tree would not not necessarily choose optimal splits for the given loss matrix. Once chosen, splits were evaluated correctly. The printed “improvement” values are of course the wrong ones as well. It is interesting that for my little test case, with L quite asymmetric, the early splits in the tree are unchanged – a good split still looks good.
Add the return.all
argument to xpred.rpart()
.
Added a set of formal tests, i.e., cases with known answers to which we can compare.
Add a ‘usercode’ vignette, explaining how to add user defined splitting functions.
The class method now also returns the node probability.
Add the stagec
data set, used in some tests.
The plot.rpart
routine needs to store a value that will
be visible to the rpartco
routine at a later time. This is
now done in an environment in the namespace.
Force use of registered symbols in R >= 2.16.0
Update Polish translations.
Work on message formats.
Add Polish translations
rpart
, rpart.matrix
: allow backticks in formulae.
‘tests/backtick.R’: regession test
‘src/xval.c’: ensure unused code is not compiled in.
Change description of ‘margin’ in ?plot.rpart
as suggested by Bill Venables.