mlpack
2.0.1
|
The HoeffdingTree object represents all of the necessary information for a Hoeffding-bound-based decision tree. More...
Public Types | |
typedef CategoricalSplitType< FitnessFunction > | CategoricalSplit |
Allow access to the categorical split type. More... | |
typedef NumericSplitType< FitnessFunction > | NumericSplit |
Allow access to the numeric split type. More... | |
Public Member Functions | |
template<typename MatType > | |
HoeffdingTree (const MatType &data, const data::DatasetInfo &datasetInfo, const arma::Row< size_t > &labels, const size_t numClasses, const bool batchTraining=true, const double successProbability=0.95, const size_t maxSamples=0, const size_t checkInterval=100, const size_t minSamples=100, const CategoricalSplitType< FitnessFunction > &categoricalSplitIn=CategoricalSplitType< FitnessFunction >(0, 0), const NumericSplitType< FitnessFunction > &numericSplitIn=NumericSplitType< FitnessFunction >(0)) | |
Construct the Hoeffding tree with the given parameters and given training data. More... | |
HoeffdingTree (const data::DatasetInfo &datasetInfo, const size_t numClasses, const double successProbability=0.95, const size_t maxSamples=0, const size_t checkInterval=100, const size_t minSamples=100, const CategoricalSplitType< FitnessFunction > &categoricalSplitIn=CategoricalSplitType< FitnessFunction >(0, 0), const NumericSplitType< FitnessFunction > &numericSplitIn=NumericSplitType< FitnessFunction >(0), std::unordered_map< size_t, std::pair< size_t, size_t >> *dimensionMappings=NULL) | |
Construct the Hoeffding tree with the given parameters, but training on no data. More... | |
HoeffdingTree (const HoeffdingTree &other) | |
Copy another tree (warning: this will duplicate the tree entirely, and may use a lot of memory. More... | |
~HoeffdingTree () | |
Clean up memory. More... | |
template<typename VecType > | |
size_t | CalculateDirection (const VecType &point) const |
Given a point and that this node is not a leaf, calculate the index of the child node this point would go towards. More... | |
size_t | CheckInterval () const |
Get the number of samples before a split check is performed. More... | |
void | CheckInterval (const size_t checkInterval) |
Modify the number of samples before a split check is performed. More... | |
const HoeffdingTree & | Child (const size_t i) const |
Get a child. More... | |
HoeffdingTree & | Child (const size_t i) |
Modify a child. More... | |
template<typename VecType > | |
size_t | Classify (const VecType &point) const |
Classify the given point, using this node and the entire (sub)tree beneath it. More... | |
template<typename VecType > | |
void | Classify (const VecType &point, size_t &prediction, double &probability) const |
Classify the given point and also return an estimate of the probability that the prediction is correct. More... | |
template<typename MatType > | |
void | Classify (const MatType &data, arma::Row< size_t > &predictions) const |
Classify the given points, using this node and the entire (sub)tree beneath it. More... | |
template<typename MatType > | |
void | Classify (const MatType &data, arma::Row< size_t > &predictions, arma::rowvec &probabilities) const |
Classify the given points, using this node and the entire (sub)tree beneath it. More... | |
void | CreateChildren () |
Given that this node should split, create the children. More... | |
size_t | MajorityClass () const |
Get the majority class. More... | |
size_t & | MajorityClass () |
Modify the majority class. More... | |
double | MajorityProbability () const |
Get the probability of the majority class (based on training samples). More... | |
double & | MajorityProbability () |
Modify the probability of the majority class. More... | |
size_t | MaxSamples () const |
Get the maximum number of samples before a split is forced. More... | |
void | MaxSamples (const size_t maxSamples) |
Modify the maximum number of samples before a split is forced. More... | |
size_t | MinSamples () const |
Get the minimum number of samples for a split. More... | |
void | MinSamples (const size_t minSamples) |
Modify the minimum number of samples for a split. More... | |
size_t | NumChildren () const |
Get the number of children. More... | |
template<typename Archive > | |
void | Serialize (Archive &ar, const unsigned int) |
Serialize the split. More... | |
size_t | SplitCheck () |
Check if a split would satisfy the conditions of the Hoeffding bound with the node's specified success probability. More... | |
size_t | SplitDimension () const |
Get the splitting dimension (size_t(-1) if no split). More... | |
double | SuccessProbability () const |
Get the confidence required for a split. More... | |
void | SuccessProbability (const double successProbability) |
Modify the confidence required for a split. More... | |
template<typename MatType > | |
void | Train (const MatType &data, const arma::Row< size_t > &labels, const bool batchTraining=true) |
Train on a set of points, either in streaming mode or in batch mode, with the given labels. More... | |
template<typename VecType > | |
void | Train (const VecType &point, const size_t label) |
Train on a single point in streaming mode, with the given label. More... | |
Private Attributes | |
CategoricalSplitType< FitnessFunction >::SplitInfo | categoricalSplit |
If the split is categorical, this holds the splitting information. More... | |
std::vector< CategoricalSplitType< FitnessFunction > > | categoricalSplits |
Information for splitting of categorical features (used before split). More... | |
size_t | checkInterval |
The number of samples that should be seen before checking for a split. More... | |
std::vector< HoeffdingTree * > | children |
If the split has occurred, these are the children. More... | |
const data::DatasetInfo * | datasetInfo |
The dataset information. More... | |
std::unordered_map< size_t, std::pair< size_t, size_t > > * | dimensionMappings |
This structure is owned by this node only if it is the root of the tree. More... | |
size_t | majorityClass |
The majority class of this node. More... | |
double | majorityProbability |
The empirical probability of a point this node saw having the majority class. More... | |
size_t | maxSamples |
The maximum number of samples we can see before splitting. More... | |
size_t | minSamples |
The minimum number of samples for splitting. More... | |
size_t | numClasses |
The number of classes this node is trained on. More... | |
NumericSplitType< FitnessFunction >::SplitInfo | numericSplit |
If the split is numeric, this holds the splitting information. More... | |
std::vector< NumericSplitType< FitnessFunction > > | numericSplits |
Information for splitting of numeric features (used before split). More... | |
size_t | numSamples |
The number of samples seen so far by this node. More... | |
bool | ownsInfo |
Whether or not we own the dataset information. More... | |
bool | ownsMappings |
Indicates whether or not we own the mappings. More... | |
size_t | splitDimension |
The dimension that this node has split on. More... | |
double | successProbability |
The required probability of success for a split to be performed. More... | |
The HoeffdingTree object represents all of the necessary information for a Hoeffding-bound-based decision tree.
This class is able to train on samples in streaming settings and batch settings, and perform splits based on the Hoeffding bound. The Hoeffding tree (also known as the "very fast decision tree" – VFDT) is described in the following paper:
The class is modular, and takes three template parameters. The first, FitnessFunction, is the fitness function that should be used to determine whether a split is beneficial; examples might be GiniImpurity or InformationGain. The NumericSplitType determines how numeric attributes are handled, and the CategoricalSplitType determines how categorical attributes are handled. As far as the actual splitting goes, the meat of the splitting procedure will be contained in those two classes.
FitnessFunction | Fitness function to use. |
NumericSplitType | Technique for splitting numeric features. |
CategoricalSplitType | Technique for splitting categorical features. |
Definition at line 62 of file hoeffding_tree.hpp.
typedef CategoricalSplitType<FitnessFunction> mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::CategoricalSplit |
Allow access to the categorical split type.
Definition at line 68 of file hoeffding_tree.hpp.
typedef NumericSplitType<FitnessFunction> mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::NumericSplit |
Allow access to the numeric split type.
Definition at line 66 of file hoeffding_tree.hpp.
mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::HoeffdingTree | ( | const MatType & | data, |
const data::DatasetInfo & | datasetInfo, | ||
const arma::Row< size_t > & | labels, | ||
const size_t | numClasses, | ||
const bool | batchTraining = true , |
||
const double | successProbability = 0.95 , |
||
const size_t | maxSamples = 0 , |
||
const size_t | checkInterval = 100 , |
||
const size_t | minSamples = 100 , |
||
const CategoricalSplitType< FitnessFunction > & | categoricalSplitIn = CategoricalSplitType< FitnessFunction >(0, 0) , |
||
const NumericSplitType< FitnessFunction > & | numericSplitIn = NumericSplitType< FitnessFunction >(0) |
||
) |
Construct the Hoeffding tree with the given parameters and given training data.
The tree may be trained either in batch mode (which looks at all points before splitting, and propagates these points to the created children for further training), or in streaming mode, where each point is only considered once. (In general, batch mode will give better-performing trees, but will have higher memory and runtime costs for the same dataset.)
data | Dataset to train on. |
datasetInfo | Information on the dataset (types of each feature). |
labels | Labels of each point in the dataset. |
numClasses | Number of classes in the dataset. |
batchTraining | Whether or not to train in batch. |
successProbability | Probability of success required in Hoeffding bounds before a split can happen. |
maxSamples | Maximum number of samples before a split is forced (0 never forces a split); ignored in batch training mode. |
checkInterval | Number of samples required before each split; ignored in batch training mode. |
minSamples | If the node has seen this many points or fewer, no split will be allowed. |
mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::HoeffdingTree | ( | const data::DatasetInfo & | datasetInfo, |
const size_t | numClasses, | ||
const double | successProbability = 0.95 , |
||
const size_t | maxSamples = 0 , |
||
const size_t | checkInterval = 100 , |
||
const size_t | minSamples = 100 , |
||
const CategoricalSplitType< FitnessFunction > & | categoricalSplitIn = CategoricalSplitType< FitnessFunction >(0, 0) , |
||
const NumericSplitType< FitnessFunction > & | numericSplitIn = NumericSplitType< FitnessFunction >(0) , |
||
std::unordered_map< size_t, std::pair< size_t, size_t >> * | dimensionMappings = NULL |
||
) |
Construct the Hoeffding tree with the given parameters, but training on no data.
The dimensionMappings parameter is only used if it is desired that this node does not create its own dimensionMappings object (for instance, if this is a child of another node in the tree).
dimensionality | Dimensionality of the dataset. |
numClasses | Number of classes in the dataset. |
datasetInfo | Information on the dataset (types of each feature). |
successProbability | Probability of success required in Hoeffding bound before a split can happen. |
maxSamples | Maximum number of samples before a split is forced. |
checkInterval | Number of samples required before each split check. |
minSamples | If the node has seen this many points or fewer, no split will be allowed. |
dimensionMappings | Mappings from dimension indices to positions in numeric and categorical split vectors. If left NULL, a new one will be created. |
mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::HoeffdingTree | ( | const HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType > & | other | ) |
Copy another tree (warning: this will duplicate the tree entirely, and may use a lot of memory.
Make sure it's what you want before you do it).
other | Tree to copy. |
mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::~HoeffdingTree | ( | ) |
Clean up memory.
size_t mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::CalculateDirection | ( | const VecType & | point | ) | const |
Given a point and that this node is not a leaf, calculate the index of the child node this point would go towards.
This method is primarily used by the Classify() function, but it can be used in a standalone sense too.
point | Point to classify. |
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::CheckInterval().
|
inline |
Get the number of samples before a split check is performed.
Definition at line 218 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::CalculateDirection(), mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::checkInterval, mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::Classify(), mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::CreateChildren(), and mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::Serialize().
void mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::CheckInterval | ( | const size_t | checkInterval | ) |
Modify the number of samples before a split check is performed.
|
inline |
Get a child.
Definition at line 198 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::children.
|
inline |
Modify a child.
Definition at line 200 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::children.
size_t mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::Classify | ( | const VecType & | point | ) | const |
Classify the given point, using this node and the entire (sub)tree beneath it.
The predicted label is returned.
point | Point to classify. |
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::CheckInterval().
void mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::Classify | ( | const VecType & | point, |
size_t & | prediction, | ||
double & | probability | ||
) | const |
Classify the given point and also return an estimate of the probability that the prediction is correct.
(This estimate is simply the probability that a training point was from the majority class in the leaf that this point binned to.)
point | Point to classify. |
prediction | Predicted label of point. |
probability | An estimate of the probability that the prediction is correct. |
void mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::Classify | ( | const MatType & | data, |
arma::Row< size_t > & | predictions | ||
) | const |
Classify the given points, using this node and the entire (sub)tree beneath it.
The predicted labels for each point are returned.
data | Points to classify. |
predictions | Predicted labels for each point. |
void mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::Classify | ( | const MatType & | data, |
arma::Row< size_t > & | predictions, | ||
arma::rowvec & | probabilities | ||
) | const |
Classify the given points, using this node and the entire (sub)tree beneath it.
The predicted labels for each point are returned, as well as an estimate of the probability that the prediction is correct for each point. This estimate is simply the MajorityProbability() for the leaf that each point bins to.
data | Points to classify. |
predictions | Predicted labels for each point. |
probabilities | Probability estimates for each predicted label. |
void mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::CreateChildren | ( | ) |
Given that this node should split, create the children.
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::CheckInterval().
|
inline |
Get the majority class.
Definition at line 185 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::majorityClass.
|
inline |
Modify the majority class.
Definition at line 187 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::majorityClass.
|
inline |
Get the probability of the majority class (based on training samples).
Definition at line 190 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::majorityProbability.
|
inline |
Modify the probability of the majority class.
Definition at line 192 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::majorityProbability.
|
inline |
Get the maximum number of samples before a split is forced.
Definition at line 213 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::maxSamples.
void mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::MaxSamples | ( | const size_t | maxSamples | ) |
Modify the maximum number of samples before a split is forced.
|
inline |
Get the minimum number of samples for a split.
Definition at line 208 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::minSamples.
void mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::MinSamples | ( | const size_t | minSamples | ) |
Modify the minimum number of samples for a split.
|
inline |
Get the number of children.
Definition at line 195 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::children.
void mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::Serialize | ( | Archive & | ar, |
const unsigned | int | ||
) |
Serialize the split.
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::CheckInterval().
size_t mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::SplitCheck | ( | ) |
Check if a split would satisfy the conditions of the Hoeffding bound with the node's specified success probability.
If so, the number of children that would be created is returned. If not, 0 is returned.
|
inline |
Get the splitting dimension (size_t(-1) if no split).
Definition at line 182 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::splitDimension.
|
inline |
Get the confidence required for a split.
Definition at line 203 of file hoeffding_tree.hpp.
References mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::successProbability.
void mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::SuccessProbability | ( | const double | successProbability | ) |
Modify the confidence required for a split.
void mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::Train | ( | const MatType & | data, |
const arma::Row< size_t > & | labels, | ||
const bool | batchTraining = true |
||
) |
Train on a set of points, either in streaming mode or in batch mode, with the given labels.
data | Data points to train on. |
label | Labels of data points. |
batchTraining | If true, perform training in batch. |
void mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::Train | ( | const VecType & | point, |
const size_t | label | ||
) |
Train on a single point in streaming mode, with the given label.
point | Point to train on. |
label | Label of point to train on. |
|
private |
If the split is categorical, this holds the splitting information.
Definition at line 332 of file hoeffding_tree.hpp.
|
private |
Information for splitting of categorical features (used before split).
Definition at line 298 of file hoeffding_tree.hpp.
|
private |
The number of samples that should be seen before checking for a split.
Definition at line 312 of file hoeffding_tree.hpp.
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::CheckInterval().
|
private |
If the split has occurred, these are the children.
Definition at line 336 of file hoeffding_tree.hpp.
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::Child(), and mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::NumChildren().
|
private |
The dataset information.
Definition at line 316 of file hoeffding_tree.hpp.
|
private |
This structure is owned by this node only if it is the root of the tree.
Definition at line 301 of file hoeffding_tree.hpp.
|
private |
The majority class of this node.
Definition at line 327 of file hoeffding_tree.hpp.
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::MajorityClass().
|
private |
The empirical probability of a point this node saw having the majority class.
Definition at line 330 of file hoeffding_tree.hpp.
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::MajorityProbability().
|
private |
The maximum number of samples we can see before splitting.
Definition at line 310 of file hoeffding_tree.hpp.
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::MaxSamples().
|
private |
The minimum number of samples for splitting.
Definition at line 314 of file hoeffding_tree.hpp.
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::MinSamples().
|
private |
The number of classes this node is trained on.
Definition at line 308 of file hoeffding_tree.hpp.
|
private |
If the split is numeric, this holds the splitting information.
Definition at line 334 of file hoeffding_tree.hpp.
|
private |
Information for splitting of numeric features (used before split).
Definition at line 296 of file hoeffding_tree.hpp.
|
private |
The number of samples seen so far by this node.
Definition at line 306 of file hoeffding_tree.hpp.
|
private |
Whether or not we own the dataset information.
Definition at line 318 of file hoeffding_tree.hpp.
|
private |
Indicates whether or not we own the mappings.
Definition at line 303 of file hoeffding_tree.hpp.
|
private |
The dimension that this node has split on.
Definition at line 325 of file hoeffding_tree.hpp.
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::SplitDimension().
|
private |
The required probability of success for a split to be performed.
Definition at line 320 of file hoeffding_tree.hpp.
Referenced by mlpack::tree::HoeffdingTree< FitnessFunction, NumericSplitType, CategoricalSplitType >::SuccessProbability().