Random Forest



Preliminaries: decision tree learning

Decision trees are a popular method for various machine learning tasks. Tree learning “come[s] closest to meeting the requirements for serving as an off-the-shelf procedure for data mining”, say Hastie et al., because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate.[3]:352

In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.[3]:587–588 This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

Tree bagging

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, …, xn with responses Y = y1, …, yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:

For b = 1, …, B:

  1. Sample, with replacement, n training examples from XY; call these XbYb.
  2. Train a classification or regression tree fb on XbYb.

After training, predictions for unseen samples x’ can be made by averaging the predictions from all the individual regression trees on x’:

{\displaystyle {\hat {f}}={\frac {1}{B}}\sum _{b=1}^{B}f_{b}(x’)}{\displaystyle {\hat {f}}={\frac {1}{B}}\sum _{b=1}^{B}f_{b}(x')}

or by taking the majority vote in the case of classification trees.

This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on x’:

{\displaystyle \sigma ={\sqrt {\frac {\sum _{b=1}^{B}(f_{b}(x’)-{\hat {f}})^{2}}{B-1}}}.}{\displaystyle \sigma ={\sqrt {\frac {\sum _{b=1}^{B}(f_{b}(x')-{\hat {f}})^{2}}{B-1}}}.}

The number of samples/trees, B, is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal number of trees B can be found using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample.[12] The training and test error tend to level off after some number of trees have been fit.

From bagging to random forests

The above procedure describes the original bagging algorithm for trees. Random forests differ in only one way from this general scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called “feature bagging”. The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.[13]

Typically, for a classification problem with p features, p (rounded down) features are used in each split.[3]:592 For regression problems the inventors recommend p/3 (rounded down) with a minimum node size of 5 as the default.[3]:592


Adding one further step of randomization yields extremely randomized trees, or ExtraTrees. These are trained using bagging and the random subspace method, like in an ordinary random forest, but additionally the top-down splitting in the tree learner is randomized. Instead of computing the locally optimal feature/split combination (based on, e.g., information gain or the Gini impurity), for each feature under consideration, a random value is selected for the split. This value is selected from the feature’s empirical range (in the tree’s training set, i.e., the bootstrap sample)[14]


My Ten Favorite “Practical Theory” Papers By: Jennifer Rexford


As the saying goes, “In theory there is no difference between theory and practice. But, in practice, there is.” Networking research has a wealth of good papers on both sides of the theory-practice divide. However, many practical papers stop short of having a sharp problem formulation or a rigorously considered solution, and many theory papers overlook or assume away some key aspect of the system they intend to model. Still, every so often, a paper comes along that nails a practical question with just the right bit of theory. When that happens, it’s a thing of beauty. These are my ten favorite examples. In some cases, I mention survey papers that cover an entire body of work, or a journal paper that presents a more mature overview of one or more conference papers, rather than single out an individual research result.

An Axiomatic Basis for Computer Programming. C. A. R. Hoare


In this paper an attempt is made to explore the logical foundations of computer programming by use of techniques which were first applied in the study of geometry and have later been extended to other branches of mathematics. This involves the elucidation of sets of axioms and rules of inference which can be used in proofs of the properties of computer programs. Examples are given of such axioms and rules, and a formal proof of a simple theorem is displayed. Finally, it is argued that important advantage, both theoretical and practical, may follow from a pursuance of these topics.

Graph Theorems


Cayley’s Formula • Mader’s Theorem • Menger’s Theorem • Euler’s Theorem • The Nordhaus-Gaddum Theorem • Dirac’s Theorem • The Chv´atal-Erd˝os Theorem • Hall’s Theorem • In any bipartite graph, the maximum size of a matching is equal to the minimum size of a vertex cover. • Tutte’s Theorem • Petersen’s Theorem • The chromatic index of any bipartite graph is equal to its maximum degree. • Vizing’s Theorem • Brooks’ Theorem • Tur´an’s Theorem • The K˝ov´ari-S´os-Tur´an Theorem • Ramsey’s Theorem • Erd˝os’ lower bound for the Ramsey number r(k, k). • Euler’s Formula for planar connected graphs. • The Four Color Theorem (with a proof that five colors suffice).

Graph Isomorphism


Read for Lazlo Babai


And Retraction also.

In graph theory, an isomorphism of graphs G and H is a bijection between the vertex sets of G and H

such that any two vertices u and v of G are adjacent in G if and only if ƒ(u) and ƒ(v) are adjacent in H. This kind of bijection is commonly described as “edge-preserving bijection”, in accordance with the general notion of isomorphism being a structure-preserving bijection.

If an isomorphism exists between two graphs, then the graphs are called isomorphic and denoted as {\displaystyle G\simeq H}G\simeq H. In the case when the bijection is a mapping of a graph onto itself, i.e., when G and H are one and the same graph, the bijection is called an automorphism of G.

Graph isomorphism is an equivalence relation on graphs and as such it partitions the class of all graphs into equivalence classes. A set of graphs isomorphic to each other is called an isomorphism class of graphs.

The two graphs shown below are isomorphic, despite their different looking drawings.

Graph G Graph H An isomorphism
between G and H
Graph isomorphism a.svg Graph isomorphism b.svg f(a) = 1f(b) = 6

f(c) = 8

f(d) = 3

f(g) = 5

f(h) = 2

f(i) = 4

f(j) = 7