iris virginica in a field

R is an excellent language for machine learning. R primarily excels in the following three areas:

  1. Data manipulation
  2. Plotting
  3. Built-in libraries

R manipulates data using a native, in-built data structure called a data frame. Let’s load some data now and see how R can help us to work with it.

The data

Luckily for us, R comes with some built-in datasets that we can simply load and have ready to go. We’ll use one such dataset called iris to test some of R’s machine learning capabilities. This dataset contains 50 flowers, each one of three different different species: iris setosa, iris versicolor, and iris virginica. We can classify these data points using a data frame containing 4 different measurement attributes: sepal length, sepal width, petal length, and petal width. Let’s load the data and get started.


The R head command lets us inspect the first few elements of a dataset. You’ll see that the first elements all have the same class label, given by the Species column. When training a machine learning model, we want our data to be randomized so let’s fix this.

Shuffling the data

The elements of our data frame are the rows, so to shuffle the data we need to permute the rows. To do this we can use the built-in R functions, nrow and samplenrow(iris) returns a range the size of the iris dataset. Applying sample to this returns a permutation of the range. We can then index into the dataset using the shuffled row indices as shown below.

shuffled <- iris[sample(nrow(iris)),]

Now, when we take a look at the head of the dataset, we see that we’re getting a mix of different flowers, which is great!

Train and validation splits

Now that we’ve got the data shuffled, let’s partition it into a training and validation set. We’ll hold out 20 percent of the data points for the validation set. We can get the two partitions as follows:

n <- nrow(shuffled)
split <- .2 * n
train <- shuffled[1:split,]
validate <- shuffled[split:n,]

Defining a model

Once that we have our data divided into train and validation splits, we can start training a model on it. We’ll first explore using boosted decision trees for classification with an R package called xgboost. Let’s first install it.


Now, we can construct the model. The XGBoost interface requires that we section the training data into the actual X data and the species labels. To do this, we just select the corresponding columns from the data frame. We also need to convert the Species labels into numerical values in ({0, 1, 2}) as XGBoost only handles numbered class labels. We’ll do the same for the validation set.

# Training set
dcols <- c(1:4)
lcols <- c(5)
train$Species <- as.character(train$Species)
train$Species[train$Species == "setosa"] <- 0
train$Species[train$Species == "virginica"] <- 1
train$Species[train$Species == "versicolor"] <- 2
X <- train[,dcols]
labels <- train[,lcols]

# Validation Set
validate$Species <- as.character(validate$Species)
validate$Species[validate$Species == "setosa"] <- 0
validate$Species[validate$Species == "virginica"] <- 1
validate$Species[validate$Species == "versicolor"] <- 2
Xv <- validate[,dcols]
labelsV <- validate[,lcols]

Training the model

Now that we’ve gotten that out of the way, we can begin training the model. The interface for xgboost is super simple; we can train the model with a single call.

booster <- xgboost(data = as.matrix(X), label = labels, max.depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "multi:softmax", num.class = 3)
[1] train-merror:0.000000 
[2] train-merror:0.000000 

The parameters we used deserve some explanation.

  1. We set max.depth = 2 to use decision trees that have a maximum depth of 2. Deeper trees can give higher accuracy but are also higher in variance. Depth 2 should be a good balance for our purposes.
  2. eta specifies the learning rate for our model. This is a hyperparameter that requires some experimentation and tuning to get right. We’ll use 1 for this example.
  3. nthread specifies the number of CPU threads to use while training the model. This is a small dataset so we don’t need many threads, but we would for larger datasets.
  4. nrounds specifies the number of training epochs (passes) to perform over the data. We’ll start with 2.

Evaluating model performance

Now that we’ve trained our boosted tree model, we can evaluate its performance on the validation set. To do this, we can use the predict function that comes packaged with xgboost in order to generate our predictions.

preds <- predict(booster, as.matrix(Xv))
[1] 1 1 0 0 1 0

You can see that our boosted model is predicting nicely. All that is left is to calculate is its accuracy. We will calculate the mean of all entries of preds that are not equal to the corresponding entries of labelsV.

err <- mean(preds != labelsV)
print(paste("validation-error=", err))
[1] "validation-error= 0.0495867768595041"
print(paste("validation-accuracy=", 1 - err))
[1] "validation-accuracy= 0.950413223140496"

Nice! The validation accuracy on the dataset is equal to about 95 percent, which is a great performance from the model.

Visualizing our results

To finish off the project, we’ll just visualize our predictions. First, we will plot the native data using a library called ggvis. Before that though, we must install the package.


Now, since we’re visually limited to two dimensions, we will choose to plot our classes vs. the Sepal Length and Width attributes. This can be done as follows.

Registered S3 method overwritten by 'dplyr':
  method           from
data <- iris
data %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points()

To visualize how our model predicts in comparison, we’ll quickly run it across all data points, not just the validation set. To combine the training and validation data, we use the rbind function, which just joins the two data frames vertically. We also undo the work we did before, converting from numeric labels back to species names.

all_data <- rbind(X, Xv)
preds <- predict(booster, as.matrix(all_data))
all_data["Species"] <- preds
all_data$Species[all_data$Species == 0] <- "setosa"
all_data$Species[all_data$Species == 1] <- "virginica"
all_data$Species[all_data$Species == 2] <- "versicolor"

Everything looks good! Let’s plot the final result, using the same code as before.

all_data %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points()

We can visually compare the true plot with the plot of our predictions. There should only be a few points misclassified between the two!

To learn more about machine learning models in R, visit