Factors in R are data structures for representing categorical variables i.e. data which can assume only finite number of distinct values. Although it is possible to store such data in any of the data structures that we have already covered, viz. vectors, matrices, arrays, lists and data frames, doing so will not take advantage of the fact that these variables can take only finite number of distinct values. When we know beforehand that our observations can take one of predetermined values, R can optimize both storage and permitted operations for us.

Factors in R are capable of storing integers and strings. If you consider the six atomic data types, R already optimizes for logical data type since it can take only two distinct values. When you encounter a variable strings or numbers that can take limited number of unique values, you need to use factors.

## Example

Suppose in an experiment we roll the dice 10 times. Each time the dice can show up a value between 1 to 6. Our payout is defined as twice the value that shows up on the dice. We record our payouts as follows –

1 2 3 |
> > dice.throw.result <- c(2, 4, 6, 8, 10, 12, 8, 6, 4, 2) > |

Since the outcome of our experiment is range bound (2 to 12), we should really be using factor data structure as follows –

1 2 3 4 5 6 7 |
> > dice.throw.result.factor <- factor(dice.throw.result) > str(dice.throw.result.factor ) Factor w/ 6 levels "2","4","6","8",..: 1 2 3 4 5 6 4 3 2 1 > |

As the output of str function shows, R has created 6 levels – “2”, “4”, “6”, “8”, “10”, and “12”.

R also creates a vector of integers that stores indexes in the levels vector.

In this case R did not store the numbers 2, 4, 6, 8, 10… as the values. Instead R stored integers 1, 2, 3 … which are indexes in the levels vector.

## Creating Factors in R

There are three popular ways of creating factors in R –

- The ‘constructor’ factor function
- The gl function
- The as.factor transformation or casting function

### 1. factor() Function for Creating Factors

The factor() function is used to create a, well, factor.

The only mandatory option to this function is a container data structure such as vector containing values that you want to represent as a factor.

#### Example

Let’s say we want to record the type of government in various countries. In the current politico-cultural scenario one can say that the type of a government in any present day country can be classified either as a democracy or autocracy. We can create a factor for the types of governments as follows –

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
> # create a vector > govts <- c("Democracy", "Autocracy") # pass the vector to the factor function to create a factor > govt.types <- factor(govts) > typeof(govt.types) [1] "integer" > class(govt.types) [1] "factor" > |

A few noteworthy points that must be noted from the code above. The factor function takes a mandatory argument of values and returns a factor data structure. The values can be in one the container data structures such as vector or matrix or array that we discussed earlier. In this example we passed in a vector of character data type containing the types of government and R returned us a data structure of class “factor“. Note that the typeof function gives integer as the type, and we will get to this point again shortly again. Let’s try to gather more details about this new data structure –

1 2 3 4 5 6 7 8 9 |
> # look into the newly created factor > govt.types [1] Democracy Autocracy Levels: Autocracy Democracy # some more details > str(govt.types) Factor w/ 2 levels "Autocracy","Democracy": 2 1 |

R stores our factor with 2 levels – Autocracy and Democracy and values 2 1. To know what this means we need to look at how R stores a factor internally. We will do that in the next section after analyzing all the ways of creating factors in R.

### 2. gl() Function for Creating Factors

The gl() function is typically used to ‘generate’ rather than create a factor. It takes as argument the number of levels desired and the number of times each value must be repeated in the resulting factor.

1 2 3 4 5 6 7 |
> > gl.demo <- gl(2, 3) > str(gl.demo) Factor w/ 2 levels "1","2": 1 1 1 2 2 2 > |

R interpreted the first argument 2 to imply that there are 2 distinct values (levels) in the data. The second argument 3 tells how many times each distinct value must be repeated to fill our factor with values.

It is also possible to pass a labels argument to the gl function. If you pass this argument R names the levels accordingly.

1 2 3 4 5 6 7 8 9 10 11 |
> # create a factor containing names of 2 major rivers # repeat each 3 times > gl.demo <- gl(2, 3, labels=c("Amazon", "Ganga")) > str(gl.demo) Factor w/ 2 levels "Amazon","Ganga": 1 1 1 2 2 2 > gl.demo [1] Amazon Amazon Amazon Ganga Ganga Ganga Levels: Amazon Ganga |

### 3. as.factor() Transformer Function for Creating Factors

True to the spirit of R, even factors have transformer function. The as.factor function creates a factor from existing data on a best effort basis.

Suppose we have a vector of rivers which we want to convert into factor. We can do this by using as.factor function

1 2 3 4 5 6 7 8 9 |
> > rivers = c("Amazon", "Ganga", "Ganga") > rivers.factor = as.factor(rivers) > rivers.factor [1] Amazon Ganga Ganga Levels: Amazon Ganga |

## Internal Structure of a Factor

To take advantage of the finite number of possible values that a factor can have, R stores a factor internally in the simplest possible Set data structure – associative array. The associative array has a unique design that we will discuss in a short while.

A factor is stored internally using two vectors – an integer vector and a character vector.

The first vector of integers stores the values that are currently present in our factor. This vector of integer values is associated with a vector of characters (i.e. strings) called levels. The levels vector stores the various unique values that are factor is capable of taking. The values are stored as a vector of integers, where each integer is an index into the levels vector. To summarize, the integer vector of values holds indexes into the character vector of levels.

### Example

The diagram below illustrates how R stores factors internally. We have a factor elements of which can take one of two values – “A” or “B”. We perform an experiment and record our observations as “A”, “B”, “B”, “A”.

If we create a factor in this scenario with levels {“A”, “B” } and recorded values as {“A”, “B”, “B”, “A”}, R is going to create a vector of integers to hold values and will store {1, 2, 2, 1} in it. It will also create a character vector and store “A” and “B” in it. The vector of values {1, 2, 2, 1} is actually the index value in the second vector. “A” is assumed to reside at index 1 and “B” at index 2.

**Note for computer science students:** The above data structures in essence implements a dictionary. A dictionary maps a set of distinct unique keys to their values. The above form of dictionary implementation is called associative array as there are multiple arrays referencing each other. Other possible implementations of factors could have been hashtable, index based array, binary search tree etc. Java happens to call its dictionaries Maps.

## Nominal and Ordinal Variables in Factors

A nominal variable is a kind of categorical variable where the levels don’t bear any qualitative or qunatitative relationship to each other.

An ordinal variable on the other hand are categorical variable whose distinct values signify a certain fixed quantum of distance.

For example, a factor recording the number showing on the face of a dice is nominal. A dice showing 2 is not related in any manner to the dice showing a value 1. But a factor that records the grades is ordinal. A student getting “A” can roughly be deemed to have performed better than student getting “B” and twice as good as a student achieving a “C”.

By default R creates nominal factors and stores the levels in alphabetical order. This sometimes works fine for nominal factors but not for ordinal variables. For ordinal variables you need to pass in the levels and tell R that levels need to be ordered. You can also change the order of levels for nominal variables using the levels argument to factor function.

### Example

Consider the example below where we create a factor containing grades of two students.

1 2 3 4 5 6 7 |
> > grades = factor(c("A", "F"), order=T, levels=c("F", "E", "D", "C", "B", "A")) > str(grades) Ord.factor w/ 6 levels "F"<"E"<"D"<"C"<..: 6 1 > |

R creates an ordered factor which has only 2 values at present (indexes 6 and 1) and has 6 distinct levels. R orders them in accordance to the order in which they appear in the levels vector. Since we have list grade “F” first, it is given index 1 and “A” being last is stored at index 6.

This example also illustrates that we may create levels with values that are not present in our values at the moment. For example, grades “B”, “C”, “D”, and “E” don’t appear as values but R still knows that our ordinal values might assume these values at some point.

R also allows you to rename the levels primarily for output purposes. Continuing with the above example, suppose the meaning of grades is not readily apparent to your audience. You want to help them figure out that A is outstanding and E is almost out. You can accomplish this using the labels argument of the factor function. We extend the example above to have a few more data points so that our final plot doesn’t look empty.

1 2 3 4 5 6 7 8 9 10 |
> > plot(grades) > grades = factor(c("A", "B", "A", "B", "B", "F"), order=T, levels=c("F", "E", "D", "C", "B", "A") , + labels=c("Fail", "Barely Made It", "Hopeless", "Needs Improvement", "Good", "Excellent") + ) > str(grades) Ord.factor w/ 6 levels "Fail"<"Barely Made It"<..: 6 5 6 5 5 1 |

The output of the plot command is shown below. The histogram uses the labels instead of the values.

## Factors in Data Frame

The data.frame function, which is used to construct data frames, by default treats all character data as factor. This works well in most cases because strings are usually used to record categorical data. Consider the example below –

1 2 3 4 5 6 7 8 9 10 11 12 13 |
> > programming_languages = c("LISP", "FORTRAN", "ALGOL", "C") > year_programming_languages = c(1957, 1959, 1966, 1969) > langs_data = data.frame(programming_languages, year_programming_languages) > str(langs_data) 'data.frame': 4 obs. of 2 variables: $ programming_languages : Factor w/ 4 levels "ALGOL","C","FORTRAN",..: 4 3 1 2 $ year_programming_languages: num 1957 1959 1966 1969 > |

Note that R converted the character vector named programming_languages to a factor with 4 levels.

Sometimes you may want to retain the data as a character vector rather than converting to a factor. For example, if you are recording names of students, it is much more intuitive to store the data as character vector because you are not likely to find much repetition in data.

You can avoid the default behavior by setting stringsAsFactors = FALSE in the data.frame function as shown below –

1 2 3 4 5 6 7 |
> > langs_data2 = data.frame(programming_languages, year_programming_languages, stringsAsFactors = F) > str(langs_data2) 'data.frame': 4 obs. of 2 variables: $ programming_languages : chr "LISP" "FORTRAN" "ALGOL" "C" $ year_programming_languages: num 1957 1959 1966 1969 |

## Accessing Members of a Factor

You can access members of a factor pretty much the same way you would access members of unnamed vector –

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
> > programming_langs = factor(c("LISP", "SCHEME", "C++", "Python", "R", "PHP", "Java")) > programming_langs[1] [1] LISP Levels: C++ Java LISP PHP Python R SCHEME > programming_langs[c(1,5)] [1] LISP R Levels: C++ Java LISP PHP Python R SCHEME > programming_langs[c(T, F, F, F, T, F)] [1] LISP R Java Levels: C++ Java LISP PHP Python R SCHEME > programming_langs[-1] [1] SCHEME C++ Python R PHP Java Levels: C++ Java LISP PHP Python R SCHEME > programming_langs[-5] [1] LISP SCHEME C++ Python PHP Java Levels: C++ Java LISP PHP Python R SCHEME > |

## Conclusion

After examining basic data structures in R, we examined how to store categorical variables in R. Categorical variables are pretty common in statistics and R has suitable optimizations to take advantages of the categorical nature of data. We will now turn our attention to functions in R.

### We are social

Spread the wordFollow CodingRaptor

## Leave a Reply