What are Data Frames in R?
Data Frames in R are a two dimensional data structure used to hold data of heterogeneous types in a table-like format.
A data frame comprises of rows and columns. Columns are the variables that you record in an experiment. Rows are the observation. More specifically, each row is an instance of an observation.
Another way of looking at data frame is to view it as a matrix that can hold data of various different data types. That is to say that each row can have data of different types, but each column has a data of the same type.
Yet another way of perceiving data frame is to view them as a list of vectors or to put simply as a collection of vectors that are cbound (i.e. cbind operation has been applied on them).
The following code snippet shows the structure of a data frame called mtcars that ships with R-
1 2 3 4 5 6 7 8 9 10 11 12 |
> class(mtcars) [1] "data.frame" > head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > |
Creating a Data Frame in R
Like with other data structures in R – there are two interconnected ways of creating data frames –
- Using the ‘constructor’ data.frame function
- as.data.frame function (this is internally called by data.frame function)
Creating Data Frame using data.frame Function
Let’s create a data frame that we will be using for illustration in the rest of article. We will record data about our favorite lowlifes viz. politicians. We want to record their names (which I will hide for my security), the last criminal offence they committed and when was their last conviction in a court of law. I agree this is going to be a pretty nasty example, but there is a message – name and shame politicians when you can.
Let’s first prepare the data, we will then pump all of it into a data frame.
1 2 3 4 5 6 7 8 |
> politicians <- c("P1", "P2", "P3", "P4", "P5") > offences <- c("rape", "murder", "kidnapping", "terrorism", "mass murder") > convicted.on <- as.Date(c("1999-01-01", "2003-03-04" , + "2008-09-17", "2016-06-06", "2022-02-21")) |
Note that all three vectors have the same length because these three vectors hold data pertaining to five different low-lifes err… politicians. Our next task is to hook these three scattered vectors into a single data frame. We do that by using data.frame function as shown below –
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
> lowlifes <- data.frame(politicians, offences, convicted.on, stringsAsFactors = FALSE) > lowlifes politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > class(lowlifes) # pretty unfortunate that # class of these lowlifes is data.frame [1] "data.frame" > |
The only thing new in the above code is stringsAsFactors argument to data.frame. This logical value should be set to FALSE otherwise R will automatically convert character data to factors.
Creating Data Frames with Non-Default Row Names
Among other arguments of data.frame function, there is one that you will probably use quite often and that is row.names . Using this argument you can pass in a values or a vector that holds a distinct, unique name for each of the row of your data.
1 2 3 4 5 6 7 8 9 10 11 12 |
> lowlifes.named <- data.frame( offences, convicted.on, row.names=politicians, stringsAsFactors = FALSE) > lowlifes.named offences convicted.on P1 rape 1999-01-01 P2 murder 2003-03-04 P3 kidnapping 2008-09-17 P4 terrorism 2016-06-06 P5 mass murder 2022-02-21 > |
Instead of integers as row numbers previously, we are now using the names of politicians as the row name. (Sympathies for the row)
Now we have a data frame called lowlifes that holds all the data about these …, well, you get the point.
Of course, data.frame can be used to created data frames of much more socially useful and better things. Otherwise there would be no sense in learning about them (and I mean data frames).
Creating Data Frame using as.data.frame Generic Function
as.data.frame function can be used to transform another data structure such as vector, matrix to a data frame as demonstrated below –
1 2 3 4 5 6 7 |
> M1 <- matrix(1:6, 2, 2) > as.data.frame(M1) V1 V2 1 1 3 2 2 4 > |
Characteristic Features of Data Frames in R
The following are defining features of data frames –
- Every row of data frame has a unique name. Duplicate row names are not allowed.
- No row can have its name missing i.e. NA is not a valid row name.
- Row names must be of integer or character data types. You cannot, for example, have logicals as row names.
- If you don’t explicitly provide row names, R uses row numbers starting with integer 1 as row names.
- NULL is a valid for a row name, Using NULL for the value resets the row names to seq_len(nrow(data frame)). Or in other words the rows will be named 1..len(rows).
- The column names should be non-empty. It’s technically allowed to have duplicate column names, but quite often doing that makes no sense at all. For example, during an observation you don’t want to have two variables to record whether a politician is also a criminal (anyway, the outcome is a foregone conclusion).
- You can skip providing column names explicitly, in which case your column names will be a concatenation of column values.
Accessing Members of a Data Frame
You can access members of a data frame in a number of ways shown in next few sections –
Accessing Single Element By Its Address
You can use the position of a member to access it in a data frame –
1 2 3 4 5 6 7 8 9 10 11 12 13 |
> df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > df2[3,2] # 3rd row, 2nd col. # What was the offence of the 3rd politician? [1] "kidnapping" > |
Similarly, you can use name of elements. For example, you can find out about the atrocities of the 3rd politician as follows as well –
1 2 3 4 5 6 7 8 9 10 11 |
> df3 offences convicted.on P1 rape 1999-01-01 P2 murder 2003-03-04 P3 kidnapping 2008-09-17 P4 terrorism 2016-06-06 P5 mass murder 2022-02-21 > df3['P3' , 'offences'] [1] "kidnapping" > |
Accessing Rows From a Data Frame
Using the ‘[]‘ operator, you can select entire rows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
> df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > df2[3, ] politicians offences convicted.on 3 P3 kidnapping 2008-09-17 > df2[c(3,5) , ] # know the good deeds of mr. 3 & 5 politicians offences convicted.on 3 P3 kidnapping 2008-09-17 5 P5 mass murder 2022-02-21 > df2[c(T,F,TRUE,FALSE,TRUE), ] politicians offences convicted.on 1 P1 rape 1999-01-01 3 P3 kidnapping 2008-09-17 5 P5 mass murder 2022-02-21 > |
You can, of course, use names of rows and columns instead of their positions –
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
> df3 offences convicted.on P1 rape 1999-01-01 P2 murder 2003-03-04 P3 kidnapping 2008-09-17 P4 terrorism 2016-06-06 P5 mass murder 2022-02-21 > df3['P2', ] # Know more about the meaningful life of politician #2 offences convicted.on P2 murder 2003-03-04 > df3[c('P2' ,'P5'), ] # Know more about the meaningful # lives of politicians number 2,5 offences convicted.on P2 murder 2003-03-04 P5 mass murder 2022-02-21 > df3[c(T,F,T,F,F), ] offences convicted.on P1 rape 1999-01-01 P3 kidnapping 2008-09-17 > |
Accessing Columns From a Data Frame
Different ways of extracting columns mirror the ways of extracting rows.
Continuing our journey into knowing more about the illustrious politicians around us, we will now extract their ‘attributes’ – (Pun intended)
Using ‘[‘
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
> df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > df2['offences'] # An incomplete list of what all politicians do offences 1 rape 2 murder 3 kidnapping 4 terrorism 5 mass murder > df2[c(2)] # An imaginary list of conviction dates of politicians offences 1 rape 2 murder 3 kidnapping 4 terrorism 5 mass murder > df2[c(3)] # An imaginary list of conviction dates of politicians convicted.on 1 1999-01-01 2 2003-03-04 3 2008-09-17 4 2016-06-06 5 2022-02-21 > df2[c(TRUE, TRUE)] # use logical vector to # know more about illogical parasites politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > df2[c(FALSE, TRUE, TRUE)] offences convicted.on 1 rape 1999-01-01 2 murder 2003-03-04 3 kidnapping 2008-09-17 4 terrorism 2016-06-06 5 mass murder 2022-02-21 > |
Using names
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
> df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > df2['politicians'] # wonderful, lovely 'people', aren't they? politicians 1 P1 2 P2 3 P3 4 P4 5 P5 > |
Retrieving Columns of Data Frames in R
Suppose that you are so enamored by the good deeds of politicians that you want to gather all their wonderful deeds as a vector for further anlysis, exploration and citation. R can help you in this charitable act in three different ways –
1. Extracting Column Vector Using $ Operator
1 2 3 4 5 6 7 8 9 10 |
> df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > df2$offences [1] "rape" "murder" "kidnapping" "terrorism" "mass murder" |
There you go, R gives a character vector with all the details.
2. Extracting Column Vector Using [ Operator
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
> df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > df2[ ,'politicians'] [1] "P1" "P2" "P3" "P4" "P5" > df2[ , 1] # You are telling R # to give you the column called 'politicians' # which is the first column. [1] "P1" "P2" "P3" "P4" "P5" > |
3. Extracting Column Vector Using [[ Operator
‘[[‘ is a source of perpetual pain in R. But you can use it with column number or name to obtain a vector –
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
> df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > df2[[1]] [1] "P1" "P2" "P3" "P4" "P5" > df2[['politicians']] [1] "P1" "P2" "P3" "P4" "P5" > |
Operations on Data Frames
There are a number of operations permitted on data frames –
Transpose of a Data Frame
As indicated in the opening paragraph, a data frame can be viewed as a matrix of heterogeneous data type elements. You can calculate transpose of a data frame using the t function. Though transpose of a data frame does not make much mathematical sense, it can be used to swap rows with columns.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
> df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > t(df2) [,1] [,2] [,3] [,4] [,5] politicians "P1" "P2" "P3" "P4" "P5" offences "rape" "murder" "kidnapping" "terrorism" "mass murder" convicted.on "1999-01-01" "2003-03-04" "2008-09-17" "2016-06-06" "2022-02-21" > |
Adding Columns to Data Frame
Suppose that you also want to record the number of pending criminal cases of each, well, politician. No hassles. You can use the $ operator even after the data frame is constructed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
> df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > df2$pending.cases <- c(1024, 2048, 3072, 4096, 5120) > df2 politicians offences convicted.on pending.cases 1 P1 rape 1999-01-01 1024 2 P2 murder 2003-03-04 2048 3 P3 kidnapping 2008-09-17 3072 4 P4 terrorism 2016-06-06 4096 5 P5 mass murder 2022-02-21 5120 > |
Removing Columns from a Data Frame
Removing a column from a data frame is equally easy. Using the $ operator you set the column to NULL –
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
> df2 politicians offences convicted.on pending.cases 1 P1 rape 1999-01-01 1024 2 P2 murder 2003-03-04 2048 3 P3 kidnapping 2008-09-17 3072 4 P4 terrorism 2016-06-06 4096 5 P5 mass murder 2022-02-21 5120 > df2$pending.cases <- NULL # remove the pending.cases col > df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 |
Adding Rows to Data Frame
Suppose you have the inner urge to model more politicians and add them to your data frame. R makes that pretty straightforward. In the example here we create a new data frame called more.rascals and rbind it to the existing data frame df2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
> df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > more.rascals <- data.frame ( + politicians = c("P6", "P7") , + offences = c("Espionage", "Blackmail"), + convicted.on = as.Date(c("2040-01-01", "2049-01-01")), + stringsAsFactors = FALSE + ) > more.rascals politicians offences convicted.on 1 P6 Espionage 2040-01-01 2 P7 Blackmail 2049-01-01 > lowlives <- rbind(df2, more.rascals) > lowlives politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 6 P6 Espionage 2040-01-01 7 P7 Blackmail 2049-01-01 > |
Deleting Rows from a Data Frame
Suppose some politician P6 threatens you and you agree to remove his details from lowlives data frame. Deleting rows can be accomplished by using the – operator as shown below –
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
> lowlives politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 6 P6 Espionage 2040-01-01 7 P7 Blackmail 2049-01-01 > lowlives <- lowlives[-6, ] > lowlives politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 7 P7 Blackmail 2049-01-01 > |
Sorting a Data Frame
You can sort a data frame based on values in one or more columns using the order() function. Let’s sort the lowlives data frame based on the alphabetical ordering of the offences that they have committed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
> lowlives politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 7 P7 Blackmail 2049-01-01 > lowlives[order(lowlives[,2]), ] politicians offences convicted.on 7 P7 Blackmail 2049-01-01 3 P3 kidnapping 2008-09-17 5 P5 mass murder 2022-02-21 2 P2 murder 2003-03-04 1 P1 rape 1999-01-01 4 P4 terrorism 2016-06-06 > |
Changing Order of Rows or Columns in a Data Frame
The order of columns and rows in a data frame can be altered using various selection mechanisms discussed in the beginning of this article. Consider the trivial example below –
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# Changing the order of columns of these lowlifes > lowlives politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 7 P7 Blackmail 2049-01-01 > lowlives[, c(2, 1, 3)] offences politicians convicted.on 1 rape P1 1999-01-01 2 murder P2 2003-03-04 3 kidnapping P3 2008-09-17 4 terrorism P4 2016-06-06 5 mass murder P5 2022-02-21 7 Blackmail P7 2049-01-01 # Changing the order of rows (but not the row.names) # using same logic > df2 politicians offences convicted.on 1 P1 rape 1999-01-01 2 P2 murder 2003-03-04 3 P3 kidnapping 2008-09-17 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 > df2[c(2,4,5,1), ] politicians offences convicted.on 2 P2 murder 2003-03-04 4 P4 terrorism 2016-06-06 5 P5 mass murder 2022-02-21 1 P1 rape 1999-01-01 > |
Conclusion
This article introduced one of the most frequently used data structures in R, viz. data frames. We have already discussed vectors and matrices. We will look at lists in R next.
We are social
Spread the wordFollow CodingRaptor
Leave a Reply