Coding Raptor

Programmers' Nest

You are here: Home / Big Data / R Programming / Basic Data Structures III: Data Frames in R

October 15, 2017

Basic Data Structures III: Data Frames in R

What are Data Frames in R?

Data Frames in R are a two dimensional data structure used to hold data of heterogeneous types in a table-like format.

A data frame comprises of rows and columns. Columns are the variables that you record in an experiment. Rows are the observation. More specifically, each row is an instance of an observation.

Another way of looking at data frame is to view it as a matrix that can hold data of various different data types. That is to say that each row can have data of different types, but each column has a data of the same type.

Yet another way of perceiving data frame is to view them as a list of vectors or to put simply as a collection of vectors that are cbound (i.e. cbind operation has been applied on them).

The following code snippet shows the structure of a data frame called mtcars that ships with R-

R
1
2
3
4
5
6
7
8
9
10
11
12
> class(mtcars)
[1] "data.frame"
 
> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
>

Creating a Data Frame in R

Like with other data structures in R – there are two interconnected ways of creating data frames –

  1. Using the ‘constructor’ data.frame function
  2. as.data.frame function (this is internally called by data.frame function)

Creating Data Frame using data.frame Function

Let’s create a data frame that we will be using for illustration in the rest of article. We will record data about our favorite lowlifes viz. politicians. We want to record their names (which I will hide for my security), the last criminal offence they committed and when was their last conviction in a court of law.  I agree this is going to be a pretty nasty example, but there is a message – name and shame politicians when you can.

Let’s first prepare the data, we will then pump all of it into a data frame.

R
1
2
3
4
5
6
7
8
> politicians <- c("P1", "P2", "P3", "P4", "P5")
 
> offences <- c("rape", "murder",
  "kidnapping", "terrorism",
   "mass murder")
 
> convicted.on <- as.Date(c("1999-01-01", "2003-03-04" ,
+ "2008-09-17", "2016-06-06", "2022-02-21"))

Note that all three vectors have the same length because these three vectors hold data pertaining to five different low-lifes err… politicians. Our next task is to hook these three scattered vectors into a single data frame. We do that by using data.frame function as shown below –

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> lowlifes <- data.frame(politicians, offences,
convicted.on, stringsAsFactors = FALSE)
> lowlifes
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> class(lowlifes) # pretty unfortunate that
  # class of these lowlifes is data.frame
[1] "data.frame"
>

The only thing new in the above code is stringsAsFactors argument to data.frame. This logical value should be set to FALSE otherwise R will automatically convert character data to factors.

Creating Data Frames with Non-Default Row Names

Among other arguments of data.frame function, there is one that you will probably use quite often and that is row.names . Using this argument you can pass in a values or a vector that holds a distinct, unique name for each of the row of your data.

R
1
2
3
4
5
6
7
8
9
10
11
12
> lowlifes.named <- data.frame(
   offences, convicted.on, row.names=politicians,
   stringsAsFactors = FALSE)
 
> lowlifes.named
      offences convicted.on
P1        rape   1999-01-01
P2      murder   2003-03-04
P3  kidnapping   2008-09-17
P4   terrorism   2016-06-06
P5 mass murder   2022-02-21
>

Instead of integers as row numbers previously, we are now using the names of politicians as the row name. (Sympathies for the row)

Now we have a data frame called lowlifes that holds all the data about these …, well, you get the point.

Of course, data.frame can be used to created data frames of much more socially useful and better things. Otherwise there would be no sense in learning about them (and I mean data frames).

Creating Data Frame using as.data.frame Generic Function

as.data.frame function can be used to transform another data structure such as vector, matrix to a data frame as demonstrated below –

R
1
2
3
4
5
6
7
> M1 <- matrix(1:6, 2, 2)
 
> as.data.frame(M1)
  V1 V2
1  1  3
2  2  4
>

Characteristic Features of Data Frames in R

The following are defining features of data frames –

  1. Every row of data frame has a unique name. Duplicate row names are not allowed.
  2. No row can have its name missing i.e. NA is not a valid row name.
  3. Row names must be of integer or character data types. You cannot, for example, have logicals as row names.
  4. If you don’t explicitly provide row names, R uses row numbers starting with integer 1 as row names.
  5. NULL is a valid for a row name, Using NULL for the value resets the row names to seq_len(nrow(data frame)). Or in other words the rows will be named 1..len(rows).
  6. The column names should be non-empty. It’s technically allowed to have duplicate column names, but quite often doing that makes no sense at all. For example, during an observation you don’t want to have two variables to record whether a politician is also a criminal (anyway, the outcome is a foregone conclusion).
  7. You can skip providing column names explicitly, in which case your column names will be a concatenation of column values.

Accessing Members of a Data Frame

You can access members of a data frame in a number of ways shown in next few sections –

Accessing Single Element By Its Address

You can use the position of a member to access it in a data frame –

R
1
2
3
4
5
6
7
8
9
10
11
12
13
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> df2[3,2] # 3rd row, 2nd col.
  # What was the offence of the 3rd politician?
[1] "kidnapping"
 
>

Similarly, you can use name of elements.  For example, you can find out about the atrocities of the 3rd politician as follows as well –

R
1
2
3
4
5
6
7
8
9
10
11
> df3
      offences convicted.on
P1        rape   1999-01-01
P2      murder   2003-03-04
P3  kidnapping   2008-09-17
P4   terrorism   2016-06-06
P5 mass murder   2022-02-21
 
> df3['P3' , 'offences']
[1] "kidnapping"
>

Accessing Rows From a Data Frame

Using the ‘[]‘ operator, you can select entire rows.

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> df2[3, ]
  politicians   offences convicted.on
3          P3 kidnapping   2008-09-17
 
> df2[c(3,5) , ] # know the good deeds of mr. 3 & 5
  politicians    offences convicted.on
3          P3  kidnapping   2008-09-17
5          P5 mass murder   2022-02-21
 
> df2[c(T,F,TRUE,FALSE,TRUE), ]
  politicians    offences convicted.on
1          P1        rape   1999-01-01
3          P3  kidnapping   2008-09-17
5          P5 mass murder   2022-02-21
>

You can, of course, use names of rows and columns instead of their positions –

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> df3
      offences convicted.on
P1        rape   1999-01-01
P2      murder   2003-03-04
P3  kidnapping   2008-09-17
P4   terrorism   2016-06-06
P5 mass murder   2022-02-21
 
> df3['P2', ] # Know more about the meaningful life of politician #2
   offences convicted.on
P2   murder   2003-03-04
 
> df3[c('P2' ,'P5'), ] # Know more about the meaningful
  # lives of politicians number 2,5
      offences convicted.on
P2      murder   2003-03-04
P5 mass murder   2022-02-21
 
> df3[c(T,F,T,F,F), ]
     offences convicted.on
P1       rape   1999-01-01
P3 kidnapping   2008-09-17
>

Accessing Columns From a Data Frame

Different ways of extracting columns mirror the ways of extracting rows.

Continuing our journey into knowing more about the illustrious politicians around us, we will now extract their ‘attributes’ – (Pun intended)

Using ‘[‘

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> df2['offences'] # An incomplete list of what all politicians do
     offences
1        rape
2      murder
3  kidnapping
4   terrorism
5 mass murder
> df2[c(2)] # An imaginary list of conviction dates of politicians
     offences
1        rape
2      murder
3  kidnapping
4   terrorism
5 mass murder
 
> df2[c(3)] # An imaginary list of conviction dates of politicians
  convicted.on
1   1999-01-01
2   2003-03-04
3   2008-09-17
4   2016-06-06
5   2022-02-21
 
> df2[c(TRUE, TRUE)] # use logical vector to
  #    know more about illogical parasites
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> df2[c(FALSE, TRUE, TRUE)]
     offences convicted.on
1        rape   1999-01-01
2      murder   2003-03-04
3  kidnapping   2008-09-17
4   terrorism   2016-06-06
5 mass murder   2022-02-21
>

Using names

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> df2['politicians'] # wonderful, lovely 'people', aren't they?
  politicians
1          P1
2          P2
3          P3
4          P4
5          P5
>

Retrieving Columns of Data Frames in R

Suppose that you are so enamored by the good deeds of politicians that you want to gather all their wonderful deeds as a vector for further anlysis, exploration and citation. R can help you in this charitable act in three different ways –

1. Extracting Column Vector Using $ Operator

R
1
2
3
4
5
6
7
8
9
10
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> df2$offences
[1] "rape"        "murder"      "kidnapping"  "terrorism"   "mass murder"

There you go, R gives a character vector with all the details.

2. Extracting Column Vector Using [ Operator

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> df2[ ,'politicians']
[1] "P1" "P2" "P3" "P4" "P5"
 
> df2[ , 1] # You are telling R
  # to give you the column called 'politicians'
  # which is the first column.
[1] "P1" "P2" "P3" "P4" "P5"
>

3. Extracting Column Vector Using [[ Operator

‘[[‘ is a source of perpetual pain in R. But you can use it with column number or name to obtain a vector –

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> df2[[1]]
[1] "P1" "P2" "P3" "P4" "P5"
 
> df2[['politicians']]
[1] "P1" "P2" "P3" "P4" "P5"
>

Operations on Data Frames

There are a number of operations permitted on data frames –

Transpose of a Data Frame

As indicated in the opening paragraph, a data frame can be viewed as a matrix of heterogeneous data type elements. You can calculate transpose of a data frame using the t function. Though transpose of a data frame does not make much mathematical sense, it can be used to swap rows with columns.

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> t(df2)
             [,1]         [,2]         [,3]         [,4]         [,5]        
politicians  "P1"         "P2"         "P3"         "P4"         "P5"        
offences     "rape"       "murder"     "kidnapping" "terrorism"  "mass murder"
convicted.on "1999-01-01" "2003-03-04" "2008-09-17" "2016-06-06" "2022-02-21"
>

Adding Columns to Data Frame

Suppose that you also want to record the number of pending criminal cases of each, well, politician. No hassles. You can use the $ operator even after the data frame is constructed.

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> df2$pending.cases <- c(1024, 2048, 3072, 4096, 5120)
 
> df2
  politicians    offences convicted.on pending.cases
1          P1        rape   1999-01-01          1024
2          P2      murder   2003-03-04          2048
3          P3  kidnapping   2008-09-17          3072
4          P4   terrorism   2016-06-06          4096
5          P5 mass murder   2022-02-21          5120
>

Removing Columns from a Data Frame

Removing a column from a data frame is equally easy. Using the $ operator you set the column to NULL –

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> df2
  politicians    offences convicted.on pending.cases
1          P1        rape   1999-01-01          1024
2          P2      murder   2003-03-04          2048
3          P3  kidnapping   2008-09-17          3072
4          P4   terrorism   2016-06-06          4096
5          P5 mass murder   2022-02-21          5120
 
> df2$pending.cases <- NULL # remove the pending.cases col
 
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21

Adding Rows to Data Frame

Suppose you have the inner urge to model more politicians and add them to your data frame. R makes that pretty straightforward. In the example here we create a new data frame called more.rascals and rbind it to the existing data frame df2

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> more.rascals <- data.frame (
+ politicians = c("P6", "P7") ,
+ offences = c("Espionage", "Blackmail"),
+ convicted.on = as.Date(c("2040-01-01", "2049-01-01")),
+ stringsAsFactors = FALSE
+ )
 
> more.rascals
  politicians  offences convicted.on
1          P6 Espionage   2040-01-01
2          P7 Blackmail   2049-01-01
 
> lowlives <- rbind(df2, more.rascals)
 
> lowlives
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
6          P6   Espionage   2040-01-01
7          P7   Blackmail   2049-01-01
>

Deleting Rows from a Data Frame

Suppose some politician P6 threatens you and you agree to remove his details from lowlives data frame. Deleting rows can be accomplished by using the – operator as shown below –

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
> lowlives
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
6          P6   Espionage   2040-01-01
7          P7   Blackmail   2049-01-01
 
> lowlives <- lowlives[-6, ]
 
> lowlives
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
7          P7   Blackmail   2049-01-01
>

Sorting a Data Frame

You can sort a data frame based on values in one or more columns using the order() function. Let’s sort the lowlives data frame based on the alphabetical ordering of the offences that they have committed.

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
> lowlives
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
7          P7   Blackmail   2049-01-01
 
> lowlives[order(lowlives[,2]), ]
  politicians    offences convicted.on
7          P7   Blackmail   2049-01-01
3          P3  kidnapping   2008-09-17
5          P5 mass murder   2022-02-21
2          P2      murder   2003-03-04
1          P1        rape   1999-01-01
4          P4   terrorism   2016-06-06
>

Changing Order of Rows or Columns in a Data Frame

The order of columns and rows in a data frame can be altered using various selection mechanisms discussed in the beginning of this article. Consider the trivial example below –

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Changing the order of columns of these lowlifes
> lowlives
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
7          P7   Blackmail   2049-01-01
 
> lowlives[, c(2, 1, 3)]
     offences politicians convicted.on
1        rape          P1   1999-01-01
2      murder          P2   2003-03-04
3  kidnapping          P3   2008-09-17
4   terrorism          P4   2016-06-06
5 mass murder          P5   2022-02-21
7   Blackmail          P7   2049-01-01
 
# Changing the order of rows (but not the row.names)
# using same logic
> df2
  politicians    offences convicted.on
1          P1        rape   1999-01-01
2          P2      murder   2003-03-04
3          P3  kidnapping   2008-09-17
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
 
> df2[c(2,4,5,1), ]
  politicians    offences convicted.on
2          P2      murder   2003-03-04
4          P4   terrorism   2016-06-06
5          P5 mass murder   2022-02-21
1          P1        rape   1999-01-01
>

Conclusion

This article introduced one of the most frequently used data structures in R, viz. data frames. We have already discussed vectors and matrices. We will look at lists in R next.

 

We are social

Spread the word
Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Follow CodingRaptor
Facebooktwittergoogle_plusrssFacebooktwittergoogle_plusrss
Post Views: 48

Share with your friends on other avenues:

  • Twitter
  • Facebook
  • Google
  • Reddit
  • Pinterest
  • Tumblr
  • Print
  • LinkedIn
  • Email
  • More
  • Pocket
  • Skype

Related

Article by P T / R Programming / data, data structures, R programming, Statistics Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Also Check These Out!

  • Factors in R
  • Basic Data Structures V: Arrays in R
  • Basic Data Structures IV: Lists in R
  • Basic Data Structures III: Data Frames in R
  • Basic Data Structures II: Matrices in R
  • Basic Data Structures I: Vectors in R
  • Atomic or Basic Data Types and Scalars in R
  • Getting Started with R: Packages, Help & Workspace (Part 2)
  • Getting Started with R: Installing and Running R (Part 1)
  • Floating-Point in Java: Representation, Comparison, Equality & A Few Shockers
  • Java Ternary Operator a.k.a Conditional Operator
  • Arrays In JavaScript
  • Coding Raptor Turns 1
  • Java EnumMap with Example
  • LinkedHashMap Example: LRU Cache Implementation
  • LinkedHashMap with Example
  • Java TreeMap with Example
  • Java Collections: Java HashMap Internals
  • Java HashMap with Example
  • Java Map
  • Java Collections: EnumSet with Example
  • Java Collections: Problem Solving with Java Sets
  • Java Collections: Differences between TreeSet, LinkedHashSet and HashSet
  • Java Collections: Sets
  • Using Lists in Java
  • Java Collections: Using Lists
  • Java Collections: Interfaces
  • Java Collections: General Overview
  • Debugging in Netbeans
  • Learn Basics of Core Java in 49 Lessons
  • Varargs in Java
  • Exceptions in Java VIII: Creating Custom Exceptions
  • Exceptions in Java VII: throw and throws in Java
  • Exceptions in Java VI: Types and Hierarchy
  • Exceptions in Java V: try with Resources
  • Exceptions in Java IV: Summary of Rules for try-catch-finally
  • Exceptions in Java iii: Control Flow and Multiple try, catch, finally
  • Exceptions in Java II: try, catch, finally
  • Exceptions in Java I: Introduction
  • Strings in Java VI: More String Methods
  • String in Java V: The intern Method
  • String in Java IV: Important String Methods
  • Strings in Java III: StringBuilder and StringBuffer
  • Strings in Java II: Immutability
  • Strings in Java I: What is a String?
  • Arrays in Java V: Iteration
  • Arrays in Java IV: Manipulating Individual Elements
  • Arrays in Java III: Multidimensional Arrays
  • Arrays in Java II: Declaring, Initializing, Instantiating Arrays
  • Arrays in Java I: What is an Array?
  • Using Environment Variables as Inputs
  • Command Line Arguments in Java
  • static import in Java
  • The import Statement in Java
  • Packages in Java
  • Non Access Modifiers in Java III: abstract, transient, volatile, native
  • Non Access Modifiers in Java II: final, synchronized
  • Non Access Modifiers in Java I: static and strictfp
  • Access Modifiers in Java
  • Looping Construct: for Loop
  • Looping Construct: while and do-while
  • Statements, Expressions and Code Blocks
  • Operators in Java
  • Conditional Statements in Java
  • Reference Data Types
  • Literals in Java
  • Primitive Data Types in Java
  • Variables in Java
  • Compiling and Running Java from Command Line
  • First Java Program: Hello World!
  • Java IDEs and Editors III: Sublime Text and Editplus
  • Java IDEs and Editors II: Eclipse IDE
  • Java IDEs and Editors I: Netbeans
  • Setting up Java Development Environment
  • Java’s Program Execution Model and WORA: Compilation & Interpretation
  • Inside JVM 101: What does JVM do and Memory Areas
  • JVM, JRE, JDK and JIT explained
  • Features of Java and White Paper Buzzwords
  • Beginning Java: History in Brief
Facebooktwittergoogle_plusrssFacebooktwittergoogle_plusrss

Copyright © 2017 · Coding Raptor

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.