#-----------------------------------------------------------------------------
# order function
#
# You can use the order function to put the rows of a dataframe in sorted
# "order" based on the contents of one or more columns.
#
#
#
# WARNING: DON'T USE sort
#
# The sort function will NOT help you to do this at all!!!
# sort only works for individual vectors!!!
#-----------------------------------------------------------------------------37 dataframes - order - stringsAsFactors
37.1 Use the order function to sort the rows of a dataframe. DON’T USE THE sort FUNCTION
This page discusses two separate issues:
How to sort the rows of a dataframe using the order function.
The stringsAsFactors argument to the data.frame function
37.2 using order with a vector
The order function is not usually used with a vector. However, it is easier to understand the order function if we start by discussing how order works when used with a vector.
The order function returns a numeric vector. This is best explained with an example.
x = c(20,30,40,10)
order(x)[1] 4 1 2 3
The 1st entry in the returned value shows the position in x that contains the lowest value in x.
The 2nd entry in the returned value shows the position in x that contains the 2nd lowest value in x.
etc.
This can then be used to sort x in the following way:
x = c(20,30,40,10)
x[order(x)][1] 10 20 30 40
However this is never done because the sort function does the same thing.
x = c(20,30,40,10)
sort(x)[1] 10 20 30 40
Unfortunately the sort function does NOT work with dataframes. However, we can use the sort function to sort the rows of a dataframe.
Let’s start with the following dataframe
gradebook = data.frame(student = c("joe", "sue", "sam", "anne", "bob", "carla", "dana", "david"),
test1 = c(90, 80, 90, 75, 80, 90, 100, 60),
test2 = c(95, 97, 88, 87, 81, 92, 99, 73),
year = factor(c("fr", "fr", "so", "so", "fr", "se", "so", "so"),
ordered=TRUE, levels=c("fr","so","ju","se")),
honors = c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE)
)
gradebook student test1 test2 year honors
1 joe 90 95 fr FALSE
2 sue 80 97 fr FALSE
3 sam 90 88 so FALSE
4 anne 75 87 so FALSE
5 bob 80 81 fr FALSE
6 carla 90 92 se TRUE
7 dana 100 99 so TRUE
8 david 60 73 so FALSE
sort doesn’t work with dataframes
sort(gradebook) # errorError in xtfrm.data.frame(x): cannot xtfrm data frames
However, the folloiwng shows which row should be 1st, 2nd, 3rd, etc if we would order the rdsults by the names of the students
order(gradebook$student) [1] 4 5 6 7 8 1 3 2
We can now use that to specify which rows we want 1st,2nd,3rd, etc.
gradebook [ order(gradebook$student) , ] student test1 test2 year honors
4 anne 75 87 so FALSE
5 bob 80 81 fr FALSE
6 carla 90 92 se TRUE
7 dana 100 99 so TRUE
8 david 60 73 so FALSE
1 joe 90 95 fr FALSE
3 sam 90 88 so FALSE
2 sue 80 97 fr FALSE
We can order in reverse order by specifying decreasing=TRUE
gradebook [ order(gradebook$student, decreasing=TRUE) , ] student test1 test2 year honors
2 sue 80 97 fr FALSE
3 sam 90 88 so FALSE
1 joe 90 95 fr FALSE
8 david 60 73 so FALSE
7 dana 100 99 so TRUE
6 carla 90 92 se TRUE
5 bob 80 81 fr FALSE
4 anne 75 87 so FALSE
To order by the grades in test1 do this
gradebook [ order(gradebook$test1) , ] student test1 test2 year honors
8 david 60 73 so FALSE
4 anne 75 87 so FALSE
2 sue 80 97 fr FALSE
5 bob 80 81 fr FALSE
1 joe 90 95 fr FALSE
3 sam 90 88 so FALSE
6 carla 90 92 se TRUE
7 dana 100 99 so TRUE
Notice that some students have the same test1 grade. For example there are 3 students who all got in the 90s. However, notice that those 3 students are not in any particular order.
It would be nice if for students who all got a 90 for test1 they would be listed in order of their test2 grades. (Similarly for students who all got 80 on test1, etc).
This is done specifying more than one column in the call to the order function. The first column listed will be used to sort the data. If more than one row have the same value for that data then those rows will be sorted by the 2nd column specified. You can continue doing this for as many columns as you like.
gradebook [ order(gradebook$test1, gradebook$test2) , ] student test1 test2 year honors
8 david 60 73 so FALSE
4 anne 75 87 so FALSE
5 bob 80 81 fr FALSE
2 sue 80 97 fr FALSE
3 sam 90 88 so FALSE
6 carla 90 92 se TRUE
1 joe 90 95 fr FALSE
7 dana 100 99 so TRUE
Another example - show honors students and non-honors students separately
gradebook [ order(gradebook$honors, gradebook$test1, gradebook$test2) , ] student test1 test2 year honors
8 david 60 73 so FALSE
4 anne 75 87 so FALSE
5 bob 80 81 fr FALSE
2 sue 80 97 fr FALSE
3 sam 90 88 so FALSE
1 joe 90 95 fr FALSE
6 carla 90 92 se TRUE
7 dana 100 99 so TRUE
This time sepcify decreasing=TRUE
gradebook [ order(gradebook$honors, gradebook$test1, gradebook$test2, decreasing=TRUE) , ] student test1 test2 year honors
7 dana 100 99 so TRUE
6 carla 90 92 se TRUE
1 joe 90 95 fr FALSE
3 sam 90 88 so FALSE
2 sue 80 97 fr FALSE
5 bob 80 81 fr FALSE
4 anne 75 87 so FALSE
8 david 60 73 so FALSE
37.3 stringsAsFactors=FALSE or stringsAsFactors=TRUE
The data.frame function contains an argument named stringsAsFactors that is expected to be TRUE or FALSE. The default value is FALSE. (see the documentation for data.frame, i.e. ?data.frame)
IMPORTANT NOTE - the default value used to be TRUE but was changed approximately during the 2022’ish timeframe. Therefore you might see code that assumes stringsAsFactors=TRUE when it’s not explicitly specified. This caused a lot of confusion and R subsequently changed the default value.
rm(list = ls() )# WHAT IS A STRING???
#
# Don't get confused by the word "string". The term "string" means the same
# thing as "an element of a character vector". The term "string" is used a LOT
# in other languages, e.g. Java, Python, etc. instead
# of what we call an element of a "character vector". The word seeped into
# R in a few places. One of them is in the name of the argument
# ?stringsAsFactors = FALSE. Perhaps a better name for this argument
# could have been charactersAsFactors but that's not what it is.
#
# Are you curious about why an element of a character vector is known
# as a "string" in many other languages? The word string comes from
# "stringing together many individual 'characters',
# e.g. 'a' and 'p' and 'p' and 'p' and 'l' and 'e' can be strung together
# like a string of beads on a necklace to make a single
# "string of characters" e.g. "apple".
#
#
#
# WHAT DOES stringsAsFactors=FALSE DO ?
#
# By default, if you create a dataframe using character vectors, the
# character vectors will be converted into factors before they are stored in the
# dataframe. If that is not what you want then you can specify
# stringsAsFactors = FALSE
# EXAMPLE : stringsAsFactors = TRUE
# (this is the default if you don't specify anything for stringsAsFactors)
gradebook_fact = data.frame(first = c("joe", "sue", "sam", "anne", "bob", "carla", "dana", "david"),
last = c("baker", "jones", "smith", "fox", "cohen", "jones", "schwartz", "rosen"),
test1 = c(70, 80, 90, 75, 85, 95, 100, 60),
test2 = c(81, 77, 88, 87, 91, 92, 99, 73),
year = c("fr", "fr", "so", "so", "fr", "se", "so", "se"),
honors = c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE),
stringsAsFactors = TRUE) # THIS IS THE DEFAULT IF YOU DONT SPECIFY ANYTHING
gradebook_fact first last test1 test2 year honors
1 joe baker 70 81 fr FALSE
2 sue jones 80 77 fr FALSE
3 sam smith 90 88 so FALSE
4 anne fox 75 87 so FALSE
5 bob cohen 85 91 fr FALSE
6 carla jones 95 92 se TRUE
7 dana schwartz 100 99 so TRUE
8 david rosen 60 73 se FALSE
# character vectors were converted to factors in the dataframe
class(gradebook_fact$first) [1] "factor"
class(gradebook_fact$last)[1] "factor"
class(gradebook_fact$year)[1] "factor"
summary(gradebook_fact$first) anne bob carla dana david joe sam sue
1 1 1 1 1 1 1 1
summary(gradebook_fact$last) baker cohen fox jones rosen schwartz smith
1 1 1 2 1 1 1
summary(gradebook_fact$year)fr se so
3 2 3
# EXAMPLE : stringsAsFactors = FALSE
gradebook_char = data.frame(first = c("joe", "sue", "sam", "anne", "bob", "carla", "dana", "david"),
last = c("baker", "jones", "smith", "fox", "cohen", "jones", "schwartz", "rosen"),
test1 = c(70, 80, 90, 75, 85, 95, 100, 60),
test2 = c(81, 77, 88, 87, 91, 92, 99, 73),
year = c("fr", "fr", "so", "so", "fr", "se", "so", "se"),
honors = c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE),
stringsAsFactors = FALSE)
# character vectors were NOT converted to factors in the dataframe
class(gradebook_char$first)[1] "character"
class(gradebook_char$last)[1] "character"
class(gradebook_char$year)[1] "character"
summary(gradebook_char$first) Length Class Mode
8 character character
summary(gradebook_char$last) Length Class Mode
8 character character
summary(gradebook_char$year) Length Class Mode
8 character character
# QUESTION
#
# In the gradebook_char variable we created above, the year is a character
# vector but it should be a factor. Create a new variable named
# gradebook, that changes the year column into a factor. You should
# NOT use the data.frame function at all. Rather replace the year
# column from gradebook_char with a factor that has the same data.
# QUESTION
#
# In the gradebook_fact variable we created above, the first and last
# name columns are factor columns. However, they should NOT be factors.
# Create a new variable named gradebook, that changes the
# first and last columns into character vectors. You should
# NOT use the data.frame function at all. Rather replace the
# first and last columns from gradebook_fact with a charcter vectors
# that have the same data.