37 37. dataframes - order - stringsAsFactors

37.1 Use the order function to sort the rows of a dataframe. DON’T USE THE sort FUNCTION

This page discusses two separate issues:

How to sort the rows of a dataframe using the order function.
The stringsAsFactors argument to the data.frame function

#-----------------------------------------------------------------------------
# order function
#
# You can use the order function to put the rows of a dataframe in sorted 
# "order" based on the contents of one or more columns.
#
#
#
# WARNING: DON'T USE sort
#
# The sort function will NOT help you to do this at all!!!
# sort only works for individual vectors!!!
#-----------------------------------------------------------------------------

37.2 using order with a vector

The order function is not usually used with a vector. However, it is easier to understand the order function if we start by discussing how order works when used with a vector.

The order function returns a numeric vector. This is best explained with an example.

x = c(20,30,40,10)
order(x)

[1] 4 1 2 3

The 1st entry in the returned value shows the position in x that contains the lowest value in x.
The 2nd entry in the returned value shows the position in x that contains the 2nd lowest value in x.
etc.

This can then be used to sort x in the following way:

x = c(20,30,40,10)
x[order(x)]

[1] 10 20 30 40

However this is never done because the sort function does the same thing.

x = c(20,30,40,10)
sort(x)

[1] 10 20 30 40

Unfortunately the sort function does NOT work with dataframes. However, we can use the sort function to sort the rows of a dataframe.

Let’s start with the following dataframe

gradebook = data.frame(student =      c("joe", "sue", "sam", "anne", "bob", "carla", "dana", "david"),
                       test1 =        c(90,     80,    90,    75,    80,    90,      100,    60),
                       test2 =        c(95,     97,    88,    87,    81,    92,      99,     73),
                       year  = factor(c("fr",   "fr",  "so",  "so",  "fr",  "se",    "so",   "so"), 
                                      ordered=TRUE, levels=c("fr","so","ju","se")),
                       honors =       c(FALSE,  FALSE, FALSE, FALSE, FALSE, TRUE,    TRUE,   FALSE)
                      )
                       
gradebook

  student test1 test2 year honors
1     joe    90    95   fr  FALSE
2     sue    80    97   fr  FALSE
3     sam    90    88   so  FALSE
4    anne    75    87   so  FALSE
5     bob    80    81   fr  FALSE
6   carla    90    92   se   TRUE
7    dana   100    99   so   TRUE
8   david    60    73   so  FALSE

sort doesn’t work with dataframes

sort(gradebook)   # error

Error in xtfrm.data.frame(x): cannot xtfrm data frames

However, the folloiwng shows which row should be 1st, 2nd, 3rd, etc if we would order the rdsults by the names of the students

order(gradebook$student)

[1] 4 5 6 7 8 1 3 2

We can now use that to specify which rows we want 1st,2nd,3rd, etc.

gradebook [ order(gradebook$student)  , ]

  student test1 test2 year honors
4    anne    75    87   so  FALSE
5     bob    80    81   fr  FALSE
6   carla    90    92   se   TRUE
7    dana   100    99   so   TRUE
8   david    60    73   so  FALSE
1     joe    90    95   fr  FALSE
3     sam    90    88   so  FALSE
2     sue    80    97   fr  FALSE

We can order in reverse order by specifying decreasing=TRUE

gradebook [ order(gradebook$student, decreasing=TRUE)  , ]

  student test1 test2 year honors
2     sue    80    97   fr  FALSE
3     sam    90    88   so  FALSE
1     joe    90    95   fr  FALSE
8   david    60    73   so  FALSE
7    dana   100    99   so   TRUE
6   carla    90    92   se   TRUE
5     bob    80    81   fr  FALSE
4    anne    75    87   so  FALSE

To order by the grades in test1 do this

gradebook [ order(gradebook$test1)  , ]

  student test1 test2 year honors
8   david    60    73   so  FALSE
4    anne    75    87   so  FALSE
2     sue    80    97   fr  FALSE
5     bob    80    81   fr  FALSE
1     joe    90    95   fr  FALSE
3     sam    90    88   so  FALSE
6   carla    90    92   se   TRUE
7    dana   100    99   so   TRUE

Notice that some students have the same test1 grade. For example there are 3 students who all got in the 90s. However, notice that those 3 students are not in any particular order.

It would be nice if for students who all got a 90 for test1 they would be listed in order of their test2 grades. (Similarly for students who all got 80 on test1, etc).

This is done specifying more than one column in the call to the order function. The first column listed will be used to sort the data. If more than one row have the same value for that data then those rows will be sorted by the 2nd column specified. You can continue doing this for as many columns as you like.

gradebook [ order(gradebook$test1, gradebook$test2)  , ]

  student test1 test2 year honors
8   david    60    73   so  FALSE
4    anne    75    87   so  FALSE
5     bob    80    81   fr  FALSE
2     sue    80    97   fr  FALSE
3     sam    90    88   so  FALSE
6   carla    90    92   se   TRUE
1     joe    90    95   fr  FALSE
7    dana   100    99   so   TRUE

Another example - show honors students and non-honors students separately

gradebook [ order(gradebook$honors, gradebook$test1, gradebook$test2)  , ]

  student test1 test2 year honors
8   david    60    73   so  FALSE
4    anne    75    87   so  FALSE
5     bob    80    81   fr  FALSE
2     sue    80    97   fr  FALSE
3     sam    90    88   so  FALSE
1     joe    90    95   fr  FALSE
6   carla    90    92   se   TRUE
7    dana   100    99   so   TRUE

This time sepcify decreasing=TRUE

gradebook [ order(gradebook$honors, gradebook$test1, gradebook$test2, decreasing=TRUE)  , ]

  student test1 test2 year honors
7    dana   100    99   so   TRUE
6   carla    90    92   se   TRUE
1     joe    90    95   fr  FALSE
3     sam    90    88   so  FALSE
2     sue    80    97   fr  FALSE
5     bob    80    81   fr  FALSE
4    anne    75    87   so  FALSE
8   david    60    73   so  FALSE

37.3 stringsAsFactors=FALSE or stringsAsFactors=TRUE

The data.frame function contains an argument named stringsAsFactors that is expected to be TRUE or FALSE. The default value is FALSE. (see the documentation for data.frame, i.e. ?data.frame)

IMPORTANT NOTE - the default value used to be TRUE but was changed approximately during the 2022’ish timeframe. Therefore you might see code that assumes stringsAsFactors=TRUE when it’s not explicitly specified. This caused a lot of confusion and R subsequently changed the default value.

rm(list = ls() )

# WHAT IS A STRING???
#
# Don't get confused by the word "string". The term "string" means the same
# thing as "an element of a character vector". The term "string" is used a LOT
# in other languages, e.g. Java, Python, etc. instead
# of what we call an element of a "character vector". The word seeped into
# R in a few places. One of them is in the name of the argument
# ?stringsAsFactors = FALSE. Perhaps a better name for this argument 
# could have been charactersAsFactors but that's not what it is.
#
# Are you curious about why an element of a character vector is known
# as a "string" in many other languages? The word string comes from
# "stringing together many individual 'characters', 
# e.g. 'a' and 'p' and 'p' and 'p' and 'l' and 'e' can be strung together
# like a string of beads on a necklace to make a single
# "string of characters" e.g. "apple".
#
#
#
# WHAT DOES stringsAsFactors=FALSE DO ?
#
# By default, if you create a dataframe using character vectors, the 
# character vectors will be converted into factors before they are stored in the
# dataframe. If that is not what you want then you can specify
# stringsAsFactors = FALSE


# EXAMPLE : stringsAsFactors = TRUE 
#           (this is the default if you don't specify anything for stringsAsFactors)

gradebook_fact = data.frame(first = c("joe", "sue", "sam", "anne", "bob", "carla", "dana", "david"),
                       last =  c("baker", "jones", "smith", "fox", "cohen", "jones", "schwartz", "rosen"),    
                       test1 = c(70,     80,    90,    75,    85,    95,      100,    60),
                       test2 = c(81,     77,    88,    87,    91,    92,      99,     73),
                       year  = c("fr",   "fr",  "so",  "so",  "fr",  "se",    "so",   "se"),
                       honors =       c(FALSE,  FALSE, FALSE, FALSE, FALSE, TRUE,    TRUE,   FALSE),
                stringsAsFactors = TRUE)   # THIS IS THE DEFAULT IF YOU DONT SPECIFY ANYTHING 

gradebook_fact

  first     last test1 test2 year honors
1   joe    baker    70    81   fr  FALSE
2   sue    jones    80    77   fr  FALSE
3   sam    smith    90    88   so  FALSE
4  anne      fox    75    87   so  FALSE
5   bob    cohen    85    91   fr  FALSE
6 carla    jones    95    92   se   TRUE
7  dana schwartz   100    99   so   TRUE
8 david    rosen    60    73   se  FALSE

# character vectors were converted to factors in the dataframe
class(gradebook_fact$first)

[1] "factor"

class(gradebook_fact$last)

[1] "factor"

class(gradebook_fact$year)

[1] "factor"

summary(gradebook_fact$first)

 anne   bob carla  dana david   joe   sam   sue 
    1     1     1     1     1     1     1     1

summary(gradebook_fact$last)

   baker    cohen      fox    jones    rosen schwartz    smith 
       1        1        1        2        1        1        1

summary(gradebook_fact$year)

fr se so 
 3  2  3

# EXAMPLE : stringsAsFactors = FALSE

gradebook_char = data.frame(first = c("joe", "sue", "sam", "anne", "bob", "carla", "dana", "david"),
                       last =  c("baker", "jones", "smith", "fox", "cohen", "jones", "schwartz", "rosen"),    
                       test1 = c(70,     80,    90,    75,    85,    95,      100,    60),
                       test2 = c(81,     77,    88,    87,    91,    92,      99,     73),
                       year  = c("fr",   "fr",  "so",  "so",  "fr",  "se",    "so",   "se"),
                       honors =       c(FALSE,  FALSE, FALSE, FALSE, FALSE, TRUE,    TRUE,   FALSE),
                stringsAsFactors = FALSE)
# character vectors were NOT converted to factors in the dataframe
class(gradebook_char$first)

[1] "character"

class(gradebook_char$last)

[1] "character"

class(gradebook_char$year)

[1] "character"

summary(gradebook_char$first)

   Length     Class      Mode 
        8 character character

summary(gradebook_char$last)

   Length     Class      Mode 
        8 character character

summary(gradebook_char$year)

   Length     Class      Mode 
        8 character character

# QUESTION
#
# In the gradebook_char variable we created above, the year is a character
# vector but it should be a factor. Create a new variable named 
# gradebook, that changes the year column into a factor. You should
# NOT use the data.frame function at all. Rather replace the year 
# column from gradebook_char with a factor that has the same data.


# QUESTION
#
# In the gradebook_fact variable we created above, the first and last
# name columns are factor columns. However, they should NOT be factors. 
# Create a new variable named  gradebook, that changes the
# first and last columns into character vectors. You should
# NOT use the data.frame function at all. Rather replace the
# first and last columns from gradebook_fact with a charcter vectors
# that have the same data.