Quantcast
Viewing latest article 3
Browse Latest Browse All 7

Answer by Rui Barradas for What can R do about a messy data format?

The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.

The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.

Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.

dat <- read.table(text = "+------------+------+------+----------+--------------------------+|    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |+------------+------+------+----------+--------------------------+| 2018-06-01 | A    | A1   |        0 |                        0 || 2018-06-03 | A    | A2   |        0 |                        1 || 2018-06-03 | A    | A3   |        0 |                        2 || 2018-06-03 | A    | A4   |        1 |                        1 || 2018-06-03 | A    | A5   |        2 |                        1 || 2018-06-04 | A    | A6   |        0 |                        3 || 2018-06-01 | B    | B1   |        0 |                        1 || 2018-06-02 | B    | B2   |        0 |                        2 || 2018-06-03 | B    | B3   |        0 |                        3 |+------------+------+------+----------+--------------------------+", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)

But as you can see there are some issues with the result.

dat   X       Date Emp1 Case Priority PriorityCountinLast7days X.11 NA 2018-06-01    A   A1        0                        0  NA2 NA 2018-06-03    A   A2        0                        1  NA3 NA 2018-06-03    A   A3        0                        2  NA4 NA 2018-06-03    A   A4        1                        1  NA5 NA 2018-06-03    A   A5        2                        1  NA6 NA 2018-06-04    A   A6        0                        3  NA7 NA 2018-06-01    B   B1        0                        1  NA8 NA 2018-06-02    B   B2        0                        2  NA9 NA 2018-06-03    B   B3        0                        3  NA

To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.

So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.

dat <- dat[-c(1, ncol(dat))]dat          Date   Emp1   Case Priority PriorityCountinLast7days1  2018-06-01   A      A1           0                        02  2018-06-03   A      A2           0                        13  2018-06-03   A      A3           0                        24  2018-06-03   A      A4           1                        15  2018-06-03   A      A5           2                        16  2018-06-04   A      A6           0                        37  2018-06-01   B      B1           0                        18  2018-06-02   B      B2           0                        29  2018-06-03   B      B3           0                        3

That wasn't too hard, much better.
In this case there is still a problem, to coerce column Date to class Date.

dat$Date <- as.Date(dat$Date)

And the result is satisfactory.

str(dat)'data.frame':   9 obs. of  5 variables: $ Date                    : Date, format: "2018-06-01""2018-06-03" ... $ Emp1                    : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2 $ Case                    : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9 $ Priority                : int  0 0 0 1 2 0 0 0 0 $ PriorityCountinLast7days: int  0 1 2 1 1 3 1 2 3

Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.

The whole process took only 3 lines of base R code.

Finally, the end result in dput format, like it should be in the first place.

dat <-structure(list(Date = structure(c(17683, 17685, 17685, 17685, 17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2", "A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0, 0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1, 1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")

Viewing latest article 3
Browse Latest Browse All 7

Trending Articles