R语言tidyr包的使用

Python018

R语言tidyr包的使用,第1张

reshape2包的进化版—tidyr包

tidyr包的作者是Hadley Wickham。这个包常跟dplyr结合使用。

本文将演示tidyr包中下述四个函数的用法:

gather—宽数据转为长数据。类似于reshape2包中的melt函数

spread—长数据转为宽数据。类似于reshape2包中的cast函数

unit—多列合并为一列

separate—将一列分离为多列

下面使用datasets包中的mtcars数据集做演示。

library(tidyr)

library(dplyr)

head(mtcars)

mpg cyl disp  hp drat    wt  qsec vs am gear carb

Mazda RX4        21.0  6  160 110 3.90 2.620 16.46  0  1    4    4

Mazda RX4 Wag    21.0  6  160 110 3.90 2.875 17.02  0  1    4    4

Datsun 710        22.8  4  108  93 3.85 2.320 18.61  1  1    4    1

Hornet 4 Drive    21.4  6  258 110 3.08 3.215 19.44  1  0    3    1

Hornet Sportabout 18.7  8  360 175 3.15 3.440 17.02  0  0    3    2

Valiant          18.1  6  225 105 2.76 3.460 20.22  1  0    3    1

为方便处理,在数据集中增加一列car

mtcars$car <- rownames(mtcars)

mtcars <- mtcars[, c(12, 1:11)]

gather

gather的调用格式为:

gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)

这里,...表示需要聚合的指定列。

与reshape2包中的melt函数一样,得到如下结果:

mtcarsNew <- mtcars %>% gather(attribute, value, -car)

head(mtcarsNew)

car attribute value

1        Mazda RX4      mpg  21.0

2    Mazda RX4 Wag      mpg  21.0

3        Datsun 710      mpg  22.8

4    Hornet 4 Drive      mpg  21.4

5 Hornet Sportabout      mpg  18.7

6          Valiant      mpg  18.1

tail(mtcarsNew)

car attribute value

347  Porsche 914-2      carb    2

348  Lotus Europa      carb    2

349 Ford Pantera L      carb    4

350  Ferrari Dino      carb    6

351  Maserati Bora      carb    8

352    Volvo 142E      carb    2

如你所见,除了car列外,其余列聚合成两列,分别命名为attribute和value。

tidyr很好的一点是可以只gather若干列而其他列保持不变。如果你想gather在map和gear之间的所有列而保持carb和car列不变,可以像下面这样做:

mtcarsNew <- mtcars %>% gather(attribute, value, mpg:gear)

head(mtcarsNew)

car carb attribute value

1        Mazda RX4    4      mpg  21.0

2    Mazda RX4 Wag    4      mpg  21.0

3        Datsun 710    1      mpg  22.8

4    Hornet 4 Drive    1      mpg  21.4

5 Hornet Sportabout    2      mpg  18.7

6          Valiant    1      mpg  18.1

spread

spread的调用格式为:

spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)

与reshape2包中的cast函数一样,得到如下结果:

mtcarsSpread <- mtcarsNew %>% spread(attribute, value)

head(mtcarsSpread)

car carb  mpg cyl disp  hp drat    wt  qsec vs am gear

1        AMC Javelin    2 15.2  8  304 150 3.15 3.435 17.30  0  0    3

2 Cadillac Fleetwood    4 10.4  8  472 205 2.93 5.250 17.98  0  0    3

3        Camaro Z28    4 13.3  8  350 245 3.73 3.840 15.41  0  0    3

4  Chrysler Imperial    4 14.7  8  440 230 3.23 5.345 17.42  0  0    3

5        Datsun 710    1 22.8  4  108  93 3.85 2.320 18.61  1  1    4

6  Dodge Challenger    2 15.5  8  318 150 2.76 3.520 16.87  0  0    3

unite

unite的调用格式如下:

unite(data, col, ..., sep = "_", remove = TRUE)

where ... represents the columns to unite and col represents the c

这里,...表示需要合并的列,col表示合并后的列。

我们先虚构一些数据:

set.seed(1)

date <- as.Date('2016-01-01') + 0:14

hour <- sample(1:24, 15)

min <- sample(1:60, 15)

second <- sample(1:60, 15)

event <- sample(letters, 15)

data <- data.frame(date, hour, min, second, event)

data

date hour min second event

1  2016-01-01    7  30    29    u

2  2016-01-02    9  43    36    a

3  2016-01-03  13  58    60    l

4  2016-01-04  20  22    11    q

5  2016-01-05    5  44    47    p

6  2016-01-06  18  52    37    k

7  2016-01-07  19  12    43    r

8  2016-01-08  12  35      6    i

9  2016-01-09  11  7    38    e

10 2016-01-10    1  14    21    b

11 2016-01-11    3  20    42    w

12 2016-01-12  14  1    32    t

13 2016-01-13  23  19    52    h

14 2016-01-14  21  41    26    s

15 2016-01-15    8  16    25    o

现在,我们需要把date,hour,min和second列合并为新列datetime。通常,R中的日期时间格式为"Year-Month-Day-Hour:Min:Second"。

dataNew <- data %>%

unite(datehour, date, hour, sep = ' ') %>%

unite(datetime, datehour, min, second, sep = ':')

dataNew

datetime event

1  2016-01-01 7:30:29    u

2  2016-01-02 9:43:36    a

3  2016-01-03 13:58:60    l

4  2016-01-04 20:22:11    q

5  2016-01-05 5:44:47    p

6  2016-01-06 18:52:37    k

7  2016-01-07 19:12:43    r

8  2016-01-08 12:35:6    i

9  2016-01-09 11:7:38    e

10  2016-01-10 1:14:21    b

11  2016-01-11 3:20:42    w

12  2016-01-12 14:1:32    t

13 2016-01-13 23:19:52    h

14 2016-01-14 21:41:26    s

15  2016-01-15 8:16:25    o

separate

separate的调用格式为:

separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE,

convert = FALSE, extra = "warn", fill = "warn", ...)

我们可以用separate函数将数据恢复到刚创建的时候,如下所示:

data1 <- dataNew %>%

separate(datetime, c('date', 'time'), sep = ' ') %>%

separate(time, c('hour', 'min', 'second'), sep = ':')

data1

date hour min second event

1  2016-01-01  07  30    29    u

2  2016-01-02  09  43    36    a

3  2016-01-03  13  59    00    l

4  2016-01-04  20  22    11    q

5  2016-01-05  05  44    47    p

6  2016-01-06  18  52    37    k

7  2016-01-07  19  12    43    r

8  2016-01-08  12  35    06    i

9  2016-01-09  11  07    38    e

10 2016-01-10  01  14    21    b

11 2016-01-11  03  20    42    w

12 2016-01-12  14  01    32    t

13 2016-01-13  23  19    52    h

14 2016-01-14  21  41    26    s

15 2016-01-15  08  16    25    o

首先,将datetime分为date列和time列。然后,将time列分为hour,min,second列。

In this chapter, you will learn a consistent way to organise your data in R, an organisation called tidy data . Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.

This chapter will give you a practical introduction to tidy data and the accompanying tools in the tidyr package. If you’d like to learn more about the underlying theory, you might enjoy the Tidy Data paper published in the Journal of Statistical Software, http://www.jstatsoft.org/v59/i10/paper .

In this chapter we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.

You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables country , year , population , and cases , but each dataset organises the values in a different way.

These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.

There are three interrelated rules which make a dataset tidy:

Figure 12.1 shows the rules visually.

Figure 12.1: Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells.

These three rules are interrelated because it’s impossible to only satisfy two of the three. That interrelationship leads to an even simpler set of practical instructions:

In this example, only table1 is tidy. It’s the only representation where each column is a variable.

Why ensure that your data is tidy? There are two main advantages:

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with table1 .

The principles of tidy data seem so obvious that you might wonder if you’ll ever encounter a dataset that isn’t tidy. Unfortunately, however, most data that you will encounter will be untidy. There are two main reasons:

This means for most real analyses, you’ll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easyother times you’ll need to consult with the people who originally generated the data. The second step is to resolve one of two common problems:

Typically a dataset will only suffer from one of these problemsit’ll only suffer from both if you’re really unlucky! To fix these problems, you’ll need the two most important functions in tidyr: pivot_longer() and pivot_wider() .

A common problem is a dataset where some of the column names are not names of variables, but values of a variable. Take table4a : the column names 1999 and 2000 represent values of the year variable, the values in the 1999 and 2000 columns represent values of the cases variable, and each row represents two observations, not one.

To tidy a dataset like this, we need to pivot the offending columns into a new pair of variables. To describe that operation we need three parameters:

Together those parameters generate the call to pivot_longer() :

The columns to pivot are specified with [dplyr::select()](https://dplyr.tidyverse.org/reference/select.html) style notation. Here there are only two columns, so we list them individually. Note that “1999” and “2000” are non-syntactic names (because they don’t start with a letter) so we have to surround them in backticks. To refresh your memory of the other ways to select columns, see select .

year and cases do not exist in table4a so we put their names in quotes.

Figure 12.2: Pivoting table4 into a longer, tidy form.

In the final result, the pivoted columns are dropped, and we get new year and cases columns. Otherwise, the relationships between the original variables are preserved. Visually, this is shown in Figure 12.2 .

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns. I don’t believe it makes sense to describe a dataset as being in “long form”. Length is a relative term, and you can only say (e.g.) that dataset A is longer than dataset B.

We can use pivot_longer() to tidy table4b in a similar fashion. The only difference is the variable stored in the cell values:

To combine the tidied versions of table4a and table4b into a single tibble, we need to use [dplyr::left_join()](https://dplyr.tidyverse.org/reference/mutate-joins.html) , which you’ll learn about in relational data .

pivot_wider() is the opposite of pivot_longer() . You use it when an observation is scattered across multiple rows. For example, take table2 : an observation is a country in a year, but each observation is spread across two rows.

To tidy this up, we first analyse the representation in similar way to pivot_longer() . This time, however, we only need two parameters:

Once we’ve figured that out, we can use pivot_wider() , as shown programmatically below, and visually in Figure 12.3 .

Figure 12.3: Pivoting table2 into a “wider”, tidy form.

As you might have guessed from their names, pivot_wider() and pivot_longer() are complements. pivot_longer() makes wide tables narrower and longer pivot_wider() makes long tables shorter and wider.

So far you’ve learned how to tidy table2 and table4 , but not table3 . table3 has a different problem: we have one column ( rate ) that contains two variables ( cases and population ). To fix this problem, we’ll need the separate() function. You’ll also learn about the complement of separate() : unite() , which you use if a single variable is spread across multiple columns.

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take table3 :

The rate column contains both cases and population variables, and we need to split it into two variables. separate() takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure 12.4 and the code below.

Figure 12.4: Separating table3 makes it tidy

By default, separate() will split values wherever it sees a non-alphanumeric character (i.e. a character that isn’t a number or letter). For example, in the code above, separate() split the values of rate at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the sep argument of separate() . For example, we could rewrite the code above as:

(Formally, sep is a regular expression, which you’ll learn more about in strings .)

Look carefully at the column types: you’ll notice that cases and population are character columns. This is the default behaviour in separate() : it leaves the type of the column as is. Here, however, it’s not very useful as those really are numbers. We can ask separate() to try and convert to better types using convert = TRUE :

You can also pass a vector of integers to sep . separate() will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the stringsnegative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of sep should be one less than the number of names in into .

You can use this arrangement to separate the last two digits of each year. This make this data less tidy, but is useful in other cases, as you’ll see in a little bit.

unite() is the inverse of separate() : it combines multiple columns into a single column. You’ll need it much less frequently than separate() , but it’s still a useful tool to have in your back pocket.

Figure 12.5: Uniting table5 makes it tidy

We can use unite() to rejoin the century and year columns that we created in the last example. That data is saved as [tidyr::table5](https://tidyr.tidyverse.org/reference/table1.html) . unite() takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in [dplyr::select()](https://dplyr.tidyverse.org/reference/select.html) style:

In this case we also need to use the sep argument. The default will place an underscore ( _ ) between the values from different columns. Here we don’t want any separator so we use "" :

Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:

Let’s illustrate this idea with a very simple data set:

There are two missing values in this dataset:

One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absencean implicit missing value is the absence of a presence.

The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit by putting years in the columns:

Because these explicit missing values may not be important in other representations of the data, you can set values_drop_na = TRUE in pivot_longer() to turn explicit missing values implicit:

Another important tool for making missing values explicit in tidy data is complete() :

complete() takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NA s where necessary.

There’s one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:

You can fill in these missing values with [fill()](https://ggplot2.tidyverse.org/reference/aes_colour_fill_alpha.html) . It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).

To finish off the chapter, let’s pull together everything you’ve learned to tackle a realistic data tidying problem. The [tidyr::who](https://tidyr.tidyverse.org/reference/who.html) dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report , available at http://www.who.int/tb/country/data/download/en/ .

There’s a wealth of epidemiological information in this dataset, but it’s challenging to work with the data in the form that it’s provided:

This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who is messy, and we’ll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you’ll usually need to string together multiple verbs into a pipeline.

The best place to start is almost always to gather together the columns that are not variables. Let’s have a look at what we’ve got:

So we need to gather together all the columns from new_sp_m014 to newrel_f65 . We don’t know what those values represent yet, so we’ll give them the generic name "key" . We know the cells represent the count of cases, so we’ll use the variable cases . There are a lot of missing values in the current representation, so for now we’ll use values_drop_na just so we can focus on the values that are present.

We can get some hint of the structure of the values in the new key column by counting them:

You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:

We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel we have newrel (it’s hard to spot this here but if you don’t fix it we’ll get errors in subsequent steps). You’ll learn about str_replace() in strings , but the basic idea is pretty simple: replace the characters “newrel” with “new_rel”. This makes all variable names consistent.

We can separate the values in each code with two passes of separate() . The first pass will split the codes at each underscore.

Then we might as well drop the new column because it’s constant in this dataset. While we’re dropping columns, let’s also drop iso2 and iso3 since they’re redundant.

Next we’ll separate sexage into sex and age by splitting after the first character:

The who dataset is now tidy!

I’ve shown you the code a piece at a time, assigning each interim result to a new variable. This typically isn’t how you’d work interactively. Instead, you’d gradually build up a complex pipe:

Before we continue on to other topics, it’s worth talking briefly about non-tidy data. Earlier in the chapter, I used the pejorative term “messy” to refer to non-tidy data. That’s an oversimplification: there are lots of useful and well-founded data structures that are not tidy data. There are two main reasons to use other data structures:

Either of these reasons means you’ll need something other than a tibble (or data frame). If your data does fit naturally into a rectangular structure composed of observations and variables, I think tidy data should be your default choice. But there are good reasons to use other structurestidy data is not the only way.

If you’d like to learn more about non-tidy data, I’d highly recommend this thoughtful blog post by Jeff Leek: http://simplystatistics.org/2016/02/17/non-tidy-data/