R语言绘图包05--韦恩图的绘制:ggvenn和VennDiagram

Python010

R语言绘图包05--韦恩图的绘制:ggvenn和VennDiagram,第1张

R语言绘图包系列:

这个包支持列表或数据框的数据作为输入

1.4.1 美化颜色和大小

颜色填充参数:

fill_color - 默认是 c("blue", "yellow", "green", "red")

fill_alpha - 默认是 0.5

边线设置参数:

stroke_color - 默认是 "black"

stroke_alpha - 默认是 1

stroke_size - 默认是 1

stroke_linetype - 默认是 "solid"

集合名字设置:

set_name_color - 默认是 "black"

set_name_size - 默认是 6

图形中字体设置:

text_color - 默认是 "black"

text_size - 默认是 4

以上所有的参数都可以用于 ggvenn() 和 geom_venn()

1.4.2 展示元素

show_elements - 默认是 FALSE

label_sep - text used to concatenate elements, default is ","

1.4.3 隐藏百分比,改变百分比的小数点位数

show_percentage - 默认是TRUE

digits - 默认是 1

参数:

VennDiagram函数包最大能绘制5个数据集合的韦恩图。

参考:

https://github.com/yanlinlin82/ggvenn

R语言画维恩图--ggvenn

venn.diagram函数文档

The R Graph Gallery绘图教程

https://cloud.tencent.com/developer/article/1675092

https://www.jianshu.com/p/f858521828a5

R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.

tidyverse 包含了 ggplot2, readr, dplyr, tibble, purrr 等工具包,可以一站式完成数据读写、数据处理和数据可视化的任务。

You only need to install a package once, but you need to reload it every time you start a new session.

If we need to be explicit about where a function (or dataset) comes from, we’ll use the special form package::function(). For example, ggplot2::ggplot() tells you explicitly that we’re using the ggplot() function from the ggplot2 package.

在开始前,先介绍这部分用到的案例信息:

Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?

mpg是ggplot2包内置的数据集:

其中,int 整型,dbl 双精度,chr 字符型,

以上变量的含义:

Wilkinson(2005)提出语法规则→Wickham(2009)编写ggplot2

Wilkinson在2005年提出一套用来描述所有统计图形深层特性的语法规则:一张统计图形就是从数据到几何对象(geometric object,缩写为geom,如点、线、条形等)的图形属性(aesthetic attributes,缩写为aes,如颜色、形状、大小等)的一个映射,此外,图形中还可能包含数据的统计变换(statistical system,缩写为stats),最后绘制在某个特定的坐标系(coordinate system,缩写为coord)中,而分面(facet,指将绘图窗口划分为若干个子窗口)则可以用来生成数据不同子集的图形(毛里里求斯)。

ggplot2包由Hadley Wickham(2009a)编写,提供了一种基于Wilkinson(2005)所述图形语法的图形系统,Wickham(2009b)还对该语法进行了扩展。ggplot2包的目标是提供一个全面的、基于语法的、连贯一致的图形生成系统,允许用户创建新颖的、有创新性的数据可视化图形。该方法的力量已经使得ggplot2成为使用R进行数据可视化的重要工具(攀董)。

ggplot2有以下特点(黄宝臣):

以下是ggplot2图层函数的示意图:

基础的命令:

With ggplot2, you begin a plot with the function ggplot() . ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph.

You complete your graph by adding one or more layers to ggplot() . The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot.

Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case, mpg .

接下来我们从<MAPPINGS>映射关系拓展开来~

很容易看出刚刚绘制的图形中有一些异常值,如何来分析呢?

Let’s hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the class value for each car. The class variable of the mpg dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular).

You can add a third variable, like class, to a two dimensional scatterplot by mapping it to an aesthetic.An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to describe data, let’s use the word “level” to describe aesthetic properties. Here we change the levels of a point’s size, shape, and color to make the point small, triangular, or blue:

注意:The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminateyou have 7. Consider specifying shapes manually if you must have them.

按颜色

You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the class variable to reveal the class of each car.

To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes() . ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.

注意:如果在mapping外部设置color时,只是改变了所有点的颜色,并没有做映射。

为什么会是两座车?

The colors reveal that many of the unusual points are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.

按大小

In the above example, we mapped class to the color aesthetic, but we could have mapped class to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation. We get a warning here, because mapping an unordered variable (class) to an ordered aesthetic (size) is not a good idea.

除了按颜色、形状等分类外,我们还可以有如下的操作:

What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point )

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ <5) ? Note, you’ll also need to specify x and y.

One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets , subplots that each display one subset of the data.

To facet your plot by a single variable , use facet_wrap() . The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.

To facet your plot on the combination of two variables , add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~ .

If you prefer to not facet in the rows or columns dimension , use a . instead of a variable name.

分面有什么好处

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

当变量较多的时候,图形属性颜色区分度不高,不能很好区分各个样本点,而分面可以,但是分面后不同面上的点之间不好比较,所以 变量少容易区分时可以用图形属性映射,多的时候颜色大小等不容易区分可以考虑分面 (TidyFridy笔记本)。

单变量和双变量的分面

Read ·?facet_wrap·. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t ·facet_grid()· have nrow and ncol arguments?

nrow 和 ncol 控制分面子图的排版,facet_grid() 对应 x 方向和 y 方向的分面图个数是确定的,所有不用设置。

A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trendthey use the point geom.

To change the geom in your plot, change the geom function that you add to ggplot() . For instance, to make the plots above, you can use this code:

调整线段形式

Every geom function in ggplot2 takes a mapping argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. On the other hand, you could set the linetype of a line. geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.

Here geom_smooth() separates the cars into three lines based on their drv value, which describes a car’s drivetrain. One line describes all of the points with a 4 value, one line describes all of the points with an f value, and one line describes all of the points with an r value. Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive.

对比group和color

Many geoms, like geom_smooth() , use a single geometric object to display multiple rows of data. For these geoms, you can set the group aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.

多几何对象

To display multiple geoms in the same plot , add multiple geom functions to ggplot():

全局映射

This, however, introduces some duplication in our code. Imagine if you wanted to change the y-axis to display cty instead of hwy. You’d need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to ggplot(). ggplot2 will treat these mappings as global mappings that apply to each geom in the graph . In other words, this code will produce the same plot as the previous code:

If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only . This makes it possible to display different aesthetics in different layers.

You can use the same idea to specify different data for each layer . Here, our smooth line displays just a subset of the mpg dataset, the subcompact cars. The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.

se 代表是否在图形中显示标准差

filter(mpg, class == "subcompact") 只选择车型为subcompact的汽车

Next, let’s take a look at a bar chart. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with geom_bar() . The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut . The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price , carat , color , clarity , and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.

On the x-axis, the chart displays cut , a variable from diamonds . On the y-axis, it displays count, but count is not a variable in diamonds ! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. The figure below describes how this process works with geom_bar().

默认属性

You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat _count() . stat_count() is documented on the same page as geom_bar() , and if you scroll down you can find a section called “Computed variables”. That describes how it computes two new variables: count and prop.

You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar() :

This works because every geom has a default statand every stat has a default geom . This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you might need to use a stat explicitly :

注:group = 1 将所有的数据看作一组,如果不设置,所有的 bar 将是等高的

ggplot2 provides over 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. ?stat_bin . To see a complete list of stats, try the ggplot2 cheatsheet.

单变量:边缘和填充

There’s one more piece of magic associated with bar charts. You can colour a bar chart using either the colour aesthetic, or, more usefully, fill :

The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: "identity", "dodge" or "fill" .

以上是自动调整,接下来以条形图为例来看刊 ggplot 支持的几种位置调整方式。

1. position = 'identity'

position = "identity" will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA .

The identity position adjustment is more useful for 2d geoms, like points, where it is the default.

2. position = "fill"

position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

3. position = "dodge"

position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.

**There’s one other type of adjustment that’s not useful for bar charts, but it can be very useful for scatterplots. **Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?

原因是有些点被覆盖了,可以用 geom_point(position = 'jitter') 来缓解

Coordinate systems are probably the **most complicated part of ggplot2. **The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.

参考资料:

R for Data Science

每天 5 分钟,轻轻松松上手 R 语言(一)

如何使用 ggplot2 ?

R-可视化 | ggplot2框架与主要函数

ggplot2 专题分析