单细胞数据处理小细节汇总

Python014

单细胞数据处理小细节汇总,第1张

1. Seurat对象查看当前的Assay

在进行了SCTransform操作后,矩阵默认会变成SCT矩阵,如果不加设置,后续的PCA等操作都是基于SCT矩阵。

修改DefaultAssay:

2. Seurat使用FindVariables找到高变基因后增删高变基因

3. 不同运行步骤中的文件来源和储存位置⚠️

⚠️:PCA的值是可以被覆盖的,使用三步法对矩阵进行标准化后进行PCA后再使用SCT矩阵进行标准化,PCA的矩阵变成了SCT的PCA矩阵,原有的PCA矩阵不会保留。后续的TSNE和UMAP降维图也和三步法不一样。

4. @data标准化矩阵 和 @scale.data 归一化矩阵 的区别

细胞RNA 测序数据中,文库之间测序覆盖率的系统差异通常是由细胞间的cDNA 捕获或PCR 扩增效率方面的技术差异引起的,这归因于用最少的起始材料难以实现一致的文库制备。标准化旨在消除这些差异(例如长度,GC 含量),以使它们不干扰细胞之间表达谱的比较。这样可以确保在细胞群体中观察到的任何异质性或差异表达都是由生物学而不是技术偏倚引起的(批次矫正仅在批次之间发生,并且必须同时考虑技术偏差和生物学差异,标准化只需考虑技术差异)。

软件Seurat 提供了三种标准化的方法,分别为LogNormalize、CLR、RC,通常情况下我们采用LogNormalize 的方式进行标准化,计算公式为:log1p((Feature counts/total counts) ∗ scale. factor)

归一化的目的则是使特征具有相同的度量尺度

参考: Seurat的normalization和scaling

5. 关于有些细胞属于同一个cluster但是在umap或者tsne图上相聚较远的问题:UMAP和TSNE是各自的算法在PCA降维的基础上再进行非线形降维,在二维图上把其各自算法认为相近的细胞聚在一起。但是FindClusters输入的不是UMAP或TSNE降维的数据,而是FindNeighbors的数据,而FindNeighbors输入的数据是PCA降维数据,是用另外一种算法计算的细胞之间的距离。因此会出现有些细胞被认为是同一个cluster,但是在umap或者tsne图上相聚较远(尤其是一些散在的,脱离主群的细胞)

6. marker基因鉴定,查看marker基因的表达是使用RNA矩阵还是sct矩阵?

这是一个争议性问题,两个都可以,目前建议最好使用RNA矩阵。

sct的到的count并不是真实的基因表达值,而是通过scaledata倒推出来的,它是一个回归,运算之后的残差。

7. 关于FindAllMarks找到的基因

如下图,先看cluster0的Marker基因:cluster0的差异基因是cluster0的细胞和剩下的所有的cluster合在一起的细胞做对比得到的。pct.1是这个基因在cluster0中的表达比例(S100A8在cluster0的细胞中的表达比例是100%),pct.2是这个基因在除了cluster0以外的所有细胞中的表达比例(S100A8在除cluster0以外的细胞中的表达比例是51.2%)。avg_log2FC是表达差异倍数,p_val_adj是校正后的p值。

8. 在提取Marker基因时比较好的办法:因为单细胞矩阵算出来的结果,p_val_adj==0的有很多,所以可以先把p_val_adj==0先提出来。再把p_val_adj<0.01的按差异倍数靠前的20/30/500...(按需要)个基因提出来,然后把两个矩阵合在一起(取交集)用来做细胞鉴定。(结合p值和fc来做筛选效果更好)

⚠️:提取没有核糖体和线粒体的marker基因更好。(这些基因对鉴定没有帮助)

有些基因比如Foxp3,对细胞鉴定很重要,但常常在筛选Marker基因的时候筛选不出来。 非负矩阵分解 可能更好。

参考: 过滤线粒体核糖体基因

9. 提取亚群

⚠️ 新提取的亚群需要重新进行降维聚类 (和大群相比,标记基因发生了变化),并重新寻找marker基因,重新分群,注释。❗️subset提取子集后,不同样本间批次校正的信息也被去除,需要重新进行批次矫正

参考: Seurat取子集时会用到的函数和方法 ⚠️⚠️⚠️

10. 取子集后如何去除空子集(还存在这个level,但里面包含的细胞为0,如何去除) as.factor(as.character())

11. 双细胞的预测和去除如DoubletFinder建议单样本进行,不建议双样本一起预测。除此之外,其他步骤都可以多样本一起做,质控的时候也可以多样本一起做,但是建议每个样本都单独看一看。

12. 单细胞多样本整合:merge();多样本拆分:SplitObject()

13. 在做多组数据整合,每个组又有多个样本的时候,最好把单独的每个样本当成批次,而不是把不同的组当成批次。

14. 多核运算

参考: 单细胞数据分析中future包的使用

15. pbmc3k.final@commands$FindClusters 可以查看FindClusters运行时间和记录。Seurat是记录其分析过程的,也可查看command下其他操作

16. 关于质控标准:同一组织的最好用同样的标准,不同组织的可以不一样。不同组织线粒体含量等可能存在差异。

17. 可视化的方法总结

参考: https://www.jianshu.com/p/0d1e2c7d21a4

18. circos图绘制

19. 单细胞数据思维导图,有利于查看单细胞数据格式。

https://www.jianshu.com/p/7560f4fd0d77

20. 对于旧版本Seurat对象的更新

scRNA <- UpdateSeuratObject(scRNA)

UpdateSeuratObject {SeuratObject} :Update old Seurat object to accommodate new features

21. 对有些操作需要用到python设置的情况

22. 单细胞数据做pooling的好处:可以尽量的降低dropout的问题。(dropout就是矩阵中的zero,这些zero实际上并不是0,而是每个液滴里面起始反应量太低了。而一般的反转录效率只能到30%左右,70%的转录本实际上在反转录那一步是被丢掉的,这是单细胞测序一个比较大的问题)。

但是一旦做了pooling,你必须要证明pooling对结果是没有影响的(下图的右面三个图)。

23. Seurat的VlnPlot中的combine参数,在如下画三个基因的情况下,设置成T就画一张图,设置成False,会将三个基因各画一张图。

24. rev()这一步是将横坐标的基因反过来排序

这两个画出来的图横坐标基因的顺序是相反的(见NicheNet)

25. 堆叠小提琴图的绘制

完成这个需求有以下几种实现方法:

1. Seurat包直接就可以实现(stack = T)

2. 通过Seuart->scanpy来实现,第一张是Seurat包VlnPlot函数画的图,第二张是scanpy中stacked_violin函数画的图,那么现在问题就变成为Seurat对象到scanpy对象的转换

3. 用R原生函数实现StackedVlnPlot的方法

4. 使用基于scanpy包衍生的scanyuan包来说实现

5. 使用R包MySeuratWrappers来实现

最简单的方法1如下:

如果不设置level,会按字母顺序排列,case会自动排在con前面。

使用Seurat的 RenameIdents 函数也可以

x: table

margin: a vector giving the margins to split by. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. When x has named dimnames, it can be a character vector selecting dimension names.

得到的HC_1样本的orig.ident默认是样本名中第一个_号的前一部分。所以要保证矩阵的列名是 样本名_细胞barcode 这样的格式。

如果有多个分组,例如两个样本矩阵中细胞分别命名为HC_1_barcode,HC_2_barcode,在直接通过如下方法得到两个Seurat对象,再对其进行merge之后,两个样本会被合并成一个。也就是样本信息只保留了第一个_号之前的HC,没有保留_号之后的1和2。

为了避免这种情况,可以在构建Seurat对象时通过参数进行设置

⚠️PC数的选择:Seurat官网提供的三种方法只能给出PC数的粗略范围,选择不同PC数目,细胞聚类效果差别较大,因此,需要一个更具体的PC数目。作者提出一个确定PC阈值的三个标准:

一般先选默认分辨率(0.8),大概可能会分出十几个群。因为最终都是要注释到每一个barcode,所以首先可以看大类marker的分布(不受分辨率影响),可以根据marker基因的分布来调整分辨率。是否需要精细的分群得看精细的分群对研究有没有决定作用,还有很重要的一点是 看分出的各个cluster在Findallmarkers给出的结果中marker的热图是不是能明显分开 。精细划分的细胞本来就很类似,如果有部分小群的热图明显分不开或者非常类似,就可以考虑把分辨率调小。

这实际上是没有必要的必须保持一致的。下游的都是用pca之后的,pca是为了压缩数据。

umap和tsne是为了可视化(仅仅是可视化),但是FindNeighbor是计算细胞间距离矩阵。找类群数目和可视化可以说没有关系。

map函数:

R语言循环第三境界:purrr包map函数!

浅析R语言中map(映射)与reduce(规约)

参考: monocle2

查看不同细胞群的中位基因也是一样

查看不同样品的中位基因也是一样

或者也可以

R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.

tidyverse 包含了 ggplot2, readr, dplyr, tibble, purrr 等工具包,可以一站式完成数据读写、数据处理和数据可视化的任务。

You only need to install a package once, but you need to reload it every time you start a new session.

If we need to be explicit about where a function (or dataset) comes from, we’ll use the special form package::function(). For example, ggplot2::ggplot() tells you explicitly that we’re using the ggplot() function from the ggplot2 package.

在开始前,先介绍这部分用到的案例信息:

Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?

mpg是ggplot2包内置的数据集:

其中,int 整型,dbl 双精度,chr 字符型,

以上变量的含义:

Wilkinson(2005)提出语法规则→Wickham(2009)编写ggplot2

Wilkinson在2005年提出一套用来描述所有统计图形深层特性的语法规则:一张统计图形就是从数据到几何对象(geometric object,缩写为geom,如点、线、条形等)的图形属性(aesthetic attributes,缩写为aes,如颜色、形状、大小等)的一个映射,此外,图形中还可能包含数据的统计变换(statistical system,缩写为stats),最后绘制在某个特定的坐标系(coordinate system,缩写为coord)中,而分面(facet,指将绘图窗口划分为若干个子窗口)则可以用来生成数据不同子集的图形(毛里里求斯)。

ggplot2包由Hadley Wickham(2009a)编写,提供了一种基于Wilkinson(2005)所述图形语法的图形系统,Wickham(2009b)还对该语法进行了扩展。ggplot2包的目标是提供一个全面的、基于语法的、连贯一致的图形生成系统,允许用户创建新颖的、有创新性的数据可视化图形。该方法的力量已经使得ggplot2成为使用R进行数据可视化的重要工具(攀董)。

ggplot2有以下特点(黄宝臣):

以下是ggplot2图层函数的示意图:

基础的命令:

With ggplot2, you begin a plot with the function ggplot() . ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph.

You complete your graph by adding one or more layers to ggplot() . The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot.

Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case, mpg .

接下来我们从<MAPPINGS>映射关系拓展开来~

很容易看出刚刚绘制的图形中有一些异常值,如何来分析呢?

Let’s hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the class value for each car. The class variable of the mpg dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular).

You can add a third variable, like class, to a two dimensional scatterplot by mapping it to an aesthetic.An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to describe data, let’s use the word “level” to describe aesthetic properties. Here we change the levels of a point’s size, shape, and color to make the point small, triangular, or blue:

注意:The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminateyou have 7. Consider specifying shapes manually if you must have them.

按颜色

You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the class variable to reveal the class of each car.

To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes() . ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.

注意:如果在mapping外部设置color时,只是改变了所有点的颜色,并没有做映射。

为什么会是两座车?

The colors reveal that many of the unusual points are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.

按大小

In the above example, we mapped class to the color aesthetic, but we could have mapped class to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation. We get a warning here, because mapping an unordered variable (class) to an ordered aesthetic (size) is not a good idea.

除了按颜色、形状等分类外,我们还可以有如下的操作:

What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point )

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ <5) ? Note, you’ll also need to specify x and y.

One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets , subplots that each display one subset of the data.

To facet your plot by a single variable , use facet_wrap() . The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.

To facet your plot on the combination of two variables , add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~ .

If you prefer to not facet in the rows or columns dimension , use a . instead of a variable name.

分面有什么好处

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

当变量较多的时候,图形属性颜色区分度不高,不能很好区分各个样本点,而分面可以,但是分面后不同面上的点之间不好比较,所以 变量少容易区分时可以用图形属性映射,多的时候颜色大小等不容易区分可以考虑分面 (TidyFridy笔记本)。

单变量和双变量的分面

Read ·?facet_wrap·. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t ·facet_grid()· have nrow and ncol arguments?

nrow 和 ncol 控制分面子图的排版,facet_grid() 对应 x 方向和 y 方向的分面图个数是确定的,所有不用设置。

A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trendthey use the point geom.

To change the geom in your plot, change the geom function that you add to ggplot() . For instance, to make the plots above, you can use this code:

调整线段形式

Every geom function in ggplot2 takes a mapping argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. On the other hand, you could set the linetype of a line. geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.

Here geom_smooth() separates the cars into three lines based on their drv value, which describes a car’s drivetrain. One line describes all of the points with a 4 value, one line describes all of the points with an f value, and one line describes all of the points with an r value. Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive.

对比group和color

Many geoms, like geom_smooth() , use a single geometric object to display multiple rows of data. For these geoms, you can set the group aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.

多几何对象

To display multiple geoms in the same plot , add multiple geom functions to ggplot():

全局映射

This, however, introduces some duplication in our code. Imagine if you wanted to change the y-axis to display cty instead of hwy. You’d need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to ggplot(). ggplot2 will treat these mappings as global mappings that apply to each geom in the graph . In other words, this code will produce the same plot as the previous code:

If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only . This makes it possible to display different aesthetics in different layers.

You can use the same idea to specify different data for each layer . Here, our smooth line displays just a subset of the mpg dataset, the subcompact cars. The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.

se 代表是否在图形中显示标准差

filter(mpg, class == "subcompact") 只选择车型为subcompact的汽车

Next, let’s take a look at a bar chart. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with geom_bar() . The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut . The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price , carat , color , clarity , and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.

On the x-axis, the chart displays cut , a variable from diamonds . On the y-axis, it displays count, but count is not a variable in diamonds ! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. The figure below describes how this process works with geom_bar().

默认属性

You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat _count() . stat_count() is documented on the same page as geom_bar() , and if you scroll down you can find a section called “Computed variables”. That describes how it computes two new variables: count and prop.

You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar() :

This works because every geom has a default statand every stat has a default geom . This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you might need to use a stat explicitly :

注:group = 1 将所有的数据看作一组,如果不设置,所有的 bar 将是等高的

ggplot2 provides over 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. ?stat_bin . To see a complete list of stats, try the ggplot2 cheatsheet.

单变量:边缘和填充

There’s one more piece of magic associated with bar charts. You can colour a bar chart using either the colour aesthetic, or, more usefully, fill :

The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: "identity", "dodge" or "fill" .

以上是自动调整,接下来以条形图为例来看刊 ggplot 支持的几种位置调整方式。

1. position = 'identity'

position = "identity" will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA .

The identity position adjustment is more useful for 2d geoms, like points, where it is the default.

2. position = "fill"

position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

3. position = "dodge"

position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.

**There’s one other type of adjustment that’s not useful for bar charts, but it can be very useful for scatterplots. **Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?

原因是有些点被覆盖了,可以用 geom_point(position = 'jitter') 来缓解

Coordinate systems are probably the **most complicated part of ggplot2. **The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.

参考资料:

R for Data Science

每天 5 分钟,轻轻松松上手 R 语言(一)

如何使用 ggplot2 ?

R-可视化 | ggplot2框架与主要函数

ggplot2 专题分析