如何学好R语言

2023-02-26 21:14:01Python09

如何学好R语言,第1张

我的亲师弟最近也开始学习R语言了，然后师弟每天“师姐，师姐..."，“我这个怎么弄...”，“我怎么又报错了...”，“师姐师姐...”...我快被他搞疯了，于是有了这篇文章。

新手在学习R语言的过程中一定会出现各种各种问题，问题多到令人抓耳挠腮。

但其实不要觉得害怕或有打退堂鼓的心里，R的使用，就是不断报错不断找问题的过程。但是出现问题，第一反应一定要是上网搜索，找答案，不要第一时间就问身边的人，错失了思考的过程。生信的学习，其实就是一个漫长的自学过程。

推荐 搜索引擎：必应，必应，必应 ！不要再用某度啦拜托！当然如果你能想办法用Google，那当然再好不过了。

搜索能解决百分之九十以上的问题 ，就算解决不了，如果解决不了，可能是因为你的搜索能力还不够高。在这个搜索、尝试解决以及思考的过程，对新学者来说也是一大收获。本身搜索能力的提升就是一个巨大收获。

如果自己尝试了好久，最终实在解决不了，那。。。就再去请教有经验的前辈吧~

其实这种搜索并独立解决问题的思维，我还是在同济大学， 生信大牛刘小乐教授 课题组学到的。刘小乐教授课题组每年都有为期一个月的生信培训，本人有幸学习过一段时间。她们会给很多生信相关的题目给到学员，然后附上一些教学视频，培训的大部分时间，其实就是写作业，自己想方设法找到解决方案的过程。那些大牛师兄师姐们虽然一直在陪伴我们，但是并不会直接告诉我们答案，而是引导我们自己思考，自己去解决。当时真的很崩溃，因为真的啥也不会，怎么搞。一天下来有可能一个问题都答不上来。

但是现在回头想想，我真的获益良多。因为我慢慢学会了独立思考，现在遇到R相关的问题，配合上搜索功能，基本上已经完全能自己驾驭了。

这可能就是“ 授人以鱼不如授人以渔 ”的道理吧。

R语言很简单，只要你想学，就一定能学会。

以下附上同济大学刘小乐课题组在培训时针对初学者第一周的初级练习题。希望对大家有所帮助。

首先你需要先安装几个最常用的数据处理软件

You can use the mean() function to compute the mean of a vector like

so:

However, this does not work if the vector contains NAs:

Please use R documentation to find the mean after excluding NA's (hint: ?mean )

In this question, we will practice data manipulation using a dataset

collected by Francis Galton in 1886 on the heights of parents and their

children. This is a very famous dataset, and Galton used it to come up

with regression and correlation.

The data is available as GaltonFamilies in the HistData package.

Here, we load the data and show the first few rows. To find out more

information about the dataset, use ?GaltonFamilies .

a. Please report the height of the 10th child in the dataset.

b. What is the breakdown of male and female children in the dataset?

c. How many observations are in Galton's dataset? Please answer this

question without consulting the R help.

d. What is the mean height for the 1st child in each family?

e. Create a table showing the mean height for male and female children.

f. What was the average number of children each family had?

g. Convert the children's heights from inches to centimeters and store

it in a column called childHeight_cm in the GaltonFamilies dataset.

Show the first few rows of this dataset.

In the code above, we generate r ngroups groups of r N observations

each. In each group, we have X and Y, where X and Y are independent

normally distributed data and have 0 correlation.

a. Find the correlation between X and Y for each group, and display

the highest correlations.

Hint: since the data is quite large and your code might take a few

moments to run, you can test your code on a subset of the data first

(e.g. you can take the first 100 groups like so):

In general, this is good practice whenever you have a large dataset:

If you are writing new code and it takes a while to run on the whole

dataset, get it to work on a subset first. By running on a subset, you

can iterate faster.

However, please do run your final code on the whole dataset.

b. The highest correlation is around 0.8. Can you explain why we see

such a high correlation when X and Y are supposed to be independent and

thus uncorrelated?

Show a plot of the data for the group that had the highest correlation

you found in Problem 4.

We generate some sample data below. The data is numeric, and has 3

columns: X, Y, Z.

a. Compute the overall correlation between X and Y.

b. Make a plot showing the relationship between X and Y. Comment on

the correlation that you see.

c. Compute the correlations between X and Y for each level of Z.

d. Make a plot showing the relationship between X and Y, but this

time, color the points using the value of Z. Comment on the result,

especially any differences between this plot and the previous plot.

可以使用数据标号“text()”函数text()函数跟在画图函数语句后面，即先画出图，再标号。

下面为来自R的text()函数使用方法（疑难词汇已经标出）

Description

text draws the strings given in the vector(矢量) labels at the coordinates(坐标) given by x and y. y may be missing since xy.coords(x, y) is used for construction of the coordinates.

Usage

text(x, ...)

## Default S3 method:

text(x, y = NULL, labels = seq_along(x$x), adj = NULL,pos = NULL, offset = 0.5, vfont =NULL,cex = 1, col = NULL, font = NULL, ...)

Arguments

x, y

numeric(数) vectors(矢量) of coordinates(坐标) where the text labels should be written. If the length of x and y differs, the shorter one is recycled.

labels

a character vector or expression specifying the text to be written. An attempt is made to coerce(强制) other language objects (names and calls) to expressions, and vectors and other classed objects to character vectors byas.character. If labels is longer than x and y, the coordinates(坐标) are recycled to the length of labels.

adj

one or two values in [0, 1] which specify(指定) the x (and optionally(可选择的) y) adjustment(调整) of the labels(标签). On most devices(装置) values outside that interval will also work.

pos

a position specifier for the text. If specified this overrides(代理佣金) any adj value given. Values of 1, 2, 3 and 4, respectively(分别地) indicate(表明) positions below, to the left of, above and to the right of the specified coordinates.

offset

when pos is specified(指定), this value gives the offset(抵消) of the label(标签) from the specified coordinate(坐标) in fractions(分数) of a character width.

vfont

NULL for the current font family, or a character vector(矢量) of length 2 for Hershey vector fonts. The first element(元素) of the vector selects a typeface and the second element selects a style. Ignored(驳回诉讼) if labels is an expression.

cex

numeric character expansion factor(因素)multiplied by par("cex") yields(产量) the final character size. NULL and NA are equivalent to 1.0.

col, font

the color and (if vfont = NULL) font to be used, possibly vectors(矢量). These default to the values of the global graphical parameters in par().

...

further graphical parameters (from par), such as srt, family and xpd.

Details

labels must be of type character or expression (or be coercible(可强迫的) to such a type). In the latter case, quite a bit of mathematical(数学的) notation(符号) is available such as sub- and superscripts(上标), greek letters,fractions(分数), etc.

adj allows adjustment of the text with respect to (x, y). Values of 0, 0.5, and 1 specify(指定) left/bottom, middle and right/top alignment(队列), respectively(分别地). The default is for centered text, i.e., adj = c(0.5, NA).Accurate(精确的) vertical(垂直的) centering needs character metric(度量标准) information on individual(个人的) characters which is only available on some devices(装置). Vertical alignment is done slightly differently for character strings and for expressions: adj = c(0,0) means to left-justify and to align(结盟) on the baseline for strings but on the bottom of the bounding box for expressions. This also affects vertical(垂直的) centering: for strings the centeringexcludes(排除) any descenders(下降) whereas(然而) for expressions it includes them. Using NA for strings centers them, including descenders.

The pos and offset arguments can be used in conjunction(结合) with values returned by identify to recreate(再创造) an interactively(交互式地) labelled(贴上标签的) plot(情节).

Text can be rotated(旋转的) by using graphical parameters srt (see par)this rotates about the centre set by adj.

Graphical parameters col, cex and font can be vectors(矢量) and will then be applied cyclically(周期的) to the labels (and extra values will be ignored(驳回诉讼)). NA values of font are replaced by par("font"), and similarly for col.

Labels whose x, y or labels value is NA are omitted(省略) from the plot(情节).

What happens when font = 5 (the symbol(象征) font) is selected can be both device- and locale-dependent. Most often labels will be interpreted(说明) in the Adobe symbol encoding, so e.g. "d" is delta, and "\300" is aleph.

Euro symbol

The Euro symbol may not be available in older fonts. In current versions of Adobe symbol fonts it is character 160, so text(x, y, "\xA0", font = 5) may work. People using Western European locales(场所) on Unix-alikes can probably select ISO-8895-15 (Latin-9) which has the Euro as character 165: this can also be used for postscript and pdf. It is \u20ac in Unicode, which can be used in UTF-8 locales(场所).

In all the European Windows encodings the Euro is symbol(象征) 128 and \u20ac will work in all locales: however not all fonts will include it. It is not in the symbol font used for windows and related devices(装置), including the Windows printer.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth &Brooks/Cole.

Murrell, P. (2005) R Graphics. Chapman(叫卖小贩) &Hall/CRC Press.