


An introduction to R:全面系统地介绍R语言,适合作为初步的参考资料。该资料是一份pdf文档,也是R语言官方手册。

Try R: 强烈推荐,非常简短地课程,可以在网页上进行简短的操作。该网站提供R的网页操作,所以你无需安装R,从最基本的R语言开始学期,通过实际操作掌握R的相关知识。

Computing for DataAnalysis:大约四周的视频课程。

Introduction to R for Data Mining: R进行数据挖掘方面的材料,包括一些ppt和视频资料

Rstudio: R语言的集成操作环境,强烈建议安装。Rstudio会让你的工作效率指数提高。

Getting started withR and Hadoop, 关于R和Hadoop项目的资料。

ggplot2: R绘图神器,该网站提供所有关于ggplot2的命令分解和介绍,同时配有大量的案例。

Learning Time Serieswith R:关于R的时间序列分析的资料。




推荐 搜索引擎:必应,必应,必应 !不要再用某度啦拜托!当然如果你能想办法用Google,那当然再好不过了。

搜索能解决百分之九十以上的问题 ,就算解决不了,如果解决不了,可能是因为你的搜索能力还不够高。在这个搜索、尝试解决以及思考的过程,对新学者来说也是一大收获。本身搜索能力的提升就是一个巨大收获。


其实这种搜索并独立解决问题的思维,我还是在同济大学, 生信大牛刘小乐教授 课题组学到的。刘小乐教授课题组每年都有为期一个月的生信培训,本人有幸学习过一段时间。她们会给很多生信相关的题目给到学员,然后附上一些教学视频,培训的大部分时间,其实就是写作业,自己想方设法找到解决方案的过程。那些大牛师兄师姐们虽然一直在陪伴我们,但是并不会直接告诉我们答案,而是引导我们自己思考,自己去解决。当时真的很崩溃,因为真的啥也不会,怎么搞。一天下来有可能一个问题都答不上来。


这可能就是“ 授人以鱼不如授人以渔 ”的道理吧。




You can use the mean() function to compute the mean of a vector like


However, this does not work if the vector contains NAs:

Please use R documentation to find the mean after excluding NA's (hint: ?mean )

In this question, we will practice data manipulation using a dataset

collected by Francis Galton in 1886 on the heights of parents and their

children. This is a very famous dataset, and Galton used it to come up

with regression and correlation.

The data is available as GaltonFamilies in the HistData package.

Here, we load the data and show the first few rows. To find out more

information about the dataset, use ?GaltonFamilies .

a. Please report the height of the 10th child in the dataset.

b. What is the breakdown of male and female children in the dataset?

c. How many observations are in Galton's dataset? Please answer this

question without consulting the R help.

d. What is the mean height for the 1st child in each family?

e. Create a table showing the mean height for male and female children.

f. What was the average number of children each family had?

g. Convert the children's heights from inches to centimeters and store

it in a column called childHeight_cm in the GaltonFamilies dataset.

Show the first few rows of this dataset.

In the code above, we generate r ngroups groups of r N observations

each. In each group, we have X and Y, where X and Y are independent

normally distributed data and have 0 correlation.

a. Find the correlation between X and Y for each group, and display

the highest correlations.

Hint: since the data is quite large and your code might take a few

moments to run, you can test your code on a subset of the data first

(e.g. you can take the first 100 groups like so):

In general, this is good practice whenever you have a large dataset:

If you are writing new code and it takes a while to run on the whole

dataset, get it to work on a subset first. By running on a subset, you

can iterate faster.

However, please do run your final code on the whole dataset.

b. The highest correlation is around 0.8. Can you explain why we see

such a high correlation when X and Y are supposed to be independent and

thus uncorrelated?

Show a plot of the data for the group that had the highest correlation

you found in Problem 4.

We generate some sample data below. The data is numeric, and has 3

columns: X, Y, Z.

a. Compute the overall correlation between X and Y.

b. Make a plot showing the relationship between X and Y. Comment on

the correlation that you see.

c. Compute the correlations between X and Y for each level of Z.

d. Make a plot showing the relationship between X and Y, but this

time, color the points using the value of Z. Comment on the result,

especially any differences between this plot and the previous plot.