R语言基本数据分析

2023-02-26 15:13:02Python014

R语言基本数据分析,第1张

R语言基本数据分析

本文基于R语言进行基本数据统计分析，包括基本作图，线性拟合，逻辑回归，bootstrap采样和Anova方差分析的实现及应用。

不多说，直接上代码，代码中有注释。

1. 基本作图（盒图，qq图）

#basic plot

boxplot(x)

qqplot(x,y)

2. 线性拟合

#linear regression

n = 10

x1 = rnorm(n)#variable 1

x2 = rnorm(n)#variable 2

y = rnorm(n)*3

mod = lm(y~x1+x2)

model.matrix(mod) #erect the matrix of mod

plot(mod) #plot residual and fitted of the solution, Q-Q plot and cook distance

summary(mod) #get the statistic information of the model

hatvalues(mod) #very important, for abnormal sample detection

3. 逻辑回归

#logistic regression

x <- c(0, 1, 2, 3, 4, 5)

y <- c(0, 9, 21, 47, 60, 63) # the number of successes

n <- 70 #the number of trails

z <- n - y #the number of failures

b <- cbind(y, z) # column bind

fitx <- glm(b~x,family = binomial) # a particular type of generalized linear model

print(fitx)

plot(x,y,xlim=c(0,5),ylim=c(0,65)) #plot the points (x,y)

beta0 <- fitx$coef[1]

beta1 <- fitx$coef[2]

fn <- function(x) n*exp(beta0+beta1*x)/(1+exp(beta0+beta1*x))

par(new=T)

curve(fn,0,5,ylim=c(0,60)) # plot the logistic regression curve

3. Bootstrap采样

# bootstrap

# Application: 随机采样，获取最大eigenvalue占所有eigenvalue和之比，并画图显示distribution

dat = matrix(rnorm(100*5),100,5)

no.samples = 200 #sample 200 times

# theta = matrix(rep(0,no.samples*5),no.samples,5)

theta =rep(0,no.samples*5)

for (i in 1:no.samples)

{

j = sample(1:100,100,replace = TRUE)#get 100 samples each time

datrnd = dat[j,]#select one row each time

lambda = princomp(datrnd)$sdev^2#get eigenvalues

# theta[i,] = lambda

theta[i] = lambda[1]/sum(lambda)#plot the ratio of the biggest eigenvalue

}

# hist(theta[1,]) #plot the histogram of the first(biggest) eigenvalue

hist(theta)#plot the percentage distribution of the biggest eigenvalue

sd(theta)#standard deviation of theta

#上面注释掉的语句，可以全部去掉注释并将其下一条语句注释掉，完成画最大eigenvalue分布的功能

4. ANOVA方差分析

#Application：判断一个自变量是否有影响 (假设我们喂3种维他命给3头猪，想看喂维他命有没有用)

y = rnorm(9)#weight gain by pig(Yij, i is the treatment, j is the pig_id), 一般由用户自行输入

#y = matrix(c(1,10,1,2,10,2,1,9,1),9,1)

Treatment <- factor(c(1,2,3,1,2,3,1,2,3)) #each {1,2,3} is a group

mod = lm(y~Treatment) #linear regression

print(anova(mod))

#解释：Df（degree of freedom）

#Sum Sq: deviance (within groups, and residuals) 总偏差和

# Mean Sq: variance (within groups, and residuals) 平均方差和

# compare the contribution given by Treatment and Residual

#F value: Mean Sq(Treatment)/Mean Sq(Residuals)

#Pr(>F): p-value. 根据p-value决定是否接受Hypothesis H0：多个样本总体均数相等(检验水准为0.05)

qqnorm(mod$residual) #plot the residual approximated by mod

#如果qqnorm of residual像一条直线，说明residual符合正态分布，也就是说Treatment带来的contribution很小，也就是说Treatment无法带来收益（多喂维他命少喂维他命没区别）

如下面两图分别是

（左）用 y = matrix(c(1,10,1,2,10,2,1,9,1),9,1)和

（右）y = rnorm(9)

的结果。可见如果给定猪吃维他命2后体重特别突出的数据结果后，qq图种residual不在是一条直线，换句话说residual不再符合正态分布，i.e., 维他命对猪的体重有影响。

你就是在说哈希表吧，python中的dictionary，c++里的unordered

map.

有包的，就叫hash，具体用法：

首先安装hash包，library()一下做好准备。

两句一样效果，键是26个小写英文字母，对应的值是1到26h

可以使用数据标号“text()”函数text()函数跟在画图函数语句后面，即先画出图，再标号。

下面为来自R的text()函数使用方法（疑难词汇已经标出）

Description

text draws the strings given in the vector(矢量) labels at the coordinates(坐标) given by x and y. y may be missing since xy.coords(x, y) is used for construction of the coordinates.

Usage

text(x, ...)

## Default S3 method:

text(x, y = NULL, labels = seq_along(x$x), adj = NULL,pos = NULL, offset = 0.5, vfont =NULL,cex = 1, col = NULL, font = NULL, ...)

Arguments

x, y

numeric(数) vectors(矢量) of coordinates(坐标) where the text labels should be written. If the length of x and y differs, the shorter one is recycled.

labels

a character vector or expression specifying the text to be written. An attempt is made to coerce(强制) other language objects (names and calls) to expressions, and vectors and other classed objects to character vectors byas.character. If labels is longer than x and y, the coordinates(坐标) are recycled to the length of labels.

adj

one or two values in [0, 1] which specify(指定) the x (and optionally(可选择的) y) adjustment(调整) of the labels(标签). On most devices(装置) values outside that interval will also work.

pos

a position specifier for the text. If specified this overrides(代理佣金) any adj value given. Values of 1, 2, 3 and 4, respectively(分别地) indicate(表明) positions below, to the left of, above and to the right of the specified coordinates.

offset

when pos is specified(指定), this value gives the offset(抵消) of the label(标签) from the specified coordinate(坐标) in fractions(分数) of a character width.

vfont

NULL for the current font family, or a character vector(矢量) of length 2 for Hershey vector fonts. The first element(元素) of the vector selects a typeface and the second element selects a style. Ignored(驳回诉讼) if labels is an expression.

cex

numeric character expansion factor(因素)multiplied by par("cex") yields(产量) the final character size. NULL and NA are equivalent to 1.0.

col, font

the color and (if vfont = NULL) font to be used, possibly vectors(矢量). These default to the values of the global graphical parameters in par().

...

further graphical parameters (from par), such as srt, family and xpd.

Details

labels must be of type character or expression (or be coercible(可强迫的) to such a type). In the latter case, quite a bit of mathematical(数学的) notation(符号) is available such as sub- and superscripts(上标), greek letters,fractions(分数), etc.

adj allows adjustment of the text with respect to (x, y). Values of 0, 0.5, and 1 specify(指定) left/bottom, middle and right/top alignment(队列), respectively(分别地). The default is for centered text, i.e., adj = c(0.5, NA).Accurate(精确的) vertical(垂直的) centering needs character metric(度量标准) information on individual(个人的) characters which is only available on some devices(装置). Vertical alignment is done slightly differently for character strings and for expressions: adj = c(0,0) means to left-justify and to align(结盟) on the baseline for strings but on the bottom of the bounding box for expressions. This also affects vertical(垂直的) centering: for strings the centeringexcludes(排除) any descenders(下降) whereas(然而) for expressions it includes them. Using NA for strings centers them, including descenders.

The pos and offset arguments can be used in conjunction(结合) with values returned by identify to recreate(再创造) an interactively(交互式地) labelled(贴上标签的) plot(情节).

Text can be rotated(旋转的) by using graphical parameters srt (see par)this rotates about the centre set by adj.

Graphical parameters col, cex and font can be vectors(矢量) and will then be applied cyclically(周期的) to the labels (and extra values will be ignored(驳回诉讼)). NA values of font are replaced by par("font"), and similarly for col.

Labels whose x, y or labels value is NA are omitted(省略) from the plot(情节).

What happens when font = 5 (the symbol(象征) font) is selected can be both device- and locale-dependent. Most often labels will be interpreted(说明) in the Adobe symbol encoding, so e.g. "d" is delta, and "\300" is aleph.

Euro symbol

The Euro symbol may not be available in older fonts. In current versions of Adobe symbol fonts it is character 160, so text(x, y, "\xA0", font = 5) may work. People using Western European locales(场所) on Unix-alikes can probably select ISO-8895-15 (Latin-9) which has the Euro as character 165: this can also be used for postscript and pdf. It is \u20ac in Unicode, which can be used in UTF-8 locales(场所).

In all the European Windows encodings the Euro is symbol(象征) 128 and \u20ac will work in all locales: however not all fonts will include it. It is not in the symbol font used for windows and related devices(装置), including the Windows printer.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth &Brooks/Cole.

Murrell, P. (2005) R Graphics. Chapman(叫卖小贩) &Hall/CRC Press.