《R语言实战》自学笔记26-概率函数

2023-03-01 03:23:02Python015

《R语言实战》自学笔记26-概率函数,第1张

在R中，概率函数形如：

[dpqr]distribution_abbreviation

其中第一个字母表示其所指分布的某一方面：

d = 密度函数（density）

p = 分布函数（distribution function）

q = 分位数函数（quantile function）

r = 生成随机数（随机偏差）

以正态分布为例

1 什么是正态分布？

正态分布也被称为高斯分布，是统计学中极为常见的连续型概率分布。正态曲线呈钟型，两头低，中间高，左右对称因其曲线呈钟形，因此人们又经常称之为钟形曲线。

2 正态分布的两个参数及图形

正态分布有两个参数，即均数和标准差。 1）概率密度曲线在均值处达到最大，并且对称； 2）一旦均值和标准差确定，正态分布曲线也就确定； 3）当X的取值向横轴左右两个方向无限延伸时，曲线的两个尾端也无限渐近横轴，理论上永远不会与之相交； 4）正态随机变量在特定区间上的取值概率由正态曲线下的面积给出，而且其曲线下的总面积等于1；

5）均值可取实数轴上的任意数值，决定正态曲线的具体位置；标准差决定曲线的“陡峭”或“扁平”程度：标准差越大，正态曲线越扁平；标准差越小，正态曲线越陡峭。这是因为，标准差越小，意味着大多数变量值离均数的距离越短，因此大多数值都紧密地聚集在均数周围，图形所能覆盖的变量值就少些，于是都挤在一块，图形上呈现瘦高型。相反，标准差越大，数据跨度就比较大，分散程度大，所覆盖的变量值就越多，图形呈现“矮胖型”。

3 标准正态分布

如果不指定一个均值和一个标准差，则函数将假定其为标准正态分布（均值为0，标准差为1）。

4 正态分布的概率函数

概率密度函数为dnorm()，分布函数pnorm()，分位函数qnorm()，随机数生成函数rnorm()。

dnorm(x, mean = 0, sd = 1, log = FALSE)

pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

rnorm(n, mean = 0, sd = 1)

x - 是数字的向量。

p - 是概率向量。

n - 是观察次数(样本量)。

mean - 是样本数据的平均值，默认值为零。

sd - 是标准偏差，默认值为1。

pretty()创建美观的分割点。选取n+1等间距的取整数，将连续变量x分割为n个区间。pretty(x,n)

x:它被定义为矢量数据。

n:结果向量的长度。

返回：等长区间的数据向量。

设定随机数种子

set.seed()

该函数是设定生成随机数的种子，种子是为了让结果具有重复性，保证你在执行和调试后，所创造的随机数保持不变。 24

runif(n, min = 0, max = 1)

该函数用于创建均匀分布的随机偏差。n表示观察次数，min和max分别为最小最大值。

其他概率分布见下表。

参考资料：

是stats包中的函数，要注意它与Anova的区别。

car包中的函数。用来计算II 和 III型方差。

| anova {stats} | R Documentation |

Compute analysis of variance (or deviance) tables for one or more fitted model objects.计算一个或多个拟合模型对象的方差(或偏差)表。偏差又称为表观误差，是指个别测定值与测定的平均值之差，它可以用来衡量测定结果的精密度高低。在统计学中，偏差可以用于两个不同的概念，即有偏采样与有偏估计。一个有偏采样是对总样本集非平等采样，而一个有偏估计则是指高估或低估要估计的量。

我个人理解是，计算模型的偏差，或者是可解释方差。可解释方差越大，代表该因素对因变量的影响越大。

如果是线性回归，可以用anova()输出F检验的方差分析表，如果变量只有2个水平，则应该跟t检验的p结果一致。

如果是逻辑回归，可以用anova()输出卡方检验的分析表。

anova(object, ...)

object

an object containing the results returned by a model fitting function (e.g., lm or glm ).

...

additional objects of the same type.

This (generic) function returns an object of class anova . These objects represent analysis-of-variance and analysis-of-deviance tables.

此(泛型)函数返回一个 anova 类的对象。这些对象表示方差分析表和偏差分析表。

When given a single argument it produces a table which tests whether the model terms are significant.

当给定单个参数时，它会生成一个表，用于测试模型各个项是否显著。

When given a sequence of objects, anova tests the models against one another in the order specified.

当给定一系列对象时， anova 会按照指定的顺序对模型进行相互测试。

The print method for anova objects prints tables in a ‘pretty’ form.

“anova”对象的print方法以“ pretty”的形式打印表格。

The comparison between two or more models will only be valid if they are fitted to the same dataset. This may be a problem if there are missing values and R 's default of na.action = na.omit is used.

两个或多个模型之间的比较只有在它们符合同一数据集时才有效。如果有缺失值，并且使用了 R 的默认值： na.action=na.omit ，这可能会有问题。

Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S , Wadsworth &Brooks/Cole.

| summary {base} | R Documentation |

summary is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument.

summary 是一个通用函数，用于生成各种模型拟合函数的结果。该函数调用特定的参数方法，这些方法取决于第一个参数的参数类。

数据分析之美：决策树R语言实现

R语言实现决策树

1.准备数据

[plain] view plain copy

>install.packages("tree")

>library(tree)

>library(ISLR)

>attach(Carseats)

>High=ifelse(Sales<=8,"No","Yes") //set high values by sales data to calssify

>Carseats=data.frame(Carseats,High) //include the high data into the data source

>fix(Carseats)

2.生成决策树

[plain] view plain copy

>tree.carseats=tree(High~.-Sales,Carseats)

>summary(tree.carseats)

[plain] view plain copy

//output training error is 9%

Classification tree:

tree(formula = High ~ . - Sales, data = Carseats)

Variables actually used in tree construction:

[1] "ShelveLoc" "Price" "Income" "CompPrice" "Population"

[6] "Advertising" "Age" "US"

Number of terminal nodes: 27

Residual mean deviance: 0.4575 = 170.7 / 373

Misclassification error rate: 0.09 = 36 / 400

3. 显示决策树

[plain] view plain copy

>plot(tree . carseats )

>text(tree .carseats ,pretty =0)

4.Test Error

[plain] view plain copy

//prepare train data and test data

//We begin by using the sample() function to split the set of observations sample() into two halves, by selecting a random subset of 200 observations out of the original 400 observations.

>set . seed (1)

>train=sample(1:nrow(Carseats),200)

>Carseats.test=Carseats[-train,]

>High.test=High[-train]

//get the tree model with train data

>tree. carseats =tree (High~.-Sales , Carseats , subset =train )

//get the test error with tree model, train data and predict method

//predict is a generic function for predictions from the results of various model fitting functions.

>tree.pred = predict ( tree.carseats , Carseats .test ,type =" class ")

>table ( tree.pred ,High. test)

High. test

tree. pred No Yes

No 86 27

Yes 30 57

>(86+57) /200

[1] 0.715

5.决策树剪枝

[plain] view plain copy

/**

Next, we consider whether pruning the tree might lead to improved results. The function cv.tree() performs cross-validation in order to cv.tree() determine the optimal level of tree complexitycost complexity pruning is used in order to select a sequence of trees for consideration.

For regression trees, only the default, deviance, is accepted. For classification trees, the default is deviance and the alternative is misclass (number of misclassifications or total loss).

We use the argument FUN=prune.misclass in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance.

If the tree is regression tree,

>plot(cv. boston$size ,cv. boston$dev ,type=’b ’)

>set . seed (3)

>cv. carseats =cv. tree(tree .carseats ,FUN = prune . misclass ,K=10)

//The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate(dev) and the value of the cost-complexity parameter used (k, which corresponds to α.

>names (cv. carseats )

[1] " size" "dev " "k" " method "

>cv. carseats

$size //the number of terminal nodes of each tree considered

[1] 19 17 14 13 9 7 3 2 1

$dev //the corresponding error rate

[1] 55 55 53 52 50 56 69 65 80

$k // the value of the cost-complexity parameter used

[1] -Inf 0.0000000 0.6666667 1.0000000 1.7500000

2.0000000 4.2500000

[8] 5.0000000 23.0000000

$method //miscalss for classification tree

[1] " misclass "

attr (," class ")

[1] " prune " "tree. sequence "

[plain] view plain copy

//plot the error rate with tree node size to see whcih node size is best

>plot(cv. carseats$size ,cv. carseats$dev ,type=’b ’)

/**

Note that, despite the name, dev corresponds to the cross-validation error rate in this instance. The tree with 9 terminal nodes results in the lowest cross-validation error rate, with 50 cross-validation errors. We plot the error rate as a function of both size and k.

>prune . carseats = prune . misclass ( tree. carseats , best =9)

>plot( prune . carseats )

>text( prune .carseats , pretty =0)

//get test error again to see whether the this pruned tree perform on the test data set

>tree.pred = predict ( prune . carseats , Carseats .test , type =" class ")

>table ( tree.pred ,High. test)

High. test

tree. pred No Yes

No 94 24

Yes 22 60

>(94+60) /200

[1] 0.77