β

PostgreSQL 9.5 new feature - table | mview data sa

PostgreSQL research 131 阅读
PostgreSQL 9.5支持的新特性,数据采样。有了这个功能,如果能结合pg_dump使用的话,可以为比较大的生产数据库创建较小的测试环境。当然,如果是这样的话采样还需要考虑数据关联的问题。

用户可以自定义采样方法,或者使用系统自带的两种采样方法(基于所有数据块的选择性采样,以及基于全表的选择性采样)
基于数据块的采样扫描的数据块比较少,只需要扫描筛选出的数据块并返回最终的采样数据(粒度较大)。
基于全表的采样,需要扫描全表,并筛选最终的采样数据(粒度细,但是开销大)。
语法:
[ TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ] ]
repeatable指定采样seed.
sampling_method 参数是采样函数的参数,目前指采样比例。
例子:
postgres=# create table test(id int);CREATE TABLEpostgres=# insert into test select generate_series(1,100000000);INSERT 0 100000000Time: 128199.864 ms
采样速度很快,2毫秒就完成了。
postgres=# select count(*) from test tablesample system (0.000001); count -------   909(1 row)Time: 1.930 ms
正常的话应该返回100条采样数据,但是返回了909,因为我们使用的是数据块采样方法SYSTEM,返回了1个数据块的所有记录。
postgres=# select 100000000*0.000001;  ?column?  ------------ 100.000000(1 row)Time: 0.631 ms
postgres=# select count(*) from (select ctid,* from test limit 100000) t where ctid::text ~ '\(1,'; count -------   909(1 row)Time: 196.232 ms
这种采样方法离散度比较低,因为集中返回某些数据块的记录。

接下来使用记录采样方法BERNOULLI
postgres=# select relpages,reltuples from pg_class where relname='test'; relpages |  reltuples  ----------+-------------   110012 | 2.72691e+07(1 row)Time: 0.786 mspostgres=# select count(*) from test;   count   ----------- 100000000(1 row)Time: 12041.390 ms
扫描全表的采样方法明显速度下降了
postgres=# select count(*) from test tablesample BERNOULLI (0.000001) ; count -------     1(1 row)Time: 6245.059 mspostgres=# select count(*) from test tablesample BERNOULLI (0.00001) ; count -------    13(1 row)Time: 6305.808 ms
同样,采样结果也不对,这个应该是BUG。
postgres=# select 2.72691e+07*0.00001; ?column?  ----------- 272.69100(1 row)postgres=# vacuum analyze test;VACUUMTime: 9880.728 mspostgres=# select relpages,reltuples from pg_class where relname='test'; relpages | reltuples ----------+-----------   110012 |     1e+08(1 row)Time: 0.438 mspostgres=# select count(*) from test tablesample BERNOULLI (0.00001) ; count -------     6(1 row)Time: 6243.548 mspostgres=# select count(*) from test tablesample BERNOULLI (0.00001) ; count -------     9(1 row)Time: 6275.948 mspostgres=# select count(*) from test tablesample BERNOULLI (0.00001) ; count -------     9(1 row)Time: 6243.882 ms
这种方法的采样数据离散度很好,但是采样比较耗时。


PostgreSQL 另外还提供了两个扩展的采样方法模块:
1. http://www.postgresql.org/docs/devel/static/tsm-system-rows.html
这个采样方法和system一样,是基于数据块的,但是精确的返回指定的行数,相当于加了过滤器。
离散度较低。
postgres=# CREATE EXTENSION tsm_system_rows;CREATE EXTENSIONpostgres=# SELECT count(*) FROM test TABLESAMPLE SYSTEM_ROWS(100); count -------   100(1 row)Time: 0.568 mspostgres=# SELECT distinct substring(ctid::text,'(.*),') FROM test TABLESAMPLE SYSTEM_ROWS(100); substring ----------- (75816(1 row)Time: 1.265 ms

2. http://www.postgresql.org/docs/devel/static/tsm-system-time.html
这个采样方法是基于时间的采样方法,从任意一个数据块开始扫描,当扫描时间达到指定的扫描时间时停止扫描.
所以返回的记录是某些数据块的连续数据,离散度较低。
postgres=# create extension tsm_system_time;CREATE EXTENSION
postgres=# SELECT min(ctid),max(ctid),count(*),min(id),max(id),max(id)-min(id) FROM test TABLESAMPLE SYSTEM_TIME(1);   min    |    max     | count |   min   |   max   | ?column? ----------+------------+-------+---------+---------+---------- (8093,1) | (8095,909) |  2727 | 6356538 | 6359264 |     2726(1 row)Time: 2.080 mspostgres=# SELECT min(ctid),max(ctid),count(*),min(id),max(id),max(id)-min(id) FROM test TABLESAMPLE SYSTEM_TIME(1);    min    |     max     | count |   min    |   max    | ?column? -----------+-------------+-------+----------+----------+---------- (35058,1) | (35083,909) | 23634 | 31867723 | 31891356 |    23633(1 row)Time: 12.863 mspostgres=# explain SELECT min(ctid),max(ctid),count(*),min(id),max(id),max(id)-min(id) FROM test TABLESAMPLE SYSTEM_TIME(1);                                     QUERY PLAN                                      ------------------------------------------------------------------------------------- Aggregate  (cost=29697.06..29697.07 rows=1 width=10)   ->  Sample Scan (system_time) on test  (cost=0.00..12198.93 rows=999893 width=10)(2 rows)Time: 0.671 ms

用户还可以自定义采样方法函数,参考:
http://www.postgresql.org/docs/devel/static/tablesample-method.html

[参考]
1. http://www.postgresql.org/docs/devel/static/tsm-system-rows.html
2. http://www.postgresql.org/docs/devel/static/tsm-system-time.html
3. http://www.postgresql.org/docs/devel/static/tablesample-method.html
4. http://www.postgresql.org/docs/devel/static/sql-select.html
TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]

Table sample clause after table_name indicates that a sampling_method should be used to retrieve subset of rows in the table. The sampling_method can be any sampling method installed in the database. There are currently two sampling methods available in the standard PostgreSQL distribution:

  • SYSTEM

  • BERNOULLI

Both of these sampling methods currently accept only single argument which is the percent (floating point from 0 to 100) of the rows to be returned. The SYSTEM sampling method does block level sampling with each block having the same chance of being selected and returns all rows from each selected block. The BERNOULLI scans whole table and returns individual rows with equal probability. Additional sampling methods may be installed in the database via extensions.

The optional parameter REPEATABLE uses the seed parameter, which can be a number or expression producing a number, as a random seed for sampling. Note that subsequent commands may return different results even if same REPEATABLE clause was specified. This happens because DML statements and maintenance operations such as VACUUM may affect physical distribution of data. The setseed() function will not affect the sampling result when the REPEATABLE parameter is used.

作者:PostgreSQL research
Free PostgreSQL Support. 公益是一辈子的事, I'm 德哥@Digoal, Just Do it!
原文地址:PostgreSQL 9.5 new feature - table | mview data sa, 感谢原作者分享。