当前位置: 代码迷 >> 综合 >> Hive中的lateral view 、explode、reflect、行转列、列转行、窗口函数与分析函数使用与区别详解
  详细解决方案

Hive中的lateral view 、explode、reflect、行转列、列转行、窗口函数与分析函数使用与区别详解

热度:31   发布时间:2024-02-12 11:30:21.0

目录

一、使用explode函数将hive表中的Map和Array字段数据进行拆分

二、使用explode函数拆分json字符串数据

三、配合LATERAL  VIEW使用

四、列转行

五、行转列

六、reflect函数

七、窗口函数与分析函数

1、创建hive表并加载数据

2、窗口函数

3、OVER从句

4、分析函数

5、增强的聚合Cube和Grouping和Rollup

6、使用cube 和ROLLUP 根据GROUP BY的维度的所有组合进行聚合。


一、使用explode函数将hive表中的Map和Array字段数据进行拆分

lateral view用于和split、explode等UDTF一起使用的,能将一行数据拆分成多行数据,在此基础上可以对拆分的数据进行聚合,lateral view首先为原始表的每行调用UDTF,UDTF会把一行拆分成一行或者多行,lateral view再把结果组合,产生一个支持别名表的虚拟表

其中explode还可以用于将hive一列中复杂的array或者map结构拆分成多行

需求:现在有数据格式如下

zhangsan      child1,child2,child3,child4     k1:v1,k2:v2
lisi  child5,child6,child7,child8     k3:v3,k4:v4

字段之间使用\t分割,需求将所有的child进行拆开成为一列

+----------+--+
| mychild   |
+----------+--+
| child1    |
| child2    |
| child3    |
| child4    |
| child5    |
| child6    |
| child7    |
| child8    |
+----------+--+

将map的key和value也进行拆开,成为如下结果

+-------------------+---------------+
| mymapkey          | mymapvalue    |
+----------------+------------------+
| k1                | v1            |
| k2                | v2            |
| k3                | v3            |
| k4                | v4            |
+-------------------+---------------+

第一步:创建hive数据库

创建hive数据库

hive (default)> create database hive_explode;
hive (default)> use hive_explode;

第二步:创建hive表,然后使用explode拆分map和array

hive (hive_explode)> create  table t3(name string,children array<string>,address Map<string,string>)row format delimited fields terminated by '\t'collection items terminated by ','map keys terminated by ':' stored as textFile;

第三步:加载数据

node03执行以下命令创建表数据文件

mkdir -p /export/servers/hivedatas/
cd /export/servers/hivedatas/
vim maparray
zhangsan   child1,child2,child3,child4 k1:v1,k2:v2
lisi  child5,child6,child7,child8 k3:v3,k4:v4
 

hive表当中加载数据

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/maparray' into table t3;

第四步:使用explode将hive当中数据拆开

将array当中的数据拆分开

hive (hive_explode)> SELECT explode(children) AS myChild FROM t3;

将map当中的数据拆分开

hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM t3;
0: jdbc:hive2://node03:10000> select explode(children) as mychild from t3;+----------+--+
| mychild  |
+----------+--+
| child1   |
| child2   |
| child3   |
| child4   |
| child5   |
| child6   |
| child7   |
| child8   |
+----------+--+
8 rows selected (0.128 seconds)
0: jdbc:hive2://node03:10000> select name,explode(children) as mychild from t3;
Error: Error while compiling statement: FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions (state=42000,code=10081)
0: jdbc:hive2://node03:10000> SELECT explode(address) AS (myMapKey, myMapValue) FROM t3;+-----------+-------------+--+
| mymapkey  | mymapvalue  |
+-----------+-------------+--+
| k1        | v1          |
| k2        | v2          |
| k3        | v3          |
| k4        | v4          |
+-----------+-------------+--+

二、使用explode函数拆分json字符串数据

需求:现在有一些数据格式如下:

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

其中字段与字段之间的分隔符是 “ | ”

我们要解析得到所有的monthSales对应的值为以下这一列(行转列)

4900
2090
6987

第一步:创建hive表

hive (hive_explode)> create table explode_lateral_view(`area` string,`goods_id` string,`sale_info` string)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '|'STORED AS textfile;

第二步:准备数据并加载数据

准备数据如下

cd /export/servers/hivedatas
vim explode_json

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

加载数据到hive表当中去

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/explode_json' overwrite into table explode_lateral_view;

第三步:使用explode拆分Array

hive (hive_explode)> select explode(split(goods_id,',')) as goods_id from explode_lateral_view;

第四步:使用explode拆解Map

hive (hive_explode)> select explode(split(area,',')) as area from explode_lateral_view;

第五步:拆解json字段

hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')) as  sale_info from explode_lateral_view;

然后我们想用get_json_object来获取key为monthSales的数据:

hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')),'$.monthSales') as  sale_info from explode_lateral_view;

然后出现异常FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions

UDTF explode不能写在别的函数内

如果你这么写,想查两个字段,select explode(split(area,',')) as area,good_id from explode_lateral_view;

会报错FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'good_id'

使用UDTF的时候,只支持一个字段,这时候就需要LATERAL VIEW出场了

三、配合LATERAL  VIEW使用

配合lateral view查询多个字段

hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2;

其中LATERAL VIEW explode(split(goods_id,','))goods相当于一个虚拟表,与原表explode_lateral_view笛卡尔积关联。

也可以多重使用

hive (hive_explode)> select goods_id2,sale_info,area2from explode_lateral_viewLATERAL VIEW explode(split(goods_id,','))goods as goods_id2LATERAL VIEW explode(split(area,','))area as area2;

也是三个表笛卡尔积的结果

最终,我们可以通过下面的句子,把这个json格式的一行数据,完全转换成二维表的方式展现:

hive (hive_explode)> select get_json_object(concat('{',sale_info_1,'}'),'$.source') as source,get_json_object(concat('{',sale_info_1,'}'),'$.monthSales') as monthSales,get_json_object(concat('{',sale_info_1,'}'),'$.userCount') as userCount,get_json_object(concat('{',sale_info_1,'}'),'$.score') as score from explode_lateral_view LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{'))sale_info as sale_info_1;

总结:

Lateral View通常和UDTF一起出现,为了解决UDTF不允许在select字段的问题。 
Multiple Lateral View可以实现类似笛卡尔乘积。 
Outer关键字可以把不输出的UDTF的空结果,输出成NULL,防止丢失数据。

四、列转行

1.相关函数说明

CONCAT(string A/col, string B/col…):返回输入字符串连接后的结果,支持任意个输入字符串;

CONCAT_WS(separator, str1, str2,...):它是一个特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL,返回值也将为 NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间;

COLLECT_SET(col):函数只接受基本数据类型,它的主要作用是将某字段的值进行去重汇总,产生array类型字段。

2.数据准备

表6-6 数据准备

name

constellation

blood_type

孙悟空

白羊座

A

老王

射手座

A

宋宋

白羊座

B

猪八戒

白羊座

A

凤姐

射手座

A

3.需求

把星座和血型一样的人归类到一起。结果如下:

射手座,A            老王|凤姐
白羊座,A            孙悟空|猪八戒
白羊座,B            宋宋

4.创建本地constellation.txt,导入数据

node03服务器执行以下命令创建文件,注意数据使用\t进行分割

cd /export/servers/hivedatas
vim constellation.txt
孙悟空  白羊座  A
老王     射手座  A
宋宋     白羊座  B
猪八戒  白羊座  A
凤姐     射手座  A

5.创建hive表并导入数据

创建hive表并加载数据

hive (hive_explode)> create table person_info(name string,constellation string,blood_type string)row format delimited fields terminated by "\t";

加载数据

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/constellation.txt' into table person_info;

6.按需求查询数据

hive (hive_explode)> select t1.base,concat_ws('|', collect_set(t1.name)) name
from
(select name,concat(constellation, "," , blood_type) basefrom person_info
) t1
group by t1.base;

五、行转列

1.函数说明

EXPLODE(col):将hive一列中复杂的array或者map结构拆分成多行。

LATERAL VIEW

用法:LATERAL VIEW udtf(expression) tableAlias AS columnAlias

解释:用于和split, explodeUDTF一起使用,它能够将一列数据拆成多行数据,在此基础上可以对拆分后的数据进行聚合。

2.数据准备

cd /export/servers/hivedatas
vim movie.txt

数据字段之间使用\t进行分割

《疑犯追踪》   悬疑,动作,科幻,剧情
《Lie to me》  悬疑,警匪,动作,心理,剧情
《战狼2》 战争,动作,灾难

3.需求

将电影分类中的数组数据展开。结果如下:

《疑犯追踪》   悬疑
《疑犯追踪》   动作
《疑犯追踪》   科幻
《疑犯追踪》   剧情
《Lie to me》  悬疑
《Lie to me》  警匪
《Lie to me》  动作
《Lie to me》  心理
《Lie to me》  剧情
《战狼2》 战争
《战狼2》 动作
《战狼2》 灾难

4.创建hive表并导入数据

创建hive表

create table movie_info(movie string,category array<string>)
row format delimited fields terminated by "\t"
collection items terminated by ",";

加载数据

load data local inpath "/export/servers/hivedatas/movie.txt" into table movie_info;

5.按需求查询数据

selectmovie,category_name
frommovie_info lateral view explode(category) table_tmp as category_name;

六、reflect函数

reflect函数可以支持在sql中调用java中的自带函数,秒杀一切udf函数。

使用java.lang.Math当中的Max求两列中最大值

创建hive表

create table test_udf(col1 int,col2 int) row format delimited fields terminated by ',';

准备数据并加载数据

cd /export/servers/hivedatas
vim test_udf1,2
4,3
6,4
7,5
5,6

加载数据

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/test_udf' overwrite into table test_udf;

使用java.lang.Math当中的Max求两列当中的最大值

hive (hive_explode)> select reflect("java.lang.Math","max",col1,col2) from test_udf;

不同记录执行不同的java内置函数

创建hive表

hive (hive_explode)> create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ',';

准备数据

 
cd /export/servers/hivedatas
vim test_udf2java.lang.Math,min,1,2
java.lang.Math,max,2,3

加载数据

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/test_udf2' overwrite into table test_udf2;

执行查询

hive (hive_explode)> select reflect(class_name,method_name,col1,col2) from test_udf2;

判断是否为数字

使用apache commons中的函数,commons下的jar已经包含在hadoop的classpath中,所以可以直接使用。

使用方式如下:

select reflect("org.apache.commons.lang.math.NumberUtils","isNumber","123");

七、窗口函数与分析函数

hive当中也带有很多的窗口函数以及分析函数,主要用于以下这些场景

(1)用于分区排序 
(2)动态Group By 
(3)Top N 
(4)累计计算 
(5)层次查询

1、创建hive表并加载数据

创建表

hive (hive_explode)> create table order_detail(user_id string,device_id string,user_type string,price double,sales int)row format delimited fields terminated by ',';

加载数据

cd /export/servers/hivedatas
vim order_detailzhangsan,1,new,67.1,2
lisi,2,old,43.32,1
wagner,3,new,88.88,3
liliu,4,new,66.0,1
qiuba,5,new,54.32,1
wangshi,6,old,77.77,2
liwei,7,old,88.44,3
wutong,8,new,56.55,6
lilisi,9,new,88.88,5
qishili,10,new,66.66,5
 

加载数据

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/order_detail' into table order_detail;

2、窗口函数

FIRST_VALUE:取分组内排序后,截止到当前行,第一个值

LAST_VALUE: 取分组内排序后,截止到当前行,最后一个值

LEAD(col,n,DEFAULT) :用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

LAG(col,n,DEFAULT) :与lead相反,用于统计窗口内往上第n行值。第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

3、OVER从句

1、使用标准的聚合函数COUNT、SUM、MIN、MAX、AVG

2、使用PARTITION BY语句,使用一个或者多个原始数据类型的列

3、使用PARTITION BY与ORDER BY语句,使用一个或者多个数据类型的分区或者排序列

4、使用窗口规范,窗口规范支持以下格式:

(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)

(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)

(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

1
2
3

当ORDER BY后面缺少窗口从句条件,窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

OVER从句支持以下函数, 但是并不支持和窗口一起使用它们。

Ranking函数: Rank, NTile, DenseRank, CumeDist, PercentRank.

Lead 和 Lag 函数.

需求:使用窗口函数进行统计求销量

使用窗口函数sum  over统计销量

select 
user_id,
user_type,
sales,
//分组内所有行
sum(sales) over(partition by user_type) AS sales_1 ,
sum(sales) over(order  by user_type) AS sales_2 , 
//默认为从起点到当前行,如果sales相同,累加结果相同
sum(sales) over(partition by user_type order by sales asc) AS sales_3,
//从起点到当前行,结果与sales_3不同。 根据排序先后不同,可能结果累加不同
sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) AS sales_4,
//当前行+往前3行 
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) AS sales_5,
//当前行+往前3行+往后1行
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) AS sales_6,
//当前行+往后所有行  
sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) AS sales_7 
from order_detail 
order by user_type,sales,user_id;

select 
user_id,
user_type,
sales,
sum(sales) over(partition by user_type) AS sales_1 ,
sum(sales) over(order  by user_type) AS sales_2 , 
sum(sales) over(partition by user_type order by sales asc) AS sales_3,
sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) AS sales_4,
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) AS sales_5,
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) AS sales_6,
sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) AS sales_7 
from order_detail 
order by user_type,sales,user_id;

统计之后求得结果如下:

+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+
|  user_id  | user_type  | sales  | sales_1  | sales_2  | sales_3  | sales_4  | sales_5  | sales_6  | sales_7  |
+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+
| liliu     | new        | 1      | 23       | 23       | 2        | 2        | 2        | 4        | 22       |
| qiuba     | new        | 1      | 23       | 23       | 2        | 1        | 1        | 2        | 23       |
| zhangsan  | new        | 2      | 23       | 23       | 4        | 4        | 4        | 7        | 21       |
| wagner    | new        | 3      | 23       | 23       | 7        | 7        | 7        | 12       | 19       |
| lilisi    | new        | 5      | 23       | 23       | 17       | 17       | 15       | 21       | 11       |
| qishili   | new        | 5      | 23       | 23       | 17       | 12       | 11       | 16       | 16       |
| wutong    | new        | 6      | 23       | 23       | 23       | 23       | 19       | 19       | 6        |
| lisi      | old        | 1      | 6        | 29       | 1        | 1        | 1        | 3        | 6        |
| wangshi   | old        | 2      | 6        | 29       | 3        | 3        | 3        | 6        | 5        |
| liwei     | old        | 3      | 6        | 29       | 6        | 6        | 6        | 6        | 3        |
+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+

注意:

结果和ORDER BY相关, 默认为升序

如果不指定ROWS BETWEEN, 默认为从起点到当前行;

如果不指定ORDER BY,则将分组内所有值累加;

关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:

PRECEDING:往前

FOLLOWING:往后

CURRENT ROW:当前行

UNBOUNDED:无界限(起点或终点)

UNBOUNDED PRECEDING:表示从前面的起点

UNBOUNDED FOLLOWING:表示到后面的终点

其他COUNT、AVG,MIN,MAX,和SUM用法一样。

需求:求分组后的第一个和最后一个值first_value与last_value

使用first_value和last_value求分组后的第一个和最后一个值

select user_id,user_type,ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,  first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from order_detail;

统计之后求得结果如下:

+-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+
|  user_id  | user_type  | row_num  | max_sales_user  | min_sales_user  | curr_last_min_user  | curr_last_max_user  |
+-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+
| wutong    | new        | 7        | wutong          | qiuba           | wutong              | wutong              |
| lilisi    | new        | 6        | wutong          | qiuba           | qishili             | lilisi              |
| qishili   | new        | 5        | wutong          | qiuba           | qishili             | lilisi              |
| wagner    | new        | 4        | wutong          | qiuba           | wagner              | wagner              |
| zhangsan  | new        | 3        | wutong          | qiuba           | zhangsan            | zhangsan            |
| liliu     | new        | 2        | wutong          | qiuba           | qiuba               | liliu               |
| qiuba     | new        | 1        | wutong          | qiuba           | qiuba               | liliu               |
| liwei     | old        | 3        | liwei           | lisi            | liwei               | liwei               |
| wangshi   | old        | 2        | liwei           | lisi            | wangshi             | wangshi             |
| lisi      | old        | 1        | liwei           | lisi            | lisi                | lisi                |
+-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+

4、分析函数

ROW_NUMBER():

从1开始,按照顺序,生成分组内记录的序列,比如,按照pv降序排列,生成分组内每天的pv名次,ROW_NUMBER()的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个session中的第一条refer等。

RANK() :

生成数据项在分组中的排名,排名相等会在名次中留下空位

DENSE_RANK() :

生成数据项在分组中的排名,排名相等会在名次中不会留下空位

CUME_DIST :

小于等于当前值的行数/分组内总行数。比如,统计小于等于当前薪水的人数,所占总人数的比例

PERCENT_RANK :

分组内当前行的RANK值-1/分组内总行数-1

NTILE(n) :

用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。

RANKROW_NUMBERDENSE_RANK  OVER的使用

使用这几个函数,可以实现分组求topN

需求:按照用户类型进行分类,求取销售量最大的前N条数据

select user_id,user_type,sales,RANK() over (partition by user_type order by sales desc) as rank,ROW_NUMBER() over (partition by user_type order by sales desc) as row_number,DENSE_RANK() over (partition by user_type order by sales desc) as dense_rank
fromorder_detail;

统计之后求得结果如下:

+-----------+------------+--------+-------+-------------+-------------+--+
|  user_id  | user_type  | sales  | rank  | row_number  | dense_rank  |
+-----------+------------+--------+-------+-------------+-------------+--+
| wutong    | new        | 6      | 1     | 1           | 1           |
| qishili   | new        | 5      | 2     | 2           | 2           |
| lilisi    | new        | 5      | 2     | 3           | 2           |
| wagner    | new        | 3      | 4     | 4           | 3           |
| zhangsan  | new        | 2      | 5     | 5           | 4           |
| qiuba     | new        | 1      | 6     | 6           | 5           |
| liliu     | new        | 1      | 6     | 7           | 5           |
| liwei     | old        | 3      | 1     | 1           | 1           |
| wangshi   | old        | 2      | 2     | 2           | 2           |
| lisi      | old        | 1      | 3     | 3           | 3           |
+-----------+------------+--------+-------+-------------+-------------+--+

使用NTILE求取百分比

我们可以使用NTILE来将我们的数据分成多少份,然后求取百分比

使用NTILE将数据进行分片

select
user_type,sales,
//分组内将数据分成2片
NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
//分组内将数据分成3片   
NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
//分组内将数据分成4片   
NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
//将所有数据分成4片
NTILE(4) OVER(ORDER BY sales) AS all_nt4
from order_detail
order by user_type,sales;

select user_type,sales,NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,NTILE(4) OVER(ORDER BY sales) AS all_nt4
from order_detail
order by user_type,sales;

得到结果如下:

+------------+--------+------+------+------+----------+--+
| user_type  | sales  | nt2  | nt3  | nt4  | all_nt4  |
+------------+--------+------+------+------+----------+--+
| new        | 1      | 1    | 1    | 1    | 1        |
| new        | 1      | 1    | 1    | 1    | 1        |
| new        | 2      | 1    | 1    | 2    | 2        |
| new        | 3      | 1    | 2    | 2    | 3        |
| new        | 5      | 2    | 2    | 3    | 4        |
| new        | 5      | 2    | 3    | 3    | 3        |
| new        | 6      | 2    | 3    | 4    | 4        |
| old        | 1      | 1    | 1    | 1    | 1        |
| old        | 2      | 1    | 2    | 2    | 2        |
| old        | 3      | 2    | 3    | 3    | 2        |
+------------+--------+------+------+------+----------+--+

使用NTILE求取sales前20%的用户id

select user_id
from
(select  user_id, NTILE(5) OVER(ORDER BY sales desc) AS nt
from  order_detail
)A
where nt=1;

+----------+--+
| user_id  |
+----------+--+
| wutong   |
| qishili  |
+----------+--+

5、增强的聚合Cube和Grouping和Rollup

这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的UV数。

GROUPING SETS

在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL,

其中的GROUPING__ID,表示结果属于哪一个分组集合。

需求:按照user_type和sales分别进行分组求取数据

0: jdbc:hive2://node03:10000>selectuser_type,sales,count(user_id) as pv,GROUPING__ID
fromorder_detail
group byuser_type,sales
GROUPING SETS(user_type,sales)
ORDER BYGROUPING__ID;

求取结果如下:

+------------+--------+-----+---------------+--+
| user_type  | sales  | pv  | grouping__id  |
+------------+--------+-----+---------------+--+
| old        | NULL   | 3   | 1             |
| new        | NULL   | 7   | 1             |
| NULL       | 6      | 1   | 2             |
| NULL       | 5      | 2   | 2             |
| NULL       | 3      | 2   | 2             |
| NULL       | 2      | 2   | 2             |
| NULL       | 1      | 3   | 2             |
+------------+--------+-----+---------------+--+

需求:按照user_type,sales,以及user_type + salse  分别进行分组求取统计数据

0: jdbc:hive2://node03:10000> select 
user_type,sales,count(user_id) as pv,GROUPING__ID 
from order_detail
group by user_type,sales
GROUPING SETS(user_type,sales,(user_type,sales))
ORDER BY GROUPING__ID;

求取结果如下:

+------------+--------+-----+---------------+--+
| user_type  | sales  | pv  | grouping__id  |
+------------+--------+-----+---------------+--+
| old        | NULL   | 3   | 1             |
| new        | NULL   | 7   | 1             |
| NULL       | 1      | 3   | 2             |
| NULL       | 6      | 1   | 2             |
| NULL       | 5      | 2   | 2             |
| NULL       | 3      | 2   | 2             |
| NULL       | 2      | 2   | 2             |
| old        | 3      | 1   | 3             |
| old        | 2      | 1   | 3             |
| old        | 1      | 1   | 3             |
| new        | 6      | 1   | 3             |
| new        | 5      | 2   | 3             |
| new        | 3      | 1   | 3             |
| new        | 1      | 2   | 3             |
| new        | 2      | 1   | 3             |
+------------+--------+-----+---------------+--+

6、使用cube 和ROLLUP 根据GROUP BY的维度的所有组合进行聚合。

cube进行聚合

需求:不进行任何的分组,按照user_type进行分组,按照sales进行分组,按照user_type+sales进行分组求取统计数据

0: jdbc:hive2://node03:10000>select 
user_type,sales,count(user_id) as pv,GROUPING__ID
from order_detail
group by user_type,sales
WITH CUBE
ORDER BY GROUPING__ID;

求取结果如下:

+------------+--------+-----+---------------+--+
| user_type  | sales  | pv  | grouping__id  |
+------------+--------+-----+---------------+--+
| NULL       | NULL   | 10  | 0             |
| new        | NULL   | 7   | 1             |
| old        | NULL   | 3   | 1             |
| NULL       | 6      | 1   | 2             |
| NULL       | 5      | 2   | 2             |
| NULL       | 3      | 2   | 2             |
| NULL       | 2      | 2   | 2             |
| NULL       | 1      | 3   | 2             |
| old        | 3      | 1   | 3             |
| old        | 2      | 1   | 3             |
| old        | 1      | 1   | 3             |
| new        | 6      | 1   | 3             |
| new        | 5      | 2   | 3             |
| new        | 3      | 1   | 3             |
| new        | 2      | 1   | 3             |
| new        | 1      | 2   | 3             |
+------------+--------+-----+---------------+--+

ROLLUP进行聚合

rollup是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。

select user_type,sales,count(user_id) as pv,GROUPING__ID
from order_detail
group by user_type,sales
WITH ROLLUP
ORDER BY GROUPING__ID;

求取结果如下:

+------------+--------+-----+---------------+--+
| user_type  | sales  | pv  | grouping__id  |
+------------+--------+-----+---------------+--+
| NULL       | NULL   | 10  | 0             |
| old        | NULL   | 3   | 1             |
| new        | NULL   | 7   | 1             |
| old        | 3      | 1   | 3             |
| old        | 2      | 1   | 3             |
| old        | 1      | 1   | 3             |
| new        | 6      | 1   | 3             |
| new        | 5      | 2   | 3             |
| new        | 3      | 1   | 3             |
| new        | 2      | 1   | 3             |
| new        | 1      | 2   | 3             |
+------------+--------+-----+---------------+--+

 

  相关解决方案