当前位置: 代码迷 >> 综合 >> Day13[20200725]
  详细解决方案

Day13[20200725]

热度:48   发布时间:2024-02-05 03:00:56.0

一、课程回顾

1.共同的问题

(1)换行

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NS2knSc7-1596276046713)(0725_随堂笔记.assets/image-20200725080310832.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6ixixeIs-1596276046718)(0725_随堂笔记.assets/image-20200725080409251.png)]

lines terminated by ‘\n’ -> 让Java将换行符作为两条数据分分隔符

…metastore?\ncreate…

xml的配置文件 不要有中文,不要有空格,不要有tab,不要换行
**加粗样式
**

(2)关于日志

hive的日志需要自己配置一下的

[hadoop@hadoop212 conf]$ vim hive-log4j.properties 

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8Hu0PLcv-1596276046726)(0725_随堂笔记.assets/image-20200725081223687.png)]

2.核心知识点

(1)关于Hive

  • Hive是一个Java编写的应用程序,就是没有图形化界面,只提供了一个shell的界面进行操作。
  • Hive是将你编写的SQL语句(HQL),转换为MR程序并提交Hadoop集群运行。
  • Hive将hdfs上的结构化的数据文件,映射成数据表,然后进行分析。

(2)关于元数据

  • 默认的derby元数据,基本上不会有人使用,因为只支持1个连接。
  • 远程MySQL,提供元数据存储服务。

二、Hive语句练习

1.emp数据表分析

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mIjJhlBP-1596276046729)(0725_随堂笔记.assets/image-20200725083520732.png)]

hive> show databases;
OK
default
Time taken: 4.961 seconds, Fetched: 1 row(s)
hive> show databases;
OK
default
Time taken: 0.023 seconds, Fetched: 1 row(s)
hive> create database empManager;
OK
Time taken: 0.252 seconds
hive> show databases;
OK
default
empmanager
Time taken: 0.021 seconds, Fetched: 2 row(s)
hive> use empmanager;
OK
Time taken: 0.021 seconds
hive> create table emp(> empno int,> ename string,> job string,> mgr int,> hiredate string,> sal double,> comm double,> deptno int> )> row format delimited fields terminated by '\t';
OK
Time taken: 0.223 seconds
hive> load data local inpath '/opt/datas/emp.txt' into table emp;
Loading data to table empmanager.emp
Table empmanager.emp stats: [numFiles=1, totalSize=656]
OK
Time taken: 0.644 seconds
hive> select * from emp;
OK
7369	SMITH	CLERK	7902	1980-12-17	800.0	NULL	20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
Time taken: 0.309 seconds, Fetched: 14 row(s)
hive> create table dept(> deptno int,> dname string,> loc string> )> row format delimited fields terminated by '\t';
OK
Time taken: 0.088 seconds
hive> load data inpath '/datas/dept.txt' into table dept;
Loading data to table empmanager.dept
Table empmanager.dept stats: [numFiles=1, totalSize=79]
OK
Time taken: 0.379 seconds
hive> select * from dept;
OK
10	ACCOUNTING	NEW YORK
20	RESEARCH	DALLAS
30	SALES	CHICAGO
40	OPERATIONS	BOSTON
Time taken: 0.085 seconds, Fetched: 4 row(s)

从本地读取数据 load data local inpath 复制文件到hdfs上 bin/hdfs dfs -put

从集群上读取数据 load data inpath 移动文件到warehouse中 bin/hdfs dfs -mv

2.创建数据表的方式介绍

  • 最普通的方式
hive> create table dept(> deptno int,> dname string,> loc string> )> row format delimited fields terminated by '\t';
  • 使用as关键字

as关键字将表结果和数据都进行复制

hive> create table emp2 as select * from emp;
Query ID = hadoop_20200725091111_26fe472d-7fb6-4abb-958d-1e857c610b91
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1595637424134_0001, Tracking URL = http://hadoop212:8088/proxy/application_159563742
4134_0001/Kill Command = /opt/modules/hadoop-2.6.0-cdh5.7.6/bin/hadoop job  -kill job_1595637424134_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-07-25 09:11:37,321 Stage-1 map = 0%,  reduce = 0%
2020-07-25 09:11:48,452 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.08 sec
MapReduce Total cumulative CPU time: 4 seconds 80 msec
Ended Job = job_1595637424134_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop212:8020/user/hive/warehouse/empmanager.db/.hive-staging_hive_2020-07-25_0
9-11-17_291_3630837031185948323-1/-ext-10001Moving data to: hdfs://hadoop212:8020/user/hive/warehouse/empmanager.db/emp2
Table empmanager.emp2 stats: [numFiles=1, numRows=14, totalSize=661, rawDataSize=647]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 4.08 sec   HDFS Read: 3916 HDFS Write: 733 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 80 msec
OK
Time taken: 33.653 seconds
hive> select * from emp2;
OK
7369	SMITH	CLERK	7902	1980-12-17	800.0	NULL	20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
Time taken: 0.293 seconds, Fetched: 14 row(s)
  • like关键字

仅复制表结构不复制数据

hive> create table emp3 like emp;
OK
Time taken: 0.102 seconds
hive> desc emp3;
OK
empno               	int                 	                    
ename               	string              	                    
job                 	string              	                    
mgr                 	int                 	                    
hiredate            	string              	                    
sal                 	double              	                    
comm                	double              	                    
deptno              	int                 	                    
Time taken: 0.095 seconds, Fetched: 8 row(s)
  • location关键字 来指定数据仓库的目录
[hadoop@hadoop212 hadoop-2.6.0-cdh5.7.6]$ bin/hdfs dfs -mkdir /user/hive/warehouse/dept_location
hive> create table dept_location (> deptno int,> dname string,> loc string> )> row format delimited fields terminated by '\t'> location '/user/hive/warehouse/dept_location';
OK
Time taken: 0.085 seconds
hive> load data local inpath '/opt/datas/dept.txt' into table dept_location;
Loading data to table empmanager.dept_location
Table empmanager.dept_location stats: [numFiles=0, numRows=0, totalSize=0, rawDataSize=0]
OK
Time taken: 0.36 seconds
hive> select * from dept_location;
OK
10	ACCOUNTING	NEW YORK
20	RESEARCH	DALLAS
30	SALES	CHICAGO
40	OPERATIONS	BOSTON
Time taken: 0.077 seconds, Fetched: 4 row(s)

3.使用技巧和配置

(1)本地模式

在学习过程中,不是每次都要提交集群去运行任务的,我们需要使用本地模式

打开自动执行本地模式的设置

hive> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=false
hive> set hive.exec.mode.local.auto=true;
hive> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=true

这个设置其实你可以写死在hive-site.xml中,但是不推荐这么做

<property>
<name>hive.exec.mode.local.auto</name>
<value>true</value>
</property>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1AoESh0e-1596276046731)(0725_随堂笔记.assets/image-20200725104932244.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-flH2Kg1V-1596276046733)(0725_随堂笔记.assets/image-20200725105014141.png)]

配置完本地模式后,节约了很多时间:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NeDg8WXY-1596276046734)(0725_随堂笔记.assets/image-20200725105041041.png)]

(2)提示信息的显示

修改hive-site.xml

<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3SvrdCzZ-1596276046735)(0725_随堂笔记.assets/image-20200725105506134.png)]

hive (default)> show databases;
OK
database_name
default
empmanager
Time taken: 5.253 seconds, Fetched: 2 row(s)
hive (default)> show databases;
OK
database_name
default
empmanager
Time taken: 0.021 seconds, Fetched: 2 row(s)
hive (default)> select * from wc;
OK
wc.id	wc.word
1	hadoop
2	spark
3	mysql
4	hadoop
5	hadoop
6	mysql
Time taken: 0.658 seconds, Fetched: 6 row(s)

三、案例分析

1.python和R语言谁更适合机器学习?

其实这个调查范围的局限性,所以调查结果只能参考,实际意义不大

1,usca,1,1,1,0,0,0,1,1,1,0,0,1,0,7,Anaconda;scikit-learn;Python;R language;SQL language;Tensorflow;Other Deep Learning Tools

(1)创建数据库、数据表、读取数据

hive (default)> create database db_hive;
OK
Time taken: 0.099 seconds
hive (default)> use db_hive;
OK
Time taken: 0.027 seconds
hive (db_hive)> create table db_hive.tb_lc(> id string,area string,python string,r string,sql string,rm string,excel string,spark st
ring,ms string,tensorflow string,scikit string,tableau string,knime string,deep string,spHadoop string,ntools int,toolstr string)              > row format delimited fields terminated by ',' > lines terminated by '\n';
OK
Time taken: 0.163 seconds
hive (db_hive)> load data local inpath '/opt/datas/sw17.csv' overwrite into table db_hive.tb_lc;
Loading data to table db_hive.tb_lc
Table db_hive.tb_lc stats: [numFiles=1, numRows=0, totalSize=318533, rawDataSize=0]
OK
Time taken: 0.699 seconds
hive (db_hive)> select * from db_hive.tb_lc limit 5;
OK
tb_lc.id	tb_lc.area	tb_lc.python	tb_lc.r	tb_lc.sql	tb_lc.rm	tb_lc.excel   t
b_lc.spark	tb_lc.ms	tb_lc.tensorflow	tb_lc.scikit	tb_lc.tableau	tb_lc.knime   tb_lc.deep	tb_lc.sphadoop	tb_lc.ntools	tb_lc.toolstr1	usca	1	1	1	0	0	0	1	1	1	0	0     A
naconda;scikit-learn;Python;R language;SQL language;Tensorflow;Other Deep Learning Tools2	euro	0	0	0	0	0	0	0	0	0	0	0     O
range3	euro	1	0	1	1	0	0	0	0	0	0	0     P
yCharm;RapidMiner;Weka;Java;Python;SQL language;Open Source Hadoop Tools;SQL on Hadoop tools4	asia	1	1	0	1	0	0	0	0	0	0	0     C
4.5/C5.0/See5;PyCharm;RapidMiner;XLMiner;Python;R language5	afme	1	1	0	1	1	1	1	1	1	1	0     2
2	Anaconda;Dataiku;Excel;H2O.ai;IBM SPSS Statistics;Microsoft SQL Server;Orange;RapidMiner;scikit-learn;Tableau;Weka;Python;R language;Spark;Caffe ;Keras;Microsoft CNTK;mxnet;Tensorflow;Theano;Torch;PyTorchTime taken: 0.088 seconds, Fetched: 5 row(s)
hive (db_hive)> 

(2)对于大数据spark/hadoop来说,python用的人多还是R用的人多?

同时使用spark/hadoop 和 python的人数:

hive (db_hive)> select * from db_hive.tb_lc limit 5;
OK
tb_lc.id	tb_lc.area	tb_lc.python	tb_lc.r	tb_lc.sql	tb_lc.rm	tb_lc.excel   t
b_lc.spark	tb_lc.ms	tb_lc.tensorflow	tb_lc.scikit	tb_lc.tableau	tb_lc.knime   tb_lc.deep	tb_lc.sphadoop	tb_lc.ntools	tb_lc.toolstr1	usca	1	1	1	0	0	0	1	1	1	0	0     A
naconda;scikit-learn;Python;R language;SQL language;Tensorflow;Other Deep Learning Tools2	euro	0	0	0	0	0	0	0	0	0	0	0     O
range3	euro	1	0	1	1	0	0	0	0	0	0	0     P
yCharm;RapidMiner;Weka;Java;Python;SQL language;Open Source Hadoop Tools;SQL on Hadoop tools4	asia	1	1	0	1	0	0	0	0	0	0	0     C
4.5/C5.0/See5;PyCharm;RapidMiner;XLMiner;Python;R language5	afme	1	1	0	1	1	1	1	1	1	1	0     2
2	Anaconda;Dataiku;Excel;H2O.ai;IBM SPSS Statistics;Microsoft SQL Server;Orange;RapidMiner;scikit-learn;Tableau;Weka;Python;R language;Spark;Caffe ;Keras;Microsoft CNTK;mxnet;Tensorflow;Theano;Torch;PyTorchTime taken: 0.088 seconds, Fetched: 5 row(s)
hive (db_hive)> select count(*) cnt_shp from db_hive.tb_lc where sphadoop=1 and python=1;
Query ID = hadoop_20200725142222_fb73d3c3-47eb-4e90-a090-2373e424c815
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Starting Job = job_1595637424134_0004, Tracking URL = http://hadoop212:8088/proxy/application_159563742
4134_0004/Kill Command = /opt/modules/hadoop-2.6.0-cdh5.7.6/bin/hadoop job  -kill job_1595637424134_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-25 14:22:35,715 Stage-1 map = 0%,  reduce = 0%
2020-07-25 14:22:46,846 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.57 sec
2020-07-25 14:22:54,515 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.48 sec
MapReduce Total cumulative CPU time: 7 seconds 480 msec
Ended Job = job_1595637424134_0004
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.48 sec   HDFS Read: 327181 HDFS Write: 4 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 480 msec
OK
cnt_shp
683
Time taken: 29.315 seconds, Fetched: 1 row(s)

同时使用spark/hadoop 和R的人数:

hive (db_hive)> select count(*) cnt_shr from db_hive.tb_lc where sphadoop=1 and r=1;
Automatically selecting local only mode for query
Query ID = hadoop_20200725142525_ed3b1bc1-0f3f-4611-876a-245314812521
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 14:25:12,677 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1060290705_0001
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 645368 HDFS Write: 55479660 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
cnt_shr
606
Time taken: 1.605 seconds, Fetched: 1 row(s)

R在科学研究领域比较多,Python在开发应用领域比较多。

如果我们要将结果保存起来,保存为文件。底层还是mr程序,将输出的结果保存至hdfs上。

查询SQL:

select t1.cnt_shp as shp , t2.cnt_shr as shr from 
(select '1' as id,count(*) cnt_shp from db_hive.tb_lc where sphadoop=1 and python=1) as t1
join
(select '1' as id,count(*) cnt_shr from db_hive.tb_lc where sphadoop=1 and r=1) as t2
on t1.id=t2.id ;

实现:

hive (db_hive)> create table if not exists db_hive.tb_result1 as > select t1.cnt_shp as shp , t2.cnt_shr as shr from > (select '1' as id,count(*) cnt_shp from db_hive.tb_lc where sphadoop=1 and python=1) ast1              > join> (select '1' as id,count(*) cnt_shr from db_hive.tb_lc where sphadoop=1 and r=1) as t2> on t1.id=t2.id ;
Automatically selecting local only mode for query
Query ID = hadoop_20200725144848_353a820d-8d7a-4d1d-b00d-ee939387913d
Total jobs = 5
Launching Job 1 out of 5
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 14:48:07,178 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1605529441_0007
Launching Job 2 out of 5
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 14:48:08,470 Stage-4 map = 100%,  reduce = 100%
Ended Job = job_local710312008_0008
Stage-8 is selected by condition resolver.
Stage-9 is filtered out by condition resolver.
Stage-2 is filtered out by condition resolver.
Execution log at: /tmp/hadoop/hadoop_20200725144848_353a820d-8d7a-4d1d-b00d-ee939387913d.log
2020-07-25 02:48:11	Starting to launch local task to process map join;	maximum memory = 477626
3682020-07-25 02:48:12	Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/hadoop/
7a6d78e1-f740-4f09-baa2-ea4b542e1033/hive_2020-07-25_14-48-05_589_5138958313151404537-1/-local-10005/HashTable-Stage-5/MapJoin-mapfile41--.hashtable2020-07-25 02:48:12	Uploaded 1 File to: file:/tmp/hadoop/7a6d78e1-f740-4f09-baa2-ea4b542e1033/hive_
2020-07-25_14-48-05_589_5138958313151404537-1/-local-10005/HashTable-Stage-5/MapJoin-mapfile41--.hashtable (283 bytes)2020-07-25 02:48:12	End of local task; Time Taken: 0.983 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2020-07-25 14:48:14,681 Stage-5 map = 100%,  reduce = 0%
Ended Job = job_local23033021_0009
Moving data to: hdfs://hadoop212:8020/user/hive/warehouse/db_hive.db/tb_result1
Table db_hive.tb_result1 stats: [numFiles=1, numRows=1, totalSize=8, rawDataSize=7]
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 3186210 HDFS Write: 1076 SUCCESS
Stage-Stage-4:  HDFS Read: 3823276 HDFS Write: 1308 SUCCESS
Stage-Stage-5:  HDFS Read: 1911850 HDFS Write: 794 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
shp	shr
Time taken: 9.404 seconds
hive (db_hive)> select * from  db_hive.tb_result1;
OK
tb_result1.shp	tb_result1.shr
683	606
Time taken: 0.07 seconds, Fetched: 1 row(s)

hdfs上文件中存储的数据:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kMb1R1UI-1596276046737)(0725_随堂笔记.assets/image-20200725145428770.png)]

字段信息存储在MySQL元数据中:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VcoLgVk0-1596276046738)(0725_随堂笔记.assets/image-20200725145601314.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4mzhhUH2-1596276046739)(0725_随堂笔记.assets/image-20200725145616755.png)]

HQL是支持join的,但不推荐这么做。分析的数据来源是存储在hdfs上的文件,那么要让文件中的数据进行表连接查询,需要消耗大量的计算资源和磁盘IO,是不到万不得已不这么做。

自己先使用as关键字 将两张表拼接为一张大表 在对这一张表进行分析。

HQL不支持子查询

2.Hive语句的其他执行方式

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8LmNZC4N-1596276046741)(0725_随堂笔记.assets/image-20200725151110852.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-i8comZLT-1596276046743)(0725_随堂笔记.assets/image-20200725151154128.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IJJuEyIx-1596276046744)(0725_随堂笔记.assets/image-20200725152055841.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JLhx384Z-1596276046745)(0725_随堂笔记.assets/image-20200725152521605.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b17kk4sW-1596276046746)(0725_随堂笔记.assets/image-20200725152924262.png)]

2.链家网二手房交易记录分析

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BgdlyA7p-1596276046747)(0725_随堂笔记.assets/image-20200725155232491.png)]

数据来源2005年好像。。。

梅园六街坊,2室0厅,47.72,浦东,低区/6层,朝南,500,104777,1992年建

小区名称,户型 ,面积,区域,层数,朝向,总价(万),单价(元/平方),房龄

创建数据库、数据表、加载数据:

hive (default)> show databases;
OK
database_name
db_hive
default
empmanager
Time taken: 4.989 seconds, Fetched: 3 row(s)
hive (default)> set hive.exec.mode.local.auto=true;
hive (default)> create database db_lianjia;
OK
Time taken: 0.223 seconds
hive (default)> use db_lianjia;
OK
Time taken: 0.06 seconds
hive (db_lianjia)> create table tb_lj(> village_name string,> house_type string,> house_area string,> region string,> floor_str string,> direction string,> total_price string,> square_price string,> build_date string> )> row format delimited fields terminated by ','> lines terminated by '\n'> stored as textfile;
OK
Time taken: 0.259 seconds
hive (db_lianjia)> load data local inpath '/opt/datas/2nd_house_price.csv' overwrite into table tb_lj;
Loading data to table db_lianjia.tb_lj
Table db_lianjia.tb_lj stats: [numFiles=1, numRows=0, totalSize=2225947, rawDataSize=0]
OK
Time taken: 0.646 seconds
hive (db_lianjia)> select * from tb_lj limit 3;
OK
tb_lj.village_name	tb_lj.house_type	tb_lj.house_area	tb_lj.region	tb_lj.floor_stt
b_lj.direction	tb_lj.total_price	tb_lj.square_price	tb_lj.build_date梅园六街坊	2室0厅	47.72	浦东	低区/6层	朝南	500	104777	1992年建
碧云新天地(一期)	3室2厅	108.93	浦东	低区/6层	朝南	735	67474	2002年建
博山小区	1室1厅	43.79	浦东	中区/6层	朝南	260	59374	1988年建
Time taken: 0.31 seconds, Fetched: 3 row(s)
hive (db_lianjia)> 

(1)哪个面积区域的房子最受欢迎?

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-naYsYYoA-1596276046749)(0725_随堂笔记.assets/image-20200725160807757.png)]

面积区域的划分:

50以下 | 50-70 | 70-90 | 90-110 | 110-130 | 130-150 | 150以上

hive (db_lianjia)> select t.area_group,count(*) as total> from> (select > case >   when 0<house_area and house_area<=50 then '50平以下'>   when 50<house_area and house_area<=70 then '50-70平'>   when 70<house_area and house_area<=90 then '70-90平'>   when 90<house_area and house_area<=110 then '90-110平'>   when 110<house_area and house_area<=130 then '110-130平'>   when 130<house_area and house_area<=150 then '130-150平'>   else '150以上'> end as area_group> from db_lianjia.tb_lj ) as t > group by t.area_group order by total desc;
Automatically selecting local only mode for query
Query ID = hadoop_20200725161717_93eaff12-2d84-4c3b-8e8f-bc311e73871b
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 16:18:02,165 Stage-1 map = 100%,  reduce = 0%
2020-07-25 16:18:03,184 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local243861579_0001
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 16:18:04,685 Stage-2 map = 100%,  reduce = 100%
Ended Job = job_local61328738_0002
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 4468278 HDFS Write: 4452202 SUCCESS
Stage-Stage-2:  HDFS Read: 4469086 HDFS Write: 4452647 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
t.area_group	total
50-70平	5775
70-90平	5739
90-110平	4505
50平以下	4291
110-130平	2798
150以上	2775
130-150平	2318
Time taken: 5.321 seconds, Fetched: 7 row(s)

结论:面积越小越受欢迎!总价低!

(2)哪个房龄区间的房子最受欢迎

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sZ2OvCbc-1596276046751)(0725_随堂笔记.assets/image-20200725163433319.png)]

5年以内 | 10年以内 | 15年以内 | 20年以内 | 20年以上

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9DPMwad6-1596276046752)(0725_随堂笔记.assets/image-20200725163627823.png)]

hive (db_lianjia)> select t.year_group,count(*) as total > from > (select > case >   when (2020-substring(build_date,0,4)) between 0 and 5 then '5年以内'>   when (2020-substring(build_date,0,4)) between 6 and 10 then '10年以内'>   when (2020-substring(build_date,0,4)) between 11 and 15 then '15年以内'>   when (2020-substring(build_date,0,4)) between 16 and 20 then '20年以内'>   else '20年以上'> end as year_group > from db_lianjia.tb_lj) as t > group by t.year_group order by total desc;
Automatically selecting local only mode for query
Query ID = hadoop_20200725164444_6a3b905e-baa6-4c2d-b165-b45f73d7b42e
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 16:44:18,655 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local602938329_0003
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 16:44:20,050 Stage-2 map = 100%,  reduce = 100%
Ended Job = job_local421398031_0004
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 8929446 HDFS Write: 4453037 SUCCESS
Stage-Stage-2:  HDFS Read: 8930144 HDFS Write: 4453418 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
t.year_group	total
20年以上	17654
15年以内	4422
20年以内	3125
10年以内	2954
5年以内	46
Time taken: 2.964 seconds, Fetched: 5 row(s)

结论:房龄越老越“香”;

(3)扩展练习,课后完成

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7vOu0uKu-1596276046753)(0725_随堂笔记.assets/image-20200725164922327.png)]

  • 根据其他的房屋特点指定需求和SQL
  • 将结果保存为数据文件
  • 上面所有的步骤以SQL文件形式运行,该文件可以反复运行并不会抛出异常!

扩展:mr排序输出

hive> select word,count(word) cnt from wc group by word order by cnt desc;
Query ID = hadoop_20200725103838_c9ea80b4-e34c-42b7-9b5c-da3c9bf24d5e
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Starting Job = job_1595637424134_0002, Tracking URL = http://hadoop212:8088/proxy/application_159563742
4134_0002/Kill Command = /opt/modules/hadoop-2.6.0-cdh5.7.6/bin/hadoop job  -kill job_1595637424134_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-25 10:39:03,173 Stage-1 map = 0%,  reduce = 0%
2020-07-25 10:39:13,043 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.21 sec
2020-07-25 10:39:20,579 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.39 sec
MapReduce Total cumulative CPU time: 6 seconds 390 msec
Ended Job = job_1595637424134_0002
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Starting Job = job_1595637424134_0003, Tracking URL = http://hadoop212:8088/proxy/application_159563742
4134_0003/Kill Command = /opt/modules/hadoop-2.6.0-cdh5.7.6/bin/hadoop job  -kill job_1595637424134_0003
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2020-07-25 10:39:29,542 Stage-2 map = 0%,  reduce = 0%
2020-07-25 10:39:35,943 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.34 sec
2020-07-25 10:39:43,463 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 4.11 sec
MapReduce Total cumulative CPU time: 4 seconds 110 msec
Ended Job = job_1595637424134_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.39 sec   HDFS Read: 6289 HDFS Write: 169 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 4.11 sec   HDFS Read: 4698 HDFS Write: 25 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 500 msec
OK
hadoop	3
mysql	2
spark	1
Time taken: 51.236 seconds, Fetched: 3 row(s)
hive> 

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-du9kj4Ey-1596276046754)(0725_随堂笔记.assets/image-20200725104115994.png)]