spark札记2-spark-sql 程序_SQL

spark笔记2-spark-sql 程序

一.序言

? ? ?这里介绍用idea 跑程序去连接spark-sql->hive-metastore 的小例子，关于服务器上得保证spark-sql/spark-shell 正确执行，然后再用本地的程序去实现，我这里版本用的spark1.4.1+hive1.2.1+hadoop 2.7.1.

二.连接代码

? ? ?结合笔记1的代码

? ? ?2.1 maven 配置：

? ??

   <dependency>            <groupId>org.apache.spark</groupId>            <artifactId>spark-core_2.10</artifactId>            <version>1.4.1</version>        </dependency>        <!-- 这个不能引用，会出现版本不一致的问题 -->        <!--<dependency>-->            <!--<groupId>org.apache.hadoop</groupId>-->            <!--<artifactId>hadoop-client</artifactId>-->            <!--<version>1.2.1</version>-->        <!--</dependency>-->        <dependency>            <groupId>org.apache.spark</groupId>            <artifactId>spark-hive_2.10</artifactId>            <version>1.4.1</version>        </dependency>

? ? 2.java 代码

? ??

 public static final String master = "spark://master:7077";    public static void main(String[] args) {        SparkConf conf = new SparkConf().setAppName("demo1").setMaster(master);        conf.set("spark.executor.memory", "256M");        JavaSparkContext sc = new JavaSparkContext(conf);        HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());        DataFrame df = sqlContext.sql("select * from data_center.shop limit 10");        Row[] rows = df.collect();        for(Row row : rows){            System.out.println(row);        }    }

? ? 3.scala 代码

? ??

object hello extends App {    var master = "spark://master:7077";  val conf = new org.apache.spark.SparkConf().setAppName("demo").setMaster(master)  conf.set("spark.executor.memory", "128M").set("worker_max_heapsize", "64M")  val sc = new org.apache.spark.SparkContext(conf)  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)  sqlContext.sql("use data_center")  var df = sqlContext.sql("select * from orderinfo limit 10");  var dd = df.collect();}

? ? 4.hive-site.xml 文件配置：

? ? ?这里我就不贴出来了，从服务器上拷贝下来的，然后改了些文件存放的路径就行了。项目目录是：

? ?

?

三.常见异常：

? ? 3.1 服务器内存不够

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

? ? 设置小点就行了：

? ??

conf.set("spark.executor.memory", "128M")

? ?3.2 版本问题：这里一般是服务器上的版本和本地版本有不一致造成的，看着改吧！

Server IPC version 9 cannot communicate with client version 4

? ?

? ?3.3 还有其他权限,路径等乱起八糟的问题，看着改就行了

小结：

? ? ?1.我们数据库数据放到hdfs 上，然后利用spark-sql 进行查询的方式，这里仅仅是简单连接，挺方便的

? ? ?2.中途遇到的问题大多是版本问题，如果不用maven 最好把saprk/lib 下面的和 hive/lib hadoop/lib 等下面的文件弄到本地，不容易冲突。

? ? ? 3.纠结纠结，总会好的~。~?