当前位置: 代码迷 >> 综合 >> pySpark DataFrame简介
  详细解决方案

pySpark DataFrame简介

热度:51   发布时间:2023-12-19 02:58:13.0

1. 列名类型

pyspark.sql.types module
DataType
NullType
StringType
BinaryType
BooleanType
DateType
TimestampType
DecimalType
DoubleType
FloatType
ByteType
IntegerType
LongType
ShortType
ArrayType
MapType
StructField
StructType

2. 读取部分文件

from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)limit_df = sqlContext.read.format('com.databricks.spark.csv') \.options(header='true', inferschema='true').load(label_path).limit(20)
limit_df.show()

3. DataFrame 和 RDD

DataFrame:

from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
sqlContext.read.csv(filepath, schema=fileshema, header=False, sep='\t').first()

RDD:

sc.textFile(filepath).map(lambda x: str(x).split("\t")).first()

发现,RDD读取的速度比DataFrame读取的速度更快,RDD 42s, DataFrame则需要1Min 47 s, 两倍多的时间。

比较困惑,因为DataFrame 应该和RDD差不多,都是分布式的数据结构,但是DataFrame定义了数据的内部结构。

4. dict 和 DataFrame

dict 转变为dataFrame

from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
rdd = sc.parallelize([{
    'a':1}, {
    'a':1, 'b':2}])
df = spark.read.json(rdd)
df.show()

或者:

%%time
from pyspark.sql.types import Row
rdd.map(lambda x:Row(**x)).toDF()

DataFrame to dict:

dict(df_count.rdd.map(lambda x: (x['MYCOLUMN'], x['count'])).collect())

5. DataFrame 与Json格式

https://sparkbyexamples.com/spark/spark-read-and-write-json-file/#write-json-file

6. DataFrame 的一些操作

更改DataFrame列名:

def df_col_rename(X, to_rename, replace_with):""":param X: spark dataframe:param to_rename: list of original names:param replace_with: list of new names:return: dataframe with updated names"""import pyspark.sql.functions as Fmapping = dict(zip(to_rename, replace_with))X = X.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])return X
#如
df_col_rename(X,['a', 'b', 'c'], ['x', 'y', 'z'])

DataFrame 的标准化:

from pyspark.sql.functions import stddev_pop, avg, broadcast,mean,stddev
def normalize(df, columns):aggMean = []aggStd = []for column in columns:aggMean.append(mean(df[column]).alias(column))aggStd.append(stddev_pop(df[column]).alias(column))averages = df.agg(*aggMean).collect()[0]stds = df.agg(*aggStd).collect()[0]select_col = []for column in columns:df = df.withColumn(column+"_norm", ((df[column] - averages[column])/stds[column]) )select_col.append(column+"_norm")df = df.select(select_col)return df
df1 = spark.createDataFrame([(1,10),(33,13),(2,29)],["A","B"])normalize(df1, ['A','B']).show()

结果:

+---+---+
|  A|  B|
+---+---+
|  1| 10|
| 33| 13|
|  2| 29|
+---+---+Row(A=1, B=10)
Row(A=12.0, B=17.333333333333332)
Row(A=14.854853303438128, B=8.339997335464536)
+-------------------+-------------------+
|             A_norm|             B_norm|
+-------------------+-------------------+
|-0.7404987296275803|-0.8792968436751745|
| 1.4136793929253808|-0.5195844985353303|
|-0.6731806632978004| 1.3988813422105053|
+-------------------+-------------------+

7. DataFrame 列标准化

import pyspark.sql.functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.feature import VectorAssembler,MinMaxScaler
from pyspark.sql.types import *
def normalize(df,columns,skip_col=[]):#%%df = df.select([F.col(c).cast("double") for c in columns])normalize_col = df.columnsfor col in skip_col:if col in normalize_col:normalize_col.remove(col)vecAssembler = VectorAssembler(inputCols=normalize_col, outputCol="features",)normalizer = MinMaxScaler(inputCol="features", outputCol="scaledFeatures",min=0,max=1)pipeline_normalize = Pipeline(stages=[vecAssembler,normalizer])df_transf = pipeline_normalize.fit(df).transform(df)#%%to_array = F.udf(lambda x: (x.toArray().tolist()), ArrayType(DoubleType())) # IF spark 3.0.0 then a api - vectortoarray availabledf_array = df_transf.withColumn("scaledFeatures",to_array("scaledFeatures"))#%%\df_final= df_array.select([F.col('scaledFeatures')[i].alias(normalize_col[i]) for i in range(len(normalize_col))] + skip_col)return df_final
df = sqlContext.createDataFrame([(1,2,3,4,5,6,7,8,9,10),(2,3,8,5,6,7,8,9,10,11),(2,3,4,5,6,7,8,9,10,11)],schema=["a","b","c","d","e","f","g","h","i","j"])
normalize(df, df.columns,['a']).show()

参考:

  1. Stackoverflow change columns name;
  2. github;
  3. stackoverflow Spark: How to normalize all the columns of DataFrame effectively?;
  相关解决方案