1. pyspark数据类型
“DataType”, “NullType”, “StringType”, “BinaryType”, “BooleanType”, “DateType”,
“TimestampType”, “DecimalType”, “DoubleType”, “FloatType”, “ByteType”, “IntegerType”,
“LongType”, “ShortType”, “ArrayType”, “MapType”, “StructField”, “StructType”
2. 示例 StructField
class StructField(DataType):"""A field in :class:`StructType`.:param name: string, name of the field.:param dataType: :class:`DataType` of the field.:param nullable: boolean, whether the field can be null (None) or not.:param metadata: a dict from string to simple type that can be toInternald to JSON automatically"""def __init__(self, name, dataType, nullable=True, metadata=None):""">>> (StructField("f1", StringType(), True)... == StructField("f1", StringType(), True))True>>> (StructField("f1", StringType(), True)... == StructField("f2", StringType(), True))False"""assert isinstance(dataType, DataType), "dataType should be DataType"assert isinstance(name, basestring), "field name should be string"if not isinstance(name, str):name = name.encode('utf-8')self.name = nameself.dataType = dataTypeself.nullable = nullableself.metadata = metadata or {
}
3. DataFrame指定类型
指定说明每个DataFrame的数据类型。
val schema = StructType(List(StructField("id", IntegerType, true),StructField("name", StringType, true),StructField("age", IntegerType, true)))//将RDD映射到rowRDDval rowRDD = personRDD.map(p => Row(p(0).toInt, p(1).trim, p(2).toInt))//将schema信息应用到rowRDD上val personDataFrame = sqlContext.createDataFrame(rowRDD, schema)
参考:
- 原文链接:;
- Source code for pyspark.sql.types