basic Statistics
Correlation
计算两个系列数据之间的相关性是“统计”中的常见操作。在spark.ml中,我们提供了灵活性,可以计算多个序列之间的成对相关性。目前支持的关联方法是Pearson和Spearman的关联。
示例代码
相关使用指定的方法为输入的矢量数据集计算相关矩阵。输出将是一个DataFrame,其中包含向量列的相关矩阵。
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Rowval data = Seq(Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),Vectors.dense(4.0, 5.0, 0.0, 3.0),Vectors.dense(6.0, 7.0, 0.0, 8.0),Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
println(s"Pearson correlation matrix:\n $coeff1")val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
println(s"Spearman correlation matrix:\n $coeff2")
Hypothesis testing
假设检验是一种强大的统计工具,可用来确定结果是否具有统计意义,以及该结果是否偶然发生。 spark.ml目前支持Pearson的卡方(χ2)测试是否具有独立性。
ChiSquareTest针对标签上的每个功能进行Pearson的独立性测试。对于每个特征,(特征,标签)对将转换为列矩阵,针对该列矩阵计算卡方统计量。所有标签和特征值必须是分类的。
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.ChiSquareTestval data = Seq((0.0, Vectors.dense(0.5, 10.0)),(0.0, Vectors.dense(1.5, 20.0)),(1.0, Vectors.dense(1.5, 30.0)),(0.0, Vectors.dense(3.5, 30.0)),(0.0, Vectors.dense(3.5, 40.0)),(1.0, Vectors.dense(3.5, 40.0))
)val df = data.toDF("label", "features")
val chi = ChiSquareTest.test(df, "features", "label").head
println(s"pValues = ${chi.getAs[Vector](0)}")
println(s"degreesOfFreedom ${chi.getSeq[Int](1).mkString("[", ",", "]")}")
println(s"statistics ${chi.getAs[Vector](2)}")
Summarizer
我们通过Summarizer提供了Dataframe的矢量列摘要统计信息。可用的度量是按列的最大值,最小值,平均值,方差和非零数,以及总数。
示例代码
下面的示例演示如何使用Summarizer为输入数据帧的矢量列(带有和不带有权重列)计算均值和方差。
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.Summarizerval data = Seq((Vectors.dense(2.0, 3.0, 5.0), 1.0),(Vectors.dense(4.0, 6.0, 7.0), 2.0)
)val df = data.toDF("features", "weight")val (meanVal, varianceVal) = df.select(metrics("mean", "variance").summary($"features", $"weight").as("summary")).select("summary.mean", "summary.variance").as[(Vector, Vector)].first()println(s"with weight: mean = ${meanVal}, variance = ${varianceVal}")val (meanVal2, varianceVal2) = df.select(mean($"features"), variance($"features")).as[(Vector, Vector)].first()println(s"without weight: mean = ${meanVal2}, sum = ${varianceVal2}")
Data source
image data source
该图像数据源用于从目录加载图像文件,它可以通过Java库中的ImageIO将压缩图像(jpeg,png等)加载为原始图像表示形式。加载的DataFrame具有一个StructType列:“ image”,包含存储为图像架构的图像数据。 image列的架构为:
- origin: StringType(代表图像的文件路径)
- height: IntegerType(图像的高度)
- width: IntegerType (图像的宽度)
- nChannels: IntegerType (图像通道数)
- mode: IntegerType(OpenCV兼容类型)
- data: BinaryType (以与OpenCV兼容的顺序显示图像字节:大多数情况下按行BGR)
scala> val df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>]scala> df.select("image.origin", "image.width", "image.height").show(truncate=false)
+-----------------------------------------------------------------------+-----+------+
|origin |width|height|
+-----------------------------------------------------------------------+-----+------+
|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 |
|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 |
|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 |
|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 |
+-----------------------------------------------------------------------+-----+------+