public class Statistics
extends Object
| Constructor and Description | 
|---|
| Statistics() | 
| Modifier and Type | Method and Description | 
|---|---|
| static ChiSqTestResult | chiSqTest(Matrix observed)Conduct Pearson's independence test on the input contingency matrix, which cannot contain
 negative entries or columns or rows that sum up to 0. | 
| static ChiSqTestResult[] | chiSqTest(RDD<LabeledPoint> data)Conduct Pearson's independence test for every feature against the label across the input RDD. | 
| static ChiSqTestResult | chiSqTest(Vector observed)Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform
 distribution, with each category having an expected frequency of  1 / observed.size. | 
| static ChiSqTestResult | chiSqTest(Vector observed,
         Vector expected)Conduct Pearson's chi-squared goodness of fit test of the observed data against the
 expected distribution. | 
| static MultivariateStatisticalSummary | colStats(RDD<Vector> X)Computes column-wise summary statistics for the input RDD[Vector]. | 
| static double | corr(RDD<Object> x,
    RDD<Object> y)Compute the Pearson correlation for the input RDDs. | 
| static double | corr(RDD<Object> x,
    RDD<Object> y,
    String method)Compute the correlation for the input RDDs using the specified method. | 
| static Matrix | corr(RDD<Vector> X)Compute the Pearson correlation matrix for the input RDD of Vectors. | 
| static Matrix | corr(RDD<Vector> X,
    String method)Compute the correlation matrix for the input RDD of Vectors using the specified method. | 
public static MultivariateStatisticalSummary colStats(RDD<Vector> X)
X - an RDD[Vector] for which column-wise summary statistics are to be computed.MultivariateStatisticalSummary object containing column-wise summary statistics.public static Matrix corr(RDD<Vector> X)
X - an RDD[Vector] for which the correlation matrix is to be computed.public static Matrix corr(RDD<Vector> X, String method)
pearson (default), spearman.
 
 Note that for Spearman, a rank correlation, we need to create an RDD[Double] for each column
 and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
 which is fairly costly. Cache the input RDD before calling corr with method = "spearman" to
 avoid recomputing the common lineage.
 
X - an RDD[Vector] for which the correlation matrix is to be computed.method - String specifying the method to use for computing correlation.
               Supported: pearson (default), spearmanpublic static double corr(RDD<Object> x, RDD<Object> y)
Note: the two input RDDs need to have the same number of partitions and the same number of elements in each partition.
x - RDD[Double] of the same cardinality as y.y - RDD[Double] of the same cardinality as x.public static double corr(RDD<Object> x, RDD<Object> y, String method)
pearson (default), spearman.
 Note: the two input RDDs need to have the same number of partitions and the same number of elements in each partition.
x - RDD[Double] of the same cardinality as y.y - RDD[Double] of the same cardinality as x.method - String specifying the method to use for computing correlation.
               Supported: pearson (default), spearmanpublic static ChiSqTestResult chiSqTest(Vector observed, Vector expected)
 Note: the two input Vectors need to have the same size.
       observed cannot contain negative values.
       expected cannot contain nonpositive values.
 
observed - Vector containing the observed categorical counts/relative frequencies.expected - Vector containing the expected categorical counts/relative frequencies.
                 expected is rescaled if the expected sum differs from the observed sum.public static ChiSqTestResult chiSqTest(Vector observed)
1 / observed.size.
 
 Note: observed cannot contain negative values.
 
observed - Vector containing the observed categorical counts/relative frequencies.public static ChiSqTestResult chiSqTest(Matrix observed)
observed - The contingency matrix (containing either counts or relative frequencies).public static ChiSqTestResult[] chiSqTest(RDD<LabeledPoint> data)
data - an RDD[LabeledPoint] containing the labeled dataset with categorical features.
             Real-valued features will be treated as categorical for each distinct value.