MLlib (DataFrame-based)#
Note
From Apache Spark 4.0.0, all builtin algorithms support Spark Connect.
Pipeline APIs#
Abstract class for transformers that transform one dataset into another.  | 
|
Abstract class for transformers that take one input column, apply transformation, and output the result as a new column.  | 
|
Abstract class for estimators that fit models to data.  | 
|
  | 
Abstract class for models that are fitted by estimators.  | 
Estimator for prediction tasks (regression and classification).  | 
|
Model for prediction tasks (regression and classification).  | 
|
  | 
A simple pipeline, which acts as an estimator.  | 
  | 
Represents a compiled pipeline with transformers and fitted models.  | 
Parameters#
  | 
A param with self-contained documentation.  | 
  | 
Components that take parameters.  | 
Factory methods for common type conversion functions for Param.typeConverter.  | 
Feature#
  | 
Binarize a column of continuous features given a threshold.  | 
  | 
LSH class for Euclidean distance metrics.  | 
  | 
Model fitted by   | 
  | 
Maps a column of continuous features to a column of feature buckets.  | 
  | 
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label.  | 
  | 
Model fitted by   | 
  | 
Extracts a vocabulary from document collections and generates a   | 
  | 
Model fitted by   | 
  | 
A feature transformer that takes the 1D discrete cosine transform of a real vector.  | 
  | 
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector.  | 
  | 
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space).  | 
  | 
Maps a sequence of terms to their term frequencies using the hashing trick.  | 
  | 
Compute the Inverse Document Frequency (IDF) given a collection of documents.  | 
  | 
Model fitted by   | 
  | 
Imputation estimator for completing missing values, using the mean, median or mode of the columns in which the missing values are located.  | 
  | 
Model fitted by   | 
  | 
A   | 
  | 
Implements the feature interaction transform.  | 
  | 
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature.  | 
  | 
Model fitted by   | 
  | 
LSH class for Jaccard distance.  | 
  | 
Model produced by   | 
  | 
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.  | 
  | 
Model fitted by   | 
  | 
A feature transformer that converts the input array of strings into an array of n-grams.  | 
  | 
Normalize a vector to have unit norm using the given p-norm.  | 
  | 
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.  | 
  | 
Model fitted by   | 
  | 
PCA trains a model to project vectors to a lower dimensional space of the top   | 
  | 
Model fitted by   | 
  | 
Perform feature expansion in a polynomial space.  | 
  | 
  | 
  | 
RobustScaler removes the median and scales the data according to the quantile range.  | 
  | 
Model fitted by   | 
  | 
A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false).  | 
  | 
Implements the transforms required for fitting a dataset against an R model formula.  | 
  | 
Model fitted by   | 
  | 
Implements the transforms which are defined by SQL statement.  | 
  | 
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.  | 
  | 
Model fitted by   | 
  | 
A feature transformer that filters out stop words from input.  | 
  | 
A label indexer that maps a string column of labels to an ML column of label indices.  | 
  | 
Model fitted by   | 
  | 
Target Encoding maps a column of categorical indices into a numerical feature derived from the target.  | 
  | 
Model fitted by   | 
  | 
A tokenizer that converts the input string to lowercase and then splits it by white spaces.  | 
  | 
Feature selector based on univariate statistical tests against labels.  | 
  | 
Model fitted by   | 
  | 
Feature selector that removes all low-variance features.  | 
  | 
Model fitted by   | 
  | 
A feature transformer that merges multiple columns into a vector column.  | 
  | 
Class for indexing categorical feature columns in a dataset of Vector.  | 
  | 
Model fitted by   | 
  | 
A feature transformer that adds size information to the metadata of a vector column.  | 
  | 
This class takes a feature vector and outputs a new feature vector with a subarray of the original features.  | 
  | 
Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.  | 
  | 
Model fitted by   | 
Classification#
  | 
This binary classifier optimizes the Hinge Loss using the OWLQN optimizer.  | 
  | 
Model fitted by LinearSVC.  | 
  | 
Abstraction for LinearSVC Results for a given model.  | 
  | 
Abstraction for LinearSVC Training results.  | 
  | 
Logistic regression.  | 
  | 
Model fitted by LogisticRegression.  | 
  | 
Abstraction for Logistic Regression Results for a given model.  | 
  | 
Abstraction for multinomial Logistic Regression Training results.  | 
  | 
Binary Logistic regression results for a given model.  | 
Binary Logistic regression training results for a given model.  | 
|
  | 
Decision tree learning algorithm for classification.  | 
  | 
Model fitted by DecisionTreeClassifier.  | 
  | 
Gradient-Boosted Trees (GBTs) learning algorithm for classification.  | 
  | 
Model fitted by GBTClassifier.  | 
  | 
Random Forest learning algorithm for classification.  | 
  | 
Model fitted by RandomForestClassifier.  | 
  | 
Abstraction for RandomForestClassification Results for a given model.  | 
Abstraction for RandomForestClassificationTraining Training results.  | 
|
BinaryRandomForestClassification results for a given model.  | 
|
BinaryRandomForestClassification training results for a given model.  | 
|
  | 
Naive Bayes Classifiers.  | 
  | 
Model fitted by NaiveBayes.  | 
  | 
Classifier trainer based on the Multilayer Perceptron.  | 
Model fitted by MultilayerPerceptronClassifier.  | 
|
Abstraction for MultilayerPerceptronClassifier Results for a given model.  | 
|
Abstraction for MultilayerPerceptronClassifier Training results.  | 
|
  | 
Reduction of Multiclass Classification to Binary Classification.  | 
  | 
Model fitted by OneVsRest.  | 
  | 
Factorization Machines learning algorithm for classification.  | 
  | 
Model fitted by   | 
  | 
Abstraction for FMClassifier Results for a given model.  | 
  | 
Abstraction for FMClassifier Training results.  | 
Clustering#
  | 
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.  | 
  | 
Model fitted by BisectingKMeans.  | 
  | 
Bisecting KMeans clustering results for a given model.  | 
  | 
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).  | 
  | 
Model fitted by KMeans.  | 
  | 
Summary of KMeans.  | 
  | 
GaussianMixture clustering.  | 
  | 
Model fitted by GaussianMixture.  | 
  | 
Gaussian mixture clustering results for a given model.  | 
  | 
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.  | 
  | 
Latent Dirichlet Allocation (LDA) model.  | 
  | 
Local (non-distributed) model fitted by   | 
  | 
Distributed model fitted by   | 
  | 
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen.  | 
Functions#
  | 
Converts a column of array of numeric type into a column of pyspark.ml.linalg.DenseVector instances  | 
  | 
Converts a column of MLlib sparse/dense vectors into a column of dense arrays.  | 
  | 
Given a function which loads a model and returns a predict function for inference over a batch of numpy inputs, returns a Pandas UDF wrapper for inference over a Spark DataFrame.  | 
Vector and Matrix#
  | 
|
  | 
A dense vector represented by a value array.  | 
  | 
A simple sparse vector class for passing data to MLlib.  | 
  | 
Factory methods for working with vectors.  | 
  | 
|
  | 
Column-major dense matrix.  | 
  | 
Sparse Matrix stored in CSC format.  | 
  | 
Recommendation#
  | 
Alternating Least Squares (ALS) matrix factorization.  | 
  | 
Model fitted by ALS.  | 
Regression#
  | 
Accelerated Failure Time (AFT) Model Survival Regression  | 
  | 
Model fitted by   | 
  | 
Decision tree learning algorithm for regression.  | 
  | 
Model fitted by   | 
  | 
Gradient-Boosted Trees (GBTs) learning algorithm for regression.  | 
  | 
Model fitted by   | 
  | 
Generalized Linear Regression.  | 
  | 
Model fitted by   | 
  | 
Generalized linear regression results evaluated on a dataset.  | 
Generalized linear regression training results.  | 
|
  | 
Currently implemented using parallelized pool adjacent violators algorithm.  | 
  | 
Model fitted by   | 
  | 
Linear regression.  | 
  | 
Model fitted by   | 
  | 
Linear regression results evaluated on a dataset.  | 
  | 
Linear regression training results.  | 
  | 
Random Forest learning algorithm for regression.  | 
  | 
Model fitted by   | 
  | 
Factorization Machines learning algorithm for regression.  | 
  | 
Model fitted by   | 
Statistics#
Conduct Pearson's independence test for every feature against the label.  | 
|
Compute the correlation matrix for the input dataset of Vectors using the specified method.  | 
|
Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution.  | 
|
  | 
Represents a (mean, cov) tuple  | 
Tools for vectorized statistics on MLlib Vectors.  | 
|
  | 
A builder object that provides summary statistics about a given column.  | 
Tuning#
Builder for a param grid used in grid search-based model selection.  | 
|
  | 
K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.  | 
  | 
CrossValidatorModel contains the model with the highest average cross-validation metric across folds and uses this model to transform input data.  | 
  | 
Validation for hyper-parameter tuning.  | 
  | 
Model from train validation split.  | 
Evaluation#
Base class for evaluators that compute metrics from predictions.  | 
|
  | 
Evaluator for binary classification, which expects input columns rawPrediction, label and an optional weight column.  | 
  | 
Evaluator for Regression, which expects input columns prediction, label and an optional weight column.  | 
  | 
Evaluator for Multiclass Classification, which expects input columns: prediction, label, weight (optional) and probabilityCol (only for logLoss).  | 
  | 
Evaluator for Multilabel Classification, which expects two input columns: prediction and label.  | 
  | 
Evaluator for Clustering results, which expects two input columns: prediction and features.  | 
  | 
Evaluator for Ranking, which expects two input columns: prediction and label.  | 
Frequency Pattern Mining#
  | 
A parallel FP-growth algorithm to mine frequent itemsets.  | 
  | 
Model fitted by FPGrowth.  | 
  | 
A parallel PrefixSpan algorithm to mine frequent sequential patterns.  | 
Image#
Internal class for pyspark.ml.image.ImageSchema attribute.  | 
|
Internal class for pyspark.ml.image.ImageSchema attribute.  | 
Distributor#
  | 
A class to support distributed training on PyTorch and PyTorch Lightning using PySpark.  | 
  | 
Utilities#
Base class for MLWriter and MLReader.  | 
|
Helper trait for making simple   | 
|
  | 
Specialization of   | 
Helper trait for making simple   | 
|
  | 
Specialization of   | 
Utility class that can save ML instances in different formats.  | 
|
Base class for models that provides Training summary.  | 
|
Object with a unique ID.  | 
|
Mixin for instances that provide   | 
|
  | 
Utility class that can load ML instances.  | 
Mixin for ML instances that provide   | 
|
  | 
Utility class that can save ML instances.  |