Latent Dirichlet Allocation
spark.lda.Rdspark.lda fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
summary to get a summary of the fitted LDA model, spark.posterior to compute
posterior probabilities on new data, spark.perplexity to compute log perplexity on new
data and write.ml/read.ml to save/load fitted models.
Usage
spark.lda(data, ...)
spark.posterior(object, newData)
spark.perplexity(object, data)
# S4 method for SparkDataFrame
spark.lda(
  data,
  features = "features",
  k = 10,
  maxIter = 20,
  optimizer = c("online", "em"),
  subsamplingRate = 0.05,
  topicConcentration = -1,
  docConcentration = -1,
  customizedStopWords = "",
  maxVocabSize = bitwShiftL(1, 18)
)
# S4 method for LDAModel
summary(object, maxTermsPerTopic)
# S4 method for LDAModel,SparkDataFrame
spark.perplexity(object, data)
# S4 method for LDAModel,SparkDataFrame
spark.posterior(object, newData)
# S4 method for LDAModel,character
write.ml(object, path, overwrite = FALSE)Arguments
- data
- A SparkDataFrame for training. 
- ...
- additional argument(s) passed to the method. 
- object
- A Latent Dirichlet Allocation model fitted by - spark.lda.
- newData
- A SparkDataFrame for testing. 
- features
- Features column name. Either libSVM-format column or character-format column is valid. 
- k
- Number of topics. 
- maxIter
- Maximum iterations. 
- optimizer
- Optimizer to train an LDA model, "online" or "em", default is "online". 
- subsamplingRate
- (For online optimizer) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1]. 
- topicConcentration
- concentration parameter (commonly named - betaor- eta) for the prior placed on topic distributions over terms, default -1 to set automatically on the Spark side. Use- summaryto retrieve the effective topicConcentration. Only 1-size numeric is accepted.
- docConcentration
- concentration parameter (commonly named - alpha) for the prior placed on documents distributions over topics (- theta), default -1 to set automatically on the Spark side. Use- summaryto retrieve the effective docConcentration. Only 1-size or- k-size numeric is accepted.
- customizedStopWords
- stopwords that need to be removed from the given corpus. Ignore the parameter if libSVM-format column is used as the features column. 
- maxVocabSize
- maximum vocabulary size, default 1 << 18 
- maxTermsPerTopic
- Maximum number of terms to collect for each topic. Default value of 10. 
- path
- The directory where the model is saved. 
- overwrite
- Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists. 
Value
spark.lda returns a fitted Latent Dirichlet Allocation model.
summary returns summary information of the fitted model, which is a list.
        The list includes
- docConcentration
- concentration parameter commonly named - alphafor the prior placed on documents distributions over topics- theta
- topicConcentration
- concentration parameter commonly named - betaor- etafor the prior placed on topic distributions over terms
- logLikelihood
- log likelihood of the entire corpus 
- logPerplexity
- log perplexity 
- isDistributed
- TRUE for distributed model while FALSE for local model 
- vocabSize
- number of terms in the corpus 
- topics
- top 10 terms and their weights of all topics 
- vocabulary
- whole terms of the training corpus, NULL if libsvm format file used as training set 
- trainingLogLikelihood
- Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em") 
- logPrior
- Log probability of the current parameter estimate: log P(topics, topic distributions for docs | Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em") 
spark.perplexity returns the log perplexity of given SparkDataFrame, or the log
        perplexity of the training data if missing argument "data".
spark.posterior returns a SparkDataFrame containing posterior probabilities
        vectors named "topicDistribution".
Note
spark.lda since 2.1.0
summary(LDAModel) since 2.1.0
spark.perplexity(LDAModel) since 2.1.0
spark.posterior(LDAModel) since 2.1.0
write.ml(LDAModel, character) since 2.1.0
See also
topicmodels: https://cran.r-project.org/package=topicmodels
Examples
if (FALSE) {
text <- read.df("data/mllib/sample_lda_libsvm_data.txt", source = "libsvm")
model <- spark.lda(data = text, optimizer = "em")
# get a summary of the model
summary(model)
# compute posterior probabilities
posterior <- spark.posterior(model, text)
showDF(posterior)
# compute perplexity
perplexity <- spark.perplexity(model, text)
# save and load the model
path <- "path/to/model"
write.ml(model, path)
savedModel <- read.ml(path)
summary(savedModel)
}