Migration Guide: SparkR (R on Spark)
- Upgrading from SparkR 3.1 to 3.2
- Upgrading from SparkR 2.4 to 3.0
- Upgrading from SparkR 2.3 to 2.4
- Upgrading from SparkR 2.3 to 2.3.1 and above
- Upgrading from SparkR 2.2 to 2.3
- Upgrading from SparkR 2.1 to 2.2
- Upgrading from SparkR 2.0 to 3.1
- Upgrading from SparkR 1.6 to 2.0
- Upgrading from SparkR 1.5 to 1.6
Note that this migration guide describes the items specific to SparkR. Many items of SQL migration can be applied when migrating SparkR to higher versions. Please refer Migration Guide: SQL, Datasets and DataFrame.
Upgrading from SparkR 3.1 to 3.2
- Previously, SparkR automatically downloaded and installed the Spark distribution in user’ cache directory to complete SparkR installation when SparkR runs in a plain R shell or Rscript, and the Spark distribution cannot be found. Now, it asks if users want to download and install or not. To restore the previous behavior, set SPARKR_ASK_INSTALLATIONenvironment variable toFALSE.
Upgrading from SparkR 2.4 to 3.0
- The deprecated methods parquetFile,saveAsParquetFile,jsonFile,jsonRDDhave been removed. Useread.parquet,write.parquet,read.jsoninstead.
Upgrading from SparkR 2.3 to 2.4
- Previously, we don’t check the validity of the size of the last layer in spark.mlp. For example, if the training data only has two labels, alayersparam likec(1, 3)doesn’t cause an error previously, now it does.
Upgrading from SparkR 2.3 to 2.3.1 and above
- In SparkR 2.3.0 and earlier, the startparameter ofsubstrmethod was wrongly subtracted by one and considered as 0-based. This can lead to inconsistent substring results and also does not match with the behaviour withsubstrin R. In version 2.3.1 and later, it has been fixed so thestartparameter ofsubstrmethod is now 1-based. As an example,substr(lit('abcdef'), 2, 4))would result toabcin SparkR 2.3.0, and the result would bebcdin SparkR 2.3.1.
Upgrading from SparkR 2.2 to 2.3
- The stringsAsFactorsparameter was previously ignored withcollect, for example, incollect(createDataFrame(iris), stringsAsFactors = TRUE)). It has been corrected.
- For summary, option for statistics to compute has been added. Its output is changed from that fromdescribe.
- A warning can be raised if versions of SparkR package and the Spark JVM do not match.
Upgrading from SparkR 2.1 to 2.2
- A numPartitionsparameter has been added tocreateDataFrameandas.DataFrame. When splitting the data, the partition position calculation has been made to match the one in Scala.
- The method createExternalTablehas been deprecated to be replaced bycreateTable. Either methods can be called to create external or managed table. Additional catalog methods have also been added.
- By default, derby.log is now saved to tempdir(). This will be created when instantiating the SparkSession withenableHiveSupportset toTRUE.
- spark.ldawas not setting the optimizer correctly. It has been corrected.
- Several model summary outputs are updated to have coefficientsasmatrix. This includesspark.logit,spark.kmeans,spark.glm. Model summary outputs forspark.gaussianMixturehave added log-likelihood asloglik.
Upgrading from SparkR 2.0 to 3.1
- joinno longer performs Cartesian Product by default, use- crossJoininstead.
Upgrading from SparkR 1.6 to 2.0
- The method tablehas been removed and replaced bytableToDF.
- The class DataFramehas been renamed toSparkDataFrameto avoid name conflicts.
- Spark’s SQLContextandHiveContexthave been deprecated to be replaced bySparkSession. Instead ofsparkR.init(), callsparkR.session()in its place to instantiate the SparkSession. Once that is done, that currently active SparkSession will be used for SparkDataFrame operations.
- The parameter sparkExecutorEnvis not supported bysparkR.session. To set environment for the executors, set Spark config properties with the prefix “spark.executorEnv.VAR_NAME”, for example, “spark.executorEnv.PATH”
- The sqlContextparameter is no longer required for these functions:createDataFrame,as.DataFrame,read.json,jsonFile,read.parquet,parquetFile,read.text,sql,tables,tableNames,cacheTable,uncacheTable,clearCache,dropTempTable,read.df,loadDF,createExternalTable.
- The method registerTempTablehas been deprecated to be replaced bycreateOrReplaceTempView.
- The method dropTempTablehas been deprecated to be replaced bydropTempView.
- The scSparkContext parameter is no longer required for these functions:setJobGroup,clearJobGroup,cancelJobGroup
Upgrading from SparkR 1.5 to 1.6
- Before Spark 1.6.0, the default mode for writes was append. It was changed in Spark 1.6.0 toerrorto match the Scala API.
- SparkSQL converts NAin R tonulland vice-versa.
- Since 1.6.1, withColumn method in SparkR supports adding a new column to or replacing existing columns of the same name of a DataFrame.