JSON Files
Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row].
This conversion can be done using SparkSession.read.json() on either a Dataset[String],
or a JSON file.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set the multiLine option to true.
// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)
// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
// +------+
// |  name|
// +------+
// |Justin|
// +------+
// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset[String] storing one JSON object per string
val otherPeopleDataset = spark.createDataset(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleDataset)
otherPeople.show()
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset<Row>.
This conversion can be done using SparkSession.read().json() on either a Dataset<String>,
or a JSON file.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set the multiLine option to true.
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
Dataset<Row> people = spark.read().json("examples/src/main/resources/people.json");
// The inferred schema can be visualized using the printSchema() method
people.printSchema();
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)
// Creates a temporary view using the DataFrame
people.createOrReplaceTempView("people");
// SQL statements can be run by using the sql methods provided by spark
Dataset<Row> namesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");
namesDF.show();
// +------+
// |  name|
// +------+
// |Justin|
// +------+
// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset<String> storing one JSON object per string.
List<String> jsonData = Arrays.asList(
        "{\"name\":\"Yin\",\"address\":{\"city\":\"Columbus\",\"state\":\"Ohio\"}}");
Dataset<String> anotherPeopleDataset = spark.createDataset(jsonData, Encoders.STRING());
Dataset<Row> anotherPeople = spark.read().json(anotherPeopleDataset);
anotherPeople.show();
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame.
This conversion can be done using SparkSession.read.json on a JSON file.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set the multiLine parameter to True.
# spark is from the previous example.
sc = spark.sparkContext
# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files
path = "examples/src/main/resources/people.json"
peopleDF = spark.read.json(path)
# The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
# root
#  |-- age: long (nullable = true)
#  |-- name: string (nullable = true)
# Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
# SQL statements can be run by using the sql methods provided by spark
teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
# +------+
# |  name|
# +------+
# |Justin|
# +------+
# Alternatively, a DataFrame can be created for a JSON dataset represented by
# an RDD[String] storing one JSON object per string
jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()
# +---------------+----+
# |        address|name|
# +---------------+----+
# |[Columbus,Ohio]| Yin|
# +---------------+----+Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using
the read.json() function, which loads data from a directory of JSON files where each line of the
files is a JSON object.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set a named parameter multiLine to TRUE.
# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files.
path <- "examples/src/main/resources/people.json"
# Create a DataFrame from the file(s) pointed to by path
people <- read.json(path)
# The inferred schema can be visualized using the printSchema() method.
printSchema(people)
## root
##  |-- age: long (nullable = true)
##  |-- name: string (nullable = true)
# Register this DataFrame as a table.
createOrReplaceTempView(people, "people")
# SQL statements can be run by using the sql methods.
teenagers <- sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
head(teenagers)
##     name
## 1 JustinCREATE TEMPORARY VIEW jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path "examples/src/main/resources/people.json"
)
SELECT * FROM jsonTableData Source Option
Data source options of JSON can be set via:
- the .option/.optionsmethods of- DataFrameReader
- DataFrameWriter
- DataStreamReader
- DataStreamWriter
 
- the built-in functions below
    - from_json
- to_json
- schema_of_json
 
- OPTIONSclause at CREATE TABLE USING DATA_SOURCE
| Property Name | Default | Meaning | Scope | 
|---|---|---|---|
| timeZone | (value of spark.sql.session.timeZoneconfiguration) | Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of timeZoneare supported:
 | read/write | 
| primitivesAsString | false | Infers all primitive values as a string type. | read | 
| prefersDecimal | false | Infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles. | read | 
| allowComments | false | Ignores Java/C++ style comment in JSON records. | read | 
| allowUnquotedFieldNames | false | Allows unquoted JSON field names. | read | 
| allowSingleQuotes | true | Allows single quotes in addition to double quotes. | read | 
| allowNumericLeadingZeros | false | Allows leading zeros in numbers (e.g. 00012). | read | 
| allowBackslashEscapingAnyCharacter | false | Allows accepting quoting of all character using backslash quoting mechanism. | read | 
| mode | PERMISSIVE | Allows a mode for dealing with corrupt records during parsing. 
 | read | 
| columnNameOfCorruptRecord | (value of spark.sql.columnNameOfCorruptRecordconfiguration) | Allows renaming the new field having malformed string created by PERMISSIVEmode. This overrides spark.sql.columnNameOfCorruptRecord. | read | 
| dateFormat | yyyy-MM-dd | Sets the string that indicates a date format. Custom date formats follow the formats at datetime pattern. This applies to date type. | read/write | 
| timestampFormat | yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] | Sets the string that indicates a timestamp format. Custom date formats follow the formats at datetime pattern. This applies to timestamp type. | read/write | 
| timestampNTZFormat | yyyy-MM-dd'T'HH:mm:ss[.SSS] | Sets the string that indicates a timestamp without timezone format. Custom date formats follow the formats at Datetime Patterns. This applies to timestamp without timezone type, note that zone-offset and time-zone components are not supported when writing or reading this data type. | read/write | 
| enableDateTimeParsingFallback | Enabled if the time parser policy has legacy settings or if no custom date or timestamp pattern was provided. | Allows falling back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps if values do not match the set patterns. | read | 
| multiLine | false | Parse one record, which may span multiple lines, per file. JSON built-in functions ignore this option. | read | 
| allowUnquotedControlChars | false | Allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. | read | 
| encoding | Detected automatically when multiLineis set totrue(for reading),UTF-8(for writing) | For reading, allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. For writing, Specifies encoding (charset) of saved json files. JSON built-in functions ignore this option. | read/write | 
| lineSep | \r,\r\n,\n(for reading),\n(for writing) | Defines the line separator that should be used for parsing. JSON built-in functions ignore this option. | read/write | 
| samplingRatio | 1.0 | Defines fraction of input JSON objects used for schema inferring. | read | 
| dropFieldIfAllNull | false | Whether to ignore column of all null values or empty array during schema inference. | read | 
| locale | en-US | Sets a locale as language tag in IETF BCP 47 format. For instance, localeis used while parsing dates and timestamps. | read | 
| allowNonNumericNumbers | true | Allows JSON parser to recognize set of “Not-a-Number” (NaN) tokens as legal floating number values. 
 | read | 
| compression | (none) | Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate). JSON built-in functions ignore this option. | write | 
| ignoreNullFields | (value of spark.sql.jsonGenerator.ignoreNullFieldsconfiguration) | Whether to ignore null fields when generating JSON objects. | write | 
Other generic options can be found in Generic File Source Options.