pyspark.sql.SparkSession.createDataFrame#
- SparkSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)[source]#
- Creates a - DataFramefrom an- RDD, a list, a- pandas.DataFrameor a- numpy.ndarray.- New in version 2.0.0. - Changed in version 3.4.0: Supports Spark Connect. - Parameters
- dataRDDor iterable
- an RDD of any kind of SQL data representation ( - Row,- tuple,- int,- boolean,- dict, etc.), or- list,- pandas.DataFrameor- numpy.ndarray.
- schemapyspark.sql.types.DataType, str or list, optional
- a - pyspark.sql.types.DataTypeor a datatype string or a list of column names, default is None. The data type string format equals to- pyspark.sql.types.DataType.simpleString, except that top level struct type can omit the- struct<>.- When - schemais a list of column names, the type of each column will be inferred from- data.- When - schemais- None, it will try to infer the schema (column names and types) from- data, which should be an RDD of either- Row,- namedtuple, or- dict.- When - schemais- pyspark.sql.types.DataTypeor a datatype string, it must match the real data, or an exception will be thrown at runtime. If the given schema is not- pyspark.sql.types.StructType, it will be wrapped into a- pyspark.sql.types.StructTypeas its only field, and the field name will be “value”. Each record will also be wrapped into a tuple, which can be converted to row later.
- samplingRatiofloat, optional
- the sample ratio of rows used for inferring. The first few rows will be used if - samplingRatiois- None. This option is effective only when the input is- RDD.
- verifySchemabool, optional
- verify data types of every row against schema. Enabled by default. When the input is - pandas.DataFrameand spark.sql.execution.arrow.pyspark.enabled is enabled, this option is not effective. It follows Arrow type coercion. This option is not supported with Spark Connect.- New in version 2.1.0. 
 
- data
- Returns
 - Notes - Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental. - Examples - Create a DataFrame from a list of tuples. - >>> spark.createDataFrame([('Alice', 1)]).show() +-----+---+ | _1| _2| +-----+---+ |Alice| 1| +-----+---+ - Create a DataFrame from a list of dictionaries. - >>> d = [{'name': 'Alice', 'age': 1}] >>> spark.createDataFrame(d).show() +---+-----+ |age| name| +---+-----+ | 1|Alice| +---+-----+ - Create a DataFrame with column names specified. - >>> spark.createDataFrame([('Alice', 1)], ['name', 'age']).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+ - Create a DataFrame with the explicit schema specified. - >>> from pyspark.sql.types import * >>> schema = StructType([ ... StructField("name", StringType(), True), ... StructField("age", IntegerType(), True)]) >>> spark.createDataFrame([('Alice', 1)], schema).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+ - Create a DataFrame with the schema in DDL formatted string. - >>> spark.createDataFrame([('Alice', 1)], "name: string, age: int").show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+ - Create an empty DataFrame. When initializing an empty DataFrame in PySpark, it’s mandatory to specify its schema, as the DataFrame lacks data from which the schema can be inferred. - >>> spark.createDataFrame([], "name: string, age: int").show() +----+---+ |name|age| +----+---+ +----+---+ - Create a DataFrame from Row objects. - >>> from pyspark.sql import Row >>> Person = Row('name', 'age') >>> df = spark.createDataFrame([Person("Alice", 1)]) >>> df.show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+ - Create a DataFrame from a pandas DataFrame. - >>> spark.createDataFrame(df.toPandas()).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+ >>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect() +---+---+ | 0| 1| +---+---+ | 1| 2| +---+---+