pyspark.pandas.read_spark_io¶
- 
pyspark.pandas.read_spark_io(path: Optional[str] = None, format: Optional[str] = None, schema: Union[str, StructType] = None, index_col: Union[str, List[str], None] = None, **options: Any) → pyspark.pandas.frame.DataFrame[source]¶
- Load a DataFrame from a Spark data source. - Parameters
- pathstring, optional
- Path to the data source. 
- formatstring, optional
- Specifies the output data source format. Some common ones are: - ‘delta’ 
- ‘parquet’ 
- ‘orc’ 
- ‘json’ 
- ‘csv’ 
 
- schemastring or StructType, optional
- Input schema. If none, Spark tries to infer the schema automatically. The schema can either be a Spark StructType, or a DDL-formatted string like col0 INT, col1 DOUBLE. 
- index_colstr or list of str, optional, default: None
- Index column of table in Spark. 
- optionsdict
- All other options passed directly into Spark’s data source. 
 
 - See also - DataFrame.to_spark_io
- DataFrame.read_table
- DataFrame.read_delta
- DataFrame.read_parquet
 - Examples - >>> ps.range(1).to_spark_io('%s/read_spark_io/data.parquet' % path) >>> ps.read_spark_io( ... '%s/read_spark_io/data.parquet' % path, format='parquet', schema='id long') id 0 0 - >>> ps.range(10, 15, num_partitions=1).to_spark_io('%s/read_spark_io/data.json' % path, ... format='json', lineSep='__') >>> ps.read_spark_io( ... '%s/read_spark_io/data.json' % path, format='json', schema='id long', lineSep='__') id 0 10 1 11 2 12 3 13 4 14 - You can preserve the index in the roundtrip as below. - >>> ps.range(10, 15, num_partitions=1).to_spark_io('%s/read_spark_io/data.orc' % path, ... format='orc', index_col="index") >>> ps.read_spark_io( ... path=r'%s/read_spark_io/data.orc' % path, format="orc", index_col="index") ... id index 0 10 1 11 2 12 3 13 4 14