pyspark.SparkContext.sequenceFile#
- SparkContext.sequenceFile(path, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)[source]#
- Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is as follows: - A Java RDD is created from the SequenceFile or other InputFormat, and the key and value Writable classes 
- Serialization is attempted via Pickle pickling 
- If this fails, the fallback is to call ‘toString’ on each key and value 
- CPickleSerializeris used to deserialize pickled objects on the Python side
 - New in version 1.3.0. - Parameters
- pathstr
- path to sequencefile 
- keyClass: str, optional
- fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”) 
- valueClassstr, optional
- fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”) 
- keyConverterstr, optional
- fully qualified name of a function returning key WritableConverter 
- valueConverterstr, optional
- fully qualifiedname of a function returning value WritableConverter 
- minSplitsint, optional
- minimum splits in dataset (default min(2, sc.defaultParallelism)) 
- batchSizeint, optional, default 0
- The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically) 
 
- Returns
- RDD
- RDD of tuples of key and corresponding value 
 
 - See also - Examples - >>> import os >>> import tempfile - Set the class of output format - >>> output_format_class = "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat" - >>> with tempfile.TemporaryDirectory(prefix="sequenceFile") as d: ... path = os.path.join(d, "hadoop_file") ... ... # Write a temporary Hadoop file ... rdd = sc.parallelize([(1, {3.0: "bb"}), (2, {1.0: "aa"}), (3, {2.0: "dd"})]) ... rdd.saveAsNewAPIHadoopFile(path, output_format_class) ... ... collected = sorted(sc.sequenceFile(path).collect()) - >>> collected [(1, {3.0: 'bb'}), (2, {1.0: 'aa'}), (3, {2.0: 'dd'})]