pyspark.SparkContext.textFile¶
- 
SparkContext.textFile(name: str, minPartitions: Optional[int] = None, use_unicode: bool = True) → pyspark.rdd.RDD[str][source]¶
- Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. The text files must be encoded as UTF-8. - New in version 0.7.0. - Parameters
- namestr
- directory to the input data files, the path can be comma separated paths as a list of inputs 
- minPartitionsint, optional
- suggested minimum number of partitions for the resulting RDD 
- use_unicodebool, default True
- If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. - New in version 1.2.0. 
 
- Returns
- RDD
- RDD representing text data from the file(s). 
 
 - Examples - >>> import os >>> import tempfile >>> with tempfile.TemporaryDirectory() as d: ... path1 = os.path.join(d, "text1") ... path2 = os.path.join(d, "text2") ... ... # Write a temporary text file ... sc.parallelize(["x", "y", "z"]).saveAsTextFile(path1) ... ... # Write another temporary text file ... sc.parallelize(["aa", "bb", "cc"]).saveAsTextFile(path2) ... ... # Load text file ... collected1 = sorted(sc.textFile(path1, 3).collect()) ... collected2 = sorted(sc.textFile(path2, 4).collect()) ... ... # Load two text files together ... collected3 = sorted(sc.textFile('{},{}'.format(path1, path2), 5).collect()) - >>> collected1 ['x', 'y', 'z'] >>> collected2 ['aa', 'bb', 'cc'] >>> collected3 ['aa', 'bb', 'cc', 'x', 'y', 'z']