pyspark.sql.datasource.DataSourceReader.partitions#
- DataSourceReader.partitions()[source]#
- Returns an iterator of partitions for this data source. - Partitions are used to split data reading operations into parallel tasks. If this method returns N partitions, the query planner will create N tasks. Each task will execute - DataSourceReader.read()in parallel, using the respective partition value to read the data.- This method is called once during query planning. By default, it returns a single partition with the value - None. Subclasses can override this method to return multiple partitions.- It’s recommended to override this method for better performance when reading large datasets. - Returns
- sequence of InputPartitions
- A sequence of partitions for this data source. Each partition value must be an instance of InputPartition or a subclass of it. 
 
- sequence of 
 - Notes - All partition values must be picklable objects. - Examples - Returns a list of integers: - >>> def partitions(self): ... return [InputPartition(1), InputPartition(2), InputPartition(3)] - Returns a list of string: - >>> def partitions(self): ... return [InputPartition("a"), InputPartition("b"), InputPartition("c")] - Returns a list of ranges: - >>> class RangeInputPartition(InputPartition): ... def __init__(self, start, end): ... self.start = start ... self.end = end - >>> def partitions(self): ... return [RangeInputPartition(1, 3), RangeInputPartition(5, 10)]