| Home | Trees | Indices | Help | 
 | 
|---|
|  | 
PySpark supports custom serializers for transferring data; this can improve performance.
By default, PySpark uses PickleSerializer to serialize objects using Python's 
  cPickle serializer, which can serialize nearly any Python 
  object. Other serializers, like MarshalSerializer, support fewer datatypes but can be 
  faster.
The serializer is chosen when creating SparkContext:
>>> from pyspark.context import SparkContext >>> from pyspark.serializers import MarshalSerializer >>> sc = SparkContext('local', 'test', serializer=MarshalSerializer()) >>> sc.parallelize(list(range(1000))).map(lambda x: 2 * x).take(10) [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] >>> sc.stop()
By default, PySpark serialize objects in batches; the batch size can 
  be controlled through SparkContext's batchSize parameter 
  (the default size is 1024 objects):
>>> sc = SparkContext('local', 'test', batchSize=2) >>> rdd = sc.parallelize(range(16), 4).map(lambda x: x)
Behind the scenes, this creates a JavaRDD with four partitions, each of which contains two batches of two objects:
>>> rdd.glom().collect() [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]] >>> rdd._jrdd.count() 8L >>> sc.stop()
A batch size of -1 uses an unlimited batch size, and a size of 1 disables batching:
>>> sc = SparkContext('local', 'test', batchSize=1) >>> rdd = sc.parallelize(range(16), 4).map(lambda x: x) >>> rdd.glom().collect() [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]] >>> rdd._jrdd.count() 16L
| Classes | |
| PickleSerializer Serializes objects using Python's cPickle serializer: | |
| MarshalSerializer Serializes objects using Python's Marshal serializer: | |
| Home | Trees | Indices | Help | 
 | 
|---|
| Generated by Epydoc 3.0.1 on Wed Apr 9 15:41:34 2014 | http://epydoc.sourceforge.net |