Spark Core#
Public Classes#
  | 
Main entry point for Spark functionality.  | 
  | 
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.  | 
  | 
A broadcast variable created with   | 
  | 
A shared variable that can be accumulated, i.e., has a commutative and associative "add" operation.  | 
Helper object that defines how to accumulate values of a given type.  | 
|
  | 
Configuration for a Spark application.  | 
Resolves paths to files added through   | 
|
  | 
Flags for controlling the storage of an RDD.  | 
Contextual information about a task which can be read or mutated during execution.  | 
|
  | 
Wraps an RDD in a barrier stage, which forces Spark to launch tasks of this stage together.  | 
A   | 
|
  | 
Carries all task infos of a barrier task.  | 
  | 
Thread that is recommended to be used in PySpark when the pinned thread mode is enabled.  | 
Provides utility method to determine Spark versions with given input string.  | 
Spark Context APIs#
  | 
Create an   | 
  | 
Add an archive to be downloaded with this Spark job on every node.  | 
  | 
Add a file to be downloaded with this Spark job on every node.  | 
Add a tag to be assigned to all the jobs started by this thread.  | 
|
  | 
Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future.  | 
A unique identifier for the Spark application.  | 
|
  | 
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array.  | 
  | 
Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.  | 
  | 
Broadcast a read-only variable to the cluster, returning a   | 
Cancel all jobs that have been scheduled or are running.  | 
|
  | 
Cancel active jobs for the specified group.  | 
Cancel active jobs that have the specified tag.  | 
|
Clear the current thread's job tags.  | 
|
Default min number of partitions for Hadoop RDDs when not given by user  | 
|
Default level of parallelism to use when not given by user (e.g.  | 
|
Dump the profile stats into directory path  | 
|
Create an   | 
|
Return the directory where RDDs are checkpointed.  | 
|
Return a copy of this SparkContext's configuration   | 
|
Get the tags that are currently set to be assigned to all the jobs started by this thread.  | 
|
Get a local property set in this thread, or null if it is missing.  | 
|
  | 
Get or instantiate a   | 
Get a Java system property, such as java.home.  | 
|
  | 
Read an 'old' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.  | 
  | 
Read an 'old' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict.  | 
Returns a list of archive paths that are added to resources.  | 
|
Returns a list of file paths that are added to resources.  | 
|
  | 
Read a 'new API' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.  | 
  | 
Read a 'new API' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict.  | 
  | 
Distribute a local Python collection to form an RDD.  | 
  | 
Load an RDD previously saved using   | 
  | 
Create a new RDD of int containing elements from start to end (exclusive), increased by step every element.  | 
Return the resource information of this   | 
|
Remove a tag previously added to be assigned to all the jobs started by this thread.  | 
|
  | 
Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements.  | 
  | 
Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.  | 
  | 
Set the directory under which RDDs are going to be checkpointed.  | 
Set the behavior of job cancellation from jobs started in this thread.  | 
|
Set a human readable description of the current job.  | 
|
  | 
Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared.  | 
  | 
Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool.  | 
  | 
Control our logLevel.  | 
  | 
Set a Java system property, such as spark.executor.memory.  | 
Print the profile stats to stdout  | 
|
Get SPARK_USER for user who is running SparkContext.  | 
|
Return the epoch time when the   | 
|
Return   | 
|
Shut down the   | 
|
  | 
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.  | 
Return the URL of the SparkUI instance started by this   | 
|
  | 
Build the union of a list of RDDs.  | 
The version of Spark on which this application is running.  | 
|
  | 
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.  | 
RDD APIs#
  | 
Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral "zero value."  | 
  | 
Aggregate the values of each key, using given combine functions and a neutral "zero value".  | 
Marks the current stage as a barrier stage, where Spark must launch all tasks together.  | 
|
Persist this RDD with the default storage level (MEMORY_ONLY).  | 
|
  | 
Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements   | 
Mark this RDD for checkpointing.  | 
|
  | 
Removes an RDD's shuffles and it's non-persisted ancestors.  | 
  | 
Return a new RDD that is reduced into numPartitions partitions.  | 
  | 
For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.  | 
Return a list that contains all the elements in this RDD.  | 
|
Return the key-value pairs in this RDD to the master as a dictionary.  | 
|
  | 
When collect rdd, use this method to specify job group.  | 
  | 
Generic function to combine the elements for each key using a custom set of aggregation functions.  | 
The   | 
|
Return the number of elements in this RDD.  | 
|
  | 
Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.  | 
  | 
Return approximate number of distinct elements in the RDD.  | 
Count the number of elements for each key, and return the result to the master as a dictionary.  | 
|
Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.  | 
|
  | 
Return a new RDD containing the distinct elements in this RDD.  | 
  | 
Return a new RDD containing only the elements that satisfy a predicate.  | 
Return the first element in this RDD.  | 
|
  | 
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.  | 
Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD's partitioning.  | 
|
  | 
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value."  | 
  | 
Merge the values for each key using an associative function "func" and a neutral "zeroValue" which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).  | 
  | 
Applies a function to all elements of this RDD.  | 
Applies a function to each partition of this RDD.  | 
|
  | 
Perform a right outer join of self and other.  | 
Gets the name of the file to which this RDD was checkpointed  | 
|
Returns the number of partitions in RDD  | 
|
Get the   | 
|
Get the RDD's current storage level.  | 
|
  | 
Return an RDD created by coalescing all elements within each partition into a list.  | 
  | 
Return an RDD of grouped items.  | 
  | 
Group the values for each key in the RDD into a single sequence.  | 
  | 
Alias for cogroup but with support for multiple RDDs.  | 
  | 
Compute a histogram using the provided buckets.  | 
  | 
A unique ID for this RDD (within its SparkContext).  | 
  | 
Return the intersection of this RDD and another one.  | 
Return whether this RDD is checkpointed and materialized, either reliably or locally.  | 
|
Returns true if and only if the RDD contains no elements at all.  | 
|
Return whether this RDD is marked for local checkpointing.  | 
|
  | 
Return an RDD containing all pairs of elements with matching keys in self and other.  | 
  | 
Creates tuples of the elements in this RDD by applying f.  | 
  | 
Return an RDD with the keys of each tuple.  | 
  | 
Perform a left outer join of self and other.  | 
Mark this RDD for local checkpointing using Spark's existing caching layer.  | 
|
  | 
Return the list of values in the RDD for key key.  | 
  | 
Return a new RDD by applying a function to each element of this RDD.  | 
  | 
Return a new RDD by applying a function to each partition of this RDD.  | 
  | 
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.  | 
  | 
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.  | 
Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning.  | 
|
  | 
Find the maximum item in this RDD.  | 
  | 
Compute the mean of this RDD's elements.  | 
  | 
Approximate operation to return the mean within a timeout or meet the confidence.  | 
  | 
Find the minimum item in this RDD.  | 
  | 
Return the name of this RDD.  | 
  | 
Return a copy of the RDD partitioned using the specified partitioner.  | 
  | 
Set this RDD's storage level to persist its values across operations after the first time it is computed.  | 
  | 
Return an RDD created by piping elements to a forked external process.  | 
  | 
Randomly splits this RDD with the provided weights.  | 
  | 
Reduces the elements of this RDD using the specified commutative and associative binary operator.  | 
  | 
Merge the values for each key using an associative and commutative reduce function.  | 
  | 
Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary.  | 
  | 
Return a new RDD that has exactly numPartitions partitions.  | 
Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.  | 
|
  | 
Perform a right outer join of self and other.  | 
  | 
Return a sampled subset of this RDD.  | 
  | 
Return a subset of this RDD sampled by key (via stratified sampling).  | 
Compute the sample standard deviation of this RDD's elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N).  | 
|
Compute the sample variance of this RDD's elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N).  | 
|
  | 
Output a Python RDD of key-value pairs (of form   | 
  | 
Output a Python RDD of key-value pairs (of form   | 
  | 
Output a Python RDD of key-value pairs (of form   | 
  | 
Output a Python RDD of key-value pairs (of form   | 
  | 
Save this RDD as a SequenceFile of serialized objects.  | 
  | 
Output a Python RDD of key-value pairs (of form   | 
  | 
Save this RDD as a text file, using string representations of elements.  | 
  | 
Assign a name to this RDD.  | 
  | 
Sorts this RDD by the given keyfunc  | 
  | 
Sorts this RDD, which is assumed to consist of (key, value) pairs.  | 
Return a   | 
|
Compute the standard deviation of this RDD's elements.  | 
|
  | 
Return each value in self that is not contained in other.  | 
  | 
Return each (key, value) pair in self that has no pair with matching key in other.  | 
  | 
Add up the elements in this RDD.  | 
  | 
Approximate operation to return the sum within a timeout or meet the confidence.  | 
  | 
Take the first num elements of the RDD.  | 
  | 
Get the N elements from an RDD ordered in ascending order or as specified by the optional key function.  | 
  | 
Return a fixed-size sampled subset of this RDD.  | 
A description of this RDD and its recursive dependencies for debugging.  | 
|
  | 
Return an iterator that contains all of the elements in this RDD.  | 
  | 
Get the top N elements from an RDD.  | 
  | 
Aggregates the elements of this RDD in a multi-level tree pattern.  | 
  | 
Reduces the elements of this RDD in a multi-level tree pattern.  | 
  | 
Return the union of this RDD and another one.  | 
  | 
Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.  | 
Return an RDD with the values of each tuple.  | 
|
Compute the variance of this RDD's elements.  | 
|
  | 
Specify a   | 
  | 
Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc.  | 
Zips this RDD with its element indices.  | 
|
Zips this RDD with generated unique Long ids.  | 
Broadcast and Accumulator#
  | 
Destroy all data and metadata related to this broadcast variable.  | 
  | 
Write a pickled representation of value to the open file or socket.  | 
  | 
Read a pickled representation of value from the open file or socket.  | 
  | 
Read the pickled representation of an object from the open file and return the reconstituted object hierarchy specified therein.  | 
  | 
Delete cached copies of this broadcast on the executors.  | 
Return the broadcasted value  | 
|
  | 
Adds a term to this accumulator's value  | 
Get the accumulator's value; only usable in driver program  | 
|
  | 
Add two values of the accumulator's data type, returning a new value; for efficiency, can also update value1 in place and return it.  | 
  | 
Provide a "zero value" for the type, compatible in dimensions with the provided value (e.g., a zero vector)  | 
Management#
Return thread target wrapper which is recommended to be used in PySpark when the pinned thread mode is enabled.  | 
|
  | 
Does this configuration contain a given key?  | 
  | 
Get the configured value for some key, or return a default otherwise.  | 
Get all values as a list of key-value pairs.  | 
|
  | 
Set a configuration property.  | 
  | 
Set multiple parameters, passed as a list of key-value pairs.  | 
  | 
Set application name.  | 
  | 
Set an environment variable to be passed to executors.  | 
  | 
Set a configuration property, if not already set.  | 
  | 
Set master URL to connect to.  | 
  | 
Set path where Spark is installed on worker nodes.  | 
Returns a printable version of the configuration, as a list of key=value pairs, one per line.  | 
|
  | 
Get the absolute path of a file added through   | 
Get the root directory that contains files added through   | 
|
How many times this task has been attempted.  | 
|
CPUs allocated to the task.  | 
|
Return the currently active   | 
|
Get a local property set upstream in the driver, or None if it is missing.  | 
|
The ID of the RDD partition that is computed by this task.  | 
|
Resources allocated to the task.  | 
|
The ID of the stage that this task belong to.  | 
|
An ID that is unique to this task attempt (within the same   | 
|
  | 
Returns a new RDD by applying a function to each partition of the wrapped RDD, where tasks are launched together in a barrier stage.  | 
  | 
Returns a new RDD by applying a function to each partition of the wrapped RDD, while tracking the index of the original partition.  | 
  | 
This function blocks until all tasks in the same stage have reached this routine.  | 
How many times this task has been attempted.  | 
|
Sets a global barrier and waits until all tasks in this stage hit this barrier.  | 
|
CPUs allocated to the task.  | 
|
Return the currently active   | 
|
Get a local property set upstream in the driver, or None if it is missing.  | 
|
Returns   | 
|
The ID of the RDD partition that is computed by this task.  | 
|
Resources allocated to the task.  | 
|
The ID of the stage that this task belong to.  | 
|
An ID that is unique to this task attempt (within the same   | 
|
  | 
Given a Spark version string, return the (major version number, minor version number).  |