Accessing OpenStack Swift from Spark
Spark’s support for Hadoop InputFormat allows it to process data in OpenStack Swift using the
same URI formats as in Hadoop. You can specify a path in Swift as input through a 
URI of the form swift://container.PROVIDER/path. You will also need to set your 
Swift security credentials, through core-site.xml or via
SparkContext.hadoopConfiguration.
The current Swift driver requires Swift to use the Keystone authentication method, or
its Rackspace-specific predecessor.
Configuring Swift for Better Data Locality
Although not mandatory, it is recommended to configure the proxy server of Swift with
list_endpoints to have better data locality. More information is
available here.
Dependencies
The Spark application should include hadoop-openstack dependency, which can
be done by including the hadoop-cloud module for the specific version of spark used.
For example, for Maven support, add the following to the pom.xml file:
<dependencyManagement>
  ...
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>hadoop-cloud_2.12</artifactId>
    <version>${spark.version}</version>
  </dependency>
  ...
</dependencyManagement>Configuration Parameters
Create core-site.xml and place it inside Spark’s conf directory.
The main category of parameters that should be configured is the authentication parameters
required by Keystone.
The following table contains a list of Keystone mandatory parameters. PROVIDER can be
any (alphanumeric) name.
| Property Name | Meaning | Required | 
|---|---|---|
| fs.swift.service.PROVIDER.auth.url | Keystone Authentication URL | Mandatory | 
| fs.swift.service.PROVIDER.auth.endpoint.prefix | Keystone endpoints prefix | Optional | 
| fs.swift.service.PROVIDER.tenant | Tenant | Mandatory | 
| fs.swift.service.PROVIDER.username | Username | Mandatory | 
| fs.swift.service.PROVIDER.password | Password | Mandatory | 
| fs.swift.service.PROVIDER.http.port | HTTP port | Mandatory | 
| fs.swift.service.PROVIDER.region | Keystone region | Mandatory | 
| fs.swift.service.PROVIDER.public | Indicates whether to use the public (off cloud) or private (in cloud; no transfer fees) endpoints | Mandatory | 
For example, assume PROVIDER=SparkTest and Keystone contains user tester with password testing
defined for tenant test. Then core-site.xml should include:
<configuration>
  <property>
    <name>fs.swift.service.SparkTest.auth.url</name>
    <value>http://127.0.0.1:5000/v2.0/tokens</value>
  </property>
  <property>
    <name>fs.swift.service.SparkTest.auth.endpoint.prefix</name>
    <value>endpoints</value>
  </property>
    <name>fs.swift.service.SparkTest.http.port</name>
    <value>8080</value>
  </property>
  <property>
    <name>fs.swift.service.SparkTest.region</name>
    <value>RegionOne</value>
  </property>
  <property>
    <name>fs.swift.service.SparkTest.public</name>
    <value>true</value>
  </property>
  <property>
    <name>fs.swift.service.SparkTest.tenant</name>
    <value>test</value>
  </property>
  <property>
    <name>fs.swift.service.SparkTest.username</name>
    <value>tester</value>
  </property>
  <property>
    <name>fs.swift.service.SparkTest.password</name>
    <value>testing</value>
  </property>
</configuration>Notice that
fs.swift.service.PROVIDER.tenant,
fs.swift.service.PROVIDER.username, 
fs.swift.service.PROVIDER.password contains sensitive information and keeping them in
core-site.xml is not always a good approach.
We suggest to keep those parameters in core-site.xml for testing purposes when running Spark
via spark-shell.
For job submissions they should be provided via sparkContext.hadoopConfiguration.