pyspark.pandas.sql¶
- 
pyspark.pandas.sql(query: str, index_col: Union[str, List[str], None] = None, args: Union[Dict[str, Any], List, None] = None, **kwargs: Any) → pyspark.pandas.frame.DataFrame[source]¶
- Execute a SQL query and return the result as a pandas-on-Spark DataFrame. - This function acts as a standard Python string formatter with understanding the following variable types: - pandas-on-Spark DataFrame 
- pandas-on-Spark Series 
- pandas DataFrame 
- pandas Series 
- string 
 - Also the method can bind named parameters to SQL literals from args. - Parameters
- querystr
- the SQL query 
- index_colstr or list of str, optional
- Column names to be used in Spark to represent pandas-on-Spark’s index. The index name in pandas-on-Spark is ignored. By default, the index is always lost. - Note - If you want to preserve the index, explicitly use - DataFrame.reset_index(), and pass it to the SQL statement with index_col parameter.- For example, - >>> psdf = ps.DataFrame({"A": [1, 2, 3], "B":[4, 5, 6]}, index=['a', 'b', 'c']) >>> new_psdf = psdf.reset_index() >>> ps.sql("SELECT * FROM {new_psdf}", index_col="index", new_psdf=new_psdf) ... A B index a 1 4 b 2 5 c 3 6 - For MultiIndex, - >>> psdf = ps.DataFrame( ... {"A": [1, 2, 3], "B": [4, 5, 6]}, ... index=pd.MultiIndex.from_tuples( ... [("a", "b"), ("c", "d"), ("e", "f")], names=["index1", "index2"] ... ), ... ) >>> new_psdf = psdf.reset_index() >>> ps.sql( ... "SELECT * FROM {new_psdf}", index_col=["index1", "index2"], new_psdf=new_psdf) ... A B index1 index2 a b 1 4 c d 2 5 e f 3 6 - Also note that the index name(s) should be matched to the existing name. 
- argsdict or list
- A dictionary of parameter names to Python objects or a list of Python objects that can be converted to SQL literal expressions. See <a href=”https://spark.apache.org/docs/latest/sql-ref-datatypes.html”> Supported Data Types</a> for supported value types in Python. For example, dictionary keys: “rank”, “name”, “birthdate”; dictionary values: 1, “Steven”, datetime.date(2023, 4, 2). A value can be also a Column of literal expression, in that case it is taken as is. - New in version 3.4.0. - Changed in version 3.5.0: Added positional parameters. 
- kwargs
- other variables that the user want to set that can be referenced in the query 
 
- Returns
- pandas-on-Spark DataFrame
 
 - Examples - Calling a built-in SQL function. - >>> ps.sql("SELECT * FROM range(10) where id > 7") id 0 8 1 9 - >>> ps.sql("SELECT * FROM range(10) WHERE id > {bound1} AND id < {bound2}", bound1=7, bound2=9) id 0 8 - >>> mydf = ps.range(10) >>> x = tuple(range(4)) >>> ps.sql("SELECT {ser} FROM {mydf} WHERE id IN {x}", ser=mydf.id, mydf=mydf, x=x) id 0 0 1 1 2 2 3 3 - Mixing pandas-on-Spark and pandas DataFrames in a join operation. Note that the index is dropped. - >>> ps.sql(''' ... SELECT m1.a, m2.b ... FROM {table1} m1 INNER JOIN {table2} m2 ... ON m1.key = m2.key ... ORDER BY m1.a, m2.b''', ... table1=ps.DataFrame({"a": [1,2], "key": ["a", "b"]}), ... table2=pd.DataFrame({"b": [3,4,5], "key": ["a", "b", "b"]})) a b 0 1 3 1 2 4 2 2 5 - Also, it is possible to query using Series. - >>> psdf = ps.DataFrame({"A": [1, 2, 3], "B":[4, 5, 6]}, index=['a', 'b', 'c']) >>> ps.sql("SELECT {mydf.A} FROM {mydf}", mydf=psdf) A 0 1 1 2 2 3 - And substitude named parameters with the : prefix by SQL literals. - >>> ps.sql("SELECT * FROM range(10) WHERE id > :bound1", args={"bound1":7}) id 0 8 1 9 - Or positional parameters marked by ? in the SQL query by SQL literals. - >>> ps.sql("SELECT * FROM range(10) WHERE id > ?", args=[7]) id 0 8 1 9