pyspark.sql.functions.collect_set#
- pyspark.sql.functions.collect_set(col)[source]#
- Aggregate function: Collects the values from a column into a set, eliminating duplicates, and returns this set of objects. - New in version 1.6.0. - Changed in version 3.4.0: Supports Spark Connect. - Parameters
- colColumnor column name
- The target column on which the function is computed. 
 
- col
- Returns
- Column
- A new Column object representing a set of collected values, duplicates excluded. 
 
 - Notes - This function is non-deterministic as the order of collected results depends on the order of the rows, which may be non-deterministic after any shuffle operations. - Examples - Example 1: Collect values from a DataFrame and sort the result in ascending order - >>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(1,), (2,), (2,)], ('value',)) >>> df.select(sf.sort_array(sf.collect_set('value')).alias('sorted_set')).show() +----------+ |sorted_set| +----------+ | [1, 2]| +----------+ - Example 2: Collect values from a DataFrame and sort the result in descending order - >>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(2,), (5,), (5,)], ('age',)) >>> df.select(sf.sort_array(sf.collect_set('age'), asc=False).alias('sorted_set')).show() +----------+ |sorted_set| +----------+ | [5, 2]| +----------+ - Example 3: Collect values from a DataFrame with multiple columns and sort the result - >>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], ("id", "name")) >>> df = df.groupBy("name").agg(sf.sort_array(sf.collect_set('id')).alias('sorted_set')) >>> df.orderBy(sf.desc("name")).show() +----+----------+ |name|sorted_set| +----+----------+ |John| [1, 2]| | Ana| [3]| +----+----------+