pyspark.pandas.Series.groupby#
- Series.groupby(by, axis=0, as_index=True, dropna=True)[source]#
- Group DataFrame or Series using one or more columns. - A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups. - Parameters
- by: Series, label, or list of labels
- Used to determine the groups for the groupby. If Series is passed, the Series or dict VALUES will be used to determine the groups. A label or list of labels may be passed to group by the columns in - self.
- axis: int, default 0 or ‘index’
- Can only be set to 0 now. 
- as_index: bool, default True
- For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output. 
- dropna: bool, default True
- If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups. 
 
- Returns
- DataFrameGroupBy or SeriesGroupBy
- Depends on the calling object and returns groupby object that contains information about the groups. 
 
 - See also - pyspark.pandas.groupby.GroupBy
 - Examples - >>> df = ps.DataFrame({'Animal': ['Falcon', 'Falcon', ... 'Parrot', 'Parrot'], ... 'Max Speed': [380., 370., 24., 26.]}, ... columns=['Animal', 'Max Speed']) >>> df Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0 - >>> df.groupby(['Animal']).mean().sort_index() Max Speed Animal Falcon 375.0 Parrot 25.0 - >>> df.groupby(['Animal'], as_index=False).mean().sort_values('Animal') ... Animal Max Speed ...Falcon 375.0 ...Parrot 25.0 - We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True: - >>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] >>> df = ps.DataFrame(l, columns=["a", "b", "c"]) >>> df.groupby(by=["b"]).sum().sort_index() a c b 1.0 2 3 2.0 2 5 - >>> df.groupby(by=["b"], dropna=False).sum().sort_index() a c b 1.0 2 3 2.0 2 5 NaN 1 4