pyspark.pandas.DataFrame.duplicated¶
- 
DataFrame.duplicated(subset: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, keep: str = 'first') → Series[source]¶
- Return boolean Series denoting duplicate rows, optionally only considering certain columns. - Parameters
- subsetcolumn label or sequence of labels, optional
- Only consider certain columns for identifying duplicates, by default use all of the columns 
- keep{‘first’, ‘last’, False}, default ‘first’
- first: Mark duplicates as- Trueexcept for the first occurrence.
- last: Mark duplicates as- Trueexcept for the last occurrence.
- False : Mark all duplicates as - True.
 
 
- Returns
- duplicatedSeries
 
 - Examples - >>> df = ps.DataFrame({'a': [1, 1, 1, 3], 'b': [1, 1, 1, 4], 'c': [1, 1, 1, 5]}, ... columns = ['a', 'b', 'c']) >>> df a b c 0 1 1 1 1 1 1 1 2 1 1 1 3 3 4 5 - >>> df.duplicated().sort_index() 0 False 1 True 2 True 3 False dtype: bool - Mark duplicates as - Trueexcept for the last occurrence.- >>> df.duplicated(keep='last').sort_index() 0 True 1 True 2 False 3 False dtype: bool - Mark all duplicates as - True.- >>> df.duplicated(keep=False).sort_index() 0 True 1 True 2 True 3 False dtype: bool