pyspark.RDD.sampleByKey#
- RDD.sampleByKey(withReplacement, fractions, seed=None)[source]#
Return a subset of this RDD sampled by key (via stratified sampling). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map.
New in version 0.7.0.
- Parameters
- withReplacementbool
whether to sample with or without replacement
- fractionsdict
map of specific keys to sampling rates
- seedint, optional
seed for the random number generator
- Returns
See also
Examples
>>> fractions = {"a": 0.2, "b": 0.1} >>> rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 1000))) >>> sample = dict(rdd.sampleByKey(False, fractions, 2).groupByKey().collect()) >>> 100 < len(sample["a"]) < 300 and 50 < len(sample["b"]) < 150 True >>> max(sample["a"]) <= 999 and min(sample["a"]) >= 0 True >>> max(sample["b"]) <= 999 and min(sample["b"]) >= 0 True