Skip to content

Commit fdce240

Browse files
committed
fpm doc
1 parent 8aa560b commit fdce240

File tree

5 files changed

+220
-5
lines changed

5 files changed

+220
-5
lines changed

docs/_data/menu-ml.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88
url: ml-clustering.html
99
- text: Collaborative filtering
1010
url: ml-collaborative-filtering.html
11+
- text: Frequent Pattern Mining
12+
url: ml-frequent-pattern-mining.html
1113
- text: Model selection and tuning
1214
url: ml-tuning.html
1315
- text: Advanced topics

docs/ml-frequent-pattern-mining.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
---
2+
layout: global
3+
title: Frequent Pattern Mining
4+
displayTitle: Frequent Pattern Mining
5+
---
6+
7+
Mining frequent items, itemsets, subsequences, or other substructures is usually among the
8+
first steps to analyze a large-scale dataset, which has been an active research topic in
9+
data mining for years.
10+
We refer users to Wikipedia's [association rule learning](http://en.wikipedia.org/wiki/Association_rule_learning)
11+
for more information.
12+
13+
**Table of Contents**
14+
15+
* This will become a table of contents (this text will be scraped).
16+
{:toc}
17+
18+
## FP-Growth
19+
20+
The FP-growth algorithm is described in the paper
21+
[Han et al., Mining frequent patterns without candidate generation](http://dx.doi.org/10.1145/335191.335372),
22+
where "FP" stands for frequent pattern.
23+
Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items.
24+
Different from [Apriori-like](http://en.wikipedia.org/wiki/Apriori_algorithm) algorithms designed for the same purpose,
25+
the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets
26+
explicitly, which are usually expensive to generate.
27+
After the second step, the frequent itemsets can be extracted from the FP-tree.
28+
In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
29+
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
30+
PFP distributes the work of growing FP-trees based on the suffices of transactions,
31+
and hence more scalable than a single-machine implementation.
32+
We refer users to the papers for more details.
33+
34+
`spark.ml`'s FP-growth implementation takes the following (hyper-)parameters:
35+
36+
* `minSupport`: the minimum support for an itemset to be identified as frequent.
37+
For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
38+
* `minConfidence`: minimum confidence for generating Association Rule. The parameter has no effect during `fit`, but specify
39+
the minimum confidence for generating association rules from frequent itemsets.
40+
* `numPartitions`: the number of partitions used to distribute the work.
41+
42+
The `FPGrowthModel` provides:
43+
44+
* `freqItemsets`: frequent itemsets in the format of DataFrame("items"[Seq], "freq"[Long])
45+
* `associationRules`: association rules generated with confidence above `minConfidence`, in the format of
46+
DataFrame("antecedent"[Seq], "consequent"[Seq], "confidence"[Double]).
47+
* `transform`: The transform method examines the input items against all the association rules and
48+
summarize the consequents as prediction. The prediction column has the same data type as the
49+
input column and does not contain existing items in the input column.
50+
51+
52+
**Examples**
53+
54+
<div class="codetabs">
55+
56+
<div data-lang="scala" markdown="1">
57+
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.fpm.FPGrowth) for more details.
58+
59+
{% include_example scala/org/apache/spark/examples/ml/FPGrowthExample.scala %}
60+
</div>
61+
62+
<div data-lang="java" markdown="1">
63+
Refer to the [Java API docs](api/java/org/apache/spark/ml/fpm/FPGrowth.html) for more details.
64+
65+
{% include_example java/org/apache/spark/examples/ml/JavaFPGrowthExample.java %}
66+
</div>
67+
68+
</div>
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.examples.ml;
19+
20+
// $example on$
21+
import java.util.Arrays;
22+
import java.util.List;
23+
24+
import org.apache.spark.ml.fpm.FPGrowth;
25+
import org.apache.spark.ml.fpm.FPGrowthModel;
26+
import org.apache.spark.sql.Dataset;
27+
import org.apache.spark.sql.Row;
28+
import org.apache.spark.sql.RowFactory;
29+
import org.apache.spark.sql.SparkSession;
30+
import org.apache.spark.sql.types.*;
31+
// $example off$
32+
33+
public class JavaFPGrowthExample {
34+
public static void main(String[] args) {
35+
SparkSession spark = SparkSession
36+
.builder()
37+
.appName("JavaFPGrowthExample")
38+
.getOrCreate();
39+
40+
// $example on$
41+
List<Row> data = Arrays.asList(
42+
RowFactory.create(Arrays.asList("1 2 5".split(" "))),
43+
RowFactory.create(Arrays.asList("1 2 3 5".split(" "))),
44+
RowFactory.create(Arrays.asList("1 2".split(" ")))
45+
);
46+
StructType schema = new StructType(new StructField[]{ new StructField(
47+
"features", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
48+
});
49+
Dataset<Row> itemsDF = spark.createDataFrame(data, schema);
50+
51+
// Learn a mapping from words to Vectors.
52+
FPGrowth fpgrowth = new FPGrowth()
53+
.setMinSupport(0.5)
54+
.setMinConfidence(0.6);
55+
56+
FPGrowthModel model = fpgrowth.fit(itemsDF);
57+
58+
// get frequent itemsets.
59+
model.freqItemsets().show();
60+
61+
// get generated association rules.
62+
model.associationRules().show();
63+
64+
// transform examines the input items against all the association rules and summarize the
65+
// consequents as prediction
66+
Dataset<Row> result = model.transform(itemsDF);
67+
68+
result.show();
69+
// $example off$
70+
71+
spark.stop();
72+
}
73+
}
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.examples.ml
19+
20+
// scalastyle:off println
21+
22+
// $example on$
23+
import org.apache.spark.ml.fpm.FPGrowth
24+
// $example off$
25+
import org.apache.spark.sql.SparkSession
26+
27+
/**
28+
* An example demonstrating FP-Growth.
29+
* Run with
30+
* {{{
31+
* bin/run-example ml.FPGrowthExample
32+
* }}}
33+
*/
34+
object FPGrowthExample {
35+
36+
def main(args: Array[String]): Unit = {
37+
38+
val spark = SparkSession
39+
.builder
40+
.appName(s"${this.getClass.getSimpleName}")
41+
.getOrCreate()
42+
import spark.implicits._
43+
44+
// $example on$
45+
// Loads data.
46+
val dataset = spark.createDataset(Seq(
47+
"1 2 5",
48+
"1 2 3 5",
49+
"1 2")
50+
).map(t => t.split(" ")).toDF("features")
51+
52+
// Trains a FPGrowth model.
53+
val fpgrowth = new FPGrowth().setMinSupport(0.5).setMinConfidence(0.6)
54+
val model = fpgrowth.fit(dataset)
55+
56+
// get frequent itemsets.
57+
model.freqItemsets.show()
58+
59+
// get generated association rules.
60+
model.associationRules.show()
61+
62+
// transform examines the input items against all the association rules and summarize the
63+
// consequents as prediction
64+
model.transform(dataset).show()
65+
66+
// $example off$
67+
68+
spark.stop()
69+
}
70+
}
71+
// scalastyle:on println

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,8 @@ private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPre
5656
def getMinSupport: Double = $(minSupport)
5757

5858
/**
59-
* Number of partitions (>=1) used by parallel FP-growth. By default the param is not set, and
60-
* partition number of the input dataset is used.
59+
* Number of partitions (positive) used by parallel FP-growth. By default the param is not set,
60+
* and partition number of the input dataset is used.
6161
* @group expertParam
6262
*/
6363
@Since("2.2.0")
@@ -240,12 +240,13 @@ class FPGrowthModel private[ml] (
240240
val predictUDF = udf((items: Seq[_]) => {
241241
if (items != null) {
242242
val itemset = items.toSet
243-
brRules.value.flatMap(rule =>
244-
if (items != null && rule._1.forall(item => itemset.contains(item))) {
243+
brRules.value.flatMap { rule =>
244+
if (rule._1.forall(item => itemset.contains(item))) {
245245
rule._2.filter(item => !itemset.contains(item))
246246
} else {
247247
Seq.empty
248-
})
248+
}
249+
}
249250
} else {
250251
Seq.empty
251252
}.distinct }, dt)

0 commit comments

Comments
 (0)