matching in strategies.scala

octaviansima · octaviansima · commit 7c89dd136f02 · 2021-02-23T21:16:35.000Z
set up class thing cleanup added test cases for non-equi left anti join rename to serializeEquiJoinExpression added isEncrypted condition set up keys JoinExpr now has condition rename serialization does not throw compile error for BNLJ split up added condition in ExpressionEvaluation.h zipPartitions cpp put in place typo added func to header two loops in place update tests condition fixed scala loop interchange rows added tags ensure cached == match working comparison decoupling in ExpressionEvalulation save compiles and condition works is printing fix swap outer/inner o_i_match show() has the same result tests pass test cleanup added test cases for different condition BuildLeft works optional keys in scala started C++ passes the operator tests comments, cleanup attemping to do it the ~right~ way comments to distinguish between primary/secondary, operator tests pass cleanup comments, about to begin implementation for distinct agg ops is_distinct added test case serializing with isDistinct is_distinct in ExpressionEvaluation.h removed unused code from join implementation remove RowWriter/Reader in condition evaluation (join) easier test serialization done correct checking in Scala set is set up spaghetti but it finally works function for clearing values condition_eval isntead of condition goto comment remove explain from test, need to fix distinct aggregation for >1 partitions started impl of multiple partitions fix added rangepartitionexec that runs partitioning cleanup serialization properly comments, generalization for > 1 distinct function comments about to refactor into logical.Aggregation the new case has distinct in result expressions need to match on distinct removed new case (doesn't make difference?) works Upgrade to OE 0.12 (mc2-project#153) Update README.md Support for scalar subquery (mc2-project#157) This PR implements the scalar subquery expression, which is triggered whenever a subquery returns a scalar value. There were two main problems that needed to be solved. First, support for matching the scalar subquery expression is necessary. Spark implements this by wrapping a SparkPlan within the expression and calls executeCollect. Then it constructs a literal with that value. However, this is problematic for us because that value should not be decrypted by the driver and serialized into an expression, since it's an intermediate value. Therefore, the second issue to be addressed here is supporting an encrypted literal. This is implemented in this PR by serializing an encrypted ciphertext into a base64 encoded string, and wrapping a Decrypt expression on top of it. This expression is then evaluated in the enclave and returns a literal. Note that, in order to test our implementation, we also implement a Decrypt expression in Scala. However, this should never be evaluated on the driver side and serialized into a plaintext literal. This is because Decrypt is designated as a Nondeterministic expression, and therefore will always evaluate on the workers. match remove RangePartitionExec inefficient implementation refined Add TPC-H Benchmarks (mc2-project#139) * logic decoupling in TPCH.scala for easier benchmarking * added TPCHBenchmark.scala * Benchmark.scala rewrite * done adding all support TPC-H query benchmarks * changed commandline arguments that benchmark takes * TPCHBenchmark takes in parameters * fixed issue with spark conf * size error handling, --help flag * add Utils.force, break cluster mode * comment out logistic regression benchmark * ensureCached right before temp view created/replaced * upgrade to 3.0.1 * upgrade to 3.0.1 * 10 scale factor * persistData * almost done refactor * more cleanup * compiles * 9 passes * cleanup * collect instead of force, sf_none * remove sf_none * defaultParallelism * no removing trailing/leading whitespace * add sf_med * hdfs works in local case * cleanup, added new CLI argument * added newly supported tpch queries * function for running all supported tests complete instead of partial -> final removed traces of join cleanup
diff --git a/src/enclave/Enclave/Aggregate.cpp b/src/enclave/Enclave/Aggregate.cpp
@@ -30,8 +30,8 @@ void non_oblivious_aggregate(
     count += 1;
   }
 
-  // Skip outputting the final row if the number of input rows is 0 AND
-  // 1. It's a grouping aggregation, OR
+  // Skip outputting the final row if:
+  // 1. The number of input rows is 0 AND it's a grouping aggregation, OR
   // 2. It's a global aggregation, the mode is final
   if (!(count == 0 && (agg_op_eval.get_num_grouping_keys() > 0 || (agg_op_eval.get_num_grouping_keys() == 0 && !is_partial)))) {
     w.append(agg_op_eval.evaluate());
diff --git a/src/enclave/Enclave/ExpressionEvaluation.h b/src/enclave/Enclave/ExpressionEvaluation.h
@@ -1811,6 +1811,9 @@ class AggregateExpressionEvaluator {
         std::unique_ptr<FlatbuffersExpressionEvaluator>(
           new FlatbuffersExpressionEvaluator(eval_expr)));
     }
+    is_distinct = expr->is_distinct();
+    value_selector = std::unique_ptr<FlatbuffersExpressionEvaluator>(
+        new FlatbuffersExpressionEvaluator(expr->value_selector()));
   }
 
   std::vector<const tuix::Field *> initial_values(const tuix::Row *unused) {
@@ -1824,6 +1827,15 @@ class AggregateExpressionEvaluator {
   std::vector<const tuix::Field *> update(const tuix::Row *concat) {
     std::vector<const tuix::Field *> result;
     for (auto&& e : update_evaluators) {
+      if (is_distinct) {
+        std::string value = to_string(value_selector->eval(concat));
+        /* Check to see if this distinct value has already been counted */
+        if (observed_values.count(value)) { 
+          std::vector<const tuix::Field *> vect(1, nullptr);
+          return vect;
+        }
+        observed_values.insert(value);
+      }
       result.push_back(e->eval(concat));
     }
     return result;
@@ -1837,11 +1849,18 @@ class AggregateExpressionEvaluator {
     return result;
   }
 
+  void clear_observed_values() {
+    observed_values.clear();
+  }
+
 private:
   flatbuffers::FlatBufferBuilder builder;
   std::vector<std::unique_ptr<FlatbuffersExpressionEvaluator>> initial_value_evaluators;
   std::vector<std::unique_ptr<FlatbuffersExpressionEvaluator>> update_evaluators;
   std::vector<std::unique_ptr<FlatbuffersExpressionEvaluator>> evaluate_evaluators;
+  bool is_distinct;
+  std::unique_ptr<FlatbuffersExpressionEvaluator> value_selector;
+  std::set<std::string> observed_values;
 };
 
 class FlatbuffersAggOpEvaluator {
@@ -1880,6 +1899,7 @@ class FlatbuffersAggOpEvaluator {
     // Write initial values to a
     std::vector<flatbuffers::Offset<tuix::Field>> init_fields;
     for (auto&& e : aggregate_evaluators) {
+      e->clear_observed_values();
       for (auto f : e->initial_values(nullptr)) {
         init_fields.push_back(flatbuffers_copy<tuix::Field>(f, builder2));
       }
@@ -1901,6 +1921,7 @@ class FlatbuffersAggOpEvaluator {
   void aggregate(const tuix::Row *row) {
     builder.Clear();
     flatbuffers::Offset<tuix::Row> concat;
+    int a_length = a->field_values()->size();
 
     std::vector<flatbuffers::Offset<tuix::Field>> concat_fields;
     // concat row to a
@@ -1918,9 +1939,18 @@ class FlatbuffersAggOpEvaluator {
     std::vector<flatbuffers::Offset<tuix::Field>> output_fields;
     for (auto&& e : aggregate_evaluators) {
       for (auto f : e->update(concat_ptr)) {
+        if (f == nullptr) { // Only triggered on EXPR(distinct expr ...)
+          output_fields.clear();
+          for (int i = 0; i < a_length; i++) {
+            auto f = concat_ptr->field_values()->Get(i);
+            output_fields.push_back(flatbuffers_copy<tuix::Field>(f, builder2));
+          }
+          goto save_a;
+        } 
         output_fields.push_back(flatbuffers_copy<tuix::Field>(f, builder2));
       }
     }
+save_a:
     a = flatbuffers::GetTemporaryPointer<tuix::Row>(
       builder2, tuix::CreateRowDirect(builder2, &output_fields));
   }
diff --git a/src/flatbuffers/operators.fbs b/src/flatbuffers/operators.fbs
@@ -38,6 +38,9 @@ table AggregateExpr {
     initial_values: [Expr];
     update_exprs: [Expr];
     evaluate_exprs: [Expr];
+    // Items below are used for EXPR(distinct col_name ...)
+    is_distinct: bool;
+    value_selector: Expr;
 }
 // Supported: Average, Count, First, Last, Max, Min, Sum
 
diff --git a/src/main/scala/edu/berkeley/cs/rise/opaque/Utils.scala b/src/main/scala/edu/berkeley/cs/rise/opaque/Utils.scala
@@ -1371,13 +1371,25 @@ object Utils extends Logging {
             updateExprs.map(e => flatbuffersSerializeExpression(builder, e, concatSchema)).toArray),
           tuix.AggregateExpr.createEvaluateExprsVector(
             builder,
-            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray)
+            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray),
+          false,
+          0
         )
 
       case c @ Count(children) =>
         val count = c.aggBufferAttributes(0)
         // COUNT(*) should count NULL values
         // COUNT(expr) should return the number or rows for which the supplied expressions are non-NULL
+        // COUNT(distinct expr ...) should return the number of rows that contain UNIQUE values of expr
+
+        val ar = e.aggregateFunction.children(0)
+        val colNum = concatSchema.indexWhere(_.semanticEquals(ar))
+        val (isDistinct, valueSelector) = (e.isDistinct, colNum) match {
+          case (true, x) if x >= 0 => // If colNum < 0, then the given schema does not contain the attribute
+            (true, flatbuffersSerializeExpression(builder, ar, concatSchema))
+          case _ =>
+            (false, 0)
+        }
 
         val (updateExprs: Seq[Expression], evaluateExprs: Seq[Expression]) = e.mode match {
           case Partial => {
@@ -1396,7 +1408,7 @@ object Utils extends Logging {
             val countUpdateExpr = Add(count, Literal(1L))
             (Seq(countUpdateExpr), Seq(count))
           }
-          case _ => 
+          case _ =>
         }
 
         tuix.AggregateExpr.createAggregateExpr(
@@ -1410,7 +1422,9 @@ object Utils extends Logging {
             updateExprs.map(e => flatbuffersSerializeExpression(builder, e, concatSchema)).toArray),
           tuix.AggregateExpr.createEvaluateExprsVector(
             builder,
-            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray)
+            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray),
+          isDistinct,
+          valueSelector
         )
 
       case f @ First(child, false) =>
@@ -1449,7 +1463,10 @@ object Utils extends Logging {
             updateExprs.map(e => flatbuffersSerializeExpression(builder, e, concatSchema)).toArray),
           tuix.AggregateExpr.createEvaluateExprsVector(
             builder,
-            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray))
+            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray),
+          false,
+          0
+        )
 
       case l @ Last(child, false) =>
         val last = l.aggBufferAttributes(0)
@@ -1487,7 +1504,10 @@ object Utils extends Logging {
             updateExprs.map(e => flatbuffersSerializeExpression(builder, e, concatSchema)).toArray),
           tuix.AggregateExpr.createEvaluateExprsVector(
             builder,
-            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray))
+            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray),
+          false,
+          0
+        )
 
       case m @ Max(child) =>
         val max = m.aggBufferAttributes(0)
@@ -1520,7 +1540,10 @@ object Utils extends Logging {
             updateExprs.map(e => flatbuffersSerializeExpression(builder, e, concatSchema)).toArray),
           tuix.AggregateExpr.createEvaluateExprsVector(
             builder,
-            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray))
+            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray),
+          false,
+          0
+        )
 
       case m @ Min(child) =>
         val min = m.aggBufferAttributes(0)
@@ -1553,7 +1576,10 @@ object Utils extends Logging {
             updateExprs.map(e => flatbuffersSerializeExpression(builder, e, concatSchema)).toArray),
           tuix.AggregateExpr.createEvaluateExprsVector(
             builder,
-            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray))
+            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray),
+          false,
+          0
+        )
 
       case s @ Sum(child) =>
         val sum = s.aggBufferAttributes(0)
@@ -1591,7 +1617,10 @@ object Utils extends Logging {
             updateExprs.map(e => flatbuffersSerializeExpression(builder, e, concatSchema)).toArray),
           tuix.AggregateExpr.createEvaluateExprsVector(
             builder,
-            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray))
+            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray),
+          false,
+          0
+        )
 
       case vs @ ScalaUDAF(Seq(child), _: VectorSum, _, _) =>
         val sum = vs.aggBufferAttributes(0)
@@ -1626,7 +1655,10 @@ object Utils extends Logging {
             updateExprs.map(e => flatbuffersSerializeExpression(builder, e, concatSchema)).toArray),
           tuix.AggregateExpr.createEvaluateExprsVector(
             builder,
-            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray))
+            evaluateExprs.map(e => flatbuffersSerializeExpression(builder, e, aggSchema)).toArray),
+          false,
+          0
+        )
     }
   }
 
diff --git a/src/main/scala/edu/berkeley/cs/rise/opaque/strategies.scala b/src/main/scala/edu/berkeley/cs/rise/opaque/strategies.scala
@@ -109,25 +109,47 @@ object OpaqueOperators extends Strategy {
         if (isEncrypted(child) && aggExpressions.forall(expr => expr.isInstanceOf[AggregateExpression])) =>
 
       val aggregateExpressions = aggExpressions.map(expr => expr.asInstanceOf[AggregateExpression])
-
-      if (groupingExpressions.size == 0) {
-        // Global aggregation
-        val partialAggregate = EncryptedAggregateExec(groupingExpressions, aggregateExpressions, Partial, planLater(child))
-        val partialOutput = partialAggregate.output
-        val (projSchema, tag) = tagForGlobalAggregate(partialOutput)
-
-        EncryptedProjectExec(resultExpressions, 
-          EncryptedAggregateExec(groupingExpressions, aggregateExpressions, Final, 
-            EncryptedProjectExec(partialOutput, 
-              EncryptedSortExec(Seq(SortOrder(tag, Ascending)), true, 
-                EncryptedProjectExec(projSchema, partialAggregate))))) :: Nil
-      } else {
-        // Grouping aggregation
-        EncryptedProjectExec(resultExpressions,
-          EncryptedAggregateExec(groupingExpressions, aggregateExpressions, Final,
-            EncryptedSortExec(groupingExpressions.map(_.toAttribute).map(e => SortOrder(e, Ascending)), true,
-              EncryptedAggregateExec(groupingExpressions, aggregateExpressions, Partial,
-                EncryptedSortExec(groupingExpressions.map(e => SortOrder(e, Ascending)), false, planLater(child)))))) :: Nil
+      val (functionsWithDistinct, functionsWithoutDistinct) = aggregateExpressions.partition(_.isDistinct)
+
+      functionsWithDistinct.size match {
+        case size if size == 0 => // No distinct aggregate operations
+          if (groupingExpressions.size == 0) {
+            // Global aggregation
+            val partialAggregate = EncryptedAggregateExec(groupingExpressions, aggregateExpressions, Partial, planLater(child))
+            val partialOutput = partialAggregate.output
+            val (projSchema, tag) = tagForGlobalAggregate(partialOutput)
+
+            EncryptedProjectExec(resultExpressions, 
+              EncryptedAggregateExec(groupingExpressions, aggregateExpressions, Final, 
+                EncryptedProjectExec(partialOutput,
+                  EncryptedSortExec(Seq(SortOrder(tag, Ascending)), true, 
+                    EncryptedProjectExec(projSchema, partialAggregate))))) :: Nil
+          } else {
+            // Grouping aggregation
+            EncryptedProjectExec(resultExpressions,
+              EncryptedAggregateExec(groupingExpressions, aggregateExpressions, Final,
+                EncryptedSortExec(groupingExpressions.map(_.toAttribute).map(e => SortOrder(e, Ascending)), true,
+                  EncryptedAggregateExec(groupingExpressions, aggregateExpressions, Partial,
+                    EncryptedSortExec(groupingExpressions.map(e => SortOrder(e, Ascending)), false, planLater(child)))))) :: Nil
+          }
+        case size if size == 1 => // One distinct aggregate operation
+          if (groupingExpressions.size == 0) {
+            // Global aggregation
+            val partialAggregate = EncryptedAggregateExec(groupingExpressions, aggregateExpressions, Partial, planLater(child))
+            val partialOutput = partialAggregate.output
+            val (projSchema, tag) = tagForGlobalAggregate(partialOutput)
+
+            EncryptedProjectExec(resultExpressions, 
+              EncryptedAggregateExec(groupingExpressions, aggregateExpressions, Final, 
+                  EncryptedProjectExec(partialOutput,
+                    EncryptedSortExec(Seq(SortOrder(tag, Ascending)), true,
+                      EncryptedProjectExec(projSchema, partialAggregate))))) :: Nil
+          } else {
+            // Grouping aggregation
+            EncryptedProjectExec(resultExpressions,
+              EncryptedAggregateExec(groupingExpressions, aggregateExpressions, Complete,
+                EncryptedSortExec(groupingExpressions.map(e => SortOrder(e, Ascending)), true, planLater(child)))) :: Nil
+          }
       }
 
     case p @ Union(Seq(left, right)) if isEncrypted(p) =>
diff --git a/src/test/scala/edu/berkeley/cs/rise/opaque/OpaqueOperatorTests.scala b/src/test/scala/edu/berkeley/cs/rise/opaque/OpaqueOperatorTests.scala
@@ -377,6 +377,13 @@ trait OpaqueOperatorTests extends OpaqueTestsBase { self =>
       .collect.sortBy { case Row(category: String, _) => category }
   }
 
+  testAgainstSpark("aggregate count - distinct") { securityLevel =>
+    val data = (0 until 32).map{ i => (abc(i), i % 8)}.toSeq
+    val words = makeDF(data, securityLevel, "category", "price")
+    words.groupBy("category").agg(countDistinct("price").as("distinctPrices"))
+      .collect.sortBy { case Row(category: String, _) => category }
+  }
+
   testAgainstSpark("aggregate first") { securityLevel =>
     val data = for (i <- 0 until 256) yield (i, abc(i), 1)
     val words = makeDF(data, securityLevel, "id", "category", "price")

Original file line number	Diff line number	Diff line change
`@@ -38,6 +38,9 @@ table AggregateExpr {`
`38`	`38`	`initial_values: [Expr];`
`39`	`39`	`update_exprs: [Expr];`
`40`	`40`	`evaluate_exprs: [Expr];`
	`41`	`+ // Items below are used for EXPR(distinct col_name ...)`
	`42`	`+ is_distinct: bool;`
	`43`	`+ value_selector: Expr;`
`41`	`44`	`}`
`42`	`45`	`// Supported: Average, Count, First, Last, Max, Min, Sum`
`43`	`46`