Skip to content

Left/Right Outer support for equi and non-equi joins #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Mar 4, 2021

Conversation

octaviansima
Copy link
Collaborator

@octaviansima octaviansima commented Feb 23, 2021

TPC-H 13 passes.

This PR implements left and right outer for both physical operators of join that are currently supported: NonObliviousSortMergeJoin and BroadcastNestedLoopJoin. It adds tests for both, but TPC-H 13 only requires the equi join left outer implementation.

octaviansima and others added 2 commits February 23, 2021 20:56
set up class thing

cleanup

added test cases for non-equi left anti join

rename to serializeEquiJoinExpression

added isEncrypted condition

set up keys

JoinExpr now has condition

rename

serialization does not throw compile error for BNLJ

split up

added condition in ExpressionEvaluation.h

zipPartitions

cpp put in place

typo

added func to header

two loops in place

update tests

condition

fixed scala loop

interchange rows

added tags

ensure cached

== match working

comparison decoupling in ExpressionEvalulation

save

compiles and condition works

is printing

fix swap outer/inner

o_i_match

show() has the same result

tests pass

test cleanup

added test cases for different condition

BuildLeft works

optional keys in scala

started C++

passes the operator tests

comments, cleanup

attemping to do it the ~right~ way

comments to distinguish between primary/secondary, operator tests pass

cleanup comments, about to begin implementation for distinct agg ops

is_distinct

added test case

serializing with isDistinct

is_distinct in ExpressionEvaluation.h

removed unused code from join implementation

remove RowWriter/Reader in condition evaluation (join)

easier test

serialization done

correct checking in Scala

set is set up

spaghetti but it finally works

function for clearing values

condition_eval isntead of condition

goto

comment

started impl of multiple partitions fix

added rangepartitionexec that runs

partitioning cleanup

serialization properly

comments, generalization for > 1 distinct function

comments

about to refactor into logical.Aggregation

the new case has distinct in result expressions

need to match on distinct

removed new case (doesn't make difference?)

works

remove traces of distinct

more cleanup

address comments

rename equi join

split Join.cpp into two files

Update App.cpp

fixed swap issues

one more swap

stream/broadcast

concatEncryptedBlocks, remove import iostream

comment for for loop

added comments explaining constraints with broadcast side

comments

left semi done, existence serializes

remove existence serialization

fixed
* finishing the in expression. adding more tests and null support. need confirmation on null behavior and also I wonder why integer field is sufficient for string

* adding additional test

* adding additional test

* saving concat implementation and it's passing basic functionality tests

* adding type aware comparison and better error message for IN operator

* adding null checking for the concat operator and adding one additional test

* cleaning up IN&Concat PR

* deleting concat and preping the in branch for in pr

* fixing null bahavior

now it's only null when there's no match and there's null input

* Build failed

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

Separate Concat PR  (mc2-project#125)

Implementation of the CONCAT expression.

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Wenting Zheng <[email protected]>

Removed calls to toSet in TPC-H tests (mc2-project#140)

* removed calls to toSet

* added calls to toSet back where queries are unordered

Documentation update (mc2-project#148)

Cluster Remote Attestation Fix (mc2-project#146)

The existing code only had RA working when run locally. This PR adds a sleep for 5 seconds to make sure that all executors are spun up successfully before attestation begins.

Closes mc2-project#147

upgrade to 3.0.1 (mc2-project#144)

Update two TPC-H queries (mc2-project#149)

Tests for TPC-H 12 and 19 pass.

TPC-H 20 Fix (mc2-project#142)

* string to stringtype error

* tpch 20 passes

* cleanup

* implemented changes

* decimal.tofloat

Co-authored-by: Wenting Zheng <[email protected]>

Join update (mc2-project#145)

Migrate from Travis CI to Github Actions (mc2-project#156)

matching in strategies.scala

set up class thing

cleanup

added test cases for non-equi left anti join

rename to serializeEquiJoinExpression

added isEncrypted condition

set up keys

JoinExpr now has condition

rename

serialization does not throw compile error for BNLJ

split up

added condition in ExpressionEvaluation.h

zipPartitions

cpp put in place

typo

added func to header

two loops in place

update tests

condition

fixed scala loop

interchange rows

added tags

ensure cached

== match working

comparison decoupling in ExpressionEvalulation

save

compiles and condition works

is printing

fix swap outer/inner

o_i_match

show() has the same result

tests pass

test cleanup

added test cases for different condition

BuildLeft works

optional keys in scala

started C++

passes the operator tests

comments, cleanup

attemping to do it the ~right~ way

comments to distinguish between primary/secondary, operator tests pass

cleanup comments, about to begin implementation for distinct agg ops

is_distinct

added test case

serializing with isDistinct

is_distinct in ExpressionEvaluation.h

removed unused code from join implementation

remove RowWriter/Reader in condition evaluation (join)

easier test

serialization done

correct checking in Scala

set is set up

spaghetti but it finally works

function for clearing values

condition_eval isntead of condition

goto

comment

remove explain from test, need to fix distinct aggregation for >1 partitions

started impl of multiple partitions fix

added rangepartitionexec that runs

partitioning cleanup

serialization properly

comments, generalization for > 1 distinct function

comments

about to refactor into logical.Aggregation

the new case has distinct in result expressions

need to match on distinct

removed new case (doesn't make difference?)

works

remove traces of distinct

more cleanup

Upgrade to OE 0.12 (mc2-project#153)

Update README.md

Support for scalar subquery (mc2-project#157)

This PR implements the scalar subquery expression, which is triggered whenever a subquery returns a scalar value. There were two main problems that needed to be solved.

First, support for matching the scalar subquery expression is necessary. Spark implements this by wrapping a SparkPlan within the expression and calls executeCollect. Then it constructs a literal with that value. However, this is problematic for us because that value should not be decrypted by the driver and serialized into an expression, since it's an intermediate value.

Therefore, the second issue to be addressed here is supporting an encrypted literal. This is implemented in this PR by serializing an encrypted ciphertext into a base64 encoded string, and wrapping a Decrypt expression on top of it. This expression is then evaluated in the enclave and returns a literal. Note that, in order to test our implementation, we also implement a Decrypt expression in Scala. However, this should never be evaluated on the driver side and serialized into a plaintext literal. This is because Decrypt is designated as a Nondeterministic expression, and therefore will always evaluate on the workers.

Add TPC-H Benchmarks (mc2-project#139)

* logic decoupling in TPCH.scala for easier benchmarking

* added TPCHBenchmark.scala

* Benchmark.scala rewrite

* done adding all support TPC-H query benchmarks

* changed commandline arguments that benchmark takes

* TPCHBenchmark takes in parameters

* fixed issue with spark conf

* size error handling, --help flag

* add Utils.force, break cluster mode

* comment out logistic regression benchmark

* ensureCached right before temp view created/replaced

* upgrade to 3.0.1

* upgrade to 3.0.1

* 10 scale factor

* persistData

* almost done refactor

* more cleanup

* compiles

* 9 passes

* cleanup

* collect instead of force, sf_none

* remove sf_none

* defaultParallelism

* no removing trailing/leading whitespace

* add sf_med

* hdfs works in local case

* cleanup, added new CLI argument

* added newly supported tpch queries

* function for running all supported tests

address comments

added one test case

non-null case working

rename equi join

split Join.cpp into two files

outer and default joins split up

not handling nulls at all

first test case works

force_null to all appends

test, matching in scala

non-nulls working

it works for anti and outer

cleanup

test cases added

one row is not being added in the sort merge implementation

tpc-h 13 passes

comments

outer/inner swap, breaks a bunch of things

Update App.cpp

fixed swap issues

for loop instead of flatten

concatEncryptedBlocks

tpch 13 test passes

one more swap

stream/broadcast

concatEncryptedBlocks, remove import iostream

comment for for loop

added comments explaining constraints with broadcast side

comments
@octaviansima octaviansima marked this pull request as ready for review February 25, 2021 00:30
@octaviansima octaviansima requested a review from wzheng February 25, 2021 00:30
@octaviansima octaviansima changed the title Left Outer support for equi and non-equi joins Left/Right Outer support for equi and non-equi joins Mar 2, 2021
@octaviansima octaviansima force-pushed the left-outer branch 2 times, most recently from 90702ee to c3ed7a5 Compare March 2, 2021 03:28
Comment on lines 170 to 171
write_output_rows(primary_unmatched_rows, w, join_type);
write_output_rows(previous_primary_unmatched_rows, w, join_type);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input data is very big, then won't previous_primary_unmatched_rows potentially have to buffer a lot of data, and it will have to write out to w again? Can you write out to w directly earlier?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with doing it earlier is that last_foreign_row.get() can be null. This happens when a new primary group is encountered without any foreign row being encoounted first at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For outer joins, can you put the dummy row first instead of last?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that worked, thanks

Copy link
Collaborator

@wzheng wzheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@octaviansima octaviansima merged commit 89515e2 into mc2-project:master Mar 4, 2021
@octaviansima octaviansima deleted the left-outer branch March 15, 2021 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants