Skip to content
This repository was archived by the owner on Oct 29, 2023. It is now read-only.

Conversation

@iliat
Copy link
Contributor

@iliat iliat commented Mar 2, 2015

@wbrockman @jean-philippe-martin @dionloy - here is the first version of the sharded BAM reader. Works in the cloud with a workaround for credentials factory.

  • Most of the common files are under readers/bam.
  • The CountReads pipeline is there as an example of using the reader. It can read from the API or BAM file.
  • I added a GCSOptions class to capture what is needed to access GCS.
  • Some files added under htsjdk/samtools because I needed to deal with visibility restrictions in HTSJDK classes (not pretty but fwiw GATK code base does the same).
    Also added dependency on HTSJDK in pom.xml

@coveralls
Copy link

Coverage Status

Coverage decreased (-31.58%) to 0.0% when pulling d7ba30a on iliat:dev-broad into 408ed35 on googlegenomics:master.

@pgrosu
Copy link

pgrosu commented Mar 2, 2015

Congrats Ilia! This is a huge step forward!

~p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo - GATK

@iliat
Copy link
Contributor Author

iliat commented Mar 5, 2015

@dionloy @deflaux I fill fix the comment typo but LMK if there are any other comments, concerns before I merge.

@iliat
Copy link
Contributor Author

iliat commented Mar 7, 2015

@dionloy @deflaux Ok, I simplified credentials passing, using OflineAuth - no need for factory and side inputs. Will work on adding more tests a bit later.

@pgrosu
Copy link

pgrosu commented Mar 7, 2015

Hey Ilia,

Travis is not happy :(

[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/pipelines/CountReads.java:[39,48] cannot find symbol
symbol: class WorkerCredentialFactoryWorkaround
location: package com.google.cloud.genomics.dataflow.utils
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/pipelines/CountReads.java:[116,5] cannot find symbol
symbol: variable WorkerCredentialFactoryWorkaround
location: class com.google.cloud.genomics.dataflow.pipelines.CountReads
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/pipelines/CountReads.java:[149,17] no suitable constructor found for ReadReader(com.google.cloud.genomics.utils.GenomicsFactory.OfflineAuth)
constructor com.google.cloud.genomics.dataflow.readers.ReadReader.ReadReader(com.google.cloud.genomics.utils.GenomicsFactory.OfflineAuth,com.google.cloud.genomics.utils.Paginator.ShardBoundary) is not applicable
(actual and formal argument lists differ in length)
constructor com.google.cloud.genomics.dataflow.readers.ReadReader.ReadReader(com.google.cloud.genomics.utils.GenomicsFactory.OfflineAuth,com.google.cloud.genomics.utils.Paginator.ShardBoundary,java.lang.String) is not applicable
(actual and formal argument lists differ in length)
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/readers/bam/BAMShard.java:[70,11] cannot assign a value to final variable end
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/readers/bam/BAMShard.java:[82,13] cannot assign a value to final variable end
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/readers/bam/Sharder.java:[140,13] cannot assign a value to final variable end
[INFO] 6 errors
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 6.717 s
[INFO] Finished at: 2015-03-07T19:33:50+00:00
[INFO] Final Memory: 19M/136M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.2:compile (default-compile) on project google-genomics-dataflow: Compilation failure: Compilation failure:
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/pipelines/CountReads.java:[39,48] cannot find symbol
[ERROR] symbol: class WorkerCredentialFactoryWorkaround
[ERROR] location: package com.google.cloud.genomics.dataflow.utils
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/pipelines/CountReads.java:[116,5] cannot find symbol
[ERROR] symbol: variable WorkerCredentialFactoryWorkaround
[ERROR] location: class com.google.cloud.genomics.dataflow.pipelines.CountReads
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/pipelines/CountReads.java:[149,17] no suitable constructor found for ReadReader(com.google.cloud.genomics.utils.GenomicsFactory.OfflineAuth)
[ERROR] constructor com.google.cloud.genomics.dataflow.readers.ReadReader.ReadReader(com.google.cloud.genomics.utils.GenomicsFactory.OfflineAuth,com.google.cloud.genomics.utils.Paginator.ShardBoundary) is not applicable
[ERROR] (actual and formal argument lists differ in length)
[ERROR] constructor com.google.cloud.genomics.dataflow.readers.ReadReader.ReadReader(com.google.cloud.genomics.utils.GenomicsFactory.OfflineAuth,com.google.cloud.genomics.utils.Paginator.ShardBoundary,java.lang.String) is not applicable
[ERROR] (actual and formal argument lists differ in length)
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/readers/bam/BAMShard.java:[70,11] cannot assign a value to final variable end
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/readers/bam/BAMShard.java:[82,13] cannot assign a value to final variable end
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/com/google/cloud/genomics/dataflow/readers/bam/Sharder.java:[140,13] cannot assign a value to final variable end
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
The command "eval mvn install -DskipTests=true -Dmaven.javadoc.skip=true -B -V" failed 3 times.
The command "mvn install -DskipTests=true -Dmaven.javadoc.skip=true -B -V" failed and exited with 1 during .
Your build has been stopped.

Paul

@iliat
Copy link
Contributor Author

iliat commented Mar 7, 2015

@pgrosu Yeah, I see that. Looking

@pgrosu
Copy link

pgrosu commented Mar 7, 2015

@iliat, Cool - yeah they're minor fixes - thanks man :)

@iliat iliat force-pushed the dev-broad branch 2 times, most recently from 0a661e4 to 447361c Compare March 7, 2015 20:19
@iliat
Copy link
Contributor Author

iliat commented Mar 7, 2015

@pgrosu @deflaux @dionloy Now compiles :)

@pgrosu
Copy link

pgrosu commented Mar 7, 2015

Awesome! Thanks Ilia :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there really no builtin operation for this? Rats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find anything here: https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/package-summary
The only other alternative is to use a Coders and TextIO.Write.withCoder
but unfortunately the only prepackaged coder that can help is
TextualIntegerCoder and its not suitable for Longs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, thanks for the context - I concur there's no ToString transform that I can find at present. I've checked in with Jeff G from the Dataflow team to make sure I'm not missing anything.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why not just take this out and put it into a common directory for all the extended functionality for Dataflow until it is implemented to make the code more readable? I get the feeling it will be reused.

@coveralls
Copy link

Coverage Status

Coverage decreased (-7.79%) to 23.67% when pulling 653ae81 on iliat:dev-broad into 4b57fc1 on googlegenomics:master.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in storgae

@wbrockman
Copy link
Contributor

LGTM - let's get the unit tests in a separate commit, as you say. Thanks for updating comments on the subtle question of why this code does find all the reads. Many thanks, and nice work - it's outstanding to have a full BAM reader available! Looking forward to experimenting with this further and perhaps iterating on the code design a bit once there are tests available.

@pgrosu
Copy link

pgrosu commented Mar 11, 2015

Definitely +1 :)

iliat added a commit that referenced this pull request Mar 12, 2015
Sharded BAM reader and a sample read counting pipeline
@iliat iliat merged commit 70bc9f7 into googlegenomics:master Mar 12, 2015
@wbrockman
Copy link
Contributor

Hurray! Thank you!

On Thu, Mar 12, 2015 at 3:59 PM, Ilia Tulchinsky [email protected]
wrote:

Merged #34 #34.


Reply to this email directly or view it on GitHub
#34 (comment).

@pgrosu
Copy link

pgrosu commented Mar 12, 2015

Things should get very interesting :)

jiridanek pushed a commit to jiridanek/dataflow-java that referenced this pull request Jan 18, 2016
Sharded BAM reader and a sample read counting pipeline
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants