tensorboard dev export: If an error occurs, keep exporting the next experiment #3219

shawwn · 2020-02-05T22:41:23Z

I was running into an issue with tensorboard dev export where it was failing to export a certain experiment. (It threw an internal error.) This prevented me from exporting most of my experiment data.

This PR solves the problem by printing the error rather than aborting.

…xperiment

wchargin · 2020-02-05T22:45:09Z

Hi @shawwn! Glad to see you here, and thanks for the PR. :-)

If you still have it handy, could you attach the reference code that you
saw from the internal server error? I’ll take a look at both the failure
and the PR.

shawwn · 2020-02-06T01:17:45Z

It actually doesn't generate a reference code. The error is:

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "Received RST_STREAM with error code 2"
	debug_error_string = "{"created":"@1580951798.685511000","description":"Error received from peer ipv4:34.95.66.171:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Received RST_STREAM with error code 2","grpc_status":13}"

wchargin · 2020-02-06T01:47:22Z

Interesting. Thanks for the reference. We haven’t seen this before, and
it’s not obvious to me where the error is coming from in our stack
(there are a couple of layers of gRPC plus an Nginx proxy on our side).
We’ll continue investigating and give you an update by some time
tomorrow. If the upcoming ICML deadline makes this especially urgent,
please let us know.

wchargin · 2020-02-06T01:50:56Z

tensorboard/uploader/exporter.py

+                    import traceback
+                    traceback.print_exc()


I’m a little wary to just print the exception and continue, because (if
I’m reading this correctly) this causes the exporter to emit a partially
completed experiment file. Once the export “completes”, the user won’t
have any indication as to which of their experiments are incomplete,
which looks like a data loss bug even though it isn’t one.

One way to handle this would be to collect a list of failed/partial
exports and surface failures both immediately and at the end of the
export session; I’ll get some input from the team about how we want to
surface this.

Fair! I didn't think it was a very good solution either.

The trouble is, an error when exporting one experiment causes the export process to fail for all experiments. And I seem to have many experiments that cause this error.

One hypothesis: When I interrupt a tensorboard.dev upload, perhaps it generates an experiment which then fails to export.

(If you have access to my account, feel free to try to export the experiments associated with [email protected]. It'll reproduce the error, I think.)

(If you have access to my account, feel free to try to export the
experiments associated with [email protected]. It'll reproduce
the error, I think.)

Thanks. I’ve just done so and can indeed reproduce the error. I can also
reproduce this when exporting large experiments not written by you, so I
don’t think that it’s anything particular to your account. These
experiments are different shapes and sizes (lots of runs, long time
series, etc.) but one thing that they have in common is that the
RST_STREAM failure occurs about 31–32 seconds after process start, so
I suspect that there’s simply a 30-second timeout somewhere in the
stack. (My understanding was that gRPC was supposed to send RST_STREAM
with payload CANCEL (0x08) rather than INTERNAL (0x02) on streaming
request timeout, so it’s not clear to me exactly why this is manifesting
the way that it is.)

We’ll keep looking into this next week. We may be able to deploy a
short-term server-side patch to increase the timeouts to unblock you,
and work on a longer-term solution such that these aren’t limited to a
single streaming RPC and thus aren’t subject to timeouts at all.

I’m assuming that you’re able to run this patch locally and partially
export your experiments—is that correct? If not (e.g., if you’re having
trouble building), let me know, and I can send you a modified wheel with
this patch.

(Googlers, see http://b/149120509.)

tensorboard dev export: If an error occurs, keep exporting the next e…

a407125

…xperiment

googlebot added the cla: yes label Feb 5, 2020

wchargin self-requested a review February 5, 2020 22:44

wchargin reviewed Feb 6, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tensorboard dev export: If an error occurs, keep exporting the next experiment #3219

tensorboard dev export: If an error occurs, keep exporting the next experiment #3219

Uh oh!

shawwn commented Feb 5, 2020

Uh oh!

wchargin commented Feb 5, 2020

Uh oh!

shawwn commented Feb 6, 2020

Uh oh!

wchargin commented Feb 6, 2020

Uh oh!

wchargin Feb 6, 2020

Uh oh!

shawwn Feb 7, 2020

Uh oh!

wchargin Feb 8, 2020

Uh oh!

Uh oh!

tensorboard dev export: If an error occurs, keep exporting the next experiment #3219

Are you sure you want to change the base?

tensorboard dev export: If an error occurs, keep exporting the next experiment #3219

Uh oh!

Conversation

shawwn commented Feb 5, 2020

Uh oh!

wchargin commented Feb 5, 2020

Uh oh!

shawwn commented Feb 6, 2020

Uh oh!

wchargin commented Feb 6, 2020

Uh oh!

wchargin Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

shawwn Feb 7, 2020

Choose a reason for hiding this comment

Uh oh!

wchargin Feb 8, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!