Skip to content

devlib.collector: Add PerfettoCollector #618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 6, 2023

Conversation

mrkajetanp
Copy link
Contributor

Add a Collector for accessing Google's Perfetto tracing infrastructure. The Collector takes a path to an on-device config file, starts tracing in the background using the perfetto binary and then stops by killing the tracing process.

Signed-off-by: Kajetan Puchalski [email protected]

Copy link
Collaborator

@douglas-raillard-arm douglas-raillard-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some drive-by comments

@mrkajetanp
Copy link
Contributor Author

I updated the PR with Douglas' suggestions.
I also had to bring back the manual pull timeout bit. I noticed that without it I'd end up getting a pull timeout exception when trying to pull the trace from the device. No idea why but having the manual timeout fixes the problem.

@douglas-raillard-arm
Copy link
Collaborator

I noticed that without it I'd end up getting a pull timeout exception when trying to pull the trace from the device.

That's strange, from a cursory look I could not find any non-None default timeout on the pull() path, except in the transfer manager but it's set to 3600

@mrkajetanp
Copy link
Contributor Author

mrkajetanp commented Feb 15, 2023

Now that I'm thinking about it, are we sure switching to the self.target.background method is the better way? Perfetto's 'intended' way to trace in the background is to pass the --background argument which just returns the PID and then later to kill that PID whenever you're done. If we're using self.target.background we need to use Perfetto's non-background variant designed to be run in a terminal and cancelled with Ctrl-C. It seems to work either way so I'm not going to make a stand here but I'm just wondering if we might not end up interfering with how the tool is meant to operate on account of making the code more 'pythonic' or whatever.

@douglas-raillard-arm
Copy link
Collaborator

I would definitely use background command. The previous version of the code was straightfoward and yet already had issues:

  1. if the process dies, we would still attempt to kill an old PID even though it might point at an entirely different task. Android can exhaust PIDs pretty quickly.
  2. it allowed starting perfetto twice, which would result in failures down the line
  3. less likely, but if perfetto hanged at SIGINT/SIGTERM, devlib would freeze. With BackgroundCommand.cancel(), SIGTERM is sent and then SIGKILL after a timeout.
  4. if devlib disconnects, perfetto will be left running. Background commands are terminated properly when the target is destroyed or disconnected.

The change to the background command API fixed 1. and made 2. obvious. On top of that, we can now capture stdout/stderr if required for debug, instead of having to go hunt for some log file somewhere. Process control is a tricky business and it's definitely the sort of stuff we don't want to duplicate. Ctrl-C in a terminal is simply SIGINT, which will likely be handled like SIGTERM so I expect perfetto to work fine with that.

cmd = "cat {} | {} --txt -c - -o {}".format(
quote(self.config), quote(self.target_binary), quote(self.target_output_file)
)
self.start_time = time.time()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to figure out why the pull times out as it's not really expected to happen instead of this hack. Also time.time() is not monotonic so end - start < 0 is possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not even that it times out, more that there's an issue if we don't specify an explicit timeout?

> /home/kajpuc01/tools/lisa/external/devlib/devlib/collector/perfetto.py(69)get_data()
-> self.target.pull(self.target_output_file, self.output_path)
(Pdb) s
--Call--
> /home/kajpuc01/tools/lisa/external/devlib/devlib/utils/asyn.py(202)__get__()
-> def __get__(self, *args, **kwargs):
(Pdb) n
> /home/kajpuc01/tools/lisa/external/devlib/devlib/utils/asyn.py(203)__get__()
-> return self.__class__(
(Pdb)
> /home/kajpuc01/tools/lisa/external/devlib/devlib/utils/asyn.py(204)__get__()
-> asyn=self.asyn.__get__(*args, **kwargs),
(Pdb)
> /home/kajpuc01/tools/lisa/external/devlib/devlib/utils/asyn.py(205)__get__()
-> blocking=self.blocking.__get__(*args, **kwargs),
(Pdb)
> /home/kajpuc01/tools/lisa/external/devlib/devlib/utils/asyn.py(203)__get__()
-> return self.__class__(
(Pdb)
--Return--
> /home/kajpuc01/tools/lisa/external/devlib/devlib/utils/asyn.py(203)__get__()-><devlib.utils...x7f326301bc10>
-> return self.__class__(
(Pdb)
--Call--
> /home/kajpuc01/tools/lisa/external/devlib/devlib/utils/asyn.py(208)__call__()
-> def __call__(self, *args, **kwargs):
(Pdb)
> /home/kajpuc01/tools/lisa/external/devlib/devlib/utils/asyn.py(209)__call__()
-> return self.blocking(*args, **kwargs)
(Pdb)
WARNING          Cancelling file transfer ['/sdcard/devlib-target/trace.perfetto-trace'] -> perfetto_test/wk1-speedometer-1/trace.perfetto-trace due to 'transfer inactive'
WARNING          Cancelling file transfer ['/sdcard/devlib-target/trace.perfetto-trace'] -> perfetto_test/wk1-speedometer-1/trace.perfetto-trace due to 'exception during transfer'
TimeoutError: adb -s 18251FDF60083A pull /sdcard/devlib-target/trace.perfetto-trace perfetto_test/wk1-speedometer-1/trace.perfetto-trace

I think the issue might be here:

    def _push_pull(self, action, sources, dest, timeout):
        # [...]
        if timeout or not self.poll_transfers:
            adb_command(self.device, command, timeout=timeout, adb_server=self.adb_server)
        else:
            with self.transfer_mgr.manage(sources, dest, action):
                bg_cmd = adb_command_background(self.device, command, adb_server=self.adb_server)
                self.transfer_mgr.set_transfer_and_wait(bg_cmd)

Depending on whether a timeout was specified we either use adb_command or adb_command_background. The former seems to work fine, the latter not so much in this case.

Copy link
Collaborator

@douglas-raillard-arm douglas-raillard-arm Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so that definitely needs fixing. Can you get a backtrace of that TimeoutError and debug logging ?

There also seems to be a leak of the bg_cmd in the transfer manager (it only gets destroyed when the next transfer happens), and a Thread that should be created with daemon=True but is not so a code robustness review seems to be in order.

EDIT: I also found broken uses of None:

        self.transfer_mgr = SSHTransferManager(self, **transfer_opts) if poll_transfers else None

and then:

            with _handle_paramiko_exceptions(), self.transfer_mgr.manage(sources, dest, action, scp):

so there are definitely a bunch of smelly/brittle stuff around that transfer manager implementation that ought to be fixed ASAP if that is enabled by default

@mrkajetanp
Copy link
Contributor Author

I dropped the explicit timeout argument but that requires #619 so it should be either merged after the other one or at the same time

@mrkajetanp mrkajetanp force-pushed the perfetto branch 2 times, most recently from 4324b7c to 1cb6d0a Compare April 18, 2023 15:22
@mrkajetanp
Copy link
Contributor Author

Since the transfer manager PR has been merged we should now be able to go ahead and merge this as well?

@marcbonnici marcbonnici mentioned this pull request May 16, 2023
Copy link
Collaborator

@marcbonnici marcbonnici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay yes we can proceed with this PR. I just wanted to spend some time understand this a bit and get it running on my end as I had found some issues during testing but it turns out they were unrelated to this PR. (Opened a PR to address these #628)

However I have a couple of questions for you.

IIUC running the perfetto binary requires a running service on the device to connect to, and if this is not running then currently we fail silently. On a linux target (my test device) this is provided by traced binary or both can packaged as part of the tracebox binary.

I'm wondering is you know an easy way to detect whether this service is running, or even adding some config to enable running the service as a background command as part of the collector (I'm currently testing with manually running with ./tracebox traced before starting a workload.

Secondly do you know if we would be able to include a compiled perfetto binary (or tracebox) similar to what we do for trace-cmd etc to make it easier for users to get started?

@mrkajetanp
Copy link
Contributor Author

mrkajetanp commented May 23, 2023

First of all, these are not issues for Android - on Android Perfetto always comes included and is always running so it'll just be fine. For Linux targets there's a bit more steps, true.

I'm wondering is you know an easy way to detect whether this service is running

I'm not sure if there's any official or proper way to check, we could always grep the ps output or something like that if we really wanted to?

or even adding some config to enable running the service as a background command

We could just make the tool use tracebox if it can't find the running service on Linux?

Secondly do you know if we would be able to include a compiled perfetto binary (or tracebox) similar to what we do for trace-cmd etc to make it easier for users to get started?

As mentioned above, it comes bundled with Android so on Android there's no point. On Linux if the tracebox binary already works for you then I see no issues with bundling it.

https://perfetto.dev/docs/quickstart/linux-tracing

Copy link
Collaborator

@marcbonnici marcbonnici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the information. I understand that the requirements for Android and Linux are different, one option would be to check for the OS of the target and either auto deploy the binary and start it as a background task for Linux or assume it is present for Android?

At the minimum I think we should outline what needs to be done in the docsctring for the collector. Additionally it might be worth trying to ensure that if something fails, that we can provide some error message to indicate this (my initial testing was failing silently with an empty output file so it wasn't immediately obvious what steps needed to be performed).

@mrkajetanp mrkajetanp force-pushed the perfetto branch 2 times, most recently from 972c5b8 to bc7522f Compare July 4, 2023 16:32
@mrkajetanp
Copy link
Contributor Author

Sorry for the delay, here are the updates. I added a check to switch Perfetto on for Android 9 & 10, as well as a check to install tracebox on Linux targets if the service is missing. The tracebox binary was bundled along with the copied Perfetto license.

I also added a "is_running" function in Target, as far as I can tell there was no already existing equivalent and it seems like a useful thing to have so I made it a helper inside Target instead of implementing it just inside the collector.

@mrkajetanp
Copy link
Contributor Author

Updated with using the fstring and a more proper implementation of is_running that takes a regexp so that we can match the name more accurately and avoid a false positive. An added upside is that awk will just make execute return an empty string instead of an exception in the absent case so we can avoid the exception handler.

@mrkajetanp
Copy link
Contributor Author

I updated with just directly matching the comm to not overcomplicate things. Busybox ps solves the potential spaces problem already as far as I can tell.
busybox ps with no arguments:

PID   USER     TIME  COMMAND
    1 0         1h22 /system/bin/init second_stage
    2 0         0:03 [kthreadd]
    3 0         0:00 [rcu_gp]

busybox ps -o comm,stat

COMMAND          STAT
init             S
kthreadd         SW
rcu_gp           IW<

None of the tasks that have spaces in the default output have them in the -o variant, it abbreviates the task names to a fixed width column and gets rid of whitespace.

@douglas-raillard-arm
Copy link
Collaborator

None of the tasks that have spaces in the default output have them in the -o variant, it abbreviates the task names to a fixed width column and gets rid of whitespace.

That's good to know, have you tried the args instead of comm ? That would contain the full command line though, which is a bit different.

@mrkajetanp
Copy link
Contributor Author

have you tried the args instead of comm

comm just gives us exactly what we want, args appears slightly broken in the busybox implementation. Namely args,stat only shows the args part while stat,args shows both as expected. Still not really relevant here, the use case for this function is to check whether a background deamon is present or not and just matching comm does that more than adequately imo.

@douglas-raillard-arm
Copy link
Collaborator

comm just gives us exactly what we want

This is not what we want:

None of the tasks that have spaces in the default output have them in the -o variant, it abbreviates the task names to a fixed width column and gets rid of whitespace.

Fortunately I found that it's actually not true, the busybox binary shipped in devlib does report comm with spaces properly, e.g.

/devlib/devlib/bin/x86_64/busybox ps -A -o stat,comm | grep foo
S    foo bar.sh

for a script called "foo bar.sh"

I think the confusion comes from the full command line (args in ps parlance) and the task name (comm). comm is not a simplification of args, it's 2 separate strings:

  • comm is the same as in the kernel and is set using prctl(PR_SET_NAME, ...) syscall (task_rename ftrace event)
  • cmdline (args for busybox ps) is taken from the auxiliary vector of the process, i.e. the concatenated content of argv parameter of main() in C.

@mrkajetanp mrkajetanp force-pushed the perfetto branch 2 times, most recently from ff22f03 to 28e5945 Compare July 20, 2023 13:07
# Android 9 and 10 require traced to be enabled manually
if os_version == '9' or os_version == '10':
target.execute('setprop persist.traced.enable 1')
elif target.os in ['linux', 'android'] and not target.is_running('traced'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When launching traced via tracebox sadly the traced argument does not appear in the ps output causing this check to fail. Perhaps we could check for traced or tracebox?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's intentional though. We only check for traced so that we can use tracebox if it's missing. If traced is present this means that we don't need to use tracebox and should call perfetto instead.
Calling tracebox with traced running in the background would be an odd halfway setup I think. From their docs:

Due to Perfetto's service-based architecture, in order to capture a trace, the traced (session daemon) and traced_probes (probes and ftrace-interop daemon) need to be running. As per Perfetto v16, the tracebox binary bundles together all the binaries you need in a single executable (a bit like toybox or busybox).

@mrkajetanp mrkajetanp force-pushed the perfetto branch 2 times, most recently from ba5972b to 008811e Compare August 11, 2023 16:44
@mrkajetanp
Copy link
Contributor Author

I updated the PR so that for Android targets it'll write to Android's default
perfetto trace directory instead of the devlib one since Perfetto seems to be
having permission issues on non-rooted phones. This makes the collector work
without root access on Android phones, tested on Android 13 + Pixel 6. I also
renamed the output file to 'devlib-trace.perfetto-trace' just to avoid
potential clashes with perfetto traces collected in other ways.

Add the "is_running" function that can be used to check if a given
process is running on the target device. It will return True if a
process matching the name is found and Falsa otherwise.

Signed-off-by: Kajetan Puchalski <[email protected]>
Add a Collector for accessing Google's Perfetto tracing infrastructure.
The Collector takes a path to an on-device config file, starts tracing
in the background using the perfetto binary and then stops by killing
the tracing process.

Signed-off-by: Kajetan Puchalski <[email protected]>
@marcbonnici marcbonnici merged commit 9b15807 into ARM-software:master Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants