Add request fingerprint as cached filename #441

abhinavsingh · 2020-10-04T17:35:29Z

Introduce a mechanism within HttpParser class that maintains a running hash of request being processed. Fingerprinting feature is currently being used by cache plugin and might be an overkill to introduce within HttpParser class. As request hashes will be calculated even without cache plugin enabled.

Fingerprinting removes the need of maintaining an index structure. Here is how it will work:

Cache plugin uses a file named after fingerprint of the request to store cached data.
For a new incoming request, cache file that matches incoming request fingerprint is looked up on disk. This operation will be O(1), either the cache file exists or not. If it exists, request is served out of the cache.

Note that, serving out of cache only works for:

http requests
https requests with TLS Interception

For the 2nd case i.e. with TLS Interception, https request fingerprint is a calculated based upon both the CONNECT request and the actual cached request.

Additionally:

Fix some tests due to change in default cache directory
Fix event propagation using multiprocessing manager
Adds ReplayTestCase which decouples replay functionality from normal test case. This also removes the need of dynamically enabling cache plugin.

TODO:

Revisit placement of hash calculation in case it reflects into higher CPU/RAM consumption for high traffic load.
Fix cache plugin for TLS interception mode. Is currently broken.

1) Fix some tests due to change in default cache directory 2) Fix event propogation using multiprocessing manager I/O error on closed resource is still an issue. Doesn't result in test error but a warning is thrown (tested on Python 3.8).

codecov · 2020-10-04T17:35:38Z

Codecov Report

Merging #441 into cache-server will increase coverage by 0.53%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           cache-server     #441      +/-   ##
================================================
+ Coverage         81.51%   82.04%   +0.53%     
================================================
  Files                80       82       +2     
  Lines              3083     3046      -37     
================================================
- Hits               2513     2499      -14     
+ Misses              570      547      -23

Impacted Files	Coverage Δ
proxy/dashboard/plugin.py	`75.00% <ø> (+5.76%)`	⬆️
proxy/__init__.py	`100.00% <100.00%> (ø)`
proxy/common/flags.py	`100.00% <100.00%> (ø)`
proxy/http/parser.py	`99.39% <100.00%> (+3.14%)`	⬆️
proxy/http/proxy/server.py	`78.14% <100.00%> (-0.81%)`	⬇️
proxy/plugin/cache/base.py	`92.30% <100.00%> (+7.49%)`	⬆️
proxy/plugin/cache/cache_responses.py	`100.00% <100.00%> (ø)`
proxy/plugin/cache/store/base.py	`100.00% <100.00%> (+22.72%)`	⬆️
proxy/plugin/cache/store/disk.py	`97.50% <100.00%> (+6.19%)`	⬆️
proxy/testing/__init__.py	`100.00% <100.00%> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff4026b...bcba970. Read the comment docs.

pasccom · 2020-10-04T19:06:03Z

Hello, I think these warnings come from the fact that we output the message in the cache __del__() method

proxy.py/proxy/plugin/cache/store/disk.py

Lines 72 to 74 in 4d4a2b1

    
           def close(self) -> None: 
        
               if self.cache_file: 
        
                   logger.debug("Closing cache file")

which is executed after test ends (when the garbage collector releases the cache).
When running isolated tests (using python -m unitest), it is not a problem, but it seems pytest does not like this.
I did not bother to fix this because it gives useful information when using python -m unitest.

abhinavsingh · 2020-10-05T04:20:42Z

When running isolated tests (using python -m unitest), it is not a problem, but it seems pytest does not like this.
I did not bother to fix this because it gives useful information when using python -m unitest.

Thank you for the context. Yes, I wasn't sure why these warnings are being thrown. IIRC, these warning come from http proxy plugin testing with TLS interception on. Don't see these warnings otherwise. Will dig further into it this week.

abhinavsingh · 2020-10-05T05:01:45Z

Need a cleaner way to dynamically enable / disable a plugin. Current approach used by cache response plugin doesn't work universally. Problems being:

multiprocessing.Event which is declared as a global variable gets forked instead of being shared between parent and child. This only impacts Python 3.8 on MacOS. Likely will fail on Windows too.
By replacing multiprocessing.Event with multiprocessing.Manager().Event fixes the issue but then we start to run into following runtime errors:

> python -m proxy --num-workers 1
2020-10-05 10:18:55,090 - pid:1685 [I] load_plugins:318 - Loaded plugin proxy.http.proxy.HttpProxyPlugin
2020-10-05 10:18:55,091 - pid:1685 [I] listen:116 - Listening on ::1:8899
2020-10-05 10:18:55,093 - pid:1685 [I] start_workers:139 - Started 1 workers
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/Users/abhinavsingh/Dev/proxy.py/proxy/__init__.py", line 14, in <module>
    from .testing.test_case import TestCase
  File "/Users/abhinavsingh/Dev/proxy.py/proxy/testing/test_case.py", line 19, in <module>
    from ..plugin import CacheResponsesPlugin
  File "/Users/abhinavsingh/Dev/proxy.py/proxy/plugin/__init__.py", line 11, in <module>
    from .cache import CacheResponsesPlugin, BaseCacheResponsesPlugin
  File "/Users/abhinavsingh/Dev/proxy.py/proxy/plugin/cache/__init__.py", line 11, in <module>
    from .base import BaseCacheResponsesPlugin
  File "/Users/abhinavsingh/Dev/proxy.py/proxy/plugin/cache/base.py", line 44, in <module>
    class BaseCacheResponsesPlugin(HttpProxyBasePlugin):
  File "/Users/abhinavsingh/Dev/proxy.py/proxy/plugin/cache/base.py", line 52, in BaseCacheResponsesPlugin
    enabled = Event()
  File "/Users/abhinavsingh/Dev/proxy.py/proxy/plugin/cache/base.py", line 29, in __init__
    self.manager = multiprocessing.Manager()
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/managers.py", line 579, in start
    self._process.start()
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
    return Popen(process_obj)
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Because being a global variable, event tries to initialize itself as soon as it is imported. Manager creates a process of it's own resulting into freeze_support issues.

…ble/disable need

abhinavsingh · 2020-10-05T06:05:24Z

@pasccom Some notable changes:

Removed need of a dynamic plugin enable / disable functionality which was flaky in some environment
Using fingerprint we can remove need of maintaining an index file. See PR description. (Yet to deprecate usage of list.txt).

1) Temporarily disable cache response plugin with tls interception. These are broken due to fingerprint mismatch. For a https request, fingerprint must capture both CONNECT and then the sub-request state.

pasccom · 2020-10-05T16:49:17Z

Hello @abhinavsingh, I introduced list.txt because I needed to have a means to know which requests were cached and which cache file corresponds to which request (sometimes upstream sends random server errors, and I need to remove them manually from the test cache).
On second thought, I think it is not a bad idea to get rid of it. I think I can make do with the log file. Maybe we will have to improve the logging messages, I do not know how accurate they are.

abhinavsingh · 2020-10-05T17:14:22Z

Hello @abhinavsingh, I introduced list.txt because I needed to have a means to know which requests were cached and which cache file corresponds to which request (sometimes upstream sends random server errors, and I need to remove them manually from the test cache).

Agree about various server response codes. We'll definitely disable caching for certain http codes. For replay testing, caching can be enabled for all response codes.

On second thought, I think it is not a bad idea to get rid of it. I think I can make do with the log file. Maybe we will have to improve the logging messages, I do not know how accurate they are.

Yes, log file can work very well here. We'll have to add fingerprint as one of the logging format parameters.

Currently we are only caching responses. But we can also cache requests (most like add this feature behind a flag). Example, fingerprint.request and fingerprint.response. We can add some utility code to build index out out of cached files either offline or in-realtime using eventing.

abhinavsingh · 2020-10-06T08:10:34Z

@pasccom Created #443 to track it separately. I don't think we can cleanly proceed forward for https + TLS interception mode. Problem being, as you mentioned before, core expects an active upstream connection. I have discussed other problem arising with no upstream connection (see issue description).

Overall, it's unclear how to approach this for now. Ideally caching should work w/o upstream connection. Kind of defeats the purpose of cache if we have to re-connect with upstream even for serving out of cache. At the same time, we have some edge cases to handle if upstream connection is not made (see issue description).

Merging this one for now.

Add request fingerprinting method.

799c1c3

1) Fix some tests due to change in default cache directory 2) Fix event propogation using multiprocessing manager I/O error on closed resource is still an issue. Doesn't result in test error but a warning is thrown (tested on Python 3.8).

Merge branch 'cache-server' into caching

4d4a2b1

This comment has been minimized.

Sign in to view

abhinavsingh added 2 commits October 5, 2020 11:28

Separate ReplayTestCase which also removes need of dynamic plugin ena…

1e91c03

…ble/disable need

Rename VCR to Replay

466d719

Deprecate list.txt, use fingerprint of the request as cached file name.

ea544d5

1) Temporarily disable cache response plugin with tls interception. These are broken due to fingerprint mismatch. For a https request, fingerprint must capture both CONNECT and then the sub-request state.

abhinavsingh changed the title ~~Add request fingerprint~~ Add request fingerprint as cached filename Oct 5, 2020

Refactor tls_interception_enabled, add no cover tags on abc classes

968c4b7

abhinavsingh mentioned this pull request Oct 6, 2020

[TLSInterception] Should be able to work without an actual upstream connection #443

Closed

Add reference to issue#443

bcba970

abhinavsingh merged commit 2a37cc8 into cache-server Oct 6, 2020

abhinavsingh deleted the caching branch October 6, 2020 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add request fingerprint as cached filename #441

Add request fingerprint as cached filename #441

Uh oh!

abhinavsingh commented Oct 4, 2020 •

edited

Loading

Uh oh!

codecov bot commented Oct 4, 2020 •

edited

Loading

Uh oh!

This comment has been minimized.

pasccom commented Oct 4, 2020

Uh oh!

abhinavsingh commented Oct 5, 2020

Uh oh!

abhinavsingh commented Oct 5, 2020 •

edited

Loading

Uh oh!

abhinavsingh commented Oct 5, 2020 •

edited

Loading

Uh oh!

pasccom commented Oct 5, 2020

Uh oh!

abhinavsingh commented Oct 5, 2020

Uh oh!

abhinavsingh commented Oct 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add request fingerprint as cached filename #441

Add request fingerprint as cached filename #441

Uh oh!

Conversation

abhinavsingh commented Oct 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment has been minimized.

pasccom commented Oct 4, 2020

Uh oh!

abhinavsingh commented Oct 5, 2020

Uh oh!

abhinavsingh commented Oct 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhinavsingh commented Oct 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pasccom commented Oct 5, 2020

Uh oh!

abhinavsingh commented Oct 5, 2020

Uh oh!

abhinavsingh commented Oct 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abhinavsingh commented Oct 4, 2020 •

edited

Loading

codecov bot commented Oct 4, 2020 •

edited

Loading

abhinavsingh commented Oct 5, 2020 •

edited

Loading

abhinavsingh commented Oct 5, 2020 •

edited

Loading