[CacheResponsesPlugin] Serving out of cache #425

pasccom · 2020-08-18T13:43:48Z

Here is a basic implementation of a cache plugin which supports serving responses out of cache. This should be the base of the solution to #319. Tests are attached for caching and serving HTTP and HTTPS requests.

The implementation is done as follows. A list file (list.txt) has been added which list all completed requests (method, host, path and SHA512 hash of body). When a request arrives, the plugin searches the list for this request. If it is not stored, create the entry and fetch response from server, otherwise queue the cached response to the client (and do not connect to the server).

It does not support interception of SSL handshake, which means it needs to connect to the server even though the request is cached. I think this requires a modification of the proxy.http.proxy.server class, so that it can perform an SSL handshake with the client without performing one with an upstream server.

In fact, it will not, because the file name is generated from plugin UID. A more advanced scheme is needed. This is required if we want to create answers from cache.

The cache list is a text file whose lines are composed of five space-separated fields: - The HTTP(S) request method - The HTTP(S) request host - The HTTP(S) request path - A SHA512 sum of the request body, if any (else 'None'). - The name of the cache file.

Since the list file will be shared accross all processes, we must ensure reads are sychronized.

To be done: - Tests (currently only test for the presence of the file). - No support for HTTPS (explicitly disabled).

Cache is enabled by default.

In this case, the cache should be disabled by default.

Use `Host` header in this case.

abhinavsingh · 2020-08-19T15:15:39Z

Thanks for the PR. Working on a release this week at my end. I'll review the PR later this week or over the weekend. Cheers!!!

pasccom · 2020-08-19T17:26:47Z

OK, I am working on a mode where the plugin does not contact upstream (even when intercepting TLS or if request is not cached). This will be very useful for automatic tests using proxy.py. I am mostly done. I only need to write tests for response when request is not cached. I will send another PR for this feature.

codecov · 2020-08-25T06:38:45Z

Codecov Report

Merging #425 into develop will increase coverage by 0.21%.
The diff coverage is 83.48%.

@@             Coverage Diff             @@
##           develop     #425      +/-   ##
===========================================
+ Coverage    81.08%   81.29%   +0.21%     
===========================================
  Files           79       79              
  Lines         2955     3047      +92     
===========================================
+ Hits          2396     2477      +81     
- Misses         559      570      +11

Impacted Files	Coverage Δ
proxy/plugin/cache/cache_responses.py	`100.00% <ø> (ø)`
proxy/plugin/cache/store/base.py	`77.27% <75.00%> (-2.73%)`	⬇️
proxy/plugin/cache/base.py	`84.81% <80.00%> (-8.30%)`	⬇️
proxy/plugin/cache/store/disk.py	`91.30% <86.95%> (-8.70%)`	⬇️
proxy/testing/test_case.py	`100.00% <100.00%> (ø)`
proxy/http/proxy/server.py	`78.00% <0.00%> (+1.03%)`	⬆️
proxy/http/parser.py	`96.25% <0.00%> (+1.87%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f63747f...b996cdb. Read the comment docs.

abhinavsingh · 2020-08-26T22:43:29Z

@pasccom apologies for no review this week, unfortunately I am still working towards the release at my end. It's unlikely I will be able to review it this week. But will surely jump onto it over the weekend sometime. Thank you!!!

abhinavsingh

Hi @pasccom . Apologies for no show for about a month, I was swamped with releases at my end. Getting back to this PR, I left a few comments. Also, when I locally ran tests, tests/testing/test_embed.py::TestProxyPyEmbedded::test_proxy_no_vcr failed.

Overall, I think this PR is an excellent first step. I would love to take this forward. Let's not merge this PR into develop branch but instead into a branch named cache-server. We can iterate over cache server implementation, make changes to the core as necessary before merging into develop branch. Wdyt?

=================================================================================================== FAILURES ====================================================================================================
_____________________________________________________________________________________ TestProxyPyEmbedded.test_proxy_no_vcr _____________________________________________________________________________________

self = <tests.testing.test_embed.TestProxyPyEmbedded testMethod=test_proxy_no_vcr>

    def test_proxy_no_vcr(self) -> None:
        self.make_http_request_using_proxy()
    
>       self.assertIsNone(self.waitCached(Path(tempfile.gettempdir()) / 'list.txt',
                                          'GET', 'localhost', '/', 'None', 2))
E       AssertionError: 'proxy-cache-afd0b9587e804d25a50651812095c216' is not None

tests/testing/test_embed.py:117: AssertionError
--------------------------------------------------------------------------------------------- Captured stderr setup ---------------------------------------------------------------------------------------------
2020-10-02 13:33:26,846 - pid:25384 [I] load_plugins:541 - Loaded plugin proxy.http.proxy.HttpProxyPlugin
2020-10-02 13:33:26,846 - pid:25384 [I] load_plugins:541 - Loaded plugin proxy.http.server.HttpWebServerPlugin
2020-10-02 13:33:26,846 - pid:25384 [I] listen:71 - Listening on 0.0.0.0:61304
2020-10-02 13:33:26,849 - pid:25384 [D] start_workers:88 - Started acceptor#0 process 25418
2020-10-02 13:33:26,849 - pid:25384 [I] start_workers:94 - Started 1 workers
---------------------------------------------------------------------------------------------- Captured log setup -----------------------------------------------------------------------------------------------
INFO     proxy.common.flags:flags.py:541 Loaded plugin proxy.http.proxy.HttpProxyPlugin
INFO     proxy.common.flags:flags.py:541 Loaded plugin proxy.http.server.HttpWebServerPlugin
INFO     proxy.core.acceptor.pool:pool.py:71 Listening on 0.0.0.0:61304
DEBUG    proxy.core.acceptor.pool:pool.py:88 Started acceptor#0 process 25418
INFO     proxy.core.acceptor.pool:pool.py:94 Started 1 workers
============================================================================================ short test summary info ============================================================================================
FAILED tests/testing/test_embed.py::TestProxyPyEmbedded::test_proxy_no_vcr - AssertionError: 'proxy-cache-afd0b9587e804d25a50651812095c216' is not None
========================================================================================= 1 failed, 134 passed in 8.67s =========================================================================================
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/logging/__init__.py", line 1084, in emit
    stream.write(msg + self.terminator)
  File "/Users/abhinavsingh/Dev/proxy.py/venv38/lib/python3.8/site-packages/_pytest/capture.py", line 424, in write
    return self.buffer.write(s.encode(self.encoding, "replace"))
ValueError: I/O operation on closed file
Call stack:
  File "/Users/abhinavsingh/Dev/proxy.py/proxy/plugin/cache/store/disk.py", line 41, in __del__
    logger.debug("Closing cache list")
Message: 'Closing cache list'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/logging/__init__.py", line 1084, in emit
    stream.write(msg + self.terminator)
  File "/Users/abhinavsingh/Dev/proxy.py/venv38/lib/python3.8/site-packages/_pytest/capture.py", line 424, in write
    return self.buffer.write(s.encode(self.encoding, "replace"))
ValueError: I/O operation on closed file
Call stack:
  File "/Users/abhinavsingh/Dev/proxy.py/proxy/plugin/cache/store/disk.py", line 41, in __del__
    logger.debug("Closing cache list")
Message: 'Closing cache list'
Arguments: ()
make: *** [lib-test] Error 1

abhinavsingh · 2020-10-02T08:07:50Z

proxy/plugin/cache/base.py

+        logger.info("Upstream connexion %s:%d %s" %
+                    (text_(request.host), request.port if request.port else 0, text_(request.path)))
+
+        if request.port == 443:


We can perform TLS handshake with the client without connecting with upstream. See utils/wrap_socket utility here https://github.com/abhinavsingh/proxy.py/blob/develop/proxy/common/utils.py#L154-L166 . We'll have to re-use server side certificates generated previously during TLS interception for handshake with the client. Wdyt?

Yes, I already looked in this direction. This is what I do locally for automatic testing purposes. I pushed my work on a branch on my fork https://github.com/pasccom/proxy.py/tree/develop_cache_no_upstream
I did not create a PR yet because I wanted this one first to be fixed before going further. The major issue with not contacting upstream is that I ensure do_connect is falsy in

proxy.py/proxy/http/proxy/server.py

Lines 337 to 338 in 8cc349b

if do_connect:

self.connect_upstream()

and consequently self.server is None at

proxy.py/proxy/http/proxy/server.py

Line 279 in 8cc349b

if self.server and not self.server.closed:

Correct, before_upstream_connection callback can return None to avoid connecting to upstream. Internally it will keep do_connect = False.

And thanks for bringing this to notice, on_client_data does ensures server connection before proceeding. Might have to revisit the request state diagram to clean up some of these bits/expectations.

abhinavsingh · 2020-10-02T08:14:24Z

proxy/plugin/cache/base.py

+        logger.info("Client request %s:%d %s" %
+                    (text_(request.host), request.port if request.port else 0, text_(request.path)))
+
+        if request.port == 443:


Assuming we have already completed TLS handshake with the client above in before_upstream_connection callback, here we can return response from the cache. There are two possibilities:

Save unencrypted response as cache and encrypt it every time while serving from cache. In-fact, encryption will be automatic when serving clear text response over the TLS connection.

Save encrypted response as cache. Decrypt to parse it as necessary. Finally serve data over established TLS connection. This step involves an edge case that can occur when generated server side certificate expires, resulting into inability to decrypt the stored response or in worst case resulting into serving garbage data to client.

I would prefer we use solution 1. It seems more practical to me. I sometimes had to look at the cache to understand what data got into it (upsteam has sporadic faults).
That is what I used on https://github.com/pasccom/proxy.py/tree/develop_cache_no_upstream

+1 agreed. Privacy can be argued with solution 1 but this aspect can be controlled by using proper ACL/Permissions over stored cache directories or files.

abhinavsingh · 2020-10-02T08:16:38Z

proxy/plugin/cache/store/disk.py

+        if not os.path.isdir(self.cache_dir):
+            os.mkdir(self.cache_dir)
+        logger.debug("Opening cache list")
+        self.cache_list = open(os.path.join(self.cache_dir, 'list.txt'), 'at')


Going forward we must remove hardcoded list.txt. This ideally should be user provided. Please leave a TODO for now. Optionally address it by reading the path from an environment variable e.g. PROXY_PY_CACHE_FILENAME. When unavailable, fallback to list.txt.

Yes, I agree, I will add a TODO.

jFYI, yesterday I merged #438 which will soon allow plugins to define custom flags. Will be migrating configurable parameters within various plugins (e.g. --blacklist-ip-addresses, --cache-dir, ....) as flag this week. We should be able to re-use this mechanism for list.txt.

Talking of list.txt, at some point in future far away :) we must also revisit cache schema to ensure best performance under various scenarios (long running instances, gigabytes of cache data, ....)

abhinavsingh · 2020-10-02T08:17:18Z

tests/plugin/test_http_proxy_plugins.py

+        for f in tmpDir.glob('proxy-cache-*'):
+            if f.is_file():
+                os.remove(f)
+        if tmpDir.joinpath('list.txt').is_file():


During parallel testing, os.remove results in a deadlock as list.txt is also being used by other tests. See https://github.com/abhinavsingh/proxy.py/pull/425/checks?check_run_id=1197597223 . We won't be able to merge this until resolved or bypass CI.

Ah OK! I did not know CI was using parallel testing. When I run them locally, I think they execute sequentially. Am I right?
I tried to investigate this issue but I could not understand what was happening (I am under Linux and have no ways of running Windows test locally).

Actually, multi-threading is already a major issue with the disk-based cache, since I have to find a way to synchronize file accesses. I used a multiprocessing.Lock(), but I am not sure it is the best solution out there. Indeed, the reader can access the file simultaneously, while the writes must access the file sequentially. A "read-write lock" would be more appropriate. Or we could go for OS file locks, but that would be platform-dependent.
For multi-processing, I am not sure what we can do, since multiprocessing.Lock() will be insufficient, as it is not aware of the other instances of tests running in parallel, right?

Ah OK! I did not know CI was using parallel testing. When I run them locally, I think they execute sequentially. Am I right?

Yes, tests run sequentially. Thinking about it further, as same command invocation is used for CI, there is no reason they should run in parallel on GitHub actions.

For multi-processing, I am not sure what we can do, since multiprocessing.Lock() will be insufficient, as it is not aware of the other instances of tests running in parallel, right?

Yes. And idea of shared information about other tests doesn't sound right in the first place. There are some sporadic issues related to multiprocessing and windows too. I have a windows vm locally, will give it a try.

We can bypass tests conditionally. Example

@unittest.skipIf( os.name == 'nt', 'Open file limit tests disabled for Windows') class TestSetOpenFileLimit(unittest.TestCase):

Actually, multi-threading is already a major issue with the disk-based cache, since I have to find a way to synchronize file accesses.

Yes. A single file will require synchronization. Avoiding lock/synchronizations can also boost the cache storage performance by many folds. list.txt is essentially an index file. We can even rename it appropriately in later iterations. It contains metadata about stored item in a random order i.e. stored indexes are not ordered by any attribute. We can also see list.txt as serialized index store as indexes exists in the order of requests.

Currently, we don't have an update flow i.e. once written, cache is immutable and never updated. But, design wise we must prepare for an update cycle :) where cache is updated either based upon time based expiry, query parameters etc. This decision making can be moulded into cache plugin lifecycle callback.

I used a multiprocessing.Lock(), but I am not sure it is the best solution out there. Indeed, the reader can access the file simultaneously, while the writes must access the file sequentially. A "read-write lock" would be more appropriate. Or we could go for OS file locks, but that would be platform-dependent.

The task of on disk based storage leads us into database designing wonderland :) Good thing is our cache storage is pluggable and we can easily add new storage mechanisms (leveldb, sqlite, cloud storages, ....)

Few thoughts around approaching disk based storage using zero-dependency philosophy:

Storage: We are currently caching responses as a single file per request. So storage becomes pretty straightforward. But for a long running server or with lots of data, I do expect us to run into situations where OS starts to run out of file descriptors (possibly fixable by tweaking kernel parameters). Let's not bother about it for now as fixing this is relatively easier if we get our indexes right.

Indexes: Without any ordering on stored indexes, we indirectly increase one or both:
a) Startup time: Cache plugin is busy ordering the indexes in-memory (if we decide to pre-warm the cache index).
b) Lookup time: Cache plugin is busy iterating over unordered indexes (if we decide to consult index file for every read).

In short, ordering (fully or partially) indexes on disk can greatly benefit overall performance. Thinking of primitive ways :) maintaining one index file per alphabet e.g. a-list.txt, b-list.txt, .... where each file store data about domains starting with respective alphabet (or any unicode character). Such partitioning strategies can greatly reduce the lock contention across workers.

At some point we can also use an in-memory cache e.g. LRU cache to minimize disk hits. Ideally, I would like to have a running copy of indexes in-memory but that is not always practical with lots of indexes (and not enough available RAM).

Synchronization: There is in-build pubsub mechanism in proxy.py. Try it independently here https://github.com/abhinavsingh/proxy.py/blob/develop/examples/pubsub_eventing.py.

For some background, eventing mechanism was developed for devtools integration. This allows proxy.py traffic to be inspected using chrome devtools. devtools expects a websocket connection sending stream of events as described in devtools protocol specification. In short, we do have a mechanism to produce a single stream of custom events from various worker processes. Most importantly, eventing mechanism relieves us from synchronization pains.

An idea could be to simply publish cache payload as events from within cache plugin. Then, a single global subscriber thread or multiple subscriber threads (one per worker process) can consume events and write indexes to disk either synchronously (as events are received) or in batches.

Hopefully, we can mix some of the above ideas to build our own storage layer :)

We can bypass tests conditionally. Example

@unittest.skipIf( os.name == 'nt', 'Open file limit tests disabled for Windows') class TestSetOpenFileLimit(unittest.TestCase):

Yes, I knew we can do this, but I wanted you to see if you better understands what happens on Windows. The current output from the tests is a bit short and I did not want to push tests commits only to see a more verbose output.

Yes. A single file will require synchronization. Avoiding lock/synchronizations can also boost the cache storage performance by many folds. list.txt is essentially an index file. We can even rename it appropriately in later iterations. It contains metadata about stored item in a random order i.e. stored indexes are not ordered by any attribute. We can also see list.txt as serialized index store as indexes exists in the order of requests.

Exactly, that is how I designed it.

Currently, we don't have an update flow i.e. once written, cache is immutable and never updated. But, design wise we must prepare for an update cycle :) where cache is updated either based upon time based expiry, query parameters etc. This decision making can be moulded into cache plugin lifecycle callback.

Yes, I kept that for later, because I think that there is quite a large amount of work to implement all cache expiracy features.

The task of on disk based storage leads us into database designing wonderland :) Good thing is our cache storage is pluggable and we can easily add new storage mechanisms (leveldb, sqlite, cloud storages, ....)

Few thoughts around approaching disk based storage using zero-dependency philosophy:

Storage: We are currently caching responses as a single file per request. So storage becomes pretty straightforward. But for a long running server or with lots of data, I do expect us to run into situations where OS starts to run out of file descriptors (possibly fixable by tweaking kernel parameters). Let's not bother about it for now as fixing this is relatively easier if we get our indexes right.

I do not think we will run out of file descriptors because there is only one open file per non-completed request plus the index. This should not be too many for the OS, unless we open a lot of requests in parallel on purpose.
Though, if we use it for a long time we will run into the limit of inodes per filesystem. For Ext2/3/4 filesystems, it can be configured when formatting the disk. To avoid this we can use subfolders as is done in e.g. .git/objects "cache".

Indexes: Without any ordering on stored indexes, we indirectly increase one or both:
a) Startup time: Cache plugin is busy ordering the indexes in-memory (if we decide to pre-warm the cache index).
b) Lookup time: Cache plugin is busy iterating over unordered indexes (if we decide to consult index file for every read).

In short, ordering (fully or partially) indexes on disk can greatly benefit overall performance. Thinking of primitive ways :) maintaining one index file per alphabet e.g. a-list.txt, b-list.txt, .... where each file store data about domains starting with respective alphabet (or any unicode character). Such partitioning strategies can greatly reduce the lock contention across workers.

Yes, this will become an issue for large caches. The list by alphabet letter seems a good compromise, but we have to get a reliable way to determine it, without having everything in w (many websites still use www). Or we could even go for a per-host index file. I do not know how fast is os.path.is_file().

abhinavsingh · 2020-10-03T10:57:27Z

Will go ahead and merge this under cache-server branch. Want to iterate over your work this weekend. Thank you again!!!

pasccom · 2020-10-03T20:22:42Z

Hello @abhinavsingh! Nice to have news, I feared you got a bad COVID.
Yes, there is still some work to do, especially regarding multi-threading and multi-processing.
I think there is an issue with Transfer-Encoding: chunked as well, I got it a few days ago on my prototype. I did not had time to check if it works or not with this version. I just removed the header which was set by the server on my prototype.
I will let you work over the weekend and come to see what you did. Hopefully, I will have some time to fix the remaining issues next week.
Best regards,

abhinavsingh · 2020-10-04T07:42:57Z

Hello @abhinavsingh! Nice to have news, I feared you got a bad COVID.

Thank you :) I am doing fine, simply got super busy with other releases.

Yes, there is still some work to do, especially regarding multi-threading and multi-processing.
I think there is an issue with Transfer-Encoding: chunked as well, I got it a few days ago on my prototype. I did not had time to check if it works or not with this version. I just removed the header which was set by the server on my prototype.

Totally chunked encoding is another problem we must resolve. This is probably an indication that cache must be written only after the connection has been closed or when request is complete.

abhinavsingh · 2020-10-04T17:22:24Z

proxy/plugin/cache/base.py

+            return None
+
+    # Dynamically enable / disable cache
+    enabled = EnabledDescriptor(True)


Also must note that dynamic enabled flag doesn't work on MacOS Python 3.8. https://github.com/abhinavsingh/proxy.py/pull/425/checks?check_run_id=1197597150

I am also able to reproduce it locally. Somehow the event changes are not propagated within plugin method checks.

pasccom added 16 commits August 18, 2020 15:25

Tests for cache file contents with embedded proxy.

23d053c

Attempt to use a separate file for each request

72b7d97

In fact, it will not, because the file name is generated from plugin UID. A more advanced scheme is needed. This is required if we want to create answers from cache.

Create cache dir if it does not exist.

d50aedc

Added a cache list file.

e93960f

The cache list is a text file whose lines are composed of five space-separated fields: - The HTTP(S) request method - The HTTP(S) request host - The HTTP(S) request path - A SHA512 sum of the request body, if any (else 'None'). - The name of the cache file.

Support for multi-processing in cache store.

7a70a1f

Since the list file will be shared accross all processes, we must ensure reads are sychronized.

Load cached request from cache.

30b54d6

To be done: - Tests (currently only test for the presence of the file). - No support for HTTPS (explicitly disabled).

Tests for caching HTTP request from upstream.

970c12e

Select test class using test name prefix only.

e71a349

Tests for loading responses from cache.

5718561

Implementation on enabled feature.

7d7149a

Cache is enabled by default.

VCR feature of TestCase class with tests.

3a45a67

In this case, the cache should be disabled by default.

Flush list when cache file is written on disk.

a9e4521

When used with HTTPS resquests, request.host may be None

844171d

Use `Host` header in this case.

Tests for caching HTTPS responses from upstream.

87c45a2

Tests for answering HTTPS request from cache.

99717dd

Added newline in cache file test.

a663b53

Merge branch 'develop' into develop_cache

b607040

abhinavsingh added 3 commits October 2, 2020 12:11

Merge branch 'develop' into develop_cache

5060b70

Merge branch 'develop' into develop_cache

7b3ad4d

Merge branch 'develop' into develop_cache

b996cdb

abhinavsingh requested changes Oct 2, 2020

View reviewed changes

abhinavsingh changed the base branch from develop to cache-server October 2, 2020 08:24

abhinavsingh merged commit 4bef06d into abhinavsingh:cache-server Oct 3, 2020

abhinavsingh reviewed Oct 4, 2020

View reviewed changes

Uh oh!

[CacheResponsesPlugin] Serving out of cache #425

[CacheResponsesPlugin] Serving out of cache #425

Uh oh!

Conversation

pasccom commented Aug 18, 2020

Uh oh!

abhinavsingh commented Aug 19, 2020

Uh oh!

pasccom commented Aug 19, 2020

Uh oh!

codecov bot commented Aug 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

abhinavsingh commented Aug 26, 2020

Uh oh!

abhinavsingh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavsingh Oct 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavsingh Oct 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavsingh commented Oct 3, 2020

Uh oh!

pasccom commented Oct 3, 2020

Uh oh!

abhinavsingh commented Oct 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Aug 25, 2020 •

edited

Loading

abhinavsingh Oct 2, 2020 •

edited

Loading

abhinavsingh Oct 2, 2020 •

edited

Loading