Skip to content

Conversation

pasccom
Copy link
Contributor

@pasccom pasccom commented Aug 18, 2020

Here is a basic implementation of a cache plugin which supports serving responses out of cache. This should be the base of the solution to #319. Tests are attached for caching and serving HTTP and HTTPS requests.

The implementation is done as follows. A list file (list.txt) has been added which list all completed requests (method, host, path and SHA512 hash of body). When a request arrives, the plugin searches the list for this request. If it is not stored, create the entry and fetch response from server, otherwise queue the cached response to the client (and do not connect to the server).

It does not support interception of SSL handshake, which means it needs to connect to the server even though the request is cached. I think this requires a modification of the proxy.http.proxy.server class, so that it can perform an SSL handshake with the client without performing one with an upstream server.

pasccom added 16 commits August 18, 2020 15:25
In fact, it will not, because the file name is generated from plugin
UID. A more advanced scheme is needed.

This is required if we want to create answers from cache.
The cache list is a text file whose lines are composed of five
space-separated fields:
  - The HTTP(S) request method
  - The HTTP(S) request host
  - The HTTP(S) request path
  - A SHA512 sum of the request body, if any (else 'None').
  - The name of the cache file.
Since the list file will be shared accross all processes, we must ensure
reads are sychronized.
To be done:
  - Tests (currently only test for the presence of the file).
  - No support for HTTPS (explicitly disabled).
Cache is enabled by default.
In this case, the cache should be disabled by default.
@abhinavsingh
Copy link
Owner

Thanks for the PR. Working on a release this week at my end. I'll review the PR later this week or over the weekend. Cheers!!!

@pasccom
Copy link
Contributor Author

pasccom commented Aug 19, 2020

OK, I am working on a mode where the plugin does not contact upstream (even when intercepting TLS or if request is not cached). This will be very useful for automatic tests using proxy.py. I am mostly done. I only need to write tests for response when request is not cached. I will send another PR for this feature.

@codecov
Copy link

codecov bot commented Aug 25, 2020

Codecov Report

Merging #425 into develop will increase coverage by 0.21%.
The diff coverage is 83.48%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #425      +/-   ##
===========================================
+ Coverage    81.08%   81.29%   +0.21%     
===========================================
  Files           79       79              
  Lines         2955     3047      +92     
===========================================
+ Hits          2396     2477      +81     
- Misses         559      570      +11     
Impacted Files Coverage Δ
proxy/plugin/cache/cache_responses.py 100.00% <ø> (ø)
proxy/plugin/cache/store/base.py 77.27% <75.00%> (-2.73%) ⬇️
proxy/plugin/cache/base.py 84.81% <80.00%> (-8.30%) ⬇️
proxy/plugin/cache/store/disk.py 91.30% <86.95%> (-8.70%) ⬇️
proxy/testing/test_case.py 100.00% <100.00%> (ø)
proxy/http/proxy/server.py 78.00% <0.00%> (+1.03%) ⬆️
proxy/http/parser.py 96.25% <0.00%> (+1.87%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f63747f...b996cdb. Read the comment docs.

@abhinavsingh
Copy link
Owner

@pasccom apologies for no review this week, unfortunately I am still working towards the release at my end. It's unlikely I will be able to review it this week. But will surely jump onto it over the weekend sometime. Thank you!!!

Copy link
Owner

@abhinavsingh abhinavsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pasccom . Apologies for no show for about a month, I was swamped with releases at my end. Getting back to this PR, I left a few comments. Also, when I locally ran tests, tests/testing/test_embed.py::TestProxyPyEmbedded::test_proxy_no_vcr failed.

Overall, I think this PR is an excellent first step. I would love to take this forward. Let's not merge this PR into develop branch but instead into a branch named cache-server. We can iterate over cache server implementation, make changes to the core as necessary before merging into develop branch. Wdyt?

=================================================================================================== FAILURES ====================================================================================================
_____________________________________________________________________________________ TestProxyPyEmbedded.test_proxy_no_vcr _____________________________________________________________________________________

self = <tests.testing.test_embed.TestProxyPyEmbedded testMethod=test_proxy_no_vcr>

    def test_proxy_no_vcr(self) -> None:
        self.make_http_request_using_proxy()
    
>       self.assertIsNone(self.waitCached(Path(tempfile.gettempdir()) / 'list.txt',
                                          'GET', 'localhost', '/', 'None', 2))
E       AssertionError: 'proxy-cache-afd0b9587e804d25a50651812095c216' is not None

tests/testing/test_embed.py:117: AssertionError
--------------------------------------------------------------------------------------------- Captured stderr setup ---------------------------------------------------------------------------------------------
2020-10-02 13:33:26,846 - pid:25384 [I] load_plugins:541 - Loaded plugin proxy.http.proxy.HttpProxyPlugin
2020-10-02 13:33:26,846 - pid:25384 [I] load_plugins:541 - Loaded plugin proxy.http.server.HttpWebServerPlugin
2020-10-02 13:33:26,846 - pid:25384 [I] listen:71 - Listening on 0.0.0.0:61304
2020-10-02 13:33:26,849 - pid:25384 [D] start_workers:88 - Started acceptor#0 process 25418
2020-10-02 13:33:26,849 - pid:25384 [I] start_workers:94 - Started 1 workers
---------------------------------------------------------------------------------------------- Captured log setup -----------------------------------------------------------------------------------------------
INFO     proxy.common.flags:flags.py:541 Loaded plugin proxy.http.proxy.HttpProxyPlugin
INFO     proxy.common.flags:flags.py:541 Loaded plugin proxy.http.server.HttpWebServerPlugin
INFO     proxy.core.acceptor.pool:pool.py:71 Listening on 0.0.0.0:61304
DEBUG    proxy.core.acceptor.pool:pool.py:88 Started acceptor#0 process 25418
INFO     proxy.core.acceptor.pool:pool.py:94 Started 1 workers
============================================================================================ short test summary info ============================================================================================
FAILED tests/testing/test_embed.py::TestProxyPyEmbedded::test_proxy_no_vcr - AssertionError: 'proxy-cache-afd0b9587e804d25a50651812095c216' is not None
========================================================================================= 1 failed, 134 passed in 8.67s =========================================================================================
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/logging/__init__.py", line 1084, in emit
    stream.write(msg + self.terminator)
  File "/Users/abhinavsingh/Dev/proxy.py/venv38/lib/python3.8/site-packages/_pytest/capture.py", line 424, in write
    return self.buffer.write(s.encode(self.encoding, "replace"))
ValueError: I/O operation on closed file
Call stack:
  File "/Users/abhinavsingh/Dev/proxy.py/proxy/plugin/cache/store/disk.py", line 41, in __del__
    logger.debug("Closing cache list")
Message: 'Closing cache list'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.8.3/Frameworks/Python.framework/Versions/3.8/lib/python3.8/logging/__init__.py", line 1084, in emit
    stream.write(msg + self.terminator)
  File "/Users/abhinavsingh/Dev/proxy.py/venv38/lib/python3.8/site-packages/_pytest/capture.py", line 424, in write
    return self.buffer.write(s.encode(self.encoding, "replace"))
ValueError: I/O operation on closed file
Call stack:
  File "/Users/abhinavsingh/Dev/proxy.py/proxy/plugin/cache/store/disk.py", line 41, in __del__
    logger.debug("Closing cache list")
Message: 'Closing cache list'
Arguments: ()
make: *** [lib-test] Error 1

logger.info("Upstream connexion %s:%d %s" %
(text_(request.host), request.port if request.port else 0, text_(request.path)))

if request.port == 443:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can perform TLS handshake with the client without connecting with upstream. See utils/wrap_socket utility here https://github.com/abhinavsingh/proxy.py/blob/develop/proxy/common/utils.py#L154-L166 . We'll have to re-use server side certificates generated previously during TLS interception for handshake with the client. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I already looked in this direction. This is what I do locally for automatic testing purposes. I pushed my work on a branch on my fork https://github.com/pasccom/proxy.py/tree/develop_cache_no_upstream
I did not create a PR yet because I wanted this one first to be fixed before going further. The major issue with not contacting upstream is that I ensure do_connect is falsy in

if do_connect:
self.connect_upstream()
and consequently self.server is None at
if self.server and not self.server.closed:

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, before_upstream_connection callback can return None to avoid connecting to upstream. Internally it will keep do_connect = False.

And thanks for bringing this to notice, on_client_data does ensures server connection before proceeding. Might have to revisit the request state diagram to clean up some of these bits/expectations.

logger.info("Client request %s:%d %s" %
(text_(request.host), request.port if request.port else 0, text_(request.path)))

if request.port == 443:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we have already completed TLS handshake with the client above in before_upstream_connection callback, here we can return response from the cache. There are two possibilities:

  1. Save unencrypted response as cache and encrypt it every time while serving from cache. In-fact, encryption will be automatic when serving clear text response over the TLS connection.
  2. Save encrypted response as cache. Decrypt to parse it as necessary. Finally serve data over established TLS connection. This step involves an edge case that can occur when generated server side certificate expires, resulting into inability to decrypt the stored response or in worst case resulting into serving garbage data to client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we use solution 1. It seems more practical to me. I sometimes had to look at the cache to understand what data got into it (upsteam has sporadic faults).
That is what I used on https://github.com/pasccom/proxy.py/tree/develop_cache_no_upstream

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 agreed. Privacy can be argued with solution 1 but this aspect can be controlled by using proper ACL/Permissions over stored cache directories or files.

if not os.path.isdir(self.cache_dir):
os.mkdir(self.cache_dir)
logger.debug("Opening cache list")
self.cache_list = open(os.path.join(self.cache_dir, 'list.txt'), 'at')
Copy link
Owner

@abhinavsingh abhinavsingh Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going forward we must remove hardcoded list.txt. This ideally should be user provided. Please leave a TODO for now. Optionally address it by reading the path from an environment variable e.g. PROXY_PY_CACHE_FILENAME. When unavailable, fallback to list.txt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, I will add a TODO.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jFYI, yesterday I merged #438 which will soon allow plugins to define custom flags. Will be migrating configurable parameters within various plugins (e.g. --blacklist-ip-addresses, --cache-dir, ....) as flag this week. We should be able to re-use this mechanism for list.txt.

Talking of list.txt, at some point in future far away :) we must also revisit cache schema to ensure best performance under various scenarios (long running instances, gigabytes of cache data, ....)

for f in tmpDir.glob('proxy-cache-*'):
if f.is_file():
os.remove(f)
if tmpDir.joinpath('list.txt').is_file():
Copy link
Owner

@abhinavsingh abhinavsingh Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During parallel testing, os.remove results in a deadlock as list.txt is also being used by other tests. See https://github.com/abhinavsingh/proxy.py/pull/425/checks?check_run_id=1197597223 . We won't be able to merge this until resolved or bypass CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK! I did not know CI was using parallel testing. When I run them locally, I think they execute sequentially. Am I right?
I tried to investigate this issue but I could not understand what was happening (I am under Linux and have no ways of running Windows test locally).

Actually, multi-threading is already a major issue with the disk-based cache, since I have to find a way to synchronize file accesses. I used a multiprocessing.Lock(), but I am not sure it is the best solution out there. Indeed, the reader can access the file simultaneously, while the writes must access the file sequentially. A "read-write lock" would be more appropriate. Or we could go for OS file locks, but that would be platform-dependent.
For multi-processing, I am not sure what we can do, since multiprocessing.Lock() will be insufficient, as it is not aware of the other instances of tests running in parallel, right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK! I did not know CI was using parallel testing. When I run them locally, I think they execute sequentially. Am I right?

Yes, tests run sequentially. Thinking about it further, as same command invocation is used for CI, there is no reason they should run in parallel on GitHub actions.

For multi-processing, I am not sure what we can do, since multiprocessing.Lock() will be insufficient, as it is not aware of the other instances of tests running in parallel, right?

Yes. And idea of shared information about other tests doesn't sound right in the first place. There are some sporadic issues related to multiprocessing and windows too. I have a windows vm locally, will give it a try.

We can bypass tests conditionally. Example

@unittest.skipIf(
    os.name == 'nt',
    'Open file limit tests disabled for Windows')
class TestSetOpenFileLimit(unittest.TestCase):

Actually, multi-threading is already a major issue with the disk-based cache, since I have to find a way to synchronize file accesses.

Yes. A single file will require synchronization. Avoiding lock/synchronizations can also boost the cache storage performance by many folds. list.txt is essentially an index file. We can even rename it appropriately in later iterations. It contains metadata about stored item in a random order i.e. stored indexes are not ordered by any attribute. We can also see list.txt as serialized index store as indexes exists in the order of requests.

Currently, we don't have an update flow i.e. once written, cache is immutable and never updated. But, design wise we must prepare for an update cycle :) where cache is updated either based upon time based expiry, query parameters etc. This decision making can be moulded into cache plugin lifecycle callback.

I used a multiprocessing.Lock(), but I am not sure it is the best solution out there. Indeed, the reader can access the file simultaneously, while the writes must access the file sequentially. A "read-write lock" would be more appropriate. Or we could go for OS file locks, but that would be platform-dependent.

The task of on disk based storage leads us into database designing wonderland :) Good thing is our cache storage is pluggable and we can easily add new storage mechanisms (leveldb, sqlite, cloud storages, ....)

Few thoughts around approaching disk based storage using zero-dependency philosophy:

  1. Storage: We are currently caching responses as a single file per request. So storage becomes pretty straightforward. But for a long running server or with lots of data, I do expect us to run into situations where OS starts to run out of file descriptors (possibly fixable by tweaking kernel parameters). Let's not bother about it for now as fixing this is relatively easier if we get our indexes right.

  2. Indexes: Without any ordering on stored indexes, we indirectly increase one or both:
    a) Startup time: Cache plugin is busy ordering the indexes in-memory (if we decide to pre-warm the cache index).
    b) Lookup time: Cache plugin is busy iterating over unordered indexes (if we decide to consult index file for every read).

In short, ordering (fully or partially) indexes on disk can greatly benefit overall performance. Thinking of primitive ways :) maintaining one index file per alphabet e.g. a-list.txt, b-list.txt, .... where each file store data about domains starting with respective alphabet (or any unicode character). Such partitioning strategies can greatly reduce the lock contention across workers.

At some point we can also use an in-memory cache e.g. LRU cache to minimize disk hits. Ideally, I would like to have a running copy of indexes in-memory but that is not always practical with lots of indexes (and not enough available RAM).

  1. Synchronization: There is in-build pubsub mechanism in proxy.py. Try it independently here https://github.com/abhinavsingh/proxy.py/blob/develop/examples/pubsub_eventing.py.

For some background, eventing mechanism was developed for devtools integration. This allows proxy.py traffic to be inspected using chrome devtools. devtools expects a websocket connection sending stream of events as described in devtools protocol specification. In short, we do have a mechanism to produce a single stream of custom events from various worker processes. Most importantly, eventing mechanism relieves us from synchronization pains.

An idea could be to simply publish cache payload as events from within cache plugin. Then, a single global subscriber thread or multiple subscriber threads (one per worker process) can consume events and write indexes to disk either synchronously (as events are received) or in batches.

Hopefully, we can mix some of the above ideas to build our own storage layer :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can bypass tests conditionally. Example

@unittest.skipIf(
    os.name == 'nt',
    'Open file limit tests disabled for Windows')
class TestSetOpenFileLimit(unittest.TestCase):

Yes, I knew we can do this, but I wanted you to see if you better understands what happens on Windows. The current output from the tests is a bit short and I did not want to push tests commits only to see a more verbose output.

Yes. A single file will require synchronization. Avoiding lock/synchronizations can also boost the cache storage performance by many folds. list.txt is essentially an index file. We can even rename it appropriately in later iterations. It contains metadata about stored item in a random order i.e. stored indexes are not ordered by any attribute. We can also see list.txt as serialized index store as indexes exists in the order of requests.

Exactly, that is how I designed it.

Currently, we don't have an update flow i.e. once written, cache is immutable and never updated. But, design wise we must prepare for an update cycle :) where cache is updated either based upon time based expiry, query parameters etc. This decision making can be moulded into cache plugin lifecycle callback.

Yes, I kept that for later, because I think that there is quite a large amount of work to implement all cache expiracy features.

The task of on disk based storage leads us into database designing wonderland :) Good thing is our cache storage is pluggable and we can easily add new storage mechanisms (leveldb, sqlite, cloud storages, ....)

Few thoughts around approaching disk based storage using zero-dependency philosophy:

  1. Storage: We are currently caching responses as a single file per request. So storage becomes pretty straightforward. But for a long running server or with lots of data, I do expect us to run into situations where OS starts to run out of file descriptors (possibly fixable by tweaking kernel parameters). Let's not bother about it for now as fixing this is relatively easier if we get our indexes right.

I do not think we will run out of file descriptors because there is only one open file per non-completed request plus the index. This should not be too many for the OS, unless we open a lot of requests in parallel on purpose.
Though, if we use it for a long time we will run into the limit of inodes per filesystem. For Ext2/3/4 filesystems, it can be configured when formatting the disk. To avoid this we can use subfolders as is done in e.g. .git/objects "cache".

  1. Indexes: Without any ordering on stored indexes, we indirectly increase one or both:
    a) Startup time: Cache plugin is busy ordering the indexes in-memory (if we decide to pre-warm the cache index).
    b) Lookup time: Cache plugin is busy iterating over unordered indexes (if we decide to consult index file for every read).

In short, ordering (fully or partially) indexes on disk can greatly benefit overall performance. Thinking of primitive ways :) maintaining one index file per alphabet e.g. a-list.txt, b-list.txt, .... where each file store data about domains starting with respective alphabet (or any unicode character). Such partitioning strategies can greatly reduce the lock contention across workers.

Yes, this will become an issue for large caches. The list by alphabet letter seems a good compromise, but we have to get a reliable way to determine it, without having everything in w (many websites still use www). Or we could even go for a per-host index file. I do not know how fast is os.path.is_file().

@abhinavsingh abhinavsingh changed the base branch from develop to cache-server October 2, 2020 08:24
@abhinavsingh
Copy link
Owner

Will go ahead and merge this under cache-server branch. Want to iterate over your work this weekend. Thank you again!!!

@abhinavsingh abhinavsingh merged commit 4bef06d into abhinavsingh:cache-server Oct 3, 2020
@pasccom
Copy link
Contributor Author

pasccom commented Oct 3, 2020

Hello @abhinavsingh! Nice to have news, I feared you got a bad COVID.
Yes, there is still some work to do, especially regarding multi-threading and multi-processing.
I think there is an issue with Transfer-Encoding: chunked as well, I got it a few days ago on my prototype. I did not had time to check if it works or not with this version. I just removed the header which was set by the server on my prototype.
I will let you work over the weekend and come to see what you did. Hopefully, I will have some time to fix the remaining issues next week.
Best regards,

@abhinavsingh
Copy link
Owner

Hello @abhinavsingh! Nice to have news, I feared you got a bad COVID.

Thank you :) I am doing fine, simply got super busy with other releases.

Yes, there is still some work to do, especially regarding multi-threading and multi-processing.
I think there is an issue with Transfer-Encoding: chunked as well, I got it a few days ago on my prototype. I did not had time to check if it works or not with this version. I just removed the header which was set by the server on my prototype.

Totally chunked encoding is another problem we must resolve. This is probably an indication that cache must be written only after the connection has been closed or when request is complete.

return None

# Dynamically enable / disable cache
enabled = EnabledDescriptor(True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also must note that dynamic enabled flag doesn't work on MacOS Python 3.8. https://github.com/abhinavsingh/proxy.py/pull/425/checks?check_run_id=1197597150

I am also able to reproduce it locally. Somehow the event changes are not propagated within plugin method checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants