Skip to content
This repository was archived by the owner on May 31, 2021. It is now read-only.

Commit 306ae08

Browse files
author
Vincent Michel
committed
Update webscraper page to match the examples
1 parent b3c3da7 commit 306ae08

File tree

1 file changed

+16
-74
lines changed

1 file changed

+16
-74
lines changed

webscraper.rst

+16-74
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,7 @@ Let's have a look into the details.
5555
This provides a simple multi-threaded web server:
5656

5757
.. literalinclude:: examples/simple_server.py
58-
:language: python
59-
:start-after: ENCODING = 'utf-8'
60-
:end-before: class MyRequestHandle
58+
:pyobject: ThreadingHTTPServer
6159

6260
It uses multiple inheritance.
6361
The mix-in class ``ThreadingMixIn`` provides the multi-threading support and
@@ -68,9 +66,7 @@ The request handler only has a ``GET`` method:
6866

6967

7068
.. literalinclude:: examples/simple_server.py
71-
:language: python
72-
:start-after: pass
73-
:end-before: def run(
69+
:pyobject: MyRequestHandler
7470

7571
It takes the last entry in the paths with ``self.path[1:]``, i.e.
7672
our ``2.5``, and tries to convert it into a floating point number.
@@ -94,9 +90,7 @@ the encoding specified by ``charset``.
9490
This is our helper to find out what the encoding of the page is:
9591

9692
.. literalinclude:: examples/synchronous_client.py
97-
:language: python
98-
:start-after: ENCODING = 'ISO-8859-1'
99-
:end-before: def get_page
93+
:pyobject: get_encoding
10094

10195
It falls back to ``ISO-8859-1`` if it cannot find a specification of the
10296
encoding.
@@ -106,16 +100,12 @@ The response is a bytestring and ``.encode()`` is needed to convert it into a
106100
string:
107101

108102
.. literalinclude:: examples/synchronous_client.py
109-
:language: python
110-
:start-after: return ENCODING
111-
:end-before: def get_multiple_pages
103+
:pyobject: get_page
112104

113105
Now, we want multiple pages:
114106

115107
.. literalinclude:: examples/synchronous_client.py
116-
:language: python
117-
:start-after: return html
118-
:end-before: if __name__ == '__main__':
108+
:pyobject: get_multiple_pages
119109

120110
We just iterate over the waiting times and call ``get_page()`` for all
121111
of them.
@@ -132,13 +122,10 @@ and get this output::
132122
It took 11.08 seconds for a total waiting time of 11.00.
133123
Waited for 1.00 seconds.
134124
That's all.
135-
136125
Waited for 5.00 seconds.
137126
That's all.
138-
139127
Waited for 3.00 seconds.
140128
That's all.
141-
142129
Waited for 2.00 seconds.
143130
That's all.
144131

@@ -164,16 +151,13 @@ if found.
164151
Again, the default encoding is ``ISO-8859-1``:
165152

166153
.. literalinclude:: examples/async_page.py
167-
:language: python
168-
:start-after: ENCODING = 'ISO-8859-1'
169-
:end-before: async def get_page
154+
:pyobject: get_encoding
170155

171156
The next function is way more interesting because it actually works
172157
asynchronously:
173158

174159
.. literalinclude:: examples/async_page.py
175-
:language: python
176-
:start-after: return ENCODING
160+
:pyobject: get_page
177161

178162
The function ``asyncio.open_connection()`` opens a connection to the given URL.
179163
It returns a coroutine.
@@ -224,32 +208,7 @@ The interesting things happen in a few lines in ``get_multiple_pages()``
224208
:start-after: pages = []
225209
:end-before: duration
226210

227-
The ``closing`` from the standard library module ``contextlib`` starts
228-
the event loop within a context and closes the loop when leaving the context:
229-
230-
.. code-block:: python
231-
232-
with closing(asyncio.get_event_loop()) as loop:
233-
<body>
234-
235-
The two lines above are equivalent to these five lines:
236-
237-
.. code-block:: python
238-
239-
loop = asyncio.get_event_loop():
240-
try:
241-
<body>
242-
finally:
243-
loop.close()
244-
245-
We call ``get_page()`` for each page in a loop.
246-
Here we decide to wrap each call in ``loop.run_until_complete()``:
247-
248-
.. code-block:: python
249-
250-
for wait in waits:
251-
pages.append(loop.run_until_complete(get_page(host, port, wait)))
252-
211+
We await ``get_page()`` for each page in a loop.
253212
This means, we wait until each pages has been retrieved before asking for
254213
the next.
255214
Let's run it from the command-line to see what happens::
@@ -283,24 +242,17 @@ waiting for the answer before asking for the next page:
283242

284243
The interesting part is in this loop:
285244

286-
.. code-block:: python
287-
288-
with closing(asyncio.get_event_loop()) as loop:
289-
for wait in waits:
290-
tasks.append(get_page(host, port, wait))
291-
pages = loop.run_until_complete(asyncio.gather(*tasks))
245+
.. literalinclude:: examples/async_client_blocking.py
246+
:start-after: start = time.perf_counter()
247+
:end-before: duration
292248

293249
We append all return values of ``get_page()`` to our lits of tasks.
294250
This allows us to send out all request, in our case four, without
295251
waiting for the answers.
296252
After sending all of them, we wait for the answers, using:
297253

298-
.. code-block:: python
299-
300-
loop.run_until_complete(asyncio.gather(*tasks))
254+
await asyncio.gather(*tasks)
301255
302-
We used ``loop.run_until_complete()`` already for each call to ``get_page()``
303-
in the previous section.
304256
The difference here is the use of ``asyncio.gather()`` that is called with all
305257
our tasks in the list ``tasks`` as arguments.
306258
The ``asyncio.gather(*tasks)`` means for our example with four list entries:
@@ -370,11 +322,8 @@ The whole program looks like this:
370322

371323
The function to get one page is asynchronous, because of the ``async def``:
372324

373-
374325
.. literalinclude:: examples/aiohttp_client.py
375-
:language: python
376-
:start-after: import aiohttp
377-
:end-before: def get_multiple_pages
326+
:pyobject: fetch_page
378327

379328
The arguments are the same as those for the previous function to retrieve one
380329
page plus the additional argument ``session``.
@@ -394,13 +343,9 @@ we need to ``await`` again to return the body of the page, using the method
394343

395344
This is the interesting part of ``get_multiple_pages()``:
396345

397-
.. code-block:: python
398-
399-
with closing(asyncio.get_event_loop()) as loop:
400-
with aiohttp.ClientSession() as session:
401-
for wait in waits:
402-
tasks.append(fetch_page(session, host, port, wait))
403-
pages = loop.run_until_complete(asyncio.gather(*tasks))
346+
.. literalinclude:: examples/aiohttp_client.py
347+
:start-after: start = time.perf_counter()
348+
:end-before: duration
404349

405350
It is very similar to the code in the example of the time-saving implementation
406351
with ``asyncio``.
@@ -413,13 +358,10 @@ Finally, we run this program::
413358
It took 5.04 seconds for a total waiting time of 11.00.
414359
Waited for 1.00 seconds.
415360
That's all.
416-
417361
Waited for 5.00 seconds.
418362
That's all.
419-
420363
Waited for 3.00 seconds.
421364
That's all.
422-
423365
Waited for 2.00 seconds.
424366
That's all.
425367

0 commit comments

Comments
 (0)