多个代理的管理 #128

yxssfxwzy · 2014-05-19T09:25:56Z

先导入代理列表，各个线程均匀的使用这些代理，优先选择速度高的代理。

多个代理的管理

code4craft · 2014-05-27T23:30:21Z

很赞的功能，review了下，实现也不错。

但是我不知道怎么测试，看时间是2月份的，这段代码在自己的项目中已经运行一段时间了吧？

code4craft · 2014-05-27T23:55:55Z

做了一点改动，将

 if (site.getHttpProxyPool().isEnable()) {
                site.returnHttpProxyToPool((HttpHost) request.getExtra(Request.PROXY), (Integer) request
                        .getExtra(Request.STATUS_CODE));
            }

这段代码从Spider挪到了HttpClientDownloader，因为代理本身是HttpClientDownloader的逻辑，不适合侵入到主流程内部。

另外statusCode和proxy这个可否考虑直接在HttpClientDownloader内部消化，不放到request内部？还是说，这个东东有特别的意义？

code4craft · 2014-05-28T00:09:18Z

仔细看了一下，这里似乎PageProcessor也能对代理进行某种程度的操作，所以还必须得放到Spider中。先放回来，我想想有没有优化方案。

yxssfxwzy · 2014-05-29T01:56:20Z

这个功能是我之前为了爬了一个网站开发的，后来一边改一边爬，最后效果挺好的，就发上来了。statusCode放在request内部是因为如果网站封了爬虫的时候，只有在解析网页的时候才能发现，比如要输验证码什么的，无法再HttpClientDownloader里面发现被封了，然后需要在pageprocessor 修改 page.statuscode传给request.stutuscode，这样返回代理池的时候才会对代理进行相应的修改。相关测试还没写

yxssfxwzy added 2 commits May 19, 2014 15:56

change_gitignore

07ea042

add proxy pool

c146e2c

yxssfxwzy mentioned this pull request May 19, 2014

多个代理 #114

Closed

code4craft added a commit that referenced this pull request May 27, 2014

Merge pull request #128 from yxssfxwzy/proxy

e310139

多个代理的管理

code4craft merged commit e310139 into code4craft:master May 27, 2014

code4craft added a commit that referenced this pull request May 27, 2014

spell mistake fix #128

1f21d9c

code4craft added a commit that referenced this pull request May 27, 2014

change return proxy from spider to httpclientdownloader #128

40bf8ca

code4craft added a commit that referenced this pull request May 28, 2014

change back return proxy from spider to httpclientdownloader #128

8d67fd0

yxssfxwzy deleted the proxy branch June 5, 2014 05:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

多个代理的管理 #128

多个代理的管理 #128

Uh oh!

yxssfxwzy commented May 19, 2014

Uh oh!

code4craft commented May 27, 2014

Uh oh!

code4craft commented May 27, 2014

Uh oh!

code4craft commented May 28, 2014

Uh oh!

yxssfxwzy commented May 29, 2014

Uh oh!

Uh oh!

多个代理的管理 #128

多个代理的管理 #128

Uh oh!

Conversation

yxssfxwzy commented May 19, 2014

Uh oh!

code4craft commented May 27, 2014

Uh oh!

code4craft commented May 27, 2014

Uh oh!

code4craft commented May 28, 2014

Uh oh!

yxssfxwzy commented May 29, 2014

Uh oh!

Uh oh!