Skip to content

多个代理的管理 #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 27, 2014
Merged

多个代理的管理 #128

merged 2 commits into from
May 27, 2014

Conversation

yxssfxwzy
Copy link
Contributor

先导入代理列表,各个线程均匀的使用这些代理,优先选择速度高的代理。

@yxssfxwzy yxssfxwzy mentioned this pull request May 19, 2014
code4craft added a commit that referenced this pull request May 27, 2014
@code4craft code4craft merged commit e310139 into code4craft:master May 27, 2014
code4craft added a commit that referenced this pull request May 27, 2014
@code4craft
Copy link
Owner

很赞的功能,review了下,实现也不错。

但是我不知道怎么测试,看时间是2月份的,这段代码在自己的项目中已经运行一段时间了吧?

@code4craft
Copy link
Owner

做了一点改动,将

 if (site.getHttpProxyPool().isEnable()) {
                site.returnHttpProxyToPool((HttpHost) request.getExtra(Request.PROXY), (Integer) request
                        .getExtra(Request.STATUS_CODE));
            }

这段代码从Spider挪到了HttpClientDownloader,因为代理本身是HttpClientDownloader的逻辑,不适合侵入到主流程内部。

另外statusCode和proxy这个可否考虑直接在HttpClientDownloader内部消化,不放到request内部?还是说,这个东东有特别的意义?

@code4craft
Copy link
Owner

仔细看了一下,这里似乎PageProcessor也能对代理进行某种程度的操作,所以还必须得放到Spider中。先放回来,我想想有没有优化方案。

@yxssfxwzy
Copy link
Contributor Author

这个功能是我之前为了爬了一个网站开发的,后来一边改一边爬,最后效果挺好的,就发上来了。statusCode放在request内部是因为如果网站封了爬虫的时候,只有在解析网页的时候才能发现,比如要输验证码什么的,无法再HttpClientDownloader里面发现被封了,然后需要在pageprocessor 修改 page.statuscode传给request.stutuscode,这样返回代理池的时候才会对代理进行相应的修改。相关测试还没写

@yxssfxwzy yxssfxwzy deleted the proxy branch June 5, 2014 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants