Closed
Description
In Spider, the main loop poll all urls from scheduler and dispatch them to worker thread. But in the threadpool ExecutorService, there is a bug:
public static ExecutorService newFixedThreadPool(int threadSize) {
if (threadSize <= 0) {
throw new IllegalArgumentException("ThreadSize must be greater than 0!");
}
if (threadSize == 1) {
return MoreExecutors.sameThreadExecutor();
}
return new ThreadPoolExecutor(threadSize - 1, threadSize - 1, 0L, TimeUnit.MILLISECONDS,
new SynchronousQueue<Runnable>(), new ThreadPoolExecutor.CallerRunsPolicy());
}
ThreadPoolExecutor.CallerRunsPolicy
will call main thread to process request so the dispatching of urls will stop and other threads will be blocked.
在WebMagic的多线程实现中,由一个主线程负责URL分发,多个子线程负责请求的处理。但是存在一个问题:WebMagic使用的线程池使用了ThreadPoolExecutor.CallerRunsPolicy
这一策略,这表示当线程池跑满后会用主线程来运行请求,这就导致其他线程执行结束后会一直等待。这会对性能有巨大影响。