##-- Programmable spidering of web sites with Java
zeekEye是一款新浪微博爬虫,采用Java语言开发,基于hetrix爬虫架构,使用HTTPClient4.0和Apache4.0网络包.
特点概述:
-
数据存储:采用
MySQL数据库存储数据,支持多线程并发操作. -
功能实现:模拟微博登录、爬取微博用户信息、用户评论、提取数据、建立数据表、数据成份分析、互粉推荐。待更新...
------欢迎 Fork !
git clone git@github.com:crazyacking/Spider--Java.git
javac -cp /home/username/Documents/Spider--Java/src/cn/edu/hut/crazyacking/spider/Spider.jar WeiboSpiderStarter.java
java -cp /home/username/Documents/Spider--Java/src/cn/edu/hut/crazyacking/spider/Spider.jar : WeiboSpiderStarter
...默认编辑器是IntelliJ IDEA 14.1.4,开发环境为jdk1.7.0,编译执行前先用IntelliJ IDEA把项目源码导出成jar包.
conf/spider.properties文件为整个项目相关参数的配置文件,包括数据库接口地址、并行线程、爬取数量上限的配置等."选项"包含以下字段:
maxSockets- 线程池中最大并行线程数. 默认为4.userAgent- 发送到远程服务器的用户代理请求. 默认为Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7(firefox userAgent String).cache- 缓存对象。默认为非缓存,具体看最新版本代码缓存对象的实现细节.pool- 一个包含该请求代理的哈希线程池。如果省略,将使用全局设置的maxsockets.
其中参数如下 :
hosts- A string -- or an array of string -- representing thehostpart of the targeted URL(s).pattern- The pattern against which spider tries to match the remaining (pathname+search+hash) of the URL(s).cb- A function of the formfunction(window, $)wherethis- Will be a variable referencing theRoutes.matchreturn object/value with some other goodies added from spider. For more info see http://www.cnblogs.com/crazyacking/category/686354.htmlwindow- Will be a variable referencing the document's window.$- Will be the variable referencing the jQuery Object.
spider.get(url)其中'url'是要抓取的网络url.
目前更新缓存暂提供以下方法:
get(url, cb)- Returnsurl'sbodyfield via thecbcallback/continuation if it exists. Returnsnullotherwise.cb- Must be of the form `function(retval) {...}'
getHeaders(url, cb)- Returnsurl'sheadersfield via thecbcallback/continuation if it exists. Returnsnullotherwise.cb- Must be of the formfunction(retval) {...}
set(url, headers, body)- Sets/Savesurl'sheadersandbodyin the cache.
spider.log(level) - Where level is a string that can be any of "debug", "info", "error"
###Source Code The source code of zeekEye is made available for study purposes only. Neither it, its source code, nor its byte code may be modified and recompiled for public use by anyone except us.
We do accept and encourage private modifications with the intent for said modifications to be added to the official public version.
感谢阅读这份帮助文档。如果您有好的建议,欢迎反馈。