Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

A Node.js Crawler for csdn.com. node 爬虫,csdn 爬虫, 爬取csdn 用户的全部文章。代码仅用于测试和交流学习,请勿用于不良用途。

License

Notifications You must be signed in to change notification settings

zeeklog/csdn-crawler

Open more actions menu

Repository files navigation

A Nodejs Crawler for crawling user's article from csdn.com.

Only for Node.js Application, not work on browser.

  • Offer options.username will return you the user's article list(default length is 5);
  • Upload the Article's image to your own Qiniu Cloud Server when you offer the config: options.qiniu<object>
  • Offer options.page, options.size can limit the page and size config for api

为什么写这个? / Why would I code this?

  • I want some data to fill my database for big-data's test, but it seems hard to me to write it myself(because I am so lazy).

  • May be so many coder face the same things like me. So, let me make this job become easier.

  • WARN: This repo is only for test and study, do not use this to run Pressure-Test on csdn.com. And CSDN is Sucks!

实现原理 / How to fuck this site

# dependencies
 cheerio
 html-to-md
 pinyin
 request-promise
 
 # 使用request-primose获取目标文档
 # 通过cheerio解析HTML文档,获取文章内容
 # 使用html-to-md 解析HTML内容, 转为md
 # 使用pinyin生成文章alias
 

使用指南 / Usages

1、Fill you own config

// Example:
const options = {
    username: 'weixin_45534242', // target username
    page: 1, // the page index you are crawling
    size: 5, // page size
    link: '', // the user center article list api, you can find it on csdn.com using: F12
    businessType: 'blog', // crawl article type. only support 'blog' now.
    sleepTime: null, // Unit is: ms. sleep time when you crawling the data, it may save your ip from blocking.
    supportImageType: ['jpg', 'png', 'jpeg', 'webp', 'gif', 'mp4', 'bmp', 'svg'], // support uplaod image
    imagePrefixName: 'crawl-', // upload image name prefix
    contentNodeIdentify: '#article_content', // the html id name in article node
    qiniu: {
        zone: '', // Your qiniu cloud zone
        scope: '', // Your qiniu scope name. Storage name.
        useHttpsDomain: true, // like what you see. this is https setting
        useCdnDomain: true, // config your cdn domain, it use on Article List Image
        baseQiNiuCdnApi: '', // you CDN domain name
        remoteFilePath: '/openStatic', // the folder path where you want to save img
        isNeedWaterMark: false, // if `true`, you will need to offer qiniu image style name, write it below:
        imageStyleSplitQuote: '&', // the quote you use in image src link like: https://qiniu.com/asd.png&scale-my-img
        imageStyleName: '', // your qiniu style name
        accessKey: '', // Qiniu cloud accessKey
        secretKey: '', // Qiniu secretKey
        imageBaseAlt: '' // image base alt message prefix
    }
}

2、开始使用csdnCrawler / Fly your code now.

// You can find this code on `./demo.js`
const csdnCrawler = require('./index')
const exampleOptions = {
    username: 'weixin_45534242',
    page: 1,
    size: 5,
    link: '',
    businessType: 'blog',
    sleepTime: null, // Unit is: ms
    supportImageType: ['jpg', 'png', 'jpeg', 'webp', 'gif', 'mp4', 'bmp', 'svg'],
    imagePrefixName: 'crawl-',
    contentNodeIdentify: '#article_content',
    qiniu: {}
}

csdnCrawler(exampleOptions, data => {
    console.log(data)
    console.log(`==============================`)
    console.log(`===  Demo Crawl Succeed !!!===`)
    console.log(`==============================`)
    console.log(`Total Data length : ${data.length}`)
})

再次警告 / FBI WARN AGAIN( to save me from trouble)

  • Don't use this for bad purpose.
  • It may cause something bad result in CN(Maybe break the law...) and will drive you crazy.
  • Plz only use this for testing and study purpose.

About

A Node.js Crawler for csdn.com. node 爬虫,csdn 爬虫, 爬取csdn 用户的全部文章。代码仅用于测试和交流学习,请勿用于不良用途。

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
Morty Proxy This is a proxified and sanitized view of the page, visit original site.