Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

DotnetSpider, a .NET Standard web crawling library similar to WebMagic and Scrapy. It is a lightweight ,efficient and fast high-level web crawling & scraping framework for .NET

License

Notifications You must be signed in to change notification settings

MicroEngine/DotnetSpider

Open more actions menu
 
 

Repository files navigation

DotnetSpider

Travis branch NuGet Member project of .NET Core Community GitHub license

DotnetSpider, a .NET Standard web crawling library similar to WebMagic and Scrapy. It is a lightweight ,efficient and fast high-level web crawling & scraping framework for .NET

DESIGN

DESIGN

DEVELOP ENVIROMENT

  • Visual Studio 2017(15.3 or later)

  • .NET Core 2.0 or later

  • Storage data to mysql. Download MySql

      grant all on *.* to 'root'@'localhost' IDENTIFIED BY '' with grant option;
    
      flush privileges;
    

OPTIONAL ENVIROMENT

MORE DOCUMENTS

https://github.com/dotnetcore/DotnetSpider/wiki

SAMPLES

Please see the Projet DotnetSpider.Sample in the solution.

BASE USAGE

Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View compelte Codes

public class EntityModelSpider
{
	public static void Run()
	{
		Spider spider = new Spider();
		spider.Run();
	}

	private class Spider : EntitySpider
	{
		protected override void MyInit(params string[] arguments)
		{
			var word = "可乐|雪碧";
			AddStartUrl(string.Format("http://news.baidu.com/ns?word={0}&tn=news&from=news&cl=2&pn=0&rn=20&ct=1", word), new Dictionary<string, dynamic> { { "Keyword", word } });
			AddEntityType<BaiduSearchEntry>();
			AddPipeline(new ConsoleEntityPipeline());
		}

		[TableInfo("baidu", "baidu_search_entity_model")]
		[EntitySelector(Expression = ".//div[@class='result']", Type = SelectorType.XPath)]
		class BaiduSearchEntry : BaseEntity
		{
			[Field(Expression = "Keyword", Type = SelectorType.Enviroment)]
			public string Keyword { get; set; }

			[Field(Expression = ".//h3[@class='c-title']/a")]
			[ReplaceFormatter(NewValue = "", OldValue = "<em>")]
			[ReplaceFormatter(NewValue = "", OldValue = "</em>")]
			public string Title { get; set; }

			[Field(Expression = ".//h3[@class='c-title']/a/@href")]
			public string Url { get; set; }

			[Field(Expression = ".//div/p[@class='c-author']/text()")]
			[ReplaceFormatter(NewValue = "-", OldValue = "&nbsp;")]
			public string Website { get; set; }

			[Field(Expression = ".//div/span/a[@class='c-cache']/@href")]
			public string Snapshot { get; set; }

			[Field(Expression = ".//div[@class='c-summary c-row ']", Option = FieldOptions.InnerText)]
			[ReplaceFormatter(NewValue = "", OldValue = "<em>")]
			[ReplaceFormatter(NewValue = "", OldValue = "</em>")]
			[ReplaceFormatter(NewValue = " ", OldValue = "&nbsp;")]
			public string Details { get; set; }

			[Field(Expression = ".", Option = FieldOptions.InnerText)]
			[ReplaceFormatter(NewValue = "", OldValue = "<em>")]
			[ReplaceFormatter(NewValue = "", OldValue = "</em>")]
			[ReplaceFormatter(NewValue = " ", OldValue = "&nbsp;")]
			public string PlainText { get; set; }
		}
	}
}

public static void Main()
{
	EntityModelSpider.Run();
}

Run via Startup

Command: -s:[spider type name | TaskName attribute] -i:[identity] -a:[arg1,arg2...] --tid:[taskId] -n:[name] -c:[configuration file path or name]
  1. -s: Type name of spider or TaskNameAttribute for example: DotnetSpider.Sample.BaiduSearchSpiderl
  2. -i: Set identity.
  3. -a: Pass arguments to spider's Run method.
  4. --tid: Set task id.
  5. -n: Set name.
  6. -c: Set config file path, for example you want to run with a customize config: -e:app.my.config

WebDriver Support

When you want to collect a page JS loaded, there is only one thing to do, set the downloader to WebDriverDownloader.

Downloader=new WebDriverDownloader(Browser.Chrome);

See a complete sample

NOTE:

  1. Make sure there is a ChromeDriver.exe in bin forlder when you try to use Chrome. You can contain it to your project via NUGET manager: Chromium.ChromeDriver
  2. Make sure you already add a *.webdriver Firefox profile when you try to use Firefox: https://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles
  3. Make sure there is a PhantomJS.exe in bin folder when you try to use PhantomJS. You can contain it to your project via NUGET manager: PhantomJS

Storage log and status to database

  1. Set SystemConnection in app.config
  2. Update nlog.config like https://github.com/dotnetcore/DotnetSpider/blob/master/src/DotnetSpider.Extension.Test/nlog.config

Web Manager

https://github.com/zlzforever/DotnetSpider.Hub

  1. Dependences a ci platform forexample i used gitlab-ci right now.
  2. Dependences Sceduler.NET https://github.com/zlzforever/Scheduler.NET
  3. More documents continue...

1 2 3 4 5

NOTICE

when you use redis scheduler, please update your redis config:

timeout 0 
tcp-keepalive 60

Buy me a coffe

AREAS FOR IMPROVEMENTS

QQ Group: 477731655 Email: zlzforever@163.com

About

DotnetSpider, a .NET Standard web crawling library similar to WebMagic and Scrapy. It is a lightweight ,efficient and fast high-level web crawling & scraping framework for .NET

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C# 61.9%
  • HTML 38.0%
  • Other 0.1%
Morty Proxy This is a proxified and sanitized view of the page, visit original site.