ScrapySharp: C# Web Scraping Library

I heard about ScrapySharp in dotnetrocks and I have been meaning to play around with it ever since. It is an excellent open source library for .NET that may have been a port of scrapy.

As for this exercise, I decided to scrape TipidPC (TPC). TPC is an online buy and sell community website for PCs, laptops, computer accessories, and similar gadgets.

My goal is to list all of the new items for sale (IFS) and then do something with those data later on.

When I browse TPC website, there is a button there that allows me to browse the catalog under Items For Sale. Clicking on it redirects me to this page:

Fig 1:

This is a list of all items for sale, arranged by most recent first. Perfect! Just what I need to achieve my goal. Take note of the URL as we will need this later.

Now lets open up developer tools in the browser by hitting F12 and do some dissecting. Looking closer, the particular list that we want is a <ul> with an id of “item-search-results”. This list is composed of items with all the goodies which can lead us to more goodies, all of them ready for harvesting. Again, mind the URLs. I cant wait to get my hands dirty, and I hope you are as excited as I am.

Fig 2: Goodies!

Roll up your sleeves and fire up Visual Studio. Create a new console app and enter the following in the Package Manager Console:

PM> Install-Package ScrapySharp

Once you have that installed, we can now create an instance of HtmlWeb *and call the *Load() method.

var web = new HtmlWeb(); 

It’s that easy! The Load method returns an instance of HtmlDocument that represents the target webpage in its entirety. Note that ScrapySharp can also load HTML via ScrapingBrowser as shown in the official repository page. I just use the above because I find it much easier and cleaner.

But the party is just getting started. If you like JQuery, and by extension, CSS, you would be comfortable with the next piece of code.

var nodes = document.DocumentNode
    .CssSelect("#item-search-results li").ToList();
foreach (var node in nodes) 
    Console.WriteLine("Selling: " 
        + node.**CssSelect**("h2 a").Single().InnerText); 

CssSelect() method accepts CSS syntax to process the target HtmlNode. In the above code, the first CssSelect gets all the <li> items under the <ul> named “item-search-results” (see Fig 2 above). The second CssSelect call simply gets all the h2 links so the program can write them to the console.

The complete code listing is below

static void Main(string[] args) 
    var url = ""; 
    var webGet = new HtmlWeb(); 
    if (webGet.Load(url) is HtmlDocument document) 
        var nodes = document.DocumentNode
            .CssSelect("#item-search-results li")
            foreach (var node in nodes) 
                Console.WriteLine("Selling: " 
                    + node.CssSelect("h2 a").Single().InnerText); 

Running the above results to:


Using ScrapySharp is so convenient! If you have attempted scraping before, using plain HtmlAgilityPack or some other .NET library, I feel your pain.

Now go and play around with it and let me know what cool projects you are and will be working on.

comments powered by Disqus