Building a .NET Scraper

Wouldn't it be cool if you could grab content directly from a Web site and use that content in an application of your own? This idea is nothing new--it's just good old copy and paste. But what if the content you want is dynamic, and your application needs to reflect this ever-changing content? Now what do you do? The manual copying and pasting process won't suffice--you need an easy-to-use programmatic way of obtaining this information. Without .NET, you have to use non-standard programming techniques to automatically scrape other Web sites. This process, known as page scraping, usually started with loading the remote Web page into your application and writing custom parsing procedures to get the necessary data. However, if the target Web page changed format, you had to rewrite your parsing logic. With .NET, you can resolve this dilemma with Visual Basic .NET or C#, Web Services, Windows services, and a timer control. This article touches on techniques you can use within .NET to accomplish this.

Here's a brief overview of how I used some .NET technology to create a Web page scraper. This is a relatively simple scraper that goes out to a Web site and scrapes a certain portion of text from the page, saving it to a file in a specified directory. Because the data on this site changes every 5 minutes or so, I use the timer control inside the Windows service, which acts as a wrapper that calls the Web Service and instantiates the methods every 5 minutes. By scraping the Web site this frequently, you ensure that the saved data is current.

The following is a snippet of the Web Services Description Language (WSDL) file and the Visual Basic .NET code that was used to define the Web service.

            <output>
                 <tm:text>
<tm:match name="Start" type="" pattern="##(.*)"   ignoreCase="true" />
                 </tm:text>
                 </output>

Notice the value of the pattern XML attribute, "##(.*)". Built right into the WSDL file is the capability to define string patterns that will be used to scrape Web pages. This example looks for a string pattern within the scraped HTML page that looks like ##(.\[ANYTHING IN HERE\]). The "*" is an ambiguous operator that works just as file searches do on your folders--it's just a string pattern match. You could, for example, have a pattern equal to "<A HREF" to find all the anchors in the HTML Web page. This built-in power is what frees you from writing custom code for scraping different Web pages. Additionally, you can react much more quickly when the structure of the Web pages you are scraping changes.

The following code shows you how you can call this Web Service from Visual Basic .NET. It lets you retrieve the results from the Web page by scraping the Web Service performs.

        ' textLookup, this is how we interface to the Web Service
                Dim textlookup As New localhost.RetrievText ()
                ' match, this is what is returned from the Web Service
                Dim match As New localhost.GetTextDetailsMatches()
                Dim strText As String
               
' GetTextDetails is a method within the Web Service to
'  scrape the Web page.
        match = textlookup.GetTextDetails
        Try
            ' stores the result of the Web Service call into
'  a text file
            FileOpen(1, "C:\temp\scraperText.txt", OpenMode.Output)
            Write(1, strText)
            FileClose(1)   ' Close file.
                       
        Catch theExeption As Exception
          Dim logthis As New WriteToEventLog.WriteToEventLog()
logthis.Log(Application.CompanyName, theExeption.ToString,
Application.CompanyName)
        End Try

Web page scraping can be a useful tool when using nicely defined Web Services to retrieve information is not an option. And with the new features in .NET, it becomes an even more attractive option because of how easy it is to build.

The code for this Perspective is available at http://interknowlogy.com/knowledge/articles.aspx?aid=1065

Discuss this Article 2

Sameer garg (not verified)
on May 23, 2004
we have to scrap thye data from the sites like espn.com etc the text may be in HTML,XML format so see ur article but unable to go to link as it crashes. i will be highly thankfull if u help me.
Tom (not verified)
on Aug 9, 2008
Your link at http://interknowlogy.com/knowledge/articles.aspx?aid=1065 on the article is dead.

Please or Register to post comments.

IT/Dev Connections

Las Vegas
September 30th - October 4th

Paul ThurottOur Experts will show you:
• Common SQL Server
Problems
• Best Practices for T-SQL
• SQL Server Integration
Services
• Database Development

Come See Michael Otey & Tim Ford in Person!

Early Registration Now Open

From the Blogs
May 21, 2013
blog

A Common Misconception about MAXDOP

Out of the box, SQL Server is (and has been) able to take advantage of multiple processors/cores without any effort on behalf of administrators....More
May 9, 2013
blog

My ISO 8601-Compliant Signature 2

My family recently just "officially" announced that we're in the process of adopting a child from South Africa. We're quite excited, of course, but there's a ton of paperwork to do—along with the need for gobs of signatures....More
May 8, 2013
blog

Use SSIS for ETL from Hadoop

In this blog post, Mark Kromer walks you through using SSIS as a way to use ETL techniques using Microsoft's Hadoop on Windows (HDInsight) as a source using Hive connectors...More
SQL Server Pro Forums

Get answers to questions, share tips, and engage with the SQL Server community in our Forums.