Thursday, March 21, 2013

Three ways to scrape?

One: Scrape the Initial/Infant HTML by Code

using System.IO;
using System.Net;
using System.Text;
using System.Web.Mvc;
namespace Scraper.Controllers
{
   public class HomeController : Controller
   {
      public ActionResult Index()
      {
         string url = "https://twitter.com/";
         HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
         StringBuilder scrappingSpool = new StringBuilder();
         using (HttpWebResponse response = (HttpWebResponse) request.GetResponse())
         {
            using (Stream stream = response.GetResponseStream())
            {
               int counter = 0;
               byte[] buffer = new byte[1000000];
               do
               {
                  counter = stream.Read(buffer, 0, buffer.Length);
                  if (counter != 0)
                  {
                     string chunk = Encoding.ASCII.GetString(buffer, 0, counter);
                     scrappingSpool.Append(chunk);
                  }
               } while (counter > 0);
            }
         }
         string scrapping = scrappingSpool.ToString();
         return View(scrapping);
      }
   }
}

 
 

Here the "scrapping" variable will end up with the immediate contents of https://twitter.com/ that one gets served up just by visiting the page. This begs the questions: "What if I want to log in at the Twitter site and make some content appear which only appears after the page loads by way of AJAX?" and "What if I want to scrape after that?" Well, I'm getting to that next. The C# above is a spruced up version of something I've had in my notes for a while:

Two: Scrape the Matured HTML by Firebug

  1. Get Firefox and install it.
  2. Install the Firebug plugin for Firefox.
  3. Restart Firefox and then visit https://twitter.com/.
  4. Log in.
  5. Scroll down on the page on the other side of the log in, forcing new HTML content for older tweets to appear by way of AJAX.
  6. Click on the Firebug icon at the upper right of Firefox. It will open a pane for Firebug.
  7. "Click an element in the page to inspect." should appear when you hover over the icon that looks like a rectangle with a pointer over it which is the second in from the left at the upper left of the Firebug pane. Click this icon.
  8. Move the mouse about the browser window. Try to highlight the div holding all of the tweets and then click on it.
  9. The appropriate line of code will be highlighted in the Firebug pane. Right-click on it and pick "Copy innerHTML."
  10. Copy into Notepad!

 
 

Three: Scrape the Matured HTML by Code

PhantomJS should be the key to the best of both of the worlds above. Have I used it yet? No I haven't. :(

No comments:

Post a Comment