| Monday December 22nd 2014

Feedburner

Subscribe by email:

We promise not to spam/sell you.


Search Amazon deals:

HOWTO: Scrape Data From a Remote Website


PHP regex screen scrapeAlright, I’m sure you’re saying to yourself, ok I have all this data (e.g., web page, file data, it’s all the same to us) but I really want to extract some very specific data out of it. Does that sound like what you’re looking for? Well what we’ll do is a basic PHP web scrape, and then we’re going to take and pull some data out of it. For our example what we’d like to do is scrape a Match.com city name and just return that scraped city. This might sound confusing but it is going to give you the very basics of parsing out remote data.

The whole script is below…

<?php
$zip = $_GET['zip'];
$url = file_get_contents('http://www.match.com/landingpages/country.aspx?postalcode='.$zip);
$pattern = '/Local Singles in (.+?)\;/';
preg_match($pattern, $url, $output);
#var_dump($output);
echo $output[1];
?>

and here’s the basic explanation, line by line…


$zip = $_GET['zip'];
$_GET is a Predefined PHP variable that pulls it’s data from the address bar. We then set the variable $zip to equal the zip code pulled from the address bar. (e.g., http://www.yourwebsite.com/scrape.php?zip=90210)

$url = file_get_contents('http://www.match.com/landingpages/country.aspx?postalcode='.$zip);
This line assigns the variable $url to be whatever is in the address line after postalcode=

$pattern = '/Local Singles in (.+?)\;/';
First things first when we’re scraping a page we’re scraping the source code of the page, so that’s always what we’re going to want to be looking at when we’re picking out what we want to grab. If you know know this and you better or you’re probably lost. Go to view source in your browser then search for what you’re looking to pull out. Here’s a chunk of the source code we’re going to pull our value out of…

Local Singles in Beverly Hills; Beverly Hills Online Dating

Now that we have our data we want to to get the result from, we can get into the meat of the parsing. I know to most of you regex is big scary thing with all those crazy symbols and patterns. And well if you want to be a regex master yes, it’s pretty daunting. But don’t let all those funny chars scare you cause there’s a real simple way to use regex. The regex guru’s and preachers will mock you and say you’re bastardizing it but I say whatever works.

I’m not going to go into we’re just assigning a string to a variable in this statement. Anytime you see a $varname = ‘something here'; or $varname = “something here”; you know it’s just a value being assigned to a variable. Also note you can use single ‘ and double ” quotes interchangeably.

(.+?) is our best friend when it comes to regex, it basically means match everything starting from the text ( I’ll call that text anchors too, so be prepared for me to use the interchangeably) in the beginning and stopping at our end text/anchor.

Pretty easy huh? Yeah I thought so. The only other thing to note in this is that there is the forward slashes in the ‘/stuff/'; that’s a regex thing. Just know that in php you always need to let regex know what to match inside of forward slashes.

So ‘/Local Singles in (.+?)\;/’ means “I want everything in between “Local Singles in ” and “;”

preg_match($pattern, $url, $output);
Ah a new function’s in town, preg_match(). Preg_match() is the PHP function to call regex for a single match. So anytime we want to match one thing in our data we’re going to call the parsing function preg_match().

With preg match we’re doing something called passing data to the function for it to work on. In this case we’re passing $pattern, $url, $output. We know what both $pattern (parsing string we just made) and $url (scraped page from Match.com) are but what is the $output variable? It’s just the variable that our parsed data is going to be returned to. In plain english, we’re saying take $url and then apply the filter $pattern to it. Then whatever comes through that filter dump out into $output.

#var_dump($output);
The function var_dump() is your best friend as a programmer. It says whatever is in this variable or array dump it out onto the screen so I can see what’s happening. The # sign in front just means it’s commented out since you only need this line for troubleshooting your code. Delete the # to see exactly what’s going on.

echo $output[1];
What’s with the new notation? If you hadn’t already guessed that’s how we access the cars in our train. We know if we have a array and what we want is in car 1 we access that by “referencing” that car which is what the [1] means. We want to output only what’s in the second cell because we don’t want the anchors included. This will output to our screen:

Beverly Hills

Conclusion

You can make some pretty cool tools with just the two very basic things I’ve shard with you so far. Pulling data from somewhere using the file_get_contents() function and the data parsing preg_match() function. Have fun with it and I’ll see you on the next data scraping tutorial.

Related Posts: On this day...

Leave a Reply

You must be logged in to post a comment.