Oneupweb : Web Scraping—The Freaker Way!

It is nothing new. Giants like Google have been scraping the web for ages now. What is this scraping you ask? Web scraping is the process of collecting data from another web page and usually involves using that data in a meaningful way. Our team here has nothing but mad love for the folks over at Freaker USA. It all started with a post by one of our interactive graphic designers, which inspired our PR specialist to give a shout out, and now I’m joining in on their freaky obsession. So today I am going to give a brief how-to on web scraping, specifically collecting Freaker data and creating your own Freaker catalog with PHP to promote this boss sauce of a product.

View Final Product

Tools of the trade…

We need to fetch the catalog page from the Freaker USA website, and parse the contents that are returned. The main workhorse of our script will be:

  1. The PHP function file_get_contents (for fetching the remote web page)
  2. The PHP DOM extension (for parsing the requested page)

Fetch those Freakers!

Let’s take a look at our function that will get our Freaker data.

function get_freakers() {

	//our site details
	$domain = 'http://www.freakerusa.com';
	$catalog = $domain . '/collections/all';
	$products_id = 'products';

	//read the catalog
	$document = new DOMDocument();
	$page_contents = file_get_contents($catalog);
	@$document->loadHTML($page_contents);

	//get the products
	$productsElem = $document->getElementById($products_id);
	$products = array();
	$listItems = $productsElem->getElementsByTagName('li');

	$i = 0;
	while ( $listItems->item($i) )
	{
		$title = $listItems->item($i)->getElementsByTagName('h3')->item(0)->nodeValue;
		$anchor = $listItems->item($i)->getElementsByTagName('a')->item(0);
		$img = $anchor->getElementsByTagName('img')->item(0);
		$href = $domain . $anchor->getAttribute('href');
		$image = array(
			'alt' => $img->getAttribute('alt'),
			'src' => $img->getAttribute('src')
		);
		$price = $listItems->item($i)->getElementsByTagName('p')->item(0)->nodeValue;
		$products[] = array(
			'title' => $title,
			'href' => $href,
			'img' => $image,
			'price' => $price
		);
		$i++;
	}

	return $products;
}

If we head over to the Freaker USA catalog, we can see each product is wrapped in a nice list item belonging to an unordered list with an id of “products”

<li>
	<!-- START IMAGE -->
	<div class="image">
		<div class="align">
			<div><a href="/collections/all/products/america"><img alt="Vin Diesel" src="http://cdn.shopify.com/s/wp-content/uploads/1/0066/5282/products/IMG_1366_medium.jpg?100851"></a></div>
		</div>
	</div>
	<!-- END IMAGE -->
	<h3><a href="/collections/all/products/america">Vin Diesel</a></h3>
	<p>$8.00</p>
</li>

Get the remote page

The first thing we do is get the catalog page from the Freaker site. This is done in the following lines of the get_freakers function:

//our site details
$domain = 'http://www.freakerusa.com';
$catalog = $domain . '/collections/all';
$products_id = 'products';

//read the catalog
$page_contents = file_get_contents($catalog);

First we create a variable for the domain of the Freaker USA site—this helps us if we want to read additional pages as well. And as we will see it can be used in targeting pesky relative URLS.
Next we create a variable for the catalog and the ID of the main element that is responsible for holding the individual products. This information will be useful for getting the actual product catalog and parsing it with the DOM extension. Now we are all ready to parse out the information that we want.

Getting the relevant information

The rest of our function deals with getting the individual products:

    $document = new DOMDocument();
	@$document->loadHTML($page_contents);

	//get the products
	$productsElem = $document->getElementById($products_id);
	$products = array();
	$listItems = $productsElem->getElementsByTagName('li');

	$i = 0;
	while ( $listItems->item($i) )
	{
		$title = $listItems->item($i)->getElementsByTagName('h3')->item(0)->nodeValue;
		$anchor = $listItems->item($i)->getElementsByTagName('a')->item(0);
		$img = $anchor->getElementsByTagName('img')->item(0);
		$href = $domain . $anchor->getAttribute('href');
		$image = array(
			'alt' => $img->getAttribute('alt'),
			'src' => $img->getAttribute('src')
		);
		$price = $listItems->item($i)->getElementsByTagName('p')->item(0)->nodeValue;
		$products[] = array(
			'title' => $title,
			'href' => $href,
			'img' => $image,
			'price' => $price
		);
		$i++;
	}

	return $products;

We first create a new DOMDocument object and load our requested content into it. (The little @ symbol suppresses the warnings that are common when requesting remote pages; HTML parsers are picky little buggers). Once we have our page loaded into our document object we select the product container by its ID (in this case ‘products’). Remember from looking at the html above from the Freaker USA catalog page that each product is an HTML list-item belonging to the “products” list. We store our products in the variable $listItems. This isn’t a tutorial on using the DOM extension, but you should be able to follow the while loop that is getting all the data we need. We return all of our Freaker data in an array of associative arrays containing everything we need to output Freakers anywhere on our site!

Output your Freaker catalog

Now that we have our handy get_freakers function, we can ouput our catalog any way we like. I will choose to use a small little template like so:

<?php
require_once 'freaker.php'; //we keep our get_freakers function in here
$freakers = get_freakers();
?>
<div id="freakers">
<?php foreach ( $freakers as $freaker ):?>
	<div class="freaker">
		<h3><a target="_blank" href="<?php echo $freaker['href']; ?>"><?php echo $freaker['title']; ?></a></h3>
		<div class="freaker-thumb">
			<a target="_blank" href="<?php echo $freaker['href']; ?>">
				<img src="<?php echo $freaker['img']['src']; ?>" alt="<?php $freaker['img']['alt']; ?>" />
			</a>
		</div>
		<p class="price">
			<?php echo $freaker['price']; ?>
		</p>
		<a target="_blank" class="buy" href="<?php echo $freaker['href']; ?>">BUY!</a>
	</div>
<?php endforeach; ?>
</div>

Add a dab of CSS and…. voila! We now have a catalog of our very own to promote such a fabulous product!
Our very own Freaker USA catalog! BOSS!!

How will you get your Freaker on?

The cool thing about this is it will stay up to date with the Freaker USA catalog (provided they don’t change their markup, but you being the pro you are now, you can certainly keep up to date on this ). There are plenty of other possibilities for this type of application. You could use the data you collect to create a handy sidebar widget for WordPress, or maybe create a Facebook tab? There is no stopping you! So go get your web scraping on!