Oh Dear is the all-in-one monitoring tool for your entire website. We monitor uptime, SSL certificates, broken links, scheduled tasks and more. You'll get a notifications for us when something's wrong. All that paired with a developer friendly API and kick-ass documentation. O, and you'll also be able to create a public status page under a minute. Start monitoring using our free trial now.

Building a crawler in PHP

Original – by Freek Van der Herten – 4 minute read

When Spatie unleashes a new site on the web we want to make sure that all, both internal and external, links on it work. To facilitate that process we released a tool to check the statuscode of every link on a given website. It can easily be installed via composer:

composer global require spatie/http-status-check

Let's for example scan the Laracasts.com-domain:

http-status-check scan

Our little tool will spit out the status code of all links it finds: screenshot

And once finished a summary with the amount of links per status code will be displayed.

The package uses a home grown crawler. Sure, there are already many other crawlers available. I built a custom one part as a learning exercise, part because the other crawlers didn't to exactly what I wanted to.

Let's take a look at the code that crawls all links on a piece of html:

/**
 * Crawl all links in the given html.
 *
 * @param string $html
 */
protected function crawlAllLinks($html)
{
    $allLinks = $this->getAllLinks($html);

    collect($allLinks)
        ->filter(function (Url $url) {
            return !$url->isEmailUrl();
        })
        ->map(function (Url $url) {
            return $this->normalizeUrl($url);
        })
        ->filter(function (Url $url) {
            return $this->crawlProfile->shouldCrawl($url);
        })
        ->each(function (Url $url) {
            $this->crawlUrl($url);
        });
}

So first we get all links. Then we'll filter out mailto-links. The next step normalizes all links. After that we'll let a crawlProfile determine if that link should be crawled. And finally the link will get crawled.

Determining which links there are on a page may sound quite daunting but Symfony's DomCrawler makes that very easy. Here's the code:

protected function getAllLinks($html)
{
    $domCrawler = new DomCrawler($html);

    return collect($domCrawler->filterXpath('//a')
        ->extract(['href']))
        ->map(function ($url) {
            return Url::create($url);
        });
}

The DomCrawler returns strings. Those strings get mapped to Url-objects to make it easy to work with them.

On a webpage protocol independent-links (eg. //domain.com/contactpage) and relative links (/contactpage) may appear. To make our little crawler needs absolute links (https://domain.com/contactpage) so all links need to be normalized. The code to do that:

protected function normalizeUrl(Url $url)
{
    if ($url->isRelative()) {
        $url->setScheme($this->baseUrl->scheme)
             ->setHost($this->baseUrl->host);
    }
    if ($url->isProtocolIndependent()) {
        $url->setScheme($this->baseUrl->scheme);
    }
    return $url->removeFragment();
}

$baseUrl in the code above contains the url of the site we're scanning.

Determining if a url should be crawled

The crawler delegates the question if a url should be crawled to a dedicated class that implements the CrawlProfile-interface

namespace Spatie\Crawler;

interface CrawlProfile
{
    /**
     * Determine if the given url should be crawled.
     *
     * @param \Spatie\Crawler\Url $url
     *
     * @return bool
     */
    public function shouldCrawl(Url $url);
}

The package provides an implementation that will crawl all url's. If you want to filter out some url's there's no need to change the code of the crawler. Just create your own CrawlProfile-implementation.

Crawling an url

Guzzle makes fetching the html of an url very simple.

$response = $this->client->request('GET', (string) $url);
$this->crawlAllLinks($response->getBody()->getContents());

There's a little bit more to it, but the code above is the essential part.

Observering the crawl process

Again, you shouldn't touch the code of the crawler itself to add behaviour to it. When instantiating the crawler it expects that you pass it a implementation of CrawlObserver

Looking at the interface should make things clear:

namespace Spatie\Crawler;

interface CrawlObserver
{
    /**
     * Called when the crawler will crawl the url.
     *
     * @param \Spatie\Crawler\Url $url
     */
    public function willCrawl(Url $url);

    /**
     * Called when the crawler has crawled the given url.
     *
     * @param \Spatie\Crawler\Url                      $url
     * @param \Psr\Http\Message\ResponseInterface|null $response
     */
    public function hasBeenCrawled(Url $url, $response);

    /**
     * Called when the crawl has ended.
     */
    public function finishedCrawling();
}

The http-status-check tool uses this implementation to display all found links to the console.

In closing

That concludes the little tour of the code. I hope you've seen that creating a crawler is not that difficult. If you want to know more, just read the code of the Crawler-class on GitHub.

My colleague Sebastian had a great idea to create a Laravel-package that provides an Artisan command to check all links of a Laravel application. You might seeing that appear amongst our current Laravel packages soon.

Stay up to date with all things Laravel, PHP, and JavaScript.

You can follow me on these platforms:

On all these platforms, regularly share programming tips, and what I myself have learned in ongoing projects.

Every month I send out a newsletter containing lots of interesting stuff for the modern PHP developer.

Expect quick tips & tricks, interesting tutorials, opinions and packages. Because I work with Laravel every day there is an emphasis on that framework.

Rest assured that I will only use your email address to send you the newsletter and will not use it for any other purposes.

Comments

What are your thoughts on "Building a crawler in PHP"?

Comments powered by Laravel Comments
Want to join the conversation? Log in or create an account to post a comment.