Improving the performance of our PHP based crawler

Today a new major version of our homegrown crawler was released. The crawler is used to power our http-status-check, laravel-sitemap and laravel-link-checker packages. A new major feature is the greatly improved crawling speed. This was accomplished by leveraging multiple concurrent requests.

Let’s take a look at the performance improvements gained by using concurrent requests. In the video below, the crawler is started two times. On the left we have v1 of the crawler that just does one request and waits for the response before launching another request. On the right we have v2 that uses 10 concurrent requests. Both crawls will go over our entire company site https://spatie.be

Even though I gave v1 a little head start, it really got blown away by v2. Where v1 is constantly waiting on a response from the server, v2 will just launch another request, while it’s waiting for responses of previous requests.

To make requests, the crawler package uses Guzzle, a the well known http client. A cool feature of Guzzle is that it provides support for concurrent requests out of the box. If you want to know more about on that subject, read this excellent blogpost by Hannes Van de Vreken. Here’s the relevant code in our package.

Together with the release of crawler v2: these packages have new major versions that make use of crawler v2:
http-status-check: this command line tool can scan your entire site and report back the http status codes for each page. We use this tool whenever we launch a new site at Spatie to check if there are broken links.
laravel-sitemap: this Laravel package can generate a sitemap by crawling your entire site.
laravel-link-checker: this one can automatically notify you whenever a broken link is found on your site.

Integrating the crawler into your own project or package is easy. You can set a CrawlProfile to determine which urls should be crawled. A crawlReporter can be used to determine what should be done with the found urls. Want to know more? Then head over to the crawler repo on GitHub.

If you like the crawler, be sure to also take a look at the many other framework agnostic and Laravel specific packages our team has created.

Some request filtering macros

In a gist on GitHub Adam Wathan shares some macros that can be used to clean up a request.

Allows you to trim things, lowercase things, whatever you want. Pass a callable or array of callables that each expect a single argument:

Request::macro('filter', function ($key, $filters) {
    return collect($filters)->reduce(function ($filtered, $filter) {
        return $filter($filtered);
    }, $this->input($key));
});

https://gist.github.com/adamwathan/610a9818382900daac6d6ecdf75a109b

If you want to hear Adam talk some more about troubles with requests (generated by webforms) and possible solutions, listen to this episode of the Full Stack Radio Podcast.

No Time for a Taxicab

Gary Hockin posted a video with his attempt in solving the Day 1 challenge of http://adventofcode.com/. The video not only shows how he solved the problem codewise but also demonstrates some nice features of PHPStorm.

Mistakes and all, I attempt to code day 1 part 1 of the Advent of Code challenges you can find at http://adventofcode.com.

I deliberately didn’t overly edit, or over complicate the video as I’m trying to get them done as fast as possible, if you like this I’ll do some more!

I know some people will say “Waaa you should have done it like this!”, or “Why didn’t you use library $x”, well I didn’t so get over it. I’m also worried this gives away far too much about my coding quality and how lazy I am, but such is life.

How I refactor to collections

Christopher Rumpel posted some good practical examples on how to refactor common loops to collections.

Refactoring to Collections is a great book by Adam Wathan where he demonstrates, how you can avoid loops by using collections. It sounds great from the beginning, but you need to practice it, in order to be able to use it in your own projects. This is why I refactored some of my older projects. I want to share these examples today with you.

http://christoph-rumpel.com/2016/11/How-I-refactor-to-collections/

Clean up your Vue modules with ES6 Arrow Functions

On the .dev blog Jacob Bennett shared some nice refactorings using arrow functions.

Recently when refactoring a Vue 1.0 application, I utilized ES6 arrow functions to clean up the code and make things a bit more consistent before updating to Vue 2.0. Along the way I made a few mistakes and wanted to share the lessons I learned as well as offer a few conventions that I will be using in my Vue applications moving forward.

https://dotdev.co/clean-up-your-vue-modules-with-es6-arrow-functions-2ef65e348d41

murze.be is two years old 🍰

I’m happy to share that this blog now celebrates its second anniversary. Two years ago I started murze.be to share my bookmarks and interesting links I found with other developers. Along the way I started to write some articles of my own, mainly introductory posts to the now 100+ packages my team and I have released. Like in the last years anniversary post I’d like to share some numbers about this blog.

In the period spanning from november 2015 until november 2016 this blog served up 397 030 pages. In comparison to 2014-2015 that’s a 400% increase. Though this blog being popular is not my main motivation is still pretty awesome to know that a lot of developers appreciate the links and articles that are being shared here.

In the middle of this year I also changed theme. Though some people let me know that the liked to old design better, I myself appreciate the calm of the current one. For fun here’s how the site looked like earlier this year:

screen-shot-2016-11-24-at-19-35-31

murze.be runs on a small DigitalOcean droplet. For the most part it had no problem handling the traffic. The only time it went down is when the post on DigitalOcean having lost a one of our servers reached number two on hacker news. It’s a silly coincidence that a few hours before posting that article I had disabled WP super cache and forgot to turn it on again. With that plugin enabled the little droplet had no problem keeping the site up. At its peak more than 900 people were visiting this blog concurrently.

This is the top 10 of most visited articles on the blog in the last year:

  1. Today DigitalOcean lost our entire server
  2. Building a dashboard using Laravel, Vue.js and Pusher
  3. A package to log activity in a Laravel app
  4. A modern backup solution for Laravel apps
  5. Using Elasticsearch in Laravel
  6. Laravel Analytics v2 has been released
  7. A Laravel webshop (this is an article from 2014!!)
  8. How to setup and use the Google Calendar API
  9. A modern package to generate html menus
  10. On open sourcing Blender

Early 2016 I also started to offer a newsletter that’s really nothing more than a digest of this blog. But a lot of people like reading it. Currently the mailinglist has 2 290 subscribers. The list has an open rate of 56,7% and click rate of 27,5%. I think that those rates are pretty good considering that a large part of the target audience is probably blocking Mailchimp tracking.

I still very much enjoy posting and writing on this blog, so probably we’ll get to celebrate it’s 3th anniversary.

What’s new in PHP 7.1

With PHP 7.1 scheduled to be released next week, it’s probably a good idea to go over the new features is offers.

The newest version of PHP – 7.1.0 – is already at RC6 (Release Candidate 6) status, which means it will be out soon. After a huge update that took PHP from 5.6 straight to 7.0 increasing speeds considerably, PHP is now focusing on core language features that will help all of us write better code. In this article I’ll take a look at the major additions and features of PHP 7.1.0 which is just around the bend.

https://kinsta.com/blog/php-7-1-0/

An uptime and ssl certificate monitor written in PHP

Today we released our newest package: spatie/laravel-uptime-monitor. It’s a powerful, easy to configure uptime monitor. It’s written in PHP and distributed as a Laravel package. It will notify you when your site is down (and when it comes back up). You can also be notified a few days before an SSL certificate on one of your sites expires. Under the hood, the package leverages Laravel 5.3’s notifications, so it’s easy to use Slack, Telegram or one of the many available notification drivers.

To make sure you can easily work with the package, we’ve written extensive documentation. Topics range from the basic installation setup to some more advanced settings. In this post I’d like to show you how to start using the package and discuss some of the problems we faced (and solved) while creating the code.

The basics

After you’ve set up the package you can use the monitor:create command to monitor a url. Here’s how to add a monitor for https://example.com:

php artisan monitor:create https://example.com

You will be asked if the uptime check should look for a specific string on the response. This is handy if you know a few words that appear on the url you want to monitor. This is optional, but if you do choose to specify a string and the string does not appear in the response when checking the url, the package will consider that uptime check failed.

Instead of using the monitor:create command you may also manually create a row in the monitors table. The documentations contains a description of all the fields in that table.

Now, if all goes well the package will check the uptime of https://example.com every five minutes. That number can be changed in the configuration. If the site becomes unreachable for any reason, the package will send you a notification of that event. Here’s how that looks like in Slack:

monitor-failed

When an uptime check fails the package goes into crunch mode and starts checking that site every single minute. As soon as the connectivity to the site is restored you’ll be notified.

monitor-recovered

Out of the box the uptime check will be performed in a sane way. You can tweak some settings of the uptime check to your liking.

Like mentioned above the package will also check the validity of the ssl certificate. By default this check is being performed daily. If a certificate is expiring soon or is invalid, you’ll get notified. Here’s how such a notification looks like:

ssl-expiring-soon

You can view all configured monitors with the monitor:list command. You’ll see some output not unlike this:

monitor-list

Making the uptime check fast

Like many other agencies, at our company we have to check the uptime of a lot of sites. That’s why this process happens as fast as possible. Under the hood the uptime check uses Guzzle Promises. So instead of making a request to a site and just keep waiting until there is a response, the package will perform multiple http request at the same time. If you want to know more about Guzzle promises, read this blogpost by Hannes Van de Vreken. Take a look at the MonitorCollection-class in the package to learn how promises are being leveraged in this package.

Testing the code

To make sure the package works correctly we’ve built an extensive test suite. It was a challenge to keep the test suite fast. Reaching out to real sites to test reachability is much too slow. A typical response from a site takes 200ms (and that’s a very optimistic number), multiply that by the ±50 tests we you’ll get a suite that takes over 10 seconds to run. That’s much too slow. Another problem we needed to solve is the timing of the tests. By default the package will check a reachable site only once in 5 minutes. Of course our test suite can’t wait five minutes.

To solve the problem of slow responses we built a little node server that mimics a site and included that in the testsuite. The Express framework makes it really easy to do so. Here’s the entire source code of the server:

"use strict";

let app = require('express')();

let bodyParser = require('body-parser');
app.use(bodyParser.json()); // support json encoded bodies
app.use(bodyParser.urlencoded({ extended: true })); // support encoded bodies

let serverResponse = {};

app.get('/', function (request, response) {
    const statusCode = serverResponse.statusCode;

    response.writeHead(statusCode || 200, { 'Content-Type': 'text/html' });
    response.end(serverResponse.body || "This is the testserver");
});

app.post('/setServerResponse', function(request, response) {
    serverResponse.statusCode = request.body.statusCode
    serverResponse.body = request.body.body;

    response.send("Response set");
});

let server = app.listen(8080, function () {
    const host = 'localhost';
    const port = server.address().port;

    console.log('Testing server listening at http://%s:%s', host, port);
});

By default visiting http://localhost:8080/ returns an http response with content This is the testserver and status code 200. To change that response a post request to setServerResponse can be submitted with the text and status code that a visit to / should return. Unlike making a request to a site on the internet, this server, because it runs locally, is blazing fast.

This is the PHP class in the testsuite that communicates with the node server:

namespace Spatie\UptimeMonitor\Test;

use GuzzleHttp\Client;

class Server
{
    /** @var \GuzzleHttp\Client */
    protected $client;

    public function __construct()
    {
        $this->client = new Client();

        $this->up();
    }

    public function setResponseBody(string $text, int $statusCode = 200)
    {
        $this->client->post('http://localhost:8080/setServerResponse', [
            'form_params' => [
                'statusCode' => $statusCode,
                'body' => $text,
            ],
        ]);
    }

    public function up()
    {
        $this->setResponseBody('Site is up', 200);
    }

    public function down()
    {
        $this->setResponseBody('Site is down', 503);
    }
}

The second problem – testing time based functionality – can be solved by just controlling time. Yeah, you’ve read that right. The whole package makes use of Carbon instances to work with time. Carbon has this method to just set the current time.

use Carbon\Carbon;

Carbon::setTestNow(Carbon::create(2016, 1, 1, 00, 00, 00));

// will return a Carbon instance with a datetime value 
// of 1st January 2016 no matter what the real
// current date or time is
Carbon::now();

To make time progress a couple of minutes we made this function :

public function progressMinutes(int $minutes)
{
   $newNow = Carbon::now()->addMinutes($minutes);

   Carbon::setTestNow($newNow);
}

Now let’s take a look at a real test that uses both the testserver and the time control. Our uptime check will only fire of a UptimeCheckRecovered event after a UptimeCheckFailed was sent first. The UptimeCheckRecovered contains a datetime indicating when the UptimeCheckFailed event failed for the first time.

Here’s the test to make sure the the UptimeCheckRecovered gets fired at the right time and it contains the right info:

/** @test */
public function the_recovered_event_will_be_fired_when_an_uptime_check_succeeds_after_it_has_failed()
{
    /**
     * Get all monitors which uptime we should check
     *
     * In this test there is only one monitor configured.
     */
    $monitors = MonitorRepository::getForUptimeCheck();

    /**
     * Bring the testserver down.
     */
    $this->server->down();

    /**
     * To avoid false positives the package will only raise an `UptimeCheckFailed`
     * event after 3 consecutive failures.
     */
    foreach (range(1, 3) as $index) {
        /** Perform the uptime check */
        Event::assertNotFired(UptimeCheckFailed::class);

        $monitors->checkUptime();
    }

    /**
     * The `UptimeCHeckFailed`-event should have fired by now.
     */
    Event::assertFired(UptimeCheckFailed::class);

    /**
     * Let's simulate a downtime of 10 minutes
     */
    $downTimeLengthInMinutes = 10;
    $this->progressMinutes($downTimeLengthInMinutes);

    /**
     * Bring the testserver up
     */
    $this->server->up();

    /**
     * To be 100% certain let's assert the the `UptimeCheckRecovered`-event
     * hasn't been fired yet.
     */
    Event::assertNotFired(UptimeCheckRecovered::class);

    /**
     * Let's go ahead a check the uptime again.
     */
    $monitors->checkUptime();

    /**
     * And now the `UptimeCheckEventRecovered` event should have fired
     *
     * We'll also assert that `downtimePeriod` is correct. It should have a
     * a startDateTime of 10 minutes ago.
     */
    Event::assertFired(UptimeCheckRecovered::class, function ($event) use ($downTimeLengthInMinutes) {


        if ($event->downtimePeriod->startDateTime->toDayDateTimeString() !== Carbon::now()->subMinutes($downTimeLengthInMinutes)->toDayDateTimeString()) {
            return false;
        }

        if ($event->monitor->id !== $this->monitor->id) {
            return false;
        }

        return true;
    });
}

Sure, there’s a lot going on, but it all should be very readable.

Y U provide no GUI?

Because everybody’s needs are a bit different, the package does not come with any views. If you need a some screens to manage and view your configured monitors you should handle this in your own project. There’s only one table to manage – monitors – that should not be to overly difficult. We use semver, so we guarantee we’ll make no breaking changes within a major version. The screens you build around the package should just keep working after upgrading the package. If you’re in a generous mood you could make your fellow developers happy by make a package out of your screens.

A word on recording uptime history

Some notifications, for example UptimeCheckRecovered, have a little bit of knowledge on how long a period of downtime was. Take a look at the notification:

monitor-recovered

But other than that the package records no history. If you want to for example calculate an uptime percentage or to draw a graph of the uptime you can leverage the various events that are fired. The documentation specifies which events get send by the uptime check and the certificate check. A possible strategy would be to just write all the those events to an big table, a kind of event log. You could use that event log to generate your desired reports. This strategy has got a name: event sourcing. If you’re interested in knowing more about event sourcing watch this talk by Mitchell van Wijngaarden given at this year’s Laracon.

Closing notes

Sure, there already are a lot of excellent services out there to check uptime, both free and paid. We created this package to be in full control of how the uptime and ssl check work and how notifications get send. You can also choose from which location the checks should be performed. Just install the package onto a server in that location.

Although it certainly took some time get it right, the core functionalities was fairly easy to implement. Guzzle already had much of the functionality we needed to perform uptime checks quickly. Laravel itself make it a breeze to schedule uptime checks and comes with an excellent notification system out of the box.

We put a lot of effort in polishing the code and making sure everything just works. At Spatie, we are already using our own monitor to check to uptime of all our sites. If you choose to use our package, we very much hope that you’ll like it. Check out the extensive documentation we’ve written. If you have any remarks or questions, just open up an issue on the repo on GitHub. And don’t forget to send us a postcard! 🙂

We’ve made a lot of other packages in the past, check out this list on our company site. Maybe we’ve made something that could be of use in your projects.

PHP 7 is gaining ground fast

Jordi Boggiano shared some new stats on PHP version usage he collects via Packagist.

A few observations: 5.3 and 5.4 at this point are gone as far as I am concerned! 5.5 still has a good presence but lost 12% in 6 months which is awesome. 5.6 basically stayed stable as I suspect people jumped from 5.5 to 7 directly probably when upgrading Ubuntu LTS. 7.0 gained 15% and is now close to being the most deployed version, 1 year after release! That should definitely encourage more libraries to require it IMO, and I hope it is good encouragement to PHP internals folks as well to see that people actually upgrade these days 🙂

It’s very cool that PHP 7 is being adopted so quickly. I suspected that it would go down this way. Unfortunately the majority of package creators are still targeting PHP 5. Jordi has this to say on that.

As I wrote in the last update: I would like to encourage everyone to be a bit more aggressive in bumping PHP requirements when tagging new major releases of their libs. Don’t forget that the old code does not go away, it’s still there to be used by people using legacy PHP versions.

Amen!

Read Jordi’s blogpost here: https://seld.be/notes/php-versions-stats-2016-2-edition

An unofficial Forge API

You might not know this but Forge already has an API, it’s just not a documented and feature complete one. Open up your dev tools and inspect the web requests being sent while you do various stuff on Forge.

Marcel Pociot published a new package called Blacksmith (great name Marcel) that can make calls to that API. Here are a few code examples taken from the readme:

$activeServers = $blacksmith->getActiveServers();

$server = $blacksmith->getServer(1);

$sites = $server->getSites();

$newSite = $server->addSite($site_name, $project_type = 'php', $directory = '/public', $wildcards = false);

$jobs = $server->getScheduledJobs();

$newJob = $server->addScheduledJob($command, $user = 'forge', $frequency = 'minutely');

Very cool stuff. Because the Forge API doesn’t include a method to login, the package will under the hood just submit a filled in login form.

An official API for Forge has been on my wishlist for quite some time. Because Forge’s code base already includes an API my guess is that it wouldn’t be too work to grow it in to a full, publicly available one. Though I surely cannot read Taylors mind, I think that if there were some more indications that a Forge API would be used by enough people, there’s a higher chance that an official API would be built. I think the only reason why the API hasn’t been built yet is because not enough people are asking for it. It makes sense for Taylor to only work on things that would actually be used. So if you are using Forge and do want an official API go ahead and star the BlackSmith package on GitHub and make some noise about it.

A full web request may be faster than a SPA

At this years Chrome Dev Summit Jake Archibald gave an excellent talk on some new features that are coming to the service worker. In case you don’t know, a service worker is a piece of JavaScript that sits between the network request sent by the JavaScript in your browser and the browser itself. A common use case for a service is to display a custom page when there is no internet connection available instead of showing the default error message in your browser. And of course you can use a service worker to have a high degree of control on how things are cached.

I really like Jake’s presentation style in general. He always injects a lot of humour. This time he’s presenting in a tuxedo, with no shoes on and he uses a Wii Controller to control his slides. Go Jake!

A very interesting part of the talk is when he touches on the time needed to display a page. Turns out a full web request can be a faster than a fancy single page application. Watch that segment on YouTube by clicking here.

Or you can choose to just watch the whole presentation in the video below.

Let’s talk about phone numbers on mobile sites

Wes Bos shares a quick tip to make phone numbers on websites tappable.

I’m talking about when phone numbers on a website aren’t tapable. Often the HTML is so that mobile operating systems cannot select the phone number alone and you are forced to remember/recite or write down the actual number.

So, when you put a phone number on a website, don’t just use any old element, use a link with the tel protocol.

http://wesbos.com/lets-talk-about-phone-numbers-on-mobile-sites/

PHP 7 at Tumblr

Another big boy on the web upgraded to PHP 7. If you’re not yet on the train heading for PHP7-ville, best get your ticket soon, you won’t regret it.

At Tumblr, we’re always looking for new ways to improve the performance of the site. This means things like adding caching to heavily used codepaths, testing out new CDN configurations, or upgrading underlying software.
Recently, in a cross-team effort, we upgraded our full web server fleet from PHP 5 to PHP 7. The whole upgrade was a fun project with some very cool results, so we wanted to share it with you.

https://engineering.tumblr.com/post/152998126990/php-7-at-tumblr

Laravel service provider examples

On his blog Barry van Veen listed some examples of things you can do within a Laravel service provider.

Currently, I’m working on my first Laravel package. So, it was time to dive into the wonderful world of the service container and service providers.

Laravel has some great docs about, but I wanted to see some real-world examples for myself. And what better way than to have a look at the packages that you already depend on?

This post details the different things that a service provider can be used for, each taken from a real open-source project. I’ve linked to the source of each example.

https://barryvanveen.nl/blog/34-laravel-service-provider-examples

Testing interactive Artisan commands

For a new package I’m working on I had to test some Artisan commands. The commands I want to test contain calls to ask and confirm to interactively get some input by the user. I had a little trouble finding a way to tests such commands, but luckily a blogpost by Mohammed Said pointed me in the right direction, which was to leverage partial mocks.

Here’s the most interesting part, Artisan Commands can ask the user to provided specific pieces of information using a predefined methods that cover all the use cases an application might need.

So we mock the command, register the mocked version in Kernel, add our expectations for method calls, and pretend the user response in the form of return values.

http://themsaid.com/building-testing-interactive-console-20160409/

On Being Explicit

Mathias Verraes, one of the organizers of DDD Europe, recently gave a talk at DDD London on how to name things to both improve your code and to improve communication with the business.

“Make the implicit explicit” must be one of the most valuable advices I ever got about software modelling and design. Gather around for some tales from the trenches: stories from software projects where identifying a missing concept, and bringing it front and centre, turned the model inside out. Our tools: metaphors, pedantry, type systems, the age old heuristic of “Follow the money”, visual models, and a healthy obsession with language.

https://skillsmatter.com/skillscasts/8806-ddd-meetup

Does code need to be perfect?

Andreas Creten explains his view on wether you should always try to write perfect code. Spoiler: no.

The engineers want to write perfect code using the latest techniques, make sure that the code is well documented so they can fully understand how everything works and that it has tests so they can easily update things later. Product owners on the other hand just want things to be done, fast and cheap, so they can ship new features or convince new clients.
How can you make these conflicting views work together?

https://medium.com/we-are-madewithlove/does-code-need-to-be-perfect-a53f36ad7163