When I started writting Perfecty Push Notifications for WordPress, a self-hosted Push Server in PHP using wep-push-php, I decided to target almost every WordPress server in the planet (Shared hosts, VPS, Dedicated hosts). It was important to offer a plugin that worked out-of-the-box in almost any installation while at the same time gave a good user experience.

Although saying a 2300% performance improvement is extravagant, you can say "yeah, it was a bad design since the beginning, that's why", I'll argue the reasons behind it and invite you to read the whole post, maybe there will be some take aways for you, specially if you like to build side projects as I do :)

Everything started with the end-user in mind

ray-sangga-kusuma-QgCTFmQvD60-unsplash

Because sending thousands of notifications via a web form that waits for minutes in a never loading page while doing the whole processing is awful, I decided to use background jobs to send the notifications.

There are sophisticated ways for background processing in WordPress like Action Scheduler used in Woocommerce, which has automatic adjustments like:

[...] Action Scheduler will only process actions in a request until:

  • 90% of available memory is used
  • processing another 3 actions would exceed 30 seconds of total request time, based on the average processing time for the current batch
  • in a single concurrent queue

However, it was particularly problematic in my case because the websites would need that the cron system can reach themselves from the internet, which in some server configurations it's not always granted by default.

Considering that WordPress has its own cron system WP-Cron and is well supported in the vast majority of installations, I decided to start with this one instead. It is built-in and has tons of documentation resources and help available online for the future end-users. The only concern was how it relies on the web traffic to trigger the jobs, however it can be adjusted so that wp-cron.php is executed by a system cron, so it was not a big problem.

I implemented a simple background processing in batches with the default parameters defined as low so that the plugin worked out-of-the-box in almost any installation. The drawback was that it was not fast enough for highly demanding websites since the beginning, however I planned to solve it in future iterations because the important matter was to have an MVP that showcased its value.

Highlights of the original mechanism:

  • The batch_size setting defined the total number of notifications sent in a WP-Cron execution. It was adjustable and by default it was 30 notifications per job execution.
  • After sending a batch_size number of notifications, it would auto-schedule himself to send the next batch of notifications.
  • Used the default batchSize = 1000 parameter from wep-push-php, however the batch_size setting from Perfecty Push would still limit it.

The code at that time looked like this:

    public static function execute_broadcast_batch( $notification_id ) {
        Log::info( 'Executing batch for job id=' . $notification_id );

        $notification = Perfecty_Push_Lib_Db::get_notification( $notification_id );
        if ( ! $notification ) {
            Log::error( "Notification job $notification_id was not found" );
            return false;
        }

        // if it has been taken but not released, that means a wrong state
        if ( $notification->is_taken ) {
            Log::error( 'Halted, notification job already taken, notification_id: ' . $notification_id );
            return false;
        }

        // we check if it's a valid status
        // we only process running or scheduled jobs
        if ( $notification->status !== Perfecty_Push_Lib_Db::NOTIFICATIONS_STATUS_SCHEDULED &&
        $notification->status !== Perfecty_Push_Lib_Db::NOTIFICATIONS_STATUS_RUNNING ) {
            Log::error( 'Halted, received a job with an invalid status (' . $notification->status . '), notification_id: ' . $notification_id );
            return false;
        }

        // this is the first time we get here so we mark it as running
        if ( $notification->status == Perfecty_Push_Lib_Db::NOTIFICATIONS_STATUS_SCHEDULED ) {
            Log::info( 'Marking job id=' . $notification_id . ' as running' );
            Perfecty_Push_Lib_Db::mark_notification_running( $notification_id );
        }

        Perfecty_Push_Lib_Db::take_notification( $notification_id );

        // we get the next batch, starting from $last_cursor we take $batch_size elements
        // we only fetch the active users (only_active)
        $users = Perfecty_Push_Lib_Db::get_users( $notification->last_cursor, $notification->batch_size, 'created_at', 'desc' );

        if ( count( $users ) == 0 ) {
            Log::info( 'Job id=' . $notification_id . ' completed, released' );
            $result = Perfecty_Push_Lib_Db::mark_notification_completed_untake( $notification_id );
            if ( ! $result ) {
                Log::error( "Could not mark the notification job $notification_id as completed" );
                return false;
            }
            return true;
        }

        // we send one batch
        $result = self::send_notification( $notification->payload, $users );
        if ( is_array( $result ) ) {
            $notification                    = Perfecty_Push_Lib_Db::get_notification( $notification_id );
            $total_batch                     = $result[0];
            $succeeded                       = $result[1];
            $notification->last_cursor      += $total_batch;
            $notification->succeeded        += $succeeded;
            $notification->is_taken          = 0;
            $notification->last_execution_at = current_time( 'mysql', 1 );
            $result                          = Perfecty_Push_Lib_Db::update_notification( $notification );

            Log::info( 'Notification batch for id=' . $notification_id . ' sent. Cursor: ' . $notification->last_cursor . ', Total: ' . $total_batch . ', Succeeded: ' . $succeeded );
            if ( ! $result ) {
                Log::error( 'Could not update the notification after sending one batch' );
                return false;
            }
        } else {
            Log::error( 'Error executing one batch for id=' . $notification_id . ', result: ' . $result );
            Perfecty_Push_Lib_Db::mark_notification_failed( $notification_id );
            Perfecty_Push_Lib_Db::untake_notification( $notification_id );
            return false;
        }

        // execute the next batch
        if ( ! wp_next_scheduled( self::BROADCAST_HOOK, array( $notification_id ) ) ) {
            $result = wp_schedule_single_event( time(), self::BROADCAST_HOOK, array( $notification_id ) );
            Log::info( 'Scheduling next batch for id=' . $notification_id . ' . Result: ' . $result );
        } else {
            Log::warning( "Don't schedule next batch, it's already scheduled, id=" . $notification_id );
        }
        return true;
    }

Good for a first version, but problematic

However, the above code is slow and there was a problem with how the batchSize parameter from the web-push-lib worked during my initial tests. This parameter defines the batches size during flushing, by making asynchronous HTTP requests. You can see those batches as concurrent requests, and they can potentially create high spikes in memory and CPU usage, which can cause some weird errors in the downstream components like:

[14-Dec-2020 19:41:36] WARNING: [pool website.com]
child 4593 said into stderr: "PHP message: ERROR |
Failed to send one notification, error:
cURL error 60: Issuer certificate is invalid.
(see https://curl.haxx.se/libcurl/c/libcurl-errors.html) 
for https://fcm.googleapis.com/fcm/send/XXXXXXXXXXXXXXXXXXXXXXX"

Although I initially suspected problems with cURL and the certificates, it was my server that couldn't handle more than 300 notifications concurrently with such a specs. So, instead of tweaking the batchSize parameter from the web-push-php lib, I adjusted the batch_size value in my plugin and never used a value higher than 250 in my production-like environment.

For a fresh installation of the plugin, the default value of batch_size = 30 had a very decent throughput, it took around 3 minutes to send 1.000 notifications, acceptable if you want to send Push Notifications for free. So I launched it:

However, as some websites started to have more than 10.000 users, the plugin was taking more than 23 minutes to complete the batch processing and it was noticeable slow. I needed to do something...

2300% faster

Recently I published the v1.4.0 version with performance improvements that make the plugin 2300% faster. The plugin can now send more than 13.000 notifications in 56 seconds, in a very basic server of 2Gb RAM and 2vCPU, which is a huge gain compared to the 23 minutes it was taking before, or 2300% faster.

The server load after the improvements looked like this:

usage

Highlights of the new mechanism:

  • The execution of multiple batches is done in a single cron job (before it used multiple cron jobs), which reduced the wasted time between cron job executions (~5s to 10s). At the same time, if it can send all notifications at once, it will do it.
  • The execution is split in subsequent cron jobs if it is taking more than 80% of the maximum execution time. This can be avoided if the script has no time limit or the limit is very high.
  • The batch_size setting now defaults to 1.500 , before it was 30 and caused weird issues with values higher than 250 . With this mechanism I've used values like 20.000 and it works smoothly :)
  • The parallel flushing (batchSize from web-push-php) is now adjustable and defaults to a low value (50) to avoid the weird cURL issues mentioned above. It can be increased by using better server specs.

If you want to take a look at the code, it's this:

    public static function execute_broadcast_batch( $notification_id ) {
        Log::info( 'Executing batch for job id=' . $notification_id );

        $notification = Perfecty_Push_Lib_Db::get_notification( $notification_id );
        if ( ! $notification ) {
            Log::error( "Notification job $notification_id was not found" );
            return false;
        }

        // if it has been taken but not released, that means a wrong state
        if ( $notification->is_taken ) {
            Log::error( 'Halted, notification job already taken, notification_id: ' . $notification_id );
            return false;
        }

        // we check if it's a valid status
        // we only process running or scheduled jobs
        if ( $notification->status !== Perfecty_Push_Lib_Db::NOTIFICATIONS_STATUS_SCHEDULED &&
        $notification->status !== Perfecty_Push_Lib_Db::NOTIFICATIONS_STATUS_RUNNING ) {
            Log::error( 'Halted, received a job with an invalid status (' . $notification->status . '), notification_id: ' . $notification_id );
            return false;
        }

        // this is the first time we get here so we mark it as running
        if ( $notification->status == Perfecty_Push_Lib_Db::NOTIFICATIONS_STATUS_SCHEDULED ) {
            Log::info( 'Marking job id=' . $notification_id . ' as running' );
            Perfecty_Push_Lib_Db::mark_notification_running( $notification_id );
        }

        Perfecty_Push_Lib_Db::take_notification( $notification_id );

        // we get the next batch, starting from $last_cursor we take $batch_size elements
        // we only fetch the active users (only_active)
        $total_succeeded = 0;
        $cursor          = $notification->last_cursor;
        $start_time      = microtime( true );
        while ( true ) {
            $users  = Perfecty_Push_Lib_Db::get_users( $cursor, $notification->batch_size );
            $count  = count( $users );
            $cursor = $cursor + $count;

            if ( $count == 0 ) {
                Log::info( 'Job id=' . $notification_id . ' completed, released' );
                $result = Perfecty_Push_Lib_Db::mark_notification_completed_untake( $notification_id );
                if ( ! $result ) {
                    Log::error( "Could not mark the notification job $notification_id as completed" );
                    break;
                }
                break;
            }

            $succeeded = self::send_notification( $notification->payload, $users );
            if ( $succeeded !== 0 ) {
                Log::info( "Completed batch, successful: $succeeded, cursor: $cursor" );
                $total_succeeded += $succeeded;
            } else {
                Log::error( 'Error executing one batch for id=' . $notification_id );
                Perfecty_Push_Lib_Db::mark_notification_failed( $notification_id );
                Perfecty_Push_Lib_Db::untake_notification( $notification_id );
                break;
            }

            // check that we don't exceed 80% of max_execution_time
            // in case we do, we split the execution to a next cron cycle to avoid the termination of the script
            // if max_execution_time=0, we never split
            if ( self::time_limit_exceeded( $start_time ) ) {
                Log::warning( 'Time execution is reaching 80% of max_execution_time, moving to next cycle' );
                break;
            }
        }

        if ( $total_succeeded != 0 ) {
            $notification                    = Perfecty_Push_Lib_Db::get_notification( $notification_id );
            $notification->last_cursor       = $cursor;
            $notification->succeeded        += $total_succeeded;
            $notification->is_taken          = 0;
            $notification->last_execution_at = current_time( 'mysql', 1 );
            $result                          = Perfecty_Push_Lib_Db::update_notification( $notification );

            Log::info( 'Notification cycle for id=' . $notification_id . ' sent. Cursor: ' . $notification->last_cursor . ', Succeeded: ' . $total_succeeded );
            if ( ! $result ) {
                Log::error( 'Could not update the notification after sending one batch' );
                return false;
            }

            if ( $notification->status === Perfecty_Push_Lib_Db::NOTIFICATIONS_STATUS_RUNNING ) {
                // execute the next batch
                if ( ! wp_next_scheduled( self::BROADCAST_HOOK, array( $notification_id ) ) ) {
                    $result = wp_schedule_single_event( time(), self::BROADCAST_HOOK, array( $notification_id ) );
                    Log::info( 'Scheduling next batch for id=' . $notification_id . ' . Result: ' . $result );
                } else {
                    Log::warning( "Don't schedule next batch, it's already scheduled, id=" . $notification_id );
                }
            }
        }

        return true;
    }

The process to adjust the performance of the plugin in a WordPress site with this new mechanism is very well described in the official Perfecty Push documentation here: https://docs.perfecty.org/wp/performance-improvements/

Conclusion

Of course, this number can be lowered down much more by adjusting the web server limits (memory limit or the fpm params), or increasing the server specs, or moving away other components if they reside in the same server (mail server, metrics server, external admin panel, etc). The idea is that with the new approach, it's easier for the end users to tune it and achieve a much faster push server.

It also demonstrates that working in iterations helps in showing the product value to the end-users since the beginning, and that it's preferred to have a good working version released on time rather than getting stuck forever until it reaches the absolute perfection. Please understand, perfection takes time and a couple of iterations.

josh-calabrese-zcYRw547Dps-unsplash

Photos

Photo by Jean Gerber on Unsplash

Photo by ray sangga kusuma on Unsplash

Photo by Josh Calabrese on Unsplash

This post is also available on DEV.