Scraping – Nodejs Vs Php

One more example of screen scraping using Nodejs and Php. This is more of a benchmark test than example. Task was simple, get all the team names of fantasy premier league from first 200 pages. So there were altogether 200 requests (one request per page)
Url for first page was http://fantasy.premierleague.com/my-leagues/303/standings/?ls-page=1
External library used for Nodejs was cheerio and PhpQuery for Php.

Nodejs

var start = +new Date();
var request = require('request');
var cheerio = require('cheerio');
var total_page = 200;
var page = 1;
var header = ['', 'Rank', 'Team', 'Name', 'Point', 'Total'];

console.log("Page number, Time taken");

while (page <= total_page) {
    var url = 'http://fantasy.premierleague.com/my-leagues/303/standings/?ls-page='+page;
    request(url, (function(i) {
        return function (error, response, body) {
            $ = cheerio.load(body);
            $('.ismStandingsTable').find('tr').each(function(index, elem){
                $(this).find('td').each(function(head){
                    if (head == 2) {
                        //console.log(header[head]+ ' : '+$(this).text());
                        //console.log($(this).text());
                    }
                });
            });
            var end = +new Date();
            console.log(i +", "+(end-start)/1000);
        }
    })(page)); //bind everything with page number
    page++;
}

PHP

<?php
require('phpQuery/phpQuery.php');

$time_start = microtime(true);
$total_page = 200;
$page = 1;
$header = array('', 'Rank', 'Team', 'Name', 'Point', 'Total');

echo ("Page number, Time taken");
while($page <= $total_page) {
    $doc = phpQuery::newDocumentFileHTML('http://fantasy.premierleague.com/my-leagues/303/standings/?ls-page='.$page);
    foreach (pq('.ismStandingsTable tr') as $data) {
        foreach (pq('td', $data) as $key => $val) {
            if ($key == 2) {
                //print pq($val)->text();
            }
        }
    }
    $time_end = microtime(true);
    $execution_time = $time_end - $time_start;
    echo ("\n".$page.", ".$execution_time);
    $page++;
}

?>

Nodejs took 175.535 sec to complete where as Php took 711.790 sec to complete. Php was four times slower than Nodejs.

Here is the graph of page by page and time taken to complete task for each request.
nodejsvsphp

Updated (Nov 14, 2013)
After reading this post so called codswallop i came to know that there exist a php library reactphp which makes php behave asyncronously. This is definitely going to improve efficiency while scraping. Since PhpQuery uses file_get_contents, it has to wait till it gets response from each request. So i borrowed this code. It uses reactphp and phpquery just for parsing html and the task was same as above.
Great! Php took just 39.351 sec to complete.
nodejs_reactphp_phpquery

Then somebody from HN pointed out that default maxsocket connection of nodejs is 5.
It should be cranked up to 64 for fair comparison. So i changed its value to 64.

require(‘http’).globalAgent.maxSockets = 64;

reactphp_nodejsmaxsocket-64
Run time are almost same. Php took 39.351 sec to complete and Nodejs took 37.67. I’m not sure about my bandwith, when i pinged some random europian server from my location using speedtest.net it showed me 9.12Mbps (download) and 4.45Mbps (upload).
The runtime of php has improved. Difference is just by 2 sec, not four times like i said earlier.
If somebody come up with better option to improve efficiency of my code i would be happy to play around with it.
And yes this article is about which is fast nodejs or php just for scraping.
As long as same thing can be achieved using different platforms people are going to compare. Either its “Nodejs vs PHP” or “Nodejs vs Python” or “Nodejs vs Scala” or whatever.
I selected platform which i was comfortable with and took best possible library (that i know) and did my test.
Clearly phpquery wasn’t the best choice, reactphp does better job but i don’t see anything wrong with title here.
If someday someone pops up and says “Here, try this library OverReactPhp. It makes your code much faster”. I would be happy to do my test again.
GitHub repo

    • rojansinha

      Now i have a change to know about ReactPhp http://reactphp.org/

      • Oh, it isn’t mine – just passing it along

        • rojansinha

          I thought it was yours. Sorry about that.

  • Jonas Kuhl

    This python lib owns both =) http://scrapy.org/

    • philsturgeon

      No, Scrapy might own both, but “python” doesn’t.

      • That’s why he said “this python lib” and not “python”

        • philsturgeon

          Deleted, I messed up.

  • philsturgeon

    It is a shame you chose to try and defend your article because your reasoning does not stand up to logic.

    “As long as same thing can be achieved using different platforms people are going to compare. Either its “Nodejs vs PHP” or “Nodejs vs Python” or “Nodejs vs Scala” or whatever.”

    And they would all of course be wrong. Unless you are using built-in features of the standard library (code provided without installing anything) you are not testing the language, you are testing the abilities of third-party developer. That TPD might be the worlds smartest guy, or he might have just picked up a coding for idiots book. Judging an entire language based off of the abilities of a single developer is clearly not scientific, or relevant.

    Furthermore I was not berating you for not knowing about ReactPHP. I used React PURELY as an example, that async code is faster than sync code. By shoving this sync code in an async loop I got that exact same library running in a similar fashion to cheerio, meaning a more fair comparison could be made.

    It’s not your fault you didn’t know about ReactPHP, but it IS your fault for not recognizing that one piece of third-party code (which runs synchronously) is obviously going to be slower than another piece of code (which runs asynchronously), and furthermore you were wrong to suggest that this is anything at all to do with the language.

    Thanks again for the update, but clinging onto “i don’t see anything wrong with title here.” is only embarrassing yourself.

    • rojansinha

      You are completely ignoring the purpose of the test (which was for scraping) and just ranting about async, sync, blocking, nonblocing.

      Of course language cannot be tested just by using any third party library but while choosing any language over other i can consider the availability for its third party libraries.

      • philsturgeon

        I am not “just ranting about async, sync, blocking, nonblocing.”

        Firstly, your spelling and grammar is atrocious but that is besides the point. These four things are also not four different things in regards to the post, async code runs in a non-blocking fashion. So there’s that.

        If you really want to continue to argue that NodeJS v PHP can be fairly compared based entirely on the use of third-party libraries, when flagrantly ignoring the fact that the one you used for one of the languages is using blocking code then I tip my hat to you. You are bad at computer science and you have no idea how benchmarks work. I’d hoped you’d come around, but you can continue to compare apples to oranges as long as you like. Good day.

        • John

          I’m pretty sure he is ESL – I’m happy with everything else you’ve said, but he shouldn’t be put down for trying. I’ve read worse.

          • Vedant Mistry

            You got the message, right? and that was the important part, if you were asked to write or read in Russian, Korean or any other language than you will look गूंगा and considering that fact what you have emphasised only highlights your chootyapa.

  • Adrien Delorme

    Nobody cares about speed, real projects only care about maintainability, http://www.techempower.com/benchmarks/.
    Use Go if you need a speedy web tech.

  • I would be interested in seeing this post with 0.11 / 0.12 of node with streams3 and the limit of sockets is set to infinity.

  • Sawan Sanghvii

    nice article….i m gonna try this…..Check pdf scraping using php on my blog….http://www.webdata-scraping.com/blog

  • Whatever anybody says, I like this article. Thanks.