Have you ever tried to scrape or harvest data from an existing website — I mean, even ajax-bloated ones? Did you ever attempt to test javascript-dependent interactions within a Web application you built? Well, if you answered yes to one of the questions above, you might be interested in PhantomJS.
PhantomJS is a headless WebKit with JavaScript API. By headless, they mean you can script a real Webkit based browser with no need for a full graphical interface installed.
Installation
On OSX, installation can be achieved using homebrew (note that XCode must be installed on your machine):
$ brew install phantomjs
It can take a bit of time for the binaries to be built, especially because of their dependency to Qt4. When it’s done, you can test it this way:
$ phantomjs
Usage: phantomjs [options] script.[js|coffee] [script argument [script argument ...]]
Options:
--load-images=[yes|no] Load all inlined images (default is 'yes').
--load-plugins=[yes|no] Load all plugins (i.e. 'Flash', 'Silverlight', ...) (default is 'no').
--proxy=address:port Set the network proxy.
--disk-cache=[yes|no] Enable disk cache (at desktop services cache storage location, default is 'no').
--ignore-ssl-errors=[yes|no] Ignore SSL errors (i.e. expired or self-signed certificate errors).
Installation instructions for other platforms and alternative methods can be found on the PhantomJS project wiki.
As a side note, there’s also a Python implementation of PhantomJS, PyPhantomJS, which adds plugins support! Also, I’ve found myself having no segfault using the Python version while the standard one is a bit more unstable on my box (no troll please).
To install PyPhantomJS, let’s use
pip
:
$ pip install PyPhantomJS
The PyPhantomJS executable is named — surprise — pyphantomjs
:
$ pyphantomjs
usage: pyphantomjs [options] script.[js|coffee] [script argument [script argument ...]]
Minimalistic headless WebKit-based JavaScript-driven tool
positional arguments:
script.[js|coffee] The script to execute, and any args to pass to it
optional arguments:
-h, --help show this help message and exit
--disk-cache {yes,no}
Enable disk cache (default: no)
--ignore-ssl-errors {yes,no}
Ignore SSL errors (default: no)
--load-images {yes,no}
Load all inlined images (default: yes)
--load-plugins {yes,no}
Load all plugins (i.e. Flash, Silverlight, ...) (default: no)
--proxy address:port Set the network proxy
-v, --verbose Show verbose debug messages
--version show this program's version and license
Usage of the two versions is exactly the same.
Basic usage
PhantomJS scripts can be written in standard JavaScript or in CoffeeScript. Mainly matter of taste here, but CoffeeScript syntax looks really interesting.
So let’s write our first script, we want to retrieve the weather forecast for a given city using Google:
// script: meteo.js
var page = new WebPage()
, output = { errors: [], results: null };
if (phantom.args.length == 0) {
console.log('You must specify a city, eg. "Paris, France"');
phantom.exit(1);
}
page.open('http://www.google.fr/search?q=meteo+' + phantom.args[0], function (status) {
if (status !== 'success') {
output.errors.push('Unable to access network');
} else {
var cells = page.evaluate(function(){
try {
var cells = document.querySelectorAll('.tpo tr tr')[4].querySelectorAll('td');
return Array.prototype.map.call(cells, function(cell) {
return cell.innerText.replace(/[^0-9]/g, '');
});
} catch (e) {
return [];
}
});
if (!cells || !cells.length > 0) {
output.errors.push('No valid meteo data found');
} else {
output.results = {
city: phantom.args[0],
today: {
afternoon: cells[1],
morning: cells[2],
},
tomorrow: {
afternoon: cells[3],
morning: cells[4],
}
};
}
console.log(JSON.stringify(output, null, ' '));
}
phantom.exit();
});
Notice we use the phantom.args
Array which contains the parameters
passed to the script.
The main magic happens in the page.evaluate()
method, we pass it a
JavaScript function which will be evaluated within the retrieved page
document environment. It’s a kind of non-persistent XSS injection
just to help you to operate on the page contents =)
Now it’s time to launch the script to see how it goes:
$ phantomjs meteo.js "Montpellier, France"
{
"errors": [],
"results": {
"city": "Montpellier, France",
"today": {
"afternoon": "29",
"morning": "17"
},
"tomorrow": {
"afternoon": "28",
"morning": "17"
}
}
}
Now with an invalid city name:
$ phantomjs meteo.js "Unexistent City"
{
"errors": [
"No valid meteo data found"
],
"results": null
}
Let’s try with another city, an existing one this time:
$ phantomjs meteo.js "Paris, France"
{
"errors": [],
"results": {
"city": "Paris, France",
"today": {
"afternoon": "21",
"morning": "11"
},
"tomorrow": {
"afternoon": "21",
"morning": "11"
}
}
}
As a side note and in case you were wondering, you now understand a bit more why I moved to Montpellier ;)
I CAN HAZ SCREENSHOTS
PhantomJS also allows some nice tricks like injecting scripts to the remote page, very useful when a remote website doesn’t ship with your favorite framework (eg. jQuery)… or can render a PNG image of a captured area of the webpage. The example below saves a capture of the weather forecast area:
// script: meteoclip.js
var page = new WebPage();
page.open('http://www.google.fr/search?q=meteo+montpellier,+France', function (status) {
if (status !== 'success') {
output.error = 'Unable to access network';
} else {
page.clipRect = {
top: 127,
left: 170,
width: 400,
height: 114
}
page.render('meteo.png');
console.log('Capture saved');
}
phantom.exit();
});
Running the meteoclip.js
script will get yourself this fancy image
stored in meteo.jpg
:
There are tons of other cool topics to cover about PhantomJS, like navigation handling, automated logging in, external resources retrieving, functional testing, code organization… so I’ll maybe post a bit more about it soon, who knows!