You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

The Charity Engine Smart Proxy enables running Javascript applications within a full-featured web browser on a vast network of volunteer computing devices.

Smart Proxy service currently uses PhantomJS to run a headless browser and exposes its API; we may add additional browsers/APIs in the future.

Smart proxy service is under rapid development

The features may get added, changed or removed at any time without advance notice. The service is in alpha state.

This document is intended to obtain feedback on the roadmap for implementation and should not be considered final documentation.

Unless otherwise noted, the mechanisms of the Charity Engine Distributed Proxy for authenticating to the proxy and proxy configuration still apply as Smart Proxy is an extension of the generic Distributed Proxy. Refer to Charity Engine Distributed Proxy documentation for details.

Initiating and configuring Smart Proxy crawls

Smart Proxy crawls are initiated by connecting to the Distributed Proxy service and supplying additional HTTP headers.

x-proxy-phantomjs-script-url

A URL for the script to run on the URL that was supplied to the proxy.

While currently any URL is accepted, the service will only allow whitelisted URLs or hostnames in the future.

x-proxy-phantomjs-script-md5

An MD5 hash of the script file defined within x-proxy-phantomjs-script-url HTTP header. Used to verify that the scripts get downloaded correctly. Scripts that do not match the supplied MD5 hash will no be run.

Response structure

WIP

body: null,
  headers: null,
  statusCode: null,
  statusMessage: null,
  httpVersion: null,

Example scripts

For extended function reference see API documentation and examples of PhantomJS.

Get full page

It is impossible for search engines to extract content directly from websites that essentially are Javascript applications. Therefore, either for those search engines or SEO applications it is desirable to obtain a full source code of the page. The following code retrieves the page source code after JS manipulation, including HTTP headers and status code, and returns it back to the requester.

var page = require('webpage').create(),
    system = require('system'),
    address;

address = system.args[1]; // The URL that is submitted to the proxy service

var result = { // Standard response structure, see Response structure section in the documentation
  body: null,
  headers: null,
  statusCode: null,
  statusMessage: null,
  httpVersion: null,
};

page.onResourceReceived = function(response) { // Used to obtain response headers and status code from the loaded page
    if (decodeURIComponent(response.url) == address) { // Verify that it is the actual page and not internal resources that have finished loaded
		result.headers = {};
		for (var i in response.headers) {
	    	result.headers[response.headers[i].name] = response.headers[i].value; // Clone headers into the final response
		}
		
		// Clone HTTP status code and text into the final response
		result.statusCode = response.status;
		result.statusMessage = response.statusText;
    }
};

page.onLoadFinished = function(status) { // Page load including all internal assets has completed
	result.body = page.content; // Clone page HTML source code (as manipulated by any internal JS scripts) into final response
	
	// Write out final response and exit
	console.log(JSON.stringify(result));
    phantom.exit();
}

page.open(address, function (status) {
    if (status !== 'success') { // Handle failures
        console.log('FAILED loading the address');
		phantom.exit();
    }
});

Retrieve URLs from Google results

The following code navigates to a submitted Google result page (e.g. http://www.google.com/search?q=example) and returns a plain text list of page addresses found in that page.

Note: Google may change page structure at any time, making this example not work as intended. However, it should be relatively easy to adapt the example to an updated page.

var page = require('webpage').create(),
    system = require('system'),
    address;
address = system.args[1];
// Fake user agent
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36';
page.open(address, function (status) {
    if (status !== 'success') {
        console.log('FAILED loading the address');
    } else {
        var urls = page.evaluate(function() { // Execute code in the scope of the page
            var list = document.querySelectorAll('h3.r a');
            var urls = [];
            for (var i in list) {
                if (list[i].href !== undefined) {
                    urls.push(list[i].href);
                }
            }
            return urls;
        });
        
        for (var i in urls) { // Return URLs, one per line
            console.log(urls[i]);
        }
    }
    phantom.exit();
});
  • No labels