You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

The Charity Engine Smart Proxy allows the running of JavaScript applications within fully-featured web browsers on a vast network of volunteer computing devices. Smart Proxy currently uses PhantomJS to run a headless browser, which can be controlled via the PhantomJS API. Additional browsers and APIs may be supported in the future.

Smart proxy service is under rapid development

  • Features of Smart Proxy may be added, changed, or removed at any time and without advance notice. The service is in alpha state.
  • This document is intended to support testing on the path to public release of the service and should not be considered final documentation.

Unless otherwise noted, the mechanisms established by the Charity Engine Distributed Proxy for authentication and configuration still apply, as Smart Proxy is an extension of the Distributed Proxy. Refer to the Charity Engine Distributed Proxy documentation for details.

Initiating Smart Proxy crawls

Smart Proxy crawls are initiated by connecting to the Distributed Proxy service and supplying additional HTTP headers.

x-proxy-phantomjs-script-url

This header indicates the URL of the script that Smart Proxy should run. Nodes on the Charity Engine network will download and cache the script from this URL when processing a Smart Proxy request that requires it. The URL of the target page to crawl will then be passed as an argument to this script.

While any URL is accepted currently, the service will only allow whitelisted URLs or hostnames in the future.

x-proxy-phantomjs-script-md5

To ensure the integrity of Smart Proxy results, an MD5 hash of the script file defined by the x-proxy-phantomjs-script-url HTTP header is required with each request. A script that does not match the supplied MD5 hash will not be run.

Response structure

Data can be retrieved from PhantomJS either as plaintext or with JSON encoding. Plaintext data is passed as-is, and a HTTP 200 OK status code is generated automatically when returning a result (see example 2). If it is useful to return a different status code or custom HTTP headers instead, a specifically formatted JSON object can be used instead (see example 1):

{
  body: null,
  headers: null,
  statusCode: null,
  statusMessage: null,
  httpVersion: null
}

Known issues

The following limitations are currently expected in the Smart Proxy service:

  • It may be difficult to use the built-in PhantomJS functionality to render the page as an image and return the result through Smart Proxy. PhantomJS generates an output file, but the proxy requires results to be returned via stdout. One possible solution would be to convert the image file to base64 format and print that to stdout.
  • It may be difficult to retrieve structured data from multiple pages; all of the data would have to be transferred through stdout, most often JSON encoded, which may be suboptimal. A possibility would be to write to an output file or global variable with sufficient hierarchy to contain data for multiple pages and then return all of it as the body of the response.
  • Scripts currently time out after 20 seconds. For larger scale crawls, this may be insufficient.

Example scripts

Sample scripts demonstrating the power of the Smart Proxy service are included below. For an extended function reference to use in customization of these scripts or development of new scripts, see the both API documentation and examples for PhantomJS.

Get full content of JavaScript pages

It is impossible for search engines to extract content directly from JavaScript applications or web pages that rely on JavaScript to render content. Therefore, for both search engines and SEO applications, it is necessary to obtain and execute the source of the page in order to receive the full and proper output. The following code retrieves the page content after JavaScript manipulation, including HTTP headers and status code, and returns it back to the requester:

var page = require('webpage').create();
var system = require('system');
var address;
  
// The URL that is submitted to the proxy service
address = system.args[1];

// Standard response structure, see Response structure section in the
// documentation
var result = {
  body: null,
  headers: null,
  statusCode: null,
  statusMessage: null,
  httpVersion: null,
};

// Obtain response headers and status code from the loaded page
page.onResourceReceived = function(response) {
  // Verify that it is the actual page that has finished loading (and not
  // internal resources)
  if (decodeURIComponent(response.url) == address) { 
  result.headers = {};
  for (var i in response.headers) {
    // Clone headers into the final response
    result.headers[response.headers[i].name] = response.headers[i].value;
  }
  
  // Clone HTTP status code and text into the final response
  result.statusCode = response.status;
  result.statusMessage = response.statusText;
  }
};

page.onLoadFinished = function(status) {
  // Page load has completed, including all internal assets
  // Copy page HTML source code (as manipulated by any internal JS scripts)
  // into final response
	result.body = page.content; 
	
	// Write out final response and exit
	console.log(JSON.stringify(result));
  phantom.exit();
}

page.open(address, function (status) {
  if (status !== 'success') {
    // Handle failures
    console.log('FAILED loading the address');
    phantom.exit();
  }
});

Retrieve URLs from Google results

In many cases, the vast majority of data transferred in response to a page crawl is unnecessary and a waste of network resources. If the format of the results is known or pertinent data can be recognized, Smart Proxy can be used to pre-process results prior to returning it. The following code navigates to a submitted Google result page (e.g. http://www.google.com/search?q=example) and returns a plain text list of page addresses found in that page:

var page = require('webpage').create();
var system = require('system');
var address;

// The URL that is submitted to the proxy service
address = system.args[1]; 

// Set up a fake user agent
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36';

page.open(address, function (status) {
  if (status !== 'success') {
    console.log('FAILED loading the address');
  }
  else {
    // Execute code in the scope of the page
    var urls = page.evaluate(function() {
      var list = document.querySelectorAll('h3.r a');
      var urls = [];
      for (var i in list) {
        if (list[i].href !== undefined) {
          urls.push(list[i].href);
        }
      }
      return urls;
    });
  
    // Return URLs, one per line
    for (var i in urls) {
      console.log(urls[i]);
    }
  }
  phantom.exit();
});

Note: The Google search result page structure could change at any time, causing this example script to not work as intended. However, it should be easy to adapt the script to work with an updated format or with other search engines.

  • No labels