Warning

title	This an Alpha Service

This service has been deprecated. V2 documentation is here.

The Charity Engine Smart Proxy

...

allows the running

...

of JavaScript applications within

...

fully-featured web

...

browsers on a vast network of volunteer computing devices.

...

Smart Proxy

...

currently uses PhantomJS to run a headless browser

...

, which can be controlled via the PhantomJS API. Additional browsers and APIs may be supported in the future.

...

title	Smart proxy service is under rapid development

The features may get added, changed or removed at any time without advance notice. The service is in alpha state.

This document is intended to obtain feedback on the roadmap for implementation and should not be considered final documentation.

Info

Unless otherwise noted, the mechanisms of established by the Charity Engine Distributed Proxy for authenticating to the proxy authentication and proxy configuration still apply, as Smart Proxy is an extension of the generic Distributed Proxy. Refer to Charity Engine Distributed Proxy documentation Documentation for details.

Initiating

...

Smart Proxy crawls

Smart Proxy crawls are initiated by connecting to the Distributed Proxy service and supplying additional HTTP headers.

x-proxy-phantomjs-script-url

A URL for This header indicates the URL of the script to run on the URL that was supplied to the proxythat Smart Proxy should run. Nodes on the Charity Engine network will download and cache the script from this URL when processing a Smart Proxy request that requires it. The URL of the target page to crawl will then be passed as an argument to this script.

Note
While currently any URL is accepted currently, the service will only allow whitelisted URLs or hostnames in the future.

x-proxy-phantomjs-script-md5

An To ensure the integrity of Smart Proxy results, an MD5 hash of the script file defined within by the x-proxy-phantomjs-script-url HTTP header . Used to verify that the scripts get downloaded correctly. Scripts that do is required with each request. A script that does not match the supplied MD5 hash will no not be run.

Response structure

Data can be retrieved from PhantomJS in two different ways: either as plaintext or with JSON encodedencoding.Plaintext Plaintext data is passed as-is, and a HTTP 200 OK status code is generated automatically when returning a result (see example 2).It may however be Example 2 in "Example Scripts" below). If it is useful to return a different status code or custom HTTP headers instead. A , a specifically formatted JSON output object can be used in this case instead (see example 1 Example 1 in "Example Scripts" below):

Code Block

language	js

{
  body: null,
  headers: {'Content-Type': null 'text/plain; charset=UTF-8'},
  statusCode: null200,
  statusMessage: null'OK',
httpVersion: null

Possible issues with the current implementation

Example scripts

  httpVersion: '1.1'
}

Performance considerations

SmartProxy service benefits from batching multiple requests together into a single script run. For example, it would be significantly more efficient to download 10 URLs through a single Smart Proxy script in comparison to doing 10 runs with a single URL each, considering both time and data transfer.

Running complex crawls

Running complex crawls consisting of multiple requests or doing compute-intensive postprocessing may require special configuration of the Smart Proxy requests.

By default, Smart Proxy requests are terminated after 20 seconds just like Distributed Proxy requests. It is possible to extend the Smart Proxy request timeout by sending X-Proxy-Timeout-Hard and X-Proxy-Timeout-Soft headers (see Distributed Proxy documentation for details). Both "soft" and "hard" timeouts can be set to a maximum of 600 seconds (10 minutes) for Smart Proxy requests.

Connection will be kept open until Smart Proxy request resolves. This may not be desirable if timeouts are extended, so it may be a good idea to close the connection without waiting for the whole timeout period to expire. In order to do that you will need to modify your Smart Proxy script to send results directly to your server instead of returning a standard response as noted in section 2.

Note
Extending the timeout interval may slightly decrease the success rate of your crawls as nodes can disappear at any time without advance warning. If you see a high rate of timed out crawls try decreasing the size of the crawl executed in a single node.

Cookie persistence

Cookies are retained throughout multiple Smart Proxy runs only if Connection Group feature is used (see Distributed Proxy documentation for details). Cookies follow the same expiration pattern as Connection Groups themselves.

Cookies are also retained when processing multiple URLs within a single Smart Proxy script execution (it is possible to clear or set custom cookies through the PhantomJS API).

Known issues

The following limitations are currently expected in the Smart Proxy service:

It may be difficult to use the built-in PhantomJS functionality to render the page as an image and return the result through Smart Proxy. PhantomJS generates an output file, but the proxy requires results to be returned via stdout. One possible solution would be to convert the image file to base64 format and print that to stdout.
It may be difficult to retrieve structured data from multiple pages; all of the data would have to be transferred through stdout, most often JSON encoded, which may be suboptimal. A possibility would be to write to an output file or global variable with sufficient hierarchy to contain data for multiple pages and then return all of it as the body of the response.

Example scripts

Sample scripts demonstrating the power of the Smart Proxy service are included below. For an extended function reference to use in customization of these scripts or development of new scripts, see the both API documentation and examples for For extended function reference see API documentation and examples of PhantomJS.

Anchor
Example1
Example1
Get full

...

content of JavaScript pages

It is impossible for search engines to extract content directly from websites that essentially are Javascript applicationsJavaScript applications or web pages that rely on JavaScript to render content. Therefore, either for those both search engines or and SEO applications, it is desirable necessary to obtain a full and execute the source code of the page in order to receive the full and proper output. The following code retrieves the page source code content after JS JavaScript manipulation, including HTTP headers and status code, and returns it back to the requester.:

Code Block

language	js

var page = require('webpage').create(),;
   var system = require('system'),;
   var address;

address = system.args[1];  
// The URL that is submitted to the proxy service

var result = { /address = system.args[1];
 
// Set up a fake user agent
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36';

// Standard response structure, see Response structure section in the
// documentation
var result = {
  body: null,
  headers: null,
  statusCode: null,
  statusMessage: null,
  httpVersion: null,
};

page.onResourceReceived = function(response) { // Used to obtainObtain response headers and status code from the loaded page
page.onResourceReceived    if (decodeURIComponent= function(response.url) == address) {{
  // Verify that it is the actual page that has finished loading (and not
  // internal resources that have finished loaded
		)
  if (decodeURIComponent(response.url) == address) { 
  result.headers = {};
		  for (var i in response.headers) {
	
    // Clone headers into the final response
    	result.headers[response.headers[i].name] = response.headers[i].value;
 // Clone}
 headers into the
 final response
		}
		
		// Clone HTTP status code and text into the final response
		  result.statusCode = response.status;
		  result.statusMessage = response.statusText;
    }
};

page.onLoadFinished = function(status) {
  // Page load has completed, including all internal assets has completed
	result.body = page.content; // CloneCopy page HTML source code (as manipulated by any internal JS scripts)
  // into final response
	
	
  result.body = page.content; 
  
  // Write out final response and exit
	  console.log(JSON.stringify(result));
    phantom.exit();
}

page.open(address, function (status) {
    if (status !== 'success') {
    // Handle failures
        console.log('FAILED loading the address');
		    phantom.exit();
    }
});

Anchor
Example2
Example2
Retrieve URLs from Google results

In many cases, the vast majority of data transferred in response to a page crawl is unnecessary and a waste of network resources. If the format of the results is known or pertinent data can be recognized, Smart Proxy can be used to pre-process results prior to returning it. The following code navigates to a submitted Google result page (e.g. http://www.google.com/search?q=example) and returns a plain text list of page addresses found in that page.

...

:

...

Code Block

language	js

var page = require('webpage').create(),;
var    system = require('system'),;
   var address;
address = system.args[1]; 

// The URL that is submitted to the proxy service
address = system.args[1]; 

// FakeSet up a fake user agent
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36';

page.open(address, function (status) {
    if (status !== 'success') {
        console.log('FAILED loading the address');
  }
  } else {
        var urls = page.evaluate(function() { // Execute code in the scope of the page
    var urls = page.evaluate(function() {
      var list = document.querySelectorAll('h3.r a');
            var urls = [];
            for (var i in list) {
                if (list[i].href !== undefined) {
                    urls.push(list[i].href);
                }
            }
            return urls;
        });
  
    // Return 
URLs, one per line
     for (var i in urls) { // Return URLs, one per line
            console.log(urls[i]);
        }
    }
    phantom.exit();
});

Info
Note: The Google search result page structure could change at any time, causing this example script to not work as intended. However, it should be easy to adapt the script to work with an updated format or with other search engines.

Page tree

Versions Compared

Old Version 6

New Version Current

Key

Initiating

Smart Proxy crawls

x-proxy-phantomjs-script-url

x-proxy-phantomjs-script-md5

Response structure

Possible issues with the current implementation

Example scripts

Performance considerations

Running complex crawls

Cookie persistence

Known issues

Example scripts

Anchor
Example1
Example1
Get full

content of JavaScript pages

Anchor
Example2
Example2
Retrieve URLs from Google results

Page tree

Page History

Versions Compared

Old Version 6

New Version Current

Key

Initiating

Smart Proxy crawls

x-proxy-phantomjs-script-url

x-proxy-phantomjs-script-md5

Response structure

Possible issues with the current implementation

Example scripts

Performance considerations

Running complex crawls

Cookie persistence

Known issues

Example scripts

AnchorExample1Example1Get full

content of JavaScript pages

AnchorExample2Example2Retrieve URLs from Google results

Anchor
Example1
Example1
Get full

Anchor
Example2
Example2
Retrieve URLs from Google results