Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Info

Unless otherwise noted, the mechanisms established by the Charity Engine Distributed Proxy for authentication and configuration still apply, as Smart Proxy is an extension of the Distributed Proxy. Refer to Charity Engine Distributed Proxy Documentation for details.

Initiating Smart Proxy crawls

Smart Proxy crawls are initiated by connecting to the Distributed Proxy service and supplying additional HTTP headers.

x-proxy-puppeteer-script-url

This header indicates the URL of the script that Smart Proxy should run. Nodes on the Charity Engine network will download and cache the script from this URL when processing a Smart Proxy request that requires it. The URL of the target page to crawl will then be passed as an argument to this script.

Note

While any URL is accepted currently, the service may only allow whitelisted URLs or hostnames in the future.

x-proxy-puppeteer-script-md5

To ensure the integrity of Smart Proxy results, an MD5 hash of the script file defined by the x-proxy-puppeteer-script-url HTTP header is required with each request. A script that does not match the supplied MD5 hash will not be run.

Puppeteer script

A script that executes Smart Proxy crawls must define an async function smartproxy() that is the entry point to the script. Two parameters - a Puppeteer Page object and the starting URL - will be passed to the function, and a return value of the function will be returned via the proxy.

Code Block
languagejs
async function smartproxy(page, url) {
  await page.goto('https://example.com');
  const text = await page.evaluate(() => {
    const el = document.querySelector('h1');
    return el.textContent;
  });
  return text;
}

Testing locally

To test your script locally, a wrapper function that approximates the Smart Proxy execution environment can be used. The following example code expects the async function smartproxy(){} to be placed in a variable str, and the starting url to be placed in a variable url. The example code will return the main heading from the website https://www.example.com/.

Code Block
languagejs
// Starting URL
const url = 'https://www.example.com/';

// The function that gets executed by the Smart Proxy
const str = 'async function smartproxy(page, url) {' +
    '  await page.goto(url);\n' +
    '  const text = await page.evaluate(() => {\n' +
    '    const el = document.querySelector(\'h1\');\n' +
    '    return el.textContent;\n' +
    '  });' +
    '  return text;' +
    '}';

const puppeteer = require('puppeteer-extra');

// Enable stealth plugin
puppeteer.use(require('puppeteer-extra-plugin-stealth')());

const chromePath = process.env['PROGRAMFILES(X86)'] + '\\Google\\Chrome\\Application\\chrome.exe';

(async () => {
    // Launch the browser in headless mode and set up a page
    const browser = await puppeteer.launch({
        executablePath: chromePath,
        args: ['--no-sandbox'],
        headless: true
    })
    const page = await browser.newPage();

    eval('async function execSP(page, url) {' + str + '; return smartproxy(page, url)}');
    let res = await execSP(page, url);
    console.log('res: ' + res);
    browser.close();
})();

Response structure

Any data that is returned from the smartproxy() function will be returned back to the server as a string. In case objects are returned, they will be encoded as JSON.

...

Code Block
languagejs
{
  responses: [
    {
      "url": "https://www.example.com/",
      "statusCode": 200
    },
    {
      "url": "https://www.example.com/non-existant-page",
      "statusCode": 404
    }
  ]
}

Performance considerations

SmartProxy service benefits from batching multiple requests together into a single script run. For example, it would be significantly more efficient to download 10 URLs through a single Smart Proxy script in comparison to doing 10 runs with a single URL each, considering both time and data transfer.

We do not impose limits on the data being returned, but some nodes may be on slow connections, so they may be able to do less requests in the allotted time. Script execution duration could be monitored to stop sending additional requests in case the time limit is being approached.

Running complex crawls

Running complex crawls consisting of multiple requests or doing compute-intensive postprocessing may require special configuration of the Smart Proxy requests.

...

Note

Extending the timeout interval may slightly decrease the success rate of your crawls as nodes can disappear at any time without advance warning. If you see a high rate of timed out crawls try decreasing the size of the crawl executed in a single node.

Cookie persistence

Cookies are not currently persisted between Smart Proxy crawls, but this may change in the future. Please let us know if you need this functionality.