...
SmartProxy service benefits from batching multiple requests together into a single script run. For example, it would be significantly more efficient to download 10 URLs through a single Smart Proxy script in comparison to doing 10 runs with a single URL each, considering both time and data transfer.
Running large crawls
Running large crawls consisting of multiple requests or doing compute-intensive postprocessing may require special configuration of the Smart Proxy requests.
By default, Smart Proxy requests are terminated after 20 seconds just like Distributed Proxy requests. It is possible to extend the Smart Proxy request timeout by sending X-Proxy-Timeout-Hard
and X-Proxy-Timeout-Soft
headers (see Distributed Proxy documentation for details). Both "soft" and "hard" timeouts can be set to a maximum of 600 seconds (10 minutes) for Smart Proxy requests.
Connection will be kept open until Smart Proxy request resolves. This may not be desirable if timeouts are extended, so it may be a good idea to close the connection without waiting for the whole timeout period to expire. In order to do that you will need to modify your Smart Proxy script to send results directly to your server instead of returning a standard response as noted in section 2.
Note |
---|
Extending the timeout interval may slightly decrease the success rate of your crawls as nodes can disappear at any time without advance warning. If you see a high rate of timed out crawls try decreasing the size of the crawl executed in a single node. |
Cookie persistence
Cookies are retained throughout multiple Smart Proxy runs only if Connection Group feature is used (see Distributed Proxy documentation for details). Cookies follow the same expiration pattern as Connection Groups themselves.
...
- It may be difficult to use the built-in PhantomJS functionality to render the page as an image and return the result through Smart Proxy. PhantomJS generates an output file, but the proxy requires results to be returned via stdout. One possible solution would be to convert the image file to base64 format and print that to stdout.
- It may be difficult to retrieve structured data from multiple pages; all of the data would have to be transferred through stdout, most often JSON encoded, which may be suboptimal. A possibility would be to write to an output file or global variable with sufficient hierarchy to contain data for multiple pages and then return all of it as the body of the response.Scripts currently time out after 20 seconds. For larger scale crawls, this may be insufficient.
Example scripts
Sample scripts demonstrating the power of the Smart Proxy service are included below. For an extended function reference to use in customization of these scripts or development of new scripts, see the both API documentation and examples for PhantomJS.
...