My current code (see below) uses 147MB of virtual memory!
My provider has allocated 100MB by default and the process is killed once run, causing an internal error.
The code is utilising curl multi and must be able to loop with more than 150 iterations whilst still minimizing the virtual memory. The code below is only set at 150 iterations and still causes the internal server error. At 90 iterations the issue does not occur.
How can I adjust my code to lower the resource use / virtual memory?
Thanks!
<?php
function udate($format, $utimestamp = null) {
if ($utimestamp === null)
$utimestamp = microtime(true);
$timestamp = floor($utimestamp);
$milliseconds = round(($utimestamp - $timestamp) * 1000);
return date(preg_replace('`(?<!\\\\)u`', $milliseconds, $format), $timestamp);
}
$url = 'https://www.testdomain.com/';
$curl_arr = array();
$master = curl_multi_init();
for($i=0; $i<150; $i++)
{
$curl_arr[$i] = curl_init();
curl_setopt($curl_arr[$i], CURLOPT_URL, $url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYPEER, FALSE);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i=0; $i<150; $i++)
{
$results = curl_multi_getcontent ($curl_arr[$i]);
$results = explode("<br>", $results);
echo $results[0];
echo "<br>";
echo $results[1];
echo "<br>";
echo udate('H:i:s:u');
echo "<br><br>";
usleep(100000);
}
?>
As per your last comment..
Download RollingCurl.php.
Hopefully this will sufficiently spam the living daylights out of your API.
<?php
$url = '________';
$fetch_count = 150;
$window_size = 5;
require("RollingCurl.php");
function request_callback($response, $info, $request) {
list($result0, $result1) = explode("<br>", $response);
echo "{$result0}<br>{$result1}<br>";
//print_r($info);
//print_r($request);
echo "<hr>";
}
$urls = array_fill(0, $fetch_count, $url);
$rc = new RollingCurl("request_callback");
$rc->window_size = $window_size;
foreach ($urls as $url) {
$request = new RollingCurlRequest($url);
$rc->add($request);
}
$rc->execute();
?>
Looking through your questions, I saw this comment:
If the intention is domain snatching,
then using one of the established
services is a better option. Your
script implementation is hardly as
important as the actual connection and
latency.
I agree with that comment.
Also, you seem to have posted the "same question" approximately seven hundred times:
https://stackoverflow.com/users/558865/icer
https://stackoverflow.com/users/516277/icer
How can I adjust the server to run my PHP script quicker?
How can I re-code my php script to run as quickly as possible?
How to run cURL once, checking domain availability in a loop? Help fixing code please
Help fixing php/api/curl code please
How to reduce virtual memory by optimising my PHP code?
Overlapping HTTPS requests?
Multiple https requests.. how to?
Doesn't the fact that you have to keep asking the same question over and over tell you that you're doing it wrong?
This comment of yours:
#mario: Cheers. I'm competing against
2 other companies for specific
ccTLD's. They are new to the game and
they are snapping up those domains in
slow time (up to 10 seconds after
purge time). I'm just a little slower
at the moment.
I'm fairly sure that PHP on a shared hosting account is the wrong tool to use if you are seriously trying to beat two companies at snapping up expired domain names.
The result of each of the 150 queries is being stored in PHP memory and by your evidence this is insufficient. The only conclusion is that you cannot keep 150 queries in memory. You must have a method of streaming to files instead of memory buffers, or simply reduce the number of queries and processing the list of URLs in batches.
To use streams you must set CURLOPT_RETURNTRANSFER to 0 and implement a callback for CURLOPT_WRITEFUNCTION, there is an example in the PHP manual:
http://www.php.net/manual/en/function.curl-setopt.php#98491
function on_curl_write($ch, $data)
{
global $fh;
$bytes = fwrite ($fh, $data, strlen($data));
return $bytes;
}
curl_setopt ($curl_arr[$i], CURLOPT_WRITEFUNCTION, 'on_curl_write');
Getting the correct file handle in the callback is left as problem for the reader to solve.
<?php
echo str_repeat(' ', 1024); //to make flush work
$url = 'http://__________/';
$fetch_count = 15;
$delay = 100000; //0.1 second
//$delay = 1000000; //1 second
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
for ($i=0; $i<$fetch_count; $i++) {
$start = microtime(true);
$result = curl_exec($ch);
list($result0, $result1) = explode("<br>", $result);
echo "{$result0}<br>{$result1}<br>";
flush();
$end = microtime(true);
$sleeping = $delay - ($end - $start);
echo 'sleeping: ' . ($sleeping / 1000000) . ' seconds<hr />';
usleep($sleeping);
}
curl_close($ch);
?>
Related
I have about 15 locations in a mysql table with lat and long information.
Using PHP and google maps API Am able to calculate distance between 2 locations.
function GetDrivingDistance($lat1, $lat2, $long1, $long2)
{
$url = "https://maps.googleapis.com/maps/api/distancematrix/json?origins=".$lat1.",".$long1."&destinations=".$lat2.",".$long2."&mode=driving&language=en-US";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_PROXYPORT, 3128);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$response = curl_exec($ch);
curl_close($ch);
$response_a = json_decode($response, true);
$dist = $response_a['rows'][0]['elements'][0]['distance']['text'];
$time = $response_a['rows'][0]['elements'][0]['duration']['text'];
return array('distance' => $dist, 'time' => $time);
}
I want to to select one as fixed e.g. row 1 given lat and long
$query="SELECT lat, long from table WHERE location=1"
$locationStart = $conn->query($query); =
I want to calculate the distance to all other locations in the tables (other rows) and return the the outcome sorted by distance
tried to calculate each one alone and end up with very long code and takes too long to fetch that via api, also still not able to sort them this way!
any hint?
Disclaimer: This is not a working solution, nor have I tested it, it is just a quick example I've done off the top of my head to provide a sort of code sample to go with my comment.
My brains still not fully warmed up, but I believe the bottom should at least act as a sort of guide to help put across the idea I was making in my comment, i'll try to answer any questions you have when I'm free. Hope it helps.
<?php
define('MAXIMUM_REQUEST_STORE', 5); // Store 5 requests in each multi_curl_handle
function getCurlInstance($url) {
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, $url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
return $handle;
}
$data = []; // Build up an array of Endpoints you want to hit. I'll let you do that.
// Initialise Variables
$totalRequests = count($data);
$parallelCurlRequests = [];
$handlerID = 0;
// Set up our first handler
$parallelCurlRequests[$handlerID] = curl_multi_init();
// Loop through each of our curl handles
for ($i = 0; $i < $totalRequests; ++$i) {
// We want to create a new handler/store every 5 requests. -- Goes off the constant MAXIMUM_REQUEST_STORE
if ($i % MAXIMUM_REQUEST_STORE == 1 && $i > MAXIMUM_REQUEST_STORE) {
++$handlerID;
}
// Create a Curl Handle for the current endpoint
// ... and store the it in an array for later use.
$curl[$i] = getCurlInstance($data[$i]);
// Add the Curl Handle to the Multi-Curl-Handle
curl_multi_add_handle($parallelCurlRequests[$handlerID], $curl[$i]);
}
// Run each Curl-Multi-Handler in turn
foreach ($parallelCurlRequests as $request) {
$running = null;
do {
curl_multi_exec($request, $running);
} while ($running);
}
$distanceArray = [];
// You can now pull out the data from the request.
foreach ($curl as $response) {
$content = curl_multi_getcontent($response);
if (!empty($content)) {
// Build up some form of array.
$response = json_decode($content);
$location = $content->someObject[0]->someRow->location;
$distance = $content->someObject[0]->someRow->distance;
$distanceArray[$location] = $distance;
}
}
natsort($distanceArray);
so I'm trying to figure out why does this PHP code takes too long to run to output the results.
for example this is my apitest.php and here is my PHP Code
<?php
function getRankedMatchHistory($summonerId,$serverName,$apiKey){
$k
$d;
$a;
$timeElapsed;
$gameType;
$championName;
$result;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://".$serverName.".api.pvp.net/api/lol/".$serverName."/v2.2/matchhistory/".$summonerId."?api_key=".$apiKey);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$response = curl_exec($ch);
curl_close($ch);
$matchHistory = json_decode($response,true); // Is the Whole JSON Response saved at $matchHistory Now locally as a variable or is it requested everytime $matchHistory is invoked ?
for ($i = 9; $i >= 0; $i--){
$farm1 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["minionsKilled"];
$farm2 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralMinionsKilled"];
$farm3 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralminionsKilledTeamJungle"];
$farm4 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralminionsKilledEnemyJungle"];
$elapsedTime = $matchHistory["matches"][$i]["matchDuration"];
settype($elapsedTime, "integer");
$elapsedTime = floor($elapsedTime / 60);
$k = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["kills"];
$d = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["deaths"];
$a = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["assists"];
$championIdTmp = $matchHistory["matches"][$i]["participants"]["0"]["championId"];
$championName = call_user_func('getChampionName', $championIdTmp); // calls another function to resolve championId into championName
$gameType = preg_replace('/[^A-Za-z0-9\-]/', ' ', $matchHistory["matches"][$i]["queueType"]);
$result = (($matchHistory["matches"][$i]["participants"]["0"]["stats"]["winner"]) == "true") ? "Victory" : "Defeat";
echo "<tr>"."<td>".$gameType."</td>"."<td>".$result."</td>"."<td>".$championName."</td>"."<td>".$k."/".$d."/".$a."</td>"."<td>".($farm1+$farm2+$farm3+$farm4)." in ". $elapsedTime. " minutes". "</td>"."</tr>";
}
}
?>
What I'd like to know is how to make the page output faster as it takes around
10~15 seconds to output the results which makes the browser thinks the website is dead like a 500 Internal error or something like it .
Here is a simple demonstration of how long it can take : Here
As you might have noticed , yes I'm using Riot API which is sending the response as a JSON encoded type.
Here is an example of the response that this function handles : Here
What I thought of was creating a temporarily file called temp.php at the start of the CURL function and saving the whole response there and then reading the variables from there so i can speed up the process and after reading the variables it deletes the temp.php that was created thus freeing up disk space. and increasing the speed.
But I have no idea how to do that in PHP Only.
By the way I'd like to tell you that i just started using PHP today so I'd prefer some explanation with the answers if possible .
Thanks for your precious time.
Try benchmarking like this:
// start the timer
$start_curl = microtime(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://".$serverName.".api.pvp.net/api/lol/".$serverName."/v2.2/matchhistory/".$summonerId."?api_key=".$apiKey);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
// debugging
curl_setopt($ch, CURLOPT_VERBOSE, true);
// start another timer
$start = microtime(true);
$response = curl_exec($ch);
echo 'curl_exec() in: '.(microtime(true) - $start).' seconds<br><br>';
// start another timer
$start = microtime(true);
curl_close($ch);
echo 'curl_close() in: '.(microtime(true) - $start).' seconds<br><br>';
// how long did the entire CURL take?
echo 'CURLed in: '.(microtime(true) - $start_curl).' seconds<br><br>';
I am new here to get answers for my issues, hoping for your kind advice. Thanks in advance.
I have written a HTTP API to send SMS using curl. Everything is working fine, except I am failing to loop and post curl for certain phone numbers. For example: User uploads 50000 phone numbers using excel sheet on my site, I fetch all the mobile numbers from the database, and then post it through CURL.
Now the sms gateway which I send the request accepts only maximum 10000 numbers at once via http api.
So from the 50000 fetched numbers I want to split the numbers to 10000 each and loop that and send curl post.
Here is my code
//have taken care of sql injection on live site
$resultRestore = mysql_query("SELECT * FROM temptable WHERE userid = '".$this->user_id."' AND uploadid='".$uploadid."' ");
$rowRestoreCount = mysql_num_rows($resultRestore);
#mysql_data_seek($resultRestore, 0);
$phone_list = "";
while($rowRestore = mysql_fetch_array($resultRestore))
{
$phone_list .= $rowRestore['recphone'].",";
}
$url = "http://www.smsgatewaycenter.com/library/send_sms_2.php?UserName=".urlencode($this->param[userid])."&Password=".urlencode($this->param[password])."&Type=Bulk&To=".urlencode(substr($phone_list, 0, -1))."&Mask=".urlencode($this->sendname)."&Message=Hello%20World";
//echo $url;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
Now, from the $phone_list, I need to loop for every 10000 numbers, How can I achieve this?
Its been 2 days, I have tried several things and not getting the result.
Kindly help...
NOTE: I'm going to start off with the obligatory warning about using mysql functions. Please consider switching to mysqli or PDO.
There are a number of different ways you could do this. Personally, I would reconfigure your script to only fetch 10,000 numbers at a time from the database and put that inside a loop. It might look something like this (note that for simplicity I am not updating your mysql* calls to mysqli*). Keep in mind I didn't run this through a compiler since most of your code I can't actually test
// defines where the query starts from
$offset= 0;
// defines how many to get with the query
$limit = 10000;
// set up base SQL to use over and over updating offset
$baseSql = "SELECT * FROM temptable WHERE userid = '".$this->user_id."' AND uploadid='".$uploadid."' LIMIT ";
// get first set of results
$resultRestore = mysql_query($baseSql . $offset . ', '. $limit);
// now loop
while (mysql_num_rows($resultRestore) > 0)
{
$rowRestoreCount = mysql_num_rows($resultRestore);
$phone_list = "";
while($rowRestore = mysql_fetch_array($resultRestore))
{
$phone_list .= $rowRestore['recphone'].",";
}
$url = "http://www.smsgatewaycenter.com/library/send_sms_2.php?UserName=".urlencode($this->param[userid])."&Password=".urlencode($this->param[password])."&Type=Bulk&To=".urlencode(substr($phone_list, 0, -1))."&Mask=".urlencode($this->sendname)."&Message=Hello%20World";
//echo $url;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
// now update for the while loop
// increment by value of limit
$offset += $limit;
// now re-query for the next 10000
// this will continue until there are no records left to retrieve
// this should work even if there are 50,123 records (the last loop will process 123 records)
$resultRestore = mysql_query($baseSql . $offset . ', '. $limit);
}
You could also achieve this without using offset and limit in your sql query. This might be a simpler approach for you:
// define our maximum chunk here
$max = 10000;
$resultRestore = mysql_query("SELECT * FROM temptable WHERE userid = '".$this->user_id."' AND uploadid='".$uploadid."' ");
$rowRestoreCount = mysql_num_rows($resultRestore);
#mysql_data_seek($resultRestore, 0);
$phone_list = "";
// hold the current number of processed phone numbers
$count = 0;
while($rowRestore = mysql_fetch_array($resultRestore))
{
$phone_list .= $rowRestore['recphone'].",";
$count++;
// when count hits our max, do the send
if ($count >= $max)
{
$url = "http://www.smsgatewaycenter.com/library/send_sms_2.php?UserName=".urlencode($this->param[userid])."&Password=".urlencode($this->param[password])."&Type=Bulk&To=".urlencode(substr($phone_list, 0, -1))."&Mask=".urlencode($this->sendname)."&Message=Hello%20World";
//echo $url;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
// now reset count back to zero
$count = 0;
// and reset phone_list
$phone_list = '';
}
}
// if we don't have # of phones evenly divisible by $max then handle any leftovers
if ($count > 0)
{
$url = "http://www.smsgatewaycenter.com/library/send_sms_2.php?UserName=".urlencode($this->param[userid])."&Password=".urlencode($this->param[password])."&Type=Bulk&To=".urlencode(substr($phone_list, 0, -1))."&Mask=".urlencode($this->sendname)."&Message=Hello%20World";
//echo $url;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
}
I notice that you are retrieving the information in $curl_scraped_page. In either of these scenarios above, you will need to account for the new loop if you're doing any processing on $curl_scraped_page.
Again, please consider switching to mysqli or PDO, and keep in mind that there are likely more efficient and flexible ways to achieve this than what you are doing here. For example, you might want to log successful sends in case your script breaks and incorporate that into your script (for example, by selecting from the database only those numbers that have not yet received this text). This would allow you to re-run your script but only send to those who did NOT yet receive the text, rather than hitting everyone again (or maybe your SMS gateway handles that for you?)
EDIT
Another approach would be to load all the retrieved numbers into a single array, then chunk the array into pieces and process each chunk.
$numbers = array();
while ($rowRestore = mysql_fetch_array($resultRestore))
{
$numbers[] = $rowRestore['recphone'];
}
// split into chunks of 10,000
$chunks = array_chunk($numbers, 10000);
// loop and process the chunks
foreach ($chunks AS $chunk)
{
// $chunk will be an array, so implode it with comma to get the phone list
$phone_list = implode(',', $chunk);
// note that there is no longer a need to substr -1 the $phone_list because it won't have a trailing comma using implode()
$url = "http://www.smsgatewaycenter.com/library/send_sms_2.php?UserName=".urlencode($this->param[userid])."&Password=".urlencode($this->param[password])."&Type=Bulk&To=".urlencode($phone_list)."&Mask=".urlencode($this->sendname)."&Message=Hello%20World";
//echo $url;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
}
I have a function that calls 3 different APIs using cURL multiple times. Each API's result is passed to the next API called in nested loops, so cURL is currently opened and closed over 500 times.
Should I leave cURL open for the entire function or is it OK to open and close it so many times in one function?
There's a performance increase to reusing the same handle. See: Reusing the same curl handle. Big performance increase?
If you don't need the requests to be synchronous, consider using the curl_multi_* functions (e.g. curl_multi_init, curl_multi_exec, etc.) which also provide a big performance boost.
UPDATE:
I tried benching curl with using a new handle for each request and using the same handle with the following code:
ob_start(); //Trying to avoid setting as many curl options as possible
$start_time = microtime(true);
for ($i = 0; $i < 100; ++$i) {
$rand = rand();
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.google.com/?rand=" . $rand);
curl_exec($ch);
curl_close($ch);
}
$end_time = microtime(true);
ob_end_clean();
echo 'Curl without handle reuse: ' . ($end_time - $start_time) . '<br>';
ob_start(); //Trying to avoid setting as many curl options as possible
$start_time = microtime(true);
$ch = curl_init();
for ($i = 0; $i < 100; ++$i) {
$rand = rand();
curl_setopt($ch, CURLOPT_URL, "http://www.google.com/?rand=" . $rand);
curl_exec($ch);
}
curl_close($ch);
$end_time = microtime(true);
ob_end_clean();
echo 'Curl with handle reuse: ' . ($end_time - $start_time) . '<br>';
and got the following results:
Curl without handle reuse: 8.5690529346466
Curl with handle reuse: 5.3703031539917
So reusing the same handle actually provides a substantial performance increase when connecting to the same server multiple times. I tried connecting to different servers:
$url_arr = array(
'http://www.google.com/',
'http://www.bing.com/',
'http://www.yahoo.com/',
'http://www.slashdot.org/',
'http://www.stackoverflow.com/',
'http://github.com/',
'http://www.harvard.edu/',
'http://www.gamefaqs.com/',
'http://www.mangaupdates.com/',
'http://www.cnn.com/'
);
ob_start(); //Trying to avoid setting as many curl options as possible
$start_time = microtime(true);
foreach ($url_arr as $url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_exec($ch);
curl_close($ch);
}
$end_time = microtime(true);
ob_end_clean();
echo 'Curl without handle reuse: ' . ($end_time - $start_time) . '<br>';
ob_start(); //Trying to avoid setting as many curl options as possible
$start_time = microtime(true);
$ch = curl_init();
foreach ($url_arr as $url) {
curl_setopt($ch, CURLOPT_URL, $url);
curl_exec($ch);
}
curl_close($ch);
$end_time = microtime(true);
ob_end_clean();
echo 'Curl with handle reuse: ' . ($end_time - $start_time) . '<br>';
And got the following result:
Curl without handle reuse: 3.7672290802002
Curl with handle reuse: 3.0146431922913
Still quite a substantial performance increase.
I'm writing a page scraper for a site that is a little slow, but has a lot of information I'd like to use for widget purposes (with their permission). Currently it takes roughly 4-5 minutes to execute and parse all ~150 pages I scrape so far. It will be a crontab'd event, and a temporary table is used while it's being generated, then copied to a "live" table upon completion so it's a seamless transition from a client stand-point, however can you see a way to speed up my code, possibly?
//mysql connection stuff here
function dnl2array($domnodelist) {
$return = array();
$nb = $domnodelist->length;
for ($i = 0; $i < $nb; ++$i) {
$return['pt'][] = utf8_decode(trim($domnodelist->item($i)->nodeValue));
$return['html'][] = utf8_decode(trim(get_inner_html($domnodelist->item($i))));
}
return $return;
}
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
// NEW curl instead of file_get_contents()
$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 20);
// Grab the data.
$html = curl_exec($c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
"<p>cURL error: " . curl_error($c) . "</p>";
}
// $html = file_get_contents($url);
$doc = new DOMDocument;
// Load the html into our object
$doc->loadHTML($html);
$xPath = new DOMXPath( $doc );
// scrape initial page that contains list of everything I want to scrape
$results = $xPath->query('//div[#id="food-plan-contents"]//td[#class="product-name"]');
$test['itams'] = dnl2array($results);
foreach($test['itams']['html'] as $get_url){
$prepared_url[] = ""; // The url being scraped, modified slightly to gain access to more information -- not SO applicable data to see
}
$i = 0;
foreach($prepared_url as $url){
$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 20);
// Grab the data.
$html = curl_exec($c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
"<p>cURL error: " . curl_error($c) . "</p>";
}
// $html = file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($html);
$xPath = new DOMXPath($doc);
$results = $xPath->query('//h3[#class="product-name"]');
$arr[$i]['name'] = dnl2array($results);
$results = $xPath->query('//div[#class="product-specs"]');
$arr[$i]['desc'] = dnl2array($results);
$results = $xPath->query('//p[#class="product-image-zoom"]');
$arr[$i]['img'] = dnl2array($results);
$results = $xPath->query('//div[#class="groupedTable"]/table/tbody/tr//span[#class="price"]');
$arr[$i]['price'] = dnl2array($results);
$arr[$i]['url'] = $url;
if($i % 5 == 1){
lazy_loader($arr); //lazy loader adds data to sql database
unset($arr); // keep memory footprint light (server is wimpy -- but free!)
}
$i++;
usleep(50000); // Don't be bandwith pig
}
// Get any stragglers
if(count($arr) > 0){
lazy_loader($arr);
$time = time() + (23 * 60 * 60); // Time + 23 hours for "tomorrow's date"
$tab_name = "sr_data_items_" . date("m_d_y", $time);
// and copy table now that script is finished
mysql_query("CREATE TABLE IF NOT EXISTS `{$tab_name}` LIKE `sr_data_items_skel`");
mysql_query("INSERT INTO `{$tab_name}` SELECT * FROM `sr_data_items_skel`");
mysql_query("TRUNCATE TABLE `sr_data_items_skel`");
}
It sounds like you're mostly dealing with slow server response speeds. At even 2 seconds for each of those 150 pages, you're looking at 300 seconds = 5 minutes. The best way you could speed this up is by using curl_multi_* to run multiple connections at the same time.
So replace the start of the foreach loop (up through the if !html check) with this:
reset($prepared_url); // set internal pointer to first element
$running = array(); // map from curl reference to url
$finished = false;
$mh = curl_multi_init();
$i = 0;
while(!$finished || !empty($running)){
// add urls to $mh up to a maximum
while (count($running) < 15 && !$finished)
{
$url = next($prepared_url);
if ($url === FALSE)
{
$finished = true;
break;
}
$c = setupcurl($url);
curl_multi_add_handle($mh, $c);
$running[$c] = $url;
}
curl_multi_exec($mh, $active);
$info = curl_multi_info_read($mh);
if (false === $info) continue; // nothing to report right now
$c = $info['handle'];
$url = $running[$c];
unset($running[$c]);
$result = $info['result'];
if ($result != CURLE_OK)
{
echo "Curl Error: " . $result . "\n";
continue;
}
$html = curl_multi_getcontent($c);
$download_time = curl_getinfo($c, CURLINFO_TOTAL_TIME);
curl_multi_remove_handle($mh, $c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>\n" .
"<p>cURL error: " . curl_error($c) . "</p>\n";
}
curl_close($c);
<<rest of foreach loop here>>
That will keep 15 downloads going at the same time, and process them as they finish.
Anyway – so for the history: please see my comments up top.
As for caching: I'm using dnsmasq to cache.
My setup is using a recipe for chef, which I run through chef-solo. The templates contains my configuration and the attributes contain my settings. It's pretty straight forward.
So the beauty is that this allows me to put this server into DHCP (we use Amazon EC2 and this service distributes all IPs via DHCP to the virtual instances) and then I don't have to make any changes to my application to use them.
I have another recipe to edit /etc/dhclient.conf.
Does this help? Let me know where to elaborate more.
EDIT
Just for clarification: This is not a Ruby solution I'm just using chef for configuration management (this part makes sure that services are always setup the same, etc..). Dnsmasq itself acts as a local DNS server and saves the requests so it speeds up.
The manual way is as follows:
On a Ubuntu:
apt-get install dnsmasq
Then edit the /etc/dnsmasq.conf:
listen-address=127.0.0.1
cache-size=5000
domain-needed
bogus-priv
log-queries
Restart service and verify it's running (ps aux|grep dnsmasq).
Then put it into your /etc/resolv.conf:
nameserver 127.0.0.1
Test:
dig #127.0.0.1 stackoverflow.com
Execute twice, check time it took to resolve. Second one should be faster.
Enjoy! ;)
The first thing to do is to measure how much time is spent downloading the file from the server. Use function microtime(true) to get a timestamp both before and after the call
file_get_contents($url);
and subtract the values. After you find out that the real bottleneck is inside your code and not on the side of network or remote server, only then you can start thinking about some optimizations.
When you say that 150 pages takes 5 minutes to load & parse, that's 2 seconds per page, and my wild guess is that most of that time is spent to download the page from the server.
You should consider using cUrl instead of both file_get_contents() and DOMDocument::loadHTMLFile, because it's much faster.
See this question:
https://stackoverflow.com/questions/555523/file-get-contents-vs-curl-what-has-better-performance
You need to benchmark. DNS is not an issue, if you're scrapping 150 pages, DNS will for sure get cached on your resolver for the 4 minutes you need to parse the rest of the 149 pages.
Try timing page all transfers with wget/curl, you may get surprised that it's not so fast as you may think.
Try requesting in parallel, hitting them with 4 parallel requests will get your time down to 1 minute.
If you actually find that it's xpath problem use preg_split() or even an awk script with popen() to get your values.