I am working on a page for a library that will display the latest books, movies and items that the library has added to their collection.
A friend and I (both of us are new to PHP) have been trying to use cURL to accomplish this. We have gotten the code to grab the sections we want and have it formatted as it should look on the results page.
The problem we are having is that the url which we feed into cURL is automatically generated somehow and keeps expiring every few hours and breaks the page.
Here is the PHP we are using:
<?php
//function storeLink($url,$gathered_from) {
// $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
// mysql_query($query) or die('Error, insert query failed');
//}
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://catalog.yourppl.org/limitedsearch.asp");
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$refreshlink= curl_exec($ch);
$endlink = strpos($refreshlink,'Hot New Items')-2;//end
$startlink = $endlink -249;
$startlink = strpos($refreshlink,'http',$startlink);//start
$endlink = $endlink - $startlink;
$linkurl = substr("$refreshlink",$startlink, $endlink);
//echo $linkurl;
//this is the link that expires
$linkurl = "http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?NewestSearch&Config=pac&FormId=0&LimitsId=-168&StartIndex=0&SearchField=119&Searchtype=1&SearchAvailableOnly=0&Branch=,0,&PeriodLimit=30&ItemsPerPage=10&SearchData=&autohide=true";
$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL, $linkurl);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 50);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
$content = $html;
$PHolder = 0;
$x = 0;
$y = 0;
$max = strlen($content);
$isbn = array(300=>0);
$stitle = array(300=>0);
$sbookcover = array(300=>0);
while ($x < 200 )
{
$x++;
$start = strpos($content,'isbn',$PHolder+5);//beginning
$start2 = strpos($content,'Branch=,0,"',$start+5);//beginning
$start2 = $start2 -400;
if ($start2 < 0)break;
$start2 = strpos($content,'<a href',$start2);
if ($start2 == "")break;
$start2 = $start2 - 12;
$end2 = strpos($content,'</a>',$start);
$end = strpos($content,'"',$start);
$offset = 13;
$offset2 = $end2 - $start2;
if (substr("$content", $start+5, $offset) != $isbn)
{
if(array_search(substr("$content", $start+5, $offset), $isbn) == 0 )
{
$y++;
$isbn[$y] = substr("$content", $start+5, $offset);
$sbookcover[$y]="
<img border=\"0\" width = \"170\" alt=\"Book Jacket\"src=\"http://ls2content.tlcdelivers.com/content.html?customerid=7977&requesttype=bookjacket-lg&isbn=$isbn[$y]&isbn=$isbn[$y]\">
";
$stitle[$y]= substr("$content", $start2+12, $offset2);
$bookcover = $sbookcover[$y];
$title = $stitle[$y]."</a>";
$stitle[$y] = str_replace("<a href=\"","<a href=\"http://catalog.yourppl.org",$title);
$stitle[$y] = str_replace("\">","\" rel=\"shadowbox\">",$stitle[$y]);
$booklinkend = strpos($stitle[$y],"\">");
$booklink = substr($stitle[$y], 0, $booklinkend+2);
$sbookcover[$y] = "$booklink".$sbookcover[$y]."</a>";
}
}
$PHolder = $start;
}
echo"
<table class=\"twocolorformat\" width=\"95%\">
";
$xx = 1;
while ($xy <= 6)
{
$xy++;
echo "
<tr>
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</div></td>
";
$xx++;
echo"
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</td>
";
$xx = $xx -2;
echo"
</tr>
<tr>
<td width=\"33%\">$stitle[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\">$stitle[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\">$stitle[$xx]</td>
";
$xx = $xx -2;
echo"
</tr>
";//this is the table row and table data definition. covers and titles are fed to table here.
$xx = $xx +3;
if ($sbookcover[$xx] == "")break;
}
echo"
</table>
";//close your table here
?>
The page that has the link is here:
http://www.catalog.portsmouth.lib.oh.us/limitedsearch.asp
We are looking to grab the books and cover images from 'Hot New Items' on that page and work on the rest after we get it working.
If you click the Hot New Items link, the initial url is:
http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?Limits&LimitsId=0&FormId=0&StartIndex=0&Config=pac&ReturnForm=22&Branch=,0,&periodlimit=30&LimitCollection=1&Collection=Adult%20New%20Book&autosubmit=true
but once the page loads, changes to:
http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?NewestSearch&Config=pac&FormId=0&LimitsId=-178&StartIndex=0&SearchField=119&Searchtype=1&SearchAvailableOnly=0&Branch=,0,&PeriodLimit=30&ItemsPerPage=10&SearchData=&autohide=true
Is there anything we can do to get around the expiring links? I can provide more code and explanation if needed.
Thanks very much to anyone who can offer help,
Terry
Is there anything we can do to get around the expiring links?
You're interfacing with a system that wasn't designed to be (ab)used in the way you're doing so. Like many search systems, it looks like they're building the results and storing them somewhere. Also like many search systems, those results become invalid after a period of time.
You're going to have to design your code under the assumption that the search results are going to poof into the ether very quickly.
It looks like there's a parameter in the URL that dictates how many results there are per page. Try changing it to a higher number -- a much higher number. They don't seem to have placed a bounds check on it at the code level. I was able to enter 1000 without it complaining, though it only returned 341 links.
Keep in mind that this is very likely going to cause some pretty noticeable load on their machine, and you should be careful and gentle when making your requests. You don't want to raise attention to yourself by making it look like you're attacking their service.
The page returned from the original link generates the results and then sends you a page which uses a javascript that inserts the values into an URL and then sends you to that URL which fetches the stored results page. The results page is identified by the server with a LimitsID (you can see it in the URL of the results page). They must use this number to control how long a page lasts and each request generates a new LimitsID because not every ID works for this results page. Point of all this is, you can use cURL to get the first page (the link off of the original page, which will generate the results and store them on the server), search for the text 'LimitsId=-' in the response page(for some reason they all have a dash in front of them but I'm not sure if they're supposed to be negative as the numbers go up) and paste that text after the same line in the URL that you're using in your script, which will get you to the newly generated results.
However, as pointed out by Charles, these requests will put a significant load on the server so maybe you can just generate a new request when the old one expires.
Related
so I'm trying to figure out why does this PHP code takes too long to run to output the results.
for example this is my apitest.php and here is my PHP Code
<?php
function getRankedMatchHistory($summonerId,$serverName,$apiKey){
$k
$d;
$a;
$timeElapsed;
$gameType;
$championName;
$result;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://".$serverName.".api.pvp.net/api/lol/".$serverName."/v2.2/matchhistory/".$summonerId."?api_key=".$apiKey);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$response = curl_exec($ch);
curl_close($ch);
$matchHistory = json_decode($response,true); // Is the Whole JSON Response saved at $matchHistory Now locally as a variable or is it requested everytime $matchHistory is invoked ?
for ($i = 9; $i >= 0; $i--){
$farm1 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["minionsKilled"];
$farm2 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralMinionsKilled"];
$farm3 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralminionsKilledTeamJungle"];
$farm4 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralminionsKilledEnemyJungle"];
$elapsedTime = $matchHistory["matches"][$i]["matchDuration"];
settype($elapsedTime, "integer");
$elapsedTime = floor($elapsedTime / 60);
$k = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["kills"];
$d = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["deaths"];
$a = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["assists"];
$championIdTmp = $matchHistory["matches"][$i]["participants"]["0"]["championId"];
$championName = call_user_func('getChampionName', $championIdTmp); // calls another function to resolve championId into championName
$gameType = preg_replace('/[^A-Za-z0-9\-]/', ' ', $matchHistory["matches"][$i]["queueType"]);
$result = (($matchHistory["matches"][$i]["participants"]["0"]["stats"]["winner"]) == "true") ? "Victory" : "Defeat";
echo "<tr>"."<td>".$gameType."</td>"."<td>".$result."</td>"."<td>".$championName."</td>"."<td>".$k."/".$d."/".$a."</td>"."<td>".($farm1+$farm2+$farm3+$farm4)." in ". $elapsedTime. " minutes". "</td>"."</tr>";
}
}
?>
What I'd like to know is how to make the page output faster as it takes around
10~15 seconds to output the results which makes the browser thinks the website is dead like a 500 Internal error or something like it .
Here is a simple demonstration of how long it can take : Here
As you might have noticed , yes I'm using Riot API which is sending the response as a JSON encoded type.
Here is an example of the response that this function handles : Here
What I thought of was creating a temporarily file called temp.php at the start of the CURL function and saving the whole response there and then reading the variables from there so i can speed up the process and after reading the variables it deletes the temp.php that was created thus freeing up disk space. and increasing the speed.
But I have no idea how to do that in PHP Only.
By the way I'd like to tell you that i just started using PHP today so I'd prefer some explanation with the answers if possible .
Thanks for your precious time.
Try benchmarking like this:
// start the timer
$start_curl = microtime(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://".$serverName.".api.pvp.net/api/lol/".$serverName."/v2.2/matchhistory/".$summonerId."?api_key=".$apiKey);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
// debugging
curl_setopt($ch, CURLOPT_VERBOSE, true);
// start another timer
$start = microtime(true);
$response = curl_exec($ch);
echo 'curl_exec() in: '.(microtime(true) - $start).' seconds<br><br>';
// start another timer
$start = microtime(true);
curl_close($ch);
echo 'curl_close() in: '.(microtime(true) - $start).' seconds<br><br>';
// how long did the entire CURL take?
echo 'CURLed in: '.(microtime(true) - $start_curl).' seconds<br><br>';
I am new here to get answers for my issues, hoping for your kind advice. Thanks in advance.
I have written a HTTP API to send SMS using curl. Everything is working fine, except I am failing to loop and post curl for certain phone numbers. For example: User uploads 50000 phone numbers using excel sheet on my site, I fetch all the mobile numbers from the database, and then post it through CURL.
Now the sms gateway which I send the request accepts only maximum 10000 numbers at once via http api.
So from the 50000 fetched numbers I want to split the numbers to 10000 each and loop that and send curl post.
Here is my code
//have taken care of sql injection on live site
$resultRestore = mysql_query("SELECT * FROM temptable WHERE userid = '".$this->user_id."' AND uploadid='".$uploadid."' ");
$rowRestoreCount = mysql_num_rows($resultRestore);
#mysql_data_seek($resultRestore, 0);
$phone_list = "";
while($rowRestore = mysql_fetch_array($resultRestore))
{
$phone_list .= $rowRestore['recphone'].",";
}
$url = "http://www.smsgatewaycenter.com/library/send_sms_2.php?UserName=".urlencode($this->param[userid])."&Password=".urlencode($this->param[password])."&Type=Bulk&To=".urlencode(substr($phone_list, 0, -1))."&Mask=".urlencode($this->sendname)."&Message=Hello%20World";
//echo $url;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
Now, from the $phone_list, I need to loop for every 10000 numbers, How can I achieve this?
Its been 2 days, I have tried several things and not getting the result.
Kindly help...
NOTE: I'm going to start off with the obligatory warning about using mysql functions. Please consider switching to mysqli or PDO.
There are a number of different ways you could do this. Personally, I would reconfigure your script to only fetch 10,000 numbers at a time from the database and put that inside a loop. It might look something like this (note that for simplicity I am not updating your mysql* calls to mysqli*). Keep in mind I didn't run this through a compiler since most of your code I can't actually test
// defines where the query starts from
$offset= 0;
// defines how many to get with the query
$limit = 10000;
// set up base SQL to use over and over updating offset
$baseSql = "SELECT * FROM temptable WHERE userid = '".$this->user_id."' AND uploadid='".$uploadid."' LIMIT ";
// get first set of results
$resultRestore = mysql_query($baseSql . $offset . ', '. $limit);
// now loop
while (mysql_num_rows($resultRestore) > 0)
{
$rowRestoreCount = mysql_num_rows($resultRestore);
$phone_list = "";
while($rowRestore = mysql_fetch_array($resultRestore))
{
$phone_list .= $rowRestore['recphone'].",";
}
$url = "http://www.smsgatewaycenter.com/library/send_sms_2.php?UserName=".urlencode($this->param[userid])."&Password=".urlencode($this->param[password])."&Type=Bulk&To=".urlencode(substr($phone_list, 0, -1))."&Mask=".urlencode($this->sendname)."&Message=Hello%20World";
//echo $url;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
// now update for the while loop
// increment by value of limit
$offset += $limit;
// now re-query for the next 10000
// this will continue until there are no records left to retrieve
// this should work even if there are 50,123 records (the last loop will process 123 records)
$resultRestore = mysql_query($baseSql . $offset . ', '. $limit);
}
You could also achieve this without using offset and limit in your sql query. This might be a simpler approach for you:
// define our maximum chunk here
$max = 10000;
$resultRestore = mysql_query("SELECT * FROM temptable WHERE userid = '".$this->user_id."' AND uploadid='".$uploadid."' ");
$rowRestoreCount = mysql_num_rows($resultRestore);
#mysql_data_seek($resultRestore, 0);
$phone_list = "";
// hold the current number of processed phone numbers
$count = 0;
while($rowRestore = mysql_fetch_array($resultRestore))
{
$phone_list .= $rowRestore['recphone'].",";
$count++;
// when count hits our max, do the send
if ($count >= $max)
{
$url = "http://www.smsgatewaycenter.com/library/send_sms_2.php?UserName=".urlencode($this->param[userid])."&Password=".urlencode($this->param[password])."&Type=Bulk&To=".urlencode(substr($phone_list, 0, -1))."&Mask=".urlencode($this->sendname)."&Message=Hello%20World";
//echo $url;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
// now reset count back to zero
$count = 0;
// and reset phone_list
$phone_list = '';
}
}
// if we don't have # of phones evenly divisible by $max then handle any leftovers
if ($count > 0)
{
$url = "http://www.smsgatewaycenter.com/library/send_sms_2.php?UserName=".urlencode($this->param[userid])."&Password=".urlencode($this->param[password])."&Type=Bulk&To=".urlencode(substr($phone_list, 0, -1))."&Mask=".urlencode($this->sendname)."&Message=Hello%20World";
//echo $url;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
}
I notice that you are retrieving the information in $curl_scraped_page. In either of these scenarios above, you will need to account for the new loop if you're doing any processing on $curl_scraped_page.
Again, please consider switching to mysqli or PDO, and keep in mind that there are likely more efficient and flexible ways to achieve this than what you are doing here. For example, you might want to log successful sends in case your script breaks and incorporate that into your script (for example, by selecting from the database only those numbers that have not yet received this text). This would allow you to re-run your script but only send to those who did NOT yet receive the text, rather than hitting everyone again (or maybe your SMS gateway handles that for you?)
EDIT
Another approach would be to load all the retrieved numbers into a single array, then chunk the array into pieces and process each chunk.
$numbers = array();
while ($rowRestore = mysql_fetch_array($resultRestore))
{
$numbers[] = $rowRestore['recphone'];
}
// split into chunks of 10,000
$chunks = array_chunk($numbers, 10000);
// loop and process the chunks
foreach ($chunks AS $chunk)
{
// $chunk will be an array, so implode it with comma to get the phone list
$phone_list = implode(',', $chunk);
// note that there is no longer a need to substr -1 the $phone_list because it won't have a trailing comma using implode()
$url = "http://www.smsgatewaycenter.com/library/send_sms_2.php?UserName=".urlencode($this->param[userid])."&Password=".urlencode($this->param[password])."&Type=Bulk&To=".urlencode($phone_list)."&Mask=".urlencode($this->sendname)."&Message=Hello%20World";
//echo $url;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
}
I am working on getting some xml data into a php variable so I can easily call it in my html webpage. I am using this code below:
$ch = curl_init();
$timeout = 5;
$url = "http://www.dictionaryapi.com/api/v1/references/collegiate/xml/define?key=0b03f103-f6a7-4bb1-9136-11ab4e7b5294";
$definition = simplexml_load_file($url);
echo $definition->entry[0]->def;
However my results are: .
I am not sure what I am doing wrong and I have followed the php manual, so I am guessing it is something obvious but I am just not understanding it correctly.
The actual xml results from that link used in cURL are visible by clicking the link below , I did not post it because it is rather long:
http://www.dictionaryapi.com/api/v1/references/sd3/xml/test?key=9d92e6bd-a94b-45c5-9128-bc0f0908103d
<?php
$ch = curl_init();
$timeout = 5;
$url = "http://www.dictionaryapi.com/api/v1/references/collegiate/xml/define?key=0b03f103-f6a7-4bb1-9136-11ab4e7b5294";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch); // you were missing a semicolon
$definition = new SimpleXMLElement($data);
echo '<pre>';
print_r($definition->entry[0]->def);
echo '</pre>';
// this returns the SimpleXML Object
// to get parts, you can do something like this...
foreach($definition->entry[0]->def[0] as $entry) {
echo $entry[0] . "<br />";
}
// which returns
transitive verb
14th century
1 a
:to determine or identify the essential qualities or meaning of
b
:to discover and set forth the meaning of (as a word)
c
:to create on a computer
2 a
:to fix or mark the limits of :
b
:to make distinct, clear, or detailed especially in outline
3
:
intransitive verb
:to make a
Working Demo
I'm writing a page scraper for a site that is a little slow, but has a lot of information I'd like to use for widget purposes (with their permission). Currently it takes roughly 4-5 minutes to execute and parse all ~150 pages I scrape so far. It will be a crontab'd event, and a temporary table is used while it's being generated, then copied to a "live" table upon completion so it's a seamless transition from a client stand-point, however can you see a way to speed up my code, possibly?
//mysql connection stuff here
function dnl2array($domnodelist) {
$return = array();
$nb = $domnodelist->length;
for ($i = 0; $i < $nb; ++$i) {
$return['pt'][] = utf8_decode(trim($domnodelist->item($i)->nodeValue));
$return['html'][] = utf8_decode(trim(get_inner_html($domnodelist->item($i))));
}
return $return;
}
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
// NEW curl instead of file_get_contents()
$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 20);
// Grab the data.
$html = curl_exec($c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
"<p>cURL error: " . curl_error($c) . "</p>";
}
// $html = file_get_contents($url);
$doc = new DOMDocument;
// Load the html into our object
$doc->loadHTML($html);
$xPath = new DOMXPath( $doc );
// scrape initial page that contains list of everything I want to scrape
$results = $xPath->query('//div[#id="food-plan-contents"]//td[#class="product-name"]');
$test['itams'] = dnl2array($results);
foreach($test['itams']['html'] as $get_url){
$prepared_url[] = ""; // The url being scraped, modified slightly to gain access to more information -- not SO applicable data to see
}
$i = 0;
foreach($prepared_url as $url){
$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 20);
// Grab the data.
$html = curl_exec($c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
"<p>cURL error: " . curl_error($c) . "</p>";
}
// $html = file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($html);
$xPath = new DOMXPath($doc);
$results = $xPath->query('//h3[#class="product-name"]');
$arr[$i]['name'] = dnl2array($results);
$results = $xPath->query('//div[#class="product-specs"]');
$arr[$i]['desc'] = dnl2array($results);
$results = $xPath->query('//p[#class="product-image-zoom"]');
$arr[$i]['img'] = dnl2array($results);
$results = $xPath->query('//div[#class="groupedTable"]/table/tbody/tr//span[#class="price"]');
$arr[$i]['price'] = dnl2array($results);
$arr[$i]['url'] = $url;
if($i % 5 == 1){
lazy_loader($arr); //lazy loader adds data to sql database
unset($arr); // keep memory footprint light (server is wimpy -- but free!)
}
$i++;
usleep(50000); // Don't be bandwith pig
}
// Get any stragglers
if(count($arr) > 0){
lazy_loader($arr);
$time = time() + (23 * 60 * 60); // Time + 23 hours for "tomorrow's date"
$tab_name = "sr_data_items_" . date("m_d_y", $time);
// and copy table now that script is finished
mysql_query("CREATE TABLE IF NOT EXISTS `{$tab_name}` LIKE `sr_data_items_skel`");
mysql_query("INSERT INTO `{$tab_name}` SELECT * FROM `sr_data_items_skel`");
mysql_query("TRUNCATE TABLE `sr_data_items_skel`");
}
It sounds like you're mostly dealing with slow server response speeds. At even 2 seconds for each of those 150 pages, you're looking at 300 seconds = 5 minutes. The best way you could speed this up is by using curl_multi_* to run multiple connections at the same time.
So replace the start of the foreach loop (up through the if !html check) with this:
reset($prepared_url); // set internal pointer to first element
$running = array(); // map from curl reference to url
$finished = false;
$mh = curl_multi_init();
$i = 0;
while(!$finished || !empty($running)){
// add urls to $mh up to a maximum
while (count($running) < 15 && !$finished)
{
$url = next($prepared_url);
if ($url === FALSE)
{
$finished = true;
break;
}
$c = setupcurl($url);
curl_multi_add_handle($mh, $c);
$running[$c] = $url;
}
curl_multi_exec($mh, $active);
$info = curl_multi_info_read($mh);
if (false === $info) continue; // nothing to report right now
$c = $info['handle'];
$url = $running[$c];
unset($running[$c]);
$result = $info['result'];
if ($result != CURLE_OK)
{
echo "Curl Error: " . $result . "\n";
continue;
}
$html = curl_multi_getcontent($c);
$download_time = curl_getinfo($c, CURLINFO_TOTAL_TIME);
curl_multi_remove_handle($mh, $c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>\n" .
"<p>cURL error: " . curl_error($c) . "</p>\n";
}
curl_close($c);
<<rest of foreach loop here>>
That will keep 15 downloads going at the same time, and process them as they finish.
Anyway – so for the history: please see my comments up top.
As for caching: I'm using dnsmasq to cache.
My setup is using a recipe for chef, which I run through chef-solo. The templates contains my configuration and the attributes contain my settings. It's pretty straight forward.
So the beauty is that this allows me to put this server into DHCP (we use Amazon EC2 and this service distributes all IPs via DHCP to the virtual instances) and then I don't have to make any changes to my application to use them.
I have another recipe to edit /etc/dhclient.conf.
Does this help? Let me know where to elaborate more.
EDIT
Just for clarification: This is not a Ruby solution I'm just using chef for configuration management (this part makes sure that services are always setup the same, etc..). Dnsmasq itself acts as a local DNS server and saves the requests so it speeds up.
The manual way is as follows:
On a Ubuntu:
apt-get install dnsmasq
Then edit the /etc/dnsmasq.conf:
listen-address=127.0.0.1
cache-size=5000
domain-needed
bogus-priv
log-queries
Restart service and verify it's running (ps aux|grep dnsmasq).
Then put it into your /etc/resolv.conf:
nameserver 127.0.0.1
Test:
dig #127.0.0.1 stackoverflow.com
Execute twice, check time it took to resolve. Second one should be faster.
Enjoy! ;)
The first thing to do is to measure how much time is spent downloading the file from the server. Use function microtime(true) to get a timestamp both before and after the call
file_get_contents($url);
and subtract the values. After you find out that the real bottleneck is inside your code and not on the side of network or remote server, only then you can start thinking about some optimizations.
When you say that 150 pages takes 5 minutes to load & parse, that's 2 seconds per page, and my wild guess is that most of that time is spent to download the page from the server.
You should consider using cUrl instead of both file_get_contents() and DOMDocument::loadHTMLFile, because it's much faster.
See this question:
https://stackoverflow.com/questions/555523/file-get-contents-vs-curl-what-has-better-performance
You need to benchmark. DNS is not an issue, if you're scrapping 150 pages, DNS will for sure get cached on your resolver for the 4 minutes you need to parse the rest of the 149 pages.
Try timing page all transfers with wget/curl, you may get surprised that it's not so fast as you may think.
Try requesting in parallel, hitting them with 4 parallel requests will get your time down to 1 minute.
If you actually find that it's xpath problem use preg_split() or even an awk script with popen() to get your values.
My current code (see below) uses 147MB of virtual memory!
My provider has allocated 100MB by default and the process is killed once run, causing an internal error.
The code is utilising curl multi and must be able to loop with more than 150 iterations whilst still minimizing the virtual memory. The code below is only set at 150 iterations and still causes the internal server error. At 90 iterations the issue does not occur.
How can I adjust my code to lower the resource use / virtual memory?
Thanks!
<?php
function udate($format, $utimestamp = null) {
if ($utimestamp === null)
$utimestamp = microtime(true);
$timestamp = floor($utimestamp);
$milliseconds = round(($utimestamp - $timestamp) * 1000);
return date(preg_replace('`(?<!\\\\)u`', $milliseconds, $format), $timestamp);
}
$url = 'https://www.testdomain.com/';
$curl_arr = array();
$master = curl_multi_init();
for($i=0; $i<150; $i++)
{
$curl_arr[$i] = curl_init();
curl_setopt($curl_arr[$i], CURLOPT_URL, $url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYPEER, FALSE);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i=0; $i<150; $i++)
{
$results = curl_multi_getcontent ($curl_arr[$i]);
$results = explode("<br>", $results);
echo $results[0];
echo "<br>";
echo $results[1];
echo "<br>";
echo udate('H:i:s:u');
echo "<br><br>";
usleep(100000);
}
?>
As per your last comment..
Download RollingCurl.php.
Hopefully this will sufficiently spam the living daylights out of your API.
<?php
$url = '________';
$fetch_count = 150;
$window_size = 5;
require("RollingCurl.php");
function request_callback($response, $info, $request) {
list($result0, $result1) = explode("<br>", $response);
echo "{$result0}<br>{$result1}<br>";
//print_r($info);
//print_r($request);
echo "<hr>";
}
$urls = array_fill(0, $fetch_count, $url);
$rc = new RollingCurl("request_callback");
$rc->window_size = $window_size;
foreach ($urls as $url) {
$request = new RollingCurlRequest($url);
$rc->add($request);
}
$rc->execute();
?>
Looking through your questions, I saw this comment:
If the intention is domain snatching,
then using one of the established
services is a better option. Your
script implementation is hardly as
important as the actual connection and
latency.
I agree with that comment.
Also, you seem to have posted the "same question" approximately seven hundred times:
https://stackoverflow.com/users/558865/icer
https://stackoverflow.com/users/516277/icer
How can I adjust the server to run my PHP script quicker?
How can I re-code my php script to run as quickly as possible?
How to run cURL once, checking domain availability in a loop? Help fixing code please
Help fixing php/api/curl code please
How to reduce virtual memory by optimising my PHP code?
Overlapping HTTPS requests?
Multiple https requests.. how to?
Doesn't the fact that you have to keep asking the same question over and over tell you that you're doing it wrong?
This comment of yours:
#mario: Cheers. I'm competing against
2 other companies for specific
ccTLD's. They are new to the game and
they are snapping up those domains in
slow time (up to 10 seconds after
purge time). I'm just a little slower
at the moment.
I'm fairly sure that PHP on a shared hosting account is the wrong tool to use if you are seriously trying to beat two companies at snapping up expired domain names.
The result of each of the 150 queries is being stored in PHP memory and by your evidence this is insufficient. The only conclusion is that you cannot keep 150 queries in memory. You must have a method of streaming to files instead of memory buffers, or simply reduce the number of queries and processing the list of URLs in batches.
To use streams you must set CURLOPT_RETURNTRANSFER to 0 and implement a callback for CURLOPT_WRITEFUNCTION, there is an example in the PHP manual:
http://www.php.net/manual/en/function.curl-setopt.php#98491
function on_curl_write($ch, $data)
{
global $fh;
$bytes = fwrite ($fh, $data, strlen($data));
return $bytes;
}
curl_setopt ($curl_arr[$i], CURLOPT_WRITEFUNCTION, 'on_curl_write');
Getting the correct file handle in the callback is left as problem for the reader to solve.
<?php
echo str_repeat(' ', 1024); //to make flush work
$url = 'http://__________/';
$fetch_count = 15;
$delay = 100000; //0.1 second
//$delay = 1000000; //1 second
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
for ($i=0; $i<$fetch_count; $i++) {
$start = microtime(true);
$result = curl_exec($ch);
list($result0, $result1) = explode("<br>", $result);
echo "{$result0}<br>{$result1}<br>";
flush();
$end = microtime(true);
$sleeping = $delay - ($end - $start);
echo 'sleeping: ' . ($sleeping / 1000000) . ' seconds<hr />';
usleep($sleeping);
}
curl_close($ch);
?>