I'm writing a page scraper for a site that is a little slow, but has a lot of information I'd like to use for widget purposes (with their permission). Currently it takes roughly 4-5 minutes to execute and parse all ~150 pages I scrape so far. It will be a crontab'd event, and a temporary table is used while it's being generated, then copied to a "live" table upon completion so it's a seamless transition from a client stand-point, however can you see a way to speed up my code, possibly?
//mysql connection stuff here
function dnl2array($domnodelist) {
$return = array();
$nb = $domnodelist->length;
for ($i = 0; $i < $nb; ++$i) {
$return['pt'][] = utf8_decode(trim($domnodelist->item($i)->nodeValue));
$return['html'][] = utf8_decode(trim(get_inner_html($domnodelist->item($i))));
}
return $return;
}
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
// NEW curl instead of file_get_contents()
$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 20);
// Grab the data.
$html = curl_exec($c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
"<p>cURL error: " . curl_error($c) . "</p>";
}
// $html = file_get_contents($url);
$doc = new DOMDocument;
// Load the html into our object
$doc->loadHTML($html);
$xPath = new DOMXPath( $doc );
// scrape initial page that contains list of everything I want to scrape
$results = $xPath->query('//div[#id="food-plan-contents"]//td[#class="product-name"]');
$test['itams'] = dnl2array($results);
foreach($test['itams']['html'] as $get_url){
$prepared_url[] = ""; // The url being scraped, modified slightly to gain access to more information -- not SO applicable data to see
}
$i = 0;
foreach($prepared_url as $url){
$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 20);
// Grab the data.
$html = curl_exec($c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
"<p>cURL error: " . curl_error($c) . "</p>";
}
// $html = file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($html);
$xPath = new DOMXPath($doc);
$results = $xPath->query('//h3[#class="product-name"]');
$arr[$i]['name'] = dnl2array($results);
$results = $xPath->query('//div[#class="product-specs"]');
$arr[$i]['desc'] = dnl2array($results);
$results = $xPath->query('//p[#class="product-image-zoom"]');
$arr[$i]['img'] = dnl2array($results);
$results = $xPath->query('//div[#class="groupedTable"]/table/tbody/tr//span[#class="price"]');
$arr[$i]['price'] = dnl2array($results);
$arr[$i]['url'] = $url;
if($i % 5 == 1){
lazy_loader($arr); //lazy loader adds data to sql database
unset($arr); // keep memory footprint light (server is wimpy -- but free!)
}
$i++;
usleep(50000); // Don't be bandwith pig
}
// Get any stragglers
if(count($arr) > 0){
lazy_loader($arr);
$time = time() + (23 * 60 * 60); // Time + 23 hours for "tomorrow's date"
$tab_name = "sr_data_items_" . date("m_d_y", $time);
// and copy table now that script is finished
mysql_query("CREATE TABLE IF NOT EXISTS `{$tab_name}` LIKE `sr_data_items_skel`");
mysql_query("INSERT INTO `{$tab_name}` SELECT * FROM `sr_data_items_skel`");
mysql_query("TRUNCATE TABLE `sr_data_items_skel`");
}
It sounds like you're mostly dealing with slow server response speeds. At even 2 seconds for each of those 150 pages, you're looking at 300 seconds = 5 minutes. The best way you could speed this up is by using curl_multi_* to run multiple connections at the same time.
So replace the start of the foreach loop (up through the if !html check) with this:
reset($prepared_url); // set internal pointer to first element
$running = array(); // map from curl reference to url
$finished = false;
$mh = curl_multi_init();
$i = 0;
while(!$finished || !empty($running)){
// add urls to $mh up to a maximum
while (count($running) < 15 && !$finished)
{
$url = next($prepared_url);
if ($url === FALSE)
{
$finished = true;
break;
}
$c = setupcurl($url);
curl_multi_add_handle($mh, $c);
$running[$c] = $url;
}
curl_multi_exec($mh, $active);
$info = curl_multi_info_read($mh);
if (false === $info) continue; // nothing to report right now
$c = $info['handle'];
$url = $running[$c];
unset($running[$c]);
$result = $info['result'];
if ($result != CURLE_OK)
{
echo "Curl Error: " . $result . "\n";
continue;
}
$html = curl_multi_getcontent($c);
$download_time = curl_getinfo($c, CURLINFO_TOTAL_TIME);
curl_multi_remove_handle($mh, $c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>\n" .
"<p>cURL error: " . curl_error($c) . "</p>\n";
}
curl_close($c);
<<rest of foreach loop here>>
That will keep 15 downloads going at the same time, and process them as they finish.
Anyway – so for the history: please see my comments up top.
As for caching: I'm using dnsmasq to cache.
My setup is using a recipe for chef, which I run through chef-solo. The templates contains my configuration and the attributes contain my settings. It's pretty straight forward.
So the beauty is that this allows me to put this server into DHCP (we use Amazon EC2 and this service distributes all IPs via DHCP to the virtual instances) and then I don't have to make any changes to my application to use them.
I have another recipe to edit /etc/dhclient.conf.
Does this help? Let me know where to elaborate more.
EDIT
Just for clarification: This is not a Ruby solution I'm just using chef for configuration management (this part makes sure that services are always setup the same, etc..). Dnsmasq itself acts as a local DNS server and saves the requests so it speeds up.
The manual way is as follows:
On a Ubuntu:
apt-get install dnsmasq
Then edit the /etc/dnsmasq.conf:
listen-address=127.0.0.1
cache-size=5000
domain-needed
bogus-priv
log-queries
Restart service and verify it's running (ps aux|grep dnsmasq).
Then put it into your /etc/resolv.conf:
nameserver 127.0.0.1
Test:
dig #127.0.0.1 stackoverflow.com
Execute twice, check time it took to resolve. Second one should be faster.
Enjoy! ;)
The first thing to do is to measure how much time is spent downloading the file from the server. Use function microtime(true) to get a timestamp both before and after the call
file_get_contents($url);
and subtract the values. After you find out that the real bottleneck is inside your code and not on the side of network or remote server, only then you can start thinking about some optimizations.
When you say that 150 pages takes 5 minutes to load & parse, that's 2 seconds per page, and my wild guess is that most of that time is spent to download the page from the server.
You should consider using cUrl instead of both file_get_contents() and DOMDocument::loadHTMLFile, because it's much faster.
See this question:
https://stackoverflow.com/questions/555523/file-get-contents-vs-curl-what-has-better-performance
You need to benchmark. DNS is not an issue, if you're scrapping 150 pages, DNS will for sure get cached on your resolver for the 4 minutes you need to parse the rest of the 149 pages.
Try timing page all transfers with wget/curl, you may get surprised that it's not so fast as you may think.
Try requesting in parallel, hitting them with 4 parallel requests will get your time down to 1 minute.
If you actually find that it's xpath problem use preg_split() or even an awk script with popen() to get your values.
Related
I am looking to collect the titles of all of the posts on a subreddit, and I wanted to know what would be the best way of going about this?
I've looked around and found some stuff talking about Python and bots. I've also had a brief look at the API and am unsure in which direction to go.
As I do not want to commit to find out 90% of the way through it won't work, I ask if someone could point me in the right direction of language and extras like any software needed for example pip for Python.
My own experience is in web languages such as PHP so I initially thought of a web app would do the trick but am unsure if this would be the best way and how to go about it.
So as my question stands
What would be the best way to collect the titles (in bulk) of a
subreddit?
Or if that is too subjective
How do I retrieve and store all the post titles of a subreddit?
Preferably needs to :
do more than 1 page of (25) results
save to a .txt file
Thanks in advance.
PHP; in 25 lines:
$subreddit = 'pokemon';
$max_pages = 10;
// Set variables with default data
$page = 0;
$after = '';
$titles = '';
do {
$url = 'http://www.reddit.com/r/' . $subreddit . '/new.json?limit=25&after=' . $after;
// Set URL you want to fetch
$ch = curl_init($url);
// Set curl option of of header to false (don't need them)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Set curl option of nobody to false as we need the body
curl_setopt($ch, CURLOPT_NOBODY, 0);
// Set curl timeout of 5 seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
// Set curl to return output as string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// Execute curl
$output = curl_exec($ch);
// Get HTTP code of request
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
// Close curl
curl_close($ch);
// If http code is 200 (success)
if ($status == 200) {
// Decode JSON into PHP object
$json = json_decode($output);
// Set after for next curl iteration (reddit's pagination)
$after = $json->data->after;
// Loop though each post and output title
foreach ($json->data->children as $k => $v) {
$titles .= $v->data->title . "\n";
}
}
// Increment page number
$page++;
// Loop though whilst current page number is less than maximum pages
} while ($page < $max_pages);
// Save titles to text file
file_put_contents(dirname(__FILE__) . '/' . $subreddit . '.txt', $titles);
so I'm trying to figure out why does this PHP code takes too long to run to output the results.
for example this is my apitest.php and here is my PHP Code
<?php
function getRankedMatchHistory($summonerId,$serverName,$apiKey){
$k
$d;
$a;
$timeElapsed;
$gameType;
$championName;
$result;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://".$serverName.".api.pvp.net/api/lol/".$serverName."/v2.2/matchhistory/".$summonerId."?api_key=".$apiKey);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$response = curl_exec($ch);
curl_close($ch);
$matchHistory = json_decode($response,true); // Is the Whole JSON Response saved at $matchHistory Now locally as a variable or is it requested everytime $matchHistory is invoked ?
for ($i = 9; $i >= 0; $i--){
$farm1 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["minionsKilled"];
$farm2 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralMinionsKilled"];
$farm3 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralminionsKilledTeamJungle"];
$farm4 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralminionsKilledEnemyJungle"];
$elapsedTime = $matchHistory["matches"][$i]["matchDuration"];
settype($elapsedTime, "integer");
$elapsedTime = floor($elapsedTime / 60);
$k = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["kills"];
$d = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["deaths"];
$a = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["assists"];
$championIdTmp = $matchHistory["matches"][$i]["participants"]["0"]["championId"];
$championName = call_user_func('getChampionName', $championIdTmp); // calls another function to resolve championId into championName
$gameType = preg_replace('/[^A-Za-z0-9\-]/', ' ', $matchHistory["matches"][$i]["queueType"]);
$result = (($matchHistory["matches"][$i]["participants"]["0"]["stats"]["winner"]) == "true") ? "Victory" : "Defeat";
echo "<tr>"."<td>".$gameType."</td>"."<td>".$result."</td>"."<td>".$championName."</td>"."<td>".$k."/".$d."/".$a."</td>"."<td>".($farm1+$farm2+$farm3+$farm4)." in ". $elapsedTime. " minutes". "</td>"."</tr>";
}
}
?>
What I'd like to know is how to make the page output faster as it takes around
10~15 seconds to output the results which makes the browser thinks the website is dead like a 500 Internal error or something like it .
Here is a simple demonstration of how long it can take : Here
As you might have noticed , yes I'm using Riot API which is sending the response as a JSON encoded type.
Here is an example of the response that this function handles : Here
What I thought of was creating a temporarily file called temp.php at the start of the CURL function and saving the whole response there and then reading the variables from there so i can speed up the process and after reading the variables it deletes the temp.php that was created thus freeing up disk space. and increasing the speed.
But I have no idea how to do that in PHP Only.
By the way I'd like to tell you that i just started using PHP today so I'd prefer some explanation with the answers if possible .
Thanks for your precious time.
Try benchmarking like this:
// start the timer
$start_curl = microtime(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://".$serverName.".api.pvp.net/api/lol/".$serverName."/v2.2/matchhistory/".$summonerId."?api_key=".$apiKey);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
// debugging
curl_setopt($ch, CURLOPT_VERBOSE, true);
// start another timer
$start = microtime(true);
$response = curl_exec($ch);
echo 'curl_exec() in: '.(microtime(true) - $start).' seconds<br><br>';
// start another timer
$start = microtime(true);
curl_close($ch);
echo 'curl_close() in: '.(microtime(true) - $start).' seconds<br><br>';
// how long did the entire CURL take?
echo 'CURLed in: '.(microtime(true) - $start_curl).' seconds<br><br>';
(I'm scraping this stuff with the permission of the website in question, by the way).
Pretty simple web scraper, was working fine when I was loading all the links by hand, but when I've tried to load them in via JSON and variables (so I can do lots of scraping with the one script and make the process more modular by just adding more links to JSON) it runs on an infinite loop.
(Page has been loading for about 15 minutes now)
Here is my JSON. Only one store is in there for testing purposes but there is going to be about 15 more.
[
{
"store":"Incu Men",
"cat":"Accessories",
"general_cat":"Accessories",
"spec_cat":"accessories",
"url":"http://www.incuclothing.com/shop-men/accessories/",
"baseurl":"http://www.incuclothing.com",
"next_select":"a.next",
"prod_name_select":".infobox .fn",
"label_name_select":".infobox .brand",
"desc_select":".infobox .description",
"price_select":"#price",
"mainImg_select":"",
"more_imgs":".product-images",
"product_url":".hproduct .photo-link"
}
]
Here is the PHP scraper code:
<?php
//Set infinite time limit
set_time_limit (0);
// Include simple html dom
include('simple_html_dom.php');
// Defining the basic cURL function
function curl($url) {
$ch = curl_init();
// Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url);
// Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
// Setting cURL's option to return the webpage data
$data = curl_exec($ch);
// Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch);
// Closing cURL
return $data;
// Returning the data from the function
}
function getLinks($catURL, $prodURL, $baseURL, $next_select) {
$urls = array();
while($catURL) {
echo "Indexing: $url" . PHP_EOL;
$html = str_get_html(curl($catURL));
foreach ($html->find($prodURL) as $el) {
$urls[] = $baseURL . $el->href;
}
$next = $html->find($next_select, 0);
$url = $next ? $baseURL . $next->href : null;
echo "Results: $next" . PHP_EOL;
}
return $urls;
}
$string = file_get_contents("jsonWorkers/incuMens.json");
$json_array = json_decode($string,true);
foreach ($json_array as $value){
$baseURL = $value['baseurl'];
$catURL = $value['url'];
$store = $value['store'];
$general_cat = $value['general_cat'];
$spec_cat = $value['spec_cat'];
$next_select = $value['next_select'];
$prod_name = $value['prod_name_select'];
$label_name = $value['label_name_select'];
$description = $value['desc_select'];
$price = $value['price_select'];
$prodURL = $value['product_url'];
if (!is_null($value['mainImg_select'])){
$mainImg = $value['mainImg_select'];
}
$more_imgs = $value['more_imgs'];
$allLinks = getLinks($catURL, $prodURL, $baseURL, $next_select);
}
?>
Any ideas why the script would be running infinitely and not returning anything/stopping/printing anything to screen? I'm just gonna let it run until it stops. When I was doing this by hand it would only take a minute or so, sometimes less, so I'm sure it's a problem with my variables/json but I can't for the life of me see what the issues lie.
Can anyone take a quick look and point me in the right direction?
There is a problem with your while($catURL) loop. What do you want to do ?
Moreover, you can force to display information on your browser with the flush() command.
I am working on a page for a library that will display the latest books, movies and items that the library has added to their collection.
A friend and I (both of us are new to PHP) have been trying to use cURL to accomplish this. We have gotten the code to grab the sections we want and have it formatted as it should look on the results page.
The problem we are having is that the url which we feed into cURL is automatically generated somehow and keeps expiring every few hours and breaks the page.
Here is the PHP we are using:
<?php
//function storeLink($url,$gathered_from) {
// $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
// mysql_query($query) or die('Error, insert query failed');
//}
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://catalog.yourppl.org/limitedsearch.asp");
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$refreshlink= curl_exec($ch);
$endlink = strpos($refreshlink,'Hot New Items')-2;//end
$startlink = $endlink -249;
$startlink = strpos($refreshlink,'http',$startlink);//start
$endlink = $endlink - $startlink;
$linkurl = substr("$refreshlink",$startlink, $endlink);
//echo $linkurl;
//this is the link that expires
$linkurl = "http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?NewestSearch&Config=pac&FormId=0&LimitsId=-168&StartIndex=0&SearchField=119&Searchtype=1&SearchAvailableOnly=0&Branch=,0,&PeriodLimit=30&ItemsPerPage=10&SearchData=&autohide=true";
$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL, $linkurl);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 50);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
$content = $html;
$PHolder = 0;
$x = 0;
$y = 0;
$max = strlen($content);
$isbn = array(300=>0);
$stitle = array(300=>0);
$sbookcover = array(300=>0);
while ($x < 200 )
{
$x++;
$start = strpos($content,'isbn',$PHolder+5);//beginning
$start2 = strpos($content,'Branch=,0,"',$start+5);//beginning
$start2 = $start2 -400;
if ($start2 < 0)break;
$start2 = strpos($content,'<a href',$start2);
if ($start2 == "")break;
$start2 = $start2 - 12;
$end2 = strpos($content,'</a>',$start);
$end = strpos($content,'"',$start);
$offset = 13;
$offset2 = $end2 - $start2;
if (substr("$content", $start+5, $offset) != $isbn)
{
if(array_search(substr("$content", $start+5, $offset), $isbn) == 0 )
{
$y++;
$isbn[$y] = substr("$content", $start+5, $offset);
$sbookcover[$y]="
<img border=\"0\" width = \"170\" alt=\"Book Jacket\"src=\"http://ls2content.tlcdelivers.com/content.html?customerid=7977&requesttype=bookjacket-lg&isbn=$isbn[$y]&isbn=$isbn[$y]\">
";
$stitle[$y]= substr("$content", $start2+12, $offset2);
$bookcover = $sbookcover[$y];
$title = $stitle[$y]."</a>";
$stitle[$y] = str_replace("<a href=\"","<a href=\"http://catalog.yourppl.org",$title);
$stitle[$y] = str_replace("\">","\" rel=\"shadowbox\">",$stitle[$y]);
$booklinkend = strpos($stitle[$y],"\">");
$booklink = substr($stitle[$y], 0, $booklinkend+2);
$sbookcover[$y] = "$booklink".$sbookcover[$y]."</a>";
}
}
$PHolder = $start;
}
echo"
<table class=\"twocolorformat\" width=\"95%\">
";
$xx = 1;
while ($xy <= 6)
{
$xy++;
echo "
<tr>
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</div></td>
";
$xx++;
echo"
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</td>
";
$xx = $xx -2;
echo"
</tr>
<tr>
<td width=\"33%\">$stitle[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\">$stitle[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\">$stitle[$xx]</td>
";
$xx = $xx -2;
echo"
</tr>
";//this is the table row and table data definition. covers and titles are fed to table here.
$xx = $xx +3;
if ($sbookcover[$xx] == "")break;
}
echo"
</table>
";//close your table here
?>
The page that has the link is here:
http://www.catalog.portsmouth.lib.oh.us/limitedsearch.asp
We are looking to grab the books and cover images from 'Hot New Items' on that page and work on the rest after we get it working.
If you click the Hot New Items link, the initial url is:
http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?Limits&LimitsId=0&FormId=0&StartIndex=0&Config=pac&ReturnForm=22&Branch=,0,&periodlimit=30&LimitCollection=1&Collection=Adult%20New%20Book&autosubmit=true
but once the page loads, changes to:
http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?NewestSearch&Config=pac&FormId=0&LimitsId=-178&StartIndex=0&SearchField=119&Searchtype=1&SearchAvailableOnly=0&Branch=,0,&PeriodLimit=30&ItemsPerPage=10&SearchData=&autohide=true
Is there anything we can do to get around the expiring links? I can provide more code and explanation if needed.
Thanks very much to anyone who can offer help,
Terry
Is there anything we can do to get around the expiring links?
You're interfacing with a system that wasn't designed to be (ab)used in the way you're doing so. Like many search systems, it looks like they're building the results and storing them somewhere. Also like many search systems, those results become invalid after a period of time.
You're going to have to design your code under the assumption that the search results are going to poof into the ether very quickly.
It looks like there's a parameter in the URL that dictates how many results there are per page. Try changing it to a higher number -- a much higher number. They don't seem to have placed a bounds check on it at the code level. I was able to enter 1000 without it complaining, though it only returned 341 links.
Keep in mind that this is very likely going to cause some pretty noticeable load on their machine, and you should be careful and gentle when making your requests. You don't want to raise attention to yourself by making it look like you're attacking their service.
The page returned from the original link generates the results and then sends you a page which uses a javascript that inserts the values into an URL and then sends you to that URL which fetches the stored results page. The results page is identified by the server with a LimitsID (you can see it in the URL of the results page). They must use this number to control how long a page lasts and each request generates a new LimitsID because not every ID works for this results page. Point of all this is, you can use cURL to get the first page (the link off of the original page, which will generate the results and store them on the server), search for the text 'LimitsId=-' in the response page(for some reason they all have a dash in front of them but I'm not sure if they're supposed to be negative as the numbers go up) and paste that text after the same line in the URL that you're using in your script, which will get you to the newly generated results.
However, as pointed out by Charles, these requests will put a significant load on the server so maybe you can just generate a new request when the old one expires.
My current code (see below) uses 147MB of virtual memory!
My provider has allocated 100MB by default and the process is killed once run, causing an internal error.
The code is utilising curl multi and must be able to loop with more than 150 iterations whilst still minimizing the virtual memory. The code below is only set at 150 iterations and still causes the internal server error. At 90 iterations the issue does not occur.
How can I adjust my code to lower the resource use / virtual memory?
Thanks!
<?php
function udate($format, $utimestamp = null) {
if ($utimestamp === null)
$utimestamp = microtime(true);
$timestamp = floor($utimestamp);
$milliseconds = round(($utimestamp - $timestamp) * 1000);
return date(preg_replace('`(?<!\\\\)u`', $milliseconds, $format), $timestamp);
}
$url = 'https://www.testdomain.com/';
$curl_arr = array();
$master = curl_multi_init();
for($i=0; $i<150; $i++)
{
$curl_arr[$i] = curl_init();
curl_setopt($curl_arr[$i], CURLOPT_URL, $url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYPEER, FALSE);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i=0; $i<150; $i++)
{
$results = curl_multi_getcontent ($curl_arr[$i]);
$results = explode("<br>", $results);
echo $results[0];
echo "<br>";
echo $results[1];
echo "<br>";
echo udate('H:i:s:u');
echo "<br><br>";
usleep(100000);
}
?>
As per your last comment..
Download RollingCurl.php.
Hopefully this will sufficiently spam the living daylights out of your API.
<?php
$url = '________';
$fetch_count = 150;
$window_size = 5;
require("RollingCurl.php");
function request_callback($response, $info, $request) {
list($result0, $result1) = explode("<br>", $response);
echo "{$result0}<br>{$result1}<br>";
//print_r($info);
//print_r($request);
echo "<hr>";
}
$urls = array_fill(0, $fetch_count, $url);
$rc = new RollingCurl("request_callback");
$rc->window_size = $window_size;
foreach ($urls as $url) {
$request = new RollingCurlRequest($url);
$rc->add($request);
}
$rc->execute();
?>
Looking through your questions, I saw this comment:
If the intention is domain snatching,
then using one of the established
services is a better option. Your
script implementation is hardly as
important as the actual connection and
latency.
I agree with that comment.
Also, you seem to have posted the "same question" approximately seven hundred times:
https://stackoverflow.com/users/558865/icer
https://stackoverflow.com/users/516277/icer
How can I adjust the server to run my PHP script quicker?
How can I re-code my php script to run as quickly as possible?
How to run cURL once, checking domain availability in a loop? Help fixing code please
Help fixing php/api/curl code please
How to reduce virtual memory by optimising my PHP code?
Overlapping HTTPS requests?
Multiple https requests.. how to?
Doesn't the fact that you have to keep asking the same question over and over tell you that you're doing it wrong?
This comment of yours:
#mario: Cheers. I'm competing against
2 other companies for specific
ccTLD's. They are new to the game and
they are snapping up those domains in
slow time (up to 10 seconds after
purge time). I'm just a little slower
at the moment.
I'm fairly sure that PHP on a shared hosting account is the wrong tool to use if you are seriously trying to beat two companies at snapping up expired domain names.
The result of each of the 150 queries is being stored in PHP memory and by your evidence this is insufficient. The only conclusion is that you cannot keep 150 queries in memory. You must have a method of streaming to files instead of memory buffers, or simply reduce the number of queries and processing the list of URLs in batches.
To use streams you must set CURLOPT_RETURNTRANSFER to 0 and implement a callback for CURLOPT_WRITEFUNCTION, there is an example in the PHP manual:
http://www.php.net/manual/en/function.curl-setopt.php#98491
function on_curl_write($ch, $data)
{
global $fh;
$bytes = fwrite ($fh, $data, strlen($data));
return $bytes;
}
curl_setopt ($curl_arr[$i], CURLOPT_WRITEFUNCTION, 'on_curl_write');
Getting the correct file handle in the callback is left as problem for the reader to solve.
<?php
echo str_repeat(' ', 1024); //to make flush work
$url = 'http://__________/';
$fetch_count = 15;
$delay = 100000; //0.1 second
//$delay = 1000000; //1 second
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
for ($i=0; $i<$fetch_count; $i++) {
$start = microtime(true);
$result = curl_exec($ch);
list($result0, $result1) = explode("<br>", $result);
echo "{$result0}<br>{$result1}<br>";
flush();
$end = microtime(true);
$sleeping = $delay - ($end - $start);
echo 'sleeping: ' . ($sleeping / 1000000) . ' seconds<hr />';
usleep($sleeping);
}
curl_close($ch);
?>