Optimizing PHP Code and Storing JSON response to parse it faster? - php

so I'm trying to figure out why does this PHP code takes too long to run to output the results.
for example this is my apitest.php and here is my PHP Code
<?php
function getRankedMatchHistory($summonerId,$serverName,$apiKey){
$k
$d;
$a;
$timeElapsed;
$gameType;
$championName;
$result;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://".$serverName.".api.pvp.net/api/lol/".$serverName."/v2.2/matchhistory/".$summonerId."?api_key=".$apiKey);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$response = curl_exec($ch);
curl_close($ch);
$matchHistory = json_decode($response,true); // Is the Whole JSON Response saved at $matchHistory Now locally as a variable or is it requested everytime $matchHistory is invoked ?
for ($i = 9; $i >= 0; $i--){
$farm1 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["minionsKilled"];
$farm2 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralMinionsKilled"];
$farm3 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralminionsKilledTeamJungle"];
$farm4 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralminionsKilledEnemyJungle"];
$elapsedTime = $matchHistory["matches"][$i]["matchDuration"];
settype($elapsedTime, "integer");
$elapsedTime = floor($elapsedTime / 60);
$k = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["kills"];
$d = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["deaths"];
$a = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["assists"];
$championIdTmp = $matchHistory["matches"][$i]["participants"]["0"]["championId"];
$championName = call_user_func('getChampionName', $championIdTmp); // calls another function to resolve championId into championName
$gameType = preg_replace('/[^A-Za-z0-9\-]/', ' ', $matchHistory["matches"][$i]["queueType"]);
$result = (($matchHistory["matches"][$i]["participants"]["0"]["stats"]["winner"]) == "true") ? "Victory" : "Defeat";
echo "<tr>"."<td>".$gameType."</td>"."<td>".$result."</td>"."<td>".$championName."</td>"."<td>".$k."/".$d."/".$a."</td>"."<td>".($farm1+$farm2+$farm3+$farm4)." in ". $elapsedTime. " minutes". "</td>"."</tr>";
}
}
?>
What I'd like to know is how to make the page output faster as it takes around
10~15 seconds to output the results which makes the browser thinks the website is dead like a 500 Internal error or something like it .
Here is a simple demonstration of how long it can take : Here
As you might have noticed , yes I'm using Riot API which is sending the response as a JSON encoded type.
Here is an example of the response that this function handles : Here
What I thought of was creating a temporarily file called temp.php at the start of the CURL function and saving the whole response there and then reading the variables from there so i can speed up the process and after reading the variables it deletes the temp.php that was created thus freeing up disk space. and increasing the speed.
But I have no idea how to do that in PHP Only.
By the way I'd like to tell you that i just started using PHP today so I'd prefer some explanation with the answers if possible .
Thanks for your precious time.

Try benchmarking like this:
// start the timer
$start_curl = microtime(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://".$serverName.".api.pvp.net/api/lol/".$serverName."/v2.2/matchhistory/".$summonerId."?api_key=".$apiKey);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
// debugging
curl_setopt($ch, CURLOPT_VERBOSE, true);
// start another timer
$start = microtime(true);
$response = curl_exec($ch);
echo 'curl_exec() in: '.(microtime(true) - $start).' seconds<br><br>';
// start another timer
$start = microtime(true);
curl_close($ch);
echo 'curl_close() in: '.(microtime(true) - $start).' seconds<br><br>';
// how long did the entire CURL take?
echo 'CURLed in: '.(microtime(true) - $start_curl).' seconds<br><br>';

Related

PHP opens and closes file, but does not write anything to it

I have spent a couple of hours reading up on this but as so yet I find no clear solutions....I am using WAMP to run as my Local server. I have a successful API call set up to return data.
I would like to store that data locally, thus reducing the number of API call being made.
For simplicity I have created a cache.json file in the same folder as my PHP scripts and when I run the process I can see the file has been accessed as the time stamp updates.
But the file remains empty.
Based on research I suspect the issue may come down to a permission issue; I have gone through the folders and files and unchecked read only etc.
Appreciate if someone could validate that my code is correct and if it is hopefully point me in the the direction of a solution.
many thanks
<?php
$url = 'https://restcountries.eu/rest/v2/name/'. $_REQUEST['country'];
$cache = __DIR__."/cache.json"; // make this file in same dir
$force_refresh = true; // dev
$refresh = 60; // once an min (set short time frame for testing)
// cache json results so to not over-query (api restrictions)
if ($force_refresh || ((time() - filectime($cache)) > ($refresh) || 0 == filesize($cache))) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
$decode = json_decode($result,true);
$handle = fopen($cache, 'w');// or die('no fopen');
$json_cache = $decode;
fwrite($handle, $json_cache);
fclose($handle);
}
} else {
$json_cache = file_get_contents($cache); //locally
}
echo json_encode($json_cache, JSON_UNESCAPED_UNICODE);
?>
I managed to solve this by using file_put_contents(), not being an expert I do not understand why this works and the code above doesn't, but maybe this helps someone else.
adjusted code:
<?php
$url = 'https://restcountries.eu/rest/v2/name/'. $_REQUEST['country'];
$cache = __DIR__."/cache.json"; // make this file in same dir
$force_refresh = false; // dev
$refresh = 60; // once an min (short time frame for testing)
// cache json results so to not over-query (api restrictions)
if ($force_refresh || ((time() - filectime($cache)) > ($refresh) || 0 == filesize($cache))) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
$decode = json_decode($result,true);
$handle = fopen($cache, 'w');// or die('no fopen');
$json_cache = $result;
file_put_contents($cache, $json_cache);
} else {
$json_cache = file_get_contents($cache); //locally
$decode = json_decode($json_cache,true);
}
echo json_encode($decode, JSON_UNESCAPED_UNICODE);
?>

php - xml data parsing

I am working on getting some xml data into a php variable so I can easily call it in my html webpage. I am using this code below:
$ch = curl_init();
$timeout = 5;
$url = "http://www.dictionaryapi.com/api/v1/references/collegiate/xml/define?key=0b03f103-f6a7-4bb1-9136-11ab4e7b5294";
$definition = simplexml_load_file($url);
echo $definition->entry[0]->def;
However my results are: .
I am not sure what I am doing wrong and I have followed the php manual, so I am guessing it is something obvious but I am just not understanding it correctly.
The actual xml results from that link used in cURL are visible by clicking the link below , I did not post it because it is rather long:
http://www.dictionaryapi.com/api/v1/references/sd3/xml/test?key=9d92e6bd-a94b-45c5-9128-bc0f0908103d
<?php
$ch = curl_init();
$timeout = 5;
$url = "http://www.dictionaryapi.com/api/v1/references/collegiate/xml/define?key=0b03f103-f6a7-4bb1-9136-11ab4e7b5294";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch); // you were missing a semicolon
$definition = new SimpleXMLElement($data);
echo '<pre>';
print_r($definition->entry[0]->def);
echo '</pre>';
// this returns the SimpleXML Object
// to get parts, you can do something like this...
foreach($definition->entry[0]->def[0] as $entry) {
echo $entry[0] . "<br />";
}
// which returns
transitive verb
14th century
1 a
:to determine or identify the essential qualities or meaning of
b
:to discover and set forth the meaning of (as a word)
c
:to create on a computer
2 a
:to fix or mark the limits of :
b
:to make distinct, clear, or detailed especially in outline
3
:
intransitive verb
:to make a
Working Demo

scraping "next pages" using PHP, cURL, simplehtmldom

I finally got my scraper working (somewhat), but now I'd like to know how I can automatically go to the next page and scrape the same info from there. I'm using cURL to copy the entire page (otherwise I get a 500 error). Here's my code:
<?
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://example.com/results.asp?&j=t&page_no=1");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$html = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
// print $html . "\n";
require 'simple_html_dom.php';
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[#id='schoolsearch'] tr") as $data){
$tds = $data->find("td");
if(count($tds)==3){
$record = array(
'school' => $tds[1]->plaintext,
'city' => $tds[2]->plaintext
);
print json_encode($record) . "\n";
file_put_contents('schools.csv', json_encode($record) . "\n", FILE_APPEND);
}
}
?>
It's not perfect, but it's what works right now! Anyone know how I can move to the next pages?
Wrap it in a loop:
$maxPages = 10;
for ($i = 1; $i <= $maxPages; $i++) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://example.com/results.asp?&j=t&page_no=$i");
etc...
}
You'll need to tidy up a bit more,to avoiding including that file on every page, but you get the idea.

php xPath code optimization

I'm writing a page scraper for a site that is a little slow, but has a lot of information I'd like to use for widget purposes (with their permission). Currently it takes roughly 4-5 minutes to execute and parse all ~150 pages I scrape so far. It will be a crontab'd event, and a temporary table is used while it's being generated, then copied to a "live" table upon completion so it's a seamless transition from a client stand-point, however can you see a way to speed up my code, possibly?
//mysql connection stuff here
function dnl2array($domnodelist) {
$return = array();
$nb = $domnodelist->length;
for ($i = 0; $i < $nb; ++$i) {
$return['pt'][] = utf8_decode(trim($domnodelist->item($i)->nodeValue));
$return['html'][] = utf8_decode(trim(get_inner_html($domnodelist->item($i))));
}
return $return;
}
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
// NEW curl instead of file_get_contents()
$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 20);
// Grab the data.
$html = curl_exec($c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
"<p>cURL error: " . curl_error($c) . "</p>";
}
// $html = file_get_contents($url);
$doc = new DOMDocument;
// Load the html into our object
$doc->loadHTML($html);
$xPath = new DOMXPath( $doc );
// scrape initial page that contains list of everything I want to scrape
$results = $xPath->query('//div[#id="food-plan-contents"]//td[#class="product-name"]');
$test['itams'] = dnl2array($results);
foreach($test['itams']['html'] as $get_url){
$prepared_url[] = ""; // The url being scraped, modified slightly to gain access to more information -- not SO applicable data to see
}
$i = 0;
foreach($prepared_url as $url){
$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 20);
// Grab the data.
$html = curl_exec($c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
"<p>cURL error: " . curl_error($c) . "</p>";
}
// $html = file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($html);
$xPath = new DOMXPath($doc);
$results = $xPath->query('//h3[#class="product-name"]');
$arr[$i]['name'] = dnl2array($results);
$results = $xPath->query('//div[#class="product-specs"]');
$arr[$i]['desc'] = dnl2array($results);
$results = $xPath->query('//p[#class="product-image-zoom"]');
$arr[$i]['img'] = dnl2array($results);
$results = $xPath->query('//div[#class="groupedTable"]/table/tbody/tr//span[#class="price"]');
$arr[$i]['price'] = dnl2array($results);
$arr[$i]['url'] = $url;
if($i % 5 == 1){
lazy_loader($arr); //lazy loader adds data to sql database
unset($arr); // keep memory footprint light (server is wimpy -- but free!)
}
$i++;
usleep(50000); // Don't be bandwith pig
}
// Get any stragglers
if(count($arr) > 0){
lazy_loader($arr);
$time = time() + (23 * 60 * 60); // Time + 23 hours for "tomorrow's date"
$tab_name = "sr_data_items_" . date("m_d_y", $time);
// and copy table now that script is finished
mysql_query("CREATE TABLE IF NOT EXISTS `{$tab_name}` LIKE `sr_data_items_skel`");
mysql_query("INSERT INTO `{$tab_name}` SELECT * FROM `sr_data_items_skel`");
mysql_query("TRUNCATE TABLE `sr_data_items_skel`");
}
It sounds like you're mostly dealing with slow server response speeds. At even 2 seconds for each of those 150 pages, you're looking at 300 seconds = 5 minutes. The best way you could speed this up is by using curl_multi_* to run multiple connections at the same time.
So replace the start of the foreach loop (up through the if !html check) with this:
reset($prepared_url); // set internal pointer to first element
$running = array(); // map from curl reference to url
$finished = false;
$mh = curl_multi_init();
$i = 0;
while(!$finished || !empty($running)){
// add urls to $mh up to a maximum
while (count($running) < 15 && !$finished)
{
$url = next($prepared_url);
if ($url === FALSE)
{
$finished = true;
break;
}
$c = setupcurl($url);
curl_multi_add_handle($mh, $c);
$running[$c] = $url;
}
curl_multi_exec($mh, $active);
$info = curl_multi_info_read($mh);
if (false === $info) continue; // nothing to report right now
$c = $info['handle'];
$url = $running[$c];
unset($running[$c]);
$result = $info['result'];
if ($result != CURLE_OK)
{
echo "Curl Error: " . $result . "\n";
continue;
}
$html = curl_multi_getcontent($c);
$download_time = curl_getinfo($c, CURLINFO_TOTAL_TIME);
curl_multi_remove_handle($mh, $c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>\n" .
"<p>cURL error: " . curl_error($c) . "</p>\n";
}
curl_close($c);
<<rest of foreach loop here>>
That will keep 15 downloads going at the same time, and process them as they finish.
Anyway – so for the history: please see my comments up top.
As for caching: I'm using dnsmasq to cache.
My setup is using a recipe for chef, which I run through chef-solo. The templates contains my configuration and the attributes contain my settings. It's pretty straight forward.
So the beauty is that this allows me to put this server into DHCP (we use Amazon EC2 and this service distributes all IPs via DHCP to the virtual instances) and then I don't have to make any changes to my application to use them.
I have another recipe to edit /etc/dhclient.conf.
Does this help? Let me know where to elaborate more.
EDIT
Just for clarification: This is not a Ruby solution I'm just using chef for configuration management (this part makes sure that services are always setup the same, etc..). Dnsmasq itself acts as a local DNS server and saves the requests so it speeds up.
The manual way is as follows:
On a Ubuntu:
apt-get install dnsmasq
Then edit the /etc/dnsmasq.conf:
listen-address=127.0.0.1
cache-size=5000
domain-needed
bogus-priv
log-queries
Restart service and verify it's running (ps aux|grep dnsmasq).
Then put it into your /etc/resolv.conf:
nameserver 127.0.0.1
Test:
dig #127.0.0.1 stackoverflow.com
Execute twice, check time it took to resolve. Second one should be faster.
Enjoy! ;)
The first thing to do is to measure how much time is spent downloading the file from the server. Use function microtime(true) to get a timestamp both before and after the call
file_get_contents($url);
and subtract the values. After you find out that the real bottleneck is inside your code and not on the side of network or remote server, only then you can start thinking about some optimizations.
When you say that 150 pages takes 5 minutes to load & parse, that's 2 seconds per page, and my wild guess is that most of that time is spent to download the page from the server.
You should consider using cUrl instead of both file_get_contents() and DOMDocument::loadHTMLFile, because it's much faster.
See this question:
https://stackoverflow.com/questions/555523/file-get-contents-vs-curl-what-has-better-performance
You need to benchmark. DNS is not an issue, if you're scrapping 150 pages, DNS will for sure get cached on your resolver for the 4 minutes you need to parse the rest of the 149 pages.
Try timing page all transfers with wget/curl, you may get surprised that it's not so fast as you may think.
Try requesting in parallel, hitting them with 4 parallel requests will get your time down to 1 minute.
If you actually find that it's xpath problem use preg_split() or even an awk script with popen() to get your values.

How to reduce virtual memory by optimising my PHP code?

My current code (see below) uses 147MB of virtual memory!
My provider has allocated 100MB by default and the process is killed once run, causing an internal error.
The code is utilising curl multi and must be able to loop with more than 150 iterations whilst still minimizing the virtual memory. The code below is only set at 150 iterations and still causes the internal server error. At 90 iterations the issue does not occur.
How can I adjust my code to lower the resource use / virtual memory?
Thanks!
<?php
function udate($format, $utimestamp = null) {
if ($utimestamp === null)
$utimestamp = microtime(true);
$timestamp = floor($utimestamp);
$milliseconds = round(($utimestamp - $timestamp) * 1000);
return date(preg_replace('`(?<!\\\\)u`', $milliseconds, $format), $timestamp);
}
$url = 'https://www.testdomain.com/';
$curl_arr = array();
$master = curl_multi_init();
for($i=0; $i<150; $i++)
{
$curl_arr[$i] = curl_init();
curl_setopt($curl_arr[$i], CURLOPT_URL, $url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYPEER, FALSE);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i=0; $i<150; $i++)
{
$results = curl_multi_getcontent ($curl_arr[$i]);
$results = explode("<br>", $results);
echo $results[0];
echo "<br>";
echo $results[1];
echo "<br>";
echo udate('H:i:s:u');
echo "<br><br>";
usleep(100000);
}
?>
As per your last comment..
Download RollingCurl.php.
Hopefully this will sufficiently spam the living daylights out of your API.
<?php
$url = '________';
$fetch_count = 150;
$window_size = 5;
require("RollingCurl.php");
function request_callback($response, $info, $request) {
list($result0, $result1) = explode("<br>", $response);
echo "{$result0}<br>{$result1}<br>";
//print_r($info);
//print_r($request);
echo "<hr>";
}
$urls = array_fill(0, $fetch_count, $url);
$rc = new RollingCurl("request_callback");
$rc->window_size = $window_size;
foreach ($urls as $url) {
$request = new RollingCurlRequest($url);
$rc->add($request);
}
$rc->execute();
?>
Looking through your questions, I saw this comment:
If the intention is domain snatching,
then using one of the established
services is a better option. Your
script implementation is hardly as
important as the actual connection and
latency.
I agree with that comment.
Also, you seem to have posted the "same question" approximately seven hundred times:
https://stackoverflow.com/users/558865/icer
https://stackoverflow.com/users/516277/icer
How can I adjust the server to run my PHP script quicker?
How can I re-code my php script to run as quickly as possible?
How to run cURL once, checking domain availability in a loop? Help fixing code please
Help fixing php/api/curl code please
How to reduce virtual memory by optimising my PHP code?
Overlapping HTTPS requests?
Multiple https requests.. how to?
Doesn't the fact that you have to keep asking the same question over and over tell you that you're doing it wrong?
This comment of yours:
#mario: Cheers. I'm competing against
2 other companies for specific
ccTLD's. They are new to the game and
they are snapping up those domains in
slow time (up to 10 seconds after
purge time). I'm just a little slower
at the moment.
I'm fairly sure that PHP on a shared hosting account is the wrong tool to use if you are seriously trying to beat two companies at snapping up expired domain names.
The result of each of the 150 queries is being stored in PHP memory and by your evidence this is insufficient. The only conclusion is that you cannot keep 150 queries in memory. You must have a method of streaming to files instead of memory buffers, or simply reduce the number of queries and processing the list of URLs in batches.
To use streams you must set CURLOPT_RETURNTRANSFER to 0 and implement a callback for CURLOPT_WRITEFUNCTION, there is an example in the PHP manual:
http://www.php.net/manual/en/function.curl-setopt.php#98491
function on_curl_write($ch, $data)
{
global $fh;
$bytes = fwrite ($fh, $data, strlen($data));
return $bytes;
}
curl_setopt ($curl_arr[$i], CURLOPT_WRITEFUNCTION, 'on_curl_write');
Getting the correct file handle in the callback is left as problem for the reader to solve.
<?php
echo str_repeat(' ', 1024); //to make flush work
$url = 'http://__________/';
$fetch_count = 15;
$delay = 100000; //0.1 second
//$delay = 1000000; //1 second
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
for ($i=0; $i<$fetch_count; $i++) {
$start = microtime(true);
$result = curl_exec($ch);
list($result0, $result1) = explode("<br>", $result);
echo "{$result0}<br>{$result1}<br>";
flush();
$end = microtime(true);
$sleeping = $delay - ($end - $start);
echo 'sleeping: ' . ($sleeping / 1000000) . ' seconds<hr />';
usleep($sleeping);
}
curl_close($ch);
?>

Categories