PhantomJS call from PHP - performance

PhantomJS call from PHP - performance - php

I'm currently executing PhantomJS (from PHP) to render some HTML reliably (utilizing 3rd party js libraries that can't be easily replicated in PHP) and then sending the rendered HTML back to the client.
$fh = fopen('/dev/shm/graph-'.$sig.'.html', 'w');
fwrite($fh, $html);
fclose($fh);
$stime = microtime(true);
$res = exec('/usr/bin/phantomjs /home/me/www/js/render_svg.js '.
escapeshellarg($sig), $output, $return_var);
var_dump(microtime(true)-$stime); // 400 ms
print implode("\n", $output);
exit();
render_svg.js:
var system = require('system');
var fs = require('fs');
var page = require('webpage').create();
page.onLoadFinished = function() {
system.stdout.write(page.content);
phantom.exit(0);
};
content = '';
f = fs.open('/dev/shm/graph-'+system.args[1]+'.html', 'r');
content += f.read();
page.content = content;
The execution time for PhantomJS is around 400ms which is super, but probably too much of a delay to use in production. Is there any way of getting this down by e.g. not using exec to fire up phantomjs each time, but having it already running in the background?

You could try the webserver module:
http://phantomjs.org/api/webserver/
There is a tutorial on it here:
http://benjaminbenben.com/2013/07/28/phantomjs-webserver/
(If you try this, I'd love to hear how you get on and how latency compares to your current 400ms using exec.)
BTW, I think there was a recent change in the mongoose license, making it incompatible with the PhantomJS license. So it is possible this feature will disappear in future releases. (There was also talk of switching in an alternative library to mongoose, in which case it may not disappear!)

Answer thanks to Darren Cook:
$fh = fopen('/dev/shm/graph-'.$sig.'.full.html', 'w');
fwrite($fh, $html);
fclose($fh);
$stime = microtime(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://127.0.0.1:8080');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, 'sig='.$sig);
$output = curl_exec($ch);
curl_close($ch);
var_dump(microtime(true)-$stime); // 150ms
print $output;
exit();
render_svg.js:
var system = require('system');
var fs = require('fs');
var page = require('webpage').create();
var server = require('webserver').create();
var service = server.listen('127.0.0.1:8080', function(request, response) {
var stime = new Date();
content = '';
f = fs.open('/dev/shm/graph-'+request.post['sig']+'.full.html', 'r');
content += f.read();
page.content = content;
page.onLoadFinished = function() {
response.statusCode = 200;
response.write(page.content);
response.close();
};
});

In short, no. PhantomJS can't be run as a daemon or server so you'll need to execute this script every time. If you want to improve performance, you should try finding another method of rending the html.

Related

PHP gearman - calling a gearman worker within a gearman worker

I am very new to gearman. I am trying to write a PHP script to download scripts from a URL and upload it to user's google drive. sort of a backup script..
What I am trying to do is to call initiate a gearman worker process within the process to first download the image from source to a temp dir and then upload it to the google drive. here is the script:
<?php
require_once "../classes/drive.class.php";
$worker = new GearmanWorker();
$worker-> addServer('localhost');
$worker->addFunction('init', 'downloader');
$worker->addFunction('upload', 'uploader');
function downloader($job){
// downloads the images from facebook
$data = unserialize($job->workload()); // receives serialized data
$url = $data->url;
$file = rand().'.jpg';
$saveto = __DIR__.'/tmp/'.$file;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
$raw=curl_exec($ch);
curl_close ($ch);
if(file_exists($saveto)){
unlink($saveto);
}
$fp = fopen($saveto,'x');
fwrite($fp, $raw);
fclose($fp);
// create a gearman client to upload image to google drive
$client = new GearmanClient();
$client->addServer();
$data['file'] = $saveto;
return $client->doNormal('upload', serialize($data)); // ensure synchronous dispatch
// can implement a post request return call, to denote a loading point on a loading bar.
}
function uploader($job){
$data = unserialize($job->workload());
$file = $data->file;
$google = $data->google;
$drive = new Drive($google);
return $drive->init($file); // returns boolean
}
?>
The problem is when I start the worker using php worker.php & The process starts but kills itself the moment I start doing something else in the console with message "DONE" printed on my console.
How do I carry my processes out? and keep this script running?
This is a vague explanation, but Please try to look into it and help. I am really new to gearman.
Thanks

You're missing the work loop.
// Create the worker and configure it's capabilities
$worker = new GearmanWorker();
$worker->addServer('localhost');
$worker->addFunction('init', 'downloader');
$worker->addFunction('upload', 'uploader');
// Start the worker
while($worker->work());
// Your function definition
function downloader($job) {
// do stuff with $job
}
function uploader($job) {
// do stuff with $job
}

refresh a page and should post data each time it refreshes?

i want that this code refreshes itself & post the data to sendsms.php but should not redirect and remain on this page.tus each time it refreshes it download new data and post to sendsms.php. plz suggest me any idea?
<?php
$page = $_SERVER['PHP_SELF'];
$sec = "15";
header("Refresh: $sec; url=$page");
$url=$_REQUEST['url'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$html= curl_exec($ch);
curl_close($ch);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$first = current(explode("|", $title));
$to="888888888";
header('Location: sendsms.php?to=' .$to .'&sms=' .$first);
?>

I would recommend running it through crontab.
You have two options, the first being the better:
Run the file directly on the server as a shell script.
Use curl or wget to access the file on a web server.
The first scenario would look something like this:
* * * * * /usr/bin/php /my/home/path/script.php
The second scenario would be similar:
* * * * * curl http://example.com/path/to/file/script.php
This is not exactly what you asked for, of course. If you wanted this page to simply sit in a web browser and reload every n seconds, you could use JavaScript and some jQuery AJAX:
<script type="text/javascript">
var timer = setInterval(function () {
$.post('/sendsms.php', { data: yourdata }, function (data) {
console.log(data);
});
}, 15000);
</script>
Hopefully this helps!

Any standard refresh will not re-post the data. Depending on browser it'll either make a usual GET request or the browser will ask you to confirm the POST data re-submit.
You can do what you need by creating a form in html and adding all variables you need in POST data into that form as < input type="hidden" >. Then you can submit that form using javascript (after the required delay - 15 seconds in your case).

php xPath code optimization

I'm writing a page scraper for a site that is a little slow, but has a lot of information I'd like to use for widget purposes (with their permission). Currently it takes roughly 4-5 minutes to execute and parse all ~150 pages I scrape so far. It will be a crontab'd event, and a temporary table is used while it's being generated, then copied to a "live" table upon completion so it's a seamless transition from a client stand-point, however can you see a way to speed up my code, possibly?
//mysql connection stuff here
function dnl2array($domnodelist) {
$return = array();
$nb = $domnodelist->length;
for ($i = 0; $i < $nb; ++$i) {
$return['pt'][] = utf8_decode(trim($domnodelist->item($i)->nodeValue));
$return['html'][] = utf8_decode(trim(get_inner_html($domnodelist->item($i))));
}
return $return;
}
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
// NEW curl instead of file_get_contents()
$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 20);
// Grab the data.
$html = curl_exec($c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
"<p>cURL error: " . curl_error($c) . "</p>";
}
// $html = file_get_contents($url);
$doc = new DOMDocument;
// Load the html into our object
$doc->loadHTML($html);
$xPath = new DOMXPath( $doc );
// scrape initial page that contains list of everything I want to scrape
$results = $xPath->query('//div[#id="food-plan-contents"]//td[#class="product-name"]');
$test['itams'] = dnl2array($results);
foreach($test['itams']['html'] as $get_url){
$prepared_url[] = ""; // The url being scraped, modified slightly to gain access to more information -- not SO applicable data to see
}
$i = 0;
foreach($prepared_url as $url){
$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 20);
// Grab the data.
$html = curl_exec($c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
"<p>cURL error: " . curl_error($c) . "</p>";
}
// $html = file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($html);
$xPath = new DOMXPath($doc);
$results = $xPath->query('//h3[#class="product-name"]');
$arr[$i]['name'] = dnl2array($results);
$results = $xPath->query('//div[#class="product-specs"]');
$arr[$i]['desc'] = dnl2array($results);
$results = $xPath->query('//p[#class="product-image-zoom"]');
$arr[$i]['img'] = dnl2array($results);
$results = $xPath->query('//div[#class="groupedTable"]/table/tbody/tr//span[#class="price"]');
$arr[$i]['price'] = dnl2array($results);
$arr[$i]['url'] = $url;
if($i % 5 == 1){
lazy_loader($arr); //lazy loader adds data to sql database
unset($arr); // keep memory footprint light (server is wimpy -- but free!)
}
$i++;
usleep(50000); // Don't be bandwith pig
}
// Get any stragglers
if(count($arr) > 0){
lazy_loader($arr);
$time = time() + (23 * 60 * 60); // Time + 23 hours for "tomorrow's date"
$tab_name = "sr_data_items_" . date("m_d_y", $time);
// and copy table now that script is finished
mysql_query("CREATE TABLE IF NOT EXISTS `{$tab_name}` LIKE `sr_data_items_skel`");
mysql_query("INSERT INTO `{$tab_name}` SELECT * FROM `sr_data_items_skel`");
mysql_query("TRUNCATE TABLE `sr_data_items_skel`");
}

It sounds like you're mostly dealing with slow server response speeds. At even 2 seconds for each of those 150 pages, you're looking at 300 seconds = 5 minutes. The best way you could speed this up is by using curl_multi_* to run multiple connections at the same time.
So replace the start of the foreach loop (up through the if !html check) with this:
reset($prepared_url); // set internal pointer to first element
$running = array(); // map from curl reference to url
$finished = false;
$mh = curl_multi_init();
$i = 0;
while(!$finished || !empty($running)){
// add urls to $mh up to a maximum
while (count($running) < 15 && !$finished)
{
$url = next($prepared_url);
if ($url === FALSE)
{
$finished = true;
break;
}
$c = setupcurl($url);
curl_multi_add_handle($mh, $c);
$running[$c] = $url;
}
curl_multi_exec($mh, $active);
$info = curl_multi_info_read($mh);
if (false === $info) continue; // nothing to report right now
$c = $info['handle'];
$url = $running[$c];
unset($running[$c]);
$result = $info['result'];
if ($result != CURLE_OK)
{
echo "Curl Error: " . $result . "\n";
continue;
}
$html = curl_multi_getcontent($c);
$download_time = curl_getinfo($c, CURLINFO_TOTAL_TIME);
curl_multi_remove_handle($mh, $c);
// Check if the HTML didn't load right, if it didn't - report an error
if (!$html) {
echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>\n" .
"<p>cURL error: " . curl_error($c) . "</p>\n";
}
curl_close($c);
<<rest of foreach loop here>>
That will keep 15 downloads going at the same time, and process them as they finish.

Anyway – so for the history: please see my comments up top.
As for caching: I'm using dnsmasq to cache.
My setup is using a recipe for chef, which I run through chef-solo. The templates contains my configuration and the attributes contain my settings. It's pretty straight forward.
So the beauty is that this allows me to put this server into DHCP (we use Amazon EC2 and this service distributes all IPs via DHCP to the virtual instances) and then I don't have to make any changes to my application to use them.
I have another recipe to edit /etc/dhclient.conf.
Does this help? Let me know where to elaborate more.
EDIT
Just for clarification: This is not a Ruby solution I'm just using chef for configuration management (this part makes sure that services are always setup the same, etc..). Dnsmasq itself acts as a local DNS server and saves the requests so it speeds up.
The manual way is as follows:
On a Ubuntu:
apt-get install dnsmasq
Then edit the /etc/dnsmasq.conf:
listen-address=127.0.0.1
cache-size=5000
domain-needed
bogus-priv
log-queries
Restart service and verify it's running (ps aux|grep dnsmasq).
Then put it into your /etc/resolv.conf:
nameserver 127.0.0.1
Test:
dig #127.0.0.1 stackoverflow.com
Execute twice, check time it took to resolve. Second one should be faster.
Enjoy! ;)

The first thing to do is to measure how much time is spent downloading the file from the server. Use function microtime(true) to get a timestamp both before and after the call
file_get_contents($url);
and subtract the values. After you find out that the real bottleneck is inside your code and not on the side of network or remote server, only then you can start thinking about some optimizations.
When you say that 150 pages takes 5 minutes to load & parse, that's 2 seconds per page, and my wild guess is that most of that time is spent to download the page from the server.

You should consider using cUrl instead of both file_get_contents() and DOMDocument::loadHTMLFile, because it's much faster.
See this question:
https://stackoverflow.com/questions/555523/file-get-contents-vs-curl-what-has-better-performance

You need to benchmark. DNS is not an issue, if you're scrapping 150 pages, DNS will for sure get cached on your resolver for the 4 minutes you need to parse the rest of the 149 pages.
Try timing page all transfers with wget/curl, you may get surprised that it's not so fast as you may think.
Try requesting in parallel, hitting them with 4 parallel requests will get your time down to 1 minute.
If you actually find that it's xpath problem use preg_split() or even an awk script with popen() to get your values.

equal of php in asp.net

what is the equivalent code in asp.net language???
<?php
$ch = curl_init("http://irnafiarco.com/queue");
$request["queue"] = file_get_contents("/path_to_my_xml_file/my_xml_file.xml");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $request);
$response = curl_exec($ch);
curl_close ($ch);
echo $response;
?>
in http://irnafiarco.com/queue a Listener that get requst xml file and saver xml file.

Using WebRequest, this will be the basic code
var req = WebRequest.Create(#"http://irnafiarco.com/queue"))
// prepare the request
req.ContentType = "application/x-www-form-urlencoded";
req.Method = "POST";
// push the file contents into request body
var data = "queue=" + System.IO.File.OpenText(filePath).ReadToEnd();
var bytes = System.Text.Encoding.Ascii.GetBytes(data );
request.ContentLength = bytes.Length;
var rs = req.GetRequestStream();
rs.Write(bytes, 0, bytes.Length);
rs.Close ();
// get the response
var resp = req.GetResponse();
var sr = new System.IO.StreamReader(resp.GetResponseStream());
var result = sr.ReadToEnd();
Disclaimer: untested code
EDIT:
Added the post parameter name ("queue") which I have missed in first draft. Also added content-length for the request. This code should get you started. The basic idea is you need to simulate exact post request generated by PHP code. Use tool such as Fiddler/ Firebug on FF to inspect & compare request/response from PHP and .NET code.
Further, I suspect that the PHP code may generating request with content type as multipart/form-data. However, I believe that server should also able to support the post body with application/x-www-form-urlencoded (because we have only one parameter in body) but in case it doesn't work and you must generate POST body as multipart/form-data then it will be little more involved. See this SO question where accepted answer has given the sample code for the same : Upload files with HTTPWebrequest (multipart/form-data)

Take a look at WebRequest, WebProxy classes which are inline with what you're after...
WebRequest request = WebRequest.Create(url);
request.Proxy = new WebProxy("http://blahblahblah", true)
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// handle response here
Also, see here, and here, though may not be relevant for your implementation
Examples of using these to fetch XML abound, i.e.:
System.Net.HttpWebRequest webRequest = (HttpWebRequest)System.Net.WebRequest.Create("yourURL.xml");
webRequest.Credentials = System.Net.CredentialCache.DefaultCredentials;
webRequest.Accept = "text/xml";
System.Net.HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
System.IO.Stream responseStream = webResponse.GetResponseStream();
System.Xml.XmlTextReader reader = new XmlTextReader(responseStream);
//Do something meaningful with the reader here
reader.Close();
webResponse.Close();

Executing Javascript in an external file via php script

I have been trying to create a php file which basically is a mobile shortcode message. Since any output on the page automatically gets shown in the mobile SMS, I cannot use any html on the page. But I am having a problem in executing some google analytics javascript code before the text is outputted on the page. I create an external file and wrote the javascript there and tried to execute the file via curl, but curl does not execute the javascript. So my code for the main file SMSapi.php is something like this:
<?php
if(isset($_GET['mobile'])){
$number = $_GET['mobile'];
}
if(isset($_GET['text'])){
$data = $_GET['text'];
}
$brand = "mybrand";
$event = "MyEvent";
$deal = "MyDeal";
$url = "http://myurl.com/sms.php";
$post_string = "brand={$brand}&event={$event}&deal={$deal}";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$success = curl_exec($ch);
curl_close($ch);
echo "Your discount code is XYZ123";
?>
The sms.php code is as follows:
<?php
if(isset($_POST) && !empty($_POST['event']) && !empty($_POST['brand']) && !empty($_POST['deal'])):
$event = urldecode($_POST['event']);
$brand = urldecode($_POST['brand']);
$deal = urldecode($_POST['deal']);
?>
<html>
<head>
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-MyNum']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
_gaq.push(['_trackEvent', '<?php echo $event; ?>', '<?php echo $brand; ?>', '<?php echo $deal; ?>']);
</script>
</head>
</html>
<?php endif ?>
the main SMS api file makes a curl request to the sms.php file and while the file gets executed, the html and javascript gets returned back as text without any execution happening there. And hence the javascript shows up in the SMS.
Is there a way to implement a external url and all the javascripts in it there and there via php?

This link might be useful for you: http://curl.haxx.se/docs/faq.html#Does_curl_support_Javascript_or

I am presuming this is a script that runs on a call from an SMS provider, and you are tying to log this event in Google Analytics.
Firstly, I would have thought this would more easily be accomplished by logging to a database, rather than to Analytics. As the javascript is running server side, you won't get any extra information other that the event.
If you really do need to run the Google code then you need to try searching on "server side javascript google analytics". The solution is going to be highly dependant on what server platform you are running and what is/can be installed.
One interesting link though which may work for your PHP is:
http://code.google.com/p/serversidegoogleanalytics/
but I haven't used it so don't know how well this works.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PhantomJS call from PHP - performance - php

In short, no. PhantomJS can't be run as a daemon or server so you'll need to execute this script every time. If you want to improve performance, you should try finding another method of rending the html.

Related

PHP gearman - calling a gearman worker within a gearman worker

refresh a page and should post data each time it refreshes?

php xPath code optimization

equal of php in asp.net

Executing Javascript in an external file via php script

Categories

Resources