PHP transfer data between 2 remote servers, what is fastest way? - php

I have Server A and Server B which exchanges some data. Server A on user request pull data from Server B using simple file_get_content with some params, so server B can do all task(database etc) and return results to A which formats and show to user. Everything is in PHP.
Now I am interested what is fastest way to to this? I made some test and average transfer time for average response from server B at (~0.2 sec). In that 0.2 sec, 0.1 sec. aprox. is Server B operational time (pulling data calling few databases etc) what mean that average transfer time for 50kb with is 0.1 sec. (servers are NOT in same network)
Should I try with:
cURL insted of file_get_content ?
Or to try to make whole thing with sockets( I never work work with sockets in PHP but I supose that easily can be done, on that way to skip web server )
or something third?
I think that time can be 'found' on shortening connection establishing, since now, every request is new connection initiating (I mean on separate file_get_content calls, or I am wrong?)
Please give me your advices in which directions to try, or if you have some better solution I am listening.

Curl:
function curl($url)
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,$url);
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6");
curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
curl_close($ch);
return $result;
}
Sockets:
function sockets($host) {
$fp = fsockopen("www.".$host, 80, $errno, $errstr, 30);
$out = "GET / HTTP/1.1\r\n";
$out .= "Host: www.".$host."\r\n";
$out .= "Connection: Close\r\n\r\n";
fwrite($fp, $out);
$f='';
while (!feof($fp)) {
$f .= fgets($fp, 1024);
}
return $f;
}
file_get_contents
function fgc($url){
return file_get_contents($url);
}
Multicurl
function multiRequest($data,$nobody=false,$options = array(), $oneoptions = array())
{
$curls = array();
$result = array();
$mh = curl_multi_init();
foreach ($data as $id => $d)
{
$curls[$id] = curl_init();
$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
curl_setopt($curls[$id], CURLOPT_URL, $url);
curl_setopt($curls[$id], CURLOPT_HEADER, 0);
curl_setopt($curls[$id], CURLOPT_RETURNTRANSFER, true);
curl_setopt($curls[$id], CURLOPT_FOLLOWLOCATION,1);
curl_setopt($curls[$id], CURLOPT_USERAGENT,"Mozilla/5.0(Windows;U;WindowsNT5.1;ru;rv:1.9.0.4)Gecko/2008102920AdCentriaIM/1.7Firefox/3.0.4");
//curl_setopt($curls[$id], CURLOPT_COOKIEJAR,'cookies.txt');
//curl_setopt($curls[$id], CURLOPT_COOKIEFILE,'cookies.txt');
//curl_setopt($curls[$id], CURLOPT_NOBODY, $nobody);
if (!empty($options))
{
curl_setopt_array($curls[$id], $options);
}
if (!empty($oneoptions[$id]))
{
curl_setopt_array($curls[$id], $oneoptions[$id]);
}
if (is_array($d))
{
if (!empty($d['post']))
{
curl_setopt($curls[$id], CURLOPT_POST, 1);
curl_setopt($curls[$id], CURLOPT_POSTFIELDS, $d['post']);
}
}
curl_multi_add_handle($mh, $curls[$id]);
}
$running = null;
do
{
curl_multi_exec($mh, $running);
}
while($running > 0);
foreach($curls as $id => $content)
{
$result[$id] = curl_multi_getcontent($content);
//echo curl_multi_getcontent($content);
curl_multi_remove_handle($mh, $content);
}
curl_multi_close($mh);
return $result;
}
Tests:
$url = 'example.com';
$start = microtime(1);
for($i=0;$i<100;$i++)
curl($url);
$end = microtime(1);
echo "Curl:".($end-$start)."\n";
$start = microtime(1);
for($i=0;$i<100;$i++)
fgc("http://$url/");
$end = microtime(1);
echo "file_get_contents:".($end-$start)."\n";
$start = microtime(1);
for($i=0;$i<100;$i++)
sockets($url);
$end = microtime(1);
echo "Sockets:".($end-$start)."\n";
$start = microtime(1);
for($i=0;$i<100;$i++)
$arr[]=$url;
multiRequest($arr);
$end = microtime(1);
echo "MultiCurl:".($end-$start)."\n";
?>
Results:
Curl: 5.39667105675 file_get_contents: 7.99799394608 Sockets:
2.99629592896 MultiCurl: 0.736907958984

what is fastest way?
get your data on a flash drive.
Now seriously.
Come on, it's network that's slow. You cannot make it faster.
To make server A response faster, DO NOT request data from the server B. That's the only way.
You can replicate your data or cache it, or just quit such a clumsy setup at all.
But as long as you have to make a network lookup on each user' request, it WILL be slow. Despite of the method you are using. It is not hte method, it is media. Isn't it obvious?

You can try another different approach: Mount the remote filesystem in the local machine. You can do that with sshfs, so you will get the additional security of an encripted connection.
It may be even more efficient since php will not have to deal with connection negotiation and establishment.

Related

Retrieve URLs from XML file and gather data from the URLs to my database - PHP/cURL/XML

The XML contains around 50,000 different URLS that I am trying to gather data from to then insert or updade my database.
Currently I am using this, which sort of works but times out because of the large amounts of data being processed, how can I improve the performance of this:
URLs.xml (up to 50,000 loc's)
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>http://url.com/122122-rob-jones?</loc>
<lastmod>2014-05-05T07:12:41+08:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.9</priority>
</url>
</urlset>
index.php
<?php
include 'config.php';
include 'custom.class.php';
require_once('SimpleLargeXMLParser.class.php');
$custom = new custom();
$xml = dirname(__FILE__)."/URLs.xml";
// create a new object
$parser = new SimpleLargeXMLParser();
// load the XML
$parser->loadXML($xml);
$parser->registerNamespace("urlset", "http://www.sitemaps.org/schemas/sitemap/0.9");
$array = $parser->parseXML("//urlset:url/urlset:loc");
for ($i=0, $n=count($array); $i<$n; $i++){
$FirstURL=$array[$i];
$URL = substr($FirstURL, 0, strpos($FirstURL,'?')) . "/";
$custom->infoc($URL);
}
custom.class.php (included bits)
<?php
public function load($url, $postData='')
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6");
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
if($postData != '') {
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
}
curl_setopt($ch, CURLOPT_HTTPHEADER, array("X-Requested-With: XMLHttpRequest"));
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
public function infoc($url) {
$get_tag = $this->load($url);
// Player ID
$playeridTAG = '/<input type="text" id="player-(.+?)" name="playerid" value="(.+?)" \/>/';
preg_match($playeridTAG, $get_tag, $playerID);
// End Player ID
// Full Name
preg_match("/(.+?)-(.+?)\//",$url, $title);
$fullName = ucwords(preg_replace ("/-/", " ", $title[2]));
// End Full Name
// Total
$totalTAG = '/<li>
<span>(.+?)<\/span><span class="none"><\/span> <label>Total<\/label>
<\/li>/';
preg_match($totalTAG, $get_tag, $total);
// End Total
$query = $db->query('SELECT * FROM playerblank WHERE playerID = '.$playerID[1].'');
if($query->num_rows > 0) {
$db->query('UPDATE playerblank SET name = "'.$fullName.'", total = "'.$total[1].'" WHERE playerID = '.$playerID[1].'') or die(mysqli_error($db));
echo "UPDATED ".$playerID[1]."";
}
else {
$db->query('INSERT INTO playerblank SET playerID = '.$playerID[1].', name = "'.$fullName.'", total = "'.$total[1].'"') or die(mysqli_error($db));
echo "Inserted ".$playerID[1]."";
}
}
?>
Gathering each URL (loc) from the XML file is no problem, it's when trying to gather data using cURL for each URL that I am struggling to do without having to wait a very long time.
Try using curl_multi. In the PHP documentation there's a goot example
// create both cURL resources
$ch1 = curl_init();
$ch2 = curl_init();
// set URL and other appropriate options
curl_setopt($ch1, CURLOPT_URL, "http://lxr.php.net/");
curl_setopt($ch1, CURLOPT_HEADER, 0);
curl_setopt($ch2, CURLOPT_URL, "http://www.php.net/");
curl_setopt($ch2, CURLOPT_HEADER, 0);
//create the multiple cURL handle
$mh = curl_multi_init();
//add the two handles
curl_multi_add_handle($mh,$ch1);
curl_multi_add_handle($mh,$ch2);
$active = null;
//execute the handles
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
//close the handles
curl_multi_remove_handle($mh, $ch1);
curl_multi_remove_handle($mh, $ch2);
curl_multi_close($mh);
Try work with offline copy of XML file, and delete already updating or inserting urls, then start script again untill offline file have urls. Then get new copy of XML file if need.
Problem in "load" function: it blocks execution until single url is ready, while you can easily load several urls at the same time. Here is explanation of idea how to do it. The best way to improve performance is to load several (10-20) urls parallel and add new one for loading "on fly", when one of previous done. ParallelCurl will do the trick, something like:
require_once('parallelcurl.php');
// $max_requests = 10 or more, try to pick best value manually
$parallel_curl = new ParallelCurl($max_requests, $curl_options);
// $array - 50000 urls
$in_urls = array_splice($array, 0, $max_requests);
foreach ($in_urls as $url) {
$parallel_curl->startRequest($url, 'on_request_done');
}
function on_request_done($content, $url, $ch, $search) {
// here you can parse $content and save data to DB
// and add next url for loading
$next_url = array_shift($array);
if($next_url) {
$parallel_curl->startRequest($url, 'on_request_done');
}
}
// This should be called when you need to wait for the requests to finish.
$parallel_curl->finishAllRequests();

In PHP, how do I read an unreliable web page?

I'm trying to use Curl in PHP to read a unreliable web page. The page is often unavailable because of server errors. However, I still need to read it if it's available. Additionally, I don't want the unreliability of the web page to effect my code. I would like my PHP to fail gracefully and move on. Here is what I have so far:
<?php
function get_url_contents($url){
$crl = curl_init();
$timeout = 2;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
$handle = get_url_contents ( 'http://www.mydomain.com/mypage.html' );
?>
Use this instead, CURL is not super recommanded anymore as i've heard since PHP wrappers offer much better performance and are always available anywhere you go:
$currentcontext = stream_context_get_default();
stream_context_set_default(stream_context_create(array('timeout' => 2)));
$content = file_get_contents('url', $context);
stream_context_set_default($currentcontext);
This will set the default stream context to timeout after 2 seconds and get the content of the url via a stream wrapper that should be there in all php versions from 5.2 and up for sure;
You are not obligated to restore the default context depending on your site's code but it's always a good thing to do. If you don't, then this operation can be achieved in only 2 lines of code...
You could test the HTTP code to see if the page was successfully retrieved by testing the HTTP Response code. I can't remember if >200 and <302 are the correct code ranges though, have a quick peak at http response codes If you use this method.
<?php
function get_url_contents($url){
$crl = curl_init();
$timeout = 2;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret['pagesource'] = curl_exec($crl);
$httpcode = curl_getinfo($crl, CURLINFO_HTTP_CODE);
curl_close($crl);
if($httpcode >=200 && $httpcode<302) {
$ret['response']=true;
} else {
$ret['response']=false;
}
return $ret;
}
$handle = get_url_contents ( 'http://192.168.1.118/newTest/mainBoss.php' );
if($handle['response']==false){
echo 'page is no good';
} else {
echo 'page is ok and here it is:' . $handle['pagesource'] . 'DONE.<br>';
}
?>

PHP simple HTML DOM parser: make it loop until no error

I had an app called GrabUrTime, it's a timetable viewing utility that get its timetables from another site, my university's webspace. Every 2am I run a script that scrapes all the timetables using the parser and dump it into my database.
But today the uni's server isn't running well and my script keeps on giving me error 500 on uni's server, making the script cannot continue to run. It's periodic, not always. However I tried a few times and it just occurs randomly, no pattern at all.
Hence I want to make my script to handle the error and make it loop until it gets the data.
function grabtable($intakecode, $week) {
$html = file_get_html("http://webspace.apiit.edu.my/schedule/intakeview_intake.jsp?Intake1=".$intakecode."&Week=" . $week);
$dumb = $html->find('table[border=1] tr');
$thatarray = array();
for ($i=1; $i < sizeof($dumb);++$i){
$arow = $html->find('table[border=1] tr', $i);
$date = $arow->find('td font', 0)->innertext;
$time = $arow->find('td font', 1)->innertext;
$room = $arow->find('td font', 2)->innertext;
$loca = $arow->find('td font', 3)->innertext;
$modu = $arow->find('td font', 4)->innertext;
$lect = $arow->find('td font', 5)->innertext;
$anarray = array($date, $time, $room, $loca, $modu, $lect);
$thatarray[$i] = $anarray;
//echo "arraylol";
}
//echo serialize($tablearray)."<br/>";
$html->clear();
return $thatarray;
}
try something like this:
function getHttpCode($url)
{
$agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
$ch=curl_init();
curl_setopt ($ch, CURLOPT_URL,$url );
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$page=curl_exec($ch);
//echo curl_error($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode>=200 && $httpcode<300)
{
// YOUR CODE
}
else
{
// What you want to do should it fail
// perhaps this will serve you better as while loop, e.g.
// while($httpcode>=200 && $httpcode<300) { ... }
}
usage
getHttpCode($url);
It might not plug neatly into your code as it is but I'm sure it can help with a little re-factoring to suit your existing code structure.

Check if a remote page exists using PHP?

In PHP, how can I determine if any remote file (accessed via HTTP) exists?
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10); //follow up to 10 redirections - avoids loops
$data = curl_exec($ch);
curl_close($ch);
if (!$data) {
echo "Domain could not be found";
}
else {
preg_match_all("/HTTP\/1\.[1|0]\s(\d{3})/",$data,$matches);
$code = end($matches[1]);
if ($code == 200) {
echo "Page Found";
}
elseif ($code == 404) {
echo "Page Not Found";
}
}
Modified version of code from here.
I like curl or fsockopen to solve this problem. Either one can provide header data regarding the status of the file requested. Specifically, you would be looking for a 404 (File Not Found) response. Here is an example I've used with fsockopen:
http://www.php.net/manual/en/function.fsockopen.php#39948
This function will return the response code (the last one in case of redirection), or false in case of a dns or other error. If one argument (the url) is supplied a HEAD request is made. If a second argument is given, a full request is made and the content, if any, of the response is stored by reference in the variable passed as the second argument.
function url_response_code($url, & $contents = null)
{
$context = null;
if (func_num_args() == 1) {
$context = stream_context_create(array('http' => array('method' => 'HEAD')));
}
$contents = #file_get_contents($url, null, $context);
$code = false;
if (isset($http_response_header)) {
foreach ($http_response_header as $header) {
if (strpos($header, 'HTTP/') === 0) {
list(, $code) = explode(' ', $header);
}
}
}
return $code;
}
I recently was looking for the same info. Found some really nice code here: http://php.assistprogramming.com/check-website-status-using-php-and-curl-library.html
function Visit($url){
$agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,$url );
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$page=curl_exec($ch);
//echo curl_error($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode >= 200 && $httpcode < 300){
return true;
}
else {
return false;
}
}
if(Visit("http://www.site.com")){
echo "Website OK";
}
else{
echo "Website DOWN";
}
Use Curl, and check if the request went through successfully.
http://w-shadow.com/blog/2007/08/02/how-to-check-if-page-exists-with-curl/
Just a note that these solutions will not work on a site that does not give an appropriate response for a page not found. e.g I just had a problem with testing for a page on a site as it just loads a main site page when it gets a request it cannot handle. So the site will nearly always give a 200 response even for non-existent pages.
Some sites will give a custom error on a standard page and not still not give a 404 header.
Not much you can do in these situations unless you know the expected content of the page and start testing that the expected content exists or test for some expected error text within the page and that is all getting a bit messy...

Using php to ping a website

I want to create a php script that will ping a domain and list the response time along with the total size of the request.
This will be used for monitoring a network of websites. I tried it with curl, here is the code I have so far:
function curlTest2($url) {
clearstatcache();
$return = '';
if(substr($url,0,4)!="http") $url = "http://".$url;
$userAgent =
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, 1);
$execute = curl_exec($ch);
// Check if any error occured
if(!curl_errno($ch)) {
$bytes = curl_getinfo($ch, CURLINFO_CONTENT_LENGTH_DOWNLOAD);
$total_time = curl_getinfo($ch, CURLINFO_TOTAL_TIME);
$return = 'Took ' . $total_time . ' / Bytes: '. $bytes;
} else {
$return = 'Error reaching domain';
}
curl_close($ch);
return $return;
}
And here is one using fopen
function fopenTest($link) {
if(substr($link,0,4)!="http"){
$link = "http://".$link;
}
$timestart = microtime();
$churl = #fopen($link,'r');
$timeend = microtime();
$diff = number_format(((substr($timeend,0,9)) + (substr($timeend,-10)) -
(substr($timestart,0,9)) - (substr($timestart,-10))),4);
$diff = $diff*100;
if (!$churl) {
$message="Offline";
}else{
$message="Online. Time : ".$diff."ms ";
}
fclose($churl);
return $message;
}
Is there a better way to ping a website using php?
Obviously curl's got all kinds of cool things, but remember, you can always make use of built in tools by invoking them from the command line like this:
$site = "google.com";
ob_start();
system("ping " . escapeshellarg($site));
print ob_end_flush();
Only thing to keep in mind, this isn't going to be as cross platform as curl might be; although the curl extension is not enabled by default either..
When doing quick scripts for one time tasks I just exec() wget:
$response = `wget http://google.com -O -`;
It's simple and takes care of redirects.
If you're using suhosin patches and curl you may encounter problems with http redirect (301, 302...),
suhosin won't allow it.
I'm not sure about Curl/Fopen but this benchmark says file_get_contents have better performance then fopen.
You could use xmlrpc (xmlrpc_client). Not sure what the advantages/disadvantages to curl are.
Drupal uses xmlrpc for this purpose (look at the ping module).
Using curl is fine.
Not sure if I'd use that useragent string though. Rather make a custom one unless you specifically need to.
maybe this pear Net_Ping is what you are looking for. It's no more maintained but it works.
If remote fopen is enabled, file_get_contents() will do the trick too.

Categories