How fast is simplexml_load_file()? - php

I'm fetching lots of user data via last.fm's API for my mashup. I do this every week as I have to collect listening data.
I fetch the data through their REST API and XML: more specifically simplexml_load_file().
The script is taking ridiculously long. For about 2 300 users, the script takes 30min to fetch only the names of artists. I have to fix it now, otherwise my hosting company will shut me down. I've siphoned out all other options, it is the XML that is slowing the script.
I now have to figure out whether last.fm has a slow API (or is limiting calls without them telling us), or whether PHP's simplexml is actually rather slow.
One thing I realised is that the XML request fetches a lot more than I need, but I can't limit it through the API (ie give me info on only 3 bands, not 70). But "big" XML files only get to about 20kb. Could it be that, that is slowing down the script? Having to load 20kb into an object for each of the 2300 users?
Doesn't make sense that it can be that... I just need confirmation that it is probably last.fm's slow API. Or is it?
Any other help you can provide?

I don't think simple xml is that slow, it's slow because it is a parser but I think the 2300 curl/file_get_contents are taking a lot more time. Also why don't fetch the data and just use simplexml_load_string, do you really need to put those file on the disk of the server ?
At least loading from memory should speed up a bit things, also what kind of processing are you going on the loaded xmls ? are you sure you processing is efficient as it could be ?

20kb * 2300 users is ~45MB. If you're downloading at ~25kB/sec, it will take 30 minutes just to download the data, let alone parse it.

Make sure the XML that you download from last.fm is gzipped. You'd probably have to include the correct HTTP header to tell the server you support gzip. It would speed up the download but eat more server resources with the ungzipping part.
Also consider using asynchronous downloads to free server resources. It won't necessarily speed the process up, but it should make the server administrators happy.
If the XML itself is big, use a SAX parser, instead of a DOM parser.

I think there's a limit of 1 API call per second. I'm not sure this policy is being enforced through code, but it might have something to do with it. You can ask the Last.fm staff on IRC at irc.last.fm #audioscrobbler if you believe this to be the case.

As suggested, fetch the data and parse using simplexml_load_string rather than relying on simplexml_load_file - it works out about twice as fast. Here's some code:
function simplexml_load_file2($url, $timeout = 30) {
// parse domain etc from url
$url_parts = parse_url($url);
if(!$url_parts || !array_key_exists('host', $url_parts)) return false;
$fp = fsockopen($url_parts['host'], 80, $errno, $errstr, $timeout);
if($fp)
{
$path = array_key_exists('path', $url_parts) ? $url_parts['path'] : '/';
if(array_key_exists('query', $url_parts))
{
$path .= '?' . $url_parts['query'];
}
// make request
$out = "GET $path HTTP/1.1\r\n";
$out .= "Host: " . $url_parts['host'] . "\r\n";
$out .= "Connection: Close\r\n\r\n";
fwrite($fp, $out);
// get response
$resp = "";
while (!feof($fp))
{
$resp .= fgets($fp, 128);
}
fclose($fp);
$parts = explode("\r\n\r\n", $resp);
$headers = array_shift($parts);
$status_regex = "/HTTP\/1\.\d\s(\d+)/";
if(preg_match($status_regex, $headers, $matches) && $matches[1] == 200)
{
$xml = join("\r\n\r\n", $parts);
return #simplexml_load_string($xml);
}
}
return false; }

Related

Send HTTP request from PHP without waiting for response?

I want to have an HTTP GET request sent from PHP. Example:
http://tracker.example.com?product_number=5230&price=123.52
The idea is to do server-side web-analytics: Instead of sending tracking
information from JavaScript to a server, the server sends tracking
information directly to another server.
Requirements:
The request should take as little time as possible, in order to not
noticeably delay processing of the PHP page.
The response from the tracker.example.com does not need to be
checked. As examples, some possible responses from
tracker.example.com:
200: That's fine, but no need to check that.
404: Bad luck, but - again - no need to check that.
301: Although a redirect would be appropriate, it would delay
processing of the PHP page, so don't do that.
In short: All responses can be discarded.
Ideas for solutions:
In a now deleted answer, someone suggested calling command line
curl from PHP in a shell process. This seems like a good idea,
only that I don't know if forking a lot of shell processes under
heavy load is a wise thing to do.
I found php-ga, a package for doing server-side Google
Analytics from PHP. On the project's page, it is
mentioned: "Can be configured to [...] use non-blocking requests."
So far I haven't found the time to investigate what method php-ga
uses internally, but this method could be it!
In a nutshell: What is the best solution to do generic server-side
tracking/analytics from PHP.
Unfortunately PHP by definition is blocking. While this holds true for the majority of functions and operations you will normally be handling, the current scenario is different.
The process which I like to call HTTP-Ping, requires that you only touch a specific URI, forcing the specific server to boot-strap it's internal logic. Some functions allow you to achieve something very similar to this HTTP-ping, by not waiting for a response.
Take note that the process of pinging an url, is a two step process:
Resolve the DNS
Making the request
While making the request should be rather fast once the DNS is resolved and the connection is made, there aren't many ways of making the DNS resolve faster.
Some ways of doing an http-ping are:
cURL, by setting CONNECTION_TIMEOUT to a low value
fsockopen by closing immediately after writing
stream_socket_client (same as fsockopen) and also adding STREAM_CLIENT_ASYNC_CONNECT
While both cURL and fsockopen are both blocking while the DNS is being resolved. I have noticed that fsockopen is significantly faster, even in worst case scenarios.
stream_socket_client on the other hand should fix the problem regarding DNS resolving and should be the optimal solution in this scenario, but I have not managed to get it to work.
One final solution is to start another thread/process that does this for you. Making a system call for this should work, but also forking the current process should do that also. Unfortunately both are not really safe in applications where you can't control the environment on which PHP is running.
System calls are more often than not blocked and pcntl is not enabled by default.
I would call tracker.example.com this way:
get_headers('http://tracker.example.com?product_number=5230&price=123.52');
and in the tracker script:
ob_end_clean();
ignore_user_abort(true);
ob_start();
header("Connection: close");
header("Content-Length: " . ob_get_length());
ob_end_flush();
flush();
// from here the response has been sent. you can now wait as long as you want and do some tracking stuff
sleep(5); //wait 5 seconds
do_some_stuff();
exit;
I implemented function for fast GET request to url without waiting for response:
function fast_request($url)
{
$parts=parse_url($url);
$fp = fsockopen($parts['host'],isset($parts['port'])?$parts['port']:80,$errno, $errstr, 30);
$out = "GET ".$parts['path']." HTTP/1.1\r\n";
$out.= "Host: ".$parts['host']."\r\n";
$out.= "Content-Length: 0"."\r\n";
$out.= "Connection: Close\r\n\r\n";
fwrite($fp, $out);
fclose($fp);
}
We were using fsockopen and fwrite combo, then it up and stopped working one day. Or it was kind of intermittent. After a little research and testing, and if you have fopen wrappers enabled, I ended up using file_get_contents and stream_context_create functions with a timeout that is set to 100th of second. The timeout parameter can receive floating values (https://www.php.net/manual/en/context.http.php). I wrapped it in a try...catch block so it would fail silently. It works beautifully for our purposes. You can do logging stuff in the catch if needed. The timeout is the key if you don't want the function to block runtime.
function fetchWithoutResponseURL( $url )
{
$context = stream_context_create([
"http" => [
"method"=>"GET",
"timeout" => .01
]
]
);
try {
file_get_contents($url, 0, $context);
}catch( Exception $e ){
// Fail silently
}
}
For those of you working with wordrpess as a backend -
it is as simple as:
wp_remote_get( $url, array(blocking=>false) );
Came here whilst researching a similar problem. If you have a database connection handy, one other possibility is to quickly stuff the request details into a table, and then have a seperate cron-based process that periodically scans that table for new records to process, and makes the tracking request, freeing up your web application from having to make the HTTP request itself.
You can use shell_exec, and command line curl.
For an example, see this question
You can actually do this using CURL directly.
I have both implemented it using a very short timeout (CURLOPT_TIMEOUT_MS) and/or using curl_multi_exec.
Be advised: eventually i quit this method because not every request was correctly made. This could have been caused by my own server though i haven't been able to rule out the option of curl failing.
I needed to do something similar, just ping a url and discard all responses. I used the proc_open command which lets you end the process right away using proc_close. I'm assuming you have lynx installed on your server:
<?php
function ping($url) {
$proc = proc_open("lynx $url",[],$pipes);
proc_close($proc);
}
?>
<?php
// Create a stream
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en"
)
);
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$file = file_get_contents('http://tracker.example.com?product_number=5230&price=123.52', false, $context);
?>

PHP cURL quickly check if website exists [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
cURL Mult Simultaneous Requests (domain check)
I'm trying to check to see if a website exists. (if it responds that's good enough) The issue is my array of domains is 20,000 and I'm trying to speed up the process as much as possible.
I've done some research and come across this page which details simultaneous cURL requests ->
http://www.phpied.com/simultaneuos-http-requests-in-php-with-curl/
I also found this page which seems be a good way of checking if a domain webpage is up -> http://www.wrichards.com/blog/2009/05/php-check-if-a-url-exists-with-curl/
Any ideas on how to quickly check 20,000 domains to see if they are up?
$http = curl_init($url);
$result = curl_exec($http);
$http_status = curl_getinfo($http, CURLINFO_HTTP_CODE);
curl_close($http);
if($http_status == 200) // good here
check out RollingCurl
It allows you to execute multiple curl requests.
Here is an example:
require 'curl/RollingCurl.php';
require 'curl/RollingCurlGroup.php';
$rc = new RollingCurl('handle_response');
$rc->window_size = 2;
foreach($domain_array as $domain => $value)
{
$request = new RollingCurlRequest($value);
// echo $temp . "\n";
$rc->add($request);
}
$rc->execute();
function handle_response($response, $info)
{
if($info['http_code'] === 200)
{
// site exists handle response data
}
}
I think that if you really want to speed up the process and save a lot of bandwidth (as I got you plan to check the availability on a regular basis) then you should work with sockets, not with curl. You may open several sockets at time and arrange 'asynchronous' treatment of each socket. Then you need to send not the "GET $sitename/ HTTP/1.0\r\n\r\n" request but "HEAD $sitename/ HTTP/1.0\r\n\r\n". It will return the same status code as GET request would return but without response body. You need to parse only first row of response to get an answer, so you just could regex_match it with good response codes. And as one extra optimization, eventually your code will learn what sites are sitting on the same IPs, so you cache the name mappings and order the list by IP. Then you may check several sites over one connected socket for these sites (remember to add 'Connection: keep-alive' header).
YOu can use multi curl requests, but you probably want to limit them to 10 at a time or so. You would have to track jobs in a separate database for processing the queue: Threads in PHP

php fgets stream timeout

i am working on a server-client terminal chat using php. i wanted to know if there are any ways to detect if the client did not entered anything for like 5 or 10 seconds. using fgets pauses the terminal and waits for each entry thus making the chat not realtime xD
i am still sort of modifying the code i got from here
http://codeyoung.blogspot.com/2009/07/simple-php-socket-based-terminal-chat.html
thank you :D
based on quarry's answer ..i tried
while(true) {
stream_set_timeout($sock,1);
$reply = fread($sock, 4086);
if($reply != ""){ echo "[Server] ".$reply; }
stream_set_timeout($uin,1);
$resp = fgets($uin);
if($resp != ""){ fwrite($sock, $resp); }
}
but the set timeout doesn't seem to work any solutions?
this is not doable unless PHP supports multi-threading .what i did is use two terminals. one for receiving the messages and another for sending

Are there any other options for rest clients besides CURL?

Are there alternatives to CURL in PHP that will allow for a client to connect o a REST architecture server ?
PUT, DELETE, file upload are some of the things that need to work.
You can write your own library. It's even possible to do it completely in PHP, using fsockopen and friends. For example:
function httpget($host, $uri) {
$msg = 'GET '.$uri." HTTP/1.1\r\n".
'Host: '.$host."\r\n".
"Connection: close\r\n\r\n";
$fh = fsockopen($host, 80);
fwrite($fh, $msg);
$result = '';
while(!feof($fh)) {
$result .= fgets($fh);
}
fclose($fh);
return $result;
}
I recommend Zend_Http_Client (from Zend) or HTTP_Request2 (from PEAR). They both provide a well-designed object model for making HTTP requests.
In my personal experience, I've found the Zend version to be a little more mature (mostly in dealing with edge cases).

How do I ping a webpage in php to make sure it's alive?

I have a list of urls for some webpages and I want to make sure that this url ( webpage ) exists and not deleted or it doesn't exist. I want to do it in PHP.
How do I ping a webpage to make sure it's alive?
$urls = array(...);
foreach($urls as $url) {
$headers = get_headers($url);
if ( ! $headers OR strpos($headers[0], '200 OK') === FALSE) {
// Site is down.
}
}
Alternatively you could use ping.
$response = shell_exec('ping ' . escapeshellarg($url));
// Parse $response.
Update
You mention you want this to be scheduled. Have a look into cron jobs.
Create a PHP script that fires off an HTTP Request to each URL that you want to be kept-alive.
PHP HTTP Request
I suggest setting up a Task in your operating system that accesses this script every 15 minutes in order to keep these applications alive. Here's some info on running PHP from the command line in Windows.
See http://www.planet-source-code.com/vb/scripts/ShowCode.asp?lngWId=8&txtCodeId=1786 for a thorough implementation.

Categories