I want to check whether url is available from my database. I choose fopen, but I test 30 rows from my database, it will cost nearly 20 seconds. Is there any way, can make it more efficient? Thanks.
<?php
$start_t = microtime(true);
//connect database and select query
while ($row = mysql_fetch_array($result)){
//$url = 'http://www.google.com'; //not test from database, but a google.com, one url will cost 0.49 seconds.
$url = $row['url'];
$res = #fopen($url, "r ");
if($res){
echo $row['url'].' yes<br />';
}else{
echo $row['url']. ' no<br />';
}
}
$end_t = microtime(true);
$totaltime = $end_t-$start_t;
echo "<br />".$totaltime." s";
?>
Try using fsockopen which is faster than fopen
<?php
$t = microtime(true);
$valid = #fsockopen("www.google.com", 80, $errno, $errstr, 30);
echo (microtime(true)-$t);
if (!$valid) {
echo "Failure";
} else {
echo "Success";
}
?>
Output:
0.0013298988342285
You can try using CURL with the CURLOPT_NOBODY option set, which uses the HTTP HEAD method and avoids downloading the entire page:
$ch = curl_init($row['url']);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_exec($ch);
$retcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
// 400 means not found, 200 means found.
curl_close($ch);
From the CURLOPT_NOBODY documentation:
TRUE to exclude the body from the
output. Request method is then set to
HEAD. Changing this to FALSE does not
change it to GET.
Try Bulk URL check, that is, in blocks of 10 or 20
Curl Multi Exec.
http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading
Use the CURL options for NOBODY and HEADER ONLY, so your response will be much faster.
Also dont forget to put TIMEOUT for curl, else one BAD url may take too much time.
i was doing 50 URL checks in 20 secs.
Hope that Helps.
You can't speed things up like that.
With 30 rows I assume you are connecting to 30 different urls. 20 seconds is already a good time for that.
Also I suggest you to use file_get_contents to retrive HTML
Or if you need to know the header response use get_headers();
If you want to speed up the process just spawn more process. Each of them will fetch a tot urls.
Addendum
Also don't forget about the great Zend_HTTP_Client(); that is very good for such task
Related
am facing issue of fetching data from big file of multi weblinks.
in my code use :
curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
i used it to speed the curl request,
i tested my code on a few links and its works fine but after using the big file i get a bad results .
As i understand it used to compress the data while retrieve. so i used it to fast the curl request.
could it make any error in data retrieve like that the site not using gzip?
my code also using fork with no sleep time, maybe the problem from the fork ?
$pid = #pcntl_fork();
$execute++;
if ($execute >= $maxproc)
{
while (pcntl_waitpid(0, $status) != -1)
{
$status = pcntl_wexitstatus($status);
$execute =0;
//usleep(250000);
//sleep(1);
//echo " [$ipkey] Child $status completed\n";
}
}
if (!$pid)
{
do
exit;
}
any one have idea where the probliem is ?
I have a website that pulls prices from an API. The problem is that if you send more than ~10 requests to this API in a short amount of time, your ip gets blocked temporarily (I'm not sure if this is just a problem on localhost or if it would also be a problem from the webserver, I assume the latter).
The request to the API returns a JSON object, which I then parse and store certain parts of it into my database. There are about 300 or so entries in the database, so ~300 requests I need to make to this API.
I will end up having a cron job that every x hours, all of the prices are updated from the API. The job calls a php script I have that does all of the request and db handling.
Is there a way to have the script send the requests over a longer period of time, rather than immediately? The problem I'm running into is that after ~20 or so requests the ip gets blocked, and the next 50 or so requests after that get no data returned.
I looked into sleep(), but read that it will just store the results in a buffer and wait, rather than wait after each request.
Here is the script that the cron job will be calling:
define('HTTP_NOT_FOUND', false);
define('HTTP_TIMEOUT', null);
function http_query($url, $timeout=5000) {
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_TIMEOUT_MS, $timeout);
$text = curl_exec($curl);
if($text) {
$code = curl_getinfo($curl, CURLINFO_HTTP_CODE);
switch($code){
case 200:
return $text;
case 404:
return -1;
default:
return -1;
}
}
return HTTP_TIMEOUT;
}
function getPrices($ID) {
$t = time();
$url = url_to_API;
$result = http_query($url, 5000);
if ($result == -1) { return -1; }
else {
return json_decode($result)->price;
}
}
connectToDB();
$result = mysql_query("SELECT * FROM prices") or die(mysql_error());
while ($row = mysql_fetch_array($result)) {
$id = $row['id'];
$updatedPrice = getItemPrices($id);
.
.
echo $updatedPrice;
. // here I am just trying to make sure I can get all ~300 prices without getting any php errors or the request failing (since the ip might be blocked)
.
}
sleep() should not affect/buffer queries to the database. You can use ob_flush() if you need to print something immediately. Also make sure to set max execution time with set_time_limit() so your script don't timeout.
set_time_limit(600);
while ($row = mysql_fetch_array($result)) {
$id = $row['id'];
$updatedPrice = getItemPrices($id);
.
.
echo $updatedPrice;
//Sleep 1 seconds, use ob_flush if necessary
sleep(1);
//You can also use usleep(..) to delay the script in milliseconds
}
I have around 295 domains to check if they contain files in their public_html directory's. Currently I am using the PHP FTP functions but the script takes around 10 minutes to complete. I am trying to shorten down this time, what methods could I use to achieve this.
Here is my PHP code
<?php
foreach($ftpdata as $val) {
if (empty($val['ftp_url'])) {
echo "<p>There is no URL provided</p>";
}
if (empty($val['ftp_username'])) {
echo "<p>The site ".$val['ftp_url']." dosent have a username</p>";
}
if (empty($val['ftp_password'])) {
echo "<p>The site ".$val['ftp_url']." dosent have a password</p>";
}
if($val['ftp_url'] != NULL && $val['ftp_password'] != NULL && $val['ftp_username'] != NULL) {
$conn_id = #ftp_connect("ftp.".$val['ftp_url']);
if($conn_id == false) {
echo "<p></br></br><span>".$val['ftp_url']." isnt live</span></p>";
}
else {
$login_result = ftp_login($conn_id, $val['ftp_username'], $val['ftp_password']);
ftp_chdir($conn_id, "public_html");
$contents = ftp_nlist($conn_id, ".");
if (count($contents) > 3) {
echo "<p><span class='green'>".$val['ftp_url']." is live</span><p>";
}
else {
echo "<p></br></br><span>".$val['ftp_url']." isnt live</span></p>";
}
}
}
}
?>
If it is a publicly available file you can use file_get_contents() to try to grab it. If it is successful you know it is there. If it fails then it is not. You don't need to download the entire file. Just limit it to a small amount of characters so it's fast and not wasting bandwidth.
$page = file_get_contents($url, NULL, NULL, 0, 100);
if ($page !== false)
{
// it exists
}
Use curl. With option CURLOPT_NOBODY set to true request method is set to HEAD and do not transfer body.
<?php
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://google.com/images/srpr/logo3w.png"); //for example google logo
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);
//get content
$content = curl_exec($ch);
// close
curl_close($ch);
//work with result
var_dump($content);
?>
In output if isset "HTTP/1.1 200 OK" then the file/resourse exists.
PS. Try to use curl_multi_*. It's very fast.
M, this is really just an explanation of AlexeyKa's answer. The reason for your scan talking 10 minutes is that you are serialising some 300 network transactions, each of which is taking roughly 2 seconds on average, and 300 x 2s gives you your total 10min elapsed time.
The various approaches such as requesting a header and no body can trim the per-transaction cost but the killer is that you are still running your queries one at a time. What the curl_multi_* routines allow you do you is to run batches in parallel, say 30 x batches of 10 taking closer to 30s. Scanning through the PHP documentation's user contributed notes give this post which explains how to set this up:
Executing multiple curl requests in parallel with PHP and curl_multi_exec.
The other option (if you are using php-cli) is simply to kick off, say, ten batch threads each one much as your current code, but with its own sublist of one tenth of the sites to check.
Since either approach is largely latency bound rather specific link capacity-bound, the time should fall largely by the same factor.
I'm using PHP cURL to fetch information from another website and insert it into my page. I was wondering if it was possible to have the fetched information cached on my server? For example, when a visitor requests a page, the information is fetched and cached on my server for 24 hours. The page is then entirely served locally for 24 hours. When the 24 hours expire, the information is again fetched and cached when another visitor requests it, in the same way.
The code I am currently using to fetch the information is as follows:
$url = $fullURL;
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$result = curl_exec($ch);
curl_close($ch);
echo $result;
Is this possible? Thanks.
You need to write or download a php caching library (like extensible php caching library or such) and adjust your current code to first take a look at cache.
Let's say your cache library has 2 functions called:
save_cache($result, $cache_key, $timestamp)
and
get_cache($cache_key, $timestamp)
With save_cache() you will save the $result into the cache and with get_cache() you will retrieve the data.
$cache_key would be md5($fullURL), a unique identifier for the caching library to know what you want to retrieve.
$timestamp is the amount of minutes/hours you want the cache to be valid, depending on what your caching library accepts.
Now on your code you can have a logic like:
$cache_key = md5($fullURL);
$timestamp = 24 // assuming your caching library accept hours as timestamp
$result = get_cache($cache_key, $timestamp);
if(!$result){
echo "This url is NOT cached, let's get it and cache it";
// do the curl and get $result
// save the cache:
save_cache($result, $cache_key, $timestamp);
}
else {
echo "This url is cached";
}
echo $result;
You can cache it using memcache ( a session ) you can cache it using files on your server and you can cache it using a database, like mySQL.
file_put_contents("cache/cachedata.txt",$data);
You will need to set the permissions of the folder you want to write the files to, otherwise you might get some errors.
Then if you want to read from the cache:
if( file_exists("cache/cachedata.txt") )
{ $data = file_get_contents("cache/cachedate.txt"); }
else
{ // curl here, we have no cache
}
Honza's suggestion to use Nette cache worked great for me, and here's the code I wrote to use it. My function returns the HTTP result if it worked, false if not. You'll have to change some path strings.
use Nette\Caching\Cache;
use Nette\Caching\Storages\FileStorage;
Require("/Nette/loader.php");
function cached_httpGet($url) {
$storage = new FileStorage("/nette-cache");
$cache = new Cache($storage);
$result = $cache->load($url);
if ($result) {
echo "Cached: $url";
}
else {
echo "Fetching: $url";
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$result = curl_exec($ch);
if (curl_errno($ch)) {
echo "ERROR " . curl_error($ch) . " loading: $url";
return false;
} else
$cache->save($url, $result, array(Cache::EXPIRE => '1 day'));
curl_close($ch);
}
return $result;
}
Use Nette Cache. All you need solution, simple to use and of course - thread-safe.
If you've got nothing against file system access, you could just store it in a file. Then maybe use a script on the server that checks the file's timestamp against the current time and deletes it if it's too old.
If you don't have access to all aspects of the server you could just use the above idea and store a timestamp with the info. Every time the page is requested check against the timestamp.
And if you're having problems with the fs bottlenecking, you could use a MySQL database stored entirely in RAM.
I made a pretty cool simple function to store data gotten from your curl for 1 hour or 1 day off Antwan van Houdt's comment (shout out to him) .. firstly create a folder with name "zcache" in public_html and make sure the permission is at "755"
1 hour:
if( file_exists('./zcache/zcache-'.date("Y-m-d-H").'.html') )
{ $result = file_get_contents('./zcache/zcache-'.date("Y-m-d-H").'.html'); }
else
{
// put your curl here
file_put_contents('./zcache/zcache-'.date("Y-m-d-H").'.html',$result);
}
1 day:
if( file_exists('./zcache/zcache-'.date("Y-m-d").'.html') )
{ $result = file_get_contents('./zcache/zcache-'.date("Y-m-d").'.html'); }
else
{
// put your curl here
file_put_contents('./zcache/zcache-'.date("Y-m-d").'.html',$result);
}
you are welcome
The best way to avoid caching is applying the time or any other random element to the url, like this:
$url .= '?ts=' . time();
so for example instead of having
http://example.com/content.php
you would have
http://example.com/content.php?ts=1212434353
So in keeping with my last question, I'm working on scraping the friends feed from Twitter. I followed a tutorial to get this script written, pretty much step by step, so I'm not really sure what is wrong with it, and I'm not seeing any error messages. I've never really used cURL before save from the shell, and I'm extremely new to PHP so please bear with me.
<html>
<head>
<title>Twitcap</title>
</head>
<body>
<?php
function twitcap()
{
// Set your username and password
$user = 'osoleve';
$pass = '****';
// Set site in handler for cURL to download
$ch = curl_init("https://twitter.com/statuses/friends_timeline.xml");
// Set cURL's option
curl_setopt($ch,CURLOPT_HEADER,1); // We want to see the header
curl_setopt($ch,CURLOPT_TIMEOUT,30); // Set timeout to 30s
curl_setopt($ch,CURLOPT_USERPWD,$user.':'.$pass); // Set uname/pass
curl_setopt($ch,CURLOPT_RETURNTRANSER,1); // Do not send to screen
// For debugging purposes, comment when finished
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,0);
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST,0);
// Execute the cURL command
$result = curl_exec($ch);
// Remove the header
// We only want everything after <?
$data = strstr($result, '<?');
// Return the data
$xml = new SimpleXMLElement($data);
return $xml;
}
$xml = twitcap();
echo $xml->status[0]->text;
?>
</body>
</html>
Wouldn't you actually need everything after "?>" ?
$data = strstr($result,'?>');
Also, are you using a free web host? I once had an issue where my hosting provider blocked access to Twitter due to people spamming it.
note that if you use strstr the returend string will actually include the needle-string. so you have to strip of the first 2 chars from the string
i would rather recommend a combination of the function substr and strpos!
anways, i think simplexml should be able to handle this header meaning i think this step is not necessary!
furthermore if i open the url i don't see the like header! and if strstr doesnt find the string it returns false, so you dont have any data in your current script
instead of $data = strstr($result, '<?'); try this:
if(strpos('?>',$data) !== false) {
$data = strstr($result, '?>');
} else {
$data = $result;
}