The bwshare module and PHP scraping - php

I wrote a script downloading a list of pages from a website. From time to time I receive the following error (the number of seconds is variable):
The bwshare module will refuse your requests for the next 7 seconds.
You have downloaded data too rapidly.
I found when using sleep(2) in the loop, it works much better, however the time delay is too expensive.
What's the best way how to deal with this module? Should I scrape it without any delay and if the response will be similar to the above message simply use sleep for the requested number of seconds?

It all depends on how many pages you can get before the error message.
Try and measure how many pages in average you can get.
4 pages before the bwshare message is the minimum.
If you are getting the error message before reaching 4 page downloads, then il would be faster to sleep(2) after each download.

try this way... it might help u.
$requestTime = 0.1; // s/connection
foreach(/* blah */) {
$start = microtime(true);
// Do your stuff to here.. get_file_content($url) and other processing .........
if($timeTaken = microtime(true)-$start < $requestTime) {
usleep(($requestTime-$timeTaken)*1000000);
}
}
if your problem is solved then try to post your answer so that other people may also be benefited

Related

Long PHP script runs multiple times

I have a products database that synchronizes with product data ever morning.
The process is very clear:
Get all products from database by query
Loop through all products, and get and xml from the other server by product_id
Update data from xml
Log the changes to file.
If I query a low amount of items, but limiting it to 500 random products for example, everything goes fine. But when I query all products, my script SOMETIMES goes on the fritz and starts looping multiple times. Hours later I still see my log file growing and products being added.
I checked everything I could think of, for example:
Are variables not used twice without overwriting each other
Does the function call itself
Does it happen with a low amount of products too: no.
The script is called using a cronjob, are the settings ok. (Yes)
The reason that makes it especially weird is that it sometimes goes right, and sometimes it doesnt. Could this be some memory problem?
EDIT
wget -q -O /dev/null http://example.eu/xxxxx/cron.php?operation=sync its in webmin called on a specific hour and minute
Code is hundreds of lines long...
Thanks
You have:
max_execution_time disabled. Your script won't end until the process is complete for as long as it needed.
memory_limit disabled. There is no limit to how much data stored in memory.
500 records were completed without issues. This indicates that the scripts completes its process before the next cronjob iteration. For example, if your cron runs every hour, then the 500 records are processed in less than an hour.
If you have a cronjob that is going to process large amount of records, then consider adding lock mechanism to the process. Only allow the script to run once, and start again when the previous process is complete.
You can create script lock as part of a shell script before executing your php script. Or, if you don't have an access to your server you can use database lock within the php script, something like this.
class ProductCronJob
{
protected $lockValue;
public function run()
{
// Obtain a lock
if ($this->obtainLock()) {
// Run your script if you have valid lock
$this->syncProducts();
// Release the lock on complete
$this->releaseLock();
}
}
protected function syncProducts()
{
// your long running script
}
protected function obtainLock()
{
$time = new \DateTime;
$timestamp = $time->getTimestamp();
$this->lockValue = $timestamp . '_syncProducts';
$db = JFactory::getDbo();
$lock = [
'lock' => $this->lockValue,
'timemodified' => $timestamp
];
// lock = '0' indicate that the cronjob is not active.
// Update #__cronlock set lock = '', timemodified = '' where name = 'syncProducts' and lock = '0'
// $result = $db->updateObject('#__cronlock', $lock, 'id');
// $lock = SELECT * FROM #__cronlock where name = 'syncProducts';
if ($lock !== false && (string)$lock !== (string)$this->lockValue) {
// Currently there is an active process - can't start a new one
return false;
// You can return false as above or add extra logic as below
// Check the current lock age - how long its been running for
// $diff = $timestamp - $lock['timemodified'];
// if ($diff >= 25200) {
// // The current script is active for 7 hours.
// // You can change 25200 to any number of seconds you want.
// // Here you can send notification email to site administrator.
// // ...
// }
}
return true;
}
protected function releaseLock()
{
// Update #__cronlock set lock = '0' where name = 'syncProducts'
}
}
Your script is running for quite some time (~45m) and wget think it's "timing out" since you don't return any data. By default wget will have a 900s timeout value and a retry count of 20. So first you should probably change your wget command to prevent this:
wget --tries=0 --timeout=0 -q -O /dev/null http://example.eu/xxxxx/cron.php?operation=sync
Now removing the timeout could lead to other issue, so instead you could send (and flush to force webserver to send it) data from your script to make sure wget doesn't think the script "timed out", something every 1000 loops or something like that. Think of this as a progress bar...
Just keep in mind that you will hit an issue when the run time will get close to your period as 2 crons will run in parallel. You should optimize your process and/or have a lock mechanism maybe?
I see two possibilities:
- chron calls the script much more often
- script takes too long somehow.
you can try estimate the time a single iteration of the loop takes.
this can be done with time(). perhaps the result is suprising, perhaps not. you can probably get the number of results too. multiply the two, that way you will have an estimate of how long the process should take.
$productsToSync = $db->loadObjectList();
and
foreach ($productsToSync AS $product) {
it seems you load every result into an array. this wont work for huge databases because obviously a million rows wont fit in memory. you should just get one result at a time. with mysql there are methods that just fetch one thing at a time from the resource, i hope yours allows the same.
I also see you execute another query each iteration of the loop. this is something I try to avoid. perhaps you can move this to after the first query has ended and do all of those in one big query? otoh this may bite my first suggestion.
also if something goes wrong, try to be paranoid when debugging. measure as much as you can. time as much as you can when its a performance issue. put the timings in you log file. usually you will find the bottleneck.
I solved the problem myself. Thanks for all the replies!
My MySQL timed out, that was the problem. As soon as I added:
ini_set('mysql.connect_timeout', 14400);
ini_set('default_socket_timeout', 14400);
to my script the problem stopped. I really hope this helps someone. Ill upvote all the locking answers, because those were very helpful!

execute a PHP method every X seconds?

Context :
I'm making a PHP websocket server (here) running as a DAEMON in which there is obviously a main loop listening for sockets connections and incoming data so i can't just create an other loop with a sleep(x_number_of_seconds); in it because it'll freeze my whole server.
I can't execute an external script with a CRON job or fork a new process too (i guess) because I have to be in the scope of my server class to send data to connected client sockets.
Does anyone knows a magic trick to achieve this in PHP ? :/
Some crazy ideas :
Keeping track of the last loop execution time with microtime(true), and compare it with the current time on each loop, if it's about my desired X seconds interval, execute the method... which would result in a very drunk and inconsistent interval loop.
Run a JavaScript setInterval() in a browser that will communicate with my server trough a websocket and tell it to execute my method... i said they where crazy ideas !
Additional infos about what i'm trying to achieve :
I'm making a little online game (RPG like) in which I would like to add some NPCs that updates their behaviours every X seconds.
Is there an other ways of achieving this ? Am I missing something ? Should I rewrite my server in Node.js ??
Thanks a lot for the help !
A perfect alternative doesn't seams to exists so I'll use my crazy solution #1 :
$this->last_tick_time = microtime(true);
$this->tick_interval = 1;
$this->tick_counter = 0;
while(true)
{
//loop code here...
$t= microtime(true) - $this->last_tick_time;
if($t>= $this->tick_interval)
{
$this->on_server_tick(++$this->tick_counter);
$this->last_tick_time = microtime(true) - ($t- $this->tick_interval);
}
}
Basically, if the time elapsed since the last server tick is greater or equal to my desired tick interval, execute on_server_tick() method. And most importantly : we subtract the time overflow to make the next tick happen faster if this one happened too late. This way we fill the gaps and at the end, if the socket_select timeout is set to 1 second, we will never have a gap greater than 1.99999999+ second.
I also keep track of the tick counter, this way I can use modulo (%) to execute code on multiple intervals like this :
protected function on_server_tick($counter)
{
if($counter%5 == 0)
{
// 5 seconds interval
}
if($counter%10 == 0)
{
// 10 seconds interval
}
}
which covers all my needs ! :D
Don't worry PHP, I won't replace you with Node.js, you still my friend.
It looks to me like the websocket-framework you are using is too primitive to allow your server to do other useful things while waiting for connections from clients. The only call to PHP's socket_select() function is hard-coded to a one second timeout, and it does nothing when the time runs out. It really ought to allow a callback or an outside loop.
Look at the http://php.net/manual/en/function.socket-select.php manual page. The last parameter is a timeout time. socket_select() waits for incoming data on a socket or until the timeout time is up, which sounds like what you want to do, but the library has no provision for it. Then look at how the library uses it in core/classes/SocketServer.php.
I'm assuming you call run() and then it just never returns to your calling code until it gets a message on the socket, which prevents you from doing anything.

php script longer than max execution time (progress bar)

i currently work on a small project, where some data is gathered from the web and the system creates some relations between these. Of course it was not perfect from the beginning, so i needed to make a script which updates all the connections and relations with the updated scripts i made.
Basically the script works, but as there shall be a nice looking backend afterwards, its not really what i want.
The script needs around 10 minutes and because i didnt just want to set up the max_execution_time from php i thought of another method. Instead of loading 1000 sql entries at once i stripped it down to 200 at one time and just repeat it with the next 200 when the first round finished. Therefore i used php http_request. I just show you a stripped down version of the script:
require_once 'HTTP/Request.php';
$max = $db->query("SELECT COUNT(id) as max FROM db_table");
$lower = $_POST['lower'] ? $_POST['lower'] : 0;
$plus = 250;
$entries = $db->query("SELECT * FROM db_table LIMIT {$lower},{$plus}");
foreach($entries as $entry){
DO SOME STUFF TO UPDATE THE RELATIONS BETWEEN THE DATA
}
$lower = $lower + $plus;
if($lower <= $max) {
$request = new HTTP_Request("path to the script");
$request->setMethod(HTTP_REQUEST_METHOD_POST);
$request->addPostData("lower", $lower);
$result = $request->sendRequest();
}
This is it. As i said it works, because it's a new request so that it's not affected by the max_execution_time. But the browser is just loading and loading and loading and after a while it finishes. But of course i cannot show any refreshed data for something like a progress bar.
I saw many entries using php flush(), but that didnt work for me because of the (i guess) stupid way i used to solve my problem.
How would you do this if you need to install something on a webspace and you dont have the possbility to change the max execution time or install http_request?
As i said it should look like a progress bar later on. I guess i have to use ajax, and simply push the round the script is at every round and update the progress bar via javascript.
Can you help me?
You are still having trouble with max_execution_time because the page you requested from web browser is always active one, it doesn't finish until the HTTPRequests finishes. Try locating the page to another with the parameter lower.
header("Location: myscript.php?lower=$lower");

Use non-blocking streams to paralellize REST-Api requests in PHP?

Consider the following scenario:
http://www.restserver.com/example.php returns some content that I want to work with in my web-application.
I don't want to load it using ajax (SEO issues etc.)
My page takes 100ms to generate, the REST resource also takes 100ms to be loaded.
We assume that the 100ms generation time of my website occour before I begin working with the REST resource. What comes after that can be neglected.
Example Code:
Index.php of my website
<?
do_some_heavy_mysql_stuff(); // takes 100 ms
get_rest_resource(); // takes 100 ms
render_html_with_data_from_mysql_and_rest(); // takes neglectable amount of time
?>
Website will take ~200ms to generate.
I want to turn this into:
<?
Restclient::initiate_rest_loading(); // takes 0ms
do_some_heavy_mysql_stuff(); // takes 100 ms
Restclient::get_rest_resource(); // takes 0 ms because 100 ms have already passed since initiation
render_html_with_data_from_mysql_and_rest(); // takes neglectable amount of time
?>
Website will take ~100ms to generate.
To accomplis this I thought about using something like this:
(I am pretty sure this code will not work because this question is all about asking how to accomplish this, and whether its possible. I just thought some naive code could demonstrate it best)
class Restclient {
public static $buffer;
public static function initiate_rest_loading() {
// open resource
$handle = fopen ("http://www.restserver.com/example.php", "r");
// set to non blocking so fgets will return immediately
stream_set_blocking($handle,0);
// initate loading, but return immediately to continue website generation
fgets($handle, 40960);
}
public static function get_rest_resource() {
// set stream to blocking again because now we really want the data
stream_set_blocking($handle,1);
// get the data and save it so templates can work with it
self::$buffer = fgets($handle, 40960); templates
}
}
So final question:
Is this possible and how?
What do I have to keep an eye on (internal buffer overflows, stream lengths etc.)
Are there better methods?
Does this well work with http resources?
Any input is appriciated!
I hope I explained it understandable. If anything is unclear, please leave a comment, so I can rephrase it!
As "any input is appreciated", here is mine:
What you want is called asynchronous (you want to something while something else is being done "in the background").
To solve your problem, I thought on this:
Separate do_some_heavy_mysql_stuff and get_rest_resource in two different PHP scripts.
Use cURL "multi" ability to do simultaneous requests. Please, check:
curl_multi_init and related PHP functions
Simultaneous HTTP requests in PHP with cURL
This way, you can perform both scripts at the same time. Using cURL multi features, you can call http://example.com/do_some_heavy_mysql_stuff.php and http://example.com/get_rest_resource.php at the same time, and then play with the results as soon as they're available.
These are my first thoughts, and Iim sharing them with you. Maybe there are different and more interesting approaches... Good luck!

best way to measure (and refine) performance with PHP?

A site I am working with is starting to get a little sluggish, and I would like to refine it. I think the problem is with the PHP, but I can't be sure. How can I see how long functions are taking to perform?
If you want to test the execution time :
<?php
$startTime = microtime(true);
// Your content to test
$endTime = microtime(true);
$elapsed = $endTime - $startTime;
echo "Execution time : $elapsed seconds";
?>
Try the profiler feature in XDebug or Zend Debugger?
Two things you can do.
place Microtime calls everywhere although its not convenient if you want to test more than one function. So there is a simpler way to do it a better solution if you want to test many functions which i assume you would like to do.
just have a class (click on link to follow tutorial) where you can test how long all your functions take. Rather than place microtime everywhere. you just use this class. which is very convenient
http://codeaid.net/php/calculate-script-execution-time-%28php-class%29
the second thing you can do is to optimize your script is by taking a look at the memory usage.
By observing the memory usage of your scripts, you may be able optimize your code better.
PHP has a garbage collector and a pretty complex memory manager. The amount of memory being used by your script. can go up and down during the execution of a script. To get the current memory usage, we can use the memory_get_usage() function, and to get the highest amount of memory used at any point, we can use the memory_get_peak_usage() function.
view plaincopy to clipboardprint?
echo "Initial: ".memory_get_usage()." bytes \n";
/* prints
Initial: 361400 bytes
*/
// let's use up some memory
for ($i = 0; $i < 100000; $i++) {
$array []= md5($i);
}
// let's remove half of the array
for ($i = 0; $i < 100000; $i++) {
unset($array[$i]);
}
echo "Final: ".memory_get_usage()." bytes \n";
/* prints
Final: 885912 bytes
*/
echo "Peak: ".memory_get_peak_usage()." bytes \n";
/* prints
Peak: 13687072 bytes
*/
http://net.tutsplus.com/tutorials/php/9-useful-php-functions-and-features-you-need-to-know/
PK
You can also make it manually, by recording microtime() value in various places, like this:
<?
$TIMER['start']=microtime(TRUE);
// some code
$query="SELECT ...";
$TIMER['before q']=microtime(TRUE);
$res=mysql_query($query);
$TIMER['after q']=microtime(TRUE);
while ($row = mysql_fetch_array($res)) {
// some code
}
$TIMER['array filled']=microtime(TRUE);
// some code
$TIMER['pagination']=microtime(TRUE);
/and so on
?>
and then visualize it
<?
if ('127.0.0.1' === $_SERVER['REMOTE_ADDR']) {
echo "<table border=1><tr><td>name</td><td>so far</td><td>delta</td><td>per cent</td></tr>";
reset($TIMER);
$start=$prev=current($TIMER);
$total=end($TIMER)-$start;
foreach($TIMER as $name => $value) {
$sofar=round($value-$start,3);
$delta=round($value-$prev,3);
$percent=round($delta/$total*100);
echo "<tr><td>$name</td><td>$sofar</td><td>$delta</td><td>$percent</td></tr>";
$prev=$value;
}
echo "</table>";
}
?>
an IP address check implies that we are doing this profiling on the working site
Though I doubt it's PHP itself. Most likely it's database. So, pay most attention to query execution timing.
however, a "site" term is very broad. It includes also JS, CSS, images and stuff. So, I'd suggest to start form FirebFug's Net page to see what part of whole page takes more time.
Of course, refining can be done only after analysis of profiling results, and cannot be advised here without it.
Your best bet is Xdebug. Im happy as it comes bundled in my PHPed IDE. I can get profiler data at the click of a button.
So maybe you could consider that.
I had similar issues and so I created 2 new tables on the database and two new functions. One was audit_sql and the other was audit_code. Because I used an SQL abstraction class it was easy to time every single SQL call (I used php microtime as some others have suggested). So, I called microtime before and after the SQL call and stored the results on the database.
Similarly with pages. I called microtime at the start and end of each page and if necessary at the start and end of functons, divs - whatever I thought might be a culprit.
The general results were:
SQL calls to MySQL were almost instantaneous and were nto a problem at all. The only thing I would say is that even I was surprised at the number being executed! The site is generated from the database - even the menus, permissions etc. To produce the home page the SQL calls were measured in the 100s.
PHP was not the culprit. This was even more instantaneous that MySQL.
The culprit was.... (big build up!) calls to You Tube and Picassa and other sites like that. I host videos and photo albums on the site (well, I don't actually store them - they are stored on YT etc.) and on the home page are thumbnails that are extracted from You Tube and the like via the You Tube PHP API/Zend Framework. Because this is all http based to the other sites, each one was taking 1, 2 or 3 seconds. This was causing those divs containing these to take between 6 and 12 seconds and the home page up to 17 seconds.
The solution - store all thumbnails on my server. The first time one has to be served from the remote site (YT, Picassa etc.) so do that and then store it on your own site. Future times, you check if you have it and if so serve it always from your server. Cuts the page load time down to 2-3 seconds tops. Granted the first person to view the first home page load after someone has loaded more videos/images will take some time, but not thereafter. People will put a long one-off page load time down to their connection/the internet in general. Too many slow loads of your site and they will stop visiting!
I hope that helps somewhat.

Categories