Processing large amounts of data in PHP without a browser timeout - php

I have array of mobile numbers, around 50,000. I'm trying to process and send bulk SMS to these numbers using third-party API, but the browser will freeze for some minutes. I'm looking for a better option.
Processing of the data involves checking mobile number type (e.g CDMA), assigning unique ids to all the numbers for further referencing, check for network/country unique charges, etc.
I thought of queuing the data in the database and using cron to send about 5k by batch every minute, but that will take time if there are many messages. What are my other options?
I'm using Codeigniter 2 on XAMPP server.

I would write two scripts:
File index.php:
<iframe src="job.php" frameborder="0" scrolling="no" width="1" height="1"></iframe>
<script type="text/javascript">
function progress(percent){
document.getElementById('done').innerHTML=percent+'%';
}
</script><div id="done">0%</div>
File job.php:
set_time_limit(0); // ignore php timeout
ignore_user_abort(true); // keep on going even if user pulls the plug*
while(ob_get_level())ob_end_clean(); // remove output buffers
ob_implicit_flush(true); // output stuff directly
// * This absolutely depends on whether you want the user to stop the process
// or not. For example: You might create a stop button in index.php like so:
// Stop!
// Start
// But of course, you will need that line of code commented out for this feature to work.
function progress($percent){
echo '<script type="text/javascript">parent.progress('.$percent.');</script>';
}
$total=count($mobiles);
echo '<!DOCTYPE html><html><head></head><body>'; // webkit hotfix
foreach($mobiles as $i=>$mobile){
// send sms
progress($i/$total*100);
}
progress(100);
echo '</body></html>'; // webkit hotfix

I'm assuming these numbers are in a database, if so you should add a new column titled isSent (or whatever you fancy).
This next paragraph you typed should be queued and possibly done night/weekly/whenever appropriate. Unless you have a specific reason too, it shouldn't be done in bulk on demand. You can even add a column to the db to see when it was last checked so that if a number hasn't been checked in at least X days then you can perform a check on that number on demand.
Processing of the data involves checking mobile number type (e.g CDMA), assigning unique ids to all the numbers for further referencing, check for network/country unique charges, etc.
But that still leads you back to the same question of how to do this for 50,000 numbers at once. Since you mentioned cron jobs, I'm assuming you have SSH access to your server which means you don't need a browser. These cron jobs can be executed via the command line as such:
/usr/bin/php /home/username/example.com/myscript.php
My recommendation is to process 1,000 numbers at a time every 10 minutes via cron and to time how long this takes, then save it to a DB. Since you're using a cron job, it doesn't seem like these are time-sensitive SMS messages so they can be spread out. Once you know how long it took for this script to run 50 times (50*1000 = 50k) then you can update your cron job to run more/less frequently.
$time_start = microtime(true);
set_time_limit(0);
function doSendSMS($phoneNum, $msg, $blah);
$time_end = microtime(true);
$time = $time_end - $time_start;
saveTimeRequiredToSendMessagesInDB($time);
Also, you might have noticed a set_time_limit(0), this will tell PHP to not timeout after the default 30seconds. If you are able to modify the PHP.ini file then you don't need to enter this line of code. Even if you are able to edit the PHP.ini file, I would still recommend not changing this feature since you might want other pages to time out.
http://php.net/manual/en/function.set-time-limit.php

If this isn't a one-off type of situation, consider engineering a better solution.
What you basically want is a queue that your browser-bound process can write to, and than 1-N worker processes can read from and update.
Putting work in the queue should be rather inexpensive - perhaps a bunch of simple INSERT statements to a SQL RDBMS.
Then you can have a daemon or two (or 100, distributed across multiple servers) that read from the queue and process stuff. You'll want to be careful here and avoid two workers taking on the same task, but that's not hard to code around.
So your browser-bound workflow is: click some button that causes a bunch of stuff to get added to the queue, then redirect to some "queue status" interface, where the user can watch the system chew through all their work.
A system like this is nice, because it's easy to scale horizontally quite a ways.
EDIT: Christian Sciberras' answer is going in this direction, except the browser ends up driving both sides (it adds to the queue, then drives the worker process)

Cronjob would be your best bet, I don't see why it would take any longer than doing it in the browser if your only problem at the moment is the browser timing out.
If you insist on doing it via the browser then the other solution would be doing it in batches of say 1000 and redirecting to the same script but with some reference to where it got up to last time in a $_GET variable.

Related

Downloading many web pages with PHP curl

I'm building a PHP application which has a database containing approximately 140 URL's.
The goal is to download a copy of the contents of these web pages.
I've already written code which reads the URL's from my database then uses curl to grab a copy of the page. It then gets everything between <body> </body>, and writes it to a file. It also takes into account redirects, e.g. if I go to a URL and the response code is 302, it will follow the appropriate link. So far so good.
This all works ok for a number of URL's (maybe 20 or so) but then my script times out due to the max_execution_time being set to 30 seconds. I don't want to override or increase this, as I feel that's a poor solution.
I've thought of 2 work arounds but would like to know if these are a good/bad approach, or if there are better ways.
The first approach is to use a LIMIT on the database query such that it splits the task up into 20 rows at a time (i.e. run the script 7 separate times, if there were 140 rows). I understand from this approach it still needs to call the script, download.php, 7 separate times so would need to pass in the LIMIT figures.
The second is to have a script where I pass in the ID of each individual database record I want the URL for (e.g. download.php?id=2) and then do multiple Ajax requests to them (download.php?id=2, download.php?id=3, download.php?id=4 etc). Based on $_GET['id'] it could do a query to find the URL in the database etc. In theory I'd be doing 140 separate requests as it's a 1 request per URL set up.
I've read some other posts which have pointed to queueing systems, but these are beyond my knowledge. If this is the best way then is there a particular system which is worth taking a look at?
Any help would be appreciated.
Edit: There are 140 URL's at the moment, and this is likely to increase over time. So I'm looking for a solution that will scale without hitting any timeout limits.
I dont agree with your logic , if the script is running OK and it needs more time to finish, just give it more time it is not a poor solution.What you are suggesting makes things more complicated and will not scale well if your urls increase.
I would suggest moving your script to the command line where there is no time limit and not using the browser to execute it.
When you have an unknown list wich will take an unknown amount of time asynchronous calls are the way to go.
Split your script into a single page download (like you proposed, download.php?id=X).
From the "main" script get the list from the database, iterate over it and send an ajax call to the script for each one. As all the calls will be fired all at once, check for your bandwidth and CPU time. You could break it into "X active task" using the success callback.
You can either set the download.php file to return success data or to save it to a database with the id of the website and the result of the call. I recommend the later because you can then just leave the main script and grab the results at a later time.
You can't increase the time limit indefinitively and can't wait indefinitively time to complete the request, so you need a "fire and forget" and that's what asynchronous call does best.
As #apokryfos pointed out, depending on the timing of this sort of "backups" you could fit this into a task scheduler (like chron). If you call it "on demand", put it in a gui, if you call it "every x time" put a chron task pointing the main script, it will do the same.
What you are describing sounds like a job for the console. The browser is for the users to see, but your task is something that the programmer will run, so use the console. Or schedule the file to run with a cron-job or anything similar that is handled by the developer.
Execute all the requests simultaneously using stream_socket_client(). Save all the socket IDs in an array
Then loop through the array of IDs with stream_select() to read the responses.
It's almost like multi-tasking within PHP.

Php multiple script queuing?

I have an html which run a php several times (about 20).
Something like
for(i=0;i<20; i++){
$.get('script.php?id='+i,function(){...});
}
Every time the script runs, it get some content through different websites (every time a different one), so each script takes from 1 to 10 seconds to complete and give a response.
I want to run all the script simultaneously to be faster, but 2 things makes it slow: the first is on the html page, it seems that ajax requested are queued after first five, at least chrome developer said me that... (this can be fixed easily I think, I've not bothered yet to find a solution); the second thing is on php side: even if the first 5 scripts are triggered together, they are run sequentially, not even in order. I 've put some
echo microtime(true);
around the script to get wherever the script is slow and what I found out (a bit surprised) is that the time on the beginning of the script (the one which should be run at almost the same time on all script) is different: the difference is very consistent, also 10 seconds, like if the second script wait the first to end before begin. How can I have all the script be running together at the same time? Thank you.
I very frankly advise that you should not attempt anything "multi-threaded" here, "nor particularly 'fancy.'"
Design your application to have some kind of queue (it can be a simple array) of "things that need to be done." Then, issue "some n" number of initial AJAX requests: certainly no more than 3 to 5.
Now, wait for notification that each request has succeeded or failed. After somehow recording the status of the request, de-queue another request and start it.
In this way, "n requests, but no more than n requests," will be active at any one time, and yes, they will each take a different amount of time, "but nobody cares."
"JavaScript is not multi-threaded," and we have no need for it. Everything is done by events, which are handled one at a time, and that happens to work beautifully.
Your program runs until the queue is empty and all of the outstanding AJAX requests have been completed (or, have failed).
There's absolutely no advantage in "getting too fancy" when the requests that you are performing might each take several seconds to complete. There's also no particular advantage in trying to queue-up too many TCP/IP network requests. Predictable Simplicity, in this case, will (IMHO) rule the day.

Can 1330316 AJAX requests crash my server?

I'm building a small PHP/Javascript app which will do some processing for all cities in all US states. This rounds up to a total of (52 x 25583) = 1330316 or less items that will need to be processed.
The processing of each item will take about 2-3 seconds, so its possible that the user could have to stare at this page for 1-2 hours (or at least keep it minimized while he did other stuff).
In order to give the user the maximum feedback, I was thinking of controlling the processing of the page via javascript, basically something like this:
var current = 1;
var max = userItems.length; // 1330316 or less
process();
function process()
{
if (current >= max)
{
alert('done');
return;
}
$.post("http://example.com/process", {id: current}, function()
{
$("#current").html(current);
current ++;
process();
}
);
}
In the html i will have the following status message which will be updated whenever the process() function is called:
<div id="progress">
Please wait while items are processed.
<span id="current">0</span> / <span id="max">1330316</span> items have been processed.
</div>
Hopefully you can all see how I want this to work.
My only concern is that, if those 1330316 requests are made simultaneously to the server, is there a possibility that this crashes/brings down the server? If so, if I put in an extra wait of 2 seconds per request using sleep(3); in the server-side PHP code, will that make things better?
Or is there a different mechanism for showing the user the rapid feedback such as polling which doesn't require me to mess with apache or the server?
If you can place a cronjob in the server, I believe it'd work much better. What about using a cronjob to do the actual processing and use Javascript to update periodically the status (say, every 10 seconds)?
Then, the first step would be to trigger some flag that the cronjob PHP will check. If it's active, then the task must be performed (you could use some temporary file to tell the script which records must be processsed).
The cronjob would do the task and then, when its iteration is complete, turn off the flag.
This way, the user can even close your application and check it back later, and the server will handle all the processing, uninterrupted by client activity.
Putting a sleep inside your server-side php script can only make it worse. It leads to more processes sticking around, which turns out to increase parallel working/sleeping processes count, which adds up to increased memory usage.
Don't fear that so much processes can be done in parallel. Usually an apache server is configured to process no more than 150 requests in parallel. A well configured server does not process more requests in parallel than resources are available (good administrators do some calculations beforehand). The other requests have to wait - and given your count of requests it's probable that they are going to timeout before being processed.
Your concerns should however be about client-side resources but it looks like your script only starts a new request when the previous returned. BTW: Well behaving HTTP clients (which your browser should be) start no more than 6 requests in parallel to the same IP.
Update: Besides the above you should seriously consider redesigning your approach to mass-processing (similar to as #Joel suggested) - but this should go to another question.

Avoid PHP execution time limit

I need to create a script in PHP language which performs permutation of numbers. But PHP has an execution time limit set to 60 seconds. How can I run the script so that if you need to run more than 60 sesunde, not been interrupted by the server. I know I can change the maximum execution time limit in php, but I want to hear another version that does not require to know in advance the execution time of a script.
A friend suggested me to sign in and log out frequently from the server, but I have no idea how to do this.
Any advice is welcome. An example code would be useful.
Thanks.
First I need to enter a number, lets say 25. After this the script is launch and it need to do the following: for every number <= than 25 it will create a file with the numbers generated at the current stage; for the next number it will open the previuos created file, and will create another file base on the lines of the opened file and so on. Because this take to long, I need to avoid the script beeing intrerupted by the server.
#emanuel:
I guess when your friend told you "A friend suggested me to sign in and log out frequently from the server, but I have no idea how to do this.", he/she must have meant "Split your script computation into x pieces of work and run it separately"
For example with this script you can execute it 150 times to achieve a 150! (factorising) and show the result:
// script name: calc.php
<?php
session_start();
if(!isset($_SESSION['times'])){
$_SESSION['times'] = 1;
$_SESSION['result'] = 0;
}elseif($_SESSION['times'] < 150){
$_SESSION['times']++;
$_SESSION['result'] = $_SESSION['result'] * $_SESSION['times'];
header('Location: calc.php');
}elseif($_SESSION['times'] == 150){
echo "The Result is: " . $_SESSION['result'];
die();
}
?>
BTW (#Davmuz), you can only use set_time_limit() function on Apache servers, it's not a valid function on Microsoft IIS servers.
set_time_limit(0)
You could try to put the calls you want to make in a queue, which you serialize to a file (or memory cache?) when an operation is done. Then you could use a CRON-daemon to execute this queue every sixty seconds, so it continues to do the work, and finishes the task.
The drawbacks of this approach are problems with adding to the queue, with file locking and the such, and if you need the results immediately, this can prove troublesome. If you are adding stuff to a Db, it might work out. Also, it is not very efficient.
Use set_time_limit(0) but you have to disable the safe_mode:
http://php.net/manual/en/function.set-time-limit.php
I suggest to use a fixed time (set_time_limit(300)) because if there is a problem in the script (endless loops or memory leaks) this can not be a source of problems.
The web server, like Apache, have also a maximum time limit of 300 seconds, so you have to change it. If you want to do a Comet application, it may be better to chose another web server than Apache that can have long requests times.
If you need a long execution time for a heavy algorithm, you can also implement a parallel processing: http://www.google.com/#q=php+parallel+processing
Or store the input data and computer with another external script with a cron or whatever else.

crawling scraping and threading? with php

I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...
The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.
The scripts only crawl two levels down into the page so.. main page www.url.com and links on that page www.url.com/post1 www.url.com/post2
My problem is that as I start to get a larger collection of blogs. They are only scanned once ever 20 to 30 minutes and when I add a new blog to to script there is a backup in scanning the links as only one is processed every minute.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
What is the best way I could speed this process up.
Is there a way I can have multiple scripts affecting the DB but write them so they do not overwrite each other but queue the results?
Is there some way to create threading in PHP so that a script can process links at its own pace?
Any ideas?
Thanks.
USE CURL MULTI!
Curl-mutli will let you process the pages in parallel.
http://us3.php.net/curl
Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.
You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down
http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/
pseudo code for running parallel scanners:
start_a_scan(){
//Start mysql transaction (needs InnoDB afaik)
BEGIN
//Get first entry that has timed out and is not being scanned by someone
//(And acquire an exclusive lock on affected rows)
$row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
(scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
LIMIT 1 FOR UPDATE
//let everyone know we're scanning this one, so they'll keep out
UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
//Commit transaction
COMMIT
//scan
scan_target($row['url'])
//update entry state to allow it to be scanned in the future again
UPDATE scan_targets SET being_scanned = false, \
scanned_at = NOW() WHERE id = $row['id']
}
You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.
And then you can have several scan processes running in parallel! Yey!
cheers!
EDIT: I forgot that you need to make the first SELECT with FOR UPDATE. Read more here
This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.
CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.
It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)

Categories