Background:
I have two pages (index.php and script.php).
I have a jQuery function that calls script.php from index.php.
script.php will do a ton of data processing and then return that data back to index.php so that it can be displayed to the user.
Problem:
index.php appears to be timing out because script.php is taking to long to finish. script.php can sometimes take up to 4 hrs to finish processing before it can return the data to index.php.
the reason I say index.php is timing out is b/c it never updates and just sits there with an hour glass even after script.php stops processing.
i know for sure that script.php does finish processing successfully b/c i'm writing the output to a log file as well and see that everything is being processed.
if there is not much data to be processed by script.php then index.php will update as it is supposed to.
I'm not setting any timeout values within the function inside index.php when calling script.php.
Is there a better to get index.php to update after waiting a very long time for script.php to finish? I'm using FireFox, so is it maybe a FireFox issue?
Do you seriously want an ajax call to take four hours to respond? That makes little sense in the normal way the web and browsers work. I'd strongly suggest a redesign.
That said, jQuery's $.ajax() call has a timeout value you can set as an option described here: http://api.jquery.com/jQuery.ajax/. I have no idea if the browser will allow it to be set as long as four hours and still operate properly. In any case, it's not a high probability operation to require keeping a browser connection open and live for four hours. If there's a momentary hiccup, what are you going to do? start all over again? This is just not a good design.
What I would suggest as a redesign is that you break the problem up into smaller pieces that can be satisfied in much shorter ajax calls. If you really want a four hour operation, then I'd suggest you start the operation with one ajax call and then you poll every few minutes from the browser to inquire when the job is done. When it is finally done, then you can retrieve the results. This would be much more compatible with the normal way that ajax calls and browsers work and it wouldn't suffer if there is a momentary internet glitch during the four hours.
If possible, your first ajax call could also return an approximation for how long the operation might take which could provide some helpful UI in the browser that is waiting for the results.
Here's a possibility:
Step 1. Send ajax call requesting that the job start. Immediately receive back a job ID and any relevant information about the estimated time for the job.
Step 2. Calculate a polling interval based on the estimated time for the job. If the estimate is four hours and the estimates are generally accurate, then you can set a timer and ask again in two hours. When asking for the data, you send the job ID returned by the first ajax call.
Step 3. As you near the estimated time of completion, you can narrow the polling interval down to perhaps a few minutes. Eventually, you will get a polling request that says the data is done and it returns the data to you. If I was designing the server, I'd cache the data on the server for some period of time in case the client has to request it again for any reason so you don't have to repeat the four hour process.
Oh, and then I'd think about changing the design on my server so that nothing that is requested regularly every takes four hours. It should either be pre-built and pre-cached in some sort of batch fashion (e.g. a couple times a day) or the whole scheme should be redone so that common queries can be satisfied in less than a minute rather than four hours.
Would it be possible to periodically send a response back to index.php just to "keep it alive" ? If not, perhaps split up your script into a few smaller scripts, and run them in chunks of an hour at a type as opposed to the 4 hours you mentioned above.
Related
I have a script that parses a large amount of historical data through MySQL queries and other PHP manipulation.
Essentially it is a "Order Report" generator that outputs a lot of data and forms it into an array, and then finally encoding it into json and saving it on my server.
I would like to optimize these more, but to generate a 30 day report takes about 10-15 seconds, 90 days takes 45-seconds to a minute to complete, 180 days anywhere from 2-3 minutes, and for an entire year it takes typically over 5 minutes for the script to finish.
Like eBay and Amazon, I am thinking, due to loading times, a 'File Queue' system is needed, as the user isn't going to want to wait 5 minutes in a loading scenario waiting for a yearly report.
I understand about ajax requests, and how this can be done, or have even read about hidden iframes that perform the request. My concern is, what if the user changes the page (if normal ajax request), or exits the browser, the script execution will not finish.
My only other thought would be to possibly have a cron job along with a MySQL table that inserts the users request, but thinking forward the CRON would have to run about once every minute, all day, every single day, non-stop to check if a report request was issued.
This seems plausible, but perhaps not ideal as having a cron run every minute 24/7/365 seems it'd produce a bit of baggage in its continuity (especially since reports being generated would not be conducted very often).
Can anyone recommend/suggest a route of action to take that won't be a burden to my server, and will still run and complete regardless of user action?
I'm building a PHP application which has a database containing approximately 140 URL's.
The goal is to download a copy of the contents of these web pages.
I've already written code which reads the URL's from my database then uses curl to grab a copy of the page. It then gets everything between <body> </body>, and writes it to a file. It also takes into account redirects, e.g. if I go to a URL and the response code is 302, it will follow the appropriate link. So far so good.
This all works ok for a number of URL's (maybe 20 or so) but then my script times out due to the max_execution_time being set to 30 seconds. I don't want to override or increase this, as I feel that's a poor solution.
I've thought of 2 work arounds but would like to know if these are a good/bad approach, or if there are better ways.
The first approach is to use a LIMIT on the database query such that it splits the task up into 20 rows at a time (i.e. run the script 7 separate times, if there were 140 rows). I understand from this approach it still needs to call the script, download.php, 7 separate times so would need to pass in the LIMIT figures.
The second is to have a script where I pass in the ID of each individual database record I want the URL for (e.g. download.php?id=2) and then do multiple Ajax requests to them (download.php?id=2, download.php?id=3, download.php?id=4 etc). Based on $_GET['id'] it could do a query to find the URL in the database etc. In theory I'd be doing 140 separate requests as it's a 1 request per URL set up.
I've read some other posts which have pointed to queueing systems, but these are beyond my knowledge. If this is the best way then is there a particular system which is worth taking a look at?
Any help would be appreciated.
Edit: There are 140 URL's at the moment, and this is likely to increase over time. So I'm looking for a solution that will scale without hitting any timeout limits.
I dont agree with your logic , if the script is running OK and it needs more time to finish, just give it more time it is not a poor solution.What you are suggesting makes things more complicated and will not scale well if your urls increase.
I would suggest moving your script to the command line where there is no time limit and not using the browser to execute it.
When you have an unknown list wich will take an unknown amount of time asynchronous calls are the way to go.
Split your script into a single page download (like you proposed, download.php?id=X).
From the "main" script get the list from the database, iterate over it and send an ajax call to the script for each one. As all the calls will be fired all at once, check for your bandwidth and CPU time. You could break it into "X active task" using the success callback.
You can either set the download.php file to return success data or to save it to a database with the id of the website and the result of the call. I recommend the later because you can then just leave the main script and grab the results at a later time.
You can't increase the time limit indefinitively and can't wait indefinitively time to complete the request, so you need a "fire and forget" and that's what asynchronous call does best.
As #apokryfos pointed out, depending on the timing of this sort of "backups" you could fit this into a task scheduler (like chron). If you call it "on demand", put it in a gui, if you call it "every x time" put a chron task pointing the main script, it will do the same.
What you are describing sounds like a job for the console. The browser is for the users to see, but your task is something that the programmer will run, so use the console. Or schedule the file to run with a cron-job or anything similar that is handled by the developer.
Execute all the requests simultaneously using stream_socket_client(). Save all the socket IDs in an array
Then loop through the array of IDs with stream_select() to read the responses.
It's almost like multi-tasking within PHP.
I have an html which run a php several times (about 20).
Something like
for(i=0;i<20; i++){
$.get('script.php?id='+i,function(){...});
}
Every time the script runs, it get some content through different websites (every time a different one), so each script takes from 1 to 10 seconds to complete and give a response.
I want to run all the script simultaneously to be faster, but 2 things makes it slow: the first is on the html page, it seems that ajax requested are queued after first five, at least chrome developer said me that... (this can be fixed easily I think, I've not bothered yet to find a solution); the second thing is on php side: even if the first 5 scripts are triggered together, they are run sequentially, not even in order. I 've put some
echo microtime(true);
around the script to get wherever the script is slow and what I found out (a bit surprised) is that the time on the beginning of the script (the one which should be run at almost the same time on all script) is different: the difference is very consistent, also 10 seconds, like if the second script wait the first to end before begin. How can I have all the script be running together at the same time? Thank you.
I very frankly advise that you should not attempt anything "multi-threaded" here, "nor particularly 'fancy.'"
Design your application to have some kind of queue (it can be a simple array) of "things that need to be done." Then, issue "some n" number of initial AJAX requests: certainly no more than 3 to 5.
Now, wait for notification that each request has succeeded or failed. After somehow recording the status of the request, de-queue another request and start it.
In this way, "n requests, but no more than n requests," will be active at any one time, and yes, they will each take a different amount of time, "but nobody cares."
"JavaScript is not multi-threaded," and we have no need for it. Everything is done by events, which are handled one at a time, and that happens to work beautifully.
Your program runs until the queue is empty and all of the outstanding AJAX requests have been completed (or, have failed).
There's absolutely no advantage in "getting too fancy" when the requests that you are performing might each take several seconds to complete. There's also no particular advantage in trying to queue-up too many TCP/IP network requests. Predictable Simplicity, in this case, will (IMHO) rule the day.
I am attempting to build a script that will log data that changes every 1 second. The initial thought was "Just run a php file that does a cURL every second from cron" -- but I have a very strong feeling that this isn't the right way to go about it.
Here are my specifications:
There are currently 10 sites I need to gather data from and log to a database -- this number will invariably increase over time, so the solution needs to be scalable. Each site has data that it spits out to a URL every second, but only keeps 10 lines on the page, and they can sometimes spit out up to 10 lines each time, so I need to pick up that data every second to ensure I get all the data.
As I will also be writing this data to my own DB, there's going to be I/O every second of every day for a considerably long time.
Barring magic, what is the most efficient way to achieve this?
it might help to know that the data that I am getting every second is very small, under 500bytes.
The most efficient way is to NOT use cron, but instead make an app that just always runs and keep curl handles open and repeats the request every second. This way, they will keep the connection almost forever and the repeated requests will be very fast.
However, if the target servers aren't yours or your friends, there's a likeliness that they will not appreciate your hammering on them.
I have a cron that for the time being runs once every 20 minutes, but ultimately will run once a minute. This cron will process potentially hundreds of functions that grab an XML file remotely, and process it and perform its tasks. Problem is, due to speed of the remote sites, this script can sometimes take a while to run.
Is there a safe way to do this without [a] the script timing out, [b] overloading the server [c] overlapping and not completing its task for that minute before it runs again (would that error out?)
Unfortunately caching isnt an option as the data changes near real-time, and is from a variety of sources.
I think a slight design change would benefit this process quite a bit. Given that a remote server could time out, or a connection could be slow, you'll definitely run into concurrency issues if one slow job is still writing files when another one starts up.
I would break it into two separate scripts. Have one script that is only used for fetching the latest XML data, and another for processing it. The fetch script can take it's sweet time if it needs to, while the process script continually looks for the newest file available in order to process it.
This way they can operate independently, and the processing script can always work with the latest data, irregardless of how long either script takes to perform.
have a stack that you keep all the jobs on, have a handful of threads who's job it is to:
Pop a job off the stack
Check if you need to refresh the xml file (check for etags, expire headers, etc.)
grab the XML (this is the bit that could take the time hence spreading the load over threads) if need be, this should time out if it takes too long and raise the fact it did to someone as you might have a site down, dodgy rss generator or whatever.
then process it
This way you'll be able to grab lots of data each time.
It could be that you don't need to grab the file at all (would help if you could store the last etag for a file etc.)
One tip, don't expect any of them to be in a valid format. Suggest you have a look at Mark Pilgrims RSS RegExp reader which does a damn fine job of reading most RSS's
Addition: I would say hitting the same sites every minute is not really playing nice to the servers and creates alot of work for your server, do you really need to hit it that often?
You should make sure to read the <ttl> tag of the feeds you are grabbing to ensure you are not unnecessarily grabbing feeds before they change. <ttl> holds the update period. So if a feed has <ttl>60</ttl> then it should be only updated every 60 minutes.