G'day all,
This is actually the first question I have asked, however I use stack overflow religiously with its awesome search function, but I have come to a stop here.
I've been writing a bit of PHP code that basically takes the user input for Australian Airports, fetches the PDF's relevant to the aircraft type (for whatever reason the publisher releases them as single PDF's), and puts them into one PDF file. I've got it working reasonably smoothly now, but the last hitch in the plan is that when you place in lots of airfields (or ones with lots of PDF's) it exceeds the max_execution_time and gives me a 500 Internal Server Error. Unfortunately I'm with GoDaddy's shared hosting and cant change this, either in the php.ini, or in a script with set_time_limit(). This guy had the same problem and I have come out as fruitless as he: PHP GoDaddy maximun execution time not working
Anyway, apart from switching my hosting my only thought is to break up the php code so it doesn't run all at once. The only problem being is I am running a foreach loop and I haven't the faintest idea where to start.
Here is the code I have for the saving of the PDF's:
foreach ($pos as $po){
file_put_contents("/dir/temp/$chartNumber$po", file_get_contents("http://www.airservicesaustralia.com/aip/current/dap/$po"));
$chartNumber = $chartNumber + 1;
}
The array $pos is generated by a regex search of the website and takes very little time, it is the saving of the PDF files that kills me, and if it manages to get them all, the combining can take a bit of time as well with this code:
exec("gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=/dir/finalpdf/$date.pdf /dir/temp/*.pdf");
My question is, is there any way I can do each of the foreach loop in a seperate script, and then pick up where I left off? Or is it time to get new hosting?
Cheers in advance!
My suggestion would be to use AJAX requests, splitting each request per file.
Here's how I would approach it:
Make a request to generate $pos array and return it in JSON.
Make a request to generate each file, by passing $po and it's position in array (assuming that's the $chartNumber).
Check if last file was generated in jquery (returned true), and call the script to write the final file, returning the filename for download.
But ofcourse the best solution would be to switch to a cloud hosting. I personally use digitalocean.com where I'm running big PHP fetching scripts without any limitations.
I've taken Edvinas advice and transferred to digitalocean.com and have the script running now with no problems whatsoever. I have also managed to reduce the time by downloading each file with parallelcurl, which will download 5 at a time, so I can have a full, 100 page file (larger than I'll expect I'll ever need) downloaded and generated in just under 5 minutes. I guess other than hosting the PDF's on my own server (in which case I may miss updated of charts), this will be about as quick as I can get it to run.
Thanks for the advice!
Breaking down the operations into batches and running them serially will actually take longer than what you are currently doing. If the performance bottleneck is in creation of the component parts, a better solution would be to generate the parts in parallel.
the combining can take a bit of time as well with this code
Well, thge first part of fixing any performance issue should be profiling to identify the bottleneck. Without direct admin access to the host there's not a lot you can do to speed up the execution of a single line shell script - but if you can run shell commands then you can run a background job outside of the webserver process group.
Related
I'm building a PHP application which has a database containing approximately 140 URL's.
The goal is to download a copy of the contents of these web pages.
I've already written code which reads the URL's from my database then uses curl to grab a copy of the page. It then gets everything between <body> </body>, and writes it to a file. It also takes into account redirects, e.g. if I go to a URL and the response code is 302, it will follow the appropriate link. So far so good.
This all works ok for a number of URL's (maybe 20 or so) but then my script times out due to the max_execution_time being set to 30 seconds. I don't want to override or increase this, as I feel that's a poor solution.
I've thought of 2 work arounds but would like to know if these are a good/bad approach, or if there are better ways.
The first approach is to use a LIMIT on the database query such that it splits the task up into 20 rows at a time (i.e. run the script 7 separate times, if there were 140 rows). I understand from this approach it still needs to call the script, download.php, 7 separate times so would need to pass in the LIMIT figures.
The second is to have a script where I pass in the ID of each individual database record I want the URL for (e.g. download.php?id=2) and then do multiple Ajax requests to them (download.php?id=2, download.php?id=3, download.php?id=4 etc). Based on $_GET['id'] it could do a query to find the URL in the database etc. In theory I'd be doing 140 separate requests as it's a 1 request per URL set up.
I've read some other posts which have pointed to queueing systems, but these are beyond my knowledge. If this is the best way then is there a particular system which is worth taking a look at?
Any help would be appreciated.
Edit: There are 140 URL's at the moment, and this is likely to increase over time. So I'm looking for a solution that will scale without hitting any timeout limits.
I dont agree with your logic , if the script is running OK and it needs more time to finish, just give it more time it is not a poor solution.What you are suggesting makes things more complicated and will not scale well if your urls increase.
I would suggest moving your script to the command line where there is no time limit and not using the browser to execute it.
When you have an unknown list wich will take an unknown amount of time asynchronous calls are the way to go.
Split your script into a single page download (like you proposed, download.php?id=X).
From the "main" script get the list from the database, iterate over it and send an ajax call to the script for each one. As all the calls will be fired all at once, check for your bandwidth and CPU time. You could break it into "X active task" using the success callback.
You can either set the download.php file to return success data or to save it to a database with the id of the website and the result of the call. I recommend the later because you can then just leave the main script and grab the results at a later time.
You can't increase the time limit indefinitively and can't wait indefinitively time to complete the request, so you need a "fire and forget" and that's what asynchronous call does best.
As #apokryfos pointed out, depending on the timing of this sort of "backups" you could fit this into a task scheduler (like chron). If you call it "on demand", put it in a gui, if you call it "every x time" put a chron task pointing the main script, it will do the same.
What you are describing sounds like a job for the console. The browser is for the users to see, but your task is something that the programmer will run, so use the console. Or schedule the file to run with a cron-job or anything similar that is handled by the developer.
Execute all the requests simultaneously using stream_socket_client(). Save all the socket IDs in an array
Then loop through the array of IDs with stream_select() to read the responses.
It's almost like multi-tasking within PHP.
mostly I find the answers on my questions on google, but now i'm stuck.
I'm working on a scraper script, which first scrapes some usernames of a website, then gets every single details of the user. there are two scrapers involved, the first goes through the main page, gets the first name, then gets the details of it's profile page, then it goes forward to the next page...
the first site I'm scraping has a total of 64 names, displayed on one main page, while the second one, has 4 pages with over 365 names displayed.
the first one works great, however the second one keeps getting me the 500 internal error. I've tried to limit the script, to scrape only a few names, which works like charm, so I'm more then sure that the script itself is ok!
the max_execution_time in my php ini file is set to 1500, so I guess that's not the problem either, however there is something causing the error...
not sure if adding a sleep command after every 10 names for example will solve my situation, but well, i'm trying that now!
so if any of you have any idea what would help solve this situation, i would appreciate your help!
thanks in advance,
z
support said i can higher the memory upto 4gigabytes
Typical money gouging support answer. Save your cash & write better code because what you are doing could easily be run from the shared server of a free web hosting provider even with their draconian resource limits.
Get/update the list of users first as one job then extract the details in smaller batches as another. Use the SQL BULK Insert command to reduce connections to the database. It also runs much faster than looping through individual INSERTS.
Usernames and details is essentially a static list, so there is no rush to get all the data in realtime. Just nibble away with a cronjob fetching the details and eventually the script will catch up with new usernames being added to the incoming list and you end up with a faster,leaner more efficient system.
This is definitely a memory issue. One of your variables is growing past the memory limit you have defined in php.ini. If you do need to store a huge amount of data, I'd recommend writing your results to a file and/or DB at regular intervals (and then free up your vars) instead of storing them all in memory at run time.
get user details
dump to file
clear vars
repeat..
If you set your execution time to infinity and regularly dump the vars to file/db your php script should run fine for hours.
Anyone had any luck with fixing the Simple_DOM memory problem? I scoured these forums and found only recommmendations for other parsing engines.
My script loops through 20,000 files and extracts one word from each. I have to call the file_get_html function each time.
Moved it to a different server. Same result.
Changed the foreach loop to a while loop.
increase memory limit, either server. won't work.
yes you can increase the memory with ini_set() but that's only of you have the permission to do so.
what i recommend is when you are going through your loop, when you complete the task, unset the variables that contain the large sets of data.
for($i=0;$i < 30000;$i++){
$file = file_get_contents($some_path.$i);
// do something, like write to file
// unset the variables
unset($file);
}
of course this is just an example, but you can relate it to your codeand make sure every request is like a running your file the first time.
Wish you Good luck :)
Seems to me like the approach to processing that much data during a single execution is flawed. In my experience, PHP cli processed aren't really meant to run for long periods of time and process tons of data. It takes very, very careful memory management to do so. Throw in a leaky 3rd party script, and you have a recipe for banging your head against a desk.
Maybe instead of attempting to run through all 20k files at once, you could process a few hundred at a time, store the results someplace intermediary, like a MySQL database, and then gather the results once all the files have been processed.
I have a cron that for the time being runs once every 20 minutes, but ultimately will run once a minute. This cron will process potentially hundreds of functions that grab an XML file remotely, and process it and perform its tasks. Problem is, due to speed of the remote sites, this script can sometimes take a while to run.
Is there a safe way to do this without [a] the script timing out, [b] overloading the server [c] overlapping and not completing its task for that minute before it runs again (would that error out?)
Unfortunately caching isnt an option as the data changes near real-time, and is from a variety of sources.
I think a slight design change would benefit this process quite a bit. Given that a remote server could time out, or a connection could be slow, you'll definitely run into concurrency issues if one slow job is still writing files when another one starts up.
I would break it into two separate scripts. Have one script that is only used for fetching the latest XML data, and another for processing it. The fetch script can take it's sweet time if it needs to, while the process script continually looks for the newest file available in order to process it.
This way they can operate independently, and the processing script can always work with the latest data, irregardless of how long either script takes to perform.
have a stack that you keep all the jobs on, have a handful of threads who's job it is to:
Pop a job off the stack
Check if you need to refresh the xml file (check for etags, expire headers, etc.)
grab the XML (this is the bit that could take the time hence spreading the load over threads) if need be, this should time out if it takes too long and raise the fact it did to someone as you might have a site down, dodgy rss generator or whatever.
then process it
This way you'll be able to grab lots of data each time.
It could be that you don't need to grab the file at all (would help if you could store the last etag for a file etc.)
One tip, don't expect any of them to be in a valid format. Suggest you have a look at Mark Pilgrims RSS RegExp reader which does a damn fine job of reading most RSS's
Addition: I would say hitting the same sites every minute is not really playing nice to the servers and creates alot of work for your server, do you really need to hit it that often?
You should make sure to read the <ttl> tag of the feeds you are grabbing to ensure you are not unnecessarily grabbing feeds before they change. <ttl> holds the update period. So if a feed has <ttl>60</ttl> then it should be only updated every 60 minutes.
I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...
The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.
The scripts only crawl two levels down into the page so.. main page www.url.com and links on that page www.url.com/post1 www.url.com/post2
My problem is that as I start to get a larger collection of blogs. They are only scanned once ever 20 to 30 minutes and when I add a new blog to to script there is a backup in scanning the links as only one is processed every minute.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
What is the best way I could speed this process up.
Is there a way I can have multiple scripts affecting the DB but write them so they do not overwrite each other but queue the results?
Is there some way to create threading in PHP so that a script can process links at its own pace?
Any ideas?
Thanks.
USE CURL MULTI!
Curl-mutli will let you process the pages in parallel.
http://us3.php.net/curl
Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.
You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down
http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/
pseudo code for running parallel scanners:
start_a_scan(){
//Start mysql transaction (needs InnoDB afaik)
BEGIN
//Get first entry that has timed out and is not being scanned by someone
//(And acquire an exclusive lock on affected rows)
$row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
(scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
LIMIT 1 FOR UPDATE
//let everyone know we're scanning this one, so they'll keep out
UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
//Commit transaction
COMMIT
//scan
scan_target($row['url'])
//update entry state to allow it to be scanned in the future again
UPDATE scan_targets SET being_scanned = false, \
scanned_at = NOW() WHERE id = $row['id']
}
You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.
And then you can have several scan processes running in parallel! Yey!
cheers!
EDIT: I forgot that you need to make the first SELECT with FOR UPDATE. Read more here
This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.
CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.
It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)