Creating a long TCPDF document without timeout (so long running php process) - php

I'm building a feature of a site that will generate a PDF (using TCPDF) into a booklet of 500+ pages. The layout is very simple but just due to the number of records I think it qualifies as a "long running php process". This will only need to be done a handful of times per year and if I could just have it run in the background and email the admin when done, that would be perfect. Considered Cron but it is a user-generated type of feature.
What can I do to keep my PDF rendering for as long as it takes? I am "good" with PHP but not so much with *nix. Even a tutorial link would be helpful.

Honestly you should avoid doing this entirely from a scalability perspective. I'd use a database table to "schedule" the job with the parameters, have a script that is continuously checking this table. Then use JavaScript to poll your application for the file to be "ready", when the file is ready then let the JavaScript pull down the file to the client.
It will be incredibly hard to maintain/troubleshoot this process while you're wondering why is my web server so slow all of a sudden. Apache doesn't make it easy to determine what process is eating up what CPU.
Also by using a database you can do things like limit the number of concurrent threads, or even provide faster rendering time by letting multiple processes render each PDF page and then re-assemble them together with yet another process... etc.
Good luck!

What you need is to change the allowed maximum execution time for PHP scripts. You can do that by several means from the script itself (you should prefer this if it would work) or by changing php.ini.
BEWARE - Changing execution time might seriously lower the performance of your server. A script is allowed to run only a certain time (30sec by default) before it is terminated by the parser. This helps prevent poorly written scripts from tying up the server. You should exactly know what you are doing before you do this.
You can find some more info about:
setting max-execution-time in php.ini here http://www.php.net/manual/en/info.configuration.php#ini.max-execution-time
limiting the maximum execution time by set_time_limit() here http://php.net/manual/en/function.set-time-limit.php
PS: This should work if you use PHP to generate the PDF. It will not work if you use some stuff outside of the script (called by exec(), system() and similar).

This question is already answered, but as a result of other questions / answers here, here is what I did and it worked great: (I did the same thing using pdftk, but on a smaller scale!)
I put the following code in an iframe:
set_time_limit(0); // ignore php timeout
//ignore_user_abort(true); // optional- keep on going even if user pulls the plug*
while(ob_get_level())ob_end_clean();// remove output buffers
ob_implicit_flush(true);
This avoided the page load timeout. You might want to put a countdown or progress bar on the parent page. I originally had the iframe issuing progress updates back to the parent, but browser updates broke that.

Related

PHP page process separation

If some php page is running some long process such as sleep or while loop that makes it take a while till it loads, does it affect on other processes from the same page ?,
I noticed when i try to open the same page with different short process, it also takes so long to load and to be clear it doesn't load before the first one (long process) does,
is it true or something's wrong with my code and how to prevent it ?
i think it has something to do with cache, i don't wanna mess up though before getting a tip or an answer
PHP run in a single process, each time you access the page, it start the process, process, and finish.
Each process won't affect the others.
I noticed when i try to open the same page with different short
process, [...] it doesn't load before the first one (long process) does
The most common reasons:
Your scripts use PHP sessions "as is", which use file locking. The file locking mechanism ensures only one script at a time can edit the session data of each user, but this does mean that provided two requests from the same user happen simultaneously, a second script will not start before the first has finished if they both rely on sessions (two different users have different session files though, so they can't collide)
The browser automatically detects the page is taking long and delays subsequent requests intentionally in the background — I believe this is something Google Chrome does by default.
Both cases are relatively safe however, because the delay is only present in case the same user is trying to load several pages simultaneously which is not usual — different users will not see delays regardless how long the actual page takes to load.
More on the topic in this excellent SO answer.

max_execution_time Alternative

So here's the lowdown:
The client i'm developing for is on HostGator, which has limited their max_execution_time to 30 seconds and it cannot be overridden (I've tried and confirmed it cannot be via their support and wiki)
What I'm have the code doing is take an uploaded file and...
loop though the xml
get all feed download links within the file
download each xml file
individually loop though each xml array of each file and insert the information of each item into the database based on where they are from (i.e. the filename)
Now is there any way I can queue this somehow or split the workload into multiple files possibly? I know the code works flawlessly and checks to see if each item exists before inserting it but I'm stuck getting around the execution_limit.
Any suggestions are appreciated, let me know if you have any questions!
The timelimit is in effect only when executing PHP scripts through a webserver, if you execute the script from CLI or as a background process, it should work fine.
Note that executing an external script is somewhat dangerous if you are not careful enough, but it's a valid option.
Check the following resources:
Process Control Extensions
And specifically:
pcntl-exec
pcntl-fork
Did you know you can trick the max_execution_time by registering a shutdown handler? Within that code you can run for another 30 seconds ;-)
Okay, now for something more useful.
You can add a small queue table in your database to keep track of where you are in case the script dies mid-way.
After getting all the download links, you add those to the table
Then you download one file and process it; when you're done, you check them off (delete from) from the queue
Upon each run you check if there's still work left in the queue
For this to work you need to request that URL a few times; perhaps use JavaScript to keep reloading until the work is done?
I am in such a situation. My approach is similar to Jack's
accept that execution time limit will simply be there
design the application to cope with sudden exit (look into register_shutdown_function)
identify all time-demanding parts of the process
continuously save progress of the process
modify your components so that they are able to start from arbitrary point, e.g. a position in a XML file or continue downloading your to-be-fetched list of XML links
For the task I made two modules, Import for the actual processing; TaskManagement for dealing with these tasks.
For invoking TaskManager I use CRON, now this depends on what webhosting offers you, if it's enough. There's also a WebCron.
Jack's JavaScript method's advantage is that it only adds requests if needed. If there are no tasks to be executed, the script runtime will be very short and perhaps overstated*, but still. The downsides are it requires user to wait the whole time, not to close the tab/browser, JS support etc.
*) Likely much less demanding than 1 click of 1 user in such moment
Then of course look into performance improvements, caching, skipping what's not needed/hasn't changed etc.

How can I make a scheduler in PHP without the help of cron

How can I make a scheduler in PHP without writing a cron script? Is there any standard solution?
Feature [For example]: sent remainder to all subscriber 24hrs b4 the subscription expires.
The standard solution is to use cron on Unix-like operating systems and Scheduled Tasks on Windows.
If you don't want to use cron, I suppose you could try to rig something up using at. But it is difficult to imagine a situation where cron is a problem but at is A-OK.
The solution I see is a loop (for or while) and a sleep(3600*24);
Execute it through a sending ajax call every set interval of yours through javascript
Please read my final opinion at the bottom before rushing to implement.
Cron really is the best way to schedule things. It's simple, effective and widely available.
Having said that, if cron is not available or you absolutely don't want to use it, two general approaches for a non-cron, Apache/PHP pseudo cron running on a traditional web server, is as follows.
Check using a loadable resource
Embed an image/script/stylesheet/other somewhere on each web page. Images are probably the best supported by browsers (if javascript is turned off there's no guarantee that the browser will even load .js source files). This page will send headers and empty data back to the browser (a 1x1 clear .gif is fine - look at fpassthru)
from the php manual notes
<?php
header("Content-Length: 0");
header("Connection: close");
flush();
// browser should be disconnected at this point
// and you can do your "cron" work here
?>
Check on each page load
For each task you want to automate, you would create some sort of callable API - static OOP, function calls - whatever. On each request you check to see if there is any work to do for a given task. This is similar to the above except you don't use a separate URL for the script. This could mean that the page takes a long time to load while the work is being performed.
This would involve a select query to your database on either a task table that records the last time a task has run, or simply directly on the data in question, in your example, perhaps on a subscription table.
Final opinion
You really shouldn't reinvent the wheel on this if possible. Cron is very easy to set up.
However, even if you decide that, in your opinion, cron is not easy to set up, consider this: for each and every page load on your site, you will be incurring the overhead of checking to see what needs to be done. True cron, on the other hand, will execute command line PHP on the schedule you set up (hourly, etc) which means your server is running the task checking code much less frequently.
Biggest potential problem without true cron
You run the risk of not having enough traffic to your site to actually get updates happening frequently enough.
Create a table of cronjob. In which keep the dates of cron job. Keep a condition, if today date is equal to the date in the creonjob table. then call for a method to execute. This works fine like CRON job.

Cached, PHP generated Thumbnails load slowly

Question Part A ▉ (100 bountys, awarded)
Main question was how to make this site, load faster. First we needed to read these waterfalls. Thanks all for your suggestions on the waterfall readout analysis. Evident from the various waterfall graphs shown here is the main bottleneck: the PHP-generated thumbnails. The protocol-less jquery loading from CDN advised by David got my bounty, albeit making my site only 3% faster overall, and while not answering the site's main bottleneck. Time for for clarification of my question, and, another bounty:
Question Part B ▉ (100 bountys, awarded)
The new focus was now to solve the problem that the 6 jpg images had, which are causing the most of the loading-delay. These 6 images are PHP-generated thumbnails, tiny and only 3~5 kb, but loading relatively very slowly. Notice the "time to first byte" on the various graphs. The problem remained unsolved, but a bounty went to James, who fixed the header error that RedBot underlined: "An If-Modified-Since conditional request returned the full content unchanged.".
Question Part C ▉ (my last bounty: 250 points)
Unfortunately, after even REdbot.org header error was fixed, the delay caused by the PHP-generated images remained untouched. What on earth are these tiny puny 3~5Kb thumbnails thinking? All that header information can send a rocket to moon and back. Any suggestions on this bottleneck is much appreciated and treated as possible answer, since I am stuck at this bottleneckish problem for already seven months now.
[Some background info on my site: CSS is at the top. JS at the bottom (Jquery,JQuery UI, bought menu awm/menu.js engines, tabs js engine, video swfobject.js) The black lines on the second image show whats initiating what to load. The angry robot is my pet "ZAM". He is harmless and often happier.]
Load Waterfall: Chronological | http://webpagetest.org
Parallel Domains Grouped | http://webpagetest.org
Site-Perf Waterfall | http://site-perf.com
Pingdom Tools Waterfall | http://tools.pingdom.com
GTmetrix Waterfall | http://gtmetrix.com
First, using those multiple domains requires several DNS lookups. You'd be better off combining many of those images into a sprite instead of spreading the requests.
Second, when I load your page, I see most of the blocking (~1.25s) on all.js. I see that begins with (an old version of) jQuery. You should reference that from the Google CDN, to not only decrease load time, but potentially avoid an HTTP request for it entirely.
Specifically, the most current jQuery and jQuery UI libraries can be referenced at these URLs (see this post if you're interested why I omitted the http:):
//ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js
//ajax.googleapis.com/ajax/libs/jqueryui/1.8.9/jquery-ui.min.js
If you're using one of the default jQuery UI themes, you can also pull its CSS and images off the Google CDN.
With the jQuery hosting optimized, you should also combine awmlib2.js and tooltiplib.js into a single file.
If you address those things, you should see a significant improvement.
I had a similar problem a few days ago & i found head.js.
It's a Javascript Plugin which allows you to load all JS files paralell.
Hope that helps.
I am far from an expert but...
In regards to this:
"An If-Modified-Since conditional request returned the full content unchanged."
and my comments.
The code used to generate the Thumbnails should be checking for the following:
Is there a cached version of the thumbnail.
Is the cached version newer than the original image.
If either of these are false the thumbnail should be generated and returned no matter what. If they are both true then the following check should be made:
Is there a HTTP_IF_MODIFIED_SINCE header
Is the cached version's last modified time the same as the HTTP_IF_MODIFIED_SINCE
If either of these are false the cached thumbnail should be returned.
If both of these are true then a 304 http status should be returned. I'm not sure if its required but I also personally return the Cache-Control, Expires and Last-Modified headers along with the 304.
In regards to GZipping, I've been informed that there is no need to GZip images so ignore that part of my comment.
Edit: I didn't notice your addition to your post.
session_cache_limiter('public');
header("Content-type: " . $this->_mime);
header("Expires: " . gmdate("D, d M Y H:i:s", time() + 2419200) . " GMT");
// I'm sure Last-Modified should be a static value. not dynamic as you have it here.
header("Last-Modified: " . gmdate("D, d M Y H:i:s",time() - 404800000) . " GMT");
I'm also sure that your code needs to check for the HTTP_IF_MODIFIED_SINCE header and react to it. Just setting these headers and your .htaccess file won't provide the required result.
I think you need something like this:
$date = 'D, d M Y H:i:s T'; // DATE_RFC850
$modified = filemtime($filename);
$expires = strtotime('1 year'); // 1 Year
header(sprintf('Cache-Control: %s, max-age=%s', 'public', $expires - time()));
header(sprintf('Expires: %s', date($date, $expires)));
header(sprintf('Last-Modified: %s', date($date, $modified)));
header(sprintf('Content-Type: %s', $mime));
if(isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])) {
if(strtotime($_SERVER['HTTP_IF_MODIFIED_SINCE']) === $modified) {
header('HTTP/1.1 304 Not Modified', true, 304);
// Should have been an exit not a return. After sending the not modified http
// code, the script should end and return no content.
exit();
}
}
// Render image data
Wow, it's hard to explain things using that image.. But here, some tries:
files 33-36 load that late, because they are dynamically loaded within the swf, and the swf (25) is loaded first completely before it loads any additional content
files 20 & 21 are maybe (I don't know, because I don't know your code) libraries that are loaded by all.js (11), but for 11 to execute, it waits for the whole page (and assets) to load (you should change that to domready)
files 22-32 are loaded by those two libraries, again after those are completely loaded
Just a simple guess because this kind of analysis requires a lot of A/B testing: your .ch domain seems to be hard to reach (long, green bands before the first byte arrives).
This would mean that either the .ch website is poorly hosted or that you ISP does not have a good route to them.
Given the diagrams, this could explain a big performance hit.
On a side note, there is this cool tool cuzillion that could help you sort out things depending on your ordering of ressource loading.
Try running Y!Slow and Page Speed tests on your site/page, and follow the guidelines to sort out possible performance bottlenecks. You should be getting huge performance gains once you score higher in Y!Slow or Page Speed.
These tests will tell you what's wrong and what to change.
So your PHP script is generating the thumbnails on every page load? First off, if the images that are being thumbnailed are not changing that often, could you set up a cache such that they don't have to be parsed each time the page loads? Secondly, is your PHP script using something like imagecopyresampled() to create the thumbnails? That's a non-trivial downsample and the PHP script won't return anything until its done shrinking things down. Using imagecopymerged() instead will reduce the quality of the image, but speed up the process. And how much of a reduction are you doing? Are these thumbnails 5% the size of the original image or 50%? A greater size of the original image likely is leading to a slowdown since the PHP script has to get the original image in memory before it can shrink it and output a smaller thumbnail.
I've found the URL of your website and checked an individual jpg file from the homepage.
While the loading time is reasonable now (161ms), it's waiting for 126ms, which is far too much.
Your last-modified headers are all set to Sat, 01 Jan 2011 12:00:00 GMT, which looks too "round" to be the real date of generation ;-)
Since Cache-control is "public, max-age=14515200", arbitrary last-modified headers will could cause problem after 168 days.
Anyway, this is not the real reason for delays.
You have to check what your thumbnail generator do when the thumbnail already exists and what could consume so much time checking and delivering the picture.
You could install xdebug to profile the script and see where the bottlenecks are.
Maybe the whole thing uses a framework or connects to some database for nothing. I've seen very slow mysql_connect() on some servers, mostly because they were connecting using TCP and not socket, sometimes with some DNS issues.
I understand you can't post your paid generator here but I'm afraid there are too many possible issues...
If there isn't a really good reason (usually there isn't) your images shouldn't invoke the PHP interpreter.
Create a rewrite rule for your web server that servers the image directly if it is found on the file system. If it's not, redirect to your PHP script to generate the image. When you edit the image, change the images filename to force users that have a cached version to fetch the newly edited image.
If it doesn't work at least you will now it doesn't have anything to do with the way the images are created and checked.
Investigate PHP's usage of session data. Maybe (just maybe), the image-generating PHP script is waiting to get a lock on the session data, which is locked by the still-rendering main page or other image-rendering scripts. This would make all the JavaScript/browser optimizations almost irrelevant, since the browser's waiting for the server.
PHP locks the session data for every script running, from the moment the session handling starts, to the moment the script finishes, or when session_write_close() is called. This effectively serializes things. Check out the PHP page on sessions, especially the comments, like this one.
This is just a wild guess since I haven't looked at your code but I suspect sessions may be playing a role here, the following is from the PHP Manual entry on session_write_close():
Session data is usually stored after
your script terminated without the
need to call session_write_close(),
but as session data is locked to
prevent concurrent writes only one
script may operate on a session at any
time. When using framesets together
with sessions you will experience the
frames loading one by one due to this
locking. You can reduce the time
needed to load all the frames by
ending the session as soon as all
changes to session variables are
done.
Like I said, I don't know what your code is doing but those graphs seem oddly suspicious. I had a similar issue when I coded a multipart file serving function and I had the same problem. When serving a large file I couldn't get the multipart functionality to work nor could I open another page until the download was completed. Calling session_write_close() fixed both my problems.
Have you tried replacing the php generated thumnails by regular images to see if there is any difference ?
The problem could be around
- a bug in your php code leading to a regeneration of the thumbnail upon each server invocation
- a delay in your code ( sleep()?) associated with a clock problem
- a hardrive issue causing a very bad race condition since all the thumbnails get loaded/generated at the same time.
I think instead of using that thumbnail-generator script you must give TinySRC a try for rapid fast and cloud-hosted thumbnail generation.
It has a very simple and easy to use API, you can use like:-
http://i.tinysrc.mobi/ [height] / [width] /http://domain.tld/path_to_img.jpg
[width] (optional):-
This is a width in pixels (which overrides the adaptive- or family-sizing). If prefixed with ‘-’ or ‘x’, it will subtract from, or shrink to a percentage of, the determined size.
[height] (optional):-
This is a height in pixels, if width is also present. It also overrides adaptive- or family-sizing and can be prefixed with ‘-’ or ‘x’.
You can check the API summary here
FAQ
What does tinySrc cost me?
Nothing.
When can I start using tinySrc?
Now.
How reliable is the service?
We make no guarantees about the tinySrc service. However, it runs on a major, distributed cloud infrastructure, so it provides high availability worldwide. It should be sufficient for all your needs.
How fast is it?
tinySrc caches resized images in memory and in our datastore for up to 24 hours, and it will not fetch your original image each time. This makes the services blazingly fast from the user’s perspective. (And reduces your server load as a nice side-effect.)
Good Luck. Just a suggestion, since u ain't showing us the code :p
As some browsers only download 2 parallels downloads per domain, could you not add additional domains to shard the requests over two to three different hostnames. e.g. 1.imagecdn.com 2.imagecdn.com
First of all, you need to handle If-Modified-Since requests and such appropriately, as James said. That error states that: "When I ask your server if that image is modified since the last time, it sends the whole image instead of a simple yes/no".
The time between the connection and the first byte is generally the time your PHP script takes to run. It is apparent that something is happening when that script starts to run.
Have you considered profiling it? It may have some issues.
Combined with the above issue, your script may be running many more times than needed. Ideally, it should generate thumbs only if the original image is modified and send cached thumbs for every other request. Have you checked that the script is generating the images unnecessarily (e.g. for each request)?
Generating proper headers through the application is a bit tricky, plus they may get overwritten by the server. And you are exposed to abuse as anyone sending some no-cache request headers will cause your thumbnail generator to run continuously (and raise loads). So, if possible, try to save those generated thumbs, call the saved images directly from your pages and manage headers from .htaccess. In this case, you wouldn't even need anything in your .htaccess if your server is configured properly.
Other than these, you can apply some of the bright optimization ideas from the performance parts of this overall nice SO question on how to do websites the right way, like splitting your resources into cookieless subdomains, etc. But at any rate, a 3k image shouldn't take a second to load, this is apparent when compared to other items in the graphs. You should try to spot the problem before optimizing.
Have you tried to set up several subdomains under NGINX webserver specially for serving static data like images and stylesheets? Something helpful could be already found in this topic.
Regarding the delayed thumbnails, try putting a call to flush() immediately after the last call to header() in your thumbnail generation script. Once done, regenerate your waterfall graph and see if the delay is now on the body instead of the headers. If so you need to take a long look at the logic that generates and/or outputs the image data.
The script that handles the thumbnails should hopefully use some sort of caching so that whatever actions it takes on the images you're serving will only happen when absolutely necessary. It looks like some expensive operation is taking place every time you serve the thumbnails which is delaying any output (including the headers) from the script.
The majority of the slow issue is your TTFB (Time to first byte) being too high. This is a hard one to tackle without getting intimate with your server config files, code and underlying hardware, but I can see it's rampant on every request. You got too much green bars (bad) and very little blue bars (good). You might want to stop optimizing the frontend for a bit, as I believe you've done much in that area. Despite the adage that "80%-90% of the end-user response time is spent on the frontend", I believe yours is occuring in the backend.
TTFB is backend stuff, server stuff, pre-processing prior to output and handshaking.
Time your code execution to find slow stuff like slow database queries, time entering and exiting functions/methods to find slow functions. If you use php, try Firephp. Sometimes it is one or two slow queries being run during startup or initializtion like pulling session info or checking authentication and what not. Optimizing queries can lead to some good perf gains. Sometimes code is run using php prepend or spl autoload so they run on everything. Other times it can be mal configured apache conf and tweaking that saves the day.
Look for inefficient loops. Look for slow fetching calls of caches or slow i/o operations caused by faulty disk drives or high disk space usage. Look for memory usages and what's being used and where. Run a webpagetest repeated test of 10 runs on a single image or file using only first view from different locations around the world and not the same location. And read your access and error logs, too many developers ignore them and rely only on outputted onscreen errors. If your web host has support, ask them for help, if they don't maybe politely ask them for help anyway, it won't hurt.
You can try DNS Prefetching to combat the many domains and resources, http://html5boilerplate.com/docs/DNS-Prefetching/
Is the server your own a good/decent server? Sometimes a better server can solve a lot of problems. I am a fan of the 'hardware is cheap, programmers are expensive' mentality, if you have the chance and the money upgrade a server. And/Or use a CDN like maxcdn or cloudflare or similar.
Good Luck!
(p.s. i don't work for any of these companies. Also the cloudflare link above will argue that TTFB is not that important, I threw that in there so you can get another take.)
Sorry to say, you provide to few data. And you already had some good suggestions.
How are you serving those images ? If you're streaming those via PHP you're doing a very bad thing, even if they are already generated.
NEVER STREAM IMAGES WITH PHP. It will slow down your server, no matter the way you use it.
Put them in a accessible folder, with a meaningful URI. Then call them directly with their real URI.
If you need on the fly generation you should put an .htaccess in the images directory which redirects to a generator php-script only if the request image is missing. (this is called cache-on-request strategy).
Doing that will fix php session, browser-proxy, caching, ETAGS, whatever all at once.
WP-Supercache uses this strategy, if properly configured.
I wrote this some time ago ( http://code.google.com/p/cache-on-request/source/detail?r=8 ), last revisions are broken, but I guess 8 or less should work and you can grab the .htaccess as an example just to test things out (although there are better ways to configure the .htaccess than the way I used to).
I described that strategy in this blog post ( http://www.stefanoforenza.com/need-for-cache/ ). It is probably badly written but it may help clarifying things up.
Further reading: http://meta.wikimedia.org/wiki/404_handler_caching

How to execute a PHP spider/scraper but without it timing out

Basically I need to get around max execution time.
I need to scrape pages for info at varying intervals, which means calling the bot at those intervals, to load a link form the database and scrap the page the link points to.
The problem is, loading the bot. If I load it with javascript (like an Ajax call) the browser will throw up an error saying that the page is taking too long to respond yadda yadda yadda, plus I will have to keep the page open.
If I do it from within PHP I could probably extend the execution time to however long is needed but then if it does throw an error I don't have the access to kill the process, and nothing is displayed in the browser until the PHP execute is completed right?
I was wondering if anyone had any tricks to get around this? The scraper executing by itself at various intervals without me needing to watch it the whole time.
Cheers :)
Use set_time_limit() as such:
set_time_limit(0);
// Do Time Consuming Operations Here
"nothing is displayed in the browser until the PHP execute is completed"
You can use flush() to work around this:
flush()
(PHP 4, PHP 5)
Flushes the output buffers of PHP and whatever backend PHP is using (CGI, a web server, etc). This effectively tries to push all the output so far to the user's browser.
take a look at how Sphider (PHP Search Engine) does this.
Basically you will just process some part of the sites you need, do your thing, and go on to the next request if there's a continue=true parameter set.
run via CRON and split spider into chunks, so it will only do few chunks at once. call from CRON with different parameteres to process only few chunks.

Categories