I have a PHP client that requests an XML file over HTTP (i.e. loads an XML file via URL). As of now, the XML file is only several KB in size. A problem I can foresee is that the XML becomes several MBs or Gbs in size. I know that this is a huge question and that there are probably a myriad of solutions, but What ideas do you have to transport this data to the client?
Thanks!
based on your use case i'd definitely suggest zipping up the data first. in addition, you may want to md5 hash the file and compare it before initiating the download (no need to update if the file has no changes), this will help with point #2.
also, would it be possible to just send a segment of XML that has been instead of the whole file?
Ignoring how well a browser may or may-not handle a GB-sized XML file, the only real concern I can think of off the top of my head is if the execution time to generate all the XML is greater than any execution time thresholds that are set in your environment.
PHP's max_execution_time setting
PHP's set_time_limit() function
Apache's TimeOut Directive
Given that the XML is created dynamically with your PHP, the simplest thing I can think of is to ensure that the file is gzipped automatically by the webserver, like described here, it offers a general PHP approach and an Apache httpd-specific solution.
Besides that, having a browser (what else can be a PHP-client?) do such a job every night for some data synchonizing sounds like there must be a far simpler solution somewhere else.
And, of course, at some point, transferring "a lot" of data is going to take "a lot" of time...
The problem is that he's syncing up two datasets. The problem is completely misstated.
You need to either a) keep a differential log of changes to dataset A to that you can send that log to dataset B, or b) keep two copies of the dataset (last nights and the current dataset), and then compare them so you can then send the differential log from A to B.
Welcome to the world of replication.
The problem with (a) is that it's potentially invasive to all of your code, though if you're using an RDBMS you could do some logging perchance via database triggers to keep track of inserts/updates/deletes, and write the information in to a table, then export the relevant rows as your differential log. But, that can be nasty too.
The problem with (b) is the whole "comparing the database" all at once. Fine for 100 rows. Bad for 10^9 rows. Nasty nasty.
In fact, it can all be nasty. Replication is nasty.
A better plan is to look into a "real" replication system designed for the particular databases that you're running (assuming you're running a database). Something that perhaps sends database log records over for synchronization rather than trying to roll your own.
Most of the modern DBMS systems have replication systems.
Gallery2, which allows you to upload photos over http, makes you set up a couple of php parameters, post_max_size and upload_max_filesize, to allow larger uploads. You might want to look into that.
It seems to me that posting large files has problems with browser time-outs and the like, but on the plus side it works with proxy servers and firewalls better than trying a different file upload protocol.
Thanks for the responses. I failed to mention that transferring the file should be relatively fast (few mintues max, is this even possible?). The XML that is requested will be parsed and inserted into a database every night. The XML may be the same from the night before, or it may be different. One solution that was proposed is to zip the xml file and then transfer it. So there are basically two requirements: 1. it has to relatively fast 2. it should minimize the number of writes to the database.
One solution that was proposed is to zip the xml file and then transfer it. but that only satisfies (1)
Any other ideas?
Are there any algorithms that I could apply to compress the XML? How are large files such as MP3s being downloaded in a matter of seconds?
PHP receiving GB's of data will take long and is overhead.
Even more perceptible to flaws.
I would - dispatch the assignment to a shellscript (wget with simple error catching) that is not bothered by execution time and on failure could perhaps even retry on its own merit.
Am not experienced with this, but though one could use exec() or alike, these sadly run modal.
Calling a script with **./test.sh &** makes it run in background and solves that problem / i guess. The script could easily let your PHP pick it back up via a wget `http://yoursite.com/continue-xml-stuff.php?id=1049381023&status=0ยด. The id could be a filename, if you don't need to backtrack lost requests. The status would indicate how the script ended up handling the request.
Have you thought about using some sort of version control system to handle this? You could leverage its ability to calculate and send just the differences in the files, plus you get the added benefits of maintaining a version history of your file.
Since I don't know the details of your situation I'll throw question out there. Just for sake of argument does it have to be HTTP? FTP is much better suited for large data transfer and can be automated easily via PHP or Perl.
If you are using Apache, you might also consider Apache mod_gzip. This should allow you to compress the file automatically and the decompression should also happen automatically, as long as both sides accept gzip compression.
Related
I've got a registration list, which I need to send out a PDF to each person on the list. Each email needs to contain a PDF, which has a base version on the server, but each person's needs to be personalized via name/company etc over the top. This needs to be emailed to each person, which at the moment adds up to be 2,500, but can easily be much higher in the future.
I've only just started working on this project, but the problem I've encountered continuously since last week are that the server doesn't seem to be able to handle doing this. Currently the script is using Zend, which then allows it to use Zend_Pdf and Zend_Mail to create and email the PDFs. Zend_mail connects to an smtp server from smtp.com to do the actual emailing.
Since we have quite a few sites running on the server, we can't afford it to be going down, and when I run it in batches it can start to go down. The best solution I have thus far is running curl from my local machine to the script, which then does one person. The curl script then calls it again, over and over in batches. Even this runs into problems at times, and seems to some how hog memory even after it should be complete (I'm really not sure how).
So what I'm looking for is information on doing this, from libraries, code, information on server setups, anything that can make this much less painful, and much quicker for us to run. I've run out of ideas, and this is something I've not really had to do before (especially at a bulk level).
Thank you.
Edit:
I also forgot to mention that it's using zend_barcode::factory for creating a barcode on the PDF.
First step I suggest is to work out where the problem lies if you can. Is it the PDF generation? Is it the emailing? "Server doesn't seem to be able to handle this" doesn't say what is actually failing as with the "server goes down" - you need to determine if you are running out of memory/disk-space/time or something else. That will help you determine if you need a tweak or a new approach to your generation. Because you said that even single manual invocations can fail you should be able to narrow the problem down to exactly what is the cause of the failure.
If you are running near some resource limit (which might be the case with several sites running), you probably need to offload this capability onto another machine. Your options include:
run the same setup on a new host and adjust your applications to use the new system
run a new setup on a new host
use an external system (such as the mentioned PDFCrowd or Docmosis)
Start with the specifics of the problem. I hope that helps. Please note I work for the company that created Docmosis.
Here's some ideas:
Is there a particular reason this has to run on a web server? Why not run the framework
from a different machine, but with the same settings? You might have to create a different
controller to handle the command-line version of the request, but there's no fundamental
reason it can't work.
If creating PDFs programatically is giving you a headache, you can instead use a service.
In the past, I've used PDFCrowd with good results, and they provided
a useful PHP library. You can give them a blob of HTML, using full URLs for any stylesheets
and images, and they'll create a PDF for you.
The cost per document varies from 0.5-4.5 cents per document depending on your rate plan.
There are other services which do the same thing.
If this kind of batch job is a big deal for your company, you might consider an
asynchronous job queue like beanstalk. You could queue
up thousands of these, and a worker script could handle the requests at whatever pace you
deem reasonable.
From my experience - two options:
Dynamically generate PDFs using one or more PDF libraries (which can be awfully slow).
OR
Use something like wkhtmltopdf which is a simple shell utility to convert html to pdf using the webkit rendering engine, and qt.
Basically, you can loop over n HTML pages and generate PDF's without the overhead of purely dynamic PDF generation!
We've used this to distribute thousands of personalised PDF's on a daily basis as it quickly converts HTML pages to PDF. There are dependencies, but it works and is less intensive (computationally) than 'creating' PDFs individually.
Hope this helps.
If you are trying to call the script over HTTP, the script will timeout based on the max_execution_time specified in the php.ini.
You need to write a php script which can be run from command line and then schedule it via a cron job. The script at a time, can read one user, put together his pdf file, and email him. After that, you might have to run some performance checks to see if the server can handle the process.
I have an application that is in need of caching large amounts of data (sometimes even MBs) over multiple page request (for the same user/session). After doing some Googling etc. I've concluded that it is likely best to implement the caching mechanism by writing cache files to disk (please correct me if you think there are better alternatives).
Now, my idea was to have a root cache folder, within which I create folders for each session ID to not overwrite any cached data used in separate sessions. Then for each block of data I will create an unique identifier which can be linked to the data whenever I want to retrieve it again. The data will then be serialized to a string format (using the default PHP 'serialize' function) after which it is written to the appropriate file.
The thing I'm not so sure on how to implement is the clean up of the cached files. At some point either the data is not needed anymore, for example when the session has expired or a number of other reasons. Since it will likely be too much overhead to check for this during each page request, I expect to have to do this externally using some kind of scheduler. However, I cannot guarantee that my application will run on a UNIX environment, so I'd have to consider other platforms as well (Windows, Mac). Is there a general solution that anyone can think of that would be cross-platform without too much hassle?
I'm also thinking that there maybe is a way to intelligently check or mark certain files to be cleaned up, without have to check all the existing files separately. I was considering maybe storing their last accessed timestamp or something, but there may be other criteria besides time that could make the cached data obsolete, such as an exception being triggered in the application (though I could say that whenever that happens the entire cache for that sessions will be emptied or something like that).
Any suggestions on these issues would be very much appreciated!
If you got MemCache installed, you can use that for caching. It is faster that file cache, and you can give it an expiration time, so it will automatically be removed from the cache after a given period of time.
Both Windows and Unix have scheduled job support - cron for Unix/Linux, and 'at' for Windows. It would be a simple matter to whip up a PHP script to scan your cache directory and apply your deletion criteria to what it finds. Last access timestamp is trivial, basing it on cached file contents or other triggers slightly less so.
I created an simple web interface to allow various users to upload files. I set the upload limit to 100mb but now it turns out that the client occasionally wants to upload files 500mb+.
I know what to alter the php configuration to change the upload limit but I was wondering if there are any serious disadvantages to uploading files of this size via php?
Obviously ftp would be preferable but if possible i'd rather not have two different methods of uploading files.
Thanks
Firstly FTP is never preferable. To anything.
I assume you mean that you transferring the files via HTTP. While not quite as bad as FTP, its not a good idea if you can find another of solving the problem. HTTP (and hence the component programs) are optimized around transferring relatively small files around the internet.
While the protocol supports server to client range requests, it does not allow for the reverse operation. Even if the software at either end were unaffected by the volume, the more data you are pushing across the greater the interval during which you could lose the connection. But the biggest problem is that caveat in the last sentence.
Regardless of the server technology you use (PHP or something else) it's never a good idea to push that big file in one sweep in synchronous mode.
There are lots of plugins for any technology/framework that will do asynchronous upload for you.
Besides the connection timing out, there is one more disadvantage in that file uploading consumes the web server memory. You don't normally want that.
PHP will handle as many and as large a file as you'll allow it. But consider that it's basically impossible to resume an aborted upload in PHP, as scripts are not fired up until AFTER the upload is completed. The larger the file gets, the larger the chance of a network glitch killing the upload and wasting a good chunk of time and bandwidth. As well, without extra work with APC, or using something like uploadify, there's no progress report and users are left staring at a browser showing no visible signs of actual work except the throbber chugging away.
You all know about restrictions that exist in shared environment, so with that in mind, please suggest me a php function or something with the help of which I could stream my videos and other files. I have a lot of videos on the server, unlimited bandwidth and disk space, but I am limited in ram and cpu.
Don't use php to stream the data. Use a header redirect to point to the URL of the actual file. This will offload the work onto the webserver which might run under a different user id and is better optimized for this task.
Hmm, there is XMoov that acts as a "streaming server" but does not much more than serve a file byte by byte, with a few additional options and settings. It promises random access (i.e. arbitrary skipping within a video) but I haven't used it myself yet.
As a server administrator, though, I would frown on anybody using PHP to serve huge files like that because of the strain it puts on the server. I would generally not regard this to be a good idea, and rent a streaming server instead if at all possible. Use at your own risk.
You can use a while loop to load bits of the file, and then sleep for some time, and then output more, and sleep... (that would be the only way to limit the CPU usage).
RAM shouldn't be a problem, as you will just dump parts of the file, so you don't need to load it into RAM.
What's the best way (ways?) to speed up a php web site and how much faster it can using this or that way?
PHP isn't really the kind of language where you can do micro-optimizations, or just work on the code alone. There's really no point. Although PHP isn't particularly fast, PHP itself is rarely the bottleneck in a given web site.
You need to work out where that bottleneck is before you can fix it. There are a lot of common bottlenecks, with common solutions. It's difficult to generalize, given so few details, but there are a lot of performance hints that apply to most web sites.
The first good place to look is actually on the client side, rather than the server side. How large are your pages (including images, CSS, JavaScript and the like)? How many HTTP requests does a single page view require? Use something like Firebug (and the YSlow add-on for Firebug) to see how long your page actually takes to load, and which bits of your page cause the problem. Some general hints:
Work out ways to shrink the CSS and JavaScript - remove anything you don't need, and run the rest through a tool like YUI Compressor.
If you have multiple CSS and JavaScript files, try to combine them into a single file.
Optimize all of your images as much as possible, and see if you can combine any of those into a single file using CSS sprites or similar. PunyPNG is good for lossless images. A decent JPEG encoder (NOT Photoshop) is good for photos.
Move the CSS to the top of the page, and the JavaScript to the bottom, so the browser can render the page before the JavaScript has finished downloading.
Make sure that all of your CSS, JavaScript and HTML are being served compressed.
Make sure that you're using appropriate caching - if a file hasn't changed, there's no point in re-downloading it.
Once you've got the client side out of the way, you might have to turn your attention to the server side.
Install an opcode cache, like APC, XCache, or Zend Optimizer. It's very easy to do, and will always provide some improvement. Once you've done that, profile your pages, to find out where the time is actually being spent.
More likely than not, you'll be spending most of your time waiting for the database to return results. So, at a bare minimum:
Work out which queries are taking the longest, and work on them first. Use your head though - a query that takes five seconds on an admin page that nobody looks at is not as important as a query that takes one second on the front page.
Make sure that your query uses appropriate indexes. No common query should ever need to do a full table scan. Certain kinds of sorting or grouping may be unable to use indexes - try to avoid them, or modify the query so that it can use indexes.
Make sure that your queries aren't using temporary tables.
Use the EXPLAIN keyword - it's very useful.
Tune the database server itself. MySQL is generally not optimized for performance.
Once you've done that, it's usually best to start working out how to use caching. The best way to speed PHP code up is to reduce the amount of work it has to do.
Make sure your database's query cache is working properly.
Use something like Memcached to store frequently used results, instead of getting them from the database.
If you have enough memory, try to keep everything in Memcached, resorting to the database only when something isn't present in the cache.
If you have chunks of pages that are dynamic, but the same for all users, try caching those chunks. For example, if two users are looking at an article, the article itself is going to be exactly the same for each user, even if the rest of the page isn't. Generate the HTML for the article, and chuck it in the cache.
If you have lots of non-authenticated users, it's entirely possible that they'll all be seeing the exact same page. Two non-authenticated users looking at the above article won't just see an identical article - they'll see an identical page, right down to the login links. Set your PHP scripts up so you can use HTTP caching headers (check the last modified date, and return a 304 Not Modified if it's not been changed). Once you've done that, stick a Squid reverse-proxy in front of the webserver, and let Squid serve pages out of it's cache.
After that point, the general approach is to start using more servers, and the problem becomes one of scaling, rather than raw speed. The general plan is to make sure that your website has a shared-nothing architecture - all persistent data is stored in the database. Then, you install multiple webservers, move the database server to a separate machine, and run the entire thing behind a caching reverse proxy. To add more capacity, you add more machines.
One way: php accelerators, e.g. APC.
Another; read blog articles, e.g. performance tuning overview.
A general question i would say. Try looking for optimazation tips online...
Several parameters are involved:
I/O access (using it a lot - file_exists, is_file overheads)
Database access (optimize queries, use stored procedures, check your db cache)
Using an opcode cache (like APC)
Compressing output
Serving js/css minified and compressed (and using subdomains to deliver them to the browser)
Using memcache to cache data into memory for faster access
You can use benchmarking tools to test your environment before and after the optimizations.
Try apache bench for example.
Filesize.
A file of 500 KB takes longer to download then a file of 300 KB. So optimize and crop as much as you can.
Accelators
Self explainable: List of PHP accelerators
Server upgrade
Though this costs money, when dealing with a lot of traffic, it will have impact on how fast the .php files gets processes and how fast data will be send to the user.
I don't recommend this though since there are other (free) ways to improve speed.
Don't user external resources
If you are linking some images trough other sites, the speed of the downloading will not be in your control. Instead, if you plan on using images from others download them to your own server first (or upload them to your own provider) and load them that way.
Review and improve your code
Find short cuts, remove unnecessary code, delete unused variables, reuse others etc.
There are other ways but I believe the above information has the most impact on your speed
You should probably do some search for existing answers to this question, however...
APC for opcode caching
Memcached for object storing (to reduce the number of database queries)
Check for / optimize slow SQL queries
Measure and find bottlenecks
Don't rely on (slow) web services on each page load, etc.
Yahoo has got some good basic advice on speeding up web pages, much of it very easy to implement. You may also want to download yslow + firebug for firefox; they will help indicate possible basic bottlenecks from a client request perspective.
The rest of the advice here is good, so I wont add much else other than; don't bother optimising any code until you're 100% sure that you've found a bottleneck. I can't stress that enough. Don't waste time tweaking code or implementing new things (ie caching) because you "feel" will make things quicker, act only on real evidence (ie performance profiling).