Just wondering if anyone has information on what "costs" are associated with including a LARGE (600K or more) php file containing 100s of class files. Does it really make much difference in comparison to autoloading individual files that for instance searches across several directories before finding a match?
Would having APC caching on make this cost negligible?
Basically, the cost of including one big file depend on your usecase. Let's say you have a large file with 200 classes.
If you only use 1 class, including the large file will be more expensive than including a small class file for that individual class.
If you use all 200 classes, including the large file will be significantly less expensive than including 200 small files.
Where the cutoff lies is really system dependent. I would imaging that it would be somewhere around the 50% mark (where if you're using less than 100 classes in any one request, autoload).
And using APC will likely shift the breakeven point closer to less classes (so without, 100 classes used might be the breakeven point, but with it might be at 50 classes used) since it makes the large single include much cheaper, but only lowers the overhead of each individual smaller include slightly.
The exact break-even points will be 100% system dependent (how fast is your disk I/O, how fast are your processors, how much memory, etc). So the only way to know for sure on your platform is to test.
However, more is at stake than raw performance. Maintainability will suffer with one large file since it's harder to work on multiple classes at the same time (tabs in an IDE become useless). I personally would keep all the classes in separate files and make my life as the developer easier rather than making one giant monstrosity of a file.
Now, if you have facebook traffic levels, it may be worth investigating further. But if you're not, I personally wouldn't worry about it...
I have conducted some tests on the various cost(s) of php include() which I'd like to share, as I see many programmers or CMS platforms overlooking these pre-runtime php costs.
The cost of the function itself is quite negligible. 100 file includes (with empty files) costs about 5ms; and no more than one microsecond when using an opcache.
So the cost savings of including a larger php file containing 100 classes, as opposed to 100 separate file includes, is only about 5ms. And using an OpCode cache makes that cost irrelevant.
The real cost come with the size of your files, and what PHP has to parse and/or compile. For a better idea of what those cost are, here are test results I conducted on a 2010 Mac Mini Server, with a 10,000 RPM drive, running PHP 5.3 with an optimizer enabled eAccelerator opcache.
1µs for 100 EMPTY File includes, w/opcache
5ms for 100 EMPTY File includes, no opcache
7ms for 100 32KB File includes, w/opcache
30ms for 100 32KB File includes, no opcache
14ms for 100 64KB File includes, w/opcache
60ms for 100 64KB File includes, no opcache
22ms for 100 128KB File includes, w/opcache
100ms for 100 128KB File includes, no opcache
38ms for 100 200KB File includes, w/opcache
170ms for 100 200KB File includes, no opcache
Therefore, a 600KB php file roughly cost 6ms, or about 1ms when using an opcode cache. What you really want to watch instead is the size of all code included per request.
Merging file in combos to try and save resources is definitely not a good idea and would be a mistake when using an op-cache. My test doesn't account for disk speed very much if at all, as I included the same file 100 times. That said I don't feel the need to cover disk I/O at all, because having an op-cache installed is really a prerequisite in term of basic performance.
To gain performance as much as possible and save RAM usage, one must do the opposite. Which is to split files contextually as much as possible, with the use of an autoloader or a class factory pattern, to include as little unused code as possible for each and every request.
To that effect, misusing include_once() can also have negative performance consequences...
In regards to your base classes. I have similar circumstances, but I only include a tiny portion of the table schema. Mainly the field types and primary key details. For performance reasons, I purposely do not include the quite heavy schema of the tables all the time, because they are rarely used, and when they are, I use only a couple of them maximum per request.
The average full column details of a table being roughly 20-50k per schema arrays. Including 10-15 of them on any given request cost just about 1-3 ms for the arrays. Which in itself, is not much. But it becomes worthwhile when combined with a 500k RAM saving per request.
APC will save you a lot, but I don't know if it will be negligible if your source is 600k. That is about 15000 lines of code? Not that much for a website, but quite large for a single file.
You'd rather use a more dynamic approach and isolation specific functionality in specific classes. Then, for each page, you can choose which code is needed.
Especially when you use APC, this approach will be better, because you don't have the overhead of file I/O which you will have when you load many small files from disk. I would choose to implement small, specified classes and put each of those in a separate file. You can use the PHP class loading mechanism (__autoload) to automatically load the right units.
When you figure out a good naming convention for your classes and units, this will make your development a lot easier.
Related
It's well known that in Windows a directory with too many files will have a terrible performance when you try to open one of them. I have a program that is to execute only in Linux (currently it's on Debian-Lenny, but I don't want to be specific about this distro) and writes many files to the same directory (which acts somewhat as a repository). By "many" I mean tens each day, meaning that after one year I expect to have something like 5000-10000 files. They are meant to be kept (once a file is created, it's never deleted) and it is assumed that the hard disk has the required capacity (if not, it should be upgraded). Those files have a wide range of sizes, from a few KB to tens of MB (but not much more than that). The names are always numeric values, incrementally generated.
I'm worried about long-term performance degradation, so I'd ask:
Is it OK to write all to the same directory? Or should I think about creating a set of subdirectories for every X files?
Should I require a specific filesystem to be used for such directory?
What would be the more robust alternative? Specialized filesystem? Which?
Any other considerations/recomendations?
It depends very much on the file system.
ext2 and ext3 have a hard limit of 32,000 files per directory. This is somewhat more than you are asking about, but close enough that I would not risk it. Also, ext2 and ext3 will perform a linear scan every time you access a file by name in the directory.
ext4 supposedly fixes these problems, but I cannot vouch for it personally.
XFS was designed for this sort of thing from the beginning and will work well even if you put millions of files in the directory.
So if you really need a huge number of files, I would use XFS or maybe ext4.
Note that no file system will make "ls" run fast if you have an enormous number of files (unless you use "ls -f"), since "ls" will read the entire directory and the sort the names. A few tens of thousands is probably not a big deal, but a good design should scale beyond what you think you need at first glance...
For the application you describe, I would probably create a hierarchy instead, since it is hardly any additional coding or mental effort for someone looking at it. Specifically, you can name your first file "00/00/01" instead of "000001".
If you use a filesystem without directory-indexing, then it is a very bad idea to have lots of files in one directory (say, > 5000).
However, if you've got directory indexing (which is enabled by default on more recent distros in ext3), then it's not such a problem.
However, it does break quite a few tools to have many files in one directory (For example, "ls" will stat() all the files, which takes a long time). You can probably easily split it into subdirectories.
But don't overdo it. Don't use many levels of nested subdirectory unnecessarily, this just uses lots of inodes and makes metadata operations slower.
I've seen more cases of "too many levels of nested directories" than I've seen of "too many files per directory".
The best solution I have for you (rather than quoting some values from a micro-filesystem-benchmark) is to test it yourself.
Just use the file system of your choice. Create some random test data for 100, 1000 and 10000 entries. Then, measure the time it takes your system to perform the action you are concerned about time-wise (opening a file, reading 100 random files, etc).
Then, you compare the times and use the best solution (put them all into one directory; put each year into a new directory; put each month of each year into a new directory).
I do not know in detail what you are using, but creating a directory is a one time (and probably quite easy) operation, so why not do it instead of changing filesystems or trying some other more time-consuming stuff?
In addition to the other answers, if the huge directory is managed by a known application or library, you could consider replacing it by something else, e.g:
a GDBM index file; GDBM is a very common library providing indexed file, which associates to an arbitrary key (a sequence of bytes) an arbitrary value (another sequence of byte).
perhaps a table inside a database like MySQL or PostGresQL. Be careful about indexing.
some other way to index data
The advantages of the above approaches include:
space performance for a large collection of small items (less than a kilobyte each). A filesystem need an inode for each item. Indexed systems may have much less granularity
time performance: you don't access the filesystem for every item
scalability: indexed approaches are designed to fit large needs: either a GDBM index file, or a database can handle many millions of items. I'm not sure your directory approach will scale as easily.
The disadvantage of such approach is that they don't show as files. But as MarkR's answer remind you, ls is behaving quite poorly on huge directories.
If you stick to a filesystem approach, many software using large number of files are organizing them in subdirectories like aa/ ab/ ac/ ...ay/ az/ ba/ ... bz/ ...
Is it OK to write all to the same directory? Or should I think about creating a set of subdirectories for every X files?
In my experience the only slow down a directory with many files will give is if you do things such as getting a listing with ls. But that mostly is the fault of ls, there are faster ways of listing the contents of a directory using tools such as echo and find (see below).
Should I require a specific filesystem to be used for such directory?
I don't think so with regards to amount of files in one directory. I am sure some filesystems perform better with many small files in one dir whilst others do a better job on huge files. It's also a matter of personal taste, akin to vi vs. emacs. I prefer to use the XFS filesystem so that'd be my advice. :-)
What would be the more robust alternative? Specialized filesystem? Which?
XFS is definitely robust and fast, I use it in many places, as boot partition, oracle tablespaces, space for source control you name it. It lacks a bit on delete performance, but otherwise it's a safe bet. Plus it supports growing the size whilst it is still mounted (that's a requirement actually). That is you just delete the partition, recreate it at the same starting block and whatever ending block that's larger than the original partition, then you run xfs_growfs on it with the filesystem mounted.
Any other considerations/recomendations?
See above. With the addition that having 5000 to 10000 files in one directory should not be a problem. In practice it doesn't arbitrarily slow down the filesystem as far as I know, except for utilities such as "ls" and "rm". But you could do:
find * | xargs echo
find * | xargs rm
The benefit that a directory tree with files, such as directory "a" for file names starting with an "a" etc., will give you is that of looks, it looks more organised. But then you have less of an overview... So what you're trying to do should be fine. :-)
I neglected to say you could consider using something called "sparse files" http://en.wikipedia.org/wiki/Sparse_file
It is bad for performance to have a huge number of files in one directory. Checking for the existence of a file will typically require an O(n) scan of the directory. Creating a new file will require that same scan with the directory locked to prevent the directory state changing before the new file is created. Some file systems may be smarter about this (using B-trees or whatever), but the fewer ties your implementation has to the filesystem's strengths and weaknesses the better for long term maintenance. Assume someone might decide to run the app on a network filesystem (storage appliance or even cloud storage) someday. Huge directories are a terrible idea when using network storage.
I am now writing a php framework. I am wondering whether it will slow down when php require/include or require_once/include_once too many files during a request?
Well of course it will. Doing anything too many times will cause a slow down.
On a more serious note though, IO operations that touch disk are very slow compared to anything that happens in memory. So often times, including files will be a major performance factor when using a large framework (just look at Zend Framework...).
However, there are typically ways to alleviate this such as APC and similar op code caches.
Sometimes programming approaches are also taken. For example, if I remember correctly, Doctrine 1 has the capability to bundle everything into 1 giant file as to have fewer IO calls.
If in doubt, do some indepth profiling of an application written with your framework and see if include/require/etc are one of the major slow points.
Yes, this will slow your application down. *_once calls are generally more expensive, since it must be checked whether that file has already been included. With a lot of includes, there is a lot of hard disk access and a lot of memory usage bundled. I've developed applications with the Zend Framework that include a total of 150 to 200 files at each request - you really can see the impact that has on the overall performance.
The more files you include will add to some load. However, if you have to choose between require and require_once, require_once / include_once take more load because a check will need to be done by the server to see if the same file has been included elsewhere. So if you could possibly avoid that, at least you could boost performance.
Unless you use cache libraries, everytime a request comes those files would be included again and again. Surely it would slow things down. Create a framework that only include-s what needs to be include-ed.
I've just noticed that my app is including over 148 php files on one page. Bear in mind this is the back end admin and not the main site, but is this too many? What impact does a large number of includes have on a server, both whilst under average load and stressed? Would disk I/o be a problem?
Included File Stats
File Type - Include Count - Combined File Size
Index - 1 - 0.00169 MB
Bootstrap - 1 - 0.01757 MB
Helper - 98 - 0.58557 MB - (11 are Profiler related classes)
Configuration - 8 - 0.00672 MB
Data Store - 23 - 0.10836 MB
Action - 8 - 0.02652 MB
Page - 1 - 0.00094 MB
I18n Resource - 7 - 0.00870 MB
Vendor Library - 1 - 0.02754 MB
Total Files - 148 - 0.78362 MB
Time ran 0.123920917511
Memory used 2.891 MB
Edit 1. Should be noted that this is a worst case scenario page. It has many different template models, controllers and associated views because it handles publishing with custom fields.
Edit 2. Also the frontend has agressive page caching so the number of includes in the front is roughly 30-40 at the moment.
Edit 3. Profiler when turned off won't include the files so this will reduce quite a few includes
So, here's a breakdown of the potential problems.
The number of files itself is an issue. Unless you're using a bytecode cache (and you are), and that cache is configured to not stat the file prior to pulling in the compiled bytecode, PHP is going to stat every single one of those files on include, then read them in. In some cases, that can also mean path resolution and a naive autoloader that pokes and prods at numerous directories. This won't be "slow" because the OS will surely have things cached if the files are hit frequently, but it does add precious milliseconds to each request.
If every autoloader is designed properly and the codebase relies entirely on the autoloader to pull in the required classes (meaning nothing uses include/require/include_once/require_once on a class file), you can avoid having to open and read many of the files by gluing every single class together into a single large include. This is a bit on the impractical side of things, mainly because if there is no bytecode cache, PHP still has to parse, compile and interpret it all. Additionally, not every class is going to be used on every request, so it may be a bit wasteful.
The bottom line is that a well-configured bytecode cache will completely mitigate this problem. There's nothing wrong with telling your customers that they have to properly configure their servers for optimal performance. If they know what they're doing, they'll have everything correct to begin with.
Yes, so many files can be a problem.
No, it is probably not a problem in your case, since this is only a back-end, which is probably accessed by a few people, and not too often.
In general, I would discourage having more than 20 PHP files called on each page. This is because even the website and the server are highly optimized, for every page, the server must go and look at every file to see at least if it changed since the last request (if there is no cache implemented on this level).
Even if the time to access a file is tiny, it is a time you are loosing at each request. This tiny period of time multiplied by 148 can become an issue (and a huge scalability problem).
When I worked on a PHP framework project, I used a trick to reduce the number of files. Several files were combined to one minified file, and this single file was cached. Then, if there was a need to update the framework or the website, the cached file was automatically removed, then rebuilt.
Even if I personally discourage you to minify the source code (because it is difficult to do, difficult to test, and creates a bunch of problems, like the meaningless numbers of lines in errors), you can probably do the same thing by combining all your files into a single file.
Be careful: if a page A uses half of those files, and page B - another half, combining everything will probably decrease the performance, since PHP engine will have to parse more code.
Are the includes themselves doing something fancy, like db queries? And are they all at the top of the page, or are they included as-needed?
Those stats don't look bad, so, if admin access is infrequent, you may be ok. But you should examine this from a design angle: can things can be organized in a way that would prevent you from having to maintain so many includes? Separate from any performance issues, there is a risk here of creating hard-to-track dependency bugs.
(It could be as MainMa said, related to a framework, in which case you may have no control over the above. I only mention it in case you do.)
A couple things in case you didn't know already:
If it's just text or static HTML, you
can get the contents with
file_get_contents(), readfile(), etc. This is
somewhat faster because the loaded
file doesn't need parsing. But
obviously if it contains PHP code
this won't help.
You can use
include_once() to prevent the same
file from being included twice (if, for instance, it's included by two files
that are themselves included by the top level file).
Disk I/O won't be your problem. The system will cache frequently accessed files in RAM, or if they aren't that frequently accessed, it won't matter.
Load times may be an issue, as each file has to be requested and interpreted by the server separately.
I don't know how the web server will cope with the many requests; it may not care. If the client doesn't do pipelined requests though, you'll pay for many many TCP connections built up and torn down, which also costs a goodly amount of latency.
Honestly, don't worry about it - 148 is nothing, even if 0 caching happened at php side you're going to be hitting fs caches almost everytime - and in the grand scheme of things virtually every opensource anything out there has way more files without a problem (drupal, wordpress, joomla, elgg, anything).
Really, no problem here - even if you managed to shave a millisecond here or there off, it's so far down the priority list and places where you can make speed gains it's barely worth considering for more than a second.
caveat: do try to use require_once and include_once where suited and ensure you only load those classes/files that are needed for a given request to process.
By "common script startup sequence", what I mean is that in the majority of pages on my site, the first order of business is to consult 3 specific files (via include()), which centrally define constants, certain functions used in many scripts, and a class or two, as well as providing the database credentials. I don't know if there's a more standard term for such a setup.
What I want to know is whether it's possible to have too many of these and make things slower as a result. I know that using include() has a certain amount of overhead because it's another file to look for in the filesystem, parse, and execute. If there is such a thing as too many includes, I want to know whether I am anywhere near that point. N.B. Some of my pages include() still more scripts that they specifically, individually need (for example, a script that defines a function used by only a few pages), and I do not count these occasional extra includes, which are used reasonably sparingly anyway. I'm only worrying about the 3 includes that occur on the majority of pages and set everything up.
What are the 3 includes?
Two of them are outside of webroot. common.php defines a bunch of functions, classes and other things that do not vary between the development and production sites. config.php defines various constants and paths that are different in the development and production sites (which database to connect to, among other things). Of course, it's desirable for this file in particular to be outside of webroot. config.php include()s common.php at the bottom.
The other one is inside webroot and contains a single line:
include [path to appropriate directory]/config.php
The directory differs between the development and production sites.
(Feel free to question the rationale behind setting up the includes this way, but I feel that this does provide a good, reliable system for preparing to execute each page, and my question is about whether it is bad to have that many includes as a baseline on each page.)
Use APC and your worries go away. The opcode of your files will be cached in the RAM and everything will go super fast. :) Facebook does this so it'll definitely help you to scale.
Because you may not notice any difference between 1 include or 50 in terms of speed, but for an application with high concurrency, I/O can be a huge bottleneck. So the key is not speed, but scaling.
The best thing to do is use an accelerator of some kind, APC or eAccelerator or something like this to keep them cached in RAM. The reasons behind this are quite a few and on a busy site it means a lost.
For example a friend did an experiment on his website which has about 15k users a day and average page load time of 0.03s. He removed most of the includes which he used as templates - the average load time dropped to 0.01 secs. Then he put an accelerator - 0.002 secs per page. I hope those numbers convince you that includes must be kept as little as possible on busy sites if you don't use an accelerator of some kind.
This is because of the high I/O which is needed to scan directories, find the files, open them, read them and so on.
So keep the includes to minimum. Study the most important parts of your site and optimize there by moving required parts to general includes and so on.
I dont believe the performance has anything do with no of includes, because think of a case where one included file contains 500 lines of codes and in another case you have 50 included files with just one line of code each.
Or if you by any chance using Windows as OS, you can use WinCache.
http://php.net/manual/en/book.wincache.php
I read through the related questions and didn't find my answer. This isn't about require/require_once or the use of the __autoload function or even the name of the files.
My company builds large sites and as we've grown, the practice we've grown into is splitting up functions by their relation such as:
inc.functions-user.php
inc.functions-media.php
inc.functions-calendar.php
Each these files tends to be 1000 to 3000 lines of code. Combining would make them a monster to maintain and more difficult for more developers.
However, in some of our larger sites, we end of with somewhere between 8 and 15 of these individual functions files.
Is including the 15 functions files in the header the best way or should we find a way to combine them? Are 12 includes vs. 5 includes significantly detrimental to the running of our site?
If you care about performance install an opcocde cche like APC which will save the compiled form of the script in memory.
If you don't want to install APC the difference is minimal, yes accessing less files takes less time, but that's not where most of the time is spent. (Especially as the filesystem should be able to cache the scripts (uncompiled) in memory) if they are requested often enough.
Calling include/require function 5 times instead of 12 time is not so different, what is important is content of the included file(s).
Also, include cahchers are well suit for your purpose such as APC or xcache.
I would even suggest to split them into much more files.
Look at the MVC pattern, or other frameworks, they are extremly splitted, so you easily can maintain "only" parts, without worrying about destroying something, as long as you follow your structure.
Some points to consider that I think about also
Rasmus Lerdorf has said frequently that "you shouldn't have more than about five includes". I can only assume that he knows what he's talking about, because he made PHP. I am skeptical about the feasibility of this, however. Especially on large projects.
I've found that it's better for development and milestones to make life easier on your developers. If that means separate files, then that's a good idea.
If you're worried about CPU usage or bandwidth, there are probably more obvious bottlenecks than liberal use of include. Un-optimized functions are a good way to make the app faster, and paying attention to images and css or js files is a good way to reduce bandwidth.
With vanilla PHP it is generally better to use as few include files as possible, but of course that makes maintenance a pain. Use an opcode cache such as APC and the performance problem will pretty much disappear. Also, 12 files isn't a very large number of includes, compared to the large MVC-frameworks and other libraries. Keeping the functions separated in a logical structure is the best way by far.