I am in process of rewriting some scripts to parse machine generated logs from perl to php
The files range from 20mb~400mb
I am running into this problem to decide if I should use file() or fopen()+fgets() combo to go through the file for some faster performance.
Here is the basic run through,
I check for file size before opening it, and if file is larger than 100mb(pretty rare case, but it does happen from time to time) I will go the fopen+fgets route since I only bumped the memory limit for the script to 384mb, any file larger than 100mb will have chance causing fatal error. Otherwise, I use file().
I am only going through the file once from beginning to the end in both method, line by line.
Here is the question, is it worth it to keep the file() part of the code to deal with the small files? I don't know how exactly file() (i use the SKIP_EMPTY_LINE option as well) works in php, does it map the file into the memory directly or does it shove line by line into the memory while going through it? I ran some benchmark on it, performance is pretty close, average difference is about 0.1s on 40mb file, and file() has advantage over fopen+fgets about 80% of the time(out of 200 test on the same fileset).
Dropping the file part could save me some memory from the system for sure, and considering I have 3 instance of the same script running at the same time, it could save me 1G worth of memory on a 12G system that's also hosting the database and other crap. But I don't want to let the performance of the script down also, since there is like 10k of these logs coming in per day, 0.1s difference actually adds up.
Any suggestion would help and TIA!
I would suggest sticking with one mechanism, like foreach(new \SplFileObject('file.log') as $line). Split your input files and process them in parallel, 2-3x per CPU core. Bonus: lower priority than database on same system. In PHP, this would mean spawning off N copies of the script at once, where each copy has its own file list or directory. Since you're talking about a rewrite and IO performance is an issue, consider other platforms with more capabilities here, eg Java 7 NIO, nodejs asynchronous IO, C# TPL.
Related
Even if there seem to exist a few duplicate questions, I think this one is unique. I'm not asking if there are any limits, it's only about performance drawbacks in context of Apache. Or unix file system in general.
Lets say if I request a file from an Apache server
http://example.com/media/example.jpg
does it matter how many files there are in the same directory "media"?
The reason I'm asking is that my PHP application generates images on the fly.
Once created, it places it at the same location the PHP script would trigger due to ModRewrite. If the file exists, Apache will skip the whole PHP execution and directly serve the static image instead. Some kind of gateway cache if you want to call it that way.
Apache has basically two things to do:
Check if the file exists
Serve the file or forward the request to PHP
Till now, I have about 25.000 files with about 8 GB in this single directory. I expect it to grow at least 10 times in the next years.
While I don't face any issues managing these files, I have the slight feeling that it keeps getting slower when requesting them via HTTP. So I wondered if this is really what happens or if it's just my subjective impression.
Most file systems based on the Berkeley FFS will degrade in performance with large numbers of files in one directory due to multiple levels of indirection.
I don't know about other file systems like HFS or NTFS, but my suspicion is that they may well suffer from the same issue.
I once had to deal with a similar issue and ended up using a map for the files.
I think it was something like md5 myfilename-00001 yielding (for example): e5948ba174d28e80886a48336dcdf4a4 which I then put into a file named e5/94/8ba174d28e80886a48336dcdf4a4. Then a map file mapped 'myfilename-00001' to 'e5/94/8ba174d28e80886a48336dcdf4a4'. This not-quite-elegant solution worked for my purposes and it only took a little bit of code.
I'm developing a webapp in PHP, and the core library is 94kb in size at this point. While I think I'm safe for now, how big is too big? Is there a point where the script's size becomes an issue, and if so can this be ameliorated by splitting the script into multiple libraries?
I'm using PHP 5.3 and Ubuntu 10.04 32bit in my server environment, if that makes any difference.
I've googled the issue, and everything I can find pertains to PHP upload size only.
Thanks!
Edit: To clarify, the 94kb file is a single file that contains all my data access and business logic, and a small amount of UI code that I have yet to extract to its own file.
Do you mean you have 1 file that is 94KB in size or that your whole library is 94KB in?
Regardless, as long as you aren't piling everything into one file and you're organizing your library into different files your file size should remain manageable.
If a single PHP file is starting to hit a few hundred KB, you have to think about why that file is getting so big and refactor the code to make sure that everything is logically organized.
I've used PHP applications that probably included several megabytes worth of code; the main thing if you have big programs is to use a code caching tool such as APC on your production server. That will cache the compiled (to byte code) PHP code so that it doesn't have to process every file for every page request and will dramatically speed up your code.
I am doing some tests (lamp):
Basically I have 2 version of my custom framework.
A normal version, that includes ~20 files.
A lite version that has everything inside one single big file.
Using my lite version more and more i am seeing a time decrease for the load time. ie, from 0.01 of the normal to 0.005 of the lite version.
Let's consider just the "include" part. I always thought PHP would store the included .php files in memory so the file system doesn't have to retrieve them at every request.
Do you think condensing every classes/functions in one big file it's worth the "chaos" ?
Or there is a setting to tell PHP to store in memory the require php files?
Thanks
(php5.3.x, apache2.x, debian 6 on a dedicated server)
Don't cripple your development by mushing everything up in one file.
A speed up of 5ms is nothing compared to the pain you will feel maintaining such a beast.
To put it another way, a single incorrect index in your database can give you orders of magnitude more slowdown.
Your page would load faster using the "normal" version and omitting one 2kb image.
Don't do it, really just don't.
Or you can do this:
Leave the code as it is (located in
many different files)
Combine them in one file when you are ready to upload it to the production server
Here's what i use:
cat js/* > all.js
yuicompressor all.js -o all.min.js
First i combine them into a single file and then i minify them with the yui compressor.
I'm using DOMPDF to generate about 500 reports from one script. It's running out of memory after about 10-15 PDFs have been generated.
In debugging, it looks like it's loading 8M every time it gets to the font loading stuff, but this seems like something that should be handled with the font caching code.
Any ideas of what's going wrong here? I'd like to post a simple code snippet, but most of it is abstracted into multiple layers, so it's not just a simple copy/paste.
If you're using dompdf 0.6 beta, the memory error is the result of an infinite loop that dompdf enters when rendering tables. This is a known issue that I haven't been able to resolve.
Relevant URLs:
http://code.google.com/p/dompdf/issues/detail?id=34
http://code.google.com/p/dompdf/issues/detail?id=91
(The error you see is pdf PHP Fatal error: Allowed memory size of 268435456 bytes exhausted)
First if this is for anything remotely commercial just get Prince XML. It's substantially better and faster than any other HTML to PDF solution (and I've looked at them all). The cost will quickly be recouped in saved developer time.
Second, the quickest solution is probably to print each report in a separate process to solve any memory leak problems. If this is running from the command line have the outer loop be something like a shell script that will start a process for each report. If it's run from the Web fork a process for each script if you're on an OS that can do that.
Take a look at Convert HTML + CSS to PDF with PHP?.
As indicated by cletus, the quickest solution for you with DOMPDF is probably going to be rendering each report in a separate process. You can write a master script that calls a child script (using exec) which performs the actual rendering. As you can see in this discussion on the DOMPDF support group, it does seem to have the potential to provide a bit of a boost in performance.
It's difficult to say what's going on otherwise regarding memory usage without some kind of example that demonstrates the problem. I don't believe there is much optimization of DOMPDF and the underlaying CPDF rendering engine for multiple instances in a single script. So the font is probably being loaded into memory each time, even though it could use a static variable to cache that data.
I'm using PHP to make a simple caching system, but I'm going to be caching up to 10,000 files in one run of the script. At the moment I'm using a simple loop with
$file = "../cache/".$id.".htm";
$handle = fopen($file, 'w');
fwrite($handle, $temp);
fclose($handle);
($id being a random string which is assigned to a row in a database)
but it seems a little bit slow, is there a better method to doing that? Also I read somewhere that on some operating systems you can't store thousands and thousands of files in one single directory, is this relevant to CentOS or Debian? Bare in mind this folder may well end up having over a million small files in it.
Simple questions I suppose but I don't want to get scaling this code and then find out I'm doing it wrong, I'm only testing with chaching 10-30 pages at a time at the moment.
Remember that in UNIX, everything is a file.
When you put that many files into a directory, something has to keep track of those files. If you do an :-
ls -la
You'll probably note that the '.' has grown to some size. This is where all the info on your 10000 files is stored.
Every seek, and every write into that directory will involve parsing that large directory entry.
You should implement some kind of directory hashing system. This'll involve creating subdirectories under your target dir.
eg.
/somedir/a/b/c/yourfile.txt
/somedir/d/e/f/yourfile.txt
This'll keep the size of each directory entry quite small, and speed up IO operations.
The number of files you can effectively use in one directory is not op. system but filesystem dependent.
You can split your cache dir effectively by getting the md5 hash of the filename, taking the first 1, 2 or 3 characters of it and use it as a directory. Of course you have to create the dir if it's not exsists and use the same approach when retrieving files from cache.
For a few tens of thousands, 2 characters (256 subdirs from 00 to ff) would be enough.
File I/O in general is relatively slow. If you are looping over 1000's of files, writing them to disk, the slowness could be normal.
I would move that over to a nightly job if that's a viable option.
You may want to look at memcached as an alternative to filesystems. Using memory will give a huge performance boost.
http://php.net/memcache/