Parse very large xml files with PHP

Parse very large xml files with PHP - php

I'm working on a PHP project, and I need to parse large XML file (>240MB) from URL I used xmlReader it works in localhost but not working on shared hosting (BlueHost) it shows 404 error! http://webmashing.com/meilleures-des/cronjob?type=sejours
Is this action need a dedicated server? if yes please give me suggestion.
by the way splitting the XML file can help?

XMLParser is a pull parser, so it doesn't load the entire file into memory as you parse it, so splitting the file will have no effect other than to add complexity to your code. However, if you're holding all the details that you parse in your script, that will take up a lot of memory.
However, you should be getting some error or message from running the script on your shared hosting to identify what the problem is. Was their version of PHP built with --enable-libxml, are you getting a memory allocation error?

You may use SAX (Simple API for XML) parser which is also best solution for reading huge XML file.
As this will not dump whole file into the memory. This will prevent your memory exhaust problem. Yes It will take time to read such huge file.
You may need to check whether your php has libxml and libxml2 modules install using phpinfo(); function.
But Better if can go for XMLReader as this is faster and save your memory usage. You can check peak memory usage using memory_get_peak_usage();
And read file row by row and unset row from array after operation is done on that particular row.

Guessing it's a memory related issue (set memory and time execution limits).
For what it's worth. I have used vtd-xml (java implementation) to parse files over 500MB with success (low memory footprint and fast - maybe the fastest exec. time).

Related

Reading very large (more than 100MB) Excel files in PHP

I'm trying to read a larger than 100MB Excel file using PHPExcel but it crashes while loading the file. I don't need any styling. I tried using:
$objReader->setReadDataOnly(true);
but it still crashes.
Is there any efficient way to read this size of Excel file in PHP?

Try Spout: https://github.com/box/spout.
This is a PHP library that was created to solve your problem (reading/writing large files). Here is why it works:
Other libraries keep a representation of the spreadsheet in memory which make them subject to out of memory errors. Using some caching strategies will help with these kind of errors but will affect performance pretty badly.
On the other hand, Spout uses streams to read or write data. This means that there is only one row kept in memory at all times, all read/written rows being freed from memory. This allows fast read/write of dataset of any size! Give it a try :)

Spout just saved my time! I couldn't read a large file with PhpOffice/PhPSpreedSheet with many Fatal Error Memory size, and with Spout it works like a charm.

PHP file() vs fopen()+fgets() performance debate

I am in process of rewriting some scripts to parse machine generated logs from perl to php
The files range from 20mb~400mb
I am running into this problem to decide if I should use file() or fopen()+fgets() combo to go through the file for some faster performance.
Here is the basic run through,
I check for file size before opening it, and if file is larger than 100mb(pretty rare case, but it does happen from time to time) I will go the fopen+fgets route since I only bumped the memory limit for the script to 384mb, any file larger than 100mb will have chance causing fatal error. Otherwise, I use file().
I am only going through the file once from beginning to the end in both method, line by line.
Here is the question, is it worth it to keep the file() part of the code to deal with the small files? I don't know how exactly file() (i use the SKIP_EMPTY_LINE option as well) works in php, does it map the file into the memory directly or does it shove line by line into the memory while going through it? I ran some benchmark on it, performance is pretty close, average difference is about 0.1s on 40mb file, and file() has advantage over fopen+fgets about 80% of the time(out of 200 test on the same fileset).
Dropping the file part could save me some memory from the system for sure, and considering I have 3 instance of the same script running at the same time, it could save me 1G worth of memory on a 12G system that's also hosting the database and other crap. But I don't want to let the performance of the script down also, since there is like 10k of these logs coming in per day, 0.1s difference actually adds up.
Any suggestion would help and TIA!

I would suggest sticking with one mechanism, like foreach(new \SplFileObject('file.log') as $line). Split your input files and process them in parallel, 2-3x per CPU core. Bonus: lower priority than database on same system. In PHP, this would mean spawning off N copies of the script at once, where each copy has its own file list or directory. Since you're talking about a rewrite and IO performance is an issue, consider other platforms with more capabilities here, eg Java 7 NIO, nodejs asynchronous IO, C# TPL.

Speed up reading multiple XML files in PHP

I currently have a php file that must read hundreds of XML files, I have no choice on how these XML files are constructed, they are created by a third party.
The first xml file is a large amount of titles for the rest of the xml files, so I search the first xml file to get file names for the rest of the xml files.
I then read each xml file searching its values for a specific phrase.
This process is really slow. I'm talking 5 1/2 minute runtimes... Which is not acceptable for a website, customers wont stay on for that long.
Does anyone know a way which could speed my code up, to a maximum runtime of approx 30s.
Here is a pastebin of my code : http://pastebin.com/HXSSj0Jt
Thanks, sorry for the incomprehensible English...

Your main problem is you're trying to make hundreds of http downloads to perform the search. Unless you get rid of that restriction, it's only gonna go so fast.
If for some reason the files aren't cachable at all(unlikely), not even some of the time, you can pick up some speed by downloading in parallel. See the curl_multi_*() functions. Alternatively, use wget from the command line with xargs to download in parallel.
The above sounds crazy if you have any kinda of traffic though.
Most likely, the files can be cached for at least a short time. Look at the http headers and see what kind of freshness info their server sends. It might say how long until the file expires, in which case you can save it locally until then. Or, it might give a last modified or etag, in which case you can do conditional get requests, which should speed things up still.
I would probably set up a local squid cache and have php make these requests through squid. It'll take care of all the use the local copy if its fresh, or conditionally retrieve a new version logic for you.
If you still want more performance, you can transform cached files into a more suitable format(eg, stick the relevant data in a database). Or if you must stick with the xml format, you can do a string search on the file first, to test whether you should bother parsing that file as xml at all.

First of all if you have to deal with large xml files for each request to your service it is wise to download the xml's once, preprocess and cache them locally.
If you cannot preprocess and cache xml's and have to download them for each request (which I don't really believe is the case) you can try optimize by using XMLReader or some SAX event-based xml parser. The problem with SimpleXML is that it is using DOM underneath. DOM (as the letters stand for) creates document object model in your php process memory which takes a lot of time and eats tons of memory. I would risk to say that DOM is useless for parsing large XML files.
Whereas XMLReader will allow you to traverse the large XML node by node without barely eating any memory with the tradeoff that you cannot issue xpath queries or any other non-consequencial node access patterns.
How to use xmlreader you can consult with php manual for XMLReader extension

PHP script: How big is too big?

I'm developing a webapp in PHP, and the core library is 94kb in size at this point. While I think I'm safe for now, how big is too big? Is there a point where the script's size becomes an issue, and if so can this be ameliorated by splitting the script into multiple libraries?
I'm using PHP 5.3 and Ubuntu 10.04 32bit in my server environment, if that makes any difference.
I've googled the issue, and everything I can find pertains to PHP upload size only.
Thanks!
Edit: To clarify, the 94kb file is a single file that contains all my data access and business logic, and a small amount of UI code that I have yet to extract to its own file.

Do you mean you have 1 file that is 94KB in size or that your whole library is 94KB in?
Regardless, as long as you aren't piling everything into one file and you're organizing your library into different files your file size should remain manageable.
If a single PHP file is starting to hit a few hundred KB, you have to think about why that file is getting so big and refactor the code to make sure that everything is logically organized.

I've used PHP applications that probably included several megabytes worth of code; the main thing if you have big programs is to use a code caching tool such as APC on your production server. That will cache the compiled (to byte code) PHP code so that it doesn't have to process every file for every page request and will dramatically speed up your code.

dompdf memory issues

I'm using DOMPDF to generate about 500 reports from one script. It's running out of memory after about 10-15 PDFs have been generated.
In debugging, it looks like it's loading 8M every time it gets to the font loading stuff, but this seems like something that should be handled with the font caching code.
Any ideas of what's going wrong here? I'd like to post a simple code snippet, but most of it is abstracted into multiple layers, so it's not just a simple copy/paste.

If you're using dompdf 0.6 beta, the memory error is the result of an infinite loop that dompdf enters when rendering tables. This is a known issue that I haven't been able to resolve.
Relevant URLs:
http://code.google.com/p/dompdf/issues/detail?id=34
http://code.google.com/p/dompdf/issues/detail?id=91
(The error you see is pdf PHP Fatal error: Allowed memory size of 268435456 bytes exhausted)

First if this is for anything remotely commercial just get Prince XML. It's substantially better and faster than any other HTML to PDF solution (and I've looked at them all). The cost will quickly be recouped in saved developer time.
Second, the quickest solution is probably to print each report in a separate process to solve any memory leak problems. If this is running from the command line have the outer loop be something like a shell script that will start a process for each report. If it's run from the Web fork a process for each script if you're on an OS that can do that.
Take a look at Convert HTML + CSS to PDF with PHP?.

As indicated by cletus, the quickest solution for you with DOMPDF is probably going to be rendering each report in a separate process. You can write a master script that calls a child script (using exec) which performs the actual rendering. As you can see in this discussion on the DOMPDF support group, it does seem to have the potential to provide a bit of a boost in performance.
It's difficult to say what's going on otherwise regarding memory usage without some kind of example that demonstrates the problem. I don't believe there is much optimization of DOMPDF and the underlaying CPDF rendering engine for multiple instances in a single script. So the font is probably being loaded into memory each time, even though it could use a static variable to cache that data.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parse very large xml files with PHP - php

Guessing it's a memory related issue (set memory and time execution limits). For what it's worth. I have used vtd-xml (java implementation) to parse files over 500MB with success (low memory footprint and fast - maybe the fastest exec. time).

Related

Reading very large (more than 100MB) Excel files in PHP

PHP file() vs fopen()+fgets() performance debate

Speed up reading multiple XML files in PHP

PHP script: How big is too big?

dompdf memory issues

Categories

Resources