I currently have a php file that must read hundreds of XML files, I have no choice on how these XML files are constructed, they are created by a third party.
The first xml file is a large amount of titles for the rest of the xml files, so I search the first xml file to get file names for the rest of the xml files.
I then read each xml file searching its values for a specific phrase.
This process is really slow. I'm talking 5 1/2 minute runtimes... Which is not acceptable for a website, customers wont stay on for that long.
Does anyone know a way which could speed my code up, to a maximum runtime of approx 30s.
Here is a pastebin of my code : http://pastebin.com/HXSSj0Jt
Thanks, sorry for the incomprehensible English...
Your main problem is you're trying to make hundreds of http downloads to perform the search. Unless you get rid of that restriction, it's only gonna go so fast.
If for some reason the files aren't cachable at all(unlikely), not even some of the time, you can pick up some speed by downloading in parallel. See the curl_multi_*() functions. Alternatively, use wget from the command line with xargs to download in parallel.
The above sounds crazy if you have any kinda of traffic though.
Most likely, the files can be cached for at least a short time. Look at the http headers and see what kind of freshness info their server sends. It might say how long until the file expires, in which case you can save it locally until then. Or, it might give a last modified or etag, in which case you can do conditional get requests, which should speed things up still.
I would probably set up a local squid cache and have php make these requests through squid. It'll take care of all the use the local copy if its fresh, or conditionally retrieve a new version logic for you.
If you still want more performance, you can transform cached files into a more suitable format(eg, stick the relevant data in a database). Or if you must stick with the xml format, you can do a string search on the file first, to test whether you should bother parsing that file as xml at all.
First of all if you have to deal with large xml files for each request to your service it is wise to download the xml's once, preprocess and cache them locally.
If you cannot preprocess and cache xml's and have to download them for each request (which I don't really believe is the case) you can try optimize by using XMLReader or some SAX event-based xml parser. The problem with SimpleXML is that it is using DOM underneath. DOM (as the letters stand for) creates document object model in your php process memory which takes a lot of time and eats tons of memory. I would risk to say that DOM is useless for parsing large XML files.
Whereas XMLReader will allow you to traverse the large XML node by node without barely eating any memory with the tradeoff that you cannot issue xpath queries or any other non-consequencial node access patterns.
How to use xmlreader you can consult with php manual for XMLReader extension
Related
We want to merge a lot of PDF files into one big file and send it to the client. However, the resources on our production server are very restricted, so merging all files in memory first and then sending the finished PDF file results in our script being killed because it exhausts its available memory.
The only solution (besides getting a better server, obviously) would be starting to stream the PDF file before it is fully created to bypass the memory limit.
However I wonder if that is even possible. Can PDF files be streamed before they're fully created? Or doesn't the PDF file format allow streaming unfinished files because some headers or whatever have to be set after the full contents are certain?
If it is possible, which PDF library supports creating a file as a stream? Most libraries that I know of (like TCPDF) seem to create the full file in memory and then in the end output this finished result somewhere (i. e. via the $tcpdf->Output() method).
The PDF file format is entirely able to be streamed. There's certainly nothing that'll prevent it anyway.
As an example, we recently had a customer that required reading a single page over a HTTP connection to a remote PDF, without downloading or reading the whole PDF. We're able to do this by making many small HTTP requests for specific content within the PDF. We use the trailer at the end of the PDF and the cross reference table to find the required content without having to parse the whole PDF.
If I understand your problem, it looks like your current library you're using loads each PDF in memory before creating or streaming out the merged document.
If we look at this problem a different way, the better solution would be for the PDF library to only take references to the PDFs to be merged, then when the merged PDF is being created or streamed, pull in the content and resources from the PDFs to be merged, as-and-when required.
I'm not sure how many PHP libraries there are that can do this as I'm not too up-to-date with PHP, but I know there are probably a few C/C++ libraries that may be able to do this. I understand PHP can use extensions to call these libraries. Only downside is that they'll likely have commercial licenses.
Disclaimer: I work for the Mako SDK R&D group, hence why I know for sure there are some libraries which will do this. :)
Good day!
I have a PHP script that reads a very huge XML file. I used fgets to read line by line. In some point, we need to stop the said script to check some data integrity. My problem is how to resume that running state (I mean the line which the script stopped). We don't want to start the script all over again for it takes days to be completed.
Is there such way that I can accomplish this? Any suggestion would be greatly appreciated.
DomDocument ?
There's also SimpleXML library of PHP which needs to be installed before using it in your applications, but before you use any of it's functions, it loads all of the XML document into it's cache.
There's also XMLReader library, which is used to read XML files without loading all of the file to cache, and is the better method for using for situations like this.
Here you can find information about these two libraries :
http://us.php.net/manual/en/book.xmlreader.php
http://us.php.net/manual/en/book.simplexml.php
And examples of using them:
http://us.php.net/manual/en/simplexml.examples-basic.php
And here's much more detailed explanation :
How to use XMLReader in PHP?
I'm currently developing a Web application with Zend Framework and PHP 5.3. I have a XML file that contain configs and mapping information (+- 1500 lines). On each request I perform an xpath query to get information from that XML file. The information that is found in the XML file is static and do not change after the deployment of the application.
It is a good practice to generated a php file that contain the XML information in a static arrays on the first request and then load that php file on every request to get the information instead of doing queries on the XML?
You can cache the parsed config file as source file with var_export.
Generating code to cache resources is implemented in several places in Zend Framework, for example autoloader, so I presume it is good practice.
There is also another way to cache it - with serialize (make sure to serialize an array, not for example a SimpleXML object) or Zend_Cache which does more or less the same but is more flexible as to how the result is stored.
Since the XML not changes after deploy, i think it would be the best to transform that XML in your local dev env, and not on the productive system when needed. Its not a good idea to generate source on the productive system that will be automatically included without any validation.
I'm not very familiar with XSLT, but it might be an option for you, according to the concrete structure of that XML.
We run multiple Windows/IIS/.Net sites (up to 30+ sites per server). Each site is customized for the individual customer via a configuration file that contains the settings.
I am tasked with writing a small tool that will 'grep' all of the config files on a certain server for a particular config setting (or settings) and return the values for a nice tabled web page display. It will save many groups lots of time, especially since most groups don't have access to production servers, but they need to know how a customer is currently configured.
I have working code that finds all .config files from a starting path, I can easily extend this to do my grep'ing. Here are the challenges:
I want to aggregate this data from MULTIPLE servers. That means, the tool will be hosted on its own server -- and will make calls to a list of servers.
I'm limited to using .NET/ASP on the actual servers (they won't install PHP on IIS), but I'm writing the tool in PHP.
PROPOSED DESIGN: From my vantage point, I'm thinking the best way to accomplish this is to write my PHP tool and have it make AJAX or CURL requests to ASP scripts that live on each server in the list. Each ASP script could do the recursive directory parsing to find the config files and individually grep the files for the data, and return it in the RESPONSE.
Is that the best way to accomplish this? Should the ASP or PHP side do the 'heavy lifting'? Is their a recommended data format I should be using to pass the data.
Any ideas or samples would be great. If you need more info, I can provide!
Thanks!
Update: Here's an example of a config. Its a basic ASP file that gets included in other scripts.
custConfig1 = " 8,9,6:5:5 "
custConfig2 = " On "
I think you're bang on using PHP for the "receiving" script, and pretty sure you have that in hand.
Based on the format of your example config file, you could use ExecuteGlobal in classic ASP to load each file as you loop through them in your recursive directory lookup. Then you can use the custConfig1 et al. names in your script. e.g. (pseudo)
for each file
output("custConfig1") = custConfig1
next
Return what you need as JSON using a handy library and then do all the "hard" work of collating it and outputting it in PHP.
Yes, "grep" (if by that you mean importing a text file and using reg expressions to navigate it) isn't the best solution, in my humble opinion, use either JSON or XML as the format, and use PHP's built in XML or JSON tools.
JSON: http://php.net/manual/en/book.json.php
XML: http://php.net/manual/en/book.simplexml.php
You could use the DOM to navigate XML alternatively to SimpleXML, but SimpleXML is easier to learn (again, in my opinion) and will work for your needs.
I'm working on a PHP project, and I need to parse large XML file (>240MB) from URL I used xmlReader it works in localhost but not working on shared hosting (BlueHost) it shows 404 error! http://webmashing.com/meilleures-des/cronjob?type=sejours
Is this action need a dedicated server? if yes please give me suggestion.
by the way splitting the XML file can help?
XMLParser is a pull parser, so it doesn't load the entire file into memory as you parse it, so splitting the file will have no effect other than to add complexity to your code. However, if you're holding all the details that you parse in your script, that will take up a lot of memory.
However, you should be getting some error or message from running the script on your shared hosting to identify what the problem is. Was their version of PHP built with --enable-libxml, are you getting a memory allocation error?
You may use SAX (Simple API for XML) parser which is also best solution for reading huge XML file.
As this will not dump whole file into the memory. This will prevent your memory exhaust problem. Yes It will take time to read such huge file.
You may need to check whether your php has libxml and libxml2 modules install using phpinfo(); function.
But Better if can go for XMLReader as this is faster and save your memory usage. You can check peak memory usage using memory_get_peak_usage();
And read file row by row and unset row from array after operation is done on that particular row.
Guessing it's a memory related issue (set memory and time execution limits).
For what it's worth. I have used vtd-xml (java implementation) to parse files over 500MB with success (low memory footprint and fast - maybe the fastest exec. time).