php script-wise breakdown of huge XML for parsing

php script-wise breakdown of huge XML for parsing - php

I've got a few huge XML files, and I cut a few rows out, so I could have a manageable-sized file on which to test my parsing script, written in php. There is a lot of nesting in the XML file, there are a lot of columns, and there are a lot of blanks, so writing the script was this huge ordeal. Now, I'm hitting my php memory limit on the full-sized XML files I want to parse.
Now, one thing I've considered is temporarily upping the php memory limit, but I need to rerun this script every well... week or so. Also, I don't have the best system. Running it hot and setting it melt is an all-to-real possibility and one of my "perfect storms".
I also considered attempting to learn a new language, such as perl or python. I probably could use to know one of these languages, anyway. I would prefer to stick with what I have, though, if only in the interest of time.
Isn't there some way to have php break the XML file up into manageable chunks that won't push my machine to its limit? Because every row in the XML file is wrapped by an ID column, it seems like I should be able to cut to the nth row closure, parse what was sliced, and then sleep, or something?
Any ideas?

Related

How can I compare two similar PHP xdebug trace files efficiently?

I have two xdebug trace files, about 25 MB each. One of them led to an error and I'm trying to find the spot where the two execution sequences deviated. The problem is that there are a bunch of irrelevant differences between the two files that I want to ignore, such as remote port numbers, query times and other database statistics.
My first attempt at solving this was to open the trace files in excel and remove the execution times which are obviously different most of the time. Then I tried using compare/merge apps to get rid of the other irrelevant differences. For example I replaced the remote portnumbers in both of the files with a placeholder string PORT_NUMBER. Some of the differences are repeated again and again so I need to be able to search and replace globally. The problem is that all of the apps I've tried are extremely slow and often crash. They can't handle rendering with word wrap, search and replace or even simple editing with files this big.
I've tried many compare/merge apps including DiffMerge, WinMerge, KDiff3, Meld, Notepad++, Eclipse and Visual Studio Code. I don't think that using diff and sed together would work, because I need to see in-line differences and jump to different parts of the large file quickly. I also would have to copy and paste the differences from diff to use in sed and they use another terminal for sed. There are also special characters and extremely long lines, so I don't think sed is a good option.
I'd like to find a way to use the trace files to find the point of deviation in the execution sequence.

The best way to do this, is by having two "computerized" trace files (it's xdebug.trace_format=1). This format is tab separated, which makes it easier to write a script that goes through them and does the comparison. You can ignore arguments to methods/functions etc too, as well as even compare return values if you wish to do so.
There is a mini script in Xdebug's contrib directory (https://github.com/xdebug/xdebug/blob/master/contrib/tracefile-analyser.php) that shows a little about how to do that.
If you can't create them again, then you are going to be up for a much harder task. The first thing I would do is to strip out all the arguments (with a regexp). VIM would probably be your best bet handling two 25MB files.

Requesting long PHP scripts with AJAX

I'm building a website which loads dynamically from a SQL database. To do that, I've created a PHP file that handles ALL AJAX and Post and Get requests (and every page has a couple of php include).
The PHP is not very long yet (250 lines), but it could get much longer eventually.
Everything is wrapped in <?php **** ?> tags, and is clearly and methodically composed of switch and case. So I only need a couple of lines each time.
My question is: Does every include request load/download all the file, or just the corresponding part? Rephrasing, will a hypothetically 10,000 line long script slow down the browser, or just the response time from the server?
I have concerns about all this.
Thanks in advance.
PS: This idea of unifying sql requests comes from a computer engineer friend that's always insisting on Multititer Architecture.

When you include a file, the entire files contents are inserted/executed into the script at that point. Depending on what is going on with the includes, you could be slowing down the response if you are including files that are not necessary for the response.
http://us3.php.net/manual/en/function.include.php

I have a great solution! (And this was my own question)
Instead o parsing a variable $option to enter a switch case case case .... case case endswitch I thought I could just split the PHP into a lot of different PHP's with the name of their corresponding case.
So, instead of using case('sports'), I will now use a <?php include('some_folder/'.$option.'.php');?> which will avoid converting to binary the whole file every time is called.
Since PHP is processed on the server, the included file will come all along and it will just have to deal with 20 lines of code instead of 400 (and, eventually, many more).
Definitively, thanks

How to ensure the files is an XML file

I donot know much about files and its related security. I have a LOT of data in XML files which i am planning on parsing to put in the database. I get these XML files from 3rd party people. I will be getting minimum around 1000 files per day. So i will write a script to parse them to enter in our database. Now i have many questions regarding this.
I know how to parse a single file. And i can extend the logic to multiple files in a single loop. But.Is there a better way to do the same? How can i use multi threaded programming to parse the files simultaneously many of them. There will be a script which, given the file, parses the single file and outputs to database. How can i use this script to parse in multiple threads/parallel processing
The File as i said, Comes from a 3rd party site. So how can i be sure that there are no security loop holes. I mean, i dono much about file security. But what are the MINIMUM common basic security checks i need to take.(like sql injection and XSS in web programing are VERY basic)
Again security related: How to ensure that the incoming XML file is XML itself. I mean i can use the extension, But is there a possibility to inject scripts and make them run when i parse these files. And What steps should i take while parsing individual files

You want to validate the XML. This does two things:
Make sure it is "well-formed" - a valid XML document
Make sure it is "valid" - follows a schema, dtd or other definition - it has the elements and you expect to parse.
In php5 the syntax for validating XML documents is:
$dom->validate('articles.dtd');
$dom->relaxNGValidate('articles.rng');
$dom->schemaValidate('articles.xsd');
Of course you need an XSD (XML Schema) or DTD (Document Type Definition) to validate against.

I can't speak to point 1, but it sounds fairly simple - each file can be parsed completely independently.
Points 2 and 3 are effectively about the contents of the file. Simply put, you can check that it's valid XML by parsing it and asking the parser to validate as it goes, and that's all you need to do. If you're expecting it to follow a particular DTD, you can validate it against that. (There are multiple levels of validation, depending on what your data is.)
XML files are just data, in and of themselves. While there are "processing instructions" available as XML, they're not instructions in quite the same way as direct bits of script to be executed, and there should be no harm in just parsing the file. Two potential things a malicious file could do:
Try to launch a denial-of-service attack by referring to a huge external DTD, which will make the parser use large amounts of bandwidth. You can probably disable external DTD resolution if you want to guard against this.
Try to take up significant resources just by being very large. You could always limit the maximum file size your script will handle.

A good performance alternative to PHP - String/File Manipulation

I have a project that is done but needs better performance.
The gist of the project is that I'm taking XML and converting it to CSV files. The files represent data to be loaded into a Database.
Right now I'm using PHP to unzip the zip file that contains the XML. Then I parse, convert to CSV, and rezip.
It's been fine till now but the XML files are getting HUGE now. So much that processing takes a little more than a day. I'm also doing some manipulations in there somewhere to the files, like rearranging columns and trims.
What alternatives do you suggest that would help me improve performance?
I've thought about writing this parser in C++ but I'm not sure of what route to take. Similar questions have been asked but this is more of a performance issue I suppose. Should I switch languages for performance, stick with PHP and optimize that, should I try to make this parser parallel so more than one file can be done at a time?
What would you suggest?

You should give Perl a try if PHP doesn't deliver what you wont, but I doubt, maybe you are doing something wrong there (logically).
What kind of XML parser are you using? (Its better be a SAX one...).
Also, it would be nice to see some code (how you parse the XMLs...)

Parse big XML in PHP

I need to parse pretty big XML in PHP (like 300 MB). How can i do it most effectively?
In particular, i need to locate the specific tags and extract their content in a flat TXT file, nothing more.

You can read and parse XML in chunks with an old-school SAX-based parsing approach using PHP's xml parser functions.
Using this approach, there's no real limit to the size of documents you can parse, as you simply read and parse a buffer-full at a time. The parser will fire events to indicate it has found tags, data etc.
There's a simple example in the manual which shows how to pick up start and end of tags. For your purposes you might also want to use xml_set_character_data_handler so that you pick up on the text between tags also.

The most efficient way to do that is to create static XSLT and apply it to your XML using XSLTProcessor. The method names are a bit misleading. Even though you want to output plain text, you should use either transformToXML() if you need is as a string variable, or transformToURI() if you want to write a file.

If it's one or few time job I'd use XML Starlet. But if you really want to do it PHP side then I'd recommend to preparse it to smaller chunks and then processing it. If you load it via DOM as one big chunk it will take a lot of memory. Also use CLI side PHP script to speed things up.

This is what SAX was designed for. SAX has a low memory footprint reading in a small buffer of data and firing events when it encounter elements, character data etc.
It is not always obvious how to use SAX, well it wasn't to me the first time I used it but in essence you have to maintain your own state and view as to where you are within the document structure so generally you will end up with variables describing what section of the document you are in e.g. inFoo, inBar etc which you set when you encounter particular start/end elements.
There is a short description and example of a sax parser here

Depending on your memory requirements, you can either load it up and parse it with XSLT (the memory-consuming route), or you can create a forward-only cursor and walk the tree yourself, printing the values you're looking for (the memory-efficient route).

Pull parsing is the way to go. This way it's memory-efficient AND easy to process. I have been processing files that are as large as 50 Mb or more.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.