I need to page an XML dataset in PHP.
The website I'm running is not high-volume so an implementation that would query the whole XML serialized file for each page is ok, but I'd be interested in hearing also approaches that do it right from the start (maybe slicing the file in many smaller files).
What are some approaches to do this in PHP?
My personal preference is simplexml_load_string since it makes handeling XML so much more easier using SimpleXMLElement than using DOMDocument
Define how many items are visible on a page
Count all items in your dataset
Select the actual offset from your dataset
Related
I have a big XML file, larger than 100mb, and I want to check if the structure of this file is valid.
I can try to load this file with DOMDocument; For example, I can read it with the PHP XML parser, which "lets you parse, but not validate, XML documents".
Is there any way to do this without fully loading the XML file into memory?
Firstly, you don't say what kind of schema you are using for validation: DTD, XSD, RelaxNG?
Secondly you mention PHP but you don't say whether the solution has to be based on PHP. Could you, for example, use Java?
Generally speaking, validating an XML document against a schema is a streamable operation, it does not require building a tree representation of the XML document in memory. Finding a streaming validator that works in your environment should not be hard, but we need to know what the environment is (and what schema language you are using).
I think you need to look into the XMLReader class. More specifically,
XMLReader::setSchema.
Think about what you're saying. You want to do operations on data that is not in memory. That doesn't make sense at all... it will eventually have to be in memory if you want to reference it from operations.
If you don't want to load the data in memory all at once, you could do a divide and conquer approach. If the file is incredibly large, you could run a map reduce job in multiple processes, but this would not decrease the amount of memory used.
If all you want to do is check if the XML structure is valid, you can use PHP's XML Parser. It will not validate the document against a DTD, which is what it means by it will not validate.
All of these error codes can be returned in the event the XML structure is found to be invalid while parsing it.
I'm trying to figure out how to delete an element (and its children) from a xml file that is very large in php (latest version).
I know I can use dom and simpleXml, but that will require the document to be loaded into memory.
I am looking at the XML writer/reader/parser functions and googling, but there seems to be nothing on the subject (all answers recommend using dom or simpleXml). That cannot be correct--am I missing something?
The closest thing I've found is this (C#):
You can use an XmlReader to sequentially read your xml (ReadOuterXml might be useful in your case to read a whole node at a time). Then use an XmlWriter to write out all the nodes you want to keep.
( Deleting nodes from large XML files )
Really? Is that the approach? I have to copy the entire huge file?
Is there really no other way?
One approcah
As suggested,
I could read the data using phps XML reader or parser, possibly buffer it, and write/dump+append it back to a new file.
But is this approach really practical?
I have experience with splitting huge xml files into smaller pieces, basically using suggested method, and it took a very long time for the process to finish.
My dataset isn’t currently big enough to give me an idea on how this would work out. I could only assume that the results will be the same (a very slow process).
Does anybody have experience of applying this in practice?
There are a couple ways to process large documents incrementally, so that you do not need to load the entire structure into memory at once. In either case, yes, you will need to write back out the elements that you wish to keep and omit those you want to remove.
PHP has an XMLReader implementation of a pull parser. An explanation:
A pull parser creates an iterator that sequentially visits the various
elements, attributes, and data in an XML document. Code which uses
this iterator can test the current item (to tell, for example, whether
it is a start or end element, or text), and inspect its attributes
(local name, namespace, values of XML attributes, value of text,
etc.), and can also move the iterator to the next item. The code can
thus extract information from the document as it traverses it.
Or you could use the SAX XML Parser. Explanation:
Simple API for XML (SAX) is a lexical, event-driven interface in which
a document is read serially and its contents are reported as callbacks
to various methods on a handler object of the user's design. SAX is
fast and efficient to implement, but difficult to use for extracting
information at random from the XML, since it tends to burden the
application author with keeping track of what part of the document is
being processed.
A lot of people prefer the pull method, but either meets your requirement. Keep in mind that large is relative. If the document fits in memory, then it will almost always be easier to use the DOM. But for really, really large documents that simply might not be an option.
I am trying to avoid XMLReader for an app I build that has a huge XML file.
SimpleXML is easy to write and I was wondering if there is any way to successfully handle it (memory and performance issues) in a quite busy server.
What I will do, is to echo some data from that XML mainly from a search form.
Ok, if you really want to do this without XMLReader, here's what you could do.
Use fopen to open and read N number of bytes of that file.
Fix the ending : (That's the tough part but it's perfectly doable)
You do it by closing anything left unclosed and also if needed backtracking if you happen to be in the middle of some text.
When that XML chunk is finally valid you can parse it with simplexmL.
Process that chunk or save it in its separate XML file
and create another chunk ...until you have all of them.
Obviously if your XML is complex this might get a little painful.
Summary :
By creating your own custom/dirt-cheap xml parser/fixer you can split a huge XML file into multiple smaller files.
If your file is mostly a lot of similar nodes, like a big list of books where the number of books is large but the book record itself is small, you could use a variation of smaura's answer by using XMLReader to walk through each node, then convert the node to an XML string and pass it to SimpleXML. That way, you're using a streaming solution for the big list, but once you get each record you get the benefits of easily accessing the record with SimpleXML.
I have a large XML file with 22000 records that I have to import in my DB.
I am looking how to parse the xml with paging, meaning
parse.php?start=0; //this script get the 0-500 firt records of file
parse.php?start=500 //this script get the 500-1000 records of file
This way I can bypass memory problems.
My problem is how to point at record 500 when load the xml file
My code is simple
$data=simplexml_load_file($xmlFile);
foreach ($data->product as $product) {
foreach($product->children() as $section) {
addToDB($section);
}
}
The code above works fine for 1000-2000 records but I want to modify as mentioned to work with large XMLs
SimpleXML is a DOM parser which means that it must load the whole document into memory to be able to build an in-memory representation of the whole XML dataset. Chunking the data does not work with this type of parser.
To load XML datasets that large you must switch to so called pull parser*s such as the XMLReader for example or the very low-level XML Parser extension. Pull parsers work by traversing the XML document element by element and allow you, the developer, to react according to the currently parsed element. That reduces memory footprint because only small fragments of the data have to be loaded into memory at one time. Using pull parsers is a little bit uncommon and not as intuitive as the familiar DOM parsers (DOM and SimpleXML).
That's not possible.
You should use XMLReader to import large files as described in my blog post.
Very high performed way is
$data = preg_split('/(<|>)/m', $xmlFile);
And after that, only one cycle is needed.
I have a large XML file (600mb+) and am developing a PHP application which needs to query this file.
My initial approach was to extract all the data from the file and insert it into a MySQL database - then query it that way. The only issue with this was that it was still slow, plus the XML data gets updated regularly - meaning I need to download, parse and insert data from the XML file into the database everytime the XML file is updated.
Is it actually possible to query a 600mb file? (for example, searching for records where TITLE="something here"?) Is it possible to get it to do this in a reasonable amount of time?
Ideally would like to do this in PHP, though I could also use JavaScript too.
Any help and suggestions appreciated :)
Constructing an XML DOM for a 600+ Mb document is definitely a way to fail. What you need is SAX-based API. SAX, though, does not usually allow XPath to be used, but you can emulate it with imperative code.
As for the file being updated, is it possible to retrieve only differences anyhow? That would massively speed up subsequent processing.