I was wondering if a function capable of converting an associative array to an XML document exists in PHP (or some widely available PHP library).
I've searched quite a lot and could only find functions that do not output valid XML. I believe that the array I'm testing them on is correctly constructed, since it can be correctly used to generate a JSON document using json_encode. However, it is rather large and it is nested on four levels, which might explain why the functions I've tried so far fail.
Ultimately, I will write the code to generate the XML myself but surely there must be a faster way of doing this.
I realize I am a Johnny-Come-Lately here, but I was working with the VERY same problem -- and the tutorials I found out there would almost (but not quite upon unit testing) cover it.
After much frustration and research, here is what I cam up with
XML To Assoc. Array:
From http://www.php.net/manual/en/simplexml.examples-basic.php
json_decode( json_encode( simplexml_load_string( $string ) ), TRUE );
Assoc. Array to XML
notes:
XML attributes are not handled
Will also handle nested arrays with numeric indices (which are not valid XML!)
From http://www.devexp.eu/2009/04/11/php-domdocument-convert-array-to-xml/
/// Converts an array to XML
/// - http://www.devexp.eu/2009/04/11/php-domdocument-convert-array-to-xml/
/// #param <array> $array The associative array you want to convert; nested numeric indices are OK!
function getXml( array $array ) {
$array2XmlConverter = new XmlDomConstructor('1.0', 'utf-8');
$array2XmlConverter->xmlStandalone = TRUE;
$array2XmlConverter->formatOutput = TRUE;
try {
$array2XmlConverter->fromMixed( $array );
$array2XmlConverter->normalizeDocument ();
$xml = $array2XmlConverter->saveXML();
// echo "\n\n-----vvv start returned xml vvv-----\n";
// print_r( $xml );
// echo "\n------^^^ end returned xml ^^^----\n"
return $xml;
}
catch( Exception $ex ) {
// echo "\n\n-----vvv Rut-roh Raggy! vvv-----\n";
// print_r( $ex->getCode() ); echo "\n";
// print_r( $->getMessage() );
// var_dump( $ex );
// echo "\n------^^^ end Rut-roh Raggy! ^^^----\n"
return $ex;
}
}
... and here is the class to use for the $array2XmlConverter object:
/**
* Extends the DOMDocument to implement personal (utility) methods.
* - From: http://www.devexp.eu/2009/04/11/php-domdocument-convert-array-to-xml/
* - `parent::` See http://www.php.net/manual/en/class.domdocument.php
*
* #throws DOMException http://www.php.net/manual/en/class.domexception.php
*
* #author Toni Van de Voorde
*/
class XmlDomConstructor extends DOMDocument {
/**
* Constructs elements and texts from an array or string.
* The array can contain an element's name in the index part
* and an element's text in the value part.
*
* It can also creates an xml with the same element tagName on the same
* level.
*
* ex:
\verbatim
<nodes>
<node>text</node>
<node>
<field>hello</field>
<field>world</field>
</node>
</nodes>
\verbatim
*
*
* Array should then look like:
\verbatim
array(
"nodes" => array(
"node" => array(
0 => "text",
1 => array(
"field" => array (
0 => "hello",
1 => "world",
),
),
),
),
);
\endverbatim
*
* #param mixed $mixed An array or string.
*
* #param DOMElement[optional] $domElement Then element
* from where the array will be construct to.
*
*/
public function fromMixed($mixed, DOMElement $domElement = null) {
$domElement = is_null($domElement) ? $this : $domElement;
if (is_array($mixed)) {
foreach( $mixed as $index => $mixedElement ) {
if ( is_int($index) ) {
if ( $index == 0 ) {
$node = $domElement;
}
else {
$node = $this->createElement($domElement->tagName);
$domElement->parentNode->appendChild($node);
}
}
else {
$node = $this->createElement($index);
$domElement->appendChild($node);
}
$this->fromMixed($mixedElement, $node);
}
}
else {
$domElement->appendChild($this->createTextNode($mixed));
}
}
} // end of class
No. At least there is no such in-built function. It's not a probrem to write it at all.
surely there must be a faster way of doing this
How do you represent attribute in array? I can assume keys are tags and values are this tags content.
Basic PHP Array -> JSON works just fine, cause those structure is... well... almost the same.
Call
// $data = array(...);
$dataTransformator = new DataTransformator();
$domDocument = $dataTransformator->data2domDocument($data);
$xml = $domDocument->saveXML();
DataTransformator
class DataTransformator {
/**
* Converts the $data to a \DOMDocument.
* #param array $data
* #param string $rootElementName
* #param string $defaultElementName
* #see MyNamespace\Dom\DataTransformator#data2domNode(...)
* #return Ambigous <DOMDocument>
*/
public function data2domDocument(array $data, $rootElementName = 'data', $defaultElementName = 'item') {
return $this->data2domNode($data, $rootElementName, null, $defaultElementName);
}
/**
* Converts the $data to a \DOMNode.
* If the $elementContent is a string,
* a DOMNode with a nested shallow DOMElement
* will be (created if the argument $node is null and) returned.
* If the $elementContent is an array,
* the function will applied on every its element recursively and
* a DOMNode with a nested DOMElements
* will be (created if the argument $node is null and) returned.
* The end result is always a DOMDocument object.
* The casue is, that a \DOMElement object
* "is read only. It may be appended to a document,
* but additional nodes may not be appended to this node
* until the node is associated with a document."
* See {#link http://php.net/manual/en/domelement.construct.php here}).
*
* #param Ambigous <string, mixed> $elementName Used as element tagname. If it's not a string $defaultElementName is used instead.
* #param Ambigous <string, array> $elementContent
* #param Ambigous <\DOMDocument, NULL, \DOMElement> $parentNode The parent node is
* either a \DOMDocument (by the method calls from outside of the method)
* or a \DOMElement or NULL (by the calls from inside).
* Once again: For the calls from outside of the method the argument MUST be either a \DOMDocument object or NULL.
* #param string $defaultElementName If the key of the array element is a string, it determines the DOM element name / tagname.
* For numeric indexes the $defaultElementName is used.
* #return \DOMDocument
*/
protected function data2domNode($elementContent, $elementName, \DOMNode $parentNode = null, $defaultElementName = 'item') {
$parentNode = is_null($parentNode) ? new \DOMDocument('1.0', 'utf-8') : $parentNode;
$name = is_string($elementName) ? $elementName : $defaultElementName;
if (!is_array($elementContent)) {
$content = htmlspecialchars($elementContent);
$element = new \DOMElement($name, $content);
$parentNode->appendChild($element);
} else {
$element = new \DOMElement($name);
$parentNode->appendChild($element);
foreach ($elementContent as $key => $value) {
$elementChild = $this->data2domNode($value, $key, $element);
$parentNode->appendChild($elementChild);
}
}
return $parentNode;
}
}
PHP's DOMDocument objects are probably what you are looking for. Here is a link to an example use of this class to convert a multi-dimensional array into an xml file - http://www.php.net/manual/en/book.dom.php#78941
function combArrToXML($arrC=array(), $root="root", $element="element"){
$doc = new DOMDocument();
$doc->formatOutput = true;
$r = $doc->createElement( $root );
$doc->appendChild( $r );
$b = $doc->createElement( $element );
foreach( $arrC as $key => $val)
{
$$key = $doc->createElement( $key );
$$key->appendChild(
$doc->createTextNode( $val )
);
$b->appendChild( $$key );
$r->appendChild( $b );
}
return $doc->saveXML();
}
Example:
$b=array("testa"=>"testb", "testc"=>"testd");
combArrToXML($b, "root", "element");
Output:
<?xml version="1.0"?>
<root>
<element>
<testa>testb</testa>
<testc>testd</testc>
</element>
</root>
surely there must be a faster way of doing this
If you've got PEAR installed, there is. Take a look at XML_Seralizer. It's beta, so you'll have to use
pear install XML_Serializer-beta
to install
I needed a solution which is able to convert arrays with non-associative subarrays and content which needs to be escaped with CDATA (<>&). Since I could not find any appropriate solution, I implemented my own based on SimpleXML which should be quite fast.
https://github.com/traeger/SimplestXML (this solution supports an (Associative) Array => XML and XML => (Associative) Array conversion without attribute support). I hope this helps someone.
Related
Symfony converts nested YAML and PHP array translation files to a dot notation, like this: modules.module.title.
I'm writing some code that exports YAML translation files to a database, and I need to flatten the parsed files to a dot notation.
Does anyone know which function Symfony uses to flatten nested arrays to dot notation?
I cannot find it anywhere in the source code.
It's the flatten() method in Symfony\Component\Translation\Loader\ArrayLoader:
<?php
/**
* Flattens an nested array of translations.
*
* The scheme used is:
* 'key' => array('key2' => array('key3' => 'value'))
* Becomes:
* 'key.key2.key3' => 'value'
*
* This function takes an array by reference and will modify it
*
* #param array &$messages The array that will be flattened
* #param array $subnode Current subnode being parsed, used internally for recursive calls
* #param string $path Current path being parsed, used internally for recursive calls
*/
private function flatten(array &$messages, array $subnode = null, $path = null)
{
if (null === $subnode) {
$subnode = &$messages;
}
foreach ($subnode as $key => $value) {
if (is_array($value)) {
$nodePath = $path ? $path.'.'.$key : $key;
$this->flatten($messages, $value, $nodePath);
if (null === $path) {
unset($messages[$key]);
}
} elseif (null !== $path) {
$messages[$path.'.'.$key] = $value;
}
}
}
I don't know how what is written in previous Symfony versions, but in Symfony 4.2 onwards translations are returned already flattened.
Example controller which returns the messages catalogue translations. In my case I used this response to feed the i18next js library.
<?php
declare(strict_types=1);
namespace Conferences\Controller;
use Symfony\Component\HttpFoundation\JsonResponse;
use Symfony\Component\HttpKernel\Exception\ServiceUnavailableHttpException;
use Symfony\Component\Translation\TranslatorBagInterface;
use Symfony\Contracts\Translation\TranslatorInterface;
final class TranslationsController
{
public function __invoke(TranslatorInterface $translator): JsonResponse
{
if (!$translator instanceof TranslatorBagInterface) {
throw new ServiceUnavailableHttpException();
}
return new JsonResponse($translator->getCatalogue()->all()['messages']);
}
}
Route definition:
translations:
path: /{_locale}/translations
controller: App\Controller\TranslationsController
requirements: {_locale: pl|en}
In my Symfony 3 web app I'm serializing some DB rows into Json as follows:
$doc = $this->get ( 'doctrine' );
$repo = $doc->getRepository ( 'AppBundle:Customer' );
$result = $repo->createQueryBuilder ( 'c' )->setMaxResults(25)->getQuery ()->getResult ();
$encoder = new JsonEncoder ();
$normalizer = new GetSetMethodNormalizer ();
$serializer = new Serializer ( array (
new \AppBundle\DateTimeNormalizer(), $normalizer
), array (
$encoder
) );
$json = $serializer->serialize ( $result, 'json' );
This outputs the desired data, e.g:
{companyname:"Microsoft"}
In order to (at least initially) maintain compatibility with a legacy system, I'd like all the Json names to be in uppercase, e.g.
{COMPANYNAME:"Microsoft"}
Is the best way to tackle this by approaching from:
The Encoder
The Normalizer(s)
The Serializer
Some other way?
Please briefly describe the suggested approach
You can implement your custom NameConverter a class that implements the NameConverterInterface and pass as second argument to the GetSetMethodNormalizer. As Example:
<?php
namespace AppBundle;
use Symfony\Component\Serializer\NameConverter\NameConverterInterface;
class ToUppercaseNameConverter implements NameConverterInterface
{
/**
* Converts a property name to its normalized value.
*
* #param string $propertyName
*
* #return string
*/
public function normalize($propertyName)
{
return strtoupper($propertyName);
}
/**
* Converts a property name to its denormalized value.
*
* #param string $propertyName
*
* #return string
*/
public function denormalize($propertyName)
{
}
}
?>
and use it as follow:
$uppercaseConverter = new ToUppercaseNameConverter();
$normalizer = new GetSetMethodNormalizer (null, $uppercaseConverter);
You can take a look at the doc Converting Property Names when Serializing and Deserializing
Hope this help
I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?
There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();
This is a very similar question to Best way to process large XML in PHP but with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing.
However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:
My take on it:
https://github.com/prewk/XmlStreamer
A simple class that will extract all children to the XML root element while streaming the file.
Tested on 108 MB XML file from pubmed.com.
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();
I've recently had to parse some pretty large XML documents, and needed a method to read one element at a time.
If you have the following file complex-test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<Complex>
<Object>
<Title>Title 1</Title>
<Name>It's name goes here</Name>
<ObjectData>
<Info1></Info1>
<Info2></Info2>
<Info3></Info3>
<Info4></Info4>
</ObjectData>
<Date></Date>
</Object>
<Object></Object>
<Object>
<AnotherObject></AnotherObject>
<Data></Data>
</Object>
<Object></Object>
<Object></Object>
</Complex>
And wanted to return the <Object/>s
PHP:
require_once('class.chunk.php');
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
while ($xml = $file->read()) {
$obj = simplexml_load_string($xml);
// do some parsing, insert to DB whatever
}
###########
Class File
###########
<?php
/**
* Chunk
*
* Reads a large file in as chunks for easier parsing.
*
* The chunks returned are whole <$this->options['element']/>s found within file.
*
* Each call to read() returns the whole element including start and end tags.
*
* Tested with a 1.8MB file, extracted 500 elements in 0.11s
* (with no work done, just extracting the elements)
*
* Usage:
* <code>
* // initialize the object
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
*
* // loop through the file until all lines are read
* while ($xml = $file->read()) {
* // do whatever you want with the string
* $o = simplexml_load_string($xml);
* }
* </code>
*
* #package default
* #author Dom Hastings
*/
class Chunk {
/**
* options
*
* #var array Contains all major options
* #access public
*/
public $options = array(
'path' => './', // string The path to check for $file in
'element' => '', // string The XML element to return
'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk
);
/**
* file
*
* #var string The filename being read
* #access public
*/
public $file = '';
/**
* pointer
*
* #var integer The current position the file is being read from
* #access public
*/
public $pointer = 0;
/**
* handle
*
* #var resource The fopen() resource
* #access private
*/
private $handle = null;
/**
* reading
*
* #var boolean Whether the script is currently reading the file
* #access private
*/
private $reading = false;
/**
* readBuffer
*
* #var string Used to make sure start tags aren't missed
* #access private
*/
private $readBuffer = '';
/**
* __construct
*
* Builds the Chunk object
*
* #param string $file The filename to work with
* #param array $options The options with which to parse the file
* #author Dom Hastings
* #access public
*/
public function __construct($file, $options = array()) {
// merge the options together
$this->options = array_merge($this->options, (is_array($options) ? $options : array()));
// check that the path ends with a /
if (substr($this->options['path'], -1) != '/') {
$this->options['path'] .= '/';
}
// normalize the filename
$file = basename($file);
// make sure chunkSize is an int
$this->options['chunkSize'] = intval($this->options['chunkSize']);
// check it's valid
if ($this->options['chunkSize'] < 64) {
$this->options['chunkSize'] = 512;
}
// set the filename
$this->file = realpath($this->options['path'].$file);
// check the file exists
if (!file_exists($this->file)) {
throw new Exception('Cannot load file: '.$this->file);
}
// open the file
$this->handle = fopen($this->file, 'r');
// check the file opened successfully
if (!$this->handle) {
throw new Exception('Error opening file for reading');
}
}
/**
* __destruct
*
* Cleans up
*
* #return void
* #author Dom Hastings
* #access public
*/
public function __destruct() {
// close the file resource
fclose($this->handle);
}
/**
* read
*
* Reads the first available occurence of the XML element $this->options['element']
*
* #return string The XML string from $this->file
* #author Dom Hastings
* #access public
*/
public function read() {
// check we have an element specified
if (!empty($this->options['element'])) {
// trim it
$element = trim($this->options['element']);
} else {
$element = '';
}
// initialize the buffer
$buffer = false;
// if the element is empty
if (empty($element)) {
// let the script know we're reading
$this->reading = true;
// read in the whole doc, cos we don't know what's wanted
while ($this->reading) {
$buffer .= fread($this->handle, $this->options['chunkSize']);
$this->reading = (!feof($this->handle));
}
// return it all
return $buffer;
// we must be looking for a specific element
} else {
// set up the strings to find
$open = '<'.$element.'>';
$close = '</'.$element.'>';
// let the script know we're reading
$this->reading = true;
// reset the global buffer
$this->readBuffer = '';
// this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
$store = false;
// seek to the position we need in the file
fseek($this->handle, $this->pointer);
// start reading
while ($this->reading && !feof($this->handle)) {
// store the chunk in a temporary variable
$tmp = fread($this->handle, $this->options['chunkSize']);
// update the global buffer
$this->readBuffer .= $tmp;
// check for the open string
$checkOpen = strpos($tmp, $open);
// if it wasn't in the new buffer
if (!$checkOpen && !($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkOpen = strpos($this->readBuffer, $open);
// if it was in there
if ($checkOpen) {
// set it to the remainder
$checkOpen = $checkOpen % $this->options['chunkSize'];
}
}
// check for the close string
$checkClose = strpos($tmp, $close);
// if it wasn't in the new buffer
if (!$checkClose && ($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkClose = strpos($this->readBuffer, $close);
// if it was in there
if ($checkClose) {
// set it to the remainder plus the length of the close string itself
$checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
}
// if it was
} elseif ($checkClose) {
// add the length of the close string itself
$checkClose += strlen($close);
}
// if we've found the opening string and we're not already reading another element
if ($checkOpen !== false && !($store)) {
// if we're found the end element too
if ($checkClose !== false) {
// append the string only between the start and end element
$buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
} else {
// append the data we know to be part of this element
$buffer .= substr($tmp, $checkOpen);
// update the pointer
$this->pointer += $this->options['chunkSize'];
// let the script know we're gonna be storing all the data until we find the close element
$store = true;
}
// if we've found the closing element
} elseif ($checkClose !== false) {
// update the buffer with the data upto and including the close tag
$buffer .= substr($tmp, 0, $checkClose);
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
// if we've found the closing element, but half in the previous chunk
} elseif ($store) {
// update the buffer
$buffer .= $tmp;
// and the pointer
$this->pointer += $this->options['chunkSize'];
}
}
}
// return the element (or the whole file if we're not looking for elements)
return $buffer;
}
}
This is an old post, but first in the google search result, so I thought I post another solution based on this post:
http://drib.tech/programming/parse-large-xml-files-php
This solution uses both XMLReader and SimpleXMLElement :
$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL = 'the_name_of_your_element';
$xml = new XMLReader();
$xml->open($xmlFile);
// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}
// looping through elements
while($xml->name == $primEL) {
// loading element data into simpleXML object
$element = new SimpleXMLElement($xml->readOuterXML());
// DO STUFF
// moving pointer
$xml->next($primEL);
// clearing current element
unset($element);
} // end while
$xml->close();
I would suggest using a SAX based parser rather than DOM based parsing.
Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm
This isn't a great solution, but just to throw another option out there:
You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).
e.g., if your doc looks like:
<dmoz>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
...
</dmoz>
You can read it in a meg or two at a time, artificially wrap the few complete <listing> tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).
Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.
Note, however, that this approach is NOT really practical when your document doesn't consist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)
You can combine XMLReader with DOM for this. In PHP both APIs (and SimpleXML) are based on the same library - libxml2. Large XMLs are a list of records typically. So you use XMLReader to iterate the records, load a single record into DOM and use DOM methods and Xpath to extract values. The key is the method XMLReader::expand(). It loads the current node in an XMLReader instance and its descendants as DOM nodes.
Example XML:
<books>
<book>
<title isbn="978-0596100087">XSLT 1.0 Pocket Reference</title>
</book>
<book>
<title isbn="978-0596100506">XML Pocket Reference</title>
</book>
<!-- ... -->
</books>
Example code:
// open the XML file
$reader = new XMLReader();
$reader->open('books.xml');
// prepare a DOM document
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// find the first `book` element node at any depth
while ($reader->read() && $reader->localName !== 'book') {
continue;
}
// as long as here is a node with the name "book"
while ($reader->localName === 'book') {
// expand the node into the prepared DOM
$book = $reader->expand($document);
// use Xpath expressions to fetch values
var_dump(
$xpath->evaluate('string(title/#isbn)', $book),
$xpath->evaluate('string(title)', $book)
);
// move to the next book sibling node
$reader->next('book');
}
$reader->close();
Take note that the expanded node is never appended to the DOM document. It allows the GC to clean it up.
This approach works with XML namespaces as well.
$namespaceURI = 'urn:example-books';
$reader = new XMLReader();
$reader->open('books.xml');
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// register a prefix for the Xpath expressions
$xpath->registerNamespace('b', $namespaceURI);
// compare local node name and namespace URI
while (
$reader->read() &&
(
$reader->localName !== 'book' ||
$reader->namespaceURI !== $namespaceURI
)
) {
continue;
}
// iterate the book elements
while ($reader->localName === 'book') {
// validate that they are in the namespace
if ($reader->namespaceURI === $namespaceURI) {
$book = $reader->expand($document);
var_dump(
$xpath->evaluate('string(b:title/#isbn)', $book),
$xpath->evaluate('string(b:title)', $book)
);
}
$reader->next('book');
}
$reader->close();
I've written a wrapper for XMLReader to (IMHO) make it easier to just get the bits your after. The wrapper allows you to associate a set of paths of the data elements and a callback to be run when this path is found. The path allows regex expressions and also capture groups which can also be passed to the callback.
The library is at https://github.com/NigelRel3/XMLReaderReg and can also be installed using composer require nigelrel3/xml-reader-reg.
An example of how to use it...
$inputFile = __DIR__ ."/../tests/data/simpleTest1.xml";
$reader = new XMLReaderReg\XMLReaderReg();
$reader->open($inputFile);
$reader->process([
'(.*/person(?:\[\d*\])?)' => function (SimpleXMLElement $data, $path): void {
echo "1) Value for ".$path[1]." is ".PHP_EOL.
$data->asXML().PHP_EOL;
},
'(.*/person3(\[\d*\])?)' => function (DOMElement $data, $path): void {
echo "2) Value for ".$path[1]." is ".PHP_EOL.
$data->ownerDocument->saveXML($data).PHP_EOL;
},
'/root/person2/firstname' => function (string $data): void {
echo "3) Value for /root/person2/firstname is ". $data.PHP_EOL;
}
]);
$reader->close();
As can be seen from the example, you can get the data to be passed as a SimpleXMLElement, a DOMElement or the last one is a string. This will represent only the data which matches the path.
The paths also show how capture groups can be used - (.*/person(?:\[\d*\])?) looks for any person element (including arrays of elements) and $path[1] in the callback displays the path where this particular instance is found.
There is an expanded example in the library as well as unit tests.
I tested the following code with 2 GB xml:
<?php
set_time_limit(0);
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
die("Failed to open 'data.xml'");
}
while($reader->read())
{
$node = $reader->expand();
// process $node...
}
$reader->close();
?>
My solution:
$reader = new XMLReader();
$reader->open($fileTMP);
while ($reader->read()) {
if ($reader->nodeType === XMLReader::ELEMENT && $reader->name === 'xmltag' && $reader->isEmptyElement === false) {
$item = simplexml_load_string($reader->readOuterXML(), null, LIBXML_NOCDATA);
//operations on file
}
}
$reader->close();
Very high performed way is
preg_split('/(<|>)/m', $xmlString);
And after that, only one cycle is needed.
I have an HTML string, and I want to remove from it all DIVs whose class is "toremove".
This is trivial to do on the client side with jQuery etc., but I want to do it on the server side with PHP.
A simple aegular expression won't work, because divs may be nested...
You could use the DOM object and xPath to remove the DIVs.
/** UNTESTED **/
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("*/div[#class='yourClasshere']");
foreach($elements as $e){
$doc->removeChild($e);
}
$doc->saveHTMLFile($file);
You can replace the load from file and save to file with load from and save to string if you prefer.
Here's a snippet of code I'm using to remove content from pages:
/**
* A method to remove unwanted parts of an HTML-page. Can remove elements by
* id, tag name and/or class names.
*
* #param string $html The HTML to manipulate
* #param array $partsToRemove An array of arrays, with the keys specifying
* what type of values the array holds. The following keys are used:
* 'elements' - An array of element ids to remove from the html
* 'tags' - An array of tag names to remove from the html
* 'classNames' - An array of class names. Each tag that contains one of the
* class names will be removed from the html.
*
* Also, note that descendants of the removed document will also be removed.
*
* #return string The manipulated HTML content
*
* #example removeHtmlParts($html, array (
* 'elements' => array ('headerSection', 'nav', 'footerSection'),
* 'tags' => array ('form'),
* 'classNames' => array ('promotion')
* ));
*/
--
public function removeHtmlParts ($html, array $toRemove = array())
{
$document = new \DOMDocument('1.0', 'UTF-8');
$document->encoding = 'UTF-8';
// Hack to force DOMDocument to load the HTML using UTF-8.
#$document->loadHTML('<?xml encoding="UTF-8">' . $response->getBody());
$partsToRemove = array ();
if(isset($toRemove['elements']))
{
$partsToRemove['elements'] = $toRemove['element'];
}
if(isset($toRemove['tags']))
{
$partsToRemove['tags'] = $toRemove['tags'];
}
if(isset($toRemove['classNames']))
{
$partsToRemove['classNames'] = $toRemove['classNames'];
}
foreach ($partsToRemove as $type => $content)
{
if($type == 'elements')
{
foreach ($content as $elementId)
{
$element = $document->getElementById($elementId);
if($element)
{
$element->parentNode->removeChild($element);
}
}
}
elseif($type == 'tags')
{
foreach($content as $tagName)
{
$tags = $document->getElementsByTagName($tagName);
while($tags->length)
{
$tag = $tags->item(0);
if($tag)
{
$tag->parentNode->removeChild($tag);
}
}
}
}
elseif($type == 'classNames')
{
foreach ($content as $className)
{
$xpath = new \DOMXPath($document);
$xpathExpression = sprintf(
'//*[contains(#class,"%1")]',
$className
);
$domNodeList = $xpath->evaluate($xpathExpression);
for($i = 0; $i < $domNodeList->length; $i++)
{
$node = $domNodeList->item($i);
if($node && $node->parentNode)
{
$node->parentNode->removeChild($node);
}
}
}
}
}
return $document->saveHTML();
}
Note:
This code has not undergone proper unit testing and probably contains bugs in edge cases
This method should be refactored into a class, and the contents of the method split into separate methods to ease testing.
Based on the short answer of jebbench and the long answer of PatrikAkerstrand, I created a medium function that exactly solves my problem:
/**
* remove, from the given xhtml string, all divs with the given class.
*/
function remove_divs_with_class($xhtml, $class) {
$doc = new DOMDocument();
// Hack to force DOMDocument to load the HTML using UTF-8:
$doc->loadHTML('<?xml encoding="UTF-8">'.$xhtml);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[contains(#class,'$class')]");
foreach ($elements as $element)
$element->parentNode->removeChild($element);
return $doc->saveHTML();
}
/* UNIT TEST */
if (basename(__FILE__)==basename($_SERVER['PHP_SELF'])) {
$xhtml = "<div class='near future'>near future</div><div>start</div><div class='future'>future research</div><div class='summary'>summary</div><div class='a future b'>far future</div>";
$xhtml2 = remove_divs_with_class($xhtml, "future");
print "<h2>before</h2>$xhtml<h2>after</h2>$xhtml2";
}
/* OUTPUT:
before
near future
start
future research
summary
far future
after
start
summary
*/
Never, ever try and use regex to parse XML/HTML. Instead use a parsing library. Apparently, one for PHP is http://sourceforge.net/projects/simplehtmldom/files/
I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?
There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();
This is a very similar question to Best way to process large XML in PHP but with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing.
However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:
My take on it:
https://github.com/prewk/XmlStreamer
A simple class that will extract all children to the XML root element while streaming the file.
Tested on 108 MB XML file from pubmed.com.
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();
I've recently had to parse some pretty large XML documents, and needed a method to read one element at a time.
If you have the following file complex-test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<Complex>
<Object>
<Title>Title 1</Title>
<Name>It's name goes here</Name>
<ObjectData>
<Info1></Info1>
<Info2></Info2>
<Info3></Info3>
<Info4></Info4>
</ObjectData>
<Date></Date>
</Object>
<Object></Object>
<Object>
<AnotherObject></AnotherObject>
<Data></Data>
</Object>
<Object></Object>
<Object></Object>
</Complex>
And wanted to return the <Object/>s
PHP:
require_once('class.chunk.php');
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
while ($xml = $file->read()) {
$obj = simplexml_load_string($xml);
// do some parsing, insert to DB whatever
}
###########
Class File
###########
<?php
/**
* Chunk
*
* Reads a large file in as chunks for easier parsing.
*
* The chunks returned are whole <$this->options['element']/>s found within file.
*
* Each call to read() returns the whole element including start and end tags.
*
* Tested with a 1.8MB file, extracted 500 elements in 0.11s
* (with no work done, just extracting the elements)
*
* Usage:
* <code>
* // initialize the object
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
*
* // loop through the file until all lines are read
* while ($xml = $file->read()) {
* // do whatever you want with the string
* $o = simplexml_load_string($xml);
* }
* </code>
*
* #package default
* #author Dom Hastings
*/
class Chunk {
/**
* options
*
* #var array Contains all major options
* #access public
*/
public $options = array(
'path' => './', // string The path to check for $file in
'element' => '', // string The XML element to return
'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk
);
/**
* file
*
* #var string The filename being read
* #access public
*/
public $file = '';
/**
* pointer
*
* #var integer The current position the file is being read from
* #access public
*/
public $pointer = 0;
/**
* handle
*
* #var resource The fopen() resource
* #access private
*/
private $handle = null;
/**
* reading
*
* #var boolean Whether the script is currently reading the file
* #access private
*/
private $reading = false;
/**
* readBuffer
*
* #var string Used to make sure start tags aren't missed
* #access private
*/
private $readBuffer = '';
/**
* __construct
*
* Builds the Chunk object
*
* #param string $file The filename to work with
* #param array $options The options with which to parse the file
* #author Dom Hastings
* #access public
*/
public function __construct($file, $options = array()) {
// merge the options together
$this->options = array_merge($this->options, (is_array($options) ? $options : array()));
// check that the path ends with a /
if (substr($this->options['path'], -1) != '/') {
$this->options['path'] .= '/';
}
// normalize the filename
$file = basename($file);
// make sure chunkSize is an int
$this->options['chunkSize'] = intval($this->options['chunkSize']);
// check it's valid
if ($this->options['chunkSize'] < 64) {
$this->options['chunkSize'] = 512;
}
// set the filename
$this->file = realpath($this->options['path'].$file);
// check the file exists
if (!file_exists($this->file)) {
throw new Exception('Cannot load file: '.$this->file);
}
// open the file
$this->handle = fopen($this->file, 'r');
// check the file opened successfully
if (!$this->handle) {
throw new Exception('Error opening file for reading');
}
}
/**
* __destruct
*
* Cleans up
*
* #return void
* #author Dom Hastings
* #access public
*/
public function __destruct() {
// close the file resource
fclose($this->handle);
}
/**
* read
*
* Reads the first available occurence of the XML element $this->options['element']
*
* #return string The XML string from $this->file
* #author Dom Hastings
* #access public
*/
public function read() {
// check we have an element specified
if (!empty($this->options['element'])) {
// trim it
$element = trim($this->options['element']);
} else {
$element = '';
}
// initialize the buffer
$buffer = false;
// if the element is empty
if (empty($element)) {
// let the script know we're reading
$this->reading = true;
// read in the whole doc, cos we don't know what's wanted
while ($this->reading) {
$buffer .= fread($this->handle, $this->options['chunkSize']);
$this->reading = (!feof($this->handle));
}
// return it all
return $buffer;
// we must be looking for a specific element
} else {
// set up the strings to find
$open = '<'.$element.'>';
$close = '</'.$element.'>';
// let the script know we're reading
$this->reading = true;
// reset the global buffer
$this->readBuffer = '';
// this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
$store = false;
// seek to the position we need in the file
fseek($this->handle, $this->pointer);
// start reading
while ($this->reading && !feof($this->handle)) {
// store the chunk in a temporary variable
$tmp = fread($this->handle, $this->options['chunkSize']);
// update the global buffer
$this->readBuffer .= $tmp;
// check for the open string
$checkOpen = strpos($tmp, $open);
// if it wasn't in the new buffer
if (!$checkOpen && !($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkOpen = strpos($this->readBuffer, $open);
// if it was in there
if ($checkOpen) {
// set it to the remainder
$checkOpen = $checkOpen % $this->options['chunkSize'];
}
}
// check for the close string
$checkClose = strpos($tmp, $close);
// if it wasn't in the new buffer
if (!$checkClose && ($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkClose = strpos($this->readBuffer, $close);
// if it was in there
if ($checkClose) {
// set it to the remainder plus the length of the close string itself
$checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
}
// if it was
} elseif ($checkClose) {
// add the length of the close string itself
$checkClose += strlen($close);
}
// if we've found the opening string and we're not already reading another element
if ($checkOpen !== false && !($store)) {
// if we're found the end element too
if ($checkClose !== false) {
// append the string only between the start and end element
$buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
} else {
// append the data we know to be part of this element
$buffer .= substr($tmp, $checkOpen);
// update the pointer
$this->pointer += $this->options['chunkSize'];
// let the script know we're gonna be storing all the data until we find the close element
$store = true;
}
// if we've found the closing element
} elseif ($checkClose !== false) {
// update the buffer with the data upto and including the close tag
$buffer .= substr($tmp, 0, $checkClose);
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
// if we've found the closing element, but half in the previous chunk
} elseif ($store) {
// update the buffer
$buffer .= $tmp;
// and the pointer
$this->pointer += $this->options['chunkSize'];
}
}
}
// return the element (or the whole file if we're not looking for elements)
return $buffer;
}
}
This is an old post, but first in the google search result, so I thought I post another solution based on this post:
http://drib.tech/programming/parse-large-xml-files-php
This solution uses both XMLReader and SimpleXMLElement :
$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL = 'the_name_of_your_element';
$xml = new XMLReader();
$xml->open($xmlFile);
// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}
// looping through elements
while($xml->name == $primEL) {
// loading element data into simpleXML object
$element = new SimpleXMLElement($xml->readOuterXML());
// DO STUFF
// moving pointer
$xml->next($primEL);
// clearing current element
unset($element);
} // end while
$xml->close();
I would suggest using a SAX based parser rather than DOM based parsing.
Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm
This isn't a great solution, but just to throw another option out there:
You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).
e.g., if your doc looks like:
<dmoz>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
...
</dmoz>
You can read it in a meg or two at a time, artificially wrap the few complete <listing> tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).
Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.
Note, however, that this approach is NOT really practical when your document doesn't consist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)
You can combine XMLReader with DOM for this. In PHP both APIs (and SimpleXML) are based on the same library - libxml2. Large XMLs are a list of records typically. So you use XMLReader to iterate the records, load a single record into DOM and use DOM methods and Xpath to extract values. The key is the method XMLReader::expand(). It loads the current node in an XMLReader instance and its descendants as DOM nodes.
Example XML:
<books>
<book>
<title isbn="978-0596100087">XSLT 1.0 Pocket Reference</title>
</book>
<book>
<title isbn="978-0596100506">XML Pocket Reference</title>
</book>
<!-- ... -->
</books>
Example code:
// open the XML file
$reader = new XMLReader();
$reader->open('books.xml');
// prepare a DOM document
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// find the first `book` element node at any depth
while ($reader->read() && $reader->localName !== 'book') {
continue;
}
// as long as here is a node with the name "book"
while ($reader->localName === 'book') {
// expand the node into the prepared DOM
$book = $reader->expand($document);
// use Xpath expressions to fetch values
var_dump(
$xpath->evaluate('string(title/#isbn)', $book),
$xpath->evaluate('string(title)', $book)
);
// move to the next book sibling node
$reader->next('book');
}
$reader->close();
Take note that the expanded node is never appended to the DOM document. It allows the GC to clean it up.
This approach works with XML namespaces as well.
$namespaceURI = 'urn:example-books';
$reader = new XMLReader();
$reader->open('books.xml');
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// register a prefix for the Xpath expressions
$xpath->registerNamespace('b', $namespaceURI);
// compare local node name and namespace URI
while (
$reader->read() &&
(
$reader->localName !== 'book' ||
$reader->namespaceURI !== $namespaceURI
)
) {
continue;
}
// iterate the book elements
while ($reader->localName === 'book') {
// validate that they are in the namespace
if ($reader->namespaceURI === $namespaceURI) {
$book = $reader->expand($document);
var_dump(
$xpath->evaluate('string(b:title/#isbn)', $book),
$xpath->evaluate('string(b:title)', $book)
);
}
$reader->next('book');
}
$reader->close();
I've written a wrapper for XMLReader to (IMHO) make it easier to just get the bits your after. The wrapper allows you to associate a set of paths of the data elements and a callback to be run when this path is found. The path allows regex expressions and also capture groups which can also be passed to the callback.
The library is at https://github.com/NigelRel3/XMLReaderReg and can also be installed using composer require nigelrel3/xml-reader-reg.
An example of how to use it...
$inputFile = __DIR__ ."/../tests/data/simpleTest1.xml";
$reader = new XMLReaderReg\XMLReaderReg();
$reader->open($inputFile);
$reader->process([
'(.*/person(?:\[\d*\])?)' => function (SimpleXMLElement $data, $path): void {
echo "1) Value for ".$path[1]." is ".PHP_EOL.
$data->asXML().PHP_EOL;
},
'(.*/person3(\[\d*\])?)' => function (DOMElement $data, $path): void {
echo "2) Value for ".$path[1]." is ".PHP_EOL.
$data->ownerDocument->saveXML($data).PHP_EOL;
},
'/root/person2/firstname' => function (string $data): void {
echo "3) Value for /root/person2/firstname is ". $data.PHP_EOL;
}
]);
$reader->close();
As can be seen from the example, you can get the data to be passed as a SimpleXMLElement, a DOMElement or the last one is a string. This will represent only the data which matches the path.
The paths also show how capture groups can be used - (.*/person(?:\[\d*\])?) looks for any person element (including arrays of elements) and $path[1] in the callback displays the path where this particular instance is found.
There is an expanded example in the library as well as unit tests.
I tested the following code with 2 GB xml:
<?php
set_time_limit(0);
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
die("Failed to open 'data.xml'");
}
while($reader->read())
{
$node = $reader->expand();
// process $node...
}
$reader->close();
?>
My solution:
$reader = new XMLReader();
$reader->open($fileTMP);
while ($reader->read()) {
if ($reader->nodeType === XMLReader::ELEMENT && $reader->name === 'xmltag' && $reader->isEmptyElement === false) {
$item = simplexml_load_string($reader->readOuterXML(), null, LIBXML_NOCDATA);
//operations on file
}
}
$reader->close();
Very high performed way is
preg_split('/(<|>)/m', $xmlString);
And after that, only one cycle is needed.