I am trying to write a function which could read an existing XML file and create a new one with all the data from the first one, but in a different encoding. As far I understand it, SimpleXML saves the file in UTF-8 encoding. My original XML file is Windows-1257.
Code:
public static function toUTF8()
{
$remote_file = "data/test/import/test.xml";
$xml = simplexml_load_file($remote_file);
$xml->asXml('data/test/import/utf8/test.xml');
echo var_dump('done');
exit;
}
This way the encoding of file is still not good. I wanted to try this:
$newXML = new SimpleXMLElement($xml);
But this takes only plain XML code as a parameter. How could I get the whole XML code from the object? Or how else could I create a new UTF-8 XML object and insert all the data from the old file?
I tried this out and saw problems importing the XML directly with SimpleXML. Despite the correct encoding declaration in the XML, it would output the wrong characters. So the alternative is to use a function like iconv which can do the conversion for you.
If you don't need to parse the XML, you can just do this directly:
<?php
$remote_file = "data/test/import/test.xml";
$new_file = "data/test/import/utf8/test.xml";
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("CP1257", "UTF-8", $baltic_xml);
file_put_contents($new_file, $unicode_xml);
If you need to do stuff with the XML, it gets a little more complicated because you have to update the character set in the XML declaration.
<?php
$remote_file = "data/test/import/test.xml";
$new_file = "data/test/import/utf8/test.xml";
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("CP1257", "UTF-8", $baltic_xml);
$unicode_xml = str_replace('encoding="CP1257"', 'encoding="UTF-8"', $unicode_xml);
$xml = new SimpleXMLElement($unicode_xml);
// do stuff with $xml
$xml->asXml($new_file);
I tested this out with the following file (saved as CP1257) and it worked fine:
<?xml version="1.0" encoding="CP1257"?>
<Root-Element>
<Test>Łų߯ĒČ</Test>
</Root-Element>
Unless I'm wrong, the SimpleXML extension will just use the same encoding all the way through. UTF-8 is the default if no encoding is given but, if the original document has encoding information such encoding will be used.
You can use DOMDocument as proxy:
$xml = simplexml_load_file(__DIR__ . '/test.xml');
$doc = dom_import_simplexml($xml)->ownerDocument;
$doc->encoding = 'UTF-8';
$xml->asXml('as-utf-8.xml');
Related
I am trying to export xml with CDATA tags. I use the following code:
$xml_product = $xml_products->addChild('product');
$xml_product->addChild('mychild', htmlentities("<![CDATA[" . $mytext . "]]>"));
The problem is that I get CDATA tags < and > escaped with < and > like following:
<mychild><![CDATA[My some long long long text]]></mychild>
but I need:
<mychild><![CDATA[My some long long long text]]></mychild>
If I use htmlentities() I get lots of errors like tag raquo is not defined etc... though there are no any such tags in my text. Probably htmlentities() tries to parse my text inside CDATA and convert it, but I dont want it either.
Any ideas how to fix that? Thank you.
UPD_1 My function which saves xml to file:
public static function saveFormattedXmlFile($simpleXMLElement, $output_file) {
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadXML(urldecode($simpleXMLElement->asXML()));
$dom->save($output_file);
}
A short example of how to add a CData section, note the way it skips into using DOMDocument to add the CData section in. The code builds up a <product> element, $xml_product has a new element <mychild> created in it. This newNode is then imported into a DOMElement using dom_import_simplexml. It then uses the DOMDocument createCDATASection method to properly create the appropriate bit and adds it back into the node.
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><Products />');
$xml_product = $xml->addChild('product');
$newNode = $xml_product->addChild('mychild');
$mytext = "<html></html>";
$node = dom_import_simplexml($newNode);
$cdata = $node->ownerDocument->createCDATASection($mytext);
$node->appendChild($cdata);
echo $xml->asXML();
This example outputs...
<?xml version="1.0" encoding="UTF-8"?>
<Products><product><mychild><![CDATA[<html></html>]]></mychild></product></Products>
I'm using DOMDocument and SimpleXMLElement to create a formatted XML file. While this all works, the resulting file is saved as ASCII, not as UTF-8. I can't find an answer as to how to change that.
The XML is created as so:
$XMLNS = "http://www.sitemaps.org/schemas/sitemap/0.9";
$rootNode = new \SimpleXMLElement("<?xml version='1.0' encoding='UTF-8'?><urlset></urlset>");
$rootNode->addAttribute('xmlns', $XMLNS);
$url = $rootNode->addChild('url');
$url->addChild('loc', "Somewhere over the rainbow");
//Turn it into an indented file needs a DOMDocument...
$dom = dom_import_simplexml($rootNode)->ownerDocument;
$dom->formatOutput = true;
$path = "C:\\temp";
// This saves an ASCII file
$dom->save($path.'/sitemap.xml');
The resulting XML looks like this (which is as it should be I think):
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>Somewhere over the rainbow</loc>
</url>
</urlset>
Unfortunately the file is ASCII encoded and not UTF-8.
How do I fix this?
Edit: Don't use notepad++ to check encoding
I've got it to work now thanks to the accepted answer below. There's one note: I used Notepad++ to open the file and check the encoding. However, when I re-generated the file, Notepad++ would update its tab and for some reason indicate ANSI as the encoding. Closing and reopening the same file in Notepad++ would then again indicate UTF-8 again. This caused me a load of confusion.
I think there are a couple of things going on here. For one, you need:
$dom->encoding = 'utf-8';
But also, I think we should try creating the DOMDocument manually specifying the proper encoding. So:
<?php
$XMLNS = "http://www.sitemaps.org/schemas/sitemap/0.9";
$rootNode = new \SimpleXMLElement("<?xml version='1.0' encoding='UTF-8'?><urlset></urlset>");
$rootNode->addAttribute('xmlns', $XMLNS);
$url = $rootNode->addChild('url');
$url->addChild('loc', "Somewhere over the rainbow");
// Turn it into an indented file needs a DOMDocument...
$domSxe = dom_import_simplexml($rootNode)->ownerDocument;
// Set DOM encoding to UTF-8.
$domSxe->encoding = 'UTF-8';
$dom = new DOMDocument('1.0', 'UTF-8');
$domSxe = $dom->importNode($domSxe, true);
$domSxe = $dom->appendChild($domSxe);
$path = "C:\\temp";
$dom->formatOutput = true;
$dom->save($path.'/sitemap.xml');
Also ensure that any elements or CData you're adding are actually UTF-8 (see utf8_encode()).
Using the example above, this works for me:
php > var_dump($utf8);
string(11) "ᙀȾᎵ⁸"
php > $XMLNS = "http://www.sitemaps.org/schemas/sitemap/0.9";
php > $rootNode = new \SimpleXMLElement("<?xml version='1.0' encoding='UTF-8'?><urlset></urlset>");
php > $rootNode->addAttribute('xmlns', $XMLNS);
php > $url = $rootNode->addChild('url');
php > $url->addChild('loc', "Somewhere over the rainbow $utf8");
php > $domSxe = dom_import_simplexml($rootNode);
php > $domSxe->encoding = 'UTF-8';
php > $dom = new DOMDocument('1.0', 'UTF-8');
php > $domSxe = $dom->importNode($domSxe, true);
php > $domSxe = $dom->appendChild($domSxe);
php > $dom->save('./sitemap.xml');
$ cat ./sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>Somewhere over the rainbow ᙀȾᎵ⁸</loc></url></urlset>
Your data must not be in UTF-8. You can convert it like so:
utf8_encode($yourData);
Or, maybe:
iconv('ISO-8859-1', 'UTF-8', $yourData)
I am using the PHP DOMDocument and at one moment I am using the createTextNode
$copyrightNode = $doc->createTextNode('©');
$copyrightContainer = $dom_output->createElement('copyright-statement');
$copyrightContainer->appendChild($copyrightNode);
In the XML that is generated some time later, I am getting:
<copyright-statement>©</copyright-statement>
And my goal is to have
<copyright-statement>©</copyright-statement>
Any idea on how to do that?
Thank you in advance.
When PHP outputs an XML document, any characters that cannot be represented in the specified output encoding will be replaced with numeric entities (either decimal or hexadecimal, both are equivalent):
<?php
$dom = new DOMDocument;
$node = $dom->createElement('copyright-statement', '©');
$dom->appendChild($node);
$dom->encoding = 'UTF-8';
print $dom->saveXML(); // <copyright-statement>©</copyright-statement>
$dom->encoding = 'ASCII';
print $dom->saveXML(); // <copyright-statement>©</copyright-statement>
The correct thing to do here is to use the createEntityReference method (e.g. createEntityReference("copy");), and then appendChild this entity.
Example:
<?php
$copyrightNode = $doc->createEntityReference("copy");
$copyrightContainer = $dom_output->createElement('copyright-statement');
$copyrightContainer->appendChild($copyrightNode);
To create © you could (I believe) do:
$copyrightNode = $doc->createCDATASection("©");
$copyrightContainer = $dom_output->createElement('copyright-statement');
$copyrightContainer->appendChild($copyrightNode);
I have an XML file that looks like the example on this site: http://msdn.microsoft.com/en-us/library/ee223815(v=sql.105).aspx
I am trying to parse the XML file using something like this:
$data = file_get_contents('http://mywebsite here');
$xml = new SimpleXMLElement($data);
$str = $xml->Author;
echo $str;
Unfortunately, this is not working, and I suspect it is due to the namespaces. I can dump the $xml using asXML() and it correctly shows the XML data.
I understand I need to insert namespaces somehow, but I'm not sure how. How do I parse this type of XML file?
All you need is to register the namespace
$sxe = new SimpleXMLElement($data);
$sxe->registerXPathNamespace("diffgr", "urn:schemas-microsoft-com:xml-diffgram-v1");
$data = $sxe->xpath("//diffgr:diffgram") ;
$data = $data[0];
echo "<pre>";
foreach($data->Results->RelevantResults as $result)
{
echo $result->Author , PHP_EOL ;
}
Output
Ms.Kim Abercrombie
Mr.GustavoAchong
Mr. Samuel N. Agcaoili
See Full code In Action
I was successfully using the following code to merge multiple large XML files into a new (larger) XML file. Found at least part of this on StackOverflow
$docList = new DOMDocument();
$root = $docList->createElement('documents');
$docList->appendChild($root);
$doc = new DOMDocument();
foreach(xmlFilenames as $xmlfilename) {
$doc->load($xmlfilename);
$xmlString = $doc->saveXML($doc->documentElement);
$xpath = new DOMXPath($doc);
$query = self::getQuery(); // this is the name of the ROOT element
$nodelist = $xpath->evaluate($query, $doc->documentElement);
if( $nodelist->length > 0 ) {
$node = $docList->importNode($nodelist->item(0), true);
$xmldownload = $docList->createElement('document');
if (self::getShowFileName())
$xmldownload->setAttribute("filename", $filename);
$xmldownload->appendChild($node);
$root->appendChild($xmldownload);
}
}
$newXMLFile = self::getNewXMLFile();
$docList->save($newXMLFile);
I started running into OUT OF MEMORY issues when the number of files grew as did the size of them.
I found an article here which explained the issue and recommended using XMLWriter
So, now trying to use PHP XMLWriter to merge multiple large XML files together into a new (larger) XML file. Later, I will execute xpath against the new file.
Code:
$xmlWriter = new XMLWriter();
$xmlWriter->openMemory();
$xmlWriter->openUri('mynewFile.xml');
$xmlWriter->setIndent(true);
$xmlWriter->startDocument('1.0', 'UTF-8');
$xmlWriter->startElement('documents');
$doc = new DOMDocument();
foreach($xmlfilenames as $xmlfilename)
{
$fileContents = file_get_contents($xmlfilename);
$xmlWriter->writeElement('document',$fileContents);
}
$xmlWriter->endElement();
$xmlWriter->endDocument();
$xmlWriter->flush();
Well, the resultant (new) xml file is no longer correct since elements are escaped - i.e.
<?xml version="1.0" encoding="UTF-8"?>
<CONFIRMOWNX>
<Confirm>
<LglVeh id="GLE">
<AddrLine1>GLEACHER & COMPANY</AddrLine1>
<AddrLine2>DESCAP DIVISION</AddrLine2>
Can anyone explain how to take the content from the XML file and write them properly to new file?
I'm burnt on this and I KNOW it'll be something simple I'm missing.
Thanks.
Robert
See, the problem is that XMLWriter::writeElement is intended to, well, write a complete XML element. That's why it automatically sanitize (replace & with &, for example) the contents of what's been passed to it as the second param.
One possible solution is to use XMLWriter::writeRaw method instead, as it writes the contents as is - without any sanitizing. Obviously it doesn't validate its inputs, but in your case it does not seem to be a problem (as you're working with already checked source).
Hmm, Not sure why it's converting it to HTML Characters, but you can decode it like so
htmlspecialchars_decode($data);
It converts special HTML entities back to characters.