I'm getting this error when running this code:
Fatal error: Uncaught exception 'DOMException' with message 'Invalid Character Error' in test.php:29 Stack trace: #0 test.php(29): DOMDocument->createElement('1OhmStable', 'a') #1 {main} thrown in test.php on line 29
The nodes that from the original XML file do contain invalid characters, but as I am stripping the invalid characters away from the nodes, the nodes should be created. What type of encoding do I need to do on the original XML document? Do I need to decode the saveXML?
function __cleanData($c)
{
return preg_replace("/[^A-Za-z0-9]/", "",$c);
}
$xml = new DOMDocument('1.0', 'UTF-8');
$xml->load('test.xml');
$xml->formatOutput = true;
$append = array();
foreach ($xml->getElementsByTagName('product') as $product )
{
foreach($product->getElementsByTagName('name') as $name )
{
$append[] = $name;
}
foreach ($append as $a)
{
$nodeName = __cleanData($a->textContent);
$element = $xml->createElement(htmlentities($nodeName) , 'a');
}
$product->removeChild($xml->getElementsByTagName('details')->item(0));
$product->appendChild($element);
}
$result = $xml->saveXML();
$file = "data.xml";
file_put_contents($file,$result);
This is what the original XML looks like:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/v1/xsl/xml_pretty_printer.xsl" type="text/xsl"?>
<products>
<product>
<modelNumber>M100</modelNumber>
<itemId>1553725</itemId>
<details>
<detail>
<name>1 Ohm Stable</name>
<value>600 x 1</value>
</detail>
</details>
</product>
</products>
The new document is supposed to look like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/v1/xsl/xml_pretty_printer.xsl" type="text/xsl"?>
<products>
<product>
<modelNumber>M100</modelNumber>
<itemId>1553725</itemId>
<1 Ohm Stable>
</1 Ohm Stable>
</product>
</products>
Simply you can not use an element name start with number
1OhmStable <-- rename this
_1OhmStable <-- this is fine
php parse xml - error: StartTag: invalid element name
A nice article :- http://www.xml.com/pub/a/2001/07/25/namingparts.html
A Name is a token beginning with a letter or one of a few punctuation characters, and continuing with letters, digits, hyphens, underscores, colons, or full stops, together known as name characters.
You have not written where you get that error. In case it's after you cleaned the value, this is my guess:
preg_replace("/[^A-Za-z0-9]/", "",$c);
This replacement is not written for UTF-8 encoded strings (which are used by DOMDocument). You can make it UTF-8 compatible by using the u-modifier (PCRE8)Docs:
preg_replace("/[^A-Za-z0-9]/u", "",$c);
^
It's just a guess, I suggest you make it more precise in your question which part of your code triggers the error.
Even if __cleandata() will remove all other characters than latin alphabets a-z and numbers, it doesn't necessarily guarantee that the result is a valid XML name. Your function can return strings that begin with a number but numbers are illegal name start characters in XML, they can only appear in a name after the first name character. Also spaces are forbidden in names, so that is another point where your expected XML output would fail.
Make sure scripts have same encoding: if it's UTF make sure they are without Byte Order Mark (BOM) at very begin of file.
To do that open your XML file with a text editor like Notepad++ and convert your file in "UTF-8 without BOM".
I had a similar error, but with a json file
Related
I try to add a string to an XML object with Simple XML.
Example (http://ideone.com/L4ztum):
$str = "<aoc> САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";
$movies = new SimpleXMLElement($str);
But it gives a warning:
PHP Warning: SimpleXMLElement::__construct(): Entity: line 1: parser error : PCDATA invalid Char value 2 in /home/nmo2E7/prog.php on line 5
and finally an Exception with the message String could not be parsed as XML.
If I remove two Unicode characters, it works (http://ideone.com/LaMvHN):
$str = "<aoc> САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";
^
`-- two invisible characters have been removed here
How can I remove Unicode from string?
It is not Unicode, but two stray bytes, valued \x01 and \x02. You can filter them out with str_replace:
$s = str_replace("\x01", "", $s);
$s = str_replace("\x02", "", $s);
The constructor of the SimepleXMLElement needs it's first parameter to be well-formed XML.
The string you pass
$str = "<aoc> САМОЛЕТОМ\x02\x01 ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";
is not well-formed XML because it contains characters out of the character-range of XML, namely:
Unicode Character 'START OF TEXT' (U+0002) at binary offset 24
Unicode Character 'START OF HEADING' (U+0001) at binary offset 25
So instead of using SimpleXMLElement to create it from a hand-mangled XML-string (which is error-prone), use it to create the XML you're looking for. Let's give an example.
In the following example I assume you've got the text you want to create the XML element of. This example creates an XML element similar to the one in your question with the difference that the exact same string is passed in as text-content for the document element ("<aoc>").
$text = 'САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12';
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><aoc/>');
$xml->{0} = $text; // set the document-element's text-content to $text
When done this way, SimpleXML will filter any invalid control-characters for you and the SimpleXMLElement remains stable:
$str = $xml->asXML();
$movies = new SimpleXMLElement($str);
print_r($movies);
/* output:
SimpleXMLElement Object
(
[0] => САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12
)
*/
So to finally answer your question:
How can I remove Unicode from string?
You don't want to remove Unicode from the string. The SimpleXML library accepts Unicode strings only (in the UTF-8 encoding). What you want is that you remove Unicode-characters that are invalid for XML usage. The SimpleXML library does that for you when you set node-values as it has been designed for.
However if you try to load non-well-formed XML via the contructor or the constructor functions (simplexml_load_string etc.), it will fail and give you the (important) error.
I hope this clarifies the situation for you and answers your question.
Sup peeps. I got an issue here. I receive this data and just want to strip the <SOAP-ENV elements with their respective closing elements.
This is the header and body start part.
<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">
<SOAP-ENV:Header></SOAP-ENV:Header>
<SOAP-ENV:Body>
<VisionDataExchange>
Now I run my regular expession on $xml the variable containing the entire xml data:
$xml = preg_replace("/<\\/?SOAP(.|\\s)*?>/",'',$xml);
Now my result is this. It actually stripped the openening tags but none of the closing tags? What am I missing here?
<?xml version="1.0" encoding="UTF-8"?>
</SOAP-ENV:Header>
<VisionDataExchange>
I suggest just matching everything inside a tag, not any character or whitespace. Have a look at this regex:
$re = "/<\\/?SOAP[^<>]+?>/";
$str = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<SOAP-ENV:Envelope xmlns:SOAP-ENV=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:wsse=\"http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd\" xmlns:wsu=\"http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd\">\n <SOAP-ENV:Header></SOAP-ENV:Header>\n <SOAP-ENV:Body>\n <VisionDataExchange>";
$subst = "";
$result = preg_replace($re, $subst, $str);
Ok, so after nearly breaking my skull on my desk, i found what the issue was. The regex is indeed working perfectly!there was a hidden \ in the string that caused the regex to fail.
I want to add triple spacing in my xml but browser is changing triple spacing to single. I have found to be used to include spacing. I have a tab delimited text file and i am converting it to xml using php. In order to have triple spacing inside my title node i am doing like this.
$xml->startElement('Products');
while ($line = fgetcsv($fp, 0, "\t")) {
$xml->startElement('Product');
//replacing titlesingle space to triple space
$title = str_replace(" ", " ", $line[1]);
$xml->writeElement('title', $title);
.....
$xml->endElement();
}
$xml->endElement();
}
But is changed to   and title is something like this
<title>A   Test</title>
Basically php is changing & to amp but i want exact so that i can have triple spacing in the title
<title>A Test</title>
Any help??
It does not change the spacing. But in HTML/XML several whitespaces are usually (not always) rendered as a single space or a linebreak.
#160; is the non breaking space. One of the uses is to separate the parts of a phone number. You don't want one of the spaces in a phone number rendered as a linebreak.
The UTF-8 bytes for this character are \xC2\xA0.
$xml->writeElement('foo', "foo\xC2\xA0&\xC2\xA0bar");
If the XML document is encoding is ASCII, they will get encoded as entities.
$xml = new XMLWriter();
$xml->openMemory();
$xml->startDocument('1.0', 'ASCII');
$xml->writeElement('foo', "foo\xC2\xA0&\xC2\xA0bar");
$xml->endDocument();
echo $xml->outputMemory(TRUE);
Output:
<?xml version="1.0" encoding="ASCII"?>
<foo>foo & bar</foo>
I use php and simplexml for parse url. I want take value of simplexml node and change it, first I convert it to string, but ucfirst() doesn't work for that string.
$xml = simplexml_load_file($url);
foreach($xml->offers->offer as $offer)
{
$bodyType = (string) $offer->{"body-type"}; //I convert simplexml to string first
echo ucfirst($bodyType); // In this line ucfirst doesn't work
}
How to deal with it?
UPDATE: Problem was in Cyrillic letters, since ucfirst works only with Latin.
Working solution is to use this function:
$bodyType = (string) $offer->{"body-type"};
$encoding='UTF-8';
$str = mb_ereg_replace('^[\ ]+', '', $bodyType);
$str = mb_strtoupper(mb_substr($str, 0, 1, $encoding), $encoding). mb_substr($str, 1, mb_strlen($str), $encoding);
Dear plz share your xml file data also. I have used the following and it is working fine..
<?xml version="1.0"?>
<data>
<offers>
<offer>
<body-type>offer 1</body-type>
</offer>
<offer>
<body-type>offer 2</body-type>
</offer>
</offers>
</data>
my output is
Offer 1
Offer 2
HTML: Offer 1<br />Offer 2<br />
by following php code..
<?PHP
$url = "test.xml";
$xml = simplexml_load_file($url);
foreach($xml->offers->offer as $offer)
{
$bodyType = (string) $offer->{"body-type"}; //I convert simplexml to string first
echo ucfirst($bodyType); // In this line ucfirst doesn't work
echo '<br />';
}
?>
Given the test.xml from Farrukh's answer, you can actually even omit the typecasting. This works as well for me:
<?php
$url = "test.xml";
$xml = simplexml_load_file($url);
foreach($xml->offers->offer as $offer) {
echo ucfirst($offer->{"body-type"}) .'<br>';
}
Here's a live demo: http://codepad.viper-7.com/L4VwPL
UPDATE (after URL was provided by OP)
You'll most likely have an encoding issue. When I set the UTF-8 charset explicitly, it works as expected (otherwise simplexml returns corrupted strings only).
$url = "http://carsguru.net/x/used/exchange/4.xml";
$xml = simplexml_load_file($url);
header('Content-Type: text/html; charset=utf-8');
foreach($xml->offers->offer as $offer) {
echo ucfirst($offer->{"body-type"}) .'<br>';
}
When I run the above snippet, I get this output (stripped):
фургон
универсал
хэтчбек
хэтчбек
минивэн
минивэн
минивэн
седан
седан
универсал
хэтчбек
универсал
седан
хэтчбек
седан
NOTE You don't serve a content-type/charset header for the xml! I'd add that.
Anyway, you may want to have a look at this: iconv -> iconv("cp1251", "UTF-8", $str);
Actually file encoding is Cyrillic windows-1251, which is probably make sence.
Why? You can, of course, use valid UTF-8! Here is an example node from your XML converted with this cp1251-to-utf8-function (might look odd, but renders perfectly!)
<?xml version="1.0" encoding="UTF-8"?>
<auto-catalog>
<creation-date>2013-02-07 02:00:08 GMT+4</creation-date>
<host>carsguru.net</host>
<offers>
<offer type="commercial">
<url>http://carsguru.net/used/5131406/view.html</url>
<date>2013-02-07</date>
<mark>ГАЗ</mark>
<model>2705</model>
<year>2003</year>
<seller-city>Санкт-Петербург</seller-city>
<seller-phone>8-921-997-74-06</seller-phone>
<price>150000</price>
<currency-type>RUR</currency-type>
<steering-wheel>левый</steering-wheel>
<run-metric>км</run-metric>
<run>194</run>
<displacement>2300</displacement>
<stock>в наличии</stock>
<state>Хорошее</state>
<color>синий</color>
<body-type>фургон</body-type>
<engine-type>бензин</engine-type>
<gear-type>задний</gear-type>
<transmission>ручная</transmission>
<horse-power>98</horse-power>
<image>http://carsguru.net/clf/03/af/9c/8b/used.4r9v39h31facog8cs0w0wk8ws.jpg.medium.jpg</image>
<image>http://carsguru.net/clf/ae/51/be/3a/used.bxyc3q9mx80sko0wg80880w0k.jpg.medium.jpg</image>
<image>http://carsguru.net/clf/28/dc/c1/d4/used.8i1b76l1b8o4cwg8gc08oos4s.jpg.medium.jpg</image>
<image>http://carsguru.net/clf/55/3d/37/10/used.7dmn7puczuo0wo4cs8kko0cco.jpg.medium.jpg</image>
<image>http://carsguru.net/clf/49/02/15/54/used.7k8lhomw4j4s4040kssk4kgso.jpg.medium.jpg</image>
<equipment>Магнитола</equipment>
<equipment>Подогрев зеркал</equipment>
</offer>
</offers>
</auto-catalog>
I have to parse externally provided XML that has attributes with line breaks in them. Using SimpleXML, the line breaks seem to be lost. According to another stackoverflow question, line breaks should be valid (even though far less than ideal!) for XML.
Why are they lost? [edit] And how can I preserve them? [/edit]
Here is a demo file script (note that when the line breaks are not in an attribute they are preserved).
PHP File with embedded XML
$xml = <<<XML
<?xml version="1.0" encoding="utf-8"?>
<Rows>
<data Title='Data Title' Remarks='First line of the row.
Followed by the second line.
Even a third!' />
<data Title='Full Title' Remarks='None really'>First line of the row.
Followed by the second line.
Even a third!</data>
</Rows>
XML;
$xml = new SimpleXMLElement( $xml );
print '<pre>'; print_r($xml); print '</pre>';
Output from print_r
SimpleXMLElement Object
(
[data] => Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[Title] => Data Title
[Remarks] => First line of the row. Followed by the second line. Even a third!
)
)
[1] => First line of the row.
Followed by the second line.
Even a third!
)
)
Using SimpleXML, the line breaks seem to be lost.
Yes, that is expected... in fact it is required of any conformant XML parser that newlines in attribute values represent simple spaces. See attribute value normalisation in the XML spec.
If there was supposed to be a real newline character in the attribute value, the XML should have included a
character reference instead of a raw newline.
The entity for a new line is
. I played with your code until I found something that did the trick. It's not very elegant, I warn you:
//First remove any indentations:
$xml = str_replace(" ","", $xml);
$xml = str_replace("\t","", $xml);
//Next replace unify all new-lines into unix LF:
$xml = str_replace("\r","\n", $xml);
$xml = str_replace("\n\n","\n", $xml);
//Next replace all new lines with the unicode:
$xml = str_replace("\n","
", $xml);
Finally, replace any new line entities between >< with a new line:
$xml = str_replace(">
<",">\n<", $xml);
The assumption, based on your example, is that any new lines that occur inside a node or attribute will have more text on the next line, not a < to open a new element.
This of course would fail if your next line had some text that was wrapped in a line-level element.
Assuming $xmlData is your XML string before it is sent to the parser, this should replace all newlines in attributes with the correct entity. I had the issue with XML coming from SQL Server.
$parts = explode("<", $xmlData); //split over <
array_shift($parts); //remove the blank array element
$newParts = array(); //create array for storing new parts
foreach($parts as $p)
{
list($attr,$other) = explode(">", $p, 2); //get attribute data into $attr
$attr = str_replace("\r\n", "
", $attr); //do the replacement
$newParts[] = $attr.">".$other; // put parts back together
}
$xmlData = "<".implode("<", $newParts); // put parts back together prefixing with <
Probably can be done more simply with a regex, but that's not a strong point for me.
Here is code to replace the new lines with the appropriate character reference in that particular XML fragment. Run this code prior to parsing.
$replaceFunction = function ($matches) {
return str_replace("\n", "
", $matches[0]);
};
$xml = preg_replace_callback(
"/<data Title='[^']+' Remarks='[^']+'/i",
$replaceFunction, $xml);
This is what worked for me:
First, get the xml as a string:
$xml = file_get_contents($urlXml);
Then do the replacement:
$xml = str_replace(".\xe2\x80\xa9<as:eol/>",".\n\n<as:eol/>",$xml);
The "." and "< as:eol/ >" were there because I needed to add breaks in that case. The new lines "\n" can be replaced with whatever you like.
After replacing, just load the xml-string as a SimpleXMLElement object:
$xmlo = new SimpleXMLElement( $xml );
Et Voilà
Well, this question is old but like me, someone might come to this page eventually.
I had slightly different approach and I think the most elegant out of these mentioned.
Inside the xml, you put some unique word which you will use for new line.
Change xml to
<data Title='Data Title' Remarks='First line of the row. \n
Followed by the second line. \n
Even a third!' />
And then when you get path to desired node in SimpleXML in string output write something like this:
$findme = '\n';
$pos = strpos($output, $findme);
if($pos!=0)
{
$output = str_replace("\n","<br/>",$output);
It doesn't have to be '\n, it can be any unique char.