I have to parse externally provided XML that has attributes with line breaks in them. Using SimpleXML, the line breaks seem to be lost. According to another stackoverflow question, line breaks should be valid (even though far less than ideal!) for XML.
Why are they lost? [edit] And how can I preserve them? [/edit]
Here is a demo file script (note that when the line breaks are not in an attribute they are preserved).
PHP File with embedded XML
$xml = <<<XML
<?xml version="1.0" encoding="utf-8"?>
<Rows>
<data Title='Data Title' Remarks='First line of the row.
Followed by the second line.
Even a third!' />
<data Title='Full Title' Remarks='None really'>First line of the row.
Followed by the second line.
Even a third!</data>
</Rows>
XML;
$xml = new SimpleXMLElement( $xml );
print '<pre>'; print_r($xml); print '</pre>';
Output from print_r
SimpleXMLElement Object
(
[data] => Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[Title] => Data Title
[Remarks] => First line of the row. Followed by the second line. Even a third!
)
)
[1] => First line of the row.
Followed by the second line.
Even a third!
)
)
Using SimpleXML, the line breaks seem to be lost.
Yes, that is expected... in fact it is required of any conformant XML parser that newlines in attribute values represent simple spaces. See attribute value normalisation in the XML spec.
If there was supposed to be a real newline character in the attribute value, the XML should have included a
character reference instead of a raw newline.
The entity for a new line is
. I played with your code until I found something that did the trick. It's not very elegant, I warn you:
//First remove any indentations:
$xml = str_replace(" ","", $xml);
$xml = str_replace("\t","", $xml);
//Next replace unify all new-lines into unix LF:
$xml = str_replace("\r","\n", $xml);
$xml = str_replace("\n\n","\n", $xml);
//Next replace all new lines with the unicode:
$xml = str_replace("\n","
", $xml);
Finally, replace any new line entities between >< with a new line:
$xml = str_replace(">
<",">\n<", $xml);
The assumption, based on your example, is that any new lines that occur inside a node or attribute will have more text on the next line, not a < to open a new element.
This of course would fail if your next line had some text that was wrapped in a line-level element.
Assuming $xmlData is your XML string before it is sent to the parser, this should replace all newlines in attributes with the correct entity. I had the issue with XML coming from SQL Server.
$parts = explode("<", $xmlData); //split over <
array_shift($parts); //remove the blank array element
$newParts = array(); //create array for storing new parts
foreach($parts as $p)
{
list($attr,$other) = explode(">", $p, 2); //get attribute data into $attr
$attr = str_replace("\r\n", "
", $attr); //do the replacement
$newParts[] = $attr.">".$other; // put parts back together
}
$xmlData = "<".implode("<", $newParts); // put parts back together prefixing with <
Probably can be done more simply with a regex, but that's not a strong point for me.
Here is code to replace the new lines with the appropriate character reference in that particular XML fragment. Run this code prior to parsing.
$replaceFunction = function ($matches) {
return str_replace("\n", "
", $matches[0]);
};
$xml = preg_replace_callback(
"/<data Title='[^']+' Remarks='[^']+'/i",
$replaceFunction, $xml);
This is what worked for me:
First, get the xml as a string:
$xml = file_get_contents($urlXml);
Then do the replacement:
$xml = str_replace(".\xe2\x80\xa9<as:eol/>",".\n\n<as:eol/>",$xml);
The "." and "< as:eol/ >" were there because I needed to add breaks in that case. The new lines "\n" can be replaced with whatever you like.
After replacing, just load the xml-string as a SimpleXMLElement object:
$xmlo = new SimpleXMLElement( $xml );
Et Voilà
Well, this question is old but like me, someone might come to this page eventually.
I had slightly different approach and I think the most elegant out of these mentioned.
Inside the xml, you put some unique word which you will use for new line.
Change xml to
<data Title='Data Title' Remarks='First line of the row. \n
Followed by the second line. \n
Even a third!' />
And then when you get path to desired node in SimpleXML in string output write something like this:
$findme = '\n';
$pos = strpos($output, $findme);
if($pos!=0)
{
$output = str_replace("\n","<br/>",$output);
It doesn't have to be '\n, it can be any unique char.
Related
"contentDetails" has following data in it:
<p>This is data sample. </p><p>Second part of the paragraph. </p>
str_replace is not working here. Please take a look.
here is how my xml strucuture in php looks like:
$xml = <?xml version="1.0" encoding="UTF-8">;
$xml = '<root>';
$xml = '<myData>';
$xml .= <content> . str_replace(" ", "", htmlentities($_POST[contentDetails])) . </content>
$xml = '</myData>';
$xml = '</root>';
I'm assuming your contentDetails actually contains:
<p>This is data sample. </p><p>Second part of the paragraph. </p>
($nbsp; replaced with )
Your problem is that when you call htmlentities on contentDetails it converts into , so your str_replace won't find any matches. To solve the problem, call str_replace before htmlentities:
$xml .= '<content>' . htmlentities(str_replace(" ", "", $_POST['contentDetails'])) . '</content>';
Note that associative array keys should be enclosed in quotes; this will cause a warning now but in future PHP versions will be an error.
The htmlentities() function converts to --- so try this...
str_replace(" ", "", htmlentities($_POST[contentDetails]))
Goal: Modifying an HTML string that contains apostrophs for wrapping code inline (like Stackoverflow is doing it). But the same time having <code> blocks that can also contain apostrophs which should stay unchanged.
Example:
<p>This is my `inline code`, it can be replaced and tag-wrapped.</p>
<p><code>This text contains `apostrophs`, but should `not` be changed.</code></p>
This regex I am using for converting all wrapping apostrophs to <code> elements:
// replace apostroph with incorporating <code> tag
$content = preg_replace('/(.+?)\`(.+?)\`/', '$1<code class="inlinecode">$2</code>', $content);
Required:
Change the regex, so that it does not convert the apostroph if it is withing a <code> block.
Disclaimer: I tried for several hours to read the HTML string, use PHP's DOM parser, extract all nodes of type code, change their content, write them back, then found out that nodeValue is removing all HTML tags (especially the line breaks). Then tried several solutions found online, still not working... Now I am falling back to regex, even against the odds.
FYI, how I tried it the DOM way:
$code_blocks = $dom->getElementsByTagName('code');
foreach($code_blocks as $codenode) {
// nodeValue strips HTML tags, we need to hack
$nodevalue_html = $codenode->ownerDocument->saveXML($codenode);
// replace, i.e. custom-store each apostroph with '~~~APO~~~' so that they survive
$nodevalue_html = preg_replace('/`/', '~~~APO~~~', $nodevalue_html);
// $codenode->textValue = $nodevalue_html; // fail
// $codenode->nodeValue = $nodevalue_html; // fail
// ...
}
// html to string
$html_new = $dom->saveHTML();
$html_new = preg_replace('/~~~APO~~~/', '`', $html_new);
I wished I could use Markdown like Stackoverflow, but I still need to deal with HTML.
Using an XPath query to avoid text nodes that have a code element as ancestor:
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::code)][contains(.,"`")]');
foreach ($textNodes as $textNode) {
$parts = (function($text) { yield from explode('`', $text); })($textNode->nodeValue);
$frag = $dom->createDocumentFragment();
do {
$frag->appendChild($dom->createTextNode($parts->current()));
$parts->next();
if ( $parts->valid() ) {
$codeElt = $dom->createElement('code');
$codeElt->appendChild($dom->createTextNode($parts->current()));
$frag->appendChild($codeElt);
$parts->next();
}
} while ($parts->valid());
$textNode->parentNode->replaceChild($frag, $textNode);
}
echo $dom->saveHTML();
demo
demo for php < 7.0
I believe the only way is to explode and reassemble the string:
$html_string = '....................'; // contains apostrophes and <code>...</code> blocks
$delim = "<code>";
$closing_tag = "</code>";
$explode = explode($delim, $html_string);
foreach($explode as &$ex) {
$closing_tag_pos = strpos($ex, $closing_tag);
if ($closing_tag_pos !== false) {
$pre_closing_tag = substr($ex, 0, $closing_tag_pos);
$post_closing_tag = substr($ex, $closing_tag_pos);
$ex = $pre_closing_tag . preg_replace('/`/', '~~~APO~~~', $post_closing_tag);
}
}
$mapped_html_string = implode($delim, $explode);
I try to add a string to an XML object with Simple XML.
Example (http://ideone.com/L4ztum):
$str = "<aoc> САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";
$movies = new SimpleXMLElement($str);
But it gives a warning:
PHP Warning: SimpleXMLElement::__construct(): Entity: line 1: parser error : PCDATA invalid Char value 2 in /home/nmo2E7/prog.php on line 5
and finally an Exception with the message String could not be parsed as XML.
If I remove two Unicode characters, it works (http://ideone.com/LaMvHN):
$str = "<aoc> САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";
^
`-- two invisible characters have been removed here
How can I remove Unicode from string?
It is not Unicode, but two stray bytes, valued \x01 and \x02. You can filter them out with str_replace:
$s = str_replace("\x01", "", $s);
$s = str_replace("\x02", "", $s);
The constructor of the SimepleXMLElement needs it's first parameter to be well-formed XML.
The string you pass
$str = "<aoc> САМОЛЕТОМ\x02\x01 ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";
is not well-formed XML because it contains characters out of the character-range of XML, namely:
Unicode Character 'START OF TEXT' (U+0002) at binary offset 24
Unicode Character 'START OF HEADING' (U+0001) at binary offset 25
So instead of using SimpleXMLElement to create it from a hand-mangled XML-string (which is error-prone), use it to create the XML you're looking for. Let's give an example.
In the following example I assume you've got the text you want to create the XML element of. This example creates an XML element similar to the one in your question with the difference that the exact same string is passed in as text-content for the document element ("<aoc>").
$text = 'САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12';
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><aoc/>');
$xml->{0} = $text; // set the document-element's text-content to $text
When done this way, SimpleXML will filter any invalid control-characters for you and the SimpleXMLElement remains stable:
$str = $xml->asXML();
$movies = new SimpleXMLElement($str);
print_r($movies);
/* output:
SimpleXMLElement Object
(
[0] => САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12
)
*/
So to finally answer your question:
How can I remove Unicode from string?
You don't want to remove Unicode from the string. The SimpleXML library accepts Unicode strings only (in the UTF-8 encoding). What you want is that you remove Unicode-characters that are invalid for XML usage. The SimpleXML library does that for you when you set node-values as it has been designed for.
However if you try to load non-well-formed XML via the contructor or the constructor functions (simplexml_load_string etc.), it will fail and give you the (important) error.
I hope this clarifies the situation for you and answers your question.
I have the html document in a php $content. I can echo it, but I just need all the <a...> tags with class="pret" and after I get them I would need the non words (like a code i.e. d3852) from href attribute of <a> and the number (i.e. 2352.2345) from between <a> and </a>.
I have tried more examples from the www but I either get empty arrays or php errors.
A regex example that gives me an empty array (the <a> tag is in a table)
$pattern = "#<table\s.*?>.*?<a\s.*?class=[\"']pret[\"'].*?>(.*?)</a>.*?</table>#i";
preg_match_all($pattern, $content, $results);
print_r($results[1]);
Another example that gives just an error
$a=$content->getElementsByTagName(a);
Reason for various errors: unvalid html, non utf 8 chars.
Next I did this on another website, matched the contents in a single SQL table, and the result is a copied website with updated data from my country. No longer will I search the www for matching single results.
Let's hope you're trying to parse valid (at least valid enough) HTML document, you should use DOM for this:
// Simple example from php manual from comments
$xml = new DOMDocument();
$xml->loadHTMLFile($url);
$links = array();
foreach($xml->getElementsByTagName('a') as $link) {
$links[] = array('url' => $link->getAttribute('href'),
'text' => $link->nodeValue);
}
Note using loadHTML not load (it's just more robust against errors). You also may set DOMDocument::recover (as suggested in comment by hakre) so parser will try to recover from errors.
Or you could use xPath (here's explanation of syntax):
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//a[#class='pret']");
if (!is_null($elements)) {
foreach ($elements as $element) {
$links[] = array('url' => $link->getAttribute('href'),
'text' => $link->nodeValue);
}
}
And for case of invalid HTML you may use regexp like this:
$a1 = '\s*[^\'"=<>]+\s*=\s*"[^"]*"'; # Attribute with " - space tolerant
$a2 = "\s*[^'\"=<>]+\s*=\s*'[^']*'"; # Attribute with ' - space tolerant
$a3 = '\s*[^\'"=<>]+\s*=\s*[\w\d]*' # Unescaped values - space tolerant
# [^'"=<>]* # Junk - I'm not inserting this to regexp but you may have to
$a = "(?:$a1|$a2|$a2)*"; # Any number of arguments
$class = 'class=([\'"])pret\\1'; # Using ?: carefully is crucial for \\1 to work
# otherwise you can use ["']
$reg = "<a{$a}\s*{$class}{$a}\s*>(.*?)</a";
And then just preg_match_all.All regexp are written from the top of my head - you may have to debug them.
got the links like this
preg_match_all('/<a[^>]*class="pret">(.*?)<\\/a>/si', $content, $links);
print_r($links[0]);
and the result is
Array(
[0] => <span>3340.3570 word</span>..........)
so I need to get the first number inside href and the number between span
Ok I have to parse out a SOAP request and in the request some of the values are passed with (or inside) a Anchor tag. Looking for a RegEx (or alt method) to strip the tag and just return the value.
// But item needs to be a RegEx of some sort, it's a field right now
if($sObject->list == 'item') {
// Split on > this should be the end of the right side of the anchor tag
$pieces = explode(">", $sObject->fields->$field);
// Split on < this should be the closing anchor tag
$piece = explode("<", $pieces[1]);
$fields_string .= $piece[0] . "\n";
}
item is a field name but I would like to make this a RegEx to check for the Anchor tag instead of a specific field.
PHP has a strip_tags() function.
Alternatively you can use filter_var() with FILTER_SANITIZE_STRING.
Whatever you do don't parse HTML/XML with regular expressions. It's really error-prone and flaky. PHP has at least 3 different parsers as standard (SimpleXML, DOMDocument and XMLReader spring to mind).
I agree with cletus, using RegEx on HTML is bad practice because of how loose HTML is as a language (and I moan about PHP being too loose...). There are just so many ways you can variate a tag that unless you know that the document is standards-compliant / strict, it is sometimes just impossible to do. However, because I like a challenge that distracts me from work, here's how you might do it in RegEx!
I'll split this up into sections, no point if all you see is a string and say, "Meh... It'll do..."! First we have the main RegEx for an anchor tag:
'#<a></a>#'
Then we add in the text that could be between the tags.
We want to group this is parenthesis, so we can extract the string, and the question mark makes the asterix wildcard "un-greedy", meaning that the first </a> that it comes accross will be the one it uses to end the RegEx.
'#<a>(.*?)</a>#'
Next we add in the RegEx for href="". We match the href=" as plain text, then an any-length string that does not contain a quotation mark, then the ending quotation mark.
'#<a href\="([^"]*)">(.*?)</a>#'
Now we just need to say that the tag is allowed other attributes. According to the specification, an attribute can contain the following characters: [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*.
Allow an attribute multiple times, and with a value, we get: ( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*.
The resulting RegEx (PCRE) is as following:
'#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#'
Now, in PHP, use the preg_match_all() function to grab all occurances in the string.
$regex = '#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#';
preg_match_all($regex, $str_containing_anchors, $result);
foreach($result as $link)
{
$href = $link[2];
$text = $link[4];
}
use simplexml and xpath to retrieve the desired nodes
If you don't have some kind of request<->class mapping you can extract the information with the DOM extension. The property textConent contains all the text of the context node and its descendants.
$sr = '<?xml version="1.0"?>
<SOAP:Envelope xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1">
<SOAP:Body>
<foo:bar xmlns:foo="urn:yaddayadda">
<fragment>
Mary had a
little lamb
</fragment>
</foo:bar>
</SOAP:Body>
</SOAP:Envelope>';
$doc = new DOMDocument;
$doc->loadxml($sr);
$xpath = new DOMXPath($doc);
$ns = $xpath->query('//fragment');
if ( 0 < $ns->length ) {
echo $ns->item(0)->nodeValue;
}
prints
Mary had a
little lamb
If you want to strip or extract properties from only specific tag, you should try DOMDocument.
Something like this:
$TagWhiteList = array(
// Example of WhiteList
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getTextFromNode($Node, $Text = "") {
// No tag, so it is a text
if ($Node->tagName == null)
return $Text.$Node->textContent;
// You may select a tag here
// Like:
// if (in_array($TextName, $TagWhiteList))
// DoSomthingWithIt($Text,$Node);
// Recursive to child
$Node = $Node->firstChild;
if ($Node != null)
$Text = getTextFromNode($Node, $Text);
// Recursive to sibling
while($Node->nextSibling != null) {
$Text = getTextFromNode($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
function getTextFromDocument($DOMDoc) {
return getTextFromNode($DOMDoc->documentElement);
}
To use:
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
$Text = getTextFromDocument($Doc);
echo "Text from HTML: ".$Text."\n";
The above function is how to strip tags. But you can modify it a bit to manipulate the element. For example, if the tag is 'a' of archor, you can extract its target and display it instead of the text inside.
Hope this help.