If I have a document like this:
<!-- in doc.xml -->
<a>
<b>
greetings?
<c>hello</c>
<d>goodbye</c>
</b>
</a>
Is there any way to use simplexml (or any php builtin really) to get a string containing:
greetings?
<c>hello</c>
<d>goodbye</d>
Whitespace and such doesn't matter.
Thanks!
I must admit this wasn't as simple as one would think. This is what I came up with:
$xml = new DOMDocument;
$xml->load('doc.xml');
// find just the <b> node(s)
$xpath = new DOMXPath($xml);
$results = $xpath->query('/a/b');
// get entire <b> node as text
$node = $results->item(0);
$text = $xml->saveXML($node);
// remove encapsulating <b></b> tags
$text = preg_replace('#^<b>#', '', $text);
$text = preg_replace('#</b>$#', '', $text);
echo $text;
Regarding the XPath query
The query returns all matching nodes, so if there are multiple matching <b> tags, you can loop through $results to get them all.
My query for '/a/b' assumes that <a> is the root and <b> is its child/immediate descendant. You could alter it for different scenarios. Here's an XPath reference. Some adjustments might include:
'a/b' –– <b> is child of <a>, but <a> is anywhere, not just in the root
'a//b' –– <b> is a descendant of <a> no matter how deep, not just a direct child
'//b' –– all <b> nodes anywhere in the document
Regarding method of obtaining string contents
I tried using $node->nodeValue or $node->textContent, but both of them strip out the <c> and <d> tags, leaving just the text contents of those. I also tried casting it as a DOMText object, but that didn't directly work and was more trouble than it was worth.
Regarding the use of regular expressions
It could be done without regex, but I found it easiest to use them. I wanted to make sure that I only stripped the <b> and </b> at the very beginning and end of the string, just in case there were other <b> nodes within the contents.
How about this? Since you already know the XML format:
<?php
$xml = simplexml_load_file('doc.xml');
$str = $xml->b;
$str .= "<c>".$xml->b->c."</c>";
$str .= "<d>".$xml->b->d."</d>";
echo $str;
?>
Here's an alternative using DOM (to balance the SimpleXML answers!) that outputs the contents of all of the first <b> element.
$doc = new DOMDocument;
$doc->load('doc.xml');
$bee = $doc->getElementsByTagName('b')->item(0);
$innerxml = '';
foreach ($bee->childNodes as $node) {
$innerxml .= $doc->saveXML($node);
}
echo $innerxml;
Related
I want to have the text string "hello there" separate in 2 lines.
For this I need the simpleXML to create the "br-tag" in the output file result.xml, but I only get produced the code <br>.
<?php
// DOMDocument
$dom = new DomDocument('1.0', 'UTF-8');
$dom->formatOutput = true;
$html = $dom->appendChild($dom->createElement("html"));
$xmlns = $dom->createAttribute('xmlns');
$xmlns->value = 'http://www.w3.org/1999/xhtml';
$html->appendChild($xmlns);
// SimpleXML
$sxe = simplexml_import_dom($dom);
$head = $sxe->addChild('head', ' ');
$body = $sxe->addChild('body', 'hello <br> there');
echo $sxe->asXML('result.xml');
Result:
hello <br> there
Wanted Result:
hello
there
Firstly, PHP's SimpleXML extension works only with XML, not HTML. You're rightly mentioning XHTML in your setup code, but that means you need to use XML self-closing elements like <br /> not HTML unclosed tags like <br>.
Secondly, the addChild method takes text content as its second parameter, not raw document content; so as you've seen, it will automatically escape < and > for you.
SimpleXML is really designed around the kind of XML that's a strict tree of elements, rather than a markup language with elements interleaved with text content like XHTML, so this is probably a case where you're better off sticking to the DOM.
Even then, there's no equivalent of the JS "innerhtml" property, I'm afraid, so I believe you'll have to add the text and br element as separate nodes, e.g.
$body = $html->appendChild( $dom->createElement('head') );
$body->appendChild( $dom->createTextNode('hello') );
$body->appendChild( $dom->createElement('br') );
$body->appendChild( $dom->createTextNode('world') );
How, in PHP, I can get the raw content of DOMElement, like JS innerHTML does?
I tried with saveHTML() or saveXML() iterating over each childNodes to simulate innerHTML, but it replaced code like turning <br /> to <br> or <br/> (in case of the XML version).
This can be achieved in a hacky but reliable way. PHP has the equivalent of outerHTML by passing the node to its parent document's saveHTML() method. Because this output is well-formed and escaped, you can easily strip the single outer tag from the text, leaving the desired innerHTTML.
Example:
$dom = new DOMDocument;
$dom->loadHTML('<div><p with="scary<>\'"" attrs=40 ok>Hello <em>World</em></div>');
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//p') as $p) {
$innerHTML = preg_replace('#^<([^>\\s]+)[^>]*>(.*)</\\1>$#s', '$2', $dom->saveHTML($p));
var_dump($p);
}
Demo of regex: https://regex101.com/r/yEVMQx/2
Note the s flag on the regex is critical.
I have a huge file with lots of entries, they have one thing in common, the first line. I want to extract all of the text from a paragraph where the first line is:
Type of document: Contract Notice
The HTML code I am working on is here:
<!-- other HTML -->
<p>
<b>Type of document:</b>
" Contract Notice" <br>
<b>Country</b> <br>
... rest of text ...
</p>
<!-- other HTML -->
I have put the HTML into a DOM like this:
$dom = new DOMDocument;
$dom->loadHTML($content);
I need to return all of the text in the paragraph node where the first line is 'Type of document: Contract Notice' I am sure there is a simple way of doing this using DOM methods or XPath, please advise!
Speaking of XPath, try the following expression which selects<p> elements:
whose <b> child element (first one) has the value Type of document:
whose next sibling text node (first one) contains the text Contract Notice
//p[
b[1][.="Type of document:"]
/following-sibling::text()[1][contains(., "Contract Notice")]
]
With this XPath expression, you select the text of all children of the p element:
//b[text()="Type of document:"]/parent::p/*/text()
I don't like using DomDocument parsing unless I need to heavily parse a document, but if you want to do so then it could be something like:
//Using DomDocument
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXpath($doc);
$matchedDoms = $xpath->query('//b[text()="Type of document:"]/parent::p//text()');
$data = '';
foreach($matchedDoms as $domMatch) {
$data .= $domMatch->data . ' ';
}
var_dump($data);
I would prefer a simple regex line to do it all, after all it's just one piece of the document you are looking for:
//Using a Regular Expression
preg_match('/<p>.*<b>Type of document:<\/b>.*Contract Notice(?<data>.*)<\/p>/si', $content, $matches);
var_dump($matches['data']); //If you want everything in there
var_dump(strip_tags($matches['data'])); //If you just want the text
I can't quite figure it out, I'm looking for some code that will add an attribute to an HTML element.
For example lets say I have a string with an <a> in it, and that <a> needs an attribute added to it, so <a> gets added style="xxxx:yyyy;". How would you go about doing this?
Ideally it would add any attribute to any tag.
It's been said a million times. Don't use regex's for HTML parsing.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$node->setAttribute("style","xxxx");
}
$newHtml = $dom->saveHtml()
Here is using regex:
$result = preg_replace('/(<a\b[^><]*)>/i', '$1 style="xxxx:yyyy;">', $str);
but Regex cannot parse malformed HTML documents.
How can i replace this <p><span class="headline"> with this <p class="headline"><span>
easiest with PHP.
$data = file_get_contents("http://www.ihr-apotheker.de/cs1.html");
$clean1 = strstr($data, '<p>');
$str = preg_replace('#(<a.*>).*?(</a>)#', '$1$2', $clean1);
$ausgabe = strip_tags($str, '<p>');
echo $ausgabe;
Before I alter the html from the site I want to get the class declaration from the span to the <p> tag.
dont parse html with regex!
this class should provide what you need
http://simplehtmldom.sourceforge.net/
The reason not to parse HTML with regex is if you can't guarantee the format. If you already know the format of the string, you don't have to worry about having a complete parser.
In your case, if you know that's the format, you can use str_replace
str_replace('<p><span class="headline">', '<p class="headline"><span>', $data);
Well, answer was accepted already, but anyway, here is how to do it with native DOM:
$dom = new DOMDocument;
$dom->loadHTMLFile("http://www.ihr-apotheker.de/cs1.html");
$xPath = new DOMXpath($dom);
// remove links but keep link text
foreach($xPath->query('//a') as $link) {
$link->parentNode->replaceChild(
$dom->createTextNode($link->nodeValue), $link);
}
// switch classes
foreach($xPath->query('//p/span[#class="headline"]') as $node) {
$node->removeAttribute('class');
$node->parentNode->setAttribute('class', 'headline');
}
echo $dom->saveHTML();
On a sidenote, HTML has elements for headings, so why not use a <h*> element instead of using the semantically superfluous "headline" class.
Have you tried using str_replace?
If the placement of the <p> and <span> tags are consistent, you can simply replace one for the other with
str_replace("replacement", "part to replace", $string);