Why this DOMXpath query merges sibling nodes values?

Why this DOMXpath query merges sibling nodes values? - php

Given the following code:
$html = "<h1>foo</h1><h2>bar</h2>";
$document = new DOMDocument();
$document->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($document);
$h1Nodes = $xpath->query('//h1');
foreach ($h1Nodes as $h1Node) {
var_dump($h1Node->nodeValue);
}
H1 tag contains only text node with the text 'foo'. Text 'bar' is in a sibling heading node (h2). I would expect the output to be 'foo'.
However, the output is 'foobar'.
Why?

Thank you, for your comment, hardik solanki.
It lead me to the answer: valid markup must have a root element.
Markup, which I've provided doesn't have one, and flags I've used prevent the library from adding one implicitly. So the first tag is treated as a root element and the result is a bit confusing.
Dropping those flags helps for this issue, but I am using them for a purpose. I just want to manipulate a snippet of HTML, and not a whole document. I want to get this snippet back (after transformations), by calling DOMDocument::saveHTML(). Without doctype/<html>/<body> tags.
I've ended up doing this:
I add doctype/<html>/<body> tags to the HTML snippet I want to manipluate to have temporary a valid document
load it with DOMDocument
transform it the way I need
save it with DOMDocument::saveHTML()
get rid of excess doctype/<html>/<body> tags markup
It works.

Related

How to remove HTML tags as well as HTML content within a string in PHP?

I have a .txt file. Using the following code I read it:
while (!feof($handle)) {
yield trim(utf8_encode(fgets($handle)));
}
Now from the retrieved string I want to remove not only the HTML tags but also the HTML content inside. Found many solutions to remove the tags but not both - tags + content.
Sample string - Hey my name is <b>John</b>. I am a <i>coder</i>!
Required output string - Hey my name is . I am a !
How can I achieve this?

One way to achieve this is by using DOMDocument and DOMXPath. My solution assumes that the provided HTML string has no container node or that the container node contents are not meant to be stripped (as this would result in a completely empty string).
$string = 'Hey my name is <b>John</b>. I am a <i>coder</i>!';
// create a DOMDocument (an XML/HTML parser)
$dom = new DOMDocument('1.0', 'UTF-8');
// load the HTML string without adding a <!DOCTYPE ...> and <html><body> tags
// and with error/warning reports turned off
// if loading fails, there's something seriously wrong with the HTML
if($dom->loadHTML($string, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED | LIBXML_NOERROR | LIBXML_NOWARNING)) {
// create an DOMXPath instance for the loaded document
$xpath = new DOMXPath($dom);
// remember the root node; DOMDocument automatically adds a <p> container if one is not present
$rootNode = $dom->documentElement;
// fetch all descendant nodes (children and grandchildren, etc.) of the root node
$childNodes = $xpath->query('//*', $rootNode);
// with each of these decendants...
foreach($childNodes as $childNode) {
// ...remove them from their parent node
$childNode->parentNode->removeChild($childNode);
}
// echo the sanitized HTML
echo $rootNode->nodeValue . "\n";
}
If you do want to strip a potential container code then it's going to be a bit harder, because it's difficult to differentiate between an original container node and a container node that's automatically added by DOMDocument.
Also, if an unintended non-closing tag is found, it can lead to unexpected results, as it will strip everything until the next closing tag, because DOMDocument will automatically add a closing tag for invalid non-closing tags.

How can I remove all <span> tags and their respective content, including other nested elements?

I've tried a few solutions which only remove the the tags themselves leaving the content and any other nested
Regular Expression,
preg_replace('/<span\b[^>]*>(.*?)<\/span>/ig', '', $page->body);
Tried using HTML purifier also,
$purifier->set('Core.HiddenElements', array('span'));
$purifier->set('HTML.ForbiddenElements', array('span'));

Depending on your actual strings and the things you tried you could use a regular expression (assuming your span tags are only span tags).
A more "appropriate" solution however would be to use an html parser like DomDocument.
You can use the function document.getElementsByName("span"); to get all the span elements and remove them from the document object.
Then use saveHTML to get the html code back.
You will get something like this:
$doc = new DOMDocument;
$doc->load($yourpage);
$root = $doc->documentElement;
// we retrieve the spans and remove it from the book
$spans = $book->getElementsByTagName('span');
foreach ($spans as $span){
$root->removeChild($span);
}
echo $doc->saveXML();

localizing an html document (hind sight)

I am bulding a web application in PHP, which I have decided (far along the process) to have available in different languages.
My question is this:
I do not want to wade through all the HTMl code in the template files to look for the "words" that I need to replace with dynamically generated lang variables.
Is there a tool that can highlight the "words" used in the HTML to make my task easier.
so that when I scroll down the HTML doc, I can easily see where the language "words" are.
Normally when I create an app, I add comments as i code, like below
<label><!--lang-->Full Name</lable>
<input type="submit" value="<!--lang-->Save Changes" name="submit">
so that when I am done, I can run through and easily identify the bits I need to add dynamic variables to....unfortunately I am almost through with the app (lost of HTML template files) and I had not done so.
I use a template engine (tinybutstrong) so my HTML is pretty clean (i.e. with no PHP in it)

You can do this, relatively easily even, using DOMDocument to parse the markup, DOMXPath to query for all the comment nodes, and then access each node's parent, extract the nodeValue and list those values as "strings to translate":
$dom = new DOMDocument;
$dom->load($file);//or loadHTML in case you're working with HTML strings
$xpath = new DOMXPath($dom);//get XPath
$comments = $xpath->query('//comment()');//get all comment nodes
//this array will contain all to-translate texts
$toTranslate = array();
foreach ($comments as $comment)
{
if (trim($comment->nodeValue) == 'lang')
{//trim, avoid spaces, use stristr !== false if you need case-insensitive matching
$parent = $comment->parentNode;//get parent node
$toTranslate[] = $parent->textContent;//get parent node's text content
}
}
var_dump($toTranslate);
Note that this can't handle comments used in tag attributes. Using this simple script, you will be able to extract those strings that need to be translated in the "regular" markup. After that, you can write a script that looks for <!--lang--> in tag attributes... I'll have a look if there isn't a way to do this using XPath, too. For now, this should help you to get started, though.
If you have not comments, other than <!--lang--> in your markup, then you could simply use an xpath expression that selects the parents of those comment nodes directly:
$commentsAndInput = $xpath->query('(//input|//option)[#value]|//comment()/..');
foreach ($commentsAndInput as $node)
{
if ($node->tagName !== 'input' && $node->tagName !== 'option')
{//get the textContent of the node
$toTranslate[] = $node->textContent;
}
else
{//get value attribute's value:
$toTranslate[] = $node->getAttributeNode('value')->value;
}
}
The xpath expression explained:
//: tells xpath to search for nodes that match the rest of the criteria anywhere in the DOM
input: literal tag name: //input looks for input tags anywhere in the DOM tree
[#value]: the mentioned tag only matches if it has a #value attribute
|: OR. //a|//input[#type="button"] matches links OR buttons
//option[#value]: same as above: options with value attributes are matched
(//input|//option): groups both expressions, the [#value] applies to all matches in this selection
//comment(): selects comments anywhere in the dom
/..: selects the parent of the current node, so //comment()/.. matches the parent, containing the selected comment node.
Keep working at the XPath expression to get all of the content you need to translate
Proof of concept

php regex selecting url from html source

I'm new to stackoverflow and from South Korea.
I'm having difficulties with regex with php.
I want to select all the urls from user submitted html source.
The restrictions I want to make are following.
Select urls EXCEPT
urls are within tags
for example if the html source is like below,
http://aaa.com
Neither of http://aaa.com should be selected.
urls right after " or =
Here is my current regex stage.
/(?<![\"=])https?\:\/\/[^\"\s<>]+/i
but with this regex, I can't achieve the first rule.
I tried to add negative lookahead at the end of my current regex like
/(?<![\"=])https?\:\/\/[^<>\"\s]+(?!<\/a>)/i
It still chooses the second url in the a tag like below.
http://aaa.co
We don't have developers Q&A community like Stackoverflow in Korea, so I really hope someone can help this simplely looking regex issue!

Seriously consider using PHP's DOMDocument class. It does reliable HTML parsing. Doing this with regular expressions is error prone, more work, and slower.
The DOM works just like in the browser and you can use getElementsByTagName to get all links.
I got your use case working with this code using the DOM (try it here: http://3v4l.org/5IFof):
<?php
$html = <<<HTML
http://aaa.com
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $link) {
var_dump($link->getAttribute('href'));
// Output: http://aaa.com
}

Don't use Regex. Use DOM
$html = 'http://aaa.com';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $a) {
if($a->hasAttribute('href')){
echo $a->getAttribute('href');
}
//$a->nodeValue; // If you want the text in <a> tag
}

Seeing as you're not trying to extract urls that are the href attribute of an a node, you'll want to start by getting the actual text content of the dom. This can be easily done like so:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$root = $dom->getElementsByTagName('body')[0];//get outer tag, in case of a full dom, this is body
$text = $root->textContent;//no tags, no attributes, no nothing.
An alternative approach would be this:
$text = strip_tags($htmlString);//gets rid of makrup.

PHP XML tags nested in continuing text -> simpleXML

i am working on XML with mostly unknown content.
I am converting it to a very rough HTML output.
but i struggle with this structure in the XML:
<wrappingTag>
text text text
<formatTag>formatted text</formatTag>
continued text text text text
<formatTag2>much more formatted text</formatTag2>
continued text text text text
</wrappingTag>
as i use the simpleXML element to get the data, simpleXML returns all the normal text as the value from the "wrappingTag" but without the parts from the "formatTag" values. These come seperate of course.
So putting the text together as it was before seems to be impossible to me.
is there an easy way to solve this in simplexml or do i have to parse that on my own?
thanx
alex

DOM does not suffer from that and you can convert them into each other.
$element = simplexml_load_string($xml);
$node = dom_import_simplexml($element);
var_dump($node->nodeValue);
DOMElement::$nodeValue is the text content from all descendant text nodes (including cdata).
Another possibility to get the text content from a node is DOMXPath::evaluate().
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
var_dump($xpath->evaluate('string(//wrappingTag[1])'));
Demo: https://eval.in/161109

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Why this DOMXpath query merges sibling nodes values? - php

Related

How to remove HTML tags as well as HTML content within a string in PHP?

How can I remove all <span> tags and their respective content, including other nested elements?

localizing an html document (hind sight)

php regex selecting url from html source

PHP XML tags nested in continuing text -> simpleXML

Categories

Resources