Traverse DOM find id backwards

Traverse DOM find id backwards - php

I can't find out how to solve this
<div>
<p id="p1"> Price is <span>$ 25</span></p>
<p id='p2'> But this price is $ <span id="s1">50,23</span> </p>
<p id='p3'> This one : $ 14540.12 dollar</p>
</div>
What i'm trying to do is find an element with a price in it and it's shortest path to it.
This is what i have sofar.
$elements = $dom->getElementsByTagName('*');
foreach($elements as $child)
{
if (preg_match("/.$regex./",$child->nodeValue)){
echo $child->getNodePath(). "<br />";
}
}
This results in
/html
/html/body
/html/body/div
/html/body/div/p[1]
/html/body/div/p[1]/span
/html/body/div/p[2]
/html/body/div/p[2]/span
/html/body/div/p[3]
These are the paths to the elements i want, so that's OK in this test HTML. But in real webpages these path's get very long and are error prone.
What i'd like to do is find the closest element with an ID attribute and refer to that.
So once found and element that matched the $regex, I need to travel up the DOM and find the first element with and ID attribute and create the new shorter path from that.
In the HTML example above, there are 3 prices matching the $regex. The prices are in:
//p[#id="p1"]/span
//p[#id="s1"]
//p[#id="p3"]
So that is what i'd like to have returned from my function. The means I also need to get rid of all the other paths that exist, because they don't contain $regex
Any help on this?

You could use XPath to follow the ancestor-path to the first node containing an #id attribute and then cut its path off. Did not clean up the code, but something like this:
// snip
$xpath = new DomXPath($doc);
foreach($elements as $child)
{
$textValue = '';
foreach ($xpath->query('text()', $child) as $text)
$textValue .= $text->nodeValue;
if (preg_match("/.$regex./", $textValue)) {
$path = $child->getNodePath();
$id = $xpath->query('ancestor-or-self::*[#id][1]', $child)->item(0);
$idpath = '';
if ($id) {
$idpath = $id->getNodePath();
$path = '//'.$id->nodeName.'[#id="'.$id->attributes->getNamedItem('id')->value.'"]'.substr($path, strlen($idpath));
}
echo $path."\n";
}
}
Printing something like
/html
/html/body
/html/body/div
//p[#id="p1"]
//p[#id="p1"]/span
//p[#id="p2"]
//span[#id="s1"]
//p[#id="p3"]

Related

How to split XML child into multiple children based on a delimiter using PHP

I have an XML child which contains a bunch of URLs which are delimited by '|'
ie.
<ImageURL>https://example.com/example.jpg|https://example.com/example2.jpg</ImageURL>
I'm trying to write a PHP function which will take each URL and split it into it's own child element. So it should look like below:
<ImageURL>
<Image1>https://example.com/example.jpg</Image1>
<Image2>https://example.com/example2.jpg</Image2>
..etc
</ImageURL>
The data is passed into the function as $value which is the contents of that ImageURL. Not entirely sure where to start though. Any assistance would be appreciated!
function split_images($value){
...
}

It's a bit messy but it works.
<?php
$xml = '<ImageURL>https://example.com/example.jpg|https://example.com/example2.jpg</ImageURL>';
//the first `explode` removes the parent tags <ImageURL> so that we can get the child url inside as an array
$childUrl = explode('|',explode('>',$xml)[1]);
//I used `var_dump` to make sure that I only get child image `src`
var_dump($childUrl); //output = array(2) ['https://example.com/example.jpg','https://example.com/example2.jpg']
//then we created a concatinated parent tag `<imageURL>`
$imgUrl = '<ImageURL>';
foreach($childUrl as $key => $value){
//while using a foreach loop to assign the child url as '<Image></Image>' src
$imgUrl .= '<Image' . ($key + 1) . '> ' . $value . '</Image' . ($key + 1) . '>';
}
$imgUrl .= '<ImageURL>';
//I used `var_dump` again to test if I get the results you are looking for...
var_dump($imgUrl); //output = string(132) "<ImageURL><Image1>https://example.com/example.jpg</Image1><Image2> https://example.com/example2.jpg</ImageURL</Image2><ImageURL>"
As you can see, the results are in a string format, you can convert then into json or anything you are looking for.
Check it out here PHP Sandbox
If it is not what you are looking for, hopefully it will give you some general idea.

Not that difficult with DOM:
$xml = <<<'XML'
<ImageURL>https://example.com/example.jpg|https://example.com/example2.jpg</ImageURL>
XML;
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
// iterate the ImgeURL elements
foreach ($xpath->evaluate('//ImageURL') as $imageURL) {
// read the content and separate the URLs
$urls = explode('|', $imageURL->textContent);
// remove the content from the node
$imageURL->textContent = '';
// iterate the urls
foreach ($urls as $index => $url) {
// add a new Image node and set the url as content
$imageURL
->appendChild($document->createElement('Image'))
->textContent = $url;
}
}
$document->formatOutput = TRUE;
echo $document->saveXML();
Numbered (and dynamic) XML are an anti pattern. You should not do this, however you could add the $index to the element name.

PHP - preg_replace - html tags and attributes

I'm trying to allow some tags and attributes using an array, and remove the rest
here is my example:
$allowed=array("img", "p", "style");
$text='<img src="image.gif" onerror="myFunction()" style="background:red" onclick="myFunction()">
<p>A function is triggered if an error occurs when loading the image. The function shows an alert box with a text.
In this example we refer to an image that does not exist, therefore the onerror event occurs.</p>
<script>
function myFunction() {
alert(\'The image could not be loaded.\');
}
</script>';
using $text= preg_replace('#<script(.*?)>(.*?)</script>#is', '', $text);
I could remove script tag with content, but I need to remove everything not in $allowed array

I would suggest using DOMParser for better readability if you are mixing scripts with html altogether like this, take care about the performance if performance matters.
http://php.net/manual/en/class.domdocument.php

This function should do what you want. Given a DOMDocument ($doc) and a node ($node) to search from, it recursively iterates over the children of that node, removing any tags that are not in the $allowed_tags array, and, for those tags that are kept, removing any attributes not in the $allowed_attributes array:
function remove_nodes_and_attributes($doc, $node, $allowed_tags, $allowed_attributes) {
$xpath = new DOMXPath($doc);
foreach ($xpath->query('./*', $node) as $child) {
if (!in_array($child->nodeName, $allowed_tags)) {
$node->removeChild($child);
continue;
}
$a = 0;
while ($a < $child->attributes->length) {
$attribute = $child->attributes->item($a)->name;
if (!in_array($attribute, $allowed_attributes)) {
$child->removeAttribute($attribute);
// don't increment the pointer as the list will shift with the removal of the attribute
}
else {
// allowed attribute, skip it
$a++;
}
}
// remove any children as necessary
remove_nodes_and_attributes($doc, $child, $allowed_tags, $allowed_attributes);
}
}
You would use this function like this. Note it is necessary to wrap the HTML in a top-level element which is then stripped off again at the end using substr.
$doc = new DOMDocument();
$doc->loadHTML("<html>$text</html>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$html = $doc->getElementsByTagName('html')[0];
remove_nodes_and_attributes($doc, $html, $allowed_tags, $allowed_attributes);
echo substr($doc->saveHTML(), 6, -8);
Output (for your sample data):
<img style="background:red">
<p>A function is triggered if an error occurs when loading the image. The function shows an alert box with a text. In this example we refer to an image that does not exist, therefore the onerror event occurs.</p>
Demo on 3v4l.org

Using DOMDocument is always the best way to manipulate HTML, it understands the structure of the document.
In this solution I use XPath to find any nodes which are not in the allowed list, the XPath expression will look something like...
//body//*[not(name() = "img" or name() = "p" or name() = "style")]
This looks for any element in the <body> tag (loadHTML will automatically put this tag in for you) who's name isn't in the list of allowed tags. The XPath is built dynamically from the $allowed list and so you just change the list of tags to update it...
$allowed=array("img", "p", "style");
$text='<img src="image.gif" onerror="myFunction()" style="background:red" onclick="myFunction()">
<p>A function is triggered if an error occurs when loading the image. The function shows an alert box with a text.
In this example we refer to an image that does not exist, therefore the onerror event occurs.</p>
<script>
function myFunction() {
alert(\'The image could not be loaded.\');
}
</script>';
$doc = new DOMDocument();
$doc->loadHTML($text);
$xp = new DOMXPath($doc);
$find = '//body//*[not(name() = "'.implode ('" or name() = "', $allowed ).
'")]';
echo "XPath = ".$find.PHP_EOL;
$toRemove = $xp->evaluate($find);
print_r($toRemove);
foreach ( $toRemove as $remove ) {
$remove->parentNode->removeChild($remove);
}
// recreate HTML
$outHTML = "";
foreach ( $doc->getElementsByTagName("body")[0]->childNodes as $tag ) {
$outHTML.= $doc->saveHTML($tag);
}
echo $outHTML;
If you also want to remove attributes, you can do the same process using #* as part of the XPath expression...
$allowedAttribs = array();
$find = '//body//#*[not(name() = "'.implode ('" or name() = "', $allowedAttribs ).
'")]';
$toRemove = $xp->evaluate($find);
foreach ( $toRemove as $remove ) {
$remove->parentNode->removeAttribute($remove->nodeName);
}
It would be possible to combine these two, but it makes the code less legible (IMHO).

How to get ID using a specific word in regex?

My string:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext [ref*summation]</p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
</div>
<p>fig1.2 [label*somefigure]</p>
<p>sometext [ref*somefigure]</p>
</div>
Objective: 1.In the string above label*string and ref*string are the cross references. In the place of [ref*string] I need to replace with a with the atributes of class and href, href is the id of div where related label* resides. And class of a is the class of div
As I mentioned above a element class and ID is their relative div class names and ID. But if div class="metadata" exists, need to ignore it should not take their class name and ID.
Expected output:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext <a class="section-ref" href="s9781473910270.i102">1.2</a></p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
<p>fig1.2 [label*somefigure]</p>
</div>
<p>sometext <a class="fig-ref" href="s9781473910270.i220">fig 1.2</a></p>
</div>
How to do it in simpler way without using DOM parser?
My idea is, have to store label* string and their ID in an array and will loop against ref string to match the label* string if string matches then their related id and class should be replaced in the place of ref* string ,
So I have tried this regex to get label*string and their related id and class name.

This approach consists to use the html structure to retrieve needed elements with DOMXPath. Regex are used in a second time to extract informations from text nodes or attributes:
$classRel = ['sect2' => 'section-ref',
'figure' => 'fig-ref'];
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html); // or $dom->loadHTMLFile($url);
$xp = new DOMXPath($dom);
// make a custom php function available for the XPath query
// (it isn't really necessary, but it is more rigorous than writing
// "contains(#class, 'myClass')" )
$xp->registerNamespace("php", "http://php.net/xpath");
function hasClass($classNode, $className) {
if (!empty($classNode))
return in_array($className, preg_split('~\s+~', $classNode[0]->value, -1, PREG_SPLIT_NO_EMPTY));
return false;
}
$xp->registerPHPFunctions('hasClass');
// The XPath query will find the first ancestor of a text node with '[label*'
// that is a div tag with an id and a class attribute,
// if the class attribute doesn't contain the "metadata" class.
$labelQuery = <<<'EOD'
//text()[contains(., 'label*')]
/ancestor::div
[#id and #class and not(php:function('hasClass', #class, 'metadata'))][1]
EOD;
$idNodeList = $xp->query($labelQuery);
$links = [];
// For each div node, a new link node is created in the associative array $links.
// The keys are labels.
foreach($idNodeList as $divNode) {
// The pattern extract the first text part in group 1 and the label in group 2
if (preg_match('~(\S+) .*? \[label\* ([^]]+) ]~x', $divNode->textContent, $m)) {
$links[$m[2]] = $dom->createElement('a');
$links[$m[2]]->setAttribute('href', $divNode->getAttribute('id'));
$links[$m[2]]->setAttribute('class', $classRel[$divNode->getAttribute('class')]);
$links[$m[2]]->nodeValue = $m[1];
}
}
if ($links) { // if $links is empty no need to do anything
$refNodeList = $xp->query("//text()[contains(., '[ref*')]");
foreach ($refNodeList as $refNode) {
// split the text with square brackets parts, the reference name is preserved in a capture
$parts = preg_split('~\[ref\*([^]]+)]~', $refNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
// create a fragment to receive text parts and links
$frag = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k%2 && isset($links[$part])) { // delimiters are always odd items
$clone = $links[$part]->cloneNode(true);
$frag->appendChild($clone);
} elseif ($part !== '') {
$frag->appendChild($dom->createTextNode($part));
}
}
$refNode->parentNode->replaceChild($frag, $refNode);
}
}
$result = '';
$childNodes = $dom->getElementsByTagName('body')->item(0)->childNodes;
foreach ($childNodes as $childNode) {
$result .= $dom->saveXML($childNode);
}
echo $result;

This is not a task for regular expressions. Regular expressions are (usually) for regular languages. And what you want to do is some work on a context sensitive language (referencing an identifier which has been declared before).
So you should definately go with a DOM parser. The algorithm for this would be very easy, because you can operate on one node and it's children.
So the theoretical answer to your question is: you can't. Though it might work out with the many regex extensions in some crappy way.

Using PHP to get DOM Element

I'm struggling big time understanding how to use the DOMElement object in PHP. I found this code, but I'm not really sure it's applicable to me:
$dom = new DOMDocument();
$dom->loadHTML("index.php");
$div = $dom->getElementsByTagName('div');
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
Basically what I need is to search the DOM for an element with a particular id, after which point I need to extract a non-standard attribute (i.e. one that I made up and put on with JS) so I can see the value of that. The reason is I need one piece from the $_GET and one piece that is in the HTML based from a redirect. If someone could just explain how I use DOMDocument for this purpose, that would be helpful. I'm really struggling understanding what's going on and how to properly implement it, because I clearly am not doing it right.
EDIT (Where I'm at based on comment):
This is my code lines 4-26 for reference:
<div id="column_profile">
<?php
require_once($_SERVER["DOCUMENT_ROOT"] . "/peripheral/profile.php");
$searchResults = isset($_GET["s"]) ? performSearch($_GET["s"]) : "";
$dom = new DOMDocument();
$dom->load("index.php");
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
$div = $dom->getElementById('currentLocation');
$attr = $div->getAttribute('srckey');
echo "<h1>{$attr}</a>";
?>
</div>
<div id="column_main">
Here is the error message I'm getting:
Warning: DOMDocument::load() [domdocument.load]: Extra content at the end of the document in ../public_html/index.php, line: 26 in ../public_html/index.php on line 10
Fatal error: Call to a member function getAttribute() on a non-object in ../public_html/index.php on line 21

getElementsByTagName returns you a list of elements, so first you need to loop through the elements, then through their attributes.
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
In your case, you said you needed a specific ID. Those are supposed to be unique, so to do that, you can use (note getElementById might not work unless you call $dom->validate() first):
$div = $dom->getElementById('divID');
Then to get your attribute:
$attr = $div->getAttribute('customAttr');
EDIT: $dom->loadHTML just reads the contents of the file, it doesn't execute them. index.php won't be ran this way. You might have to do something like:
$dom->loadHTML(file_get_contents('http://localhost/index.php'))

You won't have access to the HTML if the redirect is from an external server. Let me put it this way: the DOM does not exist at the point you are trying to parse it. What you can do is pass the text to a DOM parser and then manipulate the elements that way. Or the better way would be to add it as another GET variable.
EDIT: Are you also aware that the client can change the HTML and have it pass whatever they want? (Using a tool like Firebug)

Regular Expressions, avoiding HTML tags in PHP

I have actually seen this question quite a bit here, but none of them are exactly what I want... Lets say I have the following phrase:
Line 1 - This is a TEST phrase.
Line 2 - This is a <img src="TEST" /> image.
Line 3 - This is a TEST link.
Okay, simple right? I am trying the following code:
$linkPin = '#(\b)TEST(\b)(?![^<]*>)#i';
$linkRpl = '$1TEST$2';
$html = preg_replace($linkPin, $linkRpl, $html);
As you can see, it takes the word TEST, and replaces it with a link to test. The regular expression I am using right now works good to avoid replacing the TEST in line 2, it also avoids replacing the TEST in the href of line 3. However, it still replaces the text encapsulated within the tag on line 3 and I end up with:
Line 1 - This is a TEST phrase.
Line 2 - This is a <img src="TEST" /> image.
Line 3 - This is a <a href="newurl">TEST</a> link.
This I do not want as it creates bad code in line 3. I want to not only ignore matches inside of a tag, but also encapsulated by them. (remember to keep note of the /> in line 2)

Honestly, I'd do this with DomDocument and Xpath:
//First, create a simple html string around the text.
$html = '<html><body><div id="#content">'.$text.'</div></body></html>';
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$query = '//*[not(name() = "a") and contains(., "TEST")]';
$nodes = $xpath->query($query);
//Force it to an array to break the reference so iterating works properly
$nodes = iterator_to_array($nodes);
$replaceNode = function ($node) {
$text = $node->wholeText;
$text = str_replace('TEST', 'TEST', '');
$fragment = $node->ownerDocument->createDocumentFragment();
$fragment->appendXML($text);
$node->parentNode->replaceChild($fragment, $node);
}
foreach ($nodes as $node) {
if ($node instanceof DomText) {
$replaceNode($node, 'TEST');
} else {
foreach ($node->childNodes as $child) {
if ($child instanceof DomText) {
$replaceNode($node, 'TEST');
}
}
}
}
This should work for you, since it ignores all text inside of a elements, and only replaces the text directly inside of the matching tags.

Okay... I think I came up with a better solution...
$noMatch = '(</a>|</h\d+>)';
$linkUrl = 'http://www.test.com/test/'.$link['page_slug'];
$linkPin = '#(?!(?:[^<]+>|[^>]+'.$noMatch.'))\b'.preg_quote($link['page_name']).'\b#i';
$linkRpl = ''.$link['page_name'].'';
$page['HTML'] = preg_replace($linkPin, $linkRpl, $page['HTML']);
With this code, it won't process any text within <a> tags and <h#> tags. I figure, any new exclusions I want to add, simply need to be added to $noMatch.
Am I wrong in this method?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Traverse DOM find id backwards - php

Related

How to split XML child into multiple children based on a delimiter using PHP

PHP - preg_replace - html tags and attributes

How to get ID using a specific word in regex?

Using PHP to get DOM Element

Regular Expressions, avoiding HTML tags in PHP

Categories

Resources