Can I find exact match using phpQuery?

Can I find exact match using phpQuery? - php

I made a script using phpQuery. The script finds td's that contain a certain string:
$match = $dom->find('td:contains("'.$clubname.'")');
It worked good until now. Because one clubname is for example Austria Lustenau and the second club is Lustenau. It will select both clubs, but it only should select Lustenau (the second result), so I need to find a td containing an exact match.
I know that phpQuery are using jQuery selectors, but not all. Is there a way to find an exact match using phpQuery?

Update: It is possible, see the answer from # pguardiario
Original answer. (at least an alternative):
No, unfortunately it is not possible with phpQuery. But it can be done easily with XPath.
Imagine you have to following HTML:
$html = <<<EOF
<html>
<head><title>test</title></head>
<body>
<table>
<tr>
<td>Hello</td>
<td>Hello World</td>
</tr>
</table>
</body>
</html>
EOF;
Use the following code to find exact matches with DOMXPath:
// create empty document
$document = new DOMDocument();
// load html
$document->loadHTML($html);
// create xpath selector
$selector = new DOMXPath($document);
// selects all td node which's content is 'Hello'
$results = $selector->query('//td[text()="Hello"]');
// output the results
foreach($results as $node) {
$node->nodeValue . PHP_EOL;
}
However, if you really need a phpQuery solution, use something like this:
require_once 'phpQuery/phpQuery.php';
// the search string
$needle = 'Hello';
// create phpQuery document
$document = phpQuery::newDocument($html);
// get matches as you suggested
$matches = $document->find('td:contains("' . $needle . '")');
// empty array for final results
$results = array();
// iterate through matches and check if the search string
// is the same as the node value
foreach($matches as $node) {
if($node->nodeValue === $needle) {
// put to results
$results []= $node;
}
}
// output results
foreach($results as $result) {
echo $node->nodeValue . '<br/>';
}

It can be done with filter, but it's not pretty:
$dom->find('td')->filter(function($i, $node){return 'foo' == $node->nodeValue;});
But then, neither is switching back and forth between css and xpath

I've no experience of phpQuery but the jQuery would be something like this :
var clubname = 'whatever';
var $match = $("td").map(function(index, domElement) {
return ($(domElement).text() === clubname) ? domElement : null;
});
The phpQuery documentation indicates that ->map() is available and that it accepts a callback function in the same way as in jQuery.
I'm sure you will be able to perform the translation into phpQuery.
Edit
Here's my attempt based on 5 minutes' reading - probably rubbish but here goes :
$match = $dom->find("td")->map(function($index, $domElement) {
return (pq($domElement)->text() == $clubname) ? $domElement : null;
});
Edit 2
Here's a demo of the jQuery version.
If phpQuery does what it says in its documentation, then (correctly translated from javascript) it should match the required element in the same way.
Edit 3
After some more reading about the phpQuery callback system, the following code stands a better chance of workng :
function textFilter($i, $el, $text) {
return (pq($el)->text() == $text) ? $el : null;
}};
$match = $dom->find("td")->map('textFilter', new CallbackParam, new CallbackParam, $clubname);
Note that ->map() is preferred over ->filter() as ->map() supports a simpler way to define pararameter "places" (see Example 2 in the referenced page).

I know this question is old, but I've written a function based on answer from hek2mgl
<?php
// phpQuery contains function
/**
* phpQuery contains method, this will be able to locate nodes
* relative to the NEEDLE
* #param string element_pattern
* #param string needle
* #return array
*/
function contains($element_pattern, $needle) {
$needle = (string) $needle;
$needle = trim($needle);
$element_haystack_pattern = "{$element_pattern}:contains({$needle})";
$element_haystack_pattern = (string) $element_haystack_pattern;
$findResults = $this->find($element_haystack_pattern);
$possibleResults = array();
if($findResults && !empty($findResults)) {
foreach($findResults as $nodeIndex => $node) {
if($node->nodeValue !== $needle) {
continue;
}
$possibleResults[$nodeIndex] = $node;
}
}
return $possibleResults;
}
?>
Usage
<?php
$nodes = $document->contains("td.myClass", $clubname);
?>

Related

Parsing HTML Table Data from XML with PHP

I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.
Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).
Here is my code so far (code updated):
Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.
function getXMLFeed() {
echo "<h2>Reddit Items</h2><hr><br><br>";
//$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
$feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
$xml = simplexml_load_file($feedURL);
//define each xml entry from reddit as an item
foreach ($xml -> entry as $item ) {
foreach ($item -> content as $content) {
$newContent = (string)$content;
$html = str_get_html($newContent);
foreach($html->find('table') as $table) {
$links = $table->find('span', '0');
//echo $links;
foreach($links->find('a') as $link) {
echo $link->href;
}
}
}
}
}
XML Code:
http://pasted.co/0bcf49e8
I've also included JSON if it can be done this way; I just preferred XML:
http://pasted.co/f02180db
That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).
foreach ($item -> content as $content) {
$dom = new DOMDocument();
$dom -> loadHTML($content);
$xpath = new DOMXPath($dom);
$classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";
foreach ($dom->getElementsByTagName('table') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
//$originalURL = $node->getAttribute('href');
}
//$html = $dom->saveHTML();
}
I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.
Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!
Added HTML:
I am specifically trying to extract <span>[link]</span> from each table/item.
http://pastebin.com/QXa2i6qz

The following code can extract you all the youtube links from each content.
function extract_youtube_link($xml) {
$entries = $xml['entry'];
$videos = [];
foreach($entries as $entry) {
$content = html_entity_decode($entry['content']);
preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
if(!empty($matches[1][0])) {
$videos[] = array(
'entry_title' => $entry['title'],
'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
'author_reddit_url' => $entry['author']['uri'],
'video_url' => $matches[1][0]
);
}
}
return $videos;
}
$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);
foreach($videos as $video) {
echo "<p>Entry Title: {$video['entry_title']}</p>";
echo "<p>Author: {$video['author']}</p>";
echo "<p>Author URL: {$video['author_reddit_url']}</p>";
echo "<p>Video URL: {$video['video_url']}</p>";
echo "<br><br>";
}
The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!

If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.
http://php.net/manual/es/class.domxpath.php .

Remove HTML element from parsed HTML document on a condition

I've parsed a HTML document using Simple PHP HTML DOM Parser. In the parsed document there's a ul-tag with some li-tags in it. One of these li-tags contains one of those dreaded "Add This" buttons which I want to remove.
To make this worse, the list item has no class or id, and it is not always in the same position in the list. So there is no easy way (correct me if I'm wrong) to remove it with the parser.
What I want to do is to search for the string 'addthis.com' in all li-elements and remove any element that contains that string.
<ul>
<li>Foobar</li>
<li>addthis.com</li><!-- How do I remove this? -->
<li>Foobar</li>
</ul>
FYI: This is purley a hobby project in my quest to learn PHP and not a case of content theft for profit.
All suggestions are welcome!

Couldn't find a method to remove nodes explicitly, but can remove with setting outertext to empty.
$html = new simple_html_dom();
$html->load(file_get_contents("test.html"), false, false); // preserve formatting
foreach($html->find('ul li') as $element) {
if (count($element->find('a.addthis_button')) > 0) {
$element->outertext="";
}
}
echo $html;

Well what you can do is use jQuery after the parsing. Something like this:
$('li').each(function(i) {
if($(this).html() == "addthis.com"){
$(this).remove();
}
});

This solution uses DOMDocument class and domnode.removechild method:
$str="<ul><li>Foobar</li><li>addthis.com</li><li>Foobar</li></ul>";
$remove='addthis.com';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elements = $doc->getElementsByTagName('li');
$domElemsToRemove = array();
foreach ($elements as $element) {
$pos = strpos($element->textContent, $remove); // or similar $element->nodeValue
if ($pos !== false) {
$domElemsToRemove[] = $element;
}
}
foreach( $domElemsToRemove as $domElement ){
$domElement->parentNode->removeChild($domElement);
}
$str = $doc->saveHTML(); // <ul><li>Foobar</li><li>Foobar</li></ul>

remove script tag from HTML content

I am using HTML Purifier (http://htmlpurifier.org/)
I just want to remove <script> tags only.
I don't want to remove inline formatting or any other things.
How can I achieve this?
One more thing, it there any other way to remove script tags from HTML

Because this question is tagged with regex I'm going to answer with poor man's solution in this situation:
$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);
However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.
Remember, anything that user inputs should be considered not safe.
Better solution here would be to use DOMDocument which is designed for this.
Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:
<?php
$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item)
{
$remove[] = $item;
}
foreach ($remove as $item)
{
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
I have removed the HTML intentionally because even this can bork.

Use the PHP DOMDocument parser.
$doc = new DOMDocument();
// load the HTML string we want to strip
$doc->loadHTML($html);
// get all the script tags
$script_tags = $doc->getElementsByTagName('script');
$length = $script_tags->length;
// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
$script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}
// get the HTML string back
$no_script_html_string = $doc->saveHTML();
This worked me me using the following HTML document:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>
hey
</title>
<script>
alert("hello");
</script>
</head>
<body>
hey
</body>
</html>
Just bear in mind that the DOMDocument parser requires PHP 5 or greater.

$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
$element = $dom->getElementsByTagName($tag);
foreach($element as $item){
$item->parentNode->removeChild($item);
}
}
$html = $dom->saveHTML();

A simple way by manipulating string.
function stripStr($str, $ini, $fin)
{
while (($pos = mb_stripos($str, $ini)) !== false) {
$aux = mb_substr($str, $pos + mb_strlen($ini));
$str = mb_substr($str, 0, $pos);
if (($pos2 = mb_stripos($aux, $fin)) !== false) {
$str .= mb_substr($aux, $pos2 + mb_strlen($fin));
}
}
return $str;
}

Shorter:
$html = preg_replace("/<script.*?\/script>/s", "", $html);
When doing regex things might go wrong, so it's safer to do like this:
$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;
So that when the "accident" happen, we get the original $html instead of empty string.

this is a merge of both ClandestineCoder & Binh WPO.
the problem with the script tag arrows is that they can have more than one variant
ex. (< = < = &lt;) & ( > = > = &gt;)
so instead of creating a pattern array with like a bazillion variant,
imho a better solution would be
return preg_replace('/script.*?\/script/ius', '', $text)
? preg_replace('/script.*?\/script/ius', '', $text)
: $text;
this will remove anything that look like script.../script regardless of the arrow code/variant and u can test it in here https://regex101.com/r/lK6vS8/1

Try this complete and flexible solution. It works perfectly, and is based in-part by some previous answers, but contains additional validation checks, and gets rid of additional implied HTML from the loadHTML(...) function. It is divided into two separate functions (one with a previous dependency so don't re-order/rearrange) so you can use it with multiple HTML tags that you would like to remove simultaneously (i.e. not just 'script' tags). For example removeAllInstancesOfTag(...) function accepts an array of tag names, or optionally just one as a string. So, without further ado here is the code:
/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [BEGIN] */
/* Usage Example: $scriptless_html = removeAllInstancesOfTag($html, 'script'); */
if (!function_exists('removeAllInstancesOfTag'))
{
function removeAllInstancesOfTag($html, $tag_nm)
{
if (!empty($html))
{
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); /* For UTF-8 Compatibility. */
$doc = new DOMDocument();
$doc->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD|LIBXML_NOWARNING);
if (!empty($tag_nm))
{
if (is_array($tag_nm))
{
$tag_nms = $tag_nm;
unset($tag_nm);
foreach ($tag_nms as $tag_nm)
{
$rmvbl_itms = $doc->getElementsByTagName(strval($tag_nm));
$rmvbl_itms_arr = [];
foreach ($rmvbl_itms as $itm)
{
$rmvbl_itms_arr[] = $itm;
}
foreach ($rmvbl_itms_arr as $itm)
{
$itm->parentNode->removeChild($itm);
}
}
}
else if (is_string($tag_nm))
{
$rmvbl_itms = $doc->getElementsByTagName($tag_nm);
$rmvbl_itms_arr = [];
foreach ($rmvbl_itms as $itm)
{
$rmvbl_itms_arr[] = $itm;
}
foreach ($rmvbl_itms_arr as $itm)
{
$itm->parentNode->removeChild($itm);
}
}
}
return $doc->saveHTML();
}
else
{
return '';
}
}
}
/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [END] */
/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [BEGIN] */
/* Prerequisites: 'removeAllInstancesOfTag(...)' */
if (!function_exists('removeAllScriptTags'))
{
function removeAllScriptTags($html)
{
return removeAllInstancesOfTag($html, 'script');
}
}
/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [END] */
And here is a test usage example:
$html = 'This is a JavaScript retention test.<br><br><span id="chk_frst_scrpt">Congratulations! The first \'script\' tag was successfully removed!</span><br><br><span id="chk_secd_scrpt">Congratulations! The second \'script\' tag was successfully removed!</span><script>document.getElementById("chk_frst_scrpt").innerHTML = "Oops! The first \'script\' tag was NOT removed!";</script><script>document.getElementById("chk_secd_scrpt").innerHTML = "Oops! The second \'script\' tag was NOT removed!";</script>';
echo removeAllScriptTags($html);
I hope my answer really helps someone. Enjoy!

An example modifing ctf0's answer. This should only do the preg_replace once but also check for errors and block char code for forward slash.
$str = '<script> var a - 1; </script>';
$pattern = '/(script.*?(?:\/|/|/)script)/ius';
$replace = preg_replace($pattern, '', $str);
return ($replace !== null)? $replace : $str;
If you are using php 7 you can use the null coalesce operator to simplify it even more.
$pattern = '/(script.*?(?:\/|/|/)script)/ius';
return (preg_replace($pattern, '', $str) ?? $str);

function remove_script_tags($html){
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item){
$remove[] = $item;
}
foreach ($remove as $item){
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
$html = preg_replace('/<!DOCTYPE.*?<html>.*?<body><p>/ims', '', $html);
$html = str_replace('</p></body></html>', '', $html);
return $html;
}
Dejan's answer was good, but saveHTML() adds unnecessary doctype and body tags, this should get rid of it. See https://3v4l.org/82FNP

I would use BeautifulSoup if it's available. Makes this sort of thing very easy.
Don't try to do it with regexps. That way lies madness.

I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:
$html = file_get_contents('http://some_page.html');
$h = explode('>', $html);
foreach($h as $k => $v){
$v = trim($v);//clean it up a bit
if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable
$counter = $k;//match opening tag and start counter for backtrace
}elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done
$script_length = $k - $counter;
$counter = 0;
for($i = $script_length; $i >= 0; $i--){
$h[$k-$i] = '';//backtrace and clear everything in between
}
}
}
for($i = 0; $i <= count($h); $i++){
if($h[$i] != ''){
$ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
}
}
$html = implode('>', $ht);//all scripts stripped.
echo $html;
I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.
I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.

This is a simplified variant of Dejan Marjanovic's answer:
function removeTags($html, $tag) {
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
$item->parentNode->removeChild($item);
}
return $dom->saveHTML();
}
Can be used to remove any kind of tag, including <script>:
$scriptlessHtml = removeTags($html, 'script');

use the str_replace function to replace them with empty space or something
$query = '<script>console.log("I should be banned")</script>';
$badChar = array('<script>','</script>');
$query = str_replace($badChar, '', $query);
echo $query;
//this echoes console.log("I should be banned")
?>

How to remove php code from a string?

I have a string that has php code in it, I need to remove the php code from the string, for example:
<?php $db1 = new ps_DB() ?><p>Dummy</p>
Should return <p>Dummy</p>
And a string with no php for example <p>Dummy</p> should return the same string.
I know this can be done with a regular expression, but after 4h I haven't found a solution.

<?php
function filter_html_tokens($a){
return is_array($a) && $a[0] == T_INLINE_HTML ?
$a[1]:
'';
}
$htmlphpstring = '<a>foo</a> something <?php $db1 = new ps_DB() ?><p>Dummy</p>';
echo implode('',array_map('filter_html_tokens',token_get_all($htmlphpstring)));
?>
As ircmaxell pointed out: this would require valid PHP!
A regex route would be (allowing for no 'php' with short tags. no ending ?> in the string / file (for some reason Zend recommends this?) and of course an UNgreedy & DOTALL pattern:
preg_replace('/<\\?.*(\\?>|$)/Us', '',$htmlphpstring);

Well, you can use DomDocument to do it...
function stripPHPFromHTML($html) {
$dom = new DomDocument();
$dom->loadHtml($html);
removeProcessingInstructions($dom);
$simple = simplexml_import_dom($d->getElementsByTagName('body')->item(0));
return $simple->children()->asXml();
}
function removeProcessingInstructions(DomNode &$node) {
foreach ($node->childNodes as $child) {
if ($child instanceof DOMProcessingInstruction) {
$node->removeChild($child);
} else {
removeProcessingInstructions($child);
}
}
}
Those two functions will turn
$str = '<?php echo "foo"; ?><b>Bar</b>';
$clean = stripPHPFromHTML($str);
$html = '<b>Bar</b>';
Edit: Actually, after looking at Wrikken's answer, I realized that both methods have a disadvantage... Mine requires somewhat valid HTML markup (Dom is decent, but it won't parse <b>foo</b><?php echo $bar). Wrikken's requires valid PHP (any syntax errors and it'll fail). So perhaps a combination of the two (try one first. If it fails, try the other. If both fail, there's really not much you can do without trying to figure out the exact reason they failed)...

A simple solution is to explode into arrays using the php tags to remove any content between and implode back to a string.
function strip_php($str) {
$newstr = '';
//split on opening tag
$parts = explode('<?',$str);
if(!empty($parts)) {
foreach($parts as $part) {
//split on closing tag
$partlings = explode('?>',$part);
if(!empty($partlings)) {
//remove content before closing tag
$partlings[0] = '';
}
//append to string
$newstr .= implode('',$partlings);
}
}
return $newstr;
}
This is slower than regex but doesn't require valid html or php; it only requires all php tags to be closed.
For files which don't always include a final closing tag and for general error checking you could count the tags and append a closing tag if it's missing or notify if the opening and closing tags don't add up as expected, e.g. add the code below at the start of the function. This would slow it down a bit more though :)
$tag_diff = (substr_count($str,'<?') - (substr_count($str,'?>');
//Append if there's one less closing tag
if($tag_diff == 1) $str .= '?>';
//Parse error if the tags don't add up
if($tag_diff < 0 || $tag_diff > 1) die('Error: Tag mismatch.
(Opening minus closing tags = '.$tag_diff.')<br><br>
Dumping content:<br><hr><br>'.htmlentities($str));

This is an enhanced version of strip_php suggested by #jon that is able to replace php part of code with another string:
/**
* Remove PHP code part from a string.
*
* #param string $str String to clean
* #param string $replacewith String to use as replacement
* #return string Result string without php code
*/
function dolStripPhpCode($str, $replacewith='')
{
$newstr = '';
//split on each opening tag
$parts = explode('<?php',$str);
if (!empty($parts))
{
$i=0;
foreach($parts as $part)
{
if ($i == 0) // The first part is never php code
{
$i++;
$newstr .= $part;
continue;
}
//split on closing tag
$partlings = explode('?>', $part);
if (!empty($partlings))
{
//remove content before closing tag
if (count($partlings) > 1) $partlings[0] = '';
//append to out string
$newstr .= $replacewith.implode('',$partlings);
}
}
}
return $newstr;
}

If you are using PHP, you just need to use a regular expression to replace anything that matches PHP code.
The following statement will remove the PHP tag:
preg_replace('/^<\?php.*\?\>/', '', '<?php $db1 = new ps_DB() ?><p>Dummy</p>');
If it doesn't find any match, it won't replace anything.

Get xpath of xml node within recursive function

Lets say i have some code to iterate through an XML file recursively like this:
$xmlfile = new SimpleXMLElement('http://www.domain.com/file.xml',null,true);
xmlRecurse($xmlfile,0);
function xmlRecurse($xmlObj,$depth) {
foreach($xmlObj->children() as $child) {
echo str_repeat('-',$depth).">".$child->getName().": ".$subchild."\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1);
}
}
How would i calculate the xpath of each node so i can store it for mapping to other code?

The obvious way to do it is to pass the XPath as a third parameter and build it as you dig deeper. You have to account for siblings having the same name, so you have to keep track of the number of precedent siblings with the same name as current child while iterating.
Working example:
function xmlRecurse($xmlObj,$depth=0,$xpath=null) {
if (!isset($xpath)) {
$xpath='/'.$xmlObj->getName().'/';
}
$position = array();
foreach($xmlObj->children() as $child) {
$name = $child->getName();
if(isset($position[$name])) {
++$position[$name];
}
else {
$position[$name]=1;
}
$path=$xpath.$name.'['.$position[$name].']';
echo str_repeat('-',$depth).">".$name.": $path\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1,$path.'/');
}
}
Attention though, the whole idea of mapping a whole document and storing XPath along the way seems weird. You might actually be working on the wrong solution to a totally different problem.

You can pass to your xmlRecurse third param called $xpath (with current node xPath representation) and add xpath representation of the children on each iteration:
function xmlRecurse($xmlObj,$depth,$xpath) {
$i=0;
foreach($xmlObj->children() as $child) {
echo str_repeat('-',$depth).">".$child->getName().": ".$subchild."\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1,$xpath.'/'.$child->getName().'['.$i++.']');
}
}

With SimpleXML, I think you can only do it as others have pointed out: by recursing the node path as a string argument.
With DOMDocument, you could use the $node->parentNode property to crawl back to the document element and construct it for an arbitrary node (for example if you had a reference to a node and wanted to discover where in the tree it was without prior knowledge of how you got to that node).

$domNode = dom_import_simplexml($node);
$xpath = $domNode->getNodePath();
You need PHP 5 >= 5.2.0 for this to work.

Following up on MightyE's idea about backtracking:
function whereami($node)
{
if ($node instanceof SimpleXMLElement)
{
$node = dom_import_simplexml($node);
}
elseif (!$node instanceof DOMNode)
{
die('Not a node?');
}
$q = new DOMXPath($node->ownerDocument);
$xpath = '';
do
{
$position = 1 + $q->query('preceding-sibling::*[name()="' . $node->nodeName . '"]', $node)->length;
$xpath = '/' . $node->nodeName . '[' . $position . ']' . $xpath;
$node = $node->parentNode;
}
while (!$node instanceof DOMDocument);
return $xpath;
}
I wouldn't recommend it for the case at hand (mapping a whole document, as opposed to a single given node) but it might be useful for future reference.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Can I find exact match using phpQuery? - php

It can be done with filter, but it's not pretty: $dom->find('td')->filter(function($i, $node){return 'foo' == $node->nodeValue;}); But then, neither is switching back and forth between css and xpath

Related

Parsing HTML Table Data from XML with PHP

Remove HTML element from parsed HTML document on a condition

remove script tag from HTML content

How to remove php code from a string?

Get xpath of xml node within recursive function

Categories

Resources