How to find child element based on a specific node in PHP? - php

We can use the childNodes property,and item() function to locate a child element based on a parent node, however, if the path between the parent and the child is too long, there might have be too many childNodes->item() that is needed to be writing, like the PHP I listed below, I want to find the content inside the P tag, based on node Table, you can check the variable $sentences, I don't know if this is the only way to deal with such situation, is there any better way to do this?
HTML
<table>
<tr>
<td>
<p>Sentence 1</p>
</td>
<tr>
<tr>
<td>
<a>Click 1</a>
</td>
<tr>
</table>
<table>
<tr>
<td>
<p>Sentence 2</p>
</td>
<tr>
<tr>
<td>
<a>Click 2</a>
</td>
<tr>
</table>
Here is my PHP code
$content = file_get_contents('./a.html');
$dom = new \domDocument;
#$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new \DOMXPath($dom);
$table_list = $xpath->query('//table');
foreach($table_list as $k=>$table_node){
$sentence1 = $table_node->childNodes->item(0)->childNodes->item(0)->childNodes->item(0)->nodeValue;
}
I need to get the value inside P tag, you can see the code is really long to just get the value inside the P tag, is there any way I can shorten the code, like using
$sentence1 = $table_node->FIND('//tr/td/p')
Instead of writing childNodes->item() repeatedly?

$content = file_get_contents('./a.html');
$dom = new \domDocument;
#$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new \DOMXPath($dom);
$table_list = $xpath->query('//table/tr/td/p');
$sentences = array();
foreach($table_list as $k=>$table_node){
$sentences[] = $table_node->nodeValue;
}

If I understand you correctly, you can simply do this: $xpath->query('/table//p'). It will return all p elements that are a direct or indirect child of the table element at the root.

Related

Using Xpath to replace href links with a string from the same parent node

I can't seem to get the right expression to modify the href links of a query result with a string (to be set as a new url) taken from another query but on the same parent node. Consider this structure:
<table>
<tr>
<td>
<div class=items>
<span class="working-link">link-1</span>
Item 1
</div>
</td>
<td>
<div class=items>
<span class="working-link">link-2</span>
Item 2
</div>
</td>
</tr>
<table>
So far this is what I have come up with but with no result:
$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//div[#class='items']");
foreach( $nodeList as $result) {
$newLink = $xpath->query("//span[#class='working-link']",$result);
foreach($result->getElementsByTagName('a') as $link) {
$link->setAttribute('href', $newLink);
}
echo $doc->saveHTML($result);
}
Basically, you should never starts a relative XPath with / as / at the beginning of XPath always references the root document; use ./ instead. In this case span is direct child of div, so you don't need // either :
$newLink = $xpath->query("./span[#class='working-link']",$result);
or just remove the ./ completely :
$newLink = $xpath->query("span[#class='working-link']",$result);
Solved! The issue here was using functions for wrong data types. It should be
$newLink = $xpath->query("span[#class='working-link']",$result)[0];
indicating that it's an index of an array. Convert it to string after that for setAttribute to use
$link->setAttribute('href', $newLink->textContent);

Getting a specific URL using simple_html_dom based on the end of the URL

I need to grab a URL using simple_html_dom based on the end of the URL. The URL has no specific class to make it unique. The only thing unique about it is that it ends with a specific set of numbers. I just cannot figure out the proper syntax to grab that specific URL and then print it.
Any help?
EXAMPLE:
<table class="findList">
<tr class="findResult odd"> <td class="primary_photo"> <a href="/title/tt0080487/?ref_=fn_al_tt_1" ><img src="http://ia.media-imdb.com/images/M/MV5BNzk2OTE2NjYxNF5BMl5BanBnXkFtZTYwMjYwNDQ5._V1_SY44_CR0,0,32,44_.jpg" height="44" width="32" /></a> </td>
That is the code for the beginning of the table. That first href is the one I want to grab. The table continues with more links, etc, but that's not relevant to what I want.
For the first a with a href ending in 1:
$dom->find('a[href$="1"]', 0);
You can simply use DOMdocument
<?php
$html = '
<table class="findList">
<tr class="findResult odd">
<td class="primary_photo">
<a href="/title/tt0080487/?ref_=fn_al_tt_1" ><img src="http://ia.media-imdb.com/images/M/MV5BNzk2OTE2NjYxNF5BMl5BanBnXkFtZTYwMjYwNDQ5._V1_SY44_CR0,0,32,44_.jpg" height="44" width="32" /></a>
</td>
';
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach($dom->getElementsByTagName('td') as $td) {
if($td->getAttribute('class') == 'primary_photo'){
$a = $td->getElementsByTagName('a')->item(0)->getAttribute('href');
}
}
echo $a; // title/tt0080487/?ref_=fn_al_tt_1
//Or if your looking to get the img tag
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach($dom->getElementsByTagName('td') as $td) {
if($td->getAttribute('class') == 'primary_photo'){
$a = $td->getElementsByTagName('img')->item(0)->getAttribute('src');
}
}
echo $a; // http://ia.media-imdb.com/images/M/MV5BNzk2OTE2NjYxNF5BMl5BanBnXkFtZTYwMjYwNDQ5._V1_SY44_CR0,0,32,44_.jpg
?>
Assuming you have your html in a file called "tables.html", this will work. It reads the file, finds all the 'a' links, puts them into a array, and the first one ($anchors[0]) is the one you want. Then you get the href from it with $anchors[0]->href.
$html = new simple_html_dom();
$html->load_file('tables.html');
$anchors = $html->find("a");
echo $anchors[0]->href;

XPath keeps returning empty node list

I am trying to parse a folder full of .htm files. All these files contain 1 specific element that needs to be removed.
It's a td element with class="hide". So far, this is my code.
$dir. entry is the full path to the file.
$page = ($dir . $entry);
$this->domDoc->loadHTMLFile($page);
// Use xpath query to find the menu and remove it
$nodeList = $xpath->query('//td[#class="hide"]');
Unfortunately, this is where things already go wrong. If I do a var_dump of the node list, I get the following:
object(DOMNodeList)#5 (0) { }
Just so you folks get an idea of what I'm trying to select, here's an excerpt:
<td width="160" align="left" valign="top" class="hide">
lots of other TD's and content here
</td>
Does anybody see anything wrong with what I've come up with so far?
Is your initial file xhtml (i.e. with <html xmlns="http://www.w3.org/1999/xhtml">)? If so then your elements will be namespaced and you'll need to set up a prefix mapping using $xpath->registerNamespace and then use this prefix in the expression
$xpath->registerNamespace('xhtml', 'http://www.w3.org/1999/xhtml');
$nodeList = $xpath->query('//xhtml:td[#class="hide"]');
Var dumping an xpath node list object doesn't show anything. Var dump the node list's length.
var_dump($nodeList->length);
If the value is over 0, then you can iterate over it using foreach:
foreach($nodeList as $node)var_dump($node->tagName);
Hope this helps.
For further clarification, here is a full working code snippet:
<?php
$html = <<<END
<html>
<body>
<td>
</td>
<td class="hide"></td>
<td class="hide"></td>
</body>
</html>
END;
$dom = new DOMDocument;
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);
$nodeList = $xpath->query('//td[#class="hide"]');
// Shows a blank object
var_dump($nodeList);
// Shows 2
var_dump($nodeList->length);
// Echo out all the tag names.
foreach($nodeList as $node){
echo $node->tagName . "\n";
}
?>
Maybe you have more then one class in the class attribute of your td element:
<td class="hide anotherclass">
So '//td[#class="hide"]' would only match:
<td class="hide">
Try it like this to see if it contains the hide class you are looking for:
$nodeList = $xpath->query('//td[contains(#class,"hide")]');
Check out this blog post: XPath: Select element by class

Removing the last column of an HTML table - using preg_replace

in a simple HTML table I would like to remove the last column
<table>
<tbody>
<tr>
<th>1</th>
<th>2</th>
<th>3</th>
<th rowspan="3">I want to remove this</th>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td rowspan="3">I want to remove this</td>
</tr>
I am using this code, but I am still left with the content and the th and td rowspan
$myTable = preg_replace('#</?td rowspan[^>]*>#i', '', $myTable);
echo $myTable
Question: how do I remove the last column and it's content ?
<?php
// Create a new DOMDocument and load the HTML
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
// Create a new XPath query
$xpath = new DOMXPath($dom);
// Find all elements with a rowspan attribute
$result = $xpath->query('//*[#rowspan]');
// Loop the results and remove them from the DOM
foreach ($result as $cell) {
$cell->parentNode->removeChild($cell);
}
// Save back to a string
$newhtml = $dom->saveHTML();
See it working
I guess this will do it
preg_replace("/<(?:td|th)[^>]*>.*?<\/(?:td|th)>\s+<\/tr>/i", "</tr>", $myTable);
assuming you have closing tr tag (</tr>) at the end of each row unlike in your example
Edit: this will remove any <td> or <th> elements before closing </tr> no matter if they have any attributes
working example

I want php code to find href title and some other infos from html table

I create this code until now:
<?php
$url=" SOME HTML URL ";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
echo $tag->getAttribute('href');
}
?>
I have html pages with tables so i want the link the title and the date. Example of html code:
<TR>
<TD align="center" vAlign="top" bgColor="#ffffff" class="smalltext">3</TD>
<TD class="plaintext" >THIS IS THE TITLE </TD>
<TD align="center" class="plaintext" >THIS IS DATE</TD>
</TR>
It works fine for me for the link, but i don't know how to take the others.
Tnx.
Where you are doing this:
$tags = $doc->getElementsByTagName('a');
You are getting back all the A tags. There only happens to be one.
If you want to get the text "THIS IS DATE", you're aren't going to get it by looking in A tags because the text is not inside an A tag - it is in a TD tag.
$tds = $doc->getElementsByTagName('td');
... would work to get all the TD elements, or you could assign an ID to the element you want to target and use getElementById instead.
Basically, though, this information is all in the documentation, which you absolutely should read before asking questions. Happy reading!
Once again, that's: http://php.net/manual/en/class.domdocument.php

Categories