XPath select TD's inside TR

XPath select TD's inside TR - php

I want to capture all the content between td tags but divide them by their tr. So i can get an array with the content inside every tr.
<div id="box">
<tr align='center'>
<td>1</td>
<td style='padding-left: 0px !important;padding-right: 10px !important;'> <div id=''></div></td>
<td>45</td>
<td>62</td>
</tr><tr align='center'>
<td>2</td>
<td style='padding-left: 0px !important;padding-right: 10px !important;'> <div id=''></div></td>
<td>35</td>
<td>47</td>
</tr><tr align='center'>
<td>3</td>
<td style='padding-left: 0px !important;padding-right: 10px !important;'> <div id=''></div></td>
<td>63</td>
<td>58</td>
</tr>
I've tried with this:
<?php
$url = '';
$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->preserveWhiteSpace = FALSE;
#$doc->loadHTML($html);
$xpath = new DOMXpath ($doc);
$expresion = "//div[#id='box']//tr//td";
$node = $xpath->evaluate($expresion);
foreach ($node as $nd)
{
echo $nd->nodeValue;
}
?>
But the output is:
1
45
62
2
35
47
3
63
58

If you want to group the td values by their tr, I would separate the xpath into two queries. One query selects the <tr> nodes and a second query selects the <td> childs of that node.
If you put that into a loop it can look like this:
<?php
$html = <<<EOF
<div id="box">
... Your HTML comes here
</tr>
EOF;
$url = '';
$doc = new DOMDocument();
$doc->preserveWhiteSpace = FALSE;
#$doc->loadHTML($html);
$xpath = new DOMXpath ($doc);
$expresion = "//div[#id='box']//tr";
$trs = $xpath->evaluate($expresion);
foreach ($trs as $tr)
{
$tdvals = array();
foreach($xpath->query('td', $tr) as $td) {
/* Skip the td with the empty text value */
if(trim($td->nodeValue) !== '') {
$tdvals []= $td->nodeValue;
}
}
echo implode(',', $tdvals) . PHP_EOL;
}
which outputs:
1,45,62
2,35,47
3,63,58
One another thing. In your example you are using file_get_contents() to load the HTML. Note that you can use DOMDocument::loadHTMLFile() to load (remote) files.

Related

DOMDocument How get element a from node?

$url = file_get_contents('test.html');
$DOM = new DOMDocument();
$DOM->loadHTML(mb_convert_encoding($url, 'HTML-ENTITIES', 'UTF-8'));
$trs = $DOM->getElementsByTagName('tr');
foreach ($trs as $tr) {
foreach ($tr->childNodes as $td){
echo ' ' .$td->nodeValue;
}
}
test.html
<html>
<body>
<table>
<tbody>
<tr>
<td style="background-color: #FFFF80;">1</td>
<td>test1</td>
</tr>
<tr>
<td style="background-color: #FFFF80;">2</td>
<td>test2</td>
</tr>
<tr>
<td style="background-color: #FFFF80;">3</td>
<td>test3</td>
</tr>
</tbody>
</table>
</body>
</html>
in result i get:
1 test1 2 test2 3 test3
But how get link from td a?
And how get html from td?
P.S.: i try with $td->find('a'); and $td->getElementsByTagName('a'); but it not work...

I improved your code a little bit and this version works fine for me:
$DOM = new DOMDocument();
$DOM->loadHTML(mb_convert_encoding($url, 'HTML-ENTITIES', 'UTF-8'));
$trs = $DOM->getElementsByTagName('tr');
foreach ($trs as $tr) {
foreach ($tr->childNodes as $td){
if ($td->hasChildNodes()) { //check if <td> has childnodes
foreach($td->childNodes as $i) {
if ($i->hasAttributes()){ //check if childnode has attributes
echo $i->getAttribute("href") . "\n"; // get href="" attribute
}
}
}
}
}
Result:
test1.php
test2.php
test3.php

Extract text and image src with PHP DomDocument

I'm trying to extract img src and the text of the TDs inside the div id="Ajax" but i'm unable to extract the img with my code. It just ignores the img src. How can i extract also the img src and add it in the array?
HTML:
<div id="Ajax">
<table cellpadding="1" cellspacing="0">
<tbody>
<tr id="comment_1">
<td>20:28</td>
<td class="color">
</td>
<td class="last_comment">
Text<br/>
</td>
</tr>
<tr id="comment_2">
<td>20:25</td>
<td class="color">
</td>
<td class="comment">
Text 2<br/>
</td>
</tr>
<tr id="comment_3">
<td>20:24</td>
<td class="color">
<img src="http://url.ext/img/image02.jpeg" alt="img alt 2"/>
</td>
<td class="comment">
Text 3<br/>
</td>
</tr>
<tr id="comment_4">
<td>20:23</td>
<td class="color">
<img src="http://url.ext/img/image01.jpeg" alt="img alt"/>
</td>
<td class="comment">
Text 4<br/>
</td>
</tr>
</div>
PHP:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$contentArray = array();
$doc = $doc->getElementById('Ajax');
$text = $doc->getElementsByTagName ('td');
foreach ($text as $t)
{
$contentArray[] = $t->nodeValue;
}
print_r ($contentArray);
Thanks.

You're using $t->nodeValue to obtain the content of a node. An <img> tag is empty, thus has nothing to return. The easiest way to get the src attribute would be XPath.
Example:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$expression = "//div[#id='Ajax']//tr";
$nodes = $xpath->query($expression); // Get all rows (tr) in the div
$imgSrcExpression = ".//img/#src";
$firstTdExpression = "./td[1]";
foreach($nodes as $node){ // loop over each row
// select the first td node
$tdNodes = $xpath->query($firstTdExpression ,$node);
$tdVal = null;
if($tdNodes->length > 0){
$tdVal = $tdNodes->item(0)->nodeValue;
}
// select the src attribute of the img node
$imgNodes = $xpath->query($imgSrcExpression,$node);
$imgVal = null;
if($imgNodes ->length > 0){
$imgVal = $imgNodes->item(0)->nodeValue;
}
}
(Caution: Code may contain typos)

Extract all <a> from table with DOMdocument

I want to extract the multiple <a> tags from this html markup:
<table align="center" width="100%" cellpadding="0" cellspacing="0"><tr>
<td align="left" width="60%" valign="top">
<font size="5" color="#939390">title</font><br><font size="5" color="#939390">LINKS</font>
<a style="color:#000000;" title="title Link 1" href="/page/242808/1/44643.html"> <b style="background:#ff6633">Link 1</b></a>
<a style="color:#000000;" title="title Link 2" href="/page/242808/2/erewe.html"> <b style="background:#ff6633">Link 2</b></a>
</td>
</tr>
</table>
and here is my code :
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->validateOnParse = true;
$doc->preserveWhiteSpace = false;
$doc->loadHTML($html); // load HTML you can add $html
$selector = new DOMXPath($doc);
$a = $selector->query('//table[1]//a')->item(0);
echo $doc->saveHTML($a);
This gets the first <a>, but what I want is to get all the <a> tags in the document.

To get more than one, you'll need to loop through the results instead of just printing the first one:
$a = $selector->query('//table[1]//a');
foreach($a as $current) {
echo $doc->saveHTML($current);
}

making two different class names together in the same h3 tag in code retrieved from remote source

i am working getting the html source code from remote page then removing some tbody s from
the source the echo it in my page this all works well but the problem i am stuck at is that i
want to put two class name in the same h3 tag as this is the only way the code can display
properly in my page
<?php
//Get the url
$url = "http://lsh.streamhunter.eu/static/section35.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$elements = $doc->getElementsByTagName('tbody');
$toRemove = array();
// gather a list of tbodys to remove
foreach($elements as $el)
if((strpos($el->nodeValue, 'desktop') !== false) && !in_array($el->parentNode, $toRemove, true))
$toRemove[] = $el->parentNode;
foreach($elements as $el)
if((strpos($el->nodeValue, 'Recommended') !== false) && !in_array($el->parentNode, $toRemove, true))
$toRemove[] = $el->parentNode;
// remove them
foreach($toRemove as $tbody)
$tbody->parentNode->removeChild($tbody);
echo $doc->saveHTML(); // save new HTML
?>
how i can make the two class names eventtitle and lshjpane-toggler lshtitle
to be in the same h3 tag no each in separate tag
edit: to make it clear look at this code
<h3 class="lshjpane-toggler lshtitle eventtitle150909" onclick="getEvent(150909)"></h3><table cellpadding="0" cellspacing="0" border="0"><tr><td valign="middle" width="20px;" height="20px;" style="padding:0px 0px 0px 5px; background: url(/images/stories/kr.png) no-repeat scroll center;"></td>
<td valign="middle" style="padding:0px 0px 0px 5px;"><span class="lshstart_time">12:00</span></td>
<td valign="middle" style="padding:0px 0px 0px 5px;"><span class="lshevent">Daekyo Kangaroos Women - Incheon Red Angels Women</span></td>
</tr></table><h3 id="preloadevent150909" style="display:none" class="preload-lshjpane-toggler"></h3>
it will display properly only if the </h3> before <table removed
how can i remove this tag from the code retrieved from the remote page

If I were on Your place I would store the whole page in some variable before display.
$page = $doc->saveHTML();
then repalce the </h3><table, with <table.
$page = str_replace('</h3><table', '<table', $page);
then echo the result:
echo $page; // echo $page
But I'm afraid that the main problem is that <table> can not be putted into <h3> and that's probably why You don't get the code that You wish to have.

PHP DOM accessing the object with same attribute

I want to get the date object text content and Team 1. But Team 2 object has the same attribute option with date object. How can I get the right content? If I echo $date I get date value with Team2... How should I write conditions?
<table width="100%" cellpadding=2 cellspacing=0 id="tblFixture" border=0>
<tr class=row1 align=center side='home'>
<td align=left>21.09.1928</td>
<td> </td>
<td align='right'><span class='team'>Team 1</span></td>
<td align=left><a href='http://www.foo.com/bar' target='_blank'>Team 2</a></td>
</td>
</tr>
PHP Code:
$url = "http://www.bla.com/bla.html";
$dom = new DOMDocument;
#$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$nlig = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\']');
$i = 0;
foreach ($nlig AS $val)
{
$date = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\'][#class=\'row1\']/td[#align=\'left\']')->item($i)->textContent;
$first_team = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\']/td[#align=\'right\']/span[#class=\'team\']')->item($i)->textContent;
echo $date, $first_team, "<br />";
$i++;
}

You can use a regular expression to validate / find the date.
Something like:
preg_match("/<td align=left>([0-9]{2}.[0-9]{2}.[0-9]{4})<\/td>/", $html, $matches);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

XPath select TD's inside TR - php

Related

DOMDocument How get element a from node?

Extract text and image src with PHP DomDocument

Extract all <a> from table with DOMdocument

making two different class names together in the same h3 tag in code retrieved from remote source

PHP DOM accessing the object with same attribute

Categories

Resources