I want to capture all the content between td tags but divide them by their tr. So i can get an array with the content inside every tr.
<div id="box">
<tr align='center'>
<td>1</td>
<td style='padding-left: 0px !important;padding-right: 10px !important;'> <div id=''></div></td>
<td>45</td>
<td>62</td>
</tr><tr align='center'>
<td>2</td>
<td style='padding-left: 0px !important;padding-right: 10px !important;'> <div id=''></div></td>
<td>35</td>
<td>47</td>
</tr><tr align='center'>
<td>3</td>
<td style='padding-left: 0px !important;padding-right: 10px !important;'> <div id=''></div></td>
<td>63</td>
<td>58</td>
</tr>
I've tried with this:
<?php
$url = '';
$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->preserveWhiteSpace = FALSE;
#$doc->loadHTML($html);
$xpath = new DOMXpath ($doc);
$expresion = "//div[#id='box']//tr//td";
$node = $xpath->evaluate($expresion);
foreach ($node as $nd)
{
echo $nd->nodeValue;
}
?>
But the output is:
1
45
62
2
35
47
3
63
58
If you want to group the td values by their tr, I would separate the xpath into two queries. One query selects the <tr> nodes and a second query selects the <td> childs of that node.
If you put that into a loop it can look like this:
<?php
$html = <<<EOF
<div id="box">
... Your HTML comes here
</tr>
EOF;
$url = '';
$doc = new DOMDocument();
$doc->preserveWhiteSpace = FALSE;
#$doc->loadHTML($html);
$xpath = new DOMXpath ($doc);
$expresion = "//div[#id='box']//tr";
$trs = $xpath->evaluate($expresion);
foreach ($trs as $tr)
{
$tdvals = array();
foreach($xpath->query('td', $tr) as $td) {
/* Skip the td with the empty text value */
if(trim($td->nodeValue) !== '') {
$tdvals []= $td->nodeValue;
}
}
echo implode(',', $tdvals) . PHP_EOL;
}
which outputs:
1,45,62
2,35,47
3,63,58
One another thing. In your example you are using file_get_contents() to load the HTML. Note that you can use DOMDocument::loadHTMLFile() to load (remote) files.
Related
$url = file_get_contents('test.html');
$DOM = new DOMDocument();
$DOM->loadHTML(mb_convert_encoding($url, 'HTML-ENTITIES', 'UTF-8'));
$trs = $DOM->getElementsByTagName('tr');
foreach ($trs as $tr) {
foreach ($tr->childNodes as $td){
echo ' ' .$td->nodeValue;
}
}
test.html
<html>
<body>
<table>
<tbody>
<tr>
<td style="background-color: #FFFF80;">1</td>
<td>test1</td>
</tr>
<tr>
<td style="background-color: #FFFF80;">2</td>
<td>test2</td>
</tr>
<tr>
<td style="background-color: #FFFF80;">3</td>
<td>test3</td>
</tr>
</tbody>
</table>
</body>
</html>
in result i get:
1 test1 2 test2 3 test3
But how get link from td a?
And how get html from td?
P.S.: i try with $td->find('a'); and $td->getElementsByTagName('a'); but it not work...
I improved your code a little bit and this version works fine for me:
$DOM = new DOMDocument();
$DOM->loadHTML(mb_convert_encoding($url, 'HTML-ENTITIES', 'UTF-8'));
$trs = $DOM->getElementsByTagName('tr');
foreach ($trs as $tr) {
foreach ($tr->childNodes as $td){
if ($td->hasChildNodes()) { //check if <td> has childnodes
foreach($td->childNodes as $i) {
if ($i->hasAttributes()){ //check if childnode has attributes
echo $i->getAttribute("href") . "\n"; // get href="" attribute
}
}
}
}
}
Result:
test1.php
test2.php
test3.php
I'm trying to extract img src and the text of the TDs inside the div id="Ajax" but i'm unable to extract the img with my code. It just ignores the img src. How can i extract also the img src and add it in the array?
HTML:
<div id="Ajax">
<table cellpadding="1" cellspacing="0">
<tbody>
<tr id="comment_1">
<td>20:28</td>
<td class="color">
</td>
<td class="last_comment">
Text<br/>
</td>
</tr>
<tr id="comment_2">
<td>20:25</td>
<td class="color">
</td>
<td class="comment">
Text 2<br/>
</td>
</tr>
<tr id="comment_3">
<td>20:24</td>
<td class="color">
<img src="http://url.ext/img/image02.jpeg" alt="img alt 2"/>
</td>
<td class="comment">
Text 3<br/>
</td>
</tr>
<tr id="comment_4">
<td>20:23</td>
<td class="color">
<img src="http://url.ext/img/image01.jpeg" alt="img alt"/>
</td>
<td class="comment">
Text 4<br/>
</td>
</tr>
</div>
PHP:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$contentArray = array();
$doc = $doc->getElementById('Ajax');
$text = $doc->getElementsByTagName ('td');
foreach ($text as $t)
{
$contentArray[] = $t->nodeValue;
}
print_r ($contentArray);
Thanks.
You're using $t->nodeValue to obtain the content of a node. An <img> tag is empty, thus has nothing to return. The easiest way to get the src attribute would be XPath.
Example:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$expression = "//div[#id='Ajax']//tr";
$nodes = $xpath->query($expression); // Get all rows (tr) in the div
$imgSrcExpression = ".//img/#src";
$firstTdExpression = "./td[1]";
foreach($nodes as $node){ // loop over each row
// select the first td node
$tdNodes = $xpath->query($firstTdExpression ,$node);
$tdVal = null;
if($tdNodes->length > 0){
$tdVal = $tdNodes->item(0)->nodeValue;
}
// select the src attribute of the img node
$imgNodes = $xpath->query($imgSrcExpression,$node);
$imgVal = null;
if($imgNodes ->length > 0){
$imgVal = $imgNodes->item(0)->nodeValue;
}
}
(Caution: Code may contain typos)
I want to extract the multiple <a> tags from this html markup:
<table align="center" width="100%" cellpadding="0" cellspacing="0"><tr>
<td align="left" width="60%" valign="top">
<font size="5" color="#939390">title</font><br><font size="5" color="#939390">LINKS</font>
<a style="color:#000000;" title="title Link 1" href="/page/242808/1/44643.html"> <b style="background:#ff6633">Link 1</b></a>
<a style="color:#000000;" title="title Link 2" href="/page/242808/2/erewe.html"> <b style="background:#ff6633">Link 2</b></a>
</td>
</tr>
</table>
and here is my code :
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->validateOnParse = true;
$doc->preserveWhiteSpace = false;
$doc->loadHTML($html); // load HTML you can add $html
$selector = new DOMXPath($doc);
$a = $selector->query('//table[1]//a')->item(0);
echo $doc->saveHTML($a);
This gets the first <a>, but what I want is to get all the <a> tags in the document.
To get more than one, you'll need to loop through the results instead of just printing the first one:
$a = $selector->query('//table[1]//a');
foreach($a as $current) {
echo $doc->saveHTML($current);
}
i am working getting the html source code from remote page then removing some tbody s from
the source the echo it in my page this all works well but the problem i am stuck at is that i
want to put two class name in the same h3 tag as this is the only way the code can display
properly in my page
<?php
//Get the url
$url = "http://lsh.streamhunter.eu/static/section35.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$elements = $doc->getElementsByTagName('tbody');
$toRemove = array();
// gather a list of tbodys to remove
foreach($elements as $el)
if((strpos($el->nodeValue, 'desktop') !== false) && !in_array($el->parentNode, $toRemove, true))
$toRemove[] = $el->parentNode;
foreach($elements as $el)
if((strpos($el->nodeValue, 'Recommended') !== false) && !in_array($el->parentNode, $toRemove, true))
$toRemove[] = $el->parentNode;
// remove them
foreach($toRemove as $tbody)
$tbody->parentNode->removeChild($tbody);
echo $doc->saveHTML(); // save new HTML
?>
how i can make the two class names eventtitle and lshjpane-toggler lshtitle
to be in the same h3 tag no each in separate tag
edit: to make it clear look at this code
<h3 class="lshjpane-toggler lshtitle eventtitle150909" onclick="getEvent(150909)"></h3><table cellpadding="0" cellspacing="0" border="0"><tr><td valign="middle" width="20px;" height="20px;" style="padding:0px 0px 0px 5px; background: url(/images/stories/kr.png) no-repeat scroll center;"></td>
<td valign="middle" style="padding:0px 0px 0px 5px;"><span class="lshstart_time">12:00</span></td>
<td valign="middle" style="padding:0px 0px 0px 5px;"><span class="lshevent">Daekyo Kangaroos Women - Incheon Red Angels Women</span></td>
</tr></table><h3 id="preloadevent150909" style="display:none" class="preload-lshjpane-toggler"></h3>
it will display properly only if the </h3> before <table removed
how can i remove this tag from the code retrieved from the remote page
If I were on Your place I would store the whole page in some variable before display.
$page = $doc->saveHTML();
then repalce the </h3><table, with <table.
$page = str_replace('</h3><table', '<table', $page);
then echo the result:
echo $page; // echo $page
But I'm afraid that the main problem is that <table> can not be putted into <h3> and that's probably why You don't get the code that You wish to have.
I want to get the date object text content and Team 1. But Team 2 object has the same attribute option with date object. How can I get the right content? If I echo $date I get date value with Team2... How should I write conditions?
<table width="100%" cellpadding=2 cellspacing=0 id="tblFixture" border=0>
<tr class=row1 align=center side='home'>
<td align=left>21.09.1928</td>
<td> </td>
<td align='right'><span class='team'>Team 1</span></td>
<td align=left><a href='http://www.foo.com/bar' target='_blank'>Team 2</a></td>
</td>
</tr>
PHP Code:
$url = "http://www.bla.com/bla.html";
$dom = new DOMDocument;
#$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$nlig = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\']');
$i = 0;
foreach ($nlig AS $val)
{
$date = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\'][#class=\'row1\']/td[#align=\'left\']')->item($i)->textContent;
$first_team = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\']/td[#align=\'right\']/span[#class=\'team\']')->item($i)->textContent;
echo $date, $first_team, "<br />";
$i++;
}
You can use a regular expression to validate / find the date.
Something like:
preg_match("/<td align=left>([0-9]{2}.[0-9]{2}.[0-9]{4})<\/td>/", $html, $matches);