Trouble scraping table with DOMXPath

Trouble scraping table with DOMXPath - php

I have a table I'm trying to scrape that looks like this:
<table id="thisTable">
<tr>
<td class="value1"></td>
<td class="value2"></td>
<td class="value3"></td>
<td class="value4"></td>
</tr>
<tr>
<td class="value5"></td>
<td class="value6"></td>
</tr>
</table>
and my DOMXPath that looks like this (so far):
$htmlDoc = new DomDocument();
#$htmlDoc->loadhtml($html);
$xpath = new DOMXPath($htmlDoc);
$nodelist = $xpath->query('//*[#id="thisTable"]');
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}
This works, I get the values of the table, but how do I specify the class of a nodeValue? Ultimately, my goal is to build a new table from the td's content of value2, value4 and value5 in a single row.

$htmlDoc = new DomDocument();
$htmlDoc->loadHTML($html);
$xpath = new DOMXPath($htmlDoc);
$nodelist = $xpath->query('//td');
foreach ($nodelist as $n){
echo $n->getAttribute("class")."\n";
}
Note: Use getAttribute property for getting values of class

Expand your xpath-query:
$class="value1";
$nodelist = $xpath->query('//*[#id="thisTable"][#class="$class"]');

Not sure if I understand correctly, if you want the text contents of value2, value4 and value5 in a single row, you can use this xpath:
(//td[#class='value2'] | //td[#class='value4'] | //td[#class='value5'])/text()
For example:
<table id="thisTable">
<tr>
<td class="value1"> 1111</td>
<td class="value2"> 222 </td>
<td class="value3">333 </td>
<td class="value4"> 444</td>
</tr>
<tr>
<td class="value5"> 555</td>
<td class="value6"> 666</td>
</tr>
</table>
output will then be: 222 444 555

Related

XPath PHP parsing HTML table <td> </td> tags

I am trying to parse html table in order to get <td> ID HERE </td> tag content using Xpath and PHP.
Executing following line
$doc->loadHTMLFile($file);
gives me warnings like this:
PHP Warning: DOMDocument::loadHTMLFile(): Unexpected end tag : tr in...
That's why I am using the following block of code:
libxml_use_internal_errors(true);
$doc->loadHTMLFile($file);
libxml_clear_errors();
Trying to parse this: (the entire page here)
<table class="object-table" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<th width="8%">something here</th>
<th width="89%">something here</th>
<th width="3%">something here</th>
</tr>
<tr class="normal-row">
<td>ID number here</td>
<td>something here
</td>
<td align="center">
<img src="/design/img/hasnt_photo_icon.gif">
</td>
</tr>
<tr class="odd-row">
<td>ID number here</td>
<td>something here
</td>
<td align="center">
<img src="/design/img/hasnt_photo_icon.gif">
</td>
</tr>
</tbody>
</table>
with the following code:
$file = "http://www.sportsporudy.gov.ua/catalog/#c[1]=1";
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($file);
libxml_clear_errors();
$xpath = new DOMXPath($doc);
$query = '//tr[#class="odd-row"]';
$elements = $xpath->query($query);
printf("Size of array: %d\n", sizeof($elements));
printElements($elements);
and tried using different queries like
//table[#class="object-table"]/tbody/tr ...
but doesn't seem to give me the td tags I need. Maybe that's because of the broken HTML.
Thanks for your advice.

Substantially, your code is fine.
The only error that I've found is in the printing $elements length: $elements is not an array, to retrieve its length you have to use this syntax:
printf( "Size of array: %d\n", $elements->length );
But the major problem that you have with your page is that the HTML has only one table with one row: the remaining data are filled with javascript, so you can't retrieve it directly through DOMXPath.

How to select the Last elements attribute and change its value on html with php

HTML file (Demo)
<table>
<tr>
<td rowspan="1"> 1 </td>
<td rowspan="1"> 2 </td>
<td rowspan="1"> 3 </td>
</tr>
</table>
When I'll trigger the PHP file, it should get the last rowspan of the HTML file and change its value to 2.

$html = new DOMDocument();
$html->loadHTMLFile($file);
// find the last td by xpath and set value
$xpath = new DOMXpath($html);
$td = $xpath->query("(//td)[last()]")->item(0)->setAttribute('rowspan', 2);
echo $html->saveHTML();
UPDATE
$xpath = new DOMXpath($html);
$tds = $xpath->query("//td[position()=last() or (position()=last()-1)]");
foreach($tds as $td)
$td->setAttribute('rowspan', 5);

PHP parsing won't find "span" tags

I'm trying to find the span tags on a website similar to this: http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225. The tags I need are these:
However, when I use code such as the following:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//span";
$result_rows = $xpath->query($my_xpath_query);
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
The only output I get is [].
If I replace $statsListings[] = $result_object->nodeValue; with $statsListings[] = $result_object->childNodes->item(0)->nodeValue;, I still get the same [] as output. When there are clearly span tags with values, why am I getting nothing?

XPath is not guilty at all.
Span tags are added dinamically. Just have a look at the source code of the page, not the DOM-Structure, which may be already modified by javascript, but use "view-source:" and you will see exactly the same html, as it is parsed by XPath.
It would be a good idea to have a look at the table with class tablelines? probably, you have there everything you may need.
You should skip "maincolor" and "tableheader", and start processing with "light" class.
<table width="98%" class="tablelines" cellpadding="2" border="0" cellspacing="1">
<tr class="maincolor">
<td colspan="8" align="right">All Times Local</td>
</tr>
<tr class="tableheader">
<td width="4%">
<b>GN</b>
</td>
<td nowrap width="21%">
<b>AWAY</b>
</td>
<td nowrap width="21%">
<b>HOME</b>
</td>
<td width="14%"><b>DATE</b></td>
<td width="11%"><b>TIME</b></td>
<td width="8%"><b>SCORE</b></td>
<td nowrap align="right" width="*"><b>BOXSCORE</b></td>
<td nowrap align="center" width="4%"><b>GS</b></td>
</tr>
<tr class="light">
<td></td>
<td>Sioux City
<b>1</b></td>
<td>Sioux Falls
<b>5</b></td>
<td>Tue, Apr 14</td>
<td> 7:05 PM</td>
<td> <b>1 - 5</b> </td>
<td align="right">
<img src="/images/gamelive_icon.gif" title="Click here for Game Live!" alt="Click here for Game Live" border="0">
Final</td>
<td align="center">
<img src="/images/playersection/prostats/gslink.gif" border="0">
</td>
</tr>
For example, try this:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//tr[#class='light']/td";
$result_rows = $xpath->query($my_xpath_query);
echo $result_rows->length;
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
Probably I have found what you need, and even in nice JSON form:
http://www.pointstreak.com/ajax/trending_ajax.html?action=divisionscoreboard&divisionid=12299&seasonid=14225
{"trending_list":null,"lacrosse_list":null,"hockey_list":null,"soccer_list":null,"baseball_list":null,"softball_list":null,"basketball_list":null,"news_list":null,"news_hockey_list":null,"news_baseball_list":null,"news_baseball_list2":null,"news_softball_list":null,"news_basketball_list":null,"games_list":[{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Muskegon","awayscore":"2","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"15\/05","link":"..\/prostats\/boxscore.html?gameid=2672134"},{"status":"FINAL","hometeam":"Muskegon","homescore":"1","awayteam":"Sioux Falls","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"10\/05","link":"..\/prostats\/boxscore.html?gameid=2672133"},{"status":"FINAL","hometeam":"Muskegon","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"1st","schedtime":"7:15 pm","gamedate":"09\/05","link":"..\/prostats\/boxscore.html?gameid=2672132"},{"status":"FINAL","hometeam":"Dubuque","homescore":"3","awayteam":"Muskegon","awayscore":"4","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"05\/05","link":"..\/prostats\/boxscore.html?gameid=2662061"},{"status":"FINAL","hometeam":"Muskegon","homescore":"0","awayteam":"Dubuque","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662060"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"7","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662055"},{"status":"FINAL","hometeam":"Muskegon","homescore":"3","awayteam":"Dubuque","awayscore":"1","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662059"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:04 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662054"},{"status":"FINAL","hometeam":"Tri-City","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"29\/04","link":"..\/prostats\/boxscore.html?gameid=2664638"},{"status":"FINAL","hometeam":"Dubuque","homescore":"7","awayteam":"Muskegon","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"25\/04","link":"..\/prostats\/boxscore.html?gameid=2662058"}],"division_list":null,"site_network_title":null,"leagueshortname":"USHL","includesportlink":null,"showleaguename":0}

remove HTML tag by content

I have this table in output from a program (string converted in a DomDocument in PHP):
<table>
<tr>
<td width="50">Â </td>
<td>My content</td>
<td width="50">Â </td>
</tr>
<table>
I need to remove the two tag <td width="50">Â </td> (i don't know why the program adds them, but there are -.-") like this:
<table>
<tr>
<td>My content</td>
</tr>
<table>
What's the best way for do it in PHP?
Edit:
the program is JasperReport Server. I call the report rendering function via web application:
//this is the call to server library for generate the report
$reportGen = $reportServer->runReport($myReport);
$domDoc = new \DomDocument();
$domDoc->loadHTML($reportGen);
return $domDoc->saveHTML($domDoc->getElementsByTagName('table')->item(0));
return the upper table who i need to fix...

Try this
<?php
$domDoc = new DomDocument();
$domDoc->loadHTML($reportGen);
$xpath = new DOMXpath($domDoc);
$tags = $xpath->query('//td');
foreach($tags as $tag) {
$value = $tag->nodeValue;
if(preg_match('/^(Â )/',$value))
$tag->parentNode->removeChild($tag);
}
?>

Regex and replace:
$var = '<table>
<tr>
<td width="50">Ã</td>
<td>My interssing content</td>
<td width="50">Ã</td>
</tr>
<table>';
$final = preg_replace('#(<td width="50".*?>).*?(</td>)#', '$1$2', $var);
$final = str_replace('<td width="50"></td>', '', $final);
echo $final;

php regex or html dom parsing

I use regex for HTML parsing but I need your help to parse the following table:
<table class="resultstable" width="100%" align="center">
<tr>
<th width="10">#</th>
<th width="10"></th>
<th width="100">External Volume</th>
</tr>
<tr class='odd'>
<td align="center">1</td>
<td align="left">
http://xyz.com
</td>
<td align="right">210,779,783<br />(939,265 / 499,584)</td>
</tr>
<tr class='even'>
<td align="center">2</td>
<td align="left">
http://abc.com
</td>
<td align="right">57,450,834<br />(288,915 / 62,935)</td>
</tr>
</table>
I want to get all domains with their volume(in array or var) for example
http://xyz.com - 210,779,783
Should I use regex or HTML dom in this case. I don't know how to parse large table, can you please help, thanks.

here's an XPath example that happens to parse the HTML from the question.
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile("./input.html");
$xpath = new DOMXPath($dom);
$trs = $xpath->query("//table[#class='resultstable'][1]/tr");
foreach ($trs as $tr) {
$tdList = $xpath->query("td[2]/a", $tr);
if ($tdList->length == 0) continue;
$name = $tdList->item(0)->nodeValue;
$tdList = $xpath->query("td[3]", $tr);
$vol = $tdList->item(0)->childNodes->item(0)->nodeValue;
echo "name: {$name}, vol: {$vol}\n";
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Trouble scraping table with DOMXPath - php

$htmlDoc = new DomDocument(); $htmlDoc->loadHTML($html); $xpath = new DOMXPath($htmlDoc); $nodelist = $xpath->query('//td'); foreach ($nodelist as $n){ echo $n->getAttribute("class")."\n"; } Note: Use getAttribute property for getting values of class

Expand your xpath-query: $class="value1"; $nodelist = $xpath->query('//*[#id="thisTable"][#class="$class"]');

Related

XPath PHP parsing HTML table <td> </td> tags

How to select the Last elements attribute and change its value on html with php

PHP parsing won't find "span" tags

remove HTML tag by content

php regex or html dom parsing

Categories

Resources