PHP DOMXPath - Can't target the right node

PHP DOMXPath - Can't target the right node - php

I know this is probably covered in other threads, but I've been searching all over StackOverflow and tried many solutions, this is why I'm asking.
With this html:
<div class="someclass">
<table>
<tbody>
<tr>
<th class="state">Status</th>
<th class="name">Name</th>
<th class="type">Type</th>
<th class="length">Length</th>
<th class="height">Height</th>
</tr>
<tr>
<td class="state state2"></td>
<td class="name"></td>
<td class="type t18"></td>
<td class="length">2000 m</td>
<td class="height"></td>
</tr>
<tr>
<td class="state state1"></td>
<td class="name"></td>
<td class="type t18"></td>
<td class="length">2250 m</td>
<td class="height"></td>
</tr>
<tr>
<td class="state state1"></td>
<td class="name"></td>
<td class="type t18"></td>
<td class="length">3000 m</td>
<td class="height"></td>
</tr>
<tr>
<td class="state state2"></td>
<td class="name"></td>
<td class="type t18"></td>
<td class="length">2250 m</td>
<td class="height"></td>
</tr>
</tbody>
</table>
</div>
Now, this is the PHP code I have so far :
$dom = new DOMDocument();
$dom->loadHtmlFile('http://www.whatever.com');
$dom->preserveWhiteSpace = false;
$xp = new DOMXPath($dom);
$col = $xp->query('//td[contains(#class, "state1") and (contains(#class, "state"))]');
$length = 0;
foreach( $col as $n ) {
$parent = $n->parentNode;
$length += $parent->childNodes->item(3)->nodeValue;
}
echo 'Length: ' . $length;
I need to:
1.- Sum the 'length' values so I can echo them, getting rid of the ' m' substring of the given values.
2.- Understand why I'm getting wrong the 'parentNodes', 'childNodes' and 'item()' parts. With many tries I've gotten 'Length: 0'
I know this isn't the place to get a full detailed explanation, but it is really hard to find tutorials targetting these concrete issues. It would be great if someone could give some advice on where I can get this information.
Thanks very much in advance.
Edited the 'Concat' part for simplicity.

Navigation through DOMDocument for a specified childNode value by using DOMXpath
function getInt($string)
{
preg_match("/[0-9]+/i", $string, $val);
$out = 0;
if (isset($val) && !empty($val))
{
$out = $val[0];
}
return intval($out);
}
$dom = new DOMDocument();
$dom->loadHtml($html);
$dom->preserveWhiteSpace = false;
$xp = new DOMXPath($dom);
$length = 0;
foreach($xp->query('//td[#class="state state1"]/following-sibling::*[3]') as $element)
{
$value = $element->nodeValue;
$length += getInt($value);
}
echo $length;

Related

PHP XPath to parse table

Firstly here is my table HTML:
<table class="xyz">
<caption>Outcomes</caption>
<thead>
<tr class="head">
<th title="a" class="left" nowrap="nowrap">A1</th>
<th title="a" class="left" nowrap="nowrap">A2</th>
<th title="result" class="left" nowrap="nowrap">Result</th>
<th title="margin" class="left" nowrap="nowrap">Margin</th>
<th title="area" class="left" nowrap="nowrap">Area</th>
<th title="date" nowrap="nowrap">Date</th>
<th title="link" nowrap="nowrap">Link</th>
</tr>
</thead>
<tbody>
<tr class="data1">
<td class="left" nowrap="nowrap">56546</td>
<td class="left" nowrap="nowrap">75666</td>
<td class="left" nowrap="nowrap">Lower</td>
<td class="left" nowrap="nowrap">High</td>
<td class="left">Area 3</td>
<td nowrap="nowrap">Jan 2 2016</td>
<td nowrap="nowrap">http://localhost/545436</td>
</tr>
<tr class="data1">
<td class="left" nowrap="nowrap">55546</td>
<td class="left" nowrap="nowrap">71666</td>
<td class="left" nowrap="nowrap">Lower</td>
<td class="left" nowrap="nowrap">High</td>
<td class="left">Area 4</td>
<td nowrap="nowrap">Jan 3 2016</td>
<td nowrap="nowrap">http://localhost/545437</td>
</tr>
...
And there are many more <tr> after that.
I am using this PHP code:
$html = file_get_contents('http://localhost/outcomes');
$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('', 'http://www.w3.org/1999/xhtml');
$elements = $xpath->query("//table[#class='xyz']");
How can I, now that I have the table as the first element in $elements, get the values of each <td>?
Ideally I want to get arrays like:
array(56546, 75666, 'Lower', 'High', 'Area 3', 'Jan 2 2016', 'http://localhost/545436'),
array(55546, 71666, 'Lower', 'High', 'Area 4', 'Jan 3 2016', 'http://localhost/545437'),
...
But I'm not sure how I can dig that deeply into the the table code.
Thank you for any advice.

First, get all the table rows in the <tbody>
$rows = $xpath->query('//table[#class="xyz"]/tbody/tr');
Then, you can iterate over that collection and query for each <td>
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
// alt $cells = $xpath->query('td', $row)
$cellData = [];
foreach ($cells as $cell) {
$cellData[] = $cell->nodeValue;
}
var_dump($cellData);
}

PHP parsing won't find "span" tags

I'm trying to find the span tags on a website similar to this: http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225. The tags I need are these:
However, when I use code such as the following:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//span";
$result_rows = $xpath->query($my_xpath_query);
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
The only output I get is [].
If I replace $statsListings[] = $result_object->nodeValue; with $statsListings[] = $result_object->childNodes->item(0)->nodeValue;, I still get the same [] as output. When there are clearly span tags with values, why am I getting nothing?

XPath is not guilty at all.
Span tags are added dinamically. Just have a look at the source code of the page, not the DOM-Structure, which may be already modified by javascript, but use "view-source:" and you will see exactly the same html, as it is parsed by XPath.
It would be a good idea to have a look at the table with class tablelines? probably, you have there everything you may need.
You should skip "maincolor" and "tableheader", and start processing with "light" class.
<table width="98%" class="tablelines" cellpadding="2" border="0" cellspacing="1">
<tr class="maincolor">
<td colspan="8" align="right">All Times Local</td>
</tr>
<tr class="tableheader">
<td width="4%">
<b>GN</b>
</td>
<td nowrap width="21%">
<b>AWAY</b>
</td>
<td nowrap width="21%">
<b>HOME</b>
</td>
<td width="14%"><b>DATE</b></td>
<td width="11%"><b>TIME</b></td>
<td width="8%"><b>SCORE</b></td>
<td nowrap align="right" width="*"><b>BOXSCORE</b></td>
<td nowrap align="center" width="4%"><b>GS</b></td>
</tr>
<tr class="light">
<td></td>
<td>Sioux City
<b>1</b></td>
<td>Sioux Falls
<b>5</b></td>
<td>Tue, Apr 14</td>
<td> 7:05 PM</td>
<td> <b>1 - 5</b> </td>
<td align="right">
<img src="/images/gamelive_icon.gif" title="Click here for Game Live!" alt="Click here for Game Live" border="0">
Final</td>
<td align="center">
<img src="/images/playersection/prostats/gslink.gif" border="0">
</td>
</tr>
For example, try this:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//tr[#class='light']/td";
$result_rows = $xpath->query($my_xpath_query);
echo $result_rows->length;
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
Probably I have found what you need, and even in nice JSON form:
http://www.pointstreak.com/ajax/trending_ajax.html?action=divisionscoreboard&divisionid=12299&seasonid=14225
{"trending_list":null,"lacrosse_list":null,"hockey_list":null,"soccer_list":null,"baseball_list":null,"softball_list":null,"basketball_list":null,"news_list":null,"news_hockey_list":null,"news_baseball_list":null,"news_baseball_list2":null,"news_softball_list":null,"news_basketball_list":null,"games_list":[{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Muskegon","awayscore":"2","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"15\/05","link":"..\/prostats\/boxscore.html?gameid=2672134"},{"status":"FINAL","hometeam":"Muskegon","homescore":"1","awayteam":"Sioux Falls","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"10\/05","link":"..\/prostats\/boxscore.html?gameid=2672133"},{"status":"FINAL","hometeam":"Muskegon","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"1st","schedtime":"7:15 pm","gamedate":"09\/05","link":"..\/prostats\/boxscore.html?gameid=2672132"},{"status":"FINAL","hometeam":"Dubuque","homescore":"3","awayteam":"Muskegon","awayscore":"4","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"05\/05","link":"..\/prostats\/boxscore.html?gameid=2662061"},{"status":"FINAL","hometeam":"Muskegon","homescore":"0","awayteam":"Dubuque","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662060"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"7","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662055"},{"status":"FINAL","hometeam":"Muskegon","homescore":"3","awayteam":"Dubuque","awayscore":"1","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662059"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:04 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662054"},{"status":"FINAL","hometeam":"Tri-City","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"29\/04","link":"..\/prostats\/boxscore.html?gameid=2664638"},{"status":"FINAL","hometeam":"Dubuque","homescore":"7","awayteam":"Muskegon","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"25\/04","link":"..\/prostats\/boxscore.html?gameid=2662058"}],"division_list":null,"site_network_title":null,"leagueshortname":"USHL","includesportlink":null,"showleaguename":0}

remove HTML tag by content

I have this table in output from a program (string converted in a DomDocument in PHP):
<table>
<tr>
<td width="50">Â </td>
<td>My content</td>
<td width="50">Â </td>
</tr>
<table>
I need to remove the two tag <td width="50">Â </td> (i don't know why the program adds them, but there are -.-") like this:
<table>
<tr>
<td>My content</td>
</tr>
<table>
What's the best way for do it in PHP?
Edit:
the program is JasperReport Server. I call the report rendering function via web application:
//this is the call to server library for generate the report
$reportGen = $reportServer->runReport($myReport);
$domDoc = new \DomDocument();
$domDoc->loadHTML($reportGen);
return $domDoc->saveHTML($domDoc->getElementsByTagName('table')->item(0));
return the upper table who i need to fix...

Try this
<?php
$domDoc = new DomDocument();
$domDoc->loadHTML($reportGen);
$xpath = new DOMXpath($domDoc);
$tags = $xpath->query('//td');
foreach($tags as $tag) {
$value = $tag->nodeValue;
if(preg_match('/^(Â )/',$value))
$tag->parentNode->removeChild($tag);
}
?>

Regex and replace:
$var = '<table>
<tr>
<td width="50">Ã</td>
<td>My interssing content</td>
<td width="50">Ã</td>
</tr>
<table>';
$final = preg_replace('#(<td width="50".*?>).*?(</td>)#', '$1$2', $var);
$final = str_replace('<td width="50"></td>', '', $final);
echo $final;

Get text between repeating <tr></tr> tags

I got head-ache trying to solve this problem. I have a structure like this:
<tr>
<td width="10%" bgcolor="#FFFFFF"><font class="bodytext9">17-Aug-2013</font></td>
<td width="4%" bgcolor="#FFFFFF" align=center><font class="bodytext9">Sat</font></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">5 PM</font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="teams.asp?teamno=766&leagueNo=115">XYZ Club FC</a></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/white.gif"></font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">vs</font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/orange.gif"></font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="teams.asp?teamno=632&leagueNo=115">ABC Football Club</a></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a href="pitches.asp?id=151" class=list><u>APSM Pitch </u></a></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><a target="_new" href="matchpreview_frame.asp?matchno=20877"><img src="img/matchpreview_symbol.gif" border="0"></a></td>
</tr>
this format will repeat many times with different text contain, sometime, some text contain is similar. I need to extract ONLY the FIRST group of this format, where it contain "ABC Football Club" the FIRST TIME (because it could appear many times later also). How do I do that and extract the text on each line ?
Thanks for the comments, I editted here to add some codes I tried:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'url link');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$trs = $xpath->query('//tr/td[contains(.,'ABC Football Club')]');
$rows = array();
foreach($trs as $tr)
$rows[] = innerHTML($tr, true); // this function I don't include here
print_r($rows);
However this one not work! :(

Find the first TR containing $needle
$needle = "ABC Football Club";
$doc = new DOMDocument();
$doc->loadHTML($html);
$trs = $doc->getElementsByTagName('tr');
foreach($trs as $current_tr)
{
$tr_content = $doc->saveXML($current_tr);
if(strpos($tr_content, $needle) !== FALSE)
{
break;
}
else
{
$tr_content= "";
}
}
echo $tr_content;
Find the first TR containing $needle,
and if neested, the TR closes to the needle.
that can be solved by just repating the process.
$needle = "ABC Football Club";
$doc = new DOMDocument();
$doc->loadHTML($html);
$node = $doc;
do
{
$trs = $node->getElementsByTagName('tr');
$node = NULL;
foreach($trs as $current_tr)
{
$tr_content = $doc->saveXML($current_tr);
if(strpos($tr_content, $needle) !== FALSE)
{
$node = $current_tr;
$found_tr = $node;
$found_tr_content = $tr_content;
break;
}
}
} while($node);
echo $found_tr_content;

In phpquery you would:
$dom = phpQuery::newDocument($html);
$dom->find('tr:has(> td:contains("ABC Football Club"))')->eq(0);

to get the TD:s of the first TR, you can use
$doc = new DOMDocument();
$doc->loadHTML($html);
$trs = $doc->getElementsByTagName('tr');
$td_of_the_first_tr = $trs->item(0)->getElementsByTagName('td');
foreach($td_of_the_first_tr as $current_td)
{
echo $doc->saveXML($current_td) . PHP_EOL;
}

php regex or html dom parsing

I use regex for HTML parsing but I need your help to parse the following table:
<table class="resultstable" width="100%" align="center">
<tr>
<th width="10">#</th>
<th width="10"></th>
<th width="100">External Volume</th>
</tr>
<tr class='odd'>
<td align="center">1</td>
<td align="left">
http://xyz.com
</td>
<td align="right">210,779,783<br />(939,265 / 499,584)</td>
</tr>
<tr class='even'>
<td align="center">2</td>
<td align="left">
http://abc.com
</td>
<td align="right">57,450,834<br />(288,915 / 62,935)</td>
</tr>
</table>
I want to get all domains with their volume(in array or var) for example
http://xyz.com - 210,779,783
Should I use regex or HTML dom in this case. I don't know how to parse large table, can you please help, thanks.

here's an XPath example that happens to parse the HTML from the question.
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile("./input.html");
$xpath = new DOMXPath($dom);
$trs = $xpath->query("//table[#class='resultstable'][1]/tr");
foreach ($trs as $tr) {
$tdList = $xpath->query("td[2]/a", $tr);
if ($tdList->length == 0) continue;
$name = $tdList->item(0)->nodeValue;
$tdList = $xpath->query("td[3]", $tr);
$vol = $tdList->item(0)->childNodes->item(0)->nodeValue;
echo "name: {$name}, vol: {$vol}\n";
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP DOMXPath - Can't target the right node - php

Related

PHP XPath to parse table

PHP parsing won't find "span" tags

remove HTML tag by content

Get text between repeating <tr></tr> tags

php regex or html dom parsing

Categories

Resources