PHP DOM / XPath - php

Hopefully should be a simple question for someone that has done it before!
I have a list of old web documents in table format with lots of contact details in it. What I have managed so far is to create a PHP script that parses the XHTML doc and pull out old client contact details.
An example of the document format:
<tr>
<td bgcolor="#CCCCCC" valign="top">Indigo Blue 123</td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" align="top"><font class="details">123 Blue House</font></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" align="top"></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"><font class="details">Hanley</font></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"><font class="details">ST13 4SN</font></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"><font class="details">Stoke on Trent</font></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"><font class="details">01875 322511</font></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top">www.indigoblue123.org.uk</td>
<td bgcolor="#CCCCCC"></td>
</tr>
What I need to do is parse all of these contact details into an array. The few things that I'm not sure on how to complete is grabbing the empty blocks to be empty array entries (i.e. Address 2 and Address 3 will be blank but I need to know this) as well as grabbing the web address from the <a>..</a> block.
So far I have figured all populated data has class=details in some form. However, as I mentioned before I'm not sure what the best way to accomplish the overall result is. There around 20-40 entries in the different files I have.
I have managed the basics with this so far:
<?php
print '<pre>';
$html = file_get_contents('old-contacts.xhtml');
// Create new DOM object:
$dom = new DomDocument();
// Load HTML code:
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$details = $xpath->query("//table/tbody/tr[td/font/#class = 'details']");
for ($i = 0; $i < $details->length; $i++) {
$data[$i]['data'] = $details->item($i)->nodeValue;
echo $data[$i]['data'];
}
print '</pre>';
?>
Any help would be great!
Thanks

I believed you are looking for something like this:
$nodes = $xpath->query('//table/tbody/tr/td[#align="top"] |
//table/tbody/tr/td[#valign="top"]');
$data = array();
foreach ($nodes as $node) {
$data[] = $node->textContent;
}
This would give you:
Array
(
[0] => Indigo Blue 123
[1] => 123 Blue House
[2] =>
[3] =>
[4] => Hanley
[5] =>
[6] => ST13 4SN
[7] => Stoke on Trent
[8] => 01875 322511
[9] =>
[10] => www.indigoblue123.org.uk
)

I was looking exactly for it, and worked perfect.
I created a function to extract and save it to HTML
function clean_web_source($web_source) {
$dom = new DOMDocument();
#$dom->loadHTML($web_source);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//table[#width="580"]');
$data = array();
foreach ($nodes as $node) {
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($node, true));
$data[] = trim($tmp_dom->saveHTML()); //Before use "saveHTML" I used textContent and print_r($data) to identify the array position that interested me.
}
return $data[2]; //The code in position 2 it's what I want.
}
$url = "http://www.theurl.com/?param=1&lang=1";
$web_source = file_get_contents($url);
$target_source = clean_web_source($web_source); //What I've look for.
Thanks.

Related

PHP DOMXPath - Can't target the right node

I know this is probably covered in other threads, but I've been searching all over StackOverflow and tried many solutions, this is why I'm asking.
With this html:
<div class="someclass">
<table>
<tbody>
<tr>
<th class="state">Status</th>
<th class="name">Name</th>
<th class="type">Type</th>
<th class="length">Length</th>
<th class="height">Height</th>
</tr>
<tr>
<td class="state state2"></td>
<td class="name"></td>
<td class="type t18"></td>
<td class="length">2000 m</td>
<td class="height"></td>
</tr>
<tr>
<td class="state state1"></td>
<td class="name"></td>
<td class="type t18"></td>
<td class="length">2250 m</td>
<td class="height"></td>
</tr>
<tr>
<td class="state state1"></td>
<td class="name"></td>
<td class="type t18"></td>
<td class="length">3000 m</td>
<td class="height"></td>
</tr>
<tr>
<td class="state state2"></td>
<td class="name"></td>
<td class="type t18"></td>
<td class="length">2250 m</td>
<td class="height"></td>
</tr>
</tbody>
</table>
</div>
Now, this is the PHP code I have so far :
$dom = new DOMDocument();
$dom->loadHtmlFile('http://www.whatever.com');
$dom->preserveWhiteSpace = false;
$xp = new DOMXPath($dom);
$col = $xp->query('//td[contains(#class, "state1") and (contains(#class, "state"))]');
$length = 0;
foreach( $col as $n ) {
$parent = $n->parentNode;
$length += $parent->childNodes->item(3)->nodeValue;
}
echo 'Length: ' . $length;
I need to:
1.- Sum the 'length' values so I can echo them, getting rid of the ' m' substring of the given values.
2.- Understand why I'm getting wrong the 'parentNodes', 'childNodes' and 'item()' parts. With many tries I've gotten 'Length: 0'
I know this isn't the place to get a full detailed explanation, but it is really hard to find tutorials targetting these concrete issues. It would be great if someone could give some advice on where I can get this information.
Thanks very much in advance.
Edited the 'Concat' part for simplicity.
Navigation through DOMDocument for a specified childNode value by using DOMXpath
function getInt($string)
{
preg_match("/[0-9]+/i", $string, $val);
$out = 0;
if (isset($val) && !empty($val))
{
$out = $val[0];
}
return intval($out);
}
$dom = new DOMDocument();
$dom->loadHtml($html);
$dom->preserveWhiteSpace = false;
$xp = new DOMXPath($dom);
$length = 0;
foreach($xp->query('//td[#class="state state1"]/following-sibling::*[3]') as $element)
{
$value = $element->nodeValue;
$length += getInt($value);
}
echo $length;

PHP parsing won't find "span" tags

I'm trying to find the span tags on a website similar to this: http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225. The tags I need are these:
However, when I use code such as the following:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//span";
$result_rows = $xpath->query($my_xpath_query);
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
The only output I get is [].
If I replace $statsListings[] = $result_object->nodeValue; with $statsListings[] = $result_object->childNodes->item(0)->nodeValue;, I still get the same [] as output. When there are clearly span tags with values, why am I getting nothing?
XPath is not guilty at all.
Span tags are added dinamically. Just have a look at the source code of the page, not the DOM-Structure, which may be already modified by javascript, but use "view-source:" and you will see exactly the same html, as it is parsed by XPath.
It would be a good idea to have a look at the table with class tablelines? probably, you have there everything you may need.
You should skip "maincolor" and "tableheader", and start processing with "light" class.
<table width="98%" class="tablelines" cellpadding="2" border="0" cellspacing="1">
<tr class="maincolor">
<td colspan="8" align="right">All Times Local</td>
</tr>
<tr class="tableheader">
<td width="4%">
<b>GN</b>
</td>
<td nowrap width="21%">
<b>AWAY</b>
</td>
<td nowrap width="21%">
<b>HOME</b>
</td>
<td width="14%"><b>DATE</b></td>
<td width="11%"><b>TIME</b></td>
<td width="8%"><b>SCORE</b></td>
<td nowrap align="right" width="*"><b>BOXSCORE</b></td>
<td nowrap align="center" width="4%"><b>GS</b></td>
</tr>
<tr class="light">
<td></td>
<td>Sioux City
<b>1</b></td>
<td>Sioux Falls
<b>5</b></td>
<td>Tue, Apr 14</td>
<td> 7:05 PM</td>
<td> <b>1 - 5</b> </td>
<td align="right">
<img src="/images/gamelive_icon.gif" title="Click here for Game Live!" alt="Click here for Game Live" border="0">
Final</td>
<td align="center">
<img src="/images/playersection/prostats/gslink.gif" border="0">
</td>
</tr>
For example, try this:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//tr[#class='light']/td";
$result_rows = $xpath->query($my_xpath_query);
echo $result_rows->length;
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
Probably I have found what you need, and even in nice JSON form:
http://www.pointstreak.com/ajax/trending_ajax.html?action=divisionscoreboard&divisionid=12299&seasonid=14225
{"trending_list":null,"lacrosse_list":null,"hockey_list":null,"soccer_list":null,"baseball_list":null,"softball_list":null,"basketball_list":null,"news_list":null,"news_hockey_list":null,"news_baseball_list":null,"news_baseball_list2":null,"news_softball_list":null,"news_basketball_list":null,"games_list":[{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Muskegon","awayscore":"2","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"15\/05","link":"..\/prostats\/boxscore.html?gameid=2672134"},{"status":"FINAL","hometeam":"Muskegon","homescore":"1","awayteam":"Sioux Falls","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"10\/05","link":"..\/prostats\/boxscore.html?gameid=2672133"},{"status":"FINAL","hometeam":"Muskegon","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"1st","schedtime":"7:15 pm","gamedate":"09\/05","link":"..\/prostats\/boxscore.html?gameid=2672132"},{"status":"FINAL","hometeam":"Dubuque","homescore":"3","awayteam":"Muskegon","awayscore":"4","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"05\/05","link":"..\/prostats\/boxscore.html?gameid=2662061"},{"status":"FINAL","hometeam":"Muskegon","homescore":"0","awayteam":"Dubuque","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662060"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"7","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662055"},{"status":"FINAL","hometeam":"Muskegon","homescore":"3","awayteam":"Dubuque","awayscore":"1","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662059"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:04 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662054"},{"status":"FINAL","hometeam":"Tri-City","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"29\/04","link":"..\/prostats\/boxscore.html?gameid=2664638"},{"status":"FINAL","hometeam":"Dubuque","homescore":"7","awayteam":"Muskegon","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"25\/04","link":"..\/prostats\/boxscore.html?gameid=2662058"}],"division_list":null,"site_network_title":null,"leagueshortname":"USHL","includesportlink":null,"showleaguename":0}

Looping with XPath in php to get tables values

I have this html table:
<tbody>
<tr>..</tr>
<tr>
<td class="tbl_black_n_1">1</td>
<td class="tbl_black_n_1" nowrap="" align="center">23/07/14 08:10</td>
<td class="tbl_black_n_1">
<img src="http://www.betonews.com/img/SportId389.gif" width="10" height="10" border="0" alt="">
</td>
<td class="tbl_black_n_1"></td>
<td class="tbl_black_n_1" nowrap="" align="center">BAK WS</td>
<td class="tbl_black_n_1" nowrap="" align="right">M. Eguchi</td>
<td class="tbl_black_n_1" align="center">-</td>
<td class="tbl_black_n_1" nowrap="">Radwanska U. </td>
<td class="tbl_black_n_1" align="center" title=" ">1,02</td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="center" title=" "> </td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="center" title=" ">55,00</td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="right">86%</td>
<td class="tbl_black_n_1" align="right">-</td>
<td class="tbl_black_n_1" align="right">14%</td>
<td class="tbl_black_n_1" align="center" title=" ">524.647</td>
<td class="tbl_black_n_1" nowrap="">
<img src="http://www.betonews.com//img/i_betfair.gif" width="12" height="10" border="0" alt="">
<img src="http://www.betonews.com//img/i_history.gif" width="12" height="10" border="0" alt="">
</td>
</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
...
</tbody>
There are more than one hundred <tr> structured at the same way, which contain lots of <td>. How can I loop with xpath to store all data in a database? I don't want to get the first <tr>: the query has to begin with the second <tr> (that I have showed).
This is my php code, but I can not go on.. help!
<?php
$url = 'http://www.betonews.com/table.asp?tp=2001&lang=en&dd=23&dm=7&dy=2014&df=1&dw=3';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
$document = new DOMDocument();
$document->loadHTML($response);
$xpath = new DOMXPath($document);
$expression = '/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr';
$rows = $xpath->query($expression);
$results = array();
foreach ($rows as $row) {
$result = array();
???
}
This is what I want to be the final result:
[0] => Array
(
[date] => 23/07/14 08:10
[image] => http://www.betonews.com/img/SportId389.gif
[team1] => M. Eguchi
[team2] => Radwanska U.
[1] => 1,02
[x] => 0
[2] => 55,00
[1%] => 86%
[x%] => 0
[2%] => 14%
[total] => 524.647
)
I would use a different XPath to select the table. First, there is always a problem using absolute paths with tables like this, because often tbody elements are just added by the browser, but they are not actually present in the document, i.e. not visible to the PHP code. Also, because if anything in the source HTML changes in terms of styleing, your code breaks. Now I select the first table with a cellpadding of 3 - This is not optimal, but there wasn't any obvious unique identifier.
Apart from that, you can simply iterate over the DOMNodeList result and then get the correct child nodes. Notice, that the items are increased by two, because whitespace-only elements in between are also a node in XML.
$xpath = new DOMXPath($document);
$expression = '(//table[#cellpadding="3"])[1]/tr[position() > 1]';
$rows = $xpath->query($expression);
$results = array();
foreach ($rows as $row) {
$result = array();
$td = $row->childNodes;
$result["date"] = $td->item(2)->nodeValue;
$result["image"] = $td->item(4)->firstChild->attributes->getNamedItem("src")->nodeValue;
$result["team1"] = $td->item(10)->nodeValue;
$result["team2"] = $td->item(12)->nodeValue;
$result["1"] = $td->item(14)->nodeValue;
$result["x"] = $td->item(16)->nodeValue;
$result["2"] = $td->item(18)->nodeValue;
$result["1%"] = $td->item(20)->nodeValue;
$result["x%"] = $td->item(22)->nodeValue;
$result["2%"] = $td->item(24)->nodeValue;
$result["total"] = $td->item(26)->nodeValue;
$results[] = $result;
}
For the image, you have to do same more proccesing, because you do not want the actual text, but the src attribute of the <img> element instead.

Trouble scraping table with DOMXPath

I have a table I'm trying to scrape that looks like this:
<table id="thisTable">
<tr>
<td class="value1"></td>
<td class="value2"></td>
<td class="value3"></td>
<td class="value4"></td>
</tr>
<tr>
<td class="value5"></td>
<td class="value6"></td>
</tr>
</table>
and my DOMXPath that looks like this (so far):
$htmlDoc = new DomDocument();
#$htmlDoc->loadhtml($html);
$xpath = new DOMXPath($htmlDoc);
$nodelist = $xpath->query('//*[#id="thisTable"]');
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}
This works, I get the values of the table, but how do I specify the class of a nodeValue? Ultimately, my goal is to build a new table from the td's content of value2, value4 and value5 in a single row.
$htmlDoc = new DomDocument();
$htmlDoc->loadHTML($html);
$xpath = new DOMXPath($htmlDoc);
$nodelist = $xpath->query('//td');
foreach ($nodelist as $n){
echo $n->getAttribute("class")."\n";
}
Note: Use getAttribute property for getting values of class
Expand your xpath-query:
$class="value1";
$nodelist = $xpath->query('//*[#id="thisTable"][#class="$class"]');
Not sure if I understand correctly, if you want the text contents of value2, value4 and value5 in a single row, you can use this xpath:
(//td[#class='value2'] | //td[#class='value4'] | //td[#class='value5'])/text()
For example:
<table id="thisTable">
<tr>
<td class="value1"> 1111</td>
<td class="value2"> 222 </td>
<td class="value3">333 </td>
<td class="value4"> 444</td>
</tr>
<tr>
<td class="value5"> 555</td>
<td class="value6"> 666</td>
</tr>
</table>
output will then be: 222 444 555

php regex or html dom parsing

I use regex for HTML parsing but I need your help to parse the following table:
<table class="resultstable" width="100%" align="center">
<tr>
<th width="10">#</th>
<th width="10"></th>
<th width="100">External Volume</th>
</tr>
<tr class='odd'>
<td align="center">1</td>
<td align="left">
http://xyz.com
</td>
<td align="right">210,779,783<br />(939,265 / 499,584)</td>
</tr>
<tr class='even'>
<td align="center">2</td>
<td align="left">
http://abc.com
</td>
<td align="right">57,450,834<br />(288,915 / 62,935)</td>
</tr>
</table>
I want to get all domains with their volume(in array or var) for example
http://xyz.com - 210,779,783
Should I use regex or HTML dom in this case. I don't know how to parse large table, can you please help, thanks.
here's an XPath example that happens to parse the HTML from the question.
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile("./input.html");
$xpath = new DOMXPath($dom);
$trs = $xpath->query("//table[#class='resultstable'][1]/tr");
foreach ($trs as $tr) {
$tdList = $xpath->query("td[2]/a", $tr);
if ($tdList->length == 0) continue;
$name = $tdList->item(0)->nodeValue;
$tdList = $xpath->query("td[3]", $tr);
$vol = $tdList->item(0)->childNodes->item(0)->nodeValue;
echo "name: {$name}, vol: {$vol}\n";
}
?>

Categories