I am trying to parse html table in order to get <td> ID HERE </td> tag content using Xpath and PHP.
Executing following line
$doc->loadHTMLFile($file);
gives me warnings like this:
PHP Warning: DOMDocument::loadHTMLFile(): Unexpected end tag : tr in...
That's why I am using the following block of code:
libxml_use_internal_errors(true);
$doc->loadHTMLFile($file);
libxml_clear_errors();
Trying to parse this: (the entire page here)
<table class="object-table" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<th width="8%">something here</th>
<th width="89%">something here</th>
<th width="3%">something here</th>
</tr>
<tr class="normal-row">
<td>ID number here</td>
<td>something here
</td>
<td align="center">
<img src="/design/img/hasnt_photo_icon.gif">
</td>
</tr>
<tr class="odd-row">
<td>ID number here</td>
<td>something here
</td>
<td align="center">
<img src="/design/img/hasnt_photo_icon.gif">
</td>
</tr>
</tbody>
</table>
with the following code:
$file = "http://www.sportsporudy.gov.ua/catalog/#c[1]=1";
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($file);
libxml_clear_errors();
$xpath = new DOMXPath($doc);
$query = '//tr[#class="odd-row"]';
$elements = $xpath->query($query);
printf("Size of array: %d\n", sizeof($elements));
printElements($elements);
and tried using different queries like
//table[#class="object-table"]/tbody/tr ...
but doesn't seem to give me the td tags I need. Maybe that's because of the broken HTML.
Thanks for your advice.
Substantially, your code is fine.
The only error that I've found is in the printing $elements length: $elements is not an array, to retrieve its length you have to use this syntax:
printf( "Size of array: %d\n", $elements->length );
But the major problem that you have with your page is that the HTML has only one table with one row: the remaining data are filled with javascript, so you can't retrieve it directly through DOMXPath.
I'm trying to find the span tags on a website similar to this: http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225. The tags I need are these:
However, when I use code such as the following:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//span";
$result_rows = $xpath->query($my_xpath_query);
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
The only output I get is [].
If I replace $statsListings[] = $result_object->nodeValue; with $statsListings[] = $result_object->childNodes->item(0)->nodeValue;, I still get the same [] as output. When there are clearly span tags with values, why am I getting nothing?
XPath is not guilty at all.
Span tags are added dinamically. Just have a look at the source code of the page, not the DOM-Structure, which may be already modified by javascript, but use "view-source:" and you will see exactly the same html, as it is parsed by XPath.
It would be a good idea to have a look at the table with class tablelines? probably, you have there everything you may need.
You should skip "maincolor" and "tableheader", and start processing with "light" class.
<table width="98%" class="tablelines" cellpadding="2" border="0" cellspacing="1">
<tr class="maincolor">
<td colspan="8" align="right">All Times Local</td>
</tr>
<tr class="tableheader">
<td width="4%">
<b>GN</b>
</td>
<td nowrap width="21%">
<b>AWAY</b>
</td>
<td nowrap width="21%">
<b>HOME</b>
</td>
<td width="14%"><b>DATE</b></td>
<td width="11%"><b>TIME</b></td>
<td width="8%"><b>SCORE</b></td>
<td nowrap align="right" width="*"><b>BOXSCORE</b></td>
<td nowrap align="center" width="4%"><b>GS</b></td>
</tr>
<tr class="light">
<td></td>
<td>Sioux City
<b>1</b></td>
<td>Sioux Falls
<b>5</b></td>
<td>Tue, Apr 14</td>
<td> 7:05 PM</td>
<td> <b>1 - 5</b> </td>
<td align="right">
<img src="/images/gamelive_icon.gif" title="Click here for Game Live!" alt="Click here for Game Live" border="0">
Final</td>
<td align="center">
<img src="/images/playersection/prostats/gslink.gif" border="0">
</td>
</tr>
For example, try this:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//tr[#class='light']/td";
$result_rows = $xpath->query($my_xpath_query);
echo $result_rows->length;
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
Probably I have found what you need, and even in nice JSON form:
http://www.pointstreak.com/ajax/trending_ajax.html?action=divisionscoreboard&divisionid=12299&seasonid=14225
{"trending_list":null,"lacrosse_list":null,"hockey_list":null,"soccer_list":null,"baseball_list":null,"softball_list":null,"basketball_list":null,"news_list":null,"news_hockey_list":null,"news_baseball_list":null,"news_baseball_list2":null,"news_softball_list":null,"news_basketball_list":null,"games_list":[{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Muskegon","awayscore":"2","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"15\/05","link":"..\/prostats\/boxscore.html?gameid=2672134"},{"status":"FINAL","hometeam":"Muskegon","homescore":"1","awayteam":"Sioux Falls","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"10\/05","link":"..\/prostats\/boxscore.html?gameid=2672133"},{"status":"FINAL","hometeam":"Muskegon","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"1st","schedtime":"7:15 pm","gamedate":"09\/05","link":"..\/prostats\/boxscore.html?gameid=2672132"},{"status":"FINAL","hometeam":"Dubuque","homescore":"3","awayteam":"Muskegon","awayscore":"4","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"05\/05","link":"..\/prostats\/boxscore.html?gameid=2662061"},{"status":"FINAL","hometeam":"Muskegon","homescore":"0","awayteam":"Dubuque","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662060"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"7","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662055"},{"status":"FINAL","hometeam":"Muskegon","homescore":"3","awayteam":"Dubuque","awayscore":"1","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662059"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:04 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662054"},{"status":"FINAL","hometeam":"Tri-City","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"29\/04","link":"..\/prostats\/boxscore.html?gameid=2664638"},{"status":"FINAL","hometeam":"Dubuque","homescore":"7","awayteam":"Muskegon","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"25\/04","link":"..\/prostats\/boxscore.html?gameid=2662058"}],"division_list":null,"site_network_title":null,"leagueshortname":"USHL","includesportlink":null,"showleaguename":0}
I am trying to get the text of child elements using the PHP DOM.
Specifically, I am trying to get only the first <a> tag within every <tr>.
The HTML is like this...
<table>
<tbody>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
</tbody>
</table>
My sad attempt at it involved using foreach() loops, but would only return Array() when doing a print_r() on the $aVal.
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML(returnURLData($url));
libxml_use_internal_errors(false);
$tables = $dom->getElementsByTagName('table');
$aVal = array();
foreach ($tables as $table) {
foreach ($table as $tr){
$trVal = $tr->getElementsByTagName('tr');
foreach ($trVal as $td){
$tdVal = $td->getElementsByTagName('td');
foreach($tdVal as $a){
$aVal[] = $a->getElementsByTagName('a')->nodeValue;
}
}
}
}
Am I on the right track or am I completely off?
Put this code in test.php
require 'simple_html_dom.php';
$html = file_get_html('test1.php');
foreach($html->find('table tr') as $element)
{
foreach($element->find('a',0) as $element)
{
echo $element->plaintext;
}
}
and put your html code in test1.php
<table>
<tbody>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
</tbody>
</table>
I am pretty sure I am late, but better way should be to iterate through all "tr" with getElementByTagName and then while iterating through each node in nodelist recieved use getElementByTagName"a". Now no need to iterate through nodeList point out the first element recieved by item(0). That's it! Another way can be to use xPath.
I personally don't like SimpleHtmlDom because of the loads of extra added features it uses where a small functionality is required. In case of heavy scraping also memory management issue can hold you back, its better if you yourself do DOM Analysis rather than depending thrid party application.
Just My opinion. Even I used SHD initially but later realized this.
You're not setting $trVal and $tdVal yet you're looping them ?
The webpage in question is http://assignments.uspto.gov/assignments/q?db=pat&pub=20060030630
Now, let's just say I want to capture the Assignees in the first assignment. The relevant code there looks like
<div class="t3">Assignee:</div>
</td>
</tr>
</table>
</td><td>
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<tbody valign="top">
<tr>
<td>
<table>
<tr>
<td>
<div class="p1">
LEAR CORPORATION
</div>
</td>
</tr>
<tr>
<td><span class="p1">21557 TELEGRAPH ROAD</span></td>
</tr>
<tr>
<td><span class="p1">SOUTHFIELD, MICHIGAN 48034</span></td>
</tr>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
I could I suppose use xpath and grab everything out of spans with class p1, except that thing is used all throughout the page for basically everything, same for the div class that lear corporation is in.
So is there a way for me to just read "Assignees" and then grab just the information relevant to it?
I figure if I can understand how to do that, then I can extrapolate from that and figure out how to grab any specific data on the page that I want, i.e. grabbing the conveyance data on any particular assignment.
But if say, I were just to grab all the data on the page (reel/frame, conveyance, assignors, assignee, correspondent for every assignment, and the header information about the patent itself), might that be easier to do than trying to grab each individual piece of information?
There is no clear way to do it since we have no designation in the DOM where this information is.. It's very arbitrary.
I would recommend using some math to figure out the pattern of where in the DOM the Assignee resides.
For example, we know that for every class of p1, the assignee value is position 16, and a new Assignment occurs every 23rd position. Using a loop you could figure it out.
This should get you started at the very least.
$Site = file_get_contents('http://assignments.uspto.gov/assignments/q?db=pat&pub=20060030630');
$Dom = new DomDocument();
$Dom->loadHTML($Site);
$Finder = new DomXPath($Dom);
$Nodes = $Finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' p1 ')]");
$position = 0;
foreach($Nodes as $node) {
if(($position % 16) == 0 && $position > 0) {
var_dump($node->nodeValue);
break;
}
$position++;
}
As I am well aware that PHPDom can solve half of my problem, I'm in need of a way (not necessarily regex) to be able to find a certain DOM element based on a given innerHTML.
say for example i got this code:
<tr>
<td class="ranking_rank" style="vertical-align:middle;">48697</td>
<td class="ranking_ign" style="vertical-align:middle;">kanineh</td>
<td class="ranking_img" style="vertical-align:middle;">
<img src="http://avatar.maplesea.com/Character/NKGEHGDLFNINKPMFLDCNNOHKHKBOHBKLGCBLABFLABHAGBPAEMDEFABJBLKJIHJAANGEKFJGELEPKMCNLKPCINEJDGAJFLKG.gif" onerror="this.src='/images/ranking/noimage.jpg'"/>
</td>
<td class="ranking_lvl" style="vertical-align:middle;">122</td>
<td class="ranking_world" style="vertical-align:middle;">
<img src="/images/ranking/Bootes.gif" onMouseover="ddrivetip('Bootes','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_job" style="vertical-align:middle;">
<img src="/images/ranking/Warrior.gif" onMouseover="ddrivetip('Warrior','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_fame" style="vertical-align:middle;">449</td>
</tr>
<tr>
<td class="ranking_rank" style="vertical-align:middle;">48698</td>
<td class="ranking_ign" style="vertical-align:middle;">WannaLogic</td>
<td class="ranking_img" style="vertical-align:middle;">
<img src="http://avatar.maplesea.com/Character/DOMELFGEGCGDBFCOLADBDOJLHADCIBNKEGKGINPNBEKPDDKOEEGBLMDLBGBDHGCNPGLAECAMLGKEMDKJGPODIDKCOJCMNNKN.gif" onerror="this.src='/images/ranking/noimage.jpg'"/>
</td>
<td class="ranking_lvl" style="vertical-align:middle;">122</td>
<td class="ranking_world" style="vertical-align:middle;">
<img src="/images/ranking/Aquila.gif" onMouseover="ddrivetip('Aquila','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_job" style="vertical-align:middle;">
<img src="/images/ranking/Magician.gif" onMouseover="ddrivetip('Magician','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_fame" style="vertical-align:middle;">56</td>
</tr>
I need to be able to get a hold of the whole row node with the td that has WannaLogic in it. that way, when I have this table row already, I can now easily traverse the nodes using PHP DOM. I'm a sucker for regular expression so I'd really much appreciate it if you can shed me some light on this.
Using regex on a DOM tree is a no-no and bound to fail when faced with malformed XML/HTML. Try this:
$xpath = new DOMXPath($doc);
$query = "//*[.='WannaLogic']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
// do whatever
}