PHP ganon dom parser get next element when match is found

PHP ganon dom parser get next element when match is found - php

I am parsing and html dom string from ganon dom parser and want to get the next element plain text when a match is found on previous element e.g my html is like
<tr class="last even">
<th class="label">SKU</th>
<td class="data last">some sku here i want to get </td>
</tr>
I have used the following code for now
$html = str_get_dom('html string here');
foreach ($html('th.label') as $elem){
if($elem->getPlainText()=='SKU'){ //this is right
echo $elem->getSibling(1)->getPlainText(); // this is not working
}
}
If the th with class lable and innerhtml SKU is found then get the innerhtml from next sibling that is SKU value
Please help to sort this out.

It's probably a bug in "ganon" of the html - if you take your example of html:
$html = '<table>
<tr class="last even">
<th class="label">SKU</th>
<td class="data last">some sku here i want to get </td>
</tr>
</table>';
$html = str_get_dom($html);
for some reason because of the new line in the html "ganon" thinks that the next element is a text element and only then there is the desire td - so you have to do this:
foreach ($html('th.label') as $elem){
if($elem->getPlainText()=='SKU'){
//elem -> text node -> td node
echo($elem->getSibling(1)->getSibling(1)->getPlainText());
}
}
If you organize your html like this (without new line):
$html = '<table>
<tr class="last even">
<th class="label">SKU</th><td class="data last">some sku here i want to get </td>
</tr>
</table>';
Then your original code will work $elem->getSibling(1)->getPlainText()
Maybe consider using the php simple html dom class - it's much more intuitive, using full oop methods, jquery dom parser like and don't uses this awful var-function method :):
require('simple_html_dom.php');
$html = '<table>
<tr class="last even">
<th class="label">SKU</th>
<td class="data last">some sku here i want to get </td>
</tr>
</table>';
$dom = str_get_html($html);
foreach($dom->find('th.label') as $el){
if($el->plaintext == 'SKU'){
echo($el->next_sibling()->plaintext);
}
}

Related

PHP parsing won't find "span" tags

I'm trying to find the span tags on a website similar to this: http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225. The tags I need are these:
However, when I use code such as the following:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//span";
$result_rows = $xpath->query($my_xpath_query);
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
The only output I get is [].
If I replace $statsListings[] = $result_object->nodeValue; with $statsListings[] = $result_object->childNodes->item(0)->nodeValue;, I still get the same [] as output. When there are clearly span tags with values, why am I getting nothing?

XPath is not guilty at all.
Span tags are added dinamically. Just have a look at the source code of the page, not the DOM-Structure, which may be already modified by javascript, but use "view-source:" and you will see exactly the same html, as it is parsed by XPath.
It would be a good idea to have a look at the table with class tablelines? probably, you have there everything you may need.
You should skip "maincolor" and "tableheader", and start processing with "light" class.
<table width="98%" class="tablelines" cellpadding="2" border="0" cellspacing="1">
<tr class="maincolor">
<td colspan="8" align="right">All Times Local</td>
</tr>
<tr class="tableheader">
<td width="4%">
<b>GN</b>
</td>
<td nowrap width="21%">
<b>AWAY</b>
</td>
<td nowrap width="21%">
<b>HOME</b>
</td>
<td width="14%"><b>DATE</b></td>
<td width="11%"><b>TIME</b></td>
<td width="8%"><b>SCORE</b></td>
<td nowrap align="right" width="*"><b>BOXSCORE</b></td>
<td nowrap align="center" width="4%"><b>GS</b></td>
</tr>
<tr class="light">
<td></td>
<td>Sioux City
<b>1</b></td>
<td>Sioux Falls
<b>5</b></td>
<td>Tue, Apr 14</td>
<td> 7:05 PM</td>
<td> <b>1 - 5</b> </td>
<td align="right">
<img src="/images/gamelive_icon.gif" title="Click here for Game Live!" alt="Click here for Game Live" border="0">
Final</td>
<td align="center">
<img src="/images/playersection/prostats/gslink.gif" border="0">
</td>
</tr>
For example, try this:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//tr[#class='light']/td";
$result_rows = $xpath->query($my_xpath_query);
echo $result_rows->length;
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
Probably I have found what you need, and even in nice JSON form:
http://www.pointstreak.com/ajax/trending_ajax.html?action=divisionscoreboard&divisionid=12299&seasonid=14225
{"trending_list":null,"lacrosse_list":null,"hockey_list":null,"soccer_list":null,"baseball_list":null,"softball_list":null,"basketball_list":null,"news_list":null,"news_hockey_list":null,"news_baseball_list":null,"news_baseball_list2":null,"news_softball_list":null,"news_basketball_list":null,"games_list":[{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Muskegon","awayscore":"2","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"15\/05","link":"..\/prostats\/boxscore.html?gameid=2672134"},{"status":"FINAL","hometeam":"Muskegon","homescore":"1","awayteam":"Sioux Falls","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"10\/05","link":"..\/prostats\/boxscore.html?gameid=2672133"},{"status":"FINAL","hometeam":"Muskegon","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"1st","schedtime":"7:15 pm","gamedate":"09\/05","link":"..\/prostats\/boxscore.html?gameid=2672132"},{"status":"FINAL","hometeam":"Dubuque","homescore":"3","awayteam":"Muskegon","awayscore":"4","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"05\/05","link":"..\/prostats\/boxscore.html?gameid=2662061"},{"status":"FINAL","hometeam":"Muskegon","homescore":"0","awayteam":"Dubuque","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662060"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"7","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662055"},{"status":"FINAL","hometeam":"Muskegon","homescore":"3","awayteam":"Dubuque","awayscore":"1","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662059"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:04 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662054"},{"status":"FINAL","hometeam":"Tri-City","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"29\/04","link":"..\/prostats\/boxscore.html?gameid=2664638"},{"status":"FINAL","hometeam":"Dubuque","homescore":"7","awayteam":"Muskegon","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"25\/04","link":"..\/prostats\/boxscore.html?gameid=2662058"}],"division_list":null,"site_network_title":null,"leagueshortname":"USHL","includesportlink":null,"showleaguename":0}

How to use DOMDocument to get child elements?

I am trying to get the text of child elements using the PHP DOM.
Specifically, I am trying to get only the first <a> tag within every <tr>.
The HTML is like this...
<table>
<tbody>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
</tbody>
</table>
My sad attempt at it involved using foreach() loops, but would only return Array() when doing a print_r() on the $aVal.
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML(returnURLData($url));
libxml_use_internal_errors(false);
$tables = $dom->getElementsByTagName('table');
$aVal = array();
foreach ($tables as $table) {
foreach ($table as $tr){
$trVal = $tr->getElementsByTagName('tr');
foreach ($trVal as $td){
$tdVal = $td->getElementsByTagName('td');
foreach($tdVal as $a){
$aVal[] = $a->getElementsByTagName('a')->nodeValue;
}
}
}
}
Am I on the right track or am I completely off?

Put this code in test.php
require 'simple_html_dom.php';
$html = file_get_html('test1.php');
foreach($html->find('table tr') as $element)
{
foreach($element->find('a',0) as $element)
{
echo $element->plaintext;
}
}
and put your html code in test1.php
<table>
<tbody>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
</tbody>
</table>

I am pretty sure I am late, but better way should be to iterate through all "tr" with getElementByTagName and then while iterating through each node in nodelist recieved use getElementByTagName"a". Now no need to iterate through nodeList point out the first element recieved by item(0). That's it! Another way can be to use xPath.
I personally don't like SimpleHtmlDom because of the loads of extra added features it uses where a small functionality is required. In case of heavy scraping also memory management issue can hold you back, its better if you yourself do DOM Analysis rather than depending thrid party application.
Just My opinion. Even I used SHD initially but later realized this.

You're not setting $trVal and $tdVal yet you're looping them ?

PHP DOM grabbing a specific subset of information

The webpage in question is http://assignments.uspto.gov/assignments/q?db=pat&pub=20060030630
Now, let's just say I want to capture the Assignees in the first assignment. The relevant code there looks like
<div class="t3">Assignee:</div>
</td>
</tr>
</table>
</td><td>
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<tbody valign="top">
<tr>
<td>
<table>
<tr>
<td>
<div class="p1">
LEAR CORPORATION
</div>
</td>
</tr>
<tr>
<td><span class="p1">21557 TELEGRAPH ROAD</span></td>
</tr>
<tr>
<td><span class="p1">SOUTHFIELD, MICHIGAN 48034</span></td>
</tr>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
I could I suppose use xpath and grab everything out of spans with class p1, except that thing is used all throughout the page for basically everything, same for the div class that lear corporation is in.
So is there a way for me to just read "Assignees" and then grab just the information relevant to it?
I figure if I can understand how to do that, then I can extrapolate from that and figure out how to grab any specific data on the page that I want, i.e. grabbing the conveyance data on any particular assignment.
But if say, I were just to grab all the data on the page (reel/frame, conveyance, assignors, assignee, correspondent for every assignment, and the header information about the patent itself), might that be easier to do than trying to grab each individual piece of information?

There is no clear way to do it since we have no designation in the DOM where this information is.. It's very arbitrary.
I would recommend using some math to figure out the pattern of where in the DOM the Assignee resides.
For example, we know that for every class of p1, the assignee value is position 16, and a new Assignment occurs every 23rd position. Using a loop you could figure it out.
This should get you started at the very least.
$Site = file_get_contents('http://assignments.uspto.gov/assignments/q?db=pat&pub=20060030630');
$Dom = new DomDocument();
$Dom->loadHTML($Site);
$Finder = new DomXPath($Dom);
$Nodes = $Finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' p1 ')]");
$position = 0;
foreach($Nodes as $node) {
if(($position % 16) == 0 && $position > 0) {
var_dump($node->nodeValue);
break;
}
$position++;
}

Simple HTML DOM: Notice->Trying to get property of non-object

I am getting an php notice when using simple html dom to scrape a website. There are 2 notices displayed and everything rendered underneath looks perfect when using the print_r function to display it.
The website table structure is as follows:
<table class=data schedTbl>
<thead>
<tr>
<th>DATA</th>
<th>DATA</th>
<th>DATA</th>
etc....
</tr>
</thead>
<tbody>
<tr>
<td>
<div class="class1">DATA</div>
<div class="class2">SAME DATA AS PREVIOUS DIV</div>
</td>
<td>DATA</td>
<td>DATA</td>
etc....
</tr>
<tr>
<td>
<div class="class1">DATA</div>
<div class="class2">SAME DATA AS PREVIOUS DIV</div>
</td>
<td>DATA</td>
<td>DATA</td>
etc....
</tr>
<tr>
<td>
<div class="class1">DATA</div>
<div class="class2">SAME DATA AS PREVIOUS DIV</div>
</td>
<td>DATA</td>
<td>DATA</td>
etc....
</tr>
etc....
</tbody>
</table>
The code below is used to find all tr in table[class=data schedTbl]. I have a tbody selector in there, but it seems to pay no attention to this selector as it still selects the tr in the thead.
include('simple_html_dom.php');
$articles = array();
getArticles('www.somesite.com');
function getArticles($page) {
global $articles;
$html = new simple_html_dom();
$html->load_file($page);
$items = $html->find('table[class=data schedTbl] tbody tr');
foreach($items as $post) {
$articles[] = array($post->children(0)->first_child(0)->plaintext,//0 -- GAME DATE
$post->children(1)->plaintext,//1 -- AWAY TEAM
$post->children(2)->plaintext);//2 -- HOME TEAM
}
}
So, I believe notices come from the tr in the thead because I am calling on the first child of the first td which only has one record. The reason for two is there is actually two tables with the same data structure in the body.
Again, I believe there are 2 ways of solving this:
1) PROBABLY THE EASIEST (fix the find selector so the TBODY works and only selects the tds within the tbodies)
2) Figure out a way to not do the first_child filter when it is not needed?
Please let me know if you would like a snapshot of the print_r($articles) output I am receiving.
Thanks in advance for any help provided!
Sincerely,
Bill C.

Just comment out line #695 in the simple_html_dom.php
if ($m[1]==='tbody') continue;
Then it should read the tbody.

DOMDocument in php

I have just started reading documentation and examples about DOM, in order to crawl and parse the document.
For example I have part of document shown below:
<div id="showContent">
<table>
<tr>
<td>
Crap
</td>
</tr>
<tr>
<td width="172" valign="top"><img height="91" border="0" width="172" class="" src="img"></td>
<td width="10"> </td>
<td valign="top"><table cellspacing="0" cellpadding="0" border="0">
<tbody><tr>
<td height="30"><a class="px11" href="link">title</a><a><br>
<span class="px10"></span>
</a></td>
</tr>
<tr>
<td><img height="1" width="580" src="crap"></td>
</tr>
<tr>
<td align="right">
<img height="16" border="0" width="65" src="/buy">
</td>
</tr>
<tr>
<td valign="top" class="px10">
<p style="width: 500px;">description.</p>
</td>
</tr>
</tbody></table></td>
</tr>
<tr>
<td>
Crap
</td>
</tr>
<tr>
<td>
Crap
</td>
</tr>
</table>
</div>
I'm trying to use the following code to get all the tr tags and analyze whether there is crap or information inside them:
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('.//div[#id="showContent"]');
foreach ($tags as $tag) {
$string="";
$string=trim($tag->nodeValue);
if(strlen($string)>3) {
echo $string;
echo '<br>';
}
}
However I'm getting just stripped string without the tags, for example:
Crap
Crap
Title
Description
But I would like to get:
<tr>
<td>Crap</td>
</tr>
<tr>
title
</tr>
How to keep html nodes (tags)?

If you want to work with DOM you have to understand the concept. Everything in a DOM Document, including the DOMDocument, is a Node.
The DOMDocument is a hierarchical tree structure of nodes. It starts with a root node. That root node can have child nodes and all these child nodes can have child nodes on their own. Basically everything in a DOMDocument is a node type of some sort, be it elements, attributes or text content.
HTML Legend:
/ \ UPPERCASE = DOMElement
HEAD BODY lowercase = DOMAttr
/ \ "Quoted" = DOMText
TITLE DIV - class - "header"
| \
"The Title" H1
|
"Welcome to Nodeville"
The diagram above shows a DOMDocument with some nodes. There is a root element (HTML) with two children (HEAD and BODY). The connecting lines are called axes. If you follow down the axis to the TITLE element, you will see that it has one DOMText leaf. This is important because it illustrates an often overlooked thing:
<title>The Title</title>
is not one, but two nodes. A DOMElement with a DOMText child. Likewise, this
<div class="header">
is really three nodes: the DOMElement with a DOMAttr holding a DOMText. Because all these inherit their properties and methods from DOMNode, it is essential to familiarize yourself with the DOMNode class.
In practise, this means the DIV you fetched is linked to all the other nodes in the document. You could go all the way to the root element or down to the leaves at any time. It's all there. You just have to query or traverse the document for the wanted information.
Whether you do that by iterating the childNodes of the DIV or use getElementByTagName() or XPath is up to you. You just have to understand that you are not working with raw HTML, but with nodes representing that entire HTML document.
If you need help with extracting specific information from the document, you need to clarify what information you want to fetch from it. For instance, you could ask how to fetch all the links from the table and then we could answer something like:
$div = $dom->getElementById('showContent');
foreach ($div->getElementsByTagName('a') as $link)
{
echo $dom->saveXML($link);
}
But unless you are more specific, we can only guess which nodes might be relevant.
If you need more examples and code snippets on how to work with DOM browse through my previous answers to related questions:
https://stackoverflow.com/search?q=user%3A208809+DOM
By now, there should be a snippet for every basic to medium UseCase you might have with DOM.

To create a parser you can use htmlDOM.
It is very simple easy to use DOM parser written in php. By using it you can easily fetch the contents of div tag.
For example, find all div tags which have attribute id with a value of text.
$ret = $html->find('div[id=text]');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.