I'm trying to work how to traverse this specific table with the "simple_html_dom.php". I've tried many different angles and just can't get it right. I can separate the table row by row but I can't slice up the TD values into individual components.
What I'm trying to do is take the table from this site and move the TD values into specific (array of) variables I can reliably and predictably work with. The problem is partly compounded, I think, by the fact that the TR or TDs don't have any attributes that I can 'find'.
$dom = file_get_html('http://www.asx.com.au/asx/statistics/prevBusDayAnns.do');
$tds = $dom->find('table',0)->find('tr', 1)->find('td', 1);
foreach($tds as $td)
{
echo $td->plaintext . '</br>'
}
The code above finds the first TR but I would have expected $tds to have the value of TD cell 1. It does not though. It spits out the entire TR.
I've been over the documentation and had a good search around the net but no luck.
EDIT - Solution (something like this):
$tds = $dom->find('table',0)->find('tr');
foreach($dom->find('tr') as $key => $tr)
{
$td = $tr->find('td');
if (isset($td[0]))
{
echo $td[0]->plaintext . '</br>'; // First TD column
//echo $td[1]->plaintext;
//echo $td[2]->plaintext;
//echo $td[3]->plaintext;
//echo $td[4]->plaintext;
//echo $td[5]->plaintext;
}
}
Replace
$dom->find('table',0)->find('tr', 1)->find('td', 1);
with
$dom->find('table',0)->find('tr', 1)->find('td');
You're currently only fetching the first td when you specify the second parameter. Note that this only goes through the first table row as well.
Related
I have a project where I am attempting to take a docx file then unzip it, then run through the document.xml file using xpath in order find all table elements. then within each table element run through and identify specific tables using the tblCaption (Table Caption obviously) attribute and then run through the table and find table cells. Then I will change background color of cells by changing the w:fill value using a string replace. We're doing it like this because we want to manually enter tables into Word and then change the table without having to dynamically generate tables using a library like PHPDocx or otherwise. I have so far used SimpleXML with xpath to find all tables in the doc, loop through them and test for the existence and value of the tblCaption node. If there is a match I will then assign bg color to each cell using the cell text to id the cell node. I can find all the tables using xpath. I have attempted to find child nodes of each table using both $tblNode->children() and $xpath:
$xml = simplexml_load_file(APPPATH.TEMPLATE_UPLOAD_PATH.'xmltest/word/document.xml');
$namespaces = $xml->getDocNamespaces(true);
foreach ($namespaces as $prefix => $ns) {
$prefix = $prefix == '' ? 'default' : $prefix;
$xml->registerXPathNamespace($prefix, $ns);
}
$nodes = $xml->xpath("/w:document/w:body//w:tbl");
foreach($nodes as $node) {
$children = $node->xpath("/w:tblCaption");
echo count($children) . '<br />';
//$children = $node->children();
//echo count($children) . '<br />';
}
I would eventually like to use:
$children = $node->xpath("/w:tblCaption[#val='whatever']"); to return a tblCaption node only if it exists and has a specific value.
At the moment there are zero child nodes for each tbl node being returned.
Any ideas?
One of the problems that I can identify having no previlage to see sample XML is, that your code attempt to get element relative to other element by using absolute XPath expression (one that starts with /, which reference root document). You should, at least, add a . at the beginning of your XPath or remove the / completely. Both will result in an XPath that look for child element of certain name, relative to current $node :
foreach($nodes as $node) {
$children = $node->xpath("w:tblCaption");
echo count($children) . '<br />';
}
I've ditched this all together and I'm using the OpenTBS plugin for changing table appearance in Word. It's much more powerful than trying to mess around with xml elements yourself. I haven't tried the above idea. Buy thanks anway.
There is this website
http://www.oxybet.com/france-vs-iceland/e/5209778/
What I want is to scrape not the full table but PARTS of this table.
For example to only display rows that include sportingbet stoiximan and mybet and I don't need all columns only 1 x 2 columns, also the numbers that are with red must be scraped as is with the red box or just display an asterisk next to them in the scrape can this be done or do I need to scrape the whole table on a database first then query the database?
What I got now is this code I borrowed from another similar question on this forum which is:
<?php
require('simple_html_dom.php');
$html = file_get_html('http://www.oxybet.com/france-vs-iceland/e/5209778/');
$table = $html->find('table', 0);
$rowData = array();
foreach($table->find('tr') as $row) {
// initialize array to store the cell data from each row
$flight = array();
foreach($row->find('td') as $cell) {
// push the cell's text to the array
$flight[] = $cell->plaintext;
}
$rowData[] = $flight;
}
echo '<table>';
foreach ($rowData as $row => $tr) {
echo '<tr>';
foreach ($tr as $td)
echo '<td>' . $td .'</td>';
echo '</tr>';
}
echo '</table>';
?>
which returns the full table. What I want mainly is somehow to detect the numbers selected in the red box (in 1 x 2 areas) and display an asterisk next to them in my scrape, secondly I want to know if its possible to scrape specific columns and rows and not everything do i need to use xpath?
I beg for someone to point me in the right direction I spent hours on this, the manual doesn't explain much http://simplehtmldom.sourceforge.net/manual.htm
Link is dead. However, you can do this with xPath and reference the cells that you want by their colour and order, and many more ways too.
This snippet will give you the general gist; taken from a project I'm working on atm:
function __construct($URL)
{
// make new DOM for nodes
$this->dom = new DOMDocument();
// set error level
libxml_use_internal_errors(true);
// Grab and set HTML Source
$this->HTMLSource = file_get_contents($URL);
// Load HTML into the dom
$this->dom->loadHTML($this->HTMLSource);
// Make xPath queryable
$this->xpath = new DOMXPath($this->dom);
}
function xPathQuery($query){
return $this->xpath->query($query);
}
Then simply pass a query to your DOMXPath, like //tr[1]
I am using Windows software to organize a tourpool. This program creates (among other things) HTML pages with rankings of participants. But these HTML pages are quite hideous, so I am building a site around it.
To show the top 10 ranking I need to select the first 10 out of about 1000 participants of the generated HTML file and put it on my own site.
To do this, I used:
// get top 10 ranks of p_rank.html
$file_contents = file_get_contents('p_rnk.htm');
$start = strpos($file_contents, '<tr class="header">');
// get end
$i = 11;
while (strpos($file_contents, '<tr><td class="position">'. $i .'</td>', $start) === false){
$i++;
}
$end = strpos($file_contents, '<td class="position">'. $i .'</td>', $start);
$code = substr($file_contents, $start, $end);
echo $code;
This way I get it to work, only the last 3 columns (previous position, up or down and details) are useless information. So I want these columns deleted or find a way to only select and display the first 4.
How do i manage this?
EDIT
I adjusted my code and at the end I only echo the adjusted table.
<?php
$DOM = new DOMDocument;
$DOM->loadHTMLFile("p_rnk.htm");
$table = $DOM->getElementsByTagName('table')->item(0);
$rows = $table->getElementsByTagName('tr');
$cut_rows_after = 10;
$cut_colomns_after = 3;
$row_index = $rows->length-1;
while($row = $rows->item($row_index)) {
if($row_index+1 > $cut_rows_after)
$table->removeChild($row);
else {
$tds = $row->getElementsByTagName('td');
$colomn_index = $tds->length-1;
while($td = $tds->item($colomn_index)) {
if($colomn_index+1 > $cut_colomns_after)
$row->removeChild($td);
$colomn_index--;
}
}
$row_index--;
}
echo $DOM->saveHTML($table);
?>
I'd say that the best way to deal with such stuff is to parse the html document (see, for instance, the first anwser here) and then manipulate the object that describes DOM. This way, you can easily extract the table itself using various selectors, get your 10 first records in a simpler manner and also will be able to remove unnecessary child (td) nodes from each line (using removeChild). When you're done with modifying, dump the resulting HTML using saveHTML.
Update:
ok, here's a tested code. I removed the necessity to hardcode the numbers of colomns and rows and separated the desired numbers of colomns and rows into a couple of variables (so that you can adjust them if neede). Give the code a closer look: you'll notice some details which were missing in you code (index is 0..999, not 1..1000, that's why all those -1s and +1s appear; it's better to decrease the index instead of increasing because in this case you don't have to case about numeration shifts on removing; I've also used while instead of for not to care about cases of $rows->item($row_index) == null separately):
<?php
$DOM = new DOMDocument;
$DOM->loadHTMLFile("./table.html");
$table = $DOM->getElementsByTagName('tbody')->item(0);
$rows = $table->getElementsByTagName('tr');
$cut_rows_after = 10;
$cut_colomns_after = 4;
$row_index = $rows->length-1;
while($row = $rows->item($row_index)) {
if($row_index+1 > $cut_rows_after)
$table->removeChild($row);
else {
$tds = $row->getElementsByTagName('td');
$colomn_index = $tds->length-1;
while($td = $tds->item($colomn_index)) {
if($colomn_index+1 > $cut_colomns_after)
$row->removeChild($td);
$colomn_index--;
}
}
$row_index--;
}
echo $DOM->saveHTML();
?>
Update 2:
If the page doesn't contain tbody, use the container which is present. For instance, if tr elements are inside a table element, use $DOM->getElementsByTagName('table') instead of $DOM->getElementsByTagName('tbody').
I am parsing through an XML document and getting the values of nested tags using asXML(). This works fine, but I would like to move this data into a MySQL database whose columns match the tags of the file. So essentially how do I get the tags that asXML() is pulling text from?
This way I can eventually do something like: INSERT INTO db.table (TheXMLTag) VALUES ('XMLTagText');
This is my code as of now:
$xml = simplexml_load_file($target_file) or die ("Error: Cannot create object");
foreach ($xml->Message->SettlementReport->SettlementData as $main ){
$value = $main->asXML();
echo '<pre>'; echo $value; echo '</pre>';
}
foreach ($xml->Message->SettlementReport->Order as $main ){
$value = $main->asXML();
echo '<pre>'; echo $value; echo '</pre>';
}
This is what my file looks like to give you an idea (So essentially how do I get the tags within [SettlementData], [0], [Fulfillment], [Item], etc. ?):
I would like to move this data into a MySQL database whose columns match the tags of the file.
Your problem is two folded.
The first part of the problem is to do the introspection on the database structure. That is, obtain all table names and obtain the column names of these. Most modern databases offer this functionality, so does MySQL. In MySQL those are the INFORMATION_SCHEMA Tables. You can query them as if those were normal database tables. I generally recommend PDO for that in PHP, mysqli is naturally doing the job perfectly as well.
The second part is parsing the XML data and mapping it's data onto the database tables (you use SimpleXMLElement for that in your question so I related to it specifically). For that you first of all need to find out how you would like to map the data from the XML onto the database. An XML file does not have a 2D structure like a relational database table, but it has a tree structure.
For example (if I read your question right) you identify Message->SettlementReport->SettlementData as the first "table". For that specific example it is easy as the <SettlementData> only has child-elements that could represent a column name (the element name) and value (the text-content). For that it is easy:
header('Content-Type: text/plain; charset=utf-8');
$table = $xml->Message->SettlementReport->SettlementData;
foreach ($table as $name => $value ) {
echo $name, ': ', $value, "\n";
}
As you can see, specifying the key assignment in the foreach clause will give you the element name with SimpleXMLElement. Alternatively, the SimpleXMLElement::getName() method does the same (just an example which does the same just with slightly different code):
header('Content-Type: text/plain; charset=utf-8');
$table = $xml->Message->SettlementReport->SettlementData;
foreach ($table as $value) {
$name = $value->getName();
echo $name, ': ', $value, "\n";
}
In this case you benefit from the fact that the Iterator provided in the foreach of the SimpleXMLElement you access via $xml->...->SettlementData traverses all child-elements.
A more generic concept would be Xpath here. So bear with me presenting you a third example which - again - does a similar output:
header('Content-Type: text/plain; charset=utf-8');
$rows = $xml->xpath('/*/Message/SettlementReport/SettlementData');
foreach ($rows as $row) {
foreach ($row as $column) {
$name = $column->getName();
$value = (string) $column;
echo $name, ': ', $value, "\n";
}
}
However, as mentioned earlier, mapping a tree-structure (N-Depth) onto a 2D-structure (a database table) might now always be that straight forward.
If you're looking what could be an outcome (there will most often be data-loss or data-duplication) a more complex PHP example is given in a previous Q&A:
How excel reads XML file?
PHP XML to dynamic table
Please note: As the matter of fact such mappings on it's own can be complex, the questions and answers inherit from that complexity. This first of all means those might not be easy to read but also - perhaps more prominently - might just not apply to your question. Those are merely to broaden your view and provide and some examples for certain scenarios.
I hope this is helpful, please provide any feedback in form of comments below. Your problem might or might not be less problematic, so this hopefully helps you to decide how/where to go on.
I tried with SimpleXML but it skips text data. However, using the Document Object Model extension works.
This returns an array where each element is an array with 2 keys: tag and text, returned in the order in which the tree is walked.
<?php
// recursive, pass by reference (spare memory ? meh...)
// can skip non tag elements (removes lots of empty elements)
function tagData(&$node, $skipNonTag=false) {
// get function name, allows to rename function without too much work
$self = __FUNCTION__;
// init
$out = array();
$innerXML = '';
// get document
$doc = $node->nodeName == '#document'
? $node
: $node->ownerDocument;
// current tag
// we use a reference to innerXML to fill it later to keep the tree order
// without ref, this would go after the loop, children would appear first
// not really important but we never know
if(!(mb_substr($node->nodeName,0,1) == '#' && $skipNonTag)) {
$out[] = array(
'tag' => $node->nodeName,
'text' => &$innerXML,
);
}
// build current innerXML and process children
// check for children
if($node->hasChildNodes()) {
// process children
foreach($node->childNodes as $child) {
// build current innerXML
$innerXML .= $doc->saveXML($child);
// repeat process with children
$out = array_merge($out, $self($child, $skipNonTag));
}
}
// return current + children
return $out;
}
$xml = new DOMDocument();
$xml->load($target_file) or die ("Error: Cannot load xml");
$tags = tagData($xml, true);
//print_r($tags);
?>
I'm using the Simple HTML DOM Parser - http://simplehtmldom.sourceforge.net/manual.htm
I'm trying to scrape some data from a scoreboard page. The below example shows me pulling the HTML of the "Akron Rushing" table.
Inside $tr->find('td', 0), the first column, there is a hyperlink. How can I extract this hyperlink? Using $tr->find('td', 0')->find('a') does not seem to work.
Also: I can write conditions for each table (passing, rushing, receiving, etc), but is there a more efficient way to do this? I'm open to ideas on this one.
include('simple_html_dom.php');
$html = file_get_html('http://espn.go.com/ncf/boxscore?gameId=322432006');
$teamA['rushing'] = $html->find('table.mod-data',5);
foreach ($teamA as $type=>$data) {
switch ($type) {
# Rushing Table
case "rushing":
foreach ($data->find('tr') as $tr) {
echo $tr->find('td', 0); // First TD column (Player Name)
echo $tr->find('td', 1); // Second TD Column (Carries)
echo $tr->find('td', 2); // Third TD Column (Yards)
echo $tr->find('td', 3); // Fourth TD Column (AVG)
echo $tr->find('td', 4); // Fifth TD Column (TDs)
echo $tr->find('td', 5); // Sixth TD Column (LGs)
echo "<hr />";
}
}
}
In your case, the find('tr') returns 10 elments instead of the 7 rows expected only.
Also, not all the names has links associated with them, trying to retrieve a link when it doesnt exist may return an error.
Therefore, here's a modified working version of your code:
$url = 'http://espn.go.com/ncf/boxscore?gameId=322432006';
$html = file_get_html('http://espn.go.com/ncf/boxscore?gameId=322432006');
$teamA['rushing'] = $html->find('table.mod-data',5);
foreach ($teamA as $type=>$data) {
switch ($type) {
# Rushing Table
case "rushing":
echo count($data->find('tr')) . " \$tr found !<br />";
foreach ($data->find('tr') as $key => $tr) {
$td = $tr->find('td');
if (isset($td[0])) {
echo "<br />";
echo $td[0]->plaintext . " | "; // First TD column (Player Name)
// If anchor exists
if($anchor = $td[0]->find('a', 0))
echo $anchor->href; // href
echo " | ";
echo $td[1]->plaintext . " | "; // Second TD Column (Carries)
echo $td[2]->plaintext . " | "; // Third TD Column (Yards)
echo $td[3]->plaintext . " | "; // Fourth TD Column (AVG)
echo $td[4]->plaintext . " | "; // Fifth TD Column (TDs)
echo $td[5]->plaintext; // Sixth TD Column (LGs)
echo "<hr />";
}
}
}
}
As you can see, an attribute can be reched using this format $tag->attributeName. In your case, attributeName is href
Notes:
It would be a good idea to handle find's errors, knowing that it returns "False" when nothing is found
$td = $tr->find('td');
// Find suceeded
if ($td) {
// code here
}
else
echo "Find() failed in XXXXX";
PHP Simple HTML DOM Parser has known memory leaks issues with php5, so don't forget to free up memory when DOM objects are no more used:
$html = file_get_html(...);
// do something...
$html->clear();
unset($html);
Source: http://simplehtmldom.sourceforge.net/manual_faq.htm#memory_leak
According to the documentation you should be able to chain selectors for nested elements.
This is the example they give:
// Find first <li> in first <ul>
$e = $html->find('ul', 0)->find('li', 0);
The only difference I can see is that they include the index in the second find. Try added that in and seeing if it works for you.