I'm using the Simple HTML DOM Parser - http://simplehtmldom.sourceforge.net/manual.htm
I'm trying to scrape some data from a scoreboard page. The below example shows me pulling the HTML of the "Akron Rushing" table.
Inside $tr->find('td', 0), the first column, there is a hyperlink. How can I extract this hyperlink? Using $tr->find('td', 0')->find('a') does not seem to work.
Also: I can write conditions for each table (passing, rushing, receiving, etc), but is there a more efficient way to do this? I'm open to ideas on this one.
include('simple_html_dom.php');
$html = file_get_html('http://espn.go.com/ncf/boxscore?gameId=322432006');
$teamA['rushing'] = $html->find('table.mod-data',5);
foreach ($teamA as $type=>$data) {
switch ($type) {
# Rushing Table
case "rushing":
foreach ($data->find('tr') as $tr) {
echo $tr->find('td', 0); // First TD column (Player Name)
echo $tr->find('td', 1); // Second TD Column (Carries)
echo $tr->find('td', 2); // Third TD Column (Yards)
echo $tr->find('td', 3); // Fourth TD Column (AVG)
echo $tr->find('td', 4); // Fifth TD Column (TDs)
echo $tr->find('td', 5); // Sixth TD Column (LGs)
echo "<hr />";
}
}
}
In your case, the find('tr') returns 10 elments instead of the 7 rows expected only.
Also, not all the names has links associated with them, trying to retrieve a link when it doesnt exist may return an error.
Therefore, here's a modified working version of your code:
$url = 'http://espn.go.com/ncf/boxscore?gameId=322432006';
$html = file_get_html('http://espn.go.com/ncf/boxscore?gameId=322432006');
$teamA['rushing'] = $html->find('table.mod-data',5);
foreach ($teamA as $type=>$data) {
switch ($type) {
# Rushing Table
case "rushing":
echo count($data->find('tr')) . " \$tr found !<br />";
foreach ($data->find('tr') as $key => $tr) {
$td = $tr->find('td');
if (isset($td[0])) {
echo "<br />";
echo $td[0]->plaintext . " | "; // First TD column (Player Name)
// If anchor exists
if($anchor = $td[0]->find('a', 0))
echo $anchor->href; // href
echo " | ";
echo $td[1]->plaintext . " | "; // Second TD Column (Carries)
echo $td[2]->plaintext . " | "; // Third TD Column (Yards)
echo $td[3]->plaintext . " | "; // Fourth TD Column (AVG)
echo $td[4]->plaintext . " | "; // Fifth TD Column (TDs)
echo $td[5]->plaintext; // Sixth TD Column (LGs)
echo "<hr />";
}
}
}
}
As you can see, an attribute can be reched using this format $tag->attributeName. In your case, attributeName is href
Notes:
It would be a good idea to handle find's errors, knowing that it returns "False" when nothing is found
$td = $tr->find('td');
// Find suceeded
if ($td) {
// code here
}
else
echo "Find() failed in XXXXX";
PHP Simple HTML DOM Parser has known memory leaks issues with php5, so don't forget to free up memory when DOM objects are no more used:
$html = file_get_html(...);
// do something...
$html->clear();
unset($html);
Source: http://simplehtmldom.sourceforge.net/manual_faq.htm#memory_leak
According to the documentation you should be able to chain selectors for nested elements.
This is the example they give:
// Find first <li> in first <ul>
$e = $html->find('ul', 0)->find('li', 0);
The only difference I can see is that they include the index in the second find. Try added that in and seeing if it works for you.
Related
There is this website
http://www.oxybet.com/france-vs-iceland/e/5209778/
What I want is to scrape not the full table but PARTS of this table.
For example to only display rows that include sportingbet stoiximan and mybet and I don't need all columns only 1 x 2 columns, also the numbers that are with red must be scraped as is with the red box or just display an asterisk next to them in the scrape can this be done or do I need to scrape the whole table on a database first then query the database?
What I got now is this code I borrowed from another similar question on this forum which is:
<?php
require('simple_html_dom.php');
$html = file_get_html('http://www.oxybet.com/france-vs-iceland/e/5209778/');
$table = $html->find('table', 0);
$rowData = array();
foreach($table->find('tr') as $row) {
// initialize array to store the cell data from each row
$flight = array();
foreach($row->find('td') as $cell) {
// push the cell's text to the array
$flight[] = $cell->plaintext;
}
$rowData[] = $flight;
}
echo '<table>';
foreach ($rowData as $row => $tr) {
echo '<tr>';
foreach ($tr as $td)
echo '<td>' . $td .'</td>';
echo '</tr>';
}
echo '</table>';
?>
which returns the full table. What I want mainly is somehow to detect the numbers selected in the red box (in 1 x 2 areas) and display an asterisk next to them in my scrape, secondly I want to know if its possible to scrape specific columns and rows and not everything do i need to use xpath?
I beg for someone to point me in the right direction I spent hours on this, the manual doesn't explain much http://simplehtmldom.sourceforge.net/manual.htm
Link is dead. However, you can do this with xPath and reference the cells that you want by their colour and order, and many more ways too.
This snippet will give you the general gist; taken from a project I'm working on atm:
function __construct($URL)
{
// make new DOM for nodes
$this->dom = new DOMDocument();
// set error level
libxml_use_internal_errors(true);
// Grab and set HTML Source
$this->HTMLSource = file_get_contents($URL);
// Load HTML into the dom
$this->dom->loadHTML($this->HTMLSource);
// Make xPath queryable
$this->xpath = new DOMXPath($this->dom);
}
function xPathQuery($query){
return $this->xpath->query($query);
}
Then simply pass a query to your DOMXPath, like //tr[1]
I'm trying to work how to traverse this specific table with the "simple_html_dom.php". I've tried many different angles and just can't get it right. I can separate the table row by row but I can't slice up the TD values into individual components.
What I'm trying to do is take the table from this site and move the TD values into specific (array of) variables I can reliably and predictably work with. The problem is partly compounded, I think, by the fact that the TR or TDs don't have any attributes that I can 'find'.
$dom = file_get_html('http://www.asx.com.au/asx/statistics/prevBusDayAnns.do');
$tds = $dom->find('table',0)->find('tr', 1)->find('td', 1);
foreach($tds as $td)
{
echo $td->plaintext . '</br>'
}
The code above finds the first TR but I would have expected $tds to have the value of TD cell 1. It does not though. It spits out the entire TR.
I've been over the documentation and had a good search around the net but no luck.
EDIT - Solution (something like this):
$tds = $dom->find('table',0)->find('tr');
foreach($dom->find('tr') as $key => $tr)
{
$td = $tr->find('td');
if (isset($td[0]))
{
echo $td[0]->plaintext . '</br>'; // First TD column
//echo $td[1]->plaintext;
//echo $td[2]->plaintext;
//echo $td[3]->plaintext;
//echo $td[4]->plaintext;
//echo $td[5]->plaintext;
}
}
Replace
$dom->find('table',0)->find('tr', 1)->find('td', 1);
with
$dom->find('table',0)->find('tr', 1)->find('td');
You're currently only fetching the first td when you specify the second parameter. Note that this only goes through the first table row as well.
Here is my script in which I am fetching three items Medicine Name, Generic Name, Class Name. My problem here is that I am successful in fetching the Medicine name separately but the Generic Name and Class Name is coming as string. If you will run the script you will get better idea what I am actually trying to say, I want to store Generic Name and Class Name is separate columns in table.
Script
<?php
error_reporting(0);
//simple html dom file
require('simple_html_dom.php');
//target url
$html = file_get_html('http://www.drugs.com/condition/atrial-flutter.html?rest=1');
//crawl td columns
foreach($html->find('td') as $element)
{
//get drug name
$drug_name = $element->find('b');
foreach($drug_name as $drug_name)
{
echo "Drug Name:-".$drug_name;
foreach($element->find('span[class=small] a',2) as $t)
{
//get the inner HTML
$data = $t->plaintext;
echo $data;
}
echo "<br/>";
}
}
?>
Thanks in advance
Your current code is a little bit far from what you need to do but you could utilize css selectors to get those elements easier.
Example:
$data = array();
$html = file_get_html('http://www.drugs.com/condition/atrial-flutter.html?rest=1');
foreach($html->find('tr td[1]') as $td) { // you do not need to loop each td!
// target the first td of the row
$drug_name = $td->find('a b', 0)->innertext; // get the drug name bold tag inside anchor
$other_info = $td->find('span.small[2]', 0); // get the other info
$generic_name = $other_info->find('a[1]', 0)->innertext; // get the first anchor, generic name
$children_count = count($other_info->children()); // count all of the children
$classes = array();
for($i = 1; $i < $children_count; $i++) { // since you already got the first, (in position zero) iterate all children starting from 1
$classes[] = $other_info->find('a', $i)->innertext; // push it inside another container
}
$data[] = array(
'drug_name' => $drug_name,
'generic_name' => $generic_name,
'classes' => $classes,
);
}
echo '<pre>';
print_r($data);
I recently (2 weeks ago) started coding in PHP and today I ran into a problem and wondering if somebody can help/guide me.
I am getting xml data from a Web Service and want to render the data as show in below image
The fetched XML looks like this
<pricesheets>
<pricesheet>
<buyinggroupname>China</buyinggroupname>
<categoryname>Category B</categoryname>
<currency>USD</currency>
<discamt>39330.00</discamt>
<productdesc>Product B description</productdesc>
<prdouctId> Product B </productId>
</pricesheet>
<pricesheet>
<buyinggroupname>Asia</buyinggroupname>
<categoryname>Category A</categoryname>
<currency>USD</currency>
<discamt>39330.00</discamt>
<productdesc>Product A description</productdesc>
<prodouctId> Product A </productId>
</pricesheet>
</pricesheets>
The issue I am having is what's the best way to parse above XML so that I can render products based on 'buyinggroupname' and 'categoryname'. I can easily accomplish the collapse and expand feature once I know how to render the data.
Below is what I have done to achieve what I want. But I know for sure that my code is NOT efficient and scalable.
$xmldata; // XML return by the webservice
$data = simplexml_load_string($xmldata);
$category_A_items = '';
$category_B_items = '';
foreach ($data as $object) {
if($object->categoryname == 'Category A') { // Bad Idea : Hard coded category
$category_A_items .= '<tr><td>'.$object->prdouctId.'</td><td>'. $object->productdesc. '</td><td>'. $object->discamt. '</td></tr>';
}
elseif($object->CATEGORYNAME == 'Category B') { // Bad Idea : Hard coded category
$category_B_items .='<tr><td>'.$object->prdouctId. '</td><td>'. $object->productdesc. '</td><td>'. $object->discamt. '</td></tr>';
}
}
//Render Category A items in table
if(strlen($category_A_items) > 0) {
echo '<h3>CAD</h3>';
echo '<table><tr><th>Product Name</th><th>Description</th><th>Price</th></tr>';
echo $cadItems;
echo '</table>'. PHP_EOL;
}
//Render Category B items in table
if(strlen($category_B_items) > 0) {
echo '<h3>Breast Biopsy</h3>';
echo '<table><tr><th>Product Name</th><th>Description</th><th>Price</th></tr>';
echo $breastBiopsy;
echo '</table>'. PHP_EOL;
}
The above code only renders the data based on categories ( which are hard coded). Now what would be better way of doing the same so that I can render the data based on 'buyingroupname' and 'categoryname' without hard coding either of this two values in the php code.
Thanks in advance!
get an array of unique <categoryname>-nodes with xpath, then loop through it and select all <pricesheet>-nodes with that specific category, letting xpath do that job, again:
$xml = simplexml_load_string($x);
$cat = array_unique($xml->xpath("//categoryname"));
foreach ($cat as $c) {
echo "$c<br />";
foreach ($xml->xpath("//pricesheet[categoryname='$c']") as $p) {
echo $p->productId."<br />";
}
}
see a live-demo # http://codepad.viper-7.com/m9ruRU
Of course, you have to add code for creating tables...
Putting the strings in arrays with variable keys lets PHP keep them organized for you, and you don't have to know any category names at all. I used multidimensional arrays so that each buyingroup has its categories inside it. Then you loop through the arrays to make each table. Let me know if you need more explanation. I misunderstood your image the first time I saw it.
$xmldata; // XML return by the webservice
$data = simplexml_load_string($xmldata);
$buyinggroups = array();
foreach ($data as $object) {
if(isset($object->buyinggroupname) || isset($object->BUYINGGROUPNAME)) {
if(isset($object->buyinggroupname)) {
$name = $object->buyinggroupname;
} else {
$name = $object->BUYINGGROUPNAME;
}
}
if(isset($object->categoryname) || isset($object->CATEGORYNAME)) {
if(isset($object->categoryname)) {
$category = $object->categoryname;
} else {
$category = $object->CATEGORYNAME;
}
}
if(isset($category) && isset($name)) { //just making sure this row is OK
if(!isset($buyinggroups[$name])) {
$buyinggroups[$name] = array(); //initialize the outer array
}
if(!isset($buyinggroups[$name][$category])) {
$buyinggroups[$name][$category] = ''; //this is like your previous $category_A_items
}
$buyinggroups[$name][$category] .= '<tr><td>'.$object->productId.'</td><td>'. $object->productdesc. '</td><td>'. $object->discamt. '</td></tr>';
}
}
//Render all categories in lots of tables
//I am guessing at what HTML you want here; I don't think it's necessarily correct
echo '<table>'. PHP_EOL;
foreach($buyinggroups as $name=>$set) {
echo '<tr><th colspan="2">'.$name.'</th></tr>'. PHP_EOL;
echo '<tr><th> </th><td>';
foreach($set as $category=>$rows) {
echo '<table>';
echo '<tr><th><h3>'.$category.'</h3></th>'. PHP_EOL;
echo '<td><table><tr><th>Product Name</th><th>Description</th><th>Price</th></tr>';
echo $rows;
echo '</table>'. PHP_EOL;
}
echo '</td></tr>';
}
echo '</table>';
EDIT:
This can't possibly be beyond your ability to debug. You are getting everything you need in order to debug. PHP tells you the line number and the error. You google the error and find out what it means. Then you go to that line number and see how what's there corresponds to the thing you googled. In this case, I can tell you that "illegal offset type" means that you have an array key that is not a string or integer. On those lines in the error messages, you have the array keys $name and $category. Try var_dump($name) and var_dump($category) to find out what they actually are, or even var_dump($object) to find out how to get name and category out of the object.
I have following code:
SELECT q21, q21coding AS Description FROM `tresults_acme` WHERE q21 IS NOT NULL AND q21 <> '' ORDER BY q21coding
It brings back the following (excerpt):
Text Description
Lack of up to date equal pay cases&legislation - t... Content needs updating
The intranet could contain more "up to date traini... Content needs updating
Poorly set out. It is hard to find things. Difficulty in navigating/finding content
Only use the intranet as a necessity. Will ask my ... Difficulty in navigating/finding content
Now, I'd like to display this in a table on a PHP page but am having some problems because of the way I'd like it displayed, it needs to be as follows:
Content needs updating
----------------------
[List all the comments relating to this description]
Difficulty in navigating/finding content
----------------------------------------
[List all the comments relating to this description]
and so on.
Now I think it is a For Each loop in PHP but I am having terrible difficulty getting my head around this - any ideas and suggestions very very welcome!
Thanks,
Simple approach
Set prev_desc to NULL
For each row print text
If description is not equal to prev_desc prepend with the description for the new "section" and set prev_desc <- description
E.g.1 (untested!),
$prev_desc = null;
while ($row = mysql_fetch_assoc(...)) {
if ($prev_desc != $row['description']) {
print '<h1>' . $row['description'] . '</h1>';
$prev_desc = $row['description'];
}
print $row['text'] . '<br />'; // Formatting needed
}
Note: You must keep the ORDER BY <description-column> in order to have rows "grouped". Otherwise this simple approach will not work.
Less presentation-specific approach
I could be considered more "clean" to create some kind of 2D container to "categorize" the extracted data, e.g.,
$items = array(
'Content needs updating' => array(
'Lack of ...',
'The intra...'
),
...
);
You could then loop over these items like so1:
foreach ($items as $desc => $texts) {
print '<h1>' . $desc . '</h1>';
foreach ($texts as $text) {
print $text . '<br />';
}
}
1 As #bobince has noted, make sure that content going directly into the final HTML is properly escaped, see e.g. htmlspecialchars().
You just need to keep track of which heading you last displayed. I don't know which library you're using for database access, so the details of how you access columns/rows will be slightly different, but here it is in kind-of pseudocode:
$lastHeading = '';
foreach($rows as $row)
{
if ($lastHeading != $row['Description'])
{
if ($lastHeading != '')
echo '</ul>';
$lastHeading = $row['Description'];
echo "<h1>$lastHeading</h1>";
echo '<ul>';
}
echo '<li>'.$row['Text'].'</li>';
}
if ($lastHeading != '')
echo '</ul>';
This has the added feature of putting comments in a <ul>, not sure if that's required for you or not.
This works because you've sorted by the "description" column. That means you know that all of the rows with the same "description" will come together.
you can either create multiple queries for each of the sections or loop over the data multiple times and filter based on the type of description using php.
$descriptions = array('Content needs updating','Difficulty in navigating/finding content');
$rows = <fetch all rows from the query>;
foreach($descriptions as $description)
{
echo '<h1>',$description,'</h1>';
foreach($rows as $row)
{
if ($row['description'] == $description)
{
echo $row['text'],'<br />';
}
}
}