Parsing info from a table without headers, using PHP, DOM and cUrl - php

I need to parse data from a table that i scrape from a different website using PHP.
The table looks like this:
<table id="IWGRD" border="1" cellpadding="0" cellspacing="0" width="409" bordercolor="#FFFFFF" bordercolorlight="#FFFFFF" bordercolordark="#FFFFFF" class="IWGRDCSS" style="width:409;height:10;z-index:100;font-style:normal;font-size:10pt;text-decoration:none;">
<tbody>
<tr>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Dag </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Datum </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Lesuur </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Lokaal </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Docent(en) </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Vak </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Groep(en) </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Toelichting </b></font>
</td>
</tr>
<tr>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> Di </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> 12-11-2013 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> 5 - 6 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> B2.33 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> LKH02 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> SWSP14SLB1V13_SWSP15PRA1V13 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> MAV1SP10 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> SLB major 1 / praktijkleren </font>
</td>
</tr>
This table is generated by javascript.
In this table the first tr holds all the td which holds the headers. While all the rest of the table rows hold the info that i need to parse.
Now I've been struggling with this for a while and i found an answer on this website which helped me out a little bit, but it reads the table by using the td and th id's while mine table doesn't have an id on it's table rows or td's.
I'm using cURL to get this table HTML from an other website and pass it through and load it into DOM like this:
<?php
include_once('/simple_dom/simple_html_dom.php');
//step1
$cSession = curl_init();
//step2
$tmpfname = dirname(__FILE__).'/cookie.txt';
curl_setopt($cSession, CURLOPT_COOKIEJAR, $tmpfname);
curl_setopt($cSession, CURLOPT_COOKIEFILE, $tmpfname);
curl_setopt($cSession,CURLOPT_URL,"http://anonymusurlbecauseofprivacyreasons?somegetters");
curl_setopt($cSession,CURLOPT_RETURNTRANSFER,true);
curl_setopt($cSession, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($cSession,CURLOPT_HEADER, false);
curl_setopt ($cSession, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($cSession, CURLOPT_CAINFO, dirname(__FILE__)."/cacert.pem");
curl_setopt($cSession,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$result=curl_exec($cSession);
if ($result === FALSE) {
echo "cURL Error: " . curl_error($ch);
}
curl_close($cSession);
// create empty document
$dom = new DomDocument;
#$dom->loadHtml($result);
$xpath = new DomXPath($dom);
Okay so far, so good.
But now comes the part of code which i can't figure out how to get it working.
To read out the date I copied and edited the code from this thread: (How to parse this table and extract data from it?) but I can't get it working.
// collect data
foreach ($xpath->query('//table[#id="IWGRD"]/tr') as $node) {
$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
$rowcleaned = str_replace("\xc2\xa0","", $cell->textContent);
$rowData[] = $rowcleaned;
}
}
print_r($rowData);
Which gives me the following output:
Array ( [0] => [1] => [2] => 7 - 8 [3] => S0.20 [4] => SPHdeBruin [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => Bewegingsagogiek )
Which is the correct output for the last row, but i need all the rows.
So the kind of output I would need is all of the rows (I only don't need the top rows)
So like
array[1] = ([0] => Mon [1] => 11-11-2013 [2] => 7 - 8 [3] => S0.20 [4] => SPHdeBruin [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => Bewegingsagogiek)
Array[2] = ([0] => Mon [1] => 11-11-2013 [2] => 8 - 9 [3] => S0.20 [4] => name [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => randomresult)
So i can use the info and put it in variables to pass it on to an app.
Anyone knows how to do this? I've been working on this for hours because i have none experience using cUrl or DOM whatsoever.
Any help is much appreciated! :)

It seems like you're not collecting every row as you go along...
$tableData = array();
foreach ($xpath->query('//table[#id="IWGRD"]/tr') as $node) {
$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
$rowcleaned = str_replace("\xc2\xa0","", $cell->textContent);
$rowData[] = $rowcleaned;
}
$tableData[] = $rowData;
}
print_r($tableData);

Related

How to get href attributefrom html page

I have this html code:
<tbody>
<tr class="">
<td align="right" csk="1">1</td>
<td align="left" ><img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan</td>
<td align="right" >1936</td>
<td align="right" >2016</td>
<td align="right" >103</td>
<td align="right" >7</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" >2</td>
<td align="right" >2</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
</tr>
I'd like to get inside an array all the href attributes.
I'm trying to use this php code:
<?php
include_once ('/share/Multimedia/simple_html_dom.php');
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left"';
$olympiad = array();
$html = file_get_html($url,true);
foreach($html->find($tagname_tr) as $tag) {
$olympiad[] = trim($tag->innertext);
}
Indeed if I print olympiad array I get something like:
Array
(
[0] => 1
[1] => <img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan
[2] => 1936
[3] => 2016
[4] => 103
[5] => 7
[6] =>
[7] =>
[8] => 2
[9] => 2
[10] =>
Why this behaviour? I'd like to get also the text inside href attribute (in this case Afghanistan), possibly in another array.
I'm not an php code expert so I ask help to you.
You can load the html file like this,this is an exemple you can adapt it:
<?php
include_once ('/share/Multimedia/simple_html_dom.php');
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left"';
$olympiad = array();
$html = file_get_html($url,true);
$doc = new DOMDocument();
$doc->loadHTML( $html);
// example 1:
$elements = $doc->getElementsByTagName('*');
// example 2:
$elements = $doc->getElementsByTagName('html');
// example 3:
$elements = $doc->getElementsByTagName('body');
// example 4:
$elements = $doc->getElementsByTagName('table');
// example 5:
$elements = $doc->getElementsByTagName('div');
I hope it help.
If you want to find all href attributes, I think you can add an a to $tagname_tr = 'td align="left"';
Then you can loop the result, and get the href and the innertext.
As an example, the values are stored in 2 arrays and the html is loaded as a string:
include_once ('/share/Multimedia/simple_html_dom.php');
$source = <<<SOURCE
<tbody>
<tr class="">
<td align="right" csk="1">1</td>
<td align="left" ><img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan</td>
<td align="right" >1936</td>
<td align="right" >2016</td>
<td align="right" >103</td>
<td align="right" >7</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" >2</td>
<td align="right" >2</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
</tr>
SOURCE;
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left" a';
$olympiad = array();
$elementText = array();
//$html = file_get_html($url,true);
$html = str_get_html($source);
foreach($html->find($tagname_tr) as $tag) {
$olympiad[] = $tag->href;
$elementText[] = $tag->innertext;
}
echo "<pre>";
print_r($olympiad);
print_r($elementText);
Will result in:
Array
(
[0] => /olympics/countries/AFG/
)
Array
(
[0] => Afghanistan
)

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

I been trying to extract site table text along with its link from the given table to (which is in site1.com) to my php page using a web crawler.
But unfortunately, due to incorrect input of Array index in the php code, it came error as output.
site1.com
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
<td width="1%" valign="top" class="Title2"> </td>
<td width="65%" valign="top" class="Title2">Subject</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="8%" valign="top" align="Center" class="Title2">Replies</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Serious dedicated study partner for U World - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
The php. web crawler as ::
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmleforum.com/forum/index.php?forum=1');
$first_step = explode( '<table class="Table2">' , $returned_content );
$second_step = explode('</table>', $first_step[0]);
$third_step = explode('<tr>', $second_step[1]);
// print_r($third_step);
foreach ($third_step as $key=>$element) {
$child_first = explode( '<td class="FootNotes2"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[0] );
$final = "<a href=".$child_fourth[0]."</a></br>";
?>
<li target="_blank" class="itemtitle">
<?php echo $final?>
</li>
<?php
if($key==10){
break;
}
}
?>
Now the Array Index on the above php code can be the culprit. (i guess)
If so, can some one please explain me how to make this work.
But what my final requirement from this code is::
to get the above text in second with a link associated to it.
Any help is Appreciated..
Instead of writing your own parser solution you could use an existing one like Symfony's DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html
$crawler = new Crawler($returned_content);
$linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) {
return $node->text();
});
Or if you want to traverse the DOM tree yourself you can use DOMDocument's loadHTML
http://php.net/manual/en/domdocument.loadhtml.php
$document = new DOMDocument();
$document->loadHTML($returned_content);
foreach ($document->getElementsByTagName('a') as $link) {
$text = $link->nodeValue;
}
EDIT:
To get the links you want, the code assumes you have a $returned_content variable with the HTML you want to parse.
// creating a new instance of DOMDocument (DOM = Document Object Model)
$domDocument = new DOMDocument();
// save previous libxml error reporting and set error reporting to internal
// to be able to parse not well formed HTML doc
$previousErrorReporting = libxml_use_internal_errors(true);
$domDocument->loadHTML($returned_content);
libxml_use_internal_errors($previousErrorReporting);
$links = [];
/** #var DOMElement $node */
// getting all <a> element from the HTML
foreach ($domDocument->getElementsByTagName('a') as $node) {
$parentNode = $node->parentNode;
// checking if the <a> is under a <td> that has class="FootNotes2"
$isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2';
// checking if the <a> has class="Links2"
$isLinkOfLink2Class = $node->getAttribute('class') == 'Links2';
// as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled
if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) {
$links[] = [
'href' => $node->getAttribute('href'),
'text' => $parentNode->textContent,
];
}
}
print_r($links);
This will create you an array similar to:
Array
(
[0] => Array
(
[href] => /files/forum/2017/1/837242.php
[text] => Q#Q Drill Time ① - cardio69
)
[1] => Array
(
[href] => /files/forum/2017/1/837356.php
[text] => study partner in Houston - lacy
)
[2] => Array
(
[href] => /files/forum/2017/1/837110.php
[text] => Serious dedicated study partner for U World - step12013
)
...
Using the Simple HTML DOM Parser library, you can use the following code:
<?php
require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file.
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element
$element->href = "http://www.usmleforum.com" . $element->href; // you can also access only certain attributes of the elements (e.g. the url).
echo $element.'</br>'; // do something with the elements.
}
?>
I tried the same code for another site. and it works.
Please take a look at it:
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
$first_step = explode( '<tbody id="threadbits_forum_26">' , $returned_content );
$second_step = explode('</tbody>', $first_step[1]);
$third_step = explode('<tr>', $second_step[0]);
// print_r($third_step);
foreach ($third_step as $element) {
$child_first = explode( '<td class="alt1"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[1] );
echo $final = "<a href=".$child_fourth[0]."</a></br>";
}
?>
I know its too much to ask, but can you please make a code out of these two which make the crawler work.
#jkmak
Chopping at html with string functions or regex is not a reliable method. DomDocument and Xpath do a nice job.
Code: (Demo)
$dom=new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate("//td[#class = 'FootNotes2']/a") as $node) { // target a tags that have <td class="FootNotes2"> as parent
$result[]=['href' => $node->getAttribute('href'), 'text' => $node->nodeValue]; // extract/store the href and text values
if (sizeof($result) == 10) { break; } // set a limit of 10 rows of data
}
if (isset($result)) {
echo "<ul>\n";
foreach ($result as $data) {
echo "\t<li class=\"itemtitle\">{$data['text']}</li>\n";
}
echo "</ul>";
}
Sample Input:
$html = <<<HTML
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
<td width="1%" valign="top" class="Title2"> </td>
<td width="65%" valign="top" class="Title2">Subject</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="8%" valign="top" align="Center" class="Title2">Replies</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Serious dedicated study partner for U World - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">some text - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
HTML;
Output:
<ul>
<li class="itemtitle">Serious dedicated study partner for U World</li>
<li class="itemtitle">some text</li>
</ul>

Looping with XPath in php to get tables values

I have this html table:
<tbody>
<tr>..</tr>
<tr>
<td class="tbl_black_n_1">1</td>
<td class="tbl_black_n_1" nowrap="" align="center">23/07/14 08:10</td>
<td class="tbl_black_n_1">
<img src="http://www.betonews.com/img/SportId389.gif" width="10" height="10" border="0" alt="">
</td>
<td class="tbl_black_n_1"></td>
<td class="tbl_black_n_1" nowrap="" align="center">BAK WS</td>
<td class="tbl_black_n_1" nowrap="" align="right">M. Eguchi</td>
<td class="tbl_black_n_1" align="center">-</td>
<td class="tbl_black_n_1" nowrap="">Radwanska U. </td>
<td class="tbl_black_n_1" align="center" title=" ">1,02</td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="center" title=" "> </td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="center" title=" ">55,00</td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="right">86%</td>
<td class="tbl_black_n_1" align="right">-</td>
<td class="tbl_black_n_1" align="right">14%</td>
<td class="tbl_black_n_1" align="center" title=" ">524.647</td>
<td class="tbl_black_n_1" nowrap="">
<img src="http://www.betonews.com//img/i_betfair.gif" width="12" height="10" border="0" alt="">
<img src="http://www.betonews.com//img/i_history.gif" width="12" height="10" border="0" alt="">
</td>
</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
...
</tbody>
There are more than one hundred <tr> structured at the same way, which contain lots of <td>. How can I loop with xpath to store all data in a database? I don't want to get the first <tr>: the query has to begin with the second <tr> (that I have showed).
This is my php code, but I can not go on.. help!
<?php
$url = 'http://www.betonews.com/table.asp?tp=2001&lang=en&dd=23&dm=7&dy=2014&df=1&dw=3';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
$document = new DOMDocument();
$document->loadHTML($response);
$xpath = new DOMXPath($document);
$expression = '/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr';
$rows = $xpath->query($expression);
$results = array();
foreach ($rows as $row) {
$result = array();
???
}
This is what I want to be the final result:
[0] => Array
(
[date] => 23/07/14 08:10
[image] => http://www.betonews.com/img/SportId389.gif
[team1] => M. Eguchi
[team2] => Radwanska U.
[1] => 1,02
[x] => 0
[2] => 55,00
[1%] => 86%
[x%] => 0
[2%] => 14%
[total] => 524.647
)
I would use a different XPath to select the table. First, there is always a problem using absolute paths with tables like this, because often tbody elements are just added by the browser, but they are not actually present in the document, i.e. not visible to the PHP code. Also, because if anything in the source HTML changes in terms of styleing, your code breaks. Now I select the first table with a cellpadding of 3 - This is not optimal, but there wasn't any obvious unique identifier.
Apart from that, you can simply iterate over the DOMNodeList result and then get the correct child nodes. Notice, that the items are increased by two, because whitespace-only elements in between are also a node in XML.
$xpath = new DOMXPath($document);
$expression = '(//table[#cellpadding="3"])[1]/tr[position() > 1]';
$rows = $xpath->query($expression);
$results = array();
foreach ($rows as $row) {
$result = array();
$td = $row->childNodes;
$result["date"] = $td->item(2)->nodeValue;
$result["image"] = $td->item(4)->firstChild->attributes->getNamedItem("src")->nodeValue;
$result["team1"] = $td->item(10)->nodeValue;
$result["team2"] = $td->item(12)->nodeValue;
$result["1"] = $td->item(14)->nodeValue;
$result["x"] = $td->item(16)->nodeValue;
$result["2"] = $td->item(18)->nodeValue;
$result["1%"] = $td->item(20)->nodeValue;
$result["x%"] = $td->item(22)->nodeValue;
$result["2%"] = $td->item(24)->nodeValue;
$result["total"] = $td->item(26)->nodeValue;
$results[] = $result;
}
For the image, you have to do same more proccesing, because you do not want the actual text, but the src attribute of the <img> element instead.

PHP: Scrape all numbers from Brackets "(123)" [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
I have a HTML-Code. The structure is always the same. But i don't know, how i can extract all numbers from the brackets.
Example-Code:
<table align="left" border="0" cellpadding="0" cellspacing="1">
<tbody><tr>
<td style="padding-right:0.5em;padding-bottom:1px;white-space:nowrap;font-size:10px;" align="left">
5 Sterne:
</td>
<td style="min-width:60; background-color: #eeeecc" class="tiny" title="73%" align="left" width="60"><div style="background-color:#FFCC66; height:13px; width:73%;"></div></td>
<td style="font-family:Verdana,Arial,Helvetica,Sans-serif;;font-size:10px;" align="right"> (96)</td>
</tr>
<tr>
<td style="padding-right:0.5em;padding-bottom:1px;white-space:nowrap;font-size:10px;" align="left">
4 Sterne:
</td>
<td style="min-width:60; background-color: #eeeecc" class="tiny" title="11%" align="left" width="60"><div style="background-color:#FFCC66; height:13px; width:11%;"></div></td>
<td style="font-family:Verdana,Arial,Helvetica,Sans-serif;;font-size:10px;" align="right"> (15)</td>
</tr>
<tr>
<td style="padding-right:0.5em;padding-bottom:1px;white-space:nowrap;font-size:10px;" align="left">
3 Sterne:
</td>
<td style="min-width:60; background-color: #eeeecc" class="tiny" title="7%" align="left" width="60"><div style="background-color:#FFCC66; height:13px; width:7%;"></div></td>
<td style="font-family:Verdana,Arial,Helvetica,Sans-serif;;font-size:10px;" align="right"> (10)</td>
</tr>
<tr>
<td style="padding-right:0.5em;padding-bottom:1px;white-space:nowrap;font-size:10px;" align="left">
2 Sterne:
</td>
<td style="min-width:60; background-color: #eeeecc" class="tiny" title="3%" align="left" width="60"><div style="background-color:#FFCC66; height:13px; width:3%;"></div></td>
<td style="font-family:Verdana,Arial,Helvetica,Sans-serif;;font-size:10px;" align="right"> (4)</td>
</tr>
<tr>
<td style="padding-right:0.5em;padding-bottom:1px;white-space:nowrap;font-size:10px;" align="left">
1 Stern<span style="color:#FFFFFF">e</span>:
</td>
<td style="min-width:60; background-color: #eeeecc" class="tiny" title="4%" align="left" width="60"><div style="background-color:#FFCC66; height:13px; width:4%;"></div></td>
<td style="font-family:Verdana,Arial,Helvetica,Sans-serif;;font-size:10px;" align="right"> (6)</td>
</tr>
<tr><td> </td><td><div style="width:60px;"> </div></td><td> </td></tr>
</tbody></table>
In this case i need this numbers: 96, 15, 10, 4 and 6.
Please give me a tip, which function is good for it.
You can use a DOM parser such as DOMDocument class to parse the HTML document. Since the structure is always the same, you can simply traverse the DOM using an XPath expression and grab the text from the third <td> node. Once you have the node value, you can use a simple preg_replace() to get the number:
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//table/tbody/tr/td[3]/text()') as $node) {
$number = preg_replace('~\D~', '', $node->nodeValue);
echo $number . '<br/>';
}
Demo.
preg_match_all('~\((\d+)\)~',$content,$numbers);
print_r($numbers); // example to print results
output:
Array
(
[0] => Array
(
[0] => (96)
[1] => (15)
[2] => (10)
[3] => (4)
[4] => (6)
)
[1] => Array
(
[0] => 96
[1] => 15
[2] => 10
[3] => 4
[4] => 6
)
)

Parsing Wikipedia Page tables issue

Hi I'm trying to parse a Wikipedia document in which there is a table called "infobox biota" with this structure. I'm trying to get the following table data and classes of the following characteristics
Kingdom:
Phylum:
Subphylum:
Class:
Order:
Family:
<table class="infobox biota" style="text-align: left; width: 200px; font-size: 100%">
<tbody><tr>
<th colspan="2" style="text-align: center; background-color: rgb(211,211,164)">Rabbit</th>
</tr>
<tr>
<td colspan="2" style="text-align: center"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Rabbit_in_montana.jpg/250px-Rabbit_in_montana.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Rabbit_in_montana.jpg/375px-Rabbit_in_montana.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Rabbit_in_montana.jpg/500px-Rabbit_in_montana.jpg 2x" height="222" width="250"></td>
</tr>
<tr>
<th colspan="2" style="text-align: center; background-color: rgb(211,211,164)">Scientific classification</th>
</tr>
<tr>
<td>Kingdom:</td>
<td><span class="kingdom" style="white-space:nowrap;">Animalia</span></td>
</tr>
<tr>
<td>Phylum:</td>
<td><span class="phylum" style="white-space:nowrap;">Chordata</span></td>
</tr>
<tr>
<td>Subphylum:</td>
<td><span class="subphylum" style="white-space:nowrap;">Vertebrata</span></td>
</tr>
<tr>
<td>Class:</td>
<td><span class="class" style="white-space:nowrap;">Mammalia</span></td>
</tr>
<tr>
<td>Order:</td>
<td><span class="order" style="white-space:nowrap;">Lagomorpha</span></td>
</tr>
<tr>
<td>Family:</td>
<td><span class="family" style="white-space:nowrap;">Leporidae<br>
<small>in part</small></span></td>
</tr>
<tr>
<th colspan="2" style="text-align: center; background-color: rgb(211,211,164)">Genera</th>
</tr>
<tr>
<td colspan="2" style="text-align: left">
<div>
<table style="background-color:transparent;table-layout:fixed;" border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody><tr valign="top">
<td>
<div style="margin-right:20px;">
<p><i>Pentalagus</i><br>
<i>Bunolagus</i><br>
<i>Nesolagus</i><br>
<i>Romerolagus</i></p>
</div>
</td>
<td>
<div style="margin-right: 20px;">
<p><i>Brachylagus</i><br>
<i>Sylvilagus</i><br>
<i>Oryctolagus</i><br>
<i>Poelagus</i></p>
</div>
</td>
</tr>
</tbody></table>
</div>
</td>
</tr>
</tbody></table>
Here is my attempt to parse and obtain the kingdom,phylum,subphylum,class,order and family of a rabbit with the table structure. However I get a the following Array ( [Kingdom:] => [Phylum:] => [Subphylum:] => [Class:] => [Order:] => [Family:] => [
Pentalagus
Bunolagus
Nesolagus
Romerolagus
] => )
it doesnt fill in the array with the data for the rabbit. also it give me a parse error in the line shown below, what can be wrong?
<?php
//require"mydb.php";
header('Content-type: text/html; charset=utf-8'); // this just makes sure encoding is right
include('simple_html_dom.php'); // the parser library
$html = file_get_html('http://en.wikipedia.org/wiki/Rabbit');
$table = $html->find('table.infobox');
$data = array();
foreach($table[0]->find('tr') as $row)
{
$td = $row->find('> td');
if (count($td) == 2)
{
$name = $td[0]->innertext;
$text = $td[1]->find('a')[0]->innertext; //PARSE ERROR IS GIVEN HERE, after the find('a')[0], taking off the array takes away the error but just me no results
$data[$name] = $text;
}
}
print_r($data);
?>
$text = $td[1]->find('a')[0]->innertext;
In this line you are dereferencing a function. This is only available in PHP 5.4 or later. Try this instead:
$td = $td[1]->find('a');
$text = $td[0]->innertext;

Categories