Looping with XPath in php to get tables values

Looping with XPath in php to get tables values - php

I have this html table:
<tbody>
<tr>..</tr>
<tr>
<td class="tbl_black_n_1">1</td>
<td class="tbl_black_n_1" nowrap="" align="center">23/07/14 08:10</td>
<td class="tbl_black_n_1">
<img src="http://www.betonews.com/img/SportId389.gif" width="10" height="10" border="0" alt="">
</td>
<td class="tbl_black_n_1"></td>
<td class="tbl_black_n_1" nowrap="" align="center">BAK WS</td>
<td class="tbl_black_n_1" nowrap="" align="right">M. Eguchi</td>
<td class="tbl_black_n_1" align="center">-</td>
<td class="tbl_black_n_1" nowrap="">Radwanska U. </td>
<td class="tbl_black_n_1" align="center" title=" ">1,02</td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="center" title=" "> </td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="center" title=" ">55,00</td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="right">86%</td>
<td class="tbl_black_n_1" align="right">-</td>
<td class="tbl_black_n_1" align="right">14%</td>
<td class="tbl_black_n_1" align="center" title=" ">524.647</td>
<td class="tbl_black_n_1" nowrap="">
<img src="http://www.betonews.com//img/i_betfair.gif" width="12" height="10" border="0" alt="">
<img src="http://www.betonews.com//img/i_history.gif" width="12" height="10" border="0" alt="">
</td>
</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
...
</tbody>
There are more than one hundred <tr> structured at the same way, which contain lots of <td>. How can I loop with xpath to store all data in a database? I don't want to get the first <tr>: the query has to begin with the second <tr> (that I have showed).
This is my php code, but I can not go on.. help!
<?php
$url = 'http://www.betonews.com/table.asp?tp=2001&lang=en&dd=23&dm=7&dy=2014&df=1&dw=3';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
$document = new DOMDocument();
$document->loadHTML($response);
$xpath = new DOMXPath($document);
$expression = '/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr';
$rows = $xpath->query($expression);
$results = array();
foreach ($rows as $row) {
$result = array();
???
}
This is what I want to be the final result:
[0] => Array
(
[date] => 23/07/14 08:10
[image] => http://www.betonews.com/img/SportId389.gif
[team1] => M. Eguchi
[team2] => Radwanska U.
[1] => 1,02
[x] => 0
[2] => 55,00
[1%] => 86%
[x%] => 0
[2%] => 14%
[total] => 524.647
)

I would use a different XPath to select the table. First, there is always a problem using absolute paths with tables like this, because often tbody elements are just added by the browser, but they are not actually present in the document, i.e. not visible to the PHP code. Also, because if anything in the source HTML changes in terms of styleing, your code breaks. Now I select the first table with a cellpadding of 3 - This is not optimal, but there wasn't any obvious unique identifier.
Apart from that, you can simply iterate over the DOMNodeList result and then get the correct child nodes. Notice, that the items are increased by two, because whitespace-only elements in between are also a node in XML.
$xpath = new DOMXPath($document);
$expression = '(//table[#cellpadding="3"])[1]/tr[position() > 1]';
$rows = $xpath->query($expression);
$results = array();
foreach ($rows as $row) {
$result = array();
$td = $row->childNodes;
$result["date"] = $td->item(2)->nodeValue;
$result["image"] = $td->item(4)->firstChild->attributes->getNamedItem("src")->nodeValue;
$result["team1"] = $td->item(10)->nodeValue;
$result["team2"] = $td->item(12)->nodeValue;
$result["1"] = $td->item(14)->nodeValue;
$result["x"] = $td->item(16)->nodeValue;
$result["2"] = $td->item(18)->nodeValue;
$result["1%"] = $td->item(20)->nodeValue;
$result["x%"] = $td->item(22)->nodeValue;
$result["2%"] = $td->item(24)->nodeValue;
$result["total"] = $td->item(26)->nodeValue;
$results[] = $result;
}
For the image, you have to do same more proccesing, because you do not want the actual text, but the src attribute of the <img> element instead.

Related

How to get href attributefrom html page

I have this html code:
<tbody>
<tr class="">
<td align="right" csk="1">1</td>
<td align="left" ><img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan</td>
<td align="right" >1936</td>
<td align="right" >2016</td>
<td align="right" >103</td>
<td align="right" >7</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" >2</td>
<td align="right" >2</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
</tr>
I'd like to get inside an array all the href attributes.
I'm trying to use this php code:
<?php
include_once ('/share/Multimedia/simple_html_dom.php');
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left"';
$olympiad = array();
$html = file_get_html($url,true);
foreach($html->find($tagname_tr) as $tag) {
$olympiad[] = trim($tag->innertext);
}
Indeed if I print olympiad array I get something like:
Array
(
[0] => 1
[1] => <img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan
[2] => 1936
[3] => 2016
[4] => 103
[5] => 7
[6] =>
[7] =>
[8] => 2
[9] => 2
[10] =>
Why this behaviour? I'd like to get also the text inside href attribute (in this case Afghanistan), possibly in another array.
I'm not an php code expert so I ask help to you.

You can load the html file like this,this is an exemple you can adapt it:
<?php
include_once ('/share/Multimedia/simple_html_dom.php');
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left"';
$olympiad = array();
$html = file_get_html($url,true);
$doc = new DOMDocument();
$doc->loadHTML( $html);
// example 1:
$elements = $doc->getElementsByTagName('*');
// example 2:
$elements = $doc->getElementsByTagName('html');
// example 3:
$elements = $doc->getElementsByTagName('body');
// example 4:
$elements = $doc->getElementsByTagName('table');
// example 5:
$elements = $doc->getElementsByTagName('div');
I hope it help.

If you want to find all href attributes, I think you can add an a to $tagname_tr = 'td align="left"';
Then you can loop the result, and get the href and the innertext.
As an example, the values are stored in 2 arrays and the html is loaded as a string:
include_once ('/share/Multimedia/simple_html_dom.php');
$source = <<<SOURCE
<tbody>
<tr class="">
<td align="right" csk="1">1</td>
<td align="left" ><img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan</td>
<td align="right" >1936</td>
<td align="right" >2016</td>
<td align="right" >103</td>
<td align="right" >7</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" >2</td>
<td align="right" >2</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
</tr>
SOURCE;
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left" a';
$olympiad = array();
$elementText = array();
//$html = file_get_html($url,true);
$html = str_get_html($source);
foreach($html->find($tagname_tr) as $tag) {
$olympiad[] = $tag->href;
$elementText[] = $tag->innertext;
}
echo "<pre>";
print_r($olympiad);
print_r($elementText);
Will result in:
Array
(
[0] => /olympics/countries/AFG/
)
Array
(
[0] => Afghanistan
)

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

I been trying to extract site table text along with its link from the given table to (which is in site1.com) to my php page using a web crawler.
But unfortunately, due to incorrect input of Array index in the php code, it came error as output.
site1.com
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
<td width="1%" valign="top" class="Title2"> </td>
<td width="65%" valign="top" class="Title2">Subject</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="8%" valign="top" align="Center" class="Title2">Replies</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Serious dedicated study partner for U World - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
The php. web crawler as ::
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmleforum.com/forum/index.php?forum=1');
$first_step = explode( '<table class="Table2">' , $returned_content );
$second_step = explode('</table>', $first_step[0]);
$third_step = explode('<tr>', $second_step[1]);
// print_r($third_step);
foreach ($third_step as $key=>$element) {
$child_first = explode( '<td class="FootNotes2"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[0] );
$final = "<a href=".$child_fourth[0]."</a></br>";
?>
<li target="_blank" class="itemtitle">
<?php echo $final?>
</li>
<?php
if($key==10){
break;
}
}
?>
Now the Array Index on the above php code can be the culprit. (i guess)
If so, can some one please explain me how to make this work.
But what my final requirement from this code is::
to get the above text in second with a link associated to it.
Any help is Appreciated..

Instead of writing your own parser solution you could use an existing one like Symfony's DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html
$crawler = new Crawler($returned_content);
$linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) {
return $node->text();
});
Or if you want to traverse the DOM tree yourself you can use DOMDocument's loadHTML
http://php.net/manual/en/domdocument.loadhtml.php
$document = new DOMDocument();
$document->loadHTML($returned_content);
foreach ($document->getElementsByTagName('a') as $link) {
$text = $link->nodeValue;
}
EDIT:
To get the links you want, the code assumes you have a $returned_content variable with the HTML you want to parse.
// creating a new instance of DOMDocument (DOM = Document Object Model)
$domDocument = new DOMDocument();
// save previous libxml error reporting and set error reporting to internal
// to be able to parse not well formed HTML doc
$previousErrorReporting = libxml_use_internal_errors(true);
$domDocument->loadHTML($returned_content);
libxml_use_internal_errors($previousErrorReporting);
$links = [];
/** #var DOMElement $node */
// getting all <a> element from the HTML
foreach ($domDocument->getElementsByTagName('a') as $node) {
$parentNode = $node->parentNode;
// checking if the <a> is under a <td> that has class="FootNotes2"
$isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2';
// checking if the <a> has class="Links2"
$isLinkOfLink2Class = $node->getAttribute('class') == 'Links2';
// as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled
if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) {
$links[] = [
'href' => $node->getAttribute('href'),
'text' => $parentNode->textContent,
];
}
}
print_r($links);
This will create you an array similar to:
Array
(
[0] => Array
(
[href] => /files/forum/2017/1/837242.php
[text] => Q#Q Drill Time ① - cardio69
)
[1] => Array
(
[href] => /files/forum/2017/1/837356.php
[text] => study partner in Houston - lacy
)
[2] => Array
(
[href] => /files/forum/2017/1/837110.php
[text] => Serious dedicated study partner for U World - step12013
)
...

Using the Simple HTML DOM Parser library, you can use the following code:
<?php
require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file.
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element
$element->href = "http://www.usmleforum.com" . $element->href; // you can also access only certain attributes of the elements (e.g. the url).
echo $element.'</br>'; // do something with the elements.
}
?>

I tried the same code for another site. and it works.
Please take a look at it:
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
$first_step = explode( '<tbody id="threadbits_forum_26">' , $returned_content );
$second_step = explode('</tbody>', $first_step[1]);
$third_step = explode('<tr>', $second_step[0]);
// print_r($third_step);
foreach ($third_step as $element) {
$child_first = explode( '<td class="alt1"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[1] );
echo $final = "<a href=".$child_fourth[0]."</a></br>";
}
?>
I know its too much to ask, but can you please make a code out of these two which make the crawler work.
#jkmak

Chopping at html with string functions or regex is not a reliable method. DomDocument and Xpath do a nice job.
Code: (Demo)
$dom=new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate("//td[#class = 'FootNotes2']/a") as $node) { // target a tags that have <td class="FootNotes2"> as parent
$result[]=['href' => $node->getAttribute('href'), 'text' => $node->nodeValue]; // extract/store the href and text values
if (sizeof($result) == 10) { break; } // set a limit of 10 rows of data
}
if (isset($result)) {
echo "<ul>\n";
foreach ($result as $data) {
echo "\t<li class=\"itemtitle\">{$data['text']}</li>\n";
}
echo "</ul>";
}
Sample Input:
$html = <<<HTML
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
<td width="1%" valign="top" class="Title2"> </td>
<td width="65%" valign="top" class="Title2">Subject</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="8%" valign="top" align="Center" class="Title2">Replies</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Serious dedicated study partner for U World - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">some text - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
HTML;
Output:
<ul>
<li class="itemtitle">Serious dedicated study partner for U World</li>
<li class="itemtitle">some text</li>
</ul>

PHP parsing won't find "span" tags

I'm trying to find the span tags on a website similar to this: http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225. The tags I need are these:
However, when I use code such as the following:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//span";
$result_rows = $xpath->query($my_xpath_query);
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
The only output I get is [].
If I replace $statsListings[] = $result_object->nodeValue; with $statsListings[] = $result_object->childNodes->item(0)->nodeValue;, I still get the same [] as output. When there are clearly span tags with values, why am I getting nothing?

XPath is not guilty at all.
Span tags are added dinamically. Just have a look at the source code of the page, not the DOM-Structure, which may be already modified by javascript, but use "view-source:" and you will see exactly the same html, as it is parsed by XPath.
It would be a good idea to have a look at the table with class tablelines? probably, you have there everything you may need.
You should skip "maincolor" and "tableheader", and start processing with "light" class.
<table width="98%" class="tablelines" cellpadding="2" border="0" cellspacing="1">
<tr class="maincolor">
<td colspan="8" align="right">All Times Local</td>
</tr>
<tr class="tableheader">
<td width="4%">
<b>GN</b>
</td>
<td nowrap width="21%">
<b>AWAY</b>
</td>
<td nowrap width="21%">
<b>HOME</b>
</td>
<td width="14%"><b>DATE</b></td>
<td width="11%"><b>TIME</b></td>
<td width="8%"><b>SCORE</b></td>
<td nowrap align="right" width="*"><b>BOXSCORE</b></td>
<td nowrap align="center" width="4%"><b>GS</b></td>
</tr>
<tr class="light">
<td></td>
<td>Sioux City
<b>1</b></td>
<td>Sioux Falls
<b>5</b></td>
<td>Tue, Apr 14</td>
<td> 7:05 PM</td>
<td> <b>1 - 5</b> </td>
<td align="right">
<img src="/images/gamelive_icon.gif" title="Click here for Game Live!" alt="Click here for Game Live" border="0">
Final</td>
<td align="center">
<img src="/images/playersection/prostats/gslink.gif" border="0">
</td>
</tr>
For example, try this:
$my_url = 'http://www.pointstreak.com/prostats/leagueschedule.html?leagueid=49&seasonid=14225';
$html = file_get_contents($my_url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
//Put your XPath Query here
$my_xpath_query = "//tr[#class='light']/td";
$result_rows = $xpath->query($my_xpath_query);
echo $result_rows->length;
// Create an array to hold the content of the nodes
$statsListings = array();
//here we loop through our results (a DOMDocument Object)
foreach ($result_rows as $result_object) {
$statsListings[] = $result_object->nodeValue;
}
echo json_encode($statsListings);
Probably I have found what you need, and even in nice JSON form:
http://www.pointstreak.com/ajax/trending_ajax.html?action=divisionscoreboard&divisionid=12299&seasonid=14225
{"trending_list":null,"lacrosse_list":null,"hockey_list":null,"soccer_list":null,"baseball_list":null,"softball_list":null,"basketball_list":null,"news_list":null,"news_hockey_list":null,"news_baseball_list":null,"news_baseball_list2":null,"news_softball_list":null,"news_basketball_list":null,"games_list":[{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Muskegon","awayscore":"2","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"15\/05","link":"..\/prostats\/boxscore.html?gameid=2672134"},{"status":"FINAL","hometeam":"Muskegon","homescore":"1","awayteam":"Sioux Falls","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"10\/05","link":"..\/prostats\/boxscore.html?gameid=2672133"},{"status":"FINAL","hometeam":"Muskegon","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"1st","schedtime":"7:15 pm","gamedate":"09\/05","link":"..\/prostats\/boxscore.html?gameid=2672132"},{"status":"FINAL","hometeam":"Dubuque","homescore":"3","awayteam":"Muskegon","awayscore":"4","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"05\/05","link":"..\/prostats\/boxscore.html?gameid=2662061"},{"status":"FINAL","hometeam":"Muskegon","homescore":"0","awayteam":"Dubuque","awayscore":"6","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662060"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"7","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"02\/05","link":"..\/prostats\/boxscore.html?gameid=2662055"},{"status":"FINAL","hometeam":"Muskegon","homescore":"3","awayteam":"Dubuque","awayscore":"1","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:15 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662059"},{"status":"FINAL","hometeam":"Sioux Falls","homescore":"4","awayteam":"Tri-City","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:04 pm","gamedate":"01\/05","link":"..\/prostats\/boxscore.html?gameid=2662054"},{"status":"FINAL","hometeam":"Tri-City","homescore":"2","awayteam":"Sioux Falls","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"29\/04","link":"..\/prostats\/boxscore.html?gameid=2664638"},{"status":"FINAL","hometeam":"Dubuque","homescore":"7","awayteam":"Muskegon","awayscore":"3","timeremaining":"0:00","currentperiod":"3rd","schedtime":"7:05 pm","gamedate":"25\/04","link":"..\/prostats\/boxscore.html?gameid=2662058"}],"division_list":null,"site_network_title":null,"leagueshortname":"USHL","includesportlink":null,"showleaguename":0}

Parsing info from a table without headers, using PHP, DOM and cUrl

I need to parse data from a table that i scrape from a different website using PHP.
The table looks like this:
<table id="IWGRD" border="1" cellpadding="0" cellspacing="0" width="409" bordercolor="#FFFFFF" bordercolorlight="#FFFFFF" bordercolordark="#FFFFFF" class="IWGRDCSS" style="width:409;height:10;z-index:100;font-style:normal;font-size:10pt;text-decoration:none;">
<tbody>
<tr>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Dag </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Datum </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Lesuur </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Lokaal </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Docent(en) </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Vak </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Groep(en) </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Toelichting </b></font>
</td>
</tr>
<tr>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> Di </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> 12-11-2013 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> 5 - 6 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> B2.33 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> LKH02 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> SWSP14SLB1V13_SWSP15PRA1V13 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> MAV1SP10 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> SLB major 1 / praktijkleren </font>
</td>
</tr>
This table is generated by javascript.
In this table the first tr holds all the td which holds the headers. While all the rest of the table rows hold the info that i need to parse.
Now I've been struggling with this for a while and i found an answer on this website which helped me out a little bit, but it reads the table by using the td and th id's while mine table doesn't have an id on it's table rows or td's.
I'm using cURL to get this table HTML from an other website and pass it through and load it into DOM like this:
<?php
include_once('/simple_dom/simple_html_dom.php');
//step1
$cSession = curl_init();
//step2
$tmpfname = dirname(__FILE__).'/cookie.txt';
curl_setopt($cSession, CURLOPT_COOKIEJAR, $tmpfname);
curl_setopt($cSession, CURLOPT_COOKIEFILE, $tmpfname);
curl_setopt($cSession,CURLOPT_URL,"http://anonymusurlbecauseofprivacyreasons?somegetters");
curl_setopt($cSession,CURLOPT_RETURNTRANSFER,true);
curl_setopt($cSession, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($cSession,CURLOPT_HEADER, false);
curl_setopt ($cSession, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($cSession, CURLOPT_CAINFO, dirname(__FILE__)."/cacert.pem");
curl_setopt($cSession,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$result=curl_exec($cSession);
if ($result === FALSE) {
echo "cURL Error: " . curl_error($ch);
}
curl_close($cSession);
// create empty document
$dom = new DomDocument;
#$dom->loadHtml($result);
$xpath = new DomXPath($dom);
Okay so far, so good.
But now comes the part of code which i can't figure out how to get it working.
To read out the date I copied and edited the code from this thread: (How to parse this table and extract data from it?) but I can't get it working.
// collect data
foreach ($xpath->query('//table[#id="IWGRD"]/tr') as $node) {
$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
$rowcleaned = str_replace("\xc2\xa0","", $cell->textContent);
$rowData[] = $rowcleaned;
}
}
print_r($rowData);
Which gives me the following output:
Array ( [0] => [1] => [2] => 7 - 8 [3] => S0.20 [4] => SPHdeBruin [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => Bewegingsagogiek )
Which is the correct output for the last row, but i need all the rows.
So the kind of output I would need is all of the rows (I only don't need the top rows)
So like
array[1] = ([0] => Mon [1] => 11-11-2013 [2] => 7 - 8 [3] => S0.20 [4] => SPHdeBruin [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => Bewegingsagogiek)
Array[2] = ([0] => Mon [1] => 11-11-2013 [2] => 8 - 9 [3] => S0.20 [4] => name [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => randomresult)
So i can use the info and put it in variables to pass it on to an app.
Anyone knows how to do this? I've been working on this for hours because i have none experience using cUrl or DOM whatsoever.
Any help is much appreciated! :)

It seems like you're not collecting every row as you go along...
$tableData = array();
foreach ($xpath->query('//table[#id="IWGRD"]/tr') as $node) {
$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
$rowcleaned = str_replace("\xc2\xa0","", $cell->textContent);
$rowData[] = $rowcleaned;
}
$tableData[] = $rowData;
}
print_r($tableData);

PHP DOM / XPath

Hopefully should be a simple question for someone that has done it before!
I have a list of old web documents in table format with lots of contact details in it. What I have managed so far is to create a PHP script that parses the XHTML doc and pull out old client contact details.
An example of the document format:
<tr>
<td bgcolor="#CCCCCC" valign="top">Indigo Blue 123</td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" align="top"><font class="details">123 Blue House</font></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" align="top"></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"><font class="details">Hanley</font></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"><font class="details">ST13 4SN</font></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"><font class="details">Stoke on Trent</font></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"><font class="details">01875 322511</font></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top"></td>
<td bgcolor="#CCCCCC"></td>
<td bgcolor="#CCCCCC" valign="top">www.indigoblue123.org.uk</td>
<td bgcolor="#CCCCCC"></td>
</tr>
What I need to do is parse all of these contact details into an array. The few things that I'm not sure on how to complete is grabbing the empty blocks to be empty array entries (i.e. Address 2 and Address 3 will be blank but I need to know this) as well as grabbing the web address from the <a>..</a> block.
So far I have figured all populated data has class=details in some form. However, as I mentioned before I'm not sure what the best way to accomplish the overall result is. There around 20-40 entries in the different files I have.
I have managed the basics with this so far:
<?php
print '<pre>';
$html = file_get_contents('old-contacts.xhtml');
// Create new DOM object:
$dom = new DomDocument();
// Load HTML code:
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$details = $xpath->query("//table/tbody/tr[td/font/#class = 'details']");
for ($i = 0; $i < $details->length; $i++) {
$data[$i]['data'] = $details->item($i)->nodeValue;
echo $data[$i]['data'];
}
print '</pre>';
?>
Any help would be great!
Thanks

I believed you are looking for something like this:
$nodes = $xpath->query('//table/tbody/tr/td[#align="top"] |
//table/tbody/tr/td[#valign="top"]');
$data = array();
foreach ($nodes as $node) {
$data[] = $node->textContent;
}
This would give you:
Array
(
[0] => Indigo Blue 123
[1] => 123 Blue House
[2] =>
[3] =>
[4] => Hanley
[5] =>
[6] => ST13 4SN
[7] => Stoke on Trent
[8] => 01875 322511
[9] =>
[10] => www.indigoblue123.org.uk
)

I was looking exactly for it, and worked perfect.
I created a function to extract and save it to HTML
function clean_web_source($web_source) {
$dom = new DOMDocument();
#$dom->loadHTML($web_source);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//table[#width="580"]');
$data = array();
foreach ($nodes as $node) {
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($node, true));
$data[] = trim($tmp_dom->saveHTML()); //Before use "saveHTML" I used textContent and print_r($data) to identify the array position that interested me.
}
return $data[2]; //The code in position 2 it's what I want.
}
$url = "http://www.theurl.com/?param=1&lang=1";
$web_source = file_get_contents($url);
$target_source = clean_web_source($web_source); //What I've look for.
Thanks.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Looping with XPath in php to get tables values - php

Related

How to get href attributefrom html page

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

PHP parsing won't find "span" tags

Parsing info from a table without headers, using PHP, DOM and cUrl

PHP DOM / XPath

Categories

Resources