How to get href attributefrom html page

How to get href attributefrom html page - php

I have this html code:
<tbody>
<tr class="">
<td align="right" csk="1">1</td>
<td align="left" ><img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan</td>
<td align="right" >1936</td>
<td align="right" >2016</td>
<td align="right" >103</td>
<td align="right" >7</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" >2</td>
<td align="right" >2</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
</tr>
I'd like to get inside an array all the href attributes.
I'm trying to use this php code:
<?php
include_once ('/share/Multimedia/simple_html_dom.php');
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left"';
$olympiad = array();
$html = file_get_html($url,true);
foreach($html->find($tagname_tr) as $tag) {
$olympiad[] = trim($tag->innertext);
}
Indeed if I print olympiad array I get something like:
Array
(
[0] => 1
[1] => <img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan
[2] => 1936
[3] => 2016
[4] => 103
[5] => 7
[6] =>
[7] =>
[8] => 2
[9] => 2
[10] =>
Why this behaviour? I'd like to get also the text inside href attribute (in this case Afghanistan), possibly in another array.
I'm not an php code expert so I ask help to you.

You can load the html file like this,this is an exemple you can adapt it:
<?php
include_once ('/share/Multimedia/simple_html_dom.php');
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left"';
$olympiad = array();
$html = file_get_html($url,true);
$doc = new DOMDocument();
$doc->loadHTML( $html);
// example 1:
$elements = $doc->getElementsByTagName('*');
// example 2:
$elements = $doc->getElementsByTagName('html');
// example 3:
$elements = $doc->getElementsByTagName('body');
// example 4:
$elements = $doc->getElementsByTagName('table');
// example 5:
$elements = $doc->getElementsByTagName('div');
I hope it help.

If you want to find all href attributes, I think you can add an a to $tagname_tr = 'td align="left"';
Then you can loop the result, and get the href and the innertext.
As an example, the values are stored in 2 arrays and the html is loaded as a string:
include_once ('/share/Multimedia/simple_html_dom.php');
$source = <<<SOURCE
<tbody>
<tr class="">
<td align="right" csk="1">1</td>
<td align="left" ><img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan</td>
<td align="right" >1936</td>
<td align="right" >2016</td>
<td align="right" >103</td>
<td align="right" >7</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" >2</td>
<td align="right" >2</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
</tr>
SOURCE;
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left" a';
$olympiad = array();
$elementText = array();
//$html = file_get_html($url,true);
$html = str_get_html($source);
foreach($html->find($tagname_tr) as $tag) {
$olympiad[] = $tag->href;
$elementText[] = $tag->innertext;
}
echo "<pre>";
print_r($olympiad);
print_r($elementText);
Will result in:
Array
(
[0] => /olympics/countries/AFG/
)
Array
(
[0] => Afghanistan
)

Related

How to catch text from html page

I'd like to catch the word "Bronze" from this html page portion:
<tr class="">
<td align="left" csk="Nikpai,Rohullah">Rohullah Nikpai</td>
<td align="right" >25</td>
<td align="left" >Men's Featherweight</td>
<td align="right" csk="3">3T </td>
<td align="left" class=" Bronze" csk="1"><strong>Bronze</strong></td>
</tr>
I tried different code but I failed in my intent. One of many attempts is the following:
foreach($html4->find('td align="left" strong') as $tag4) {
echo $prova = $tag4->innertext . "\n";
}
where html4 is the entire html page I have to process.

With following Code you can get the classname "Bronze"
<?php
$html='<tr class="">
<td align="left" csk="Nikpai,Rohullah">Rohullah Nikpai</td>
<td align="right" >25</td>
<td align="left" >Mens Featherweight</td>
<td align="right" csk="3">3T </td>
<td align="left" class=" Bronze" csk="1"><strong>Bronze</strong></td>
</tr>';
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach($dom->getElementsByTagName('td') as $link) {
echo trim($link->getAttribute('class'),' ');
}
?>
Or, if you prefer the Node Value and not the class name and the csk attribut is always 1:
foreach($dom->getElementsByTagName('td') as $link) {
if ($link->getAttribute('csk')=="1"){
echo $link->nodeValue;
}
}

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

I been trying to extract site table text along with its link from the given table to (which is in site1.com) to my php page using a web crawler.
But unfortunately, due to incorrect input of Array index in the php code, it came error as output.
site1.com
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
<td width="1%" valign="top" class="Title2"> </td>
<td width="65%" valign="top" class="Title2">Subject</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="8%" valign="top" align="Center" class="Title2">Replies</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Serious dedicated study partner for U World - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
The php. web crawler as ::
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmleforum.com/forum/index.php?forum=1');
$first_step = explode( '<table class="Table2">' , $returned_content );
$second_step = explode('</table>', $first_step[0]);
$third_step = explode('<tr>', $second_step[1]);
// print_r($third_step);
foreach ($third_step as $key=>$element) {
$child_first = explode( '<td class="FootNotes2"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[0] );
$final = "<a href=".$child_fourth[0]."</a></br>";
?>
<li target="_blank" class="itemtitle">
<?php echo $final?>
</li>
<?php
if($key==10){
break;
}
}
?>
Now the Array Index on the above php code can be the culprit. (i guess)
If so, can some one please explain me how to make this work.
But what my final requirement from this code is::
to get the above text in second with a link associated to it.
Any help is Appreciated..

Instead of writing your own parser solution you could use an existing one like Symfony's DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html
$crawler = new Crawler($returned_content);
$linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) {
return $node->text();
});
Or if you want to traverse the DOM tree yourself you can use DOMDocument's loadHTML
http://php.net/manual/en/domdocument.loadhtml.php
$document = new DOMDocument();
$document->loadHTML($returned_content);
foreach ($document->getElementsByTagName('a') as $link) {
$text = $link->nodeValue;
}
EDIT:
To get the links you want, the code assumes you have a $returned_content variable with the HTML you want to parse.
// creating a new instance of DOMDocument (DOM = Document Object Model)
$domDocument = new DOMDocument();
// save previous libxml error reporting and set error reporting to internal
// to be able to parse not well formed HTML doc
$previousErrorReporting = libxml_use_internal_errors(true);
$domDocument->loadHTML($returned_content);
libxml_use_internal_errors($previousErrorReporting);
$links = [];
/** #var DOMElement $node */
// getting all <a> element from the HTML
foreach ($domDocument->getElementsByTagName('a') as $node) {
$parentNode = $node->parentNode;
// checking if the <a> is under a <td> that has class="FootNotes2"
$isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2';
// checking if the <a> has class="Links2"
$isLinkOfLink2Class = $node->getAttribute('class') == 'Links2';
// as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled
if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) {
$links[] = [
'href' => $node->getAttribute('href'),
'text' => $parentNode->textContent,
];
}
}
print_r($links);
This will create you an array similar to:
Array
(
[0] => Array
(
[href] => /files/forum/2017/1/837242.php
[text] => Q#Q Drill Time ① - cardio69
)
[1] => Array
(
[href] => /files/forum/2017/1/837356.php
[text] => study partner in Houston - lacy
)
[2] => Array
(
[href] => /files/forum/2017/1/837110.php
[text] => Serious dedicated study partner for U World - step12013
)
...

Using the Simple HTML DOM Parser library, you can use the following code:
<?php
require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file.
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element
$element->href = "http://www.usmleforum.com" . $element->href; // you can also access only certain attributes of the elements (e.g. the url).
echo $element.'</br>'; // do something with the elements.
}
?>

I tried the same code for another site. and it works.
Please take a look at it:
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
$first_step = explode( '<tbody id="threadbits_forum_26">' , $returned_content );
$second_step = explode('</tbody>', $first_step[1]);
$third_step = explode('<tr>', $second_step[0]);
// print_r($third_step);
foreach ($third_step as $element) {
$child_first = explode( '<td class="alt1"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[1] );
echo $final = "<a href=".$child_fourth[0]."</a></br>";
}
?>
I know its too much to ask, but can you please make a code out of these two which make the crawler work.
#jkmak

Chopping at html with string functions or regex is not a reliable method. DomDocument and Xpath do a nice job.
Code: (Demo)
$dom=new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate("//td[#class = 'FootNotes2']/a") as $node) { // target a tags that have <td class="FootNotes2"> as parent
$result[]=['href' => $node->getAttribute('href'), 'text' => $node->nodeValue]; // extract/store the href and text values
if (sizeof($result) == 10) { break; } // set a limit of 10 rows of data
}
if (isset($result)) {
echo "<ul>\n";
foreach ($result as $data) {
echo "\t<li class=\"itemtitle\">{$data['text']}</li>\n";
}
echo "</ul>";
}
Sample Input:
$html = <<<HTML
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
<td width="1%" valign="top" class="Title2"> </td>
<td width="65%" valign="top" class="Title2">Subject</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="8%" valign="top" align="Center" class="Title2">Replies</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Serious dedicated study partner for U World - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">some text - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
HTML;
Output:
<ul>
<li class="itemtitle">Serious dedicated study partner for U World</li>
<li class="itemtitle">some text</li>
</ul>

Looping with XPath in php to get tables values

I have this html table:
<tbody>
<tr>..</tr>
<tr>
<td class="tbl_black_n_1">1</td>
<td class="tbl_black_n_1" nowrap="" align="center">23/07/14 08:10</td>
<td class="tbl_black_n_1">
<img src="http://www.betonews.com/img/SportId389.gif" width="10" height="10" border="0" alt="">
</td>
<td class="tbl_black_n_1"></td>
<td class="tbl_black_n_1" nowrap="" align="center">BAK WS</td>
<td class="tbl_black_n_1" nowrap="" align="right">M. Eguchi</td>
<td class="tbl_black_n_1" align="center">-</td>
<td class="tbl_black_n_1" nowrap="">Radwanska U. </td>
<td class="tbl_black_n_1" align="center" title=" ">1,02</td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="center" title=" "> </td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="center" title=" ">55,00</td>
<td class="tbl_black_n_1" align="center">
<td class="tbl_black_n_1" align="right">86%</td>
<td class="tbl_black_n_1" align="right">-</td>
<td class="tbl_black_n_1" align="right">14%</td>
<td class="tbl_black_n_1" align="center" title=" ">524.647</td>
<td class="tbl_black_n_1" nowrap="">
<img src="http://www.betonews.com//img/i_betfair.gif" width="12" height="10" border="0" alt="">
<img src="http://www.betonews.com//img/i_history.gif" width="12" height="10" border="0" alt="">
</td>
</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
...
</tbody>
There are more than one hundred <tr> structured at the same way, which contain lots of <td>. How can I loop with xpath to store all data in a database? I don't want to get the first <tr>: the query has to begin with the second <tr> (that I have showed).
This is my php code, but I can not go on.. help!
<?php
$url = 'http://www.betonews.com/table.asp?tp=2001&lang=en&dd=23&dm=7&dy=2014&df=1&dw=3';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
$document = new DOMDocument();
$document->loadHTML($response);
$xpath = new DOMXPath($document);
$expression = '/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr';
$rows = $xpath->query($expression);
$results = array();
foreach ($rows as $row) {
$result = array();
???
}
This is what I want to be the final result:
[0] => Array
(
[date] => 23/07/14 08:10
[image] => http://www.betonews.com/img/SportId389.gif
[team1] => M. Eguchi
[team2] => Radwanska U.
[1] => 1,02
[x] => 0
[2] => 55,00
[1%] => 86%
[x%] => 0
[2%] => 14%
[total] => 524.647
)

I would use a different XPath to select the table. First, there is always a problem using absolute paths with tables like this, because often tbody elements are just added by the browser, but they are not actually present in the document, i.e. not visible to the PHP code. Also, because if anything in the source HTML changes in terms of styleing, your code breaks. Now I select the first table with a cellpadding of 3 - This is not optimal, but there wasn't any obvious unique identifier.
Apart from that, you can simply iterate over the DOMNodeList result and then get the correct child nodes. Notice, that the items are increased by two, because whitespace-only elements in between are also a node in XML.
$xpath = new DOMXPath($document);
$expression = '(//table[#cellpadding="3"])[1]/tr[position() > 1]';
$rows = $xpath->query($expression);
$results = array();
foreach ($rows as $row) {
$result = array();
$td = $row->childNodes;
$result["date"] = $td->item(2)->nodeValue;
$result["image"] = $td->item(4)->firstChild->attributes->getNamedItem("src")->nodeValue;
$result["team1"] = $td->item(10)->nodeValue;
$result["team2"] = $td->item(12)->nodeValue;
$result["1"] = $td->item(14)->nodeValue;
$result["x"] = $td->item(16)->nodeValue;
$result["2"] = $td->item(18)->nodeValue;
$result["1%"] = $td->item(20)->nodeValue;
$result["x%"] = $td->item(22)->nodeValue;
$result["2%"] = $td->item(24)->nodeValue;
$result["total"] = $td->item(26)->nodeValue;
$results[] = $result;
}
For the image, you have to do same more proccesing, because you do not want the actual text, but the src attribute of the <img> element instead.

Parsing info from a table without headers, using PHP, DOM and cUrl

I need to parse data from a table that i scrape from a different website using PHP.
The table looks like this:
<table id="IWGRD" border="1" cellpadding="0" cellspacing="0" width="409" bordercolor="#FFFFFF" bordercolorlight="#FFFFFF" bordercolordark="#FFFFFF" class="IWGRDCSS" style="width:409;height:10;z-index:100;font-style:normal;font-size:10pt;text-decoration:none;">
<tbody>
<tr>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Dag </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Datum </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Lesuur </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Lokaal </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Docent(en) </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Vak </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Groep(en) </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Toelichting </b></font>
</td>
</tr>
<tr>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> Di </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> 12-11-2013 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> 5 - 6 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> B2.33 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> LKH02 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> SWSP14SLB1V13_SWSP15PRA1V13 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> MAV1SP10 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> SLB major 1 / praktijkleren </font>
</td>
</tr>
This table is generated by javascript.
In this table the first tr holds all the td which holds the headers. While all the rest of the table rows hold the info that i need to parse.
Now I've been struggling with this for a while and i found an answer on this website which helped me out a little bit, but it reads the table by using the td and th id's while mine table doesn't have an id on it's table rows or td's.
I'm using cURL to get this table HTML from an other website and pass it through and load it into DOM like this:
<?php
include_once('/simple_dom/simple_html_dom.php');
//step1
$cSession = curl_init();
//step2
$tmpfname = dirname(__FILE__).'/cookie.txt';
curl_setopt($cSession, CURLOPT_COOKIEJAR, $tmpfname);
curl_setopt($cSession, CURLOPT_COOKIEFILE, $tmpfname);
curl_setopt($cSession,CURLOPT_URL,"http://anonymusurlbecauseofprivacyreasons?somegetters");
curl_setopt($cSession,CURLOPT_RETURNTRANSFER,true);
curl_setopt($cSession, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($cSession,CURLOPT_HEADER, false);
curl_setopt ($cSession, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($cSession, CURLOPT_CAINFO, dirname(__FILE__)."/cacert.pem");
curl_setopt($cSession,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$result=curl_exec($cSession);
if ($result === FALSE) {
echo "cURL Error: " . curl_error($ch);
}
curl_close($cSession);
// create empty document
$dom = new DomDocument;
#$dom->loadHtml($result);
$xpath = new DomXPath($dom);
Okay so far, so good.
But now comes the part of code which i can't figure out how to get it working.
To read out the date I copied and edited the code from this thread: (How to parse this table and extract data from it?) but I can't get it working.
// collect data
foreach ($xpath->query('//table[#id="IWGRD"]/tr') as $node) {
$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
$rowcleaned = str_replace("\xc2\xa0","", $cell->textContent);
$rowData[] = $rowcleaned;
}
}
print_r($rowData);
Which gives me the following output:
Array ( [0] => [1] => [2] => 7 - 8 [3] => S0.20 [4] => SPHdeBruin [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => Bewegingsagogiek )
Which is the correct output for the last row, but i need all the rows.
So the kind of output I would need is all of the rows (I only don't need the top rows)
So like
array[1] = ([0] => Mon [1] => 11-11-2013 [2] => 7 - 8 [3] => S0.20 [4] => SPHdeBruin [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => Bewegingsagogiek)
Array[2] = ([0] => Mon [1] => 11-11-2013 [2] => 8 - 9 [3] => S0.20 [4] => name [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => randomresult)
So i can use the info and put it in variables to pass it on to an app.
Anyone knows how to do this? I've been working on this for hours because i have none experience using cUrl or DOM whatsoever.
Any help is much appreciated! :)

It seems like you're not collecting every row as you go along...
$tableData = array();
foreach ($xpath->query('//table[#id="IWGRD"]/tr') as $node) {
$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
$rowcleaned = str_replace("\xc2\xa0","", $cell->textContent);
$rowData[] = $rowcleaned;
}
$tableData[] = $rowData;
}
print_r($tableData);

Warning: DOMXPath::query(): Invalid expression

I'm an Xpath newbie. I want to loop through the result of a cURL query and print each element of the only table on the page.
I've used the Xpath plugin for Firefox to obtain my expression and my table is structured as follows:
<table>
<tr class="listItemOneBg">
<td valign="top">
SMITH
</td>
<td valign="top">
WILLIAM C C
</td>
<td valign="top">
Male
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
BLACKWOOD
</td>
<td valign="top">
61
</td>
<td valign="top">
1924
</td>
<td valign="top">
<a target="_blank" href='XXX'>
order</a>
</td>
</tr>
<tr class="listItemTwoBg">
<td valign="top">
SMITH
</td>
<td valign="top">
WILLIAM C PAGE-
</td>
<td valign="top">
Male
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
SWAN
</td>
<td valign="top">
9
</td>
<td valign="top">
1914
</td>
<td valign="top">
<a target="_blank" href='XXY'>
order</a>
</td>
</tr>
Here's the code I've tried so far. I get a message"Warning: Invalid argument supplied for foreach()". What am I doing wrong?
$page = curl_exec($ch);
curl_close($ch);
// Create new PHP DOM document
$dom = new DOMDocument;
// Load html from curl request into document model
#$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
$tableRows = $xpath->query("id('divResults')/table/tbody/tr");
foreach ($tableRows as $row) {
// fetch all 'tds' inside this 'tr'
$td = $xpath->query('td', $row);
echo $td->item(1)->textContent;
}

Assuming the table you're after is actually in a <div id="divResults">...
$tableRows = $xpath->query('//div[#id="divResults"]/table/tbody/tr');
foreach ($tableRows as $row) {
$cells = $row->getElementsByTagName('td');
}

That's a non-standard XPath expression. It cannot work in DOMXPath.(Downvoters, the expression has been edited since the question was posted. Cheers!)
This is where you learn XPath:
Microsoft XPath Syntax
Microsoft XPath by Example
PS: It's where I learnt it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to get href attributefrom html page - php

Related

How to catch text from html page

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

Looping with XPath in php to get tables values

Parsing info from a table without headers, using PHP, DOM and cUrl

Warning: DOMXPath::query(): Invalid expression

Categories

Resources