Selective extraction of data from external site using DOM PHP web crawler

Selective extraction of data from external site using DOM PHP web crawler - php

I have this PHP dom web crawler which works fine. it extracts mentioned tag along with its link from a (external) forum site to my page.
But recently i ran into a problem. Like
this is the HTML of the forum data::
<tbody>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Hispanic Study Partner - dreamer1984</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">0</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">nbme - monariyadh</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">0</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>
Now if we consider the above code (table data) as the only statements available in that site. and if i tried to extract it with a web crawler like,
<?php
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.sitename.com/');
foreach($html->find('td.FootNotes2') as $element) {
echo $element;
}
?>
It extracts al the data that is inside with a class name as "FootNote2"
Now what if i want to extract specific data in tag, for example names like, " dreamer1984" and "monariyadh" from the first tag/line.
and what if i wanted to extract data from 3rd (skipping the rest) which has same class names.
Please note that i can use "regex" like
preg_match_all('/<td.+?FootNotes2.+?<a.+?<\/a> - (?P<name>.*?)<\/td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2})/siu', $subject, $matchs);
foreach ($matchs['name'] as $k => $v){
var_dump('name: '. $v, 'relative date: '. $matchs['date'][$k]);
}
But i prefer to find solution for this in DOM parser...
Any help is appreciated..

As I said in my comment some text processing is unavoidable, however you can get the text element associated with the td like so :
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.sitename.com/');
foreach ($html->find("tr") as $row) {
$element = $row->find('td.FootNotes2',0);
if ($element == null) { continue; }
$textNode = array_filter($element->nodes, function ($n) {
return $n->nodetype == 3; //Text node type, like in jQuery
});
if (!empty($textNode)) {
$text = current($textNode);
echo $text;
}
}
This echoes:
- dreamer1984
- monariyadh
Do with that what you will.
Updated to only find the first td for each tr.

If you want to extract only text (not tags and its contain)
foreach ($html->find("td.FootNotes2") as $element) {
$children = $element->children; // get an array of children
foreach ($children AS $child) {
$child->outertext = ''; // This removes the element, but MAY NOT remove it from the original $myDiv
}
echo $element->innertext."<br>";
}
o/p:
- dreamer1984
02/28/17 01:42
0
200
- monariyadh
02/27/17 23:12
0
108

You have to use regex either way so no sense overcomplicating it:
foreach($html->find('tr') as $tr) {
echo preg_replace('/.* - /', '', $tr->find('td',1)->text()) . "\n";
echo $tr->find('td',3)->text() . "\n";
}
I really don't like apokryfos' approach to this, it's a lot of confusion with no benefit.

Related

How to catch text from html page

I'd like to catch the word "Bronze" from this html page portion:
<tr class="">
<td align="left" csk="Nikpai,Rohullah">Rohullah Nikpai</td>
<td align="right" >25</td>
<td align="left" >Men's Featherweight</td>
<td align="right" csk="3">3T </td>
<td align="left" class=" Bronze" csk="1"><strong>Bronze</strong></td>
</tr>
I tried different code but I failed in my intent. One of many attempts is the following:
foreach($html4->find('td align="left" strong') as $tag4) {
echo $prova = $tag4->innertext . "\n";
}
where html4 is the entire html page I have to process.

With following Code you can get the classname "Bronze"
<?php
$html='<tr class="">
<td align="left" csk="Nikpai,Rohullah">Rohullah Nikpai</td>
<td align="right" >25</td>
<td align="left" >Mens Featherweight</td>
<td align="right" csk="3">3T </td>
<td align="left" class=" Bronze" csk="1"><strong>Bronze</strong></td>
</tr>';
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach($dom->getElementsByTagName('td') as $link) {
echo trim($link->getAttribute('class'),' ');
}
?>
Or, if you prefer the Node Value and not the class name and the csk attribut is always 1:
foreach($dom->getElementsByTagName('td') as $link) {
if ($link->getAttribute('csk')=="1"){
echo $link->nodeValue;
}
}

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

I been trying to extract site table text along with its link from the given table to (which is in site1.com) to my php page using a web crawler.
But unfortunately, due to incorrect input of Array index in the php code, it came error as output.
site1.com
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
<td width="1%" valign="top" class="Title2"> </td>
<td width="65%" valign="top" class="Title2">Subject</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="8%" valign="top" align="Center" class="Title2">Replies</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Serious dedicated study partner for U World - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
The php. web crawler as ::
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmleforum.com/forum/index.php?forum=1');
$first_step = explode( '<table class="Table2">' , $returned_content );
$second_step = explode('</table>', $first_step[0]);
$third_step = explode('<tr>', $second_step[1]);
// print_r($third_step);
foreach ($third_step as $key=>$element) {
$child_first = explode( '<td class="FootNotes2"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[0] );
$final = "<a href=".$child_fourth[0]."</a></br>";
?>
<li target="_blank" class="itemtitle">
<?php echo $final?>
</li>
<?php
if($key==10){
break;
}
}
?>
Now the Array Index on the above php code can be the culprit. (i guess)
If so, can some one please explain me how to make this work.
But what my final requirement from this code is::
to get the above text in second with a link associated to it.
Any help is Appreciated..

Instead of writing your own parser solution you could use an existing one like Symfony's DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html
$crawler = new Crawler($returned_content);
$linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) {
return $node->text();
});
Or if you want to traverse the DOM tree yourself you can use DOMDocument's loadHTML
http://php.net/manual/en/domdocument.loadhtml.php
$document = new DOMDocument();
$document->loadHTML($returned_content);
foreach ($document->getElementsByTagName('a') as $link) {
$text = $link->nodeValue;
}
EDIT:
To get the links you want, the code assumes you have a $returned_content variable with the HTML you want to parse.
// creating a new instance of DOMDocument (DOM = Document Object Model)
$domDocument = new DOMDocument();
// save previous libxml error reporting and set error reporting to internal
// to be able to parse not well formed HTML doc
$previousErrorReporting = libxml_use_internal_errors(true);
$domDocument->loadHTML($returned_content);
libxml_use_internal_errors($previousErrorReporting);
$links = [];
/** #var DOMElement $node */
// getting all <a> element from the HTML
foreach ($domDocument->getElementsByTagName('a') as $node) {
$parentNode = $node->parentNode;
// checking if the <a> is under a <td> that has class="FootNotes2"
$isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2';
// checking if the <a> has class="Links2"
$isLinkOfLink2Class = $node->getAttribute('class') == 'Links2';
// as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled
if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) {
$links[] = [
'href' => $node->getAttribute('href'),
'text' => $parentNode->textContent,
];
}
}
print_r($links);
This will create you an array similar to:
Array
(
[0] => Array
(
[href] => /files/forum/2017/1/837242.php
[text] => Q#Q Drill Time ① - cardio69
)
[1] => Array
(
[href] => /files/forum/2017/1/837356.php
[text] => study partner in Houston - lacy
)
[2] => Array
(
[href] => /files/forum/2017/1/837110.php
[text] => Serious dedicated study partner for U World - step12013
)
...

Using the Simple HTML DOM Parser library, you can use the following code:
<?php
require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file.
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element
$element->href = "http://www.usmleforum.com" . $element->href; // you can also access only certain attributes of the elements (e.g. the url).
echo $element.'</br>'; // do something with the elements.
}
?>

I tried the same code for another site. and it works.
Please take a look at it:
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
$first_step = explode( '<tbody id="threadbits_forum_26">' , $returned_content );
$second_step = explode('</tbody>', $first_step[1]);
$third_step = explode('<tr>', $second_step[0]);
// print_r($third_step);
foreach ($third_step as $element) {
$child_first = explode( '<td class="alt1"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[1] );
echo $final = "<a href=".$child_fourth[0]."</a></br>";
}
?>
I know its too much to ask, but can you please make a code out of these two which make the crawler work.
#jkmak

Chopping at html with string functions or regex is not a reliable method. DomDocument and Xpath do a nice job.
Code: (Demo)
$dom=new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate("//td[#class = 'FootNotes2']/a") as $node) { // target a tags that have <td class="FootNotes2"> as parent
$result[]=['href' => $node->getAttribute('href'), 'text' => $node->nodeValue]; // extract/store the href and text values
if (sizeof($result) == 10) { break; } // set a limit of 10 rows of data
}
if (isset($result)) {
echo "<ul>\n";
foreach ($result as $data) {
echo "\t<li class=\"itemtitle\">{$data['text']}</li>\n";
}
echo "</ul>";
}
Sample Input:
$html = <<<HTML
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
<td width="1%" valign="top" class="Title2"> </td>
<td width="65%" valign="top" class="Title2">Subject</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="8%" valign="top" align="Center" class="Title2">Replies</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Serious dedicated study partner for U World - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">some text - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
HTML;
Output:
<ul>
<li class="itemtitle">Serious dedicated study partner for U World</li>
<li class="itemtitle">some text</li>
</ul>

not able to retrieve direct child elements using Simple HTML DOM

I have an html table like this
<table>
<tbody>
<tr>
<td><table>
<tbody>
<tr class="prdLi">
<td rowspan="2" class="prdNo"><span>310.</span></td>
<td colspan="2" class="prdDe" rowspan="2"><span>Pepsi</span></td>
</tr>
<tr class="prdLi">
<td class="prdAc"><span> 1.5L</span></td>
<td><span> </span></td>
</tr>
</tbody>
</table></td>
</tr>
</tbody>
</table>
the table is saved as $html
I want to select the child elements of the class .prdLi
I tried like this:
foreach($html->find('tr.prdLi') as $foo){
echo $foo;
}
the output that i get is like this
<span>310.</span>
<span>Pepsi</span
.
.
.
but what i actually want to get is the code with the parent element td.like this:
<td rowspan="2" class="prdNo"><span>310.</span></td>
<td colspan="2" class="prdDe" rowspan="2"><span>Pepsi</span></td>
.
.
.
please help me

Since Simple HTML DOM Parser supports CSS like selectors, you can use 'tr.prdLi td' to specify all td elements which are children of tr with class prdLi. The following should provide what you are looking for:
$htmlstr = <<<EOD
<table>
<tbody>
<tr>
<td><table>
<tbody>
<tr class="prdLi">
<td rowspan="2" class="prdNo"><span>310.</span></td>
<td colspan="2" class="prdDe" rowspan="2"><span>Pepsi</span></td>
</tr>
<tr class="prdLi">
<td class="prdAc"><span> 1.5L</span></td>
<td><span> </span></td>
</tr>
</tbody>
</table></td>
</tr>
</tbody>
</table>
EOD;
$html = str_get_html($htmlstr);
foreach ($html->find('tr.prdLi td') as $foo) {
echo $foo . "\n";
}
Note that find() is called on the main simple_html_dom-element. In your example, the result was already limited by a previous find().

What andy says is correct, but the css for direct child is > *, therefore:
foreach($html->find('tr.prdLi > *') as $foo){
echo $foo . "\n";
}

Get data inside html tags using Simple HTML DOM Parser:

I want to get all the information inside the html tags and display them in a table. I'm using Simple HTML DOM Parser. I tried the following code but I'm only getting the LAST COLUMN (Column:Total). How do I get the data from the other columns?
foreach($html->find('tr[class="tblRowShade"]') as $div) {
$key = '';
$val = '';
foreach($div->find('*') as $node) {
if ($node->tag=='td'){
$key = $node->plaintext;
}
}
$ret[$key] = $val;
}
Here's my code for the table
<tr class="tblRowShade">
<td width="12%"><strong>Project</strong></td>
<td width="38%"> </td>
<td width="25%"><strong>Recipient</strong></td>
<td width="14%"><strong>Municipality/City</strong></td>
<td width="11%" nowrap="nowrap" class="td_right"><strong>Implementing Unit</strong></td>
<td width="11%" nowrap="nowrap" class="td_right"><strong>Release Date</strong></td>
<td align="right" width="11%" class="td_right"><strong>Total</strong></td>
</tr>
<tr class="tblRowShade">
<td colspan="2" >Livelihood Programs</td>
<td >Basic Espresso and Latte</td>
<td nowrap="nowrap"></td>
<td >DOLE - TESDA Regional Office IV-A</td>
<td nowrap="nowrap">2013-06-11</td>
<td align="right" nowrap="nowrap" class="td_right">1,500,000</td>
</tr>

Why do you have $div->find('*')? you can try $div->find('td') instead. This should produce correct result. Otherwise you can also try to iterate over children: foreach($div->children as $node)
Assuming you are trying to use the first row as $key and the rest for the data, you might want to alter your HTML code by simply add th in the first row, which is your header: <tr><th>…</th></tr>. This way you can get the keys by $div->find('th'). I suppose using the first row is okay as well.

As alamin.ahmed said, it would be better to search for td instead...
Here's a working example :
$text = ' <tr class="tblRowShade">
<td width="12%"><strong>Project</strong></td>
<td width="38%"> </td>
<td width="25%"><strong>Recipient</strong></td>
<td width="14%"><strong>Municipality/City</strong></td>
<td width="11%" nowrap="nowrap" class="td_right"><strong>Implementing Unit</strong></td>
<td width="11%" nowrap="nowrap" class="td_right"><strong>Release Date</strong></td>
<td align="right" width="11%" class="td_right"><strong>Total</strong></td>
</tr>
<tr class="tblRowShade">
<td colspan="2" >Livelihood Programs</td>
<td >Basic Espresso and Latte</td>
<td nowrap="nowrap"></td>
<td >DOLE - TESDA Regional Office IV-A</td>
<td nowrap="nowrap">2013-06-11</td>
<td align="right" nowrap="nowrap" class="td_right">1,500,000</td>
</tr>';
echo "<div>Original Text: <xmp>$text</xmp></div>";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($text);
// Find all elements
$rows = $html->find('tr[class="tblRowShade"]');
// Find succeeded
if ($rows) {
echo count($rows) . " \$rows found !<br />";
foreach ($rows as $key => $row) {
echo "<hr />";
$columns = $row->find('td');
// Find succeeded
if ($rows) {
echo count($columns) . " \$columns found in \$rows[$key]!<br />";
foreach ($columns as $col) {
echo $col->plaintext . " | ";
}
}
else
echo " /!\ Find() \$columns failed /!\ ";
}
}
else
echo " /!\ Find() \$rows failed /!\ ";
here's the output of the above code:
You must be aware that the two rows doesnt contain the same number of columns... then you must handle that in your program.

How To Format This Scraped Content

I'm grabbing the content from all the td's in this table with the class="job" using this.
$table01 = $salary->find('table.table01');
$rows = $table01[0]->find('td.job');
Then I'm using this to output it which works, but obviously only outputs it as plaintext, I need to do some more with it...
foreach($table01[0]->find('td.job') as $element) {
$jobs .= $element->plaintext . '<br />';
}
Ultimately I would like it outputted to this format. Notice the a href is using the job name and replacing spaces and / with a -.
<tr>
<td class="small"> Graphic Artist / Designer
$23,755 – $55,335 </td>
</tr>
<tr>
<td class="small"> Sales Associate<br />
$15,577 – $56,290 </td>
</tr>
<tr>
<td class="small"> Film / Video Editor<br />
$24,184 – $94,493 </td>
</tr>
Heres the table im scraping
<table cellpadding="0" cellspacing="0" border="0" class="table01">
<tr>
<td class="head">Test</td>
<td class="job">
Graphic Artist / Designer<br/>
$23,755 – $55,335
</td>
</tr>
<tr>
<td class="head">Test</td>
<td class="job">
Sales Associate<br/>
$15,577 – $56,290
</td>
</tr>
<tr>
<td class="head">Test</td>
<td class="job">
Film / Video Editor<br/>
$24,184 – $94,493
</td>
</tr>
</table>

may be better to use regexps
<?php
$html=file_get_contents('1.html');
$jobs='';
if(preg_match_all("/<tr>.*?<td.*?>.*?<\/td>.*?<td\sclass=\"job\">.*?<a.+?href=\"(.+?)\".+?>(.*?)<\/a>(.*?)<\/td>.*?<\/tr>/ims", $html, $res))
{
foreach($res[1] as $i=>$uri)
{
$uri=strtolower(urldecode($uri));
$uri=preg_replace("/_\/_/",'-',$uri);
$uri=preg_replace("/_/",'-',$uri);
$jobs.='<tr><td class="small"> '.$res[2][$i].''.$res[3][$i].'</td></tr>'."\n";
}
}
echo $jobs;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Selective extraction of data from external site using DOM PHP web crawler - php

You have to use regex either way so no sense overcomplicating it: foreach($html->find('tr') as $tr) { echo preg_replace('/.* - /', '', $tr->find('td',1)->text()) . "\n"; echo $tr->find('td',3)->text() . "\n"; } I really don't like apokryfos' approach to this, it's a lot of confusion with no benefit.

Related

How to catch text from html page

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

not able to retrieve direct child elements using Simple HTML DOM

Get data inside html tags using Simple HTML DOM Parser:

How To Format This Scraped Content

Categories

Resources