How to catch text from html page

How to catch text from html page - php

I'd like to catch the word "Bronze" from this html page portion:
<tr class="">
<td align="left" csk="Nikpai,Rohullah">Rohullah Nikpai</td>
<td align="right" >25</td>
<td align="left" >Men's Featherweight</td>
<td align="right" csk="3">3T </td>
<td align="left" class=" Bronze" csk="1"><strong>Bronze</strong></td>
</tr>
I tried different code but I failed in my intent. One of many attempts is the following:
foreach($html4->find('td align="left" strong') as $tag4) {
echo $prova = $tag4->innertext . "\n";
}
where html4 is the entire html page I have to process.

With following Code you can get the classname "Bronze"
<?php
$html='<tr class="">
<td align="left" csk="Nikpai,Rohullah">Rohullah Nikpai</td>
<td align="right" >25</td>
<td align="left" >Mens Featherweight</td>
<td align="right" csk="3">3T </td>
<td align="left" class=" Bronze" csk="1"><strong>Bronze</strong></td>
</tr>';
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach($dom->getElementsByTagName('td') as $link) {
echo trim($link->getAttribute('class'),' ');
}
?>
Or, if you prefer the Node Value and not the class name and the csk attribut is always 1:
foreach($dom->getElementsByTagName('td') as $link) {
if ($link->getAttribute('csk')=="1"){
echo $link->nodeValue;
}
}

Related

Selective extraction of data from external site using DOM PHP web crawler

I have this PHP dom web crawler which works fine. it extracts mentioned tag along with its link from a (external) forum site to my page.
But recently i ran into a problem. Like
this is the HTML of the forum data::
<tbody>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Hispanic Study Partner - dreamer1984</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">0</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">nbme - monariyadh</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">0</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>
Now if we consider the above code (table data) as the only statements available in that site. and if i tried to extract it with a web crawler like,
<?php
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.sitename.com/');
foreach($html->find('td.FootNotes2') as $element) {
echo $element;
}
?>
It extracts al the data that is inside with a class name as "FootNote2"
Now what if i want to extract specific data in tag, for example names like, " dreamer1984" and "monariyadh" from the first tag/line.
and what if i wanted to extract data from 3rd (skipping the rest) which has same class names.
Please note that i can use "regex" like
preg_match_all('/<td.+?FootNotes2.+?<a.+?<\/a> - (?P<name>.*?)<\/td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2})/siu', $subject, $matchs);
foreach ($matchs['name'] as $k => $v){
var_dump('name: '. $v, 'relative date: '. $matchs['date'][$k]);
}
But i prefer to find solution for this in DOM parser...
Any help is appreciated..

As I said in my comment some text processing is unavoidable, however you can get the text element associated with the td like so :
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.sitename.com/');
foreach ($html->find("tr") as $row) {
$element = $row->find('td.FootNotes2',0);
if ($element == null) { continue; }
$textNode = array_filter($element->nodes, function ($n) {
return $n->nodetype == 3; //Text node type, like in jQuery
});
if (!empty($textNode)) {
$text = current($textNode);
echo $text;
}
}
This echoes:
- dreamer1984
- monariyadh
Do with that what you will.
Updated to only find the first td for each tr.

If you want to extract only text (not tags and its contain)
foreach ($html->find("td.FootNotes2") as $element) {
$children = $element->children; // get an array of children
foreach ($children AS $child) {
$child->outertext = ''; // This removes the element, but MAY NOT remove it from the original $myDiv
}
echo $element->innertext."<br>";
}
o/p:
- dreamer1984
02/28/17 01:42
0
200
- monariyadh
02/27/17 23:12
0
108

You have to use regex either way so no sense overcomplicating it:
foreach($html->find('tr') as $tr) {
echo preg_replace('/.* - /', '', $tr->find('td',1)->text()) . "\n";
echo $tr->find('td',3)->text() . "\n";
}
I really don't like apokryfos' approach to this, it's a lot of confusion with no benefit.

Duplicated Results on SQL Query

PHP newbie here! I ve been struggling with this for a few days now and i have decided i cant figure this out on my own.
Basically i have 2 database tables "projects_2016" and "attachment".
I want to show the data of "projects_2016" to show in the top table and then check for a matching id number (and if it exsits) in "attachment" it will list all the results under the "project_2016 data".
At the moment it works great but it duplicates the "projects_2016" data for every "attachment" entry.
Here is my code, any input is appreciated!
PS not too concereded about Sql injections. Still learning that!
<?php include '../../../connection_config.php';
$sql = "SELECT DISTINCT * FROM attachment JOIN projects_2016 ON attachment.attachment_ABE_project_number = projects_2016.id ORDER BY `attachment_ABE_project_number` DESC";
$result = $conn->query($sql);
if ($result->num_rows > 0) {
while($row = $result->fetch_assoc()) {
?>
<table width="20" border="1" cellspacing="0" cellpadding="2">
<tr>
<th height="0" scope="col"><table width="990" border="0" align= "center" cellpadding="3" cellspacing="0">
<tr class="text_report">
<td width="107" height="30" align="left" valign="middle" nowrap="nowrap" bgcolor="#F5F5F5"><strong>PNo</strong></td>
<td width="871" align="left" valign="middle" nowrap="nowrap" bgcolor="#F5F5F5"><strong>Project Name</strong></td>
</tr>
<tr>
<td height="20" align="left" valign="middle" bgcolor="#FFFFFF" class="text_report"><strong><?php echo "<br>". $row["ID"]. "<br>";?></strong></td>
<td height="20" align="left" valign="middle" bgcolor= "#FFFFFF" class="text_report"><strong><?php echo "<br>". $row["project_name"]. "<br>";?></strong></td>
</tr>
</table>
<?php
$photo_id = $row["ID"];
$contacts = "SELECT DISTINCT * FROM attachment WHERE attachment_ABE_project_number = '$photo_id'" ;
$result_contacts = $conn->query($contacts);
if ($result_contacts->num_rows > 0) {
// output data of each row
while($row_contacts = $result_contacts->fetch_assoc()) {
?>
<table width="990" border="0" align="center" cellpadding= "3" cellspacing="0" class="text_report">
<tr>
<td height="0" colspan="4" align="left" valign="middle" nowrap="nowrap" bgcolor="#FFFFFF"> </td>
</tr>
<tr>
<td width="319" height="30" align="left" valign="middle" nowrap="nowrap" bgcolor="#F5F5F5"><strong>File Name</strong></td>
<td width="279" align="left" valign="middle" nowrap="nowrap" bgcolor="#F5F5F5"><strong>File Type</strong></td>
<td width="315" align="left" valign="middle" nowrap="nowrap" bgcolor="#F5F5F5"><strong>File Size (KB)</strong></td>
<td width="53" align="right" valign="middle" nowrap="nowrap" bgcolor="#F5F5F5"><strong>View File</strong></td>
</tr>
<tr>
<td height="20" align="left" valign="middle" bgcolor="#FFFFFF"><?php echo $row_contacts ['file'] ?></td>
<td height="20" align="left" valign="middle" bgcolor="#FFFFFF"><?php echo $row_contacts ['type'] ?></td>
<td height="20" align="left" valign="middle" bgcolor="#FFFFFF"><?php echo $row_contacts ['size'] ?></td>
<td align="right" valign="middle" bgcolor="#FFFFFF">view file</td>
</tr>
<tr>
<td height="0" colspan="4" align="left" valign="middle" bgcolor="#FFFFFF"> </td>
</tr>
<?php
}
?>
</table>
<?php
}
?></th>
</tr>
</table>
<table width="1000" border="0" cellspacing="0" cellpadding="0">
<tr>
<th height="26"> </th>
</tr>
</table>
<p>
<?php
}
}
?>
</p>
</table>
<?php $conn->close();
?>

$sql = "SELECT * FROM projects_2016
WHERE EXISTS (SELECT * FROM attachment WHERE projects_2016.id = attachment_ABE_project_number) ORDER BY id DESC ";

store part of the page in variable

i have a page contains a table and this table contains a lot of coding like this
<table class="table">
<tbody>
<tr >
<td width="20" class="tabletop">م</td>
<td class="tabletop" >name</td>
<td class="tabletop" style="width:120px">date</td>
<td class="tabletop" style="width:120px">note1</td>
<td class="tabletop" style="width:100px">note2</td>
<td class="tabletop" style="width:90px">note3</td>
</tr>
<? $res=mysql_query($sql);
while($resArr=mysql_fetch_array($res)){ ?>
<tr style="width:700px">
<td class="tabletext"><?= ++$serial;?></td>
<td class="tabletext" ><?= $resArr[stName];?></td>
<td class="tabletext"><?= $resArr['date'];?></td>
<td class="tabletext" ><?= $resArr[matName];?></td>
<td class="tabletext" ><? if($resArr[exam]==1) echo "work";else echo "final";?></td>
<td class="tabletext" ><? if($resArr[exam_type]==1) echo "prac";else echo "test";?></td>
</tr>
<? }?>
</tbody>
</table>
as you see the table has php coding
now i want to store the whole table in variable so i can send it to pdf printing library tcpdf

You can use heredoc syntax, but you need to move the conditionals outside:
<?php
if($resArr['exam']==1) $exam = "work"; else $exam = "final";
if($resArr["exam_type"]==1) $examtype = "prac";else $examtype = "test";
$var = <<<EOT
<table class="table">
<tbody>
<tr >
<td width="20" class="tabletop">م</td>
<td class="tabletop" >name</td>
<td class="tabletop" style="width:120px">date</td>
<td class="tabletop" style="width:120px">note1</td>
<td class="tabletop" style="width:100px">note2</td>
<td class="tabletop" style="width:90px">note3</td>
</tr>
$res=mysql_query($sql);
while($resArr=mysql_fetch_array($res)){
<tr style="width:700px">
<td class="tabletext">{++$serial}</td>
<td class="tabletext" >{$resArr["stName"]}</td>
<td class="tabletext">{$resArr["date"]}</td>
<td class="tabletext" >{$resArr["matName"]}</td>
<td class="tabletext" >{$exam}</td>
<td class="tabletext" >{$examtype}</td>
</tr>
<? }?>
</tbody>
</table>
EOT;
echo $var;

Another way, if you already have all of this code and want to save it:
Before the table:
<?php ob_start(); ?>
After the table:
<?php $output = ob_get_contents(); ?>
The table will still be displayed and you can use $output to send to the PDF.

PHP DOM get element which contains

Need help with parsing HTML code by PHP DOM.
This is simple part of huge HTML code:
<table width="100%" border="0" align="center" cellspacing="3" cellpadding="0" bgcolor='#ffffff'>
<tr>
<td align="left" valign="top" width="20%">
<span class="tl">Obchodne meno:</span>
</td>
<td align="left" width="80%">
<table width="100%" border="0">
<tr>
<td width="67%">
<span class='ra'>STORE BUSSINES</span>
</td>
<td width="33%" valign='top'>
<span class='ra'>(od: 02.10.2012)</span>
</td>
</tr>
</table>
</td>
</tr>
</table>
What I need is to get text "STORE BUSINESS". Unfortunately, the only thing I can catch is "Obchodne meno" as a content of first tag, so according to this content I need to get its parent->parent->first sibling->child->child->child->child->content. I have limited experience with parsing html in php so any help will be valuable. Thanks in advance!

Make use of DOMDocument Class and loop through the <span> tags and put them in array.
<?php
$html=<<<XCOE
<table width="100%" border="0" align="center" cellspacing="3" cellpadding="0" bgcolor='#ffffff'>
<tr>
<td align="left" valign="top" width="20%">
<span class="tl">Obchodne meno:</span>
</td>
<td align="left" width="80%">
<table width="100%" border="0">
<tr>
<td width="67%">
<span class='ra'>STORE BUSSINES</span>
</td>
<td width="33%" valign='top'>
<span class='ra'>(od: 02.10.2012)</span>
</td>
</tr>
</table>
</td>
</tr>
</table>
XCOE;
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('span') as $tag) {
$spanarr[]=$tag->nodeValue;
}
echo $spanarr[1]; //"prints" STORE BUSINESS

Get data inside html tags using Simple HTML DOM Parser:

I want to get all the information inside the html tags and display them in a table. I'm using Simple HTML DOM Parser. I tried the following code but I'm only getting the LAST COLUMN (Column:Total). How do I get the data from the other columns?
foreach($html->find('tr[class="tblRowShade"]') as $div) {
$key = '';
$val = '';
foreach($div->find('*') as $node) {
if ($node->tag=='td'){
$key = $node->plaintext;
}
}
$ret[$key] = $val;
}
Here's my code for the table
<tr class="tblRowShade">
<td width="12%"><strong>Project</strong></td>
<td width="38%"> </td>
<td width="25%"><strong>Recipient</strong></td>
<td width="14%"><strong>Municipality/City</strong></td>
<td width="11%" nowrap="nowrap" class="td_right"><strong>Implementing Unit</strong></td>
<td width="11%" nowrap="nowrap" class="td_right"><strong>Release Date</strong></td>
<td align="right" width="11%" class="td_right"><strong>Total</strong></td>
</tr>
<tr class="tblRowShade">
<td colspan="2" >Livelihood Programs</td>
<td >Basic Espresso and Latte</td>
<td nowrap="nowrap"></td>
<td >DOLE - TESDA Regional Office IV-A</td>
<td nowrap="nowrap">2013-06-11</td>
<td align="right" nowrap="nowrap" class="td_right">1,500,000</td>
</tr>

Why do you have $div->find('*')? you can try $div->find('td') instead. This should produce correct result. Otherwise you can also try to iterate over children: foreach($div->children as $node)
Assuming you are trying to use the first row as $key and the rest for the data, you might want to alter your HTML code by simply add th in the first row, which is your header: <tr><th>…</th></tr>. This way you can get the keys by $div->find('th'). I suppose using the first row is okay as well.

As alamin.ahmed said, it would be better to search for td instead...
Here's a working example :
$text = ' <tr class="tblRowShade">
<td width="12%"><strong>Project</strong></td>
<td width="38%"> </td>
<td width="25%"><strong>Recipient</strong></td>
<td width="14%"><strong>Municipality/City</strong></td>
<td width="11%" nowrap="nowrap" class="td_right"><strong>Implementing Unit</strong></td>
<td width="11%" nowrap="nowrap" class="td_right"><strong>Release Date</strong></td>
<td align="right" width="11%" class="td_right"><strong>Total</strong></td>
</tr>
<tr class="tblRowShade">
<td colspan="2" >Livelihood Programs</td>
<td >Basic Espresso and Latte</td>
<td nowrap="nowrap"></td>
<td >DOLE - TESDA Regional Office IV-A</td>
<td nowrap="nowrap">2013-06-11</td>
<td align="right" nowrap="nowrap" class="td_right">1,500,000</td>
</tr>';
echo "<div>Original Text: <xmp>$text</xmp></div>";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($text);
// Find all elements
$rows = $html->find('tr[class="tblRowShade"]');
// Find succeeded
if ($rows) {
echo count($rows) . " \$rows found !<br />";
foreach ($rows as $key => $row) {
echo "<hr />";
$columns = $row->find('td');
// Find succeeded
if ($rows) {
echo count($columns) . " \$columns found in \$rows[$key]!<br />";
foreach ($columns as $col) {
echo $col->plaintext . " | ";
}
}
else
echo " /!\ Find() \$columns failed /!\ ";
}
}
else
echo " /!\ Find() \$rows failed /!\ ";
here's the output of the above code:
You must be aware that the two rows doesnt contain the same number of columns... then you must handle that in your program.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to catch text from html page - php

Related

Selective extraction of data from external site using DOM PHP web crawler

Duplicated Results on SQL Query

store part of the page in variable

PHP DOM get element which contains

Get data inside html tags using Simple HTML DOM Parser:

Categories

Resources