Extract table data from HTML page in php - php

I have a html table with multiple rows and each row with multiple columns. A Sample for one row looks like this.
<table class ="classt">
<tbody>
<tr class="row">
<td height="20" valign="top" class="mosttext-new">data</td>
<td height="20" valign="top" class="mosttext-new"> data</td>
<td height="20" valign="top" class="mosttext-new">data</td>
</tr>
</tbody>
</table>
I am trying to extract all td elements like this in a php script.
foreach($html->find('table.classt') as $e){
foreach ($e->find('tr.row') as $tr){
foreach ($tr->find('td') as $td){
$text = $td->innertext;
}
}
}
But in $tr I am not getting row details with td tags. It is just coming the entire row withing double quotes like this
"data data data"
so my third loop is not able to find td as $tr does not have td tags.
Any idea on this?

I think you have to mention the class name after the 'td' followed by '.' like this
foreach ($tr->find('td.mosttext-new') as $td)
Hope this should solve your problem. All the best.

Related

PHP simplehtmldom find text in blank table structure

I'm having difficulty finding the DYNAMIC-TEXT value in a sea of HTML tables.
I have tried $html->find("th[plaintext*=Type") and from here, I wanted to access the sibling, but return nothing. Here's the table structure
<table>
<tbody>
</tbody>
<colgroup>
<col width="25%">
<col>
</colgroup>
<tbody>
<tr class="odd">
<th colspan="2">Name</th>
</tr>
<tr class="even">
<th width="30%">Type</th>
<td>DYNAMIC-TEXT</td>
</tr>
</tbody>
</table>
I expect the output to be the text of DYNAMIC-TEXT but the action output is nothing
Thanks
In your code $html->find("th[plaintext*=Type") you want to use an attribute selector *= but there is no attribute plaintext.
But there is an attribute width with the value 30%. You might use a pattern ^[0-9]+%$ to check for 1+ digits followed by a percentage sign.
If you find a result, you could get the next_sibling and get the plaintext from it.
For example:
$html = str_get_html($str);
foreach ($html->find("th[width*=^[0-9]+%$]") as $value) {
echo $value->next_sibling()->plaintext;
}
Result:
DYNAMIC-TEXT

php simple html dom <tr> bgcolor

ok i've been reading up on simple php html dom and so far it works great.
I have a table which i'm trying to convert to a mysql db.
I'm using this:
foreach($html->find('TR') as $row) {
etc.etc.etc.
}
my table:
<TR BGCOLOR="CCDDFF">
<TD valign="top">
</TD>
</TR>
but how do i get the bgcolor from the tr ?
Did you try the $row->getAttribute('bgcolor') method?

Need help scraping webpage -- getting specific content...

I have a table, of whose number of columns can change depending on the configuration of the scrapped page (I have no control of it). I want to get only the information from a specific column, designated by the columns heading.
Here is a simplified table:
<table>
<tbody>
<tr class='header'>
<td>Image</td>
<td>Name</td>
<td>Time</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 1</td>
<td>13:02</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 2</td>
<td>13:43</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 3</td>
<td>14:53</td>
</tr>
</tbody>
</table>
I want to only extract the names (column 2) of the table. However, as previously stated, the column order cannot be known. The Image column might not be there, for example, in which case the column I want would be the first one.
I was wondering if there's any way to do this with DomDocument/DomXPath. Perhaps search for the string "Name" in the first tr, and find out which column index it is, and then use that to get the info. A less elegant solution would be to see if the first column has an img tag, in which case the image column is first and so we can throw that way and use the next one.
Been looking at it for about an hour and a half, but I'm not familiar to DomDocument functions and manipulation. Having a lot of trouble with this one.
Simple HTML DOM Parser may be useful. You can check the manual. Basically you should use something like;
$url = "file url";
$html = file_get_html($url);
$header = $html->find('tr.header td');
$i = 0;
foreach ($header as $element){
if ($element->innerText == 'Image') { $num = $i; }
$i++;
}
We found which column ($num) is image column. You can add additional codes to improve.
PS: Easy way to find all image sources;
$images = $html->find('tr td img');
foreach ($images as $image){
$imageUrl[] = $image->src;
}

Displaying the text in the multiple lines when retrieving from database

Hi
I have a table in which my row contains the text which i retrieve from the database.But i have a small width of row and the data i retrieve is large.And the text exceeds the width of my row so i want to break the data i retrieve into multi lines inside the table row.How can i do it.
My code is here:
$list = $mfidao1->fetchMfi($_GET['id']);
//print_r($list);
//die;
if(!empty($list))
{
foreach($list as $menu)
{
?>
<tr style="border:none; background-color:#FBFBFB;" >
<td class="topv">Social Mission</td>
<td class="topm" ><div class="txt"><?php echo $menu->mfi_1_a;?></div></td>
</tr>
<tr bgcolor="#E8E8E8">
<td class="topv">Address</td>
<td class="topm"><?php echo $menu->mfi_ii_c;?></td>
</tr>
<tr bgcolor="#FBFBFB">
<td class="topv">Phone</td>
<td class="topm"><?php echo $menu->mfi_ii_e;?></td>
</tr>
<tr bgcolor="#E8E8E8">
<td class="topv">Email</td>
<td class="topm"><?php echo $menu->mfi_ii_d;?></td>
</tr>
<tr bgcolor="#FBFBFB">
<td class="topv">Year Established</td>
<td class="topm"><?php echo $menu->mfi_i_c;?></td>
</tr>
<tr bgcolor="#E8E8E8">
<td class="topv">Current Legal Status</td>
<td class="topm"><?php echo $menu->mfi_i_d;?></td>
</tr>
<tr bgcolor="#FBFBFB">
<td class="topv">Respondent</td>
<td class="topm"><?php echo $menu->mfi_ii_a;?></td>
</tr>
<?php
}
}
?>
</table>
Set width of <td>. I think this is the best way to do this rather than word_wrap().
In your css for the table, use "table-layout:fixed" - This fixes the td elements width according to the way you want.
" word-wrap: break-word; " - this breaks the text in it so that it doesnt go beyond the boundary of the box.
You need to wrap the text in your td tags. Here is a link to a similar question
You could use the function wordwrap().
It wraps a string to a given number of characters using a string break character.
you can either use the php function
php wordwrap
or styling the td with css so that it uses the word-wrap attribute
css wordwrap
Not sure if this is what you want, but sound like you could use chunk_split()

Using php to parse html document

I am making a php app to parse HTML contents. I need to store a certain table column in php variables.
Here is my code:
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
$flag=0;
foreach ($rows as $row)
{
if($flag==0) $flag=1;
else
{
$cols = $row->getElementsByTagName('td');
foreach ($cols as $col)
{
echo $col->nodeValue; //NEED HELP HERE
}
echo '<hr />';
}
}
In each row, first col is the KEY, second is the VALUE. How to create key value pairs from the table and store them as arrays in php.
I tried many things but everytime I am just getting DOMElement Object() as value.
Any help is deeply appreciated...
HTML as requested:
<table align='center' border='0' cellpadding='0' cellspacing='0' style='border-collapse: collapse' width='780' height=100%>
<tr><td height=96% align=center><BR><BR>
<html>
<head>
</head>
<body style="background:url(uptu_logo1.gif); background-repeat:no-repeat; background-position:center">
<p align="center" style="font-size:18px"><span style='font-size:20px'>this text is unimportant gibberish that is not required by my app</span><br/><span style='font-size:16px'>this text is unimportant gibberish that is not required by my app</span><br/><u>B.Tech. Third Year Result 2009-10. this text is unimportant gibberish that is not required by my app</u></p>
<br/>
<table align="center" border="1" cellpadding="0" cellspacing="0" bordercolor="#E3DDD5" width="700" style="border-collapse: collapse; font-size: 11px">
<tr>
<td width="50%"><b>Name:</b></td>
<td width="50%">John Fernandes </td>
</tr>
<tr>
<td><b>Fathers Name:</b></td>
<td>Caith Fernandes </td>
</tr>
<tr>
<td><b>Roll No:</b></td>
<td>0702410099</td>
</tr>
<tr>
<td><b>Status:</b></td>
<td>REGULAR </td>
</tr>
<tr>
<td><b>Course/Branch:</b></td>
<td>B. Tech. </td>
</tr>
<tr>
<td><b>Institute Name</b></td>
<td>Imperial College of Science and Technology</td>
</tr>
</table>
My PHP code outputs:
Name:John Fernandes <hr />
Fathers Name:Caith Fernandes <hr />
Roll No:0702410099<hr />
Status:REGULAR <hr />
Course/Branch:B. Tech. Computer Science and Engineering (10)<hr />
Imperial College of Science and Technology<hr />
Also how to get rid of this silly Ă‚ ? I saw in the original HTML so I tried to sanitize using PHP function html_entity_decode() But its still there...
What is the HTML that you are loading? I am assuming that it's something simple like so:
<table>
<tr>
<td>heading</td>
<td>heading</td>
</tr>
<tr>
<td>key</td>
<td>value</td>
</tr>
</table>
Looks like the first tr is skipped (the headings), and then you have just 2 columns that you want to pair up as KEY => VALUE;
$cols = $row->getElementsByTagName('td');
$key = $cols->item(0)->nodeValue; // string(3) "key"
$val = $cols->item(1)->nodeValue; // string(5) "value"
The above code will return the items you want.

Categories