Using php to parse html document - php

I am making a php app to parse HTML contents. I need to store a certain table column in php variables.
Here is my code:
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
$flag=0;
foreach ($rows as $row)
{
if($flag==0) $flag=1;
else
{
$cols = $row->getElementsByTagName('td');
foreach ($cols as $col)
{
echo $col->nodeValue; //NEED HELP HERE
}
echo '<hr />';
}
}
In each row, first col is the KEY, second is the VALUE. How to create key value pairs from the table and store them as arrays in php.
I tried many things but everytime I am just getting DOMElement Object() as value.
Any help is deeply appreciated...
HTML as requested:
<table align='center' border='0' cellpadding='0' cellspacing='0' style='border-collapse: collapse' width='780' height=100%>
<tr><td height=96% align=center><BR><BR>
<html>
<head>
</head>
<body style="background:url(uptu_logo1.gif); background-repeat:no-repeat; background-position:center">
<p align="center" style="font-size:18px"><span style='font-size:20px'>this text is unimportant gibberish that is not required by my app</span><br/><span style='font-size:16px'>this text is unimportant gibberish that is not required by my app</span><br/><u>B.Tech. Third Year Result 2009-10. this text is unimportant gibberish that is not required by my app</u></p>
<br/>
<table align="center" border="1" cellpadding="0" cellspacing="0" bordercolor="#E3DDD5" width="700" style="border-collapse: collapse; font-size: 11px">
<tr>
<td width="50%"><b>Name:</b></td>
<td width="50%">John Fernandes </td>
</tr>
<tr>
<td><b>Fathers Name:</b></td>
<td>Caith Fernandes </td>
</tr>
<tr>
<td><b>Roll No:</b></td>
<td>0702410099</td>
</tr>
<tr>
<td><b>Status:</b></td>
<td>REGULAR </td>
</tr>
<tr>
<td><b>Course/Branch:</b></td>
<td>B. Tech. </td>
</tr>
<tr>
<td><b>Institute Name</b></td>
<td>Imperial College of Science and Technology</td>
</tr>
</table>
My PHP code outputs:
Name:John Fernandes <hr />
Fathers Name:Caith Fernandes <hr />
Roll No:0702410099<hr />
Status:REGULAR <hr />
Course/Branch:B. Tech. Computer Science and Engineering (10)<hr />
Imperial College of Science and Technology<hr />
Also how to get rid of this silly  ? I saw in the original HTML so I tried to sanitize using PHP function html_entity_decode() But its still there...

What is the HTML that you are loading? I am assuming that it's something simple like so:
<table>
<tr>
<td>heading</td>
<td>heading</td>
</tr>
<tr>
<td>key</td>
<td>value</td>
</tr>
</table>
Looks like the first tr is skipped (the headings), and then you have just 2 columns that you want to pair up as KEY => VALUE;
$cols = $row->getElementsByTagName('td');
$key = $cols->item(0)->nodeValue; // string(3) "key"
$val = $cols->item(1)->nodeValue; // string(5) "value"
The above code will return the items you want.

Related

PHP DOM GET HREF ATTRIBUTE BETWEEN TABLE

I'm trying to get multiple href's from a table like this
<table class="table table-bordered table-hover">
<thead>
<tr>
<th class="text-center">No</th>
<th>TITLE</th>
<th>DESCRIPTION</th>
<th class="text-center"><span class="glyphicon glyphicon-download-alt"></span></th>
</tr>
</thead>
<tbody>
<tr data-key="11e44c4ebff985d08ca5313231363233">
<td class="text-center" style="width: 50px;">181</td>
<td style="width:auto; white-space: normal;">Link 1</td>
<td style="width:auto; white-space: normal;">Lorem ipsum dolor 1</td>
<td class="text-center" style="width: 50px;"><img src="https://example.com/img/pdf.png" width="15" height="20" alt="myImage"></td>
</tr>
<tr data-key="11e44c4e4222d630bdd2313231323532">
<td class="text-center" style="width: 50px;">180</td>
<td style="width:auto; white-space: normal;">Link 2</td>
<td style="width:auto; white-space: normal;">Lorem ipsum dolor 2</td>
<td class="text-center" style="width: 50px;"><img src="https://example.com/img/pdf.png" width="15" height="20" alt="myImage"></td>
</tr>
</tbody>
</table>
i try PHP DOM like this
<?php
$html = file_get_contents('data2.html');
$htmlDom = new DOMDocument;
$htmlDom->preserveWhiteSpace = false;
$htmlDom->loadHTML($html);
$tables = $htmlDom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
echo #$cols->item(0)->nodeValue.'<br />';
echo #$cols->item(1)->nodeValue.'<br />';
echo trim($cols->item(1)->getElementsByTagName('a')->item(0)->getAttribute('href')).'<br />';
echo #$cols->item(2)->nodeValue.'<br />';
echo trim($cols->item(3)->getElementsByTagName('a')->item(0)->getAttribute('href')).'<br />';
}
?>
I get this error
Fatal error: Uncaught Error: Call to a member function getElementsByTagName() on null
getAttribute causes the error
Could someone help me out here please thanks
Your $rows are results of "all the <tr> within <table>". It not only caught the <tr> in the table body, it also caught that in your table head, which has no <td> in it. Hence when reading that row, $cols->item(0) and $cols->item(1) both got you NULL.
You should take the hint when your code didn't find ->nodeValue attribute in the items (hence you added the # sign to suppress the warning).
Try to change this:
$rows = $tables->item(0)->getElementsByTagName('tr');
into this:
$rows = $tables
->item(0)->getElementsByTagName('tbody')
->item(0)->getElementsByTagName('tr');
Now it is searching the <tr> within your <tbody> and should fix your issue with this particular HTML.
To have a more robust code, you should have checked the variables before acting on them. A type check or count check would be good.
As the previous access to the $cols array all have # to suppress the errors, this is the first one that complains.
A simple fix would be to just skip the rest of the code if no <td> elements are found (such as the header row)...
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
if ( count($cols) == 0 ) {
continue;
}
You could alternatively use XPath and only select <tr> tags which contain <td> tags.

Extract specific values from HTML table using regex

I have a html file that contains this table row:
<tr>
<td class="color21 right" style="font-size:12px; line-height:1.2;"> Location</td>
<td class="color21" style="font-size:12px;">10</td>
<td class="color21" style="font-size:12px;"><img src="../../icons/9.gif" alt="Type" /> </td>
<td class="color21" style="font-size:12px;">3</td>
<td class="color21" style="font-size:12px;">7</td>
<td class="color21" style="font-size:12px;"><img src="../../icons/11.gif" alt="Type" /> </td>
<td class="color21" style="font-size:12px;">3</td>
<td class="color21" style="font-size:12px;">10</td>
<td class="color21" style="font-size:12px;"><img src="../../icons/9.gif" alt="Type" /> </td>
</tr>
I'm retrieving file contents using file_get_contents.
How can I extract all TD values using preg_match, preg_match_all?
Use the DomParser to Parse the html content regex are not reliable on this cases.
$str=file_get_contents('read.txt');
$dom = new domDocument;
$dom->loadHTML($str);
$tr = $dom->getElementsByTagName('td');
foreach($tr as $td)
{
if(!empty($td->nodeValue)){
echo $td->nodeValue."\n";
}else{
$images=$td->getElementsByTagName('img');
foreach($images as $image){
echo $image->getAttribute('src')." ";
echo $image->getAttribute('alt');
}
}
Think over if you really wanna a regex to parse html
But you can use this:
<td.+?>(.+?)</td>
The first group will contain the values of <td>

PHP mySQL - How do i print only different attributes of two things that share a common name?

I'm trying to print out a list of products on a page. This is cake so far..
However Some of my items share names but have different attributes.
Example would be like...:
product:
keyboard
keyboard
size:
25 inches
23 inches
color:
red
blue
My table looks something like this:
id, product, size, color, so on...
So my .php I'm doing my query and print from looks something like this
<div id="accordion">
<?php
$letter = $_GET['letter'];
//echo "$id";
include("database.php");
$result = mysql_query("SELECT UPPER(product) AS upperName, PRODUCTS.* FROM PRODUCTS WHERE product LIKE '$letter%' ORDER BY UPPER(product) ");
$prodName = "";
while($row = mysql_fetch_array($result))
{
if ($row['upperName'] != $prodName)
{
print('<div style="background-color:#666; padding-bottom:25px; margin-bottom:25px;">');
print ("<h1>" . "$row[product]" . "</h1>");
}
print ("$row[ndc]" . "<br />");
print ("$row[size]" . "<br />");
print ("$row[strength]" . "<br />");
print ("$row[imprint]" . "<br />");
print ("$row[form]" . "<br />");
print ("$row[color]" . "<br />");
print('</div>');
$prodName = $row['upperName'];
}
mysql_close($linkID);
?>
</div>
My problem comes from trying to style the attributes..
Do you see that /div tag?
I want to style that stuff within an accordion however that stuff repeats for each product that shares the same name in that accordion. So if i include the div there, it prints out for 3 times with the same name 3 end div tags (hence breaking all my html nooo!!)
Is there a way to maybe loop? or use some kind of conditional logic to print that top stuff, once for the product name that all the products share, then the loop and print all the different attributes, then when each attribute is finished, to then include my closing html?
So if i have 6 products and half are named "keyboard", and the other half are named "shoes"
could i get it to print out
TABLE
product name: KEYBOARD
table for all my attributes
end table for all my attributes
end TABLE
TABLE
product name: SHOES
table for all my attributes
end table for all my attributes
end TABLE
That way i can style all my attributes.
I'm really sorry if this isn't correctly explained I'm still learning.
Any help is appreciated!
Extra stuff that you may not need to figure out my problem just an example of a table i'm printing to and why the way the attributes print is a problem. the 1-2-3-4-5-6-7-8-9 are data for the attributes.
<table cellspacing="0" cellpadding="0" border="0" width="649">
<tr>
<td width="180" height="10" valign="top"></td><td width="36" height="10" valign="top"></td><td width="433" height="10" valign="top"></td>
</tr>
<tr>
<td width="180" valign="top"><img src="images/slide1.gif" width="180" height="200" border="0" /></td>
<td width="36" valign="top"></td>
<td width="433" valign="top"><table cellspacing="0" cellpadding="0" border="0" width="433">
<tr>
<td valign="top"><span class="in-table-head">Name:</span></td>
<td valign="top"><span class="in-table-name">$row[name]</span></td>
</tr>
<tr>
<td valign="top"><span class="in-table-head">Therapeutic Category:</span></td>
<td valign="top"><span class="in-table-name">$row[therapeutic]</span></td>
</tr>
<tr>
<td colspan="2" valign="top"><span class="in-table-providers">Information for Providers</span></td>
</tr>
</table></td>
</tr></table>
<table cellspacing="0" cellpadding="0" border="0" width="649">
<tr>
<td><span class="in-table-att">NDC#:</span></td>
<td><span class="in-table-att">Strength:</span></td>
<td><span class="in-table-att">Size:</span></td>
<td><span class="in-table-att">Imprint:</span></td>
<td><span class="in-table-att">Form:</span></td>
<td><span class="in-table-att">Color:</span></td>
<td><span class="in-table-att">Shape:</span></td>
<td><span class="in-table-att">Pack Size:</span></td>
<td><span class="in-table-att">Rating:</span></td>
</tr>
</table>
<table cellspacing="0" cellpadding="0" border="0" width="649">
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
</tr>
</table>
Thank you again to anyone who looks at this problem!
stop matching by name but start using sku’s as a selector. that way you’re always spot on, no matter how many keyboards you have.
set up your pages to pass the sku instead of generic name.
otherwise you need to add more criteria to your where statement like dimensions, weight, color, price etc... to get ta reliable result.
-- Update --
An example.
page.php?product_type=keyboard
<?php
// set up database
$db = mysqli_connect("localhost", "user", "pass", "database_name");
// Just using mysql escape string, you should consider adding more securty checking to prevent injection
$producttype = mysql_real_escape_string($_GET['product_type']);
$result = $db->query("SELECT * FROM `products` WHERE `products`.`type`='$producttype'");
$features["width"][]
$features["height"][]
$features["color"][]
$features["whatever"][]
while ($row = $result->fetch_assoc())
{
$features["width"][] = $row["width"];
$features["height"][] = $row["height"];
$features["color"][] = $row["color"];
$features["whatever"][] = $row["whatever"];
}
// printing the features
echo "You have selected $producttype<BR>";
echo "The following widths can be selected<BR><UL>";
foreach($features["width"] as $width)
echo "<LI>$width</lI>";
echo "</UL><P>The following heights can be selected<BR><UL>";
foreach($features["heights"] as $height)
echo "<LI>$height</lI>";
echo "</UL><P>The following colors can be selected<BR><UL>";
foreach($features["colors"] as $color)
echo "<LI>$color</lI>";
echo "</UL><P>The following whatevers can be selected<BR><UL>";
foreach($features["whatevers"] as $whatevers)
echo "<LI>$whatevers</lI>";
echo "</UL>";
echo "have a nice day";
?>

PHP Dynamically returning multidimensional arrays

I have been doing a lot of work in learning how to return multidimensional arrays dynamically- but what I can't seem to figure out is how to nest them.
I have two tables, each has the identical format: ID, name.
Table one: SSC
- sscid
- sscname
Table two: SRV
- srvid
- srvname
What I am trying to do is print all of the items in table two under EACH item in the table one list.
The table one items are the headers, the table two items are returned as a checkbox (with the srvid as the value) and label(srvname).
I can get it to all print together, but it is a. one giant list of results and it's in a
| checkbox | table 1: name | table 2: name | format.
Not pretty at all (although it is progress for me to get this far).
After I run my query and get the result, my code looks like this:
Now, I've had a few additional thoughts about the design of the concept re:the database tables go, but everything I read indicates that they really need to be on their own tables, and they should be able to be referenced by the key from one table and the key from the other (eventually ended up in a joint table with user ID references) Because they are numerically indexed, I don't know why this would be an issue for me; however I simply can't seem to get this to work properly.
I should mention that when I alter the code to try to make the ssc_name span 2 cols and make it more like a header, it returns a header row for each checkbox/srv row, instead of for all of the checkbox/srv rows.
if($result) {
echo '<table border="1" align="center" cellspacing="3" cellpadding="3" width="300">
<tr><th colspan="2"><h3>Options</h3></th></tr>
<tr><td></td><td align="left"><b>Services</b></td></tr>';
$numfields = mysql_num_fields($result);
$data = array();
$flist = array();
for($i=0;$i<$numfields;$i++)$flist[] = mysql_field_name($result,$i);
$data[0] = $flist;
while($row = mysql_fetch_assoc($result)) {
$data[] = $row;
echo '<tr><td colspan="2" align="center"><b>' . $row['ssc_name'] .'</b><td></tr>
<tr><td align="center"><input type="checkbox" value="'. $row['ssv_id'] .'" / </td>
<td align="left">' . $row['ssvname'] . '</td>
</tr>';
}
echo '</table>';
}
Can anyone help me figure this out, please?
You are missing a > on this line
<tr><td align="center"><input type="checkbox" value="'. $row['ssv_id'] .'" / </td>
Should be
<tr><td align="center"><input type="checkbox" value="'. $row['ssv_id'] .'" /></td>
Something you might be able to pick up on with better formatting...
<?php
if ($result) {
echo <<<EOD
<table border="1" align="center" cellspacing="3" cellpadding="3" width="300">
<tr>
<th colspan="2"><h3>Options</h3></th>
</tr>
<tr>
<td></td>
<td align="left"><b>Services</b></td>
</tr>
EOD;
while ($row = mysql_fetch_assoc($result)) {
echo <<<EOD
<tr>
<td colspan="2" align="center"><b>{$row['ssc_name']}</b><td>
</tr>
<tr>
<td align="center"><input type="checkbox" value="{$row['ssv_id']}" /></td>
<td align="left">{$row['ssvname']}</td>
</tr>
EOD;
}
echo '</table>';
}
?>
And I'm not really sure what business any of those arrays have being in there.

php regex or html dom parsing

I use regex for HTML parsing but I need your help to parse the following table:
<table class="resultstable" width="100%" align="center">
<tr>
<th width="10">#</th>
<th width="10"></th>
<th width="100">External Volume</th>
</tr>
<tr class='odd'>
<td align="center">1</td>
<td align="left">
http://xyz.com
</td>
<td align="right">210,779,783<br />(939,265 / 499,584)</td>
</tr>
<tr class='even'>
<td align="center">2</td>
<td align="left">
http://abc.com
</td>
<td align="right">57,450,834<br />(288,915 / 62,935)</td>
</tr>
</table>
I want to get all domains with their volume(in array or var) for example
http://xyz.com - 210,779,783
Should I use regex or HTML dom in this case. I don't know how to parse large table, can you please help, thanks.
here's an XPath example that happens to parse the HTML from the question.
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile("./input.html");
$xpath = new DOMXPath($dom);
$trs = $xpath->query("//table[#class='resultstable'][1]/tr");
foreach ($trs as $tr) {
$tdList = $xpath->query("td[2]/a", $tr);
if ($tdList->length == 0) continue;
$name = $tdList->item(0)->nodeValue;
$tdList = $xpath->query("td[3]", $tr);
$vol = $tdList->item(0)->childNodes->item(0)->nodeValue;
echo "name: {$name}, vol: {$vol}\n";
}
?>

Categories