Simple DOM html parser read html table - php

I am trying to read specific values of this HTML table via a php dom parser. I want my code to only read the "td width" tags and output only these items from the table and look like this:
" WAITLIST, 91630, ACCY 2001, 10, Intro Financial Accounting, 3.00, Zou, Y, Duques 251, 9:35AM-10:50AM, 01/13/14-04/28/14 "
Here is the HTML table:
<table width="100%" border="0" cellspacing="1" cellpadding="0" bgcolor="#006699">
<tr align="center" class="tableRow1Font">
<td width="7%">WAITLIST</td>
<td width="5%">91630</td>
<td width="11%">
ACCY 2001
</td>
<td width="5%">10</td>
<td width="16%">Intro Financial Accounting</td>
<td width="6%">3.00</td>
<td width="8%"> Zou, Y</td>
<td width="8%"><A HREF="http://www.gwu.edu/~map/building.cfm?BLDG=DUQUES" target="_blank" >DUQUES</a> 251</td>
<td width="13%">TR<br>09:35AM - 10:50AM</td>
<td width="14%">
01/13/14 - 04/28/14
</td>
<td width="7%">
</td>
</tr>
</table
Here is my php code which grabs the whole table, some elements of which I don't want in my output, and repeats the output multiple times:
// Retrieve the DOM from a given URL
$html = file_get_html('testdata.html');
foreach($html->find('table') as $e){
foreach($html->find('td') as $f){
echo $f->innertext . '<br>';
}
}
How can I change my code to only grab and output these elements:
"WAITLIST, 91630, ACCY 2001, 10, Intro Financial Accounting, 3.00, Zou, Y, Duques 251, 9:35AM-10:50AM, 01/13/14-04/28/14"

// Retrieve the DOM from a given URL
$html = file_get_html('testdata.html');
foreach($html->find('table') as $e){
foreach($e->find('td') as $f){
echo strip_tags($f->innertext) . '<br>';
}
}
You were pretty close already...
Forgot about the tag. See if strip_tags works for you.
http://us3.php.net/strip_tags

Related

Selective extraction of data from external site using DOM PHP web crawler

I have this PHP dom web crawler which works fine. it extracts mentioned tag along with its link from a (external) forum site to my page.
But recently i ran into a problem. Like
this is the HTML of the forum data::
<tbody>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">Hispanic Study Partner - dreamer1984</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">0</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2">nbme - monariyadh</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">0</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>
Now if we consider the above code (table data) as the only statements available in that site. and if i tried to extract it with a web crawler like,
<?php
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.sitename.com/');
foreach($html->find('td.FootNotes2') as $element) {
echo $element;
}
?>
It extracts al the data that is inside with a class name as "FootNote2"
Now what if i want to extract specific data in tag, for example names like, " dreamer1984" and "monariyadh" from the first tag/line.
and what if i wanted to extract data from 3rd (skipping the rest) which has same class names.
Please note that i can use "regex" like
preg_match_all('/<td.+?FootNotes2.+?<a.+?<\/a> - (?P<name>.*?)<\/td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2})/siu', $subject, $matchs);
foreach ($matchs['name'] as $k => $v){
var_dump('name: '. $v, 'relative date: '. $matchs['date'][$k]);
}
But i prefer to find solution for this in DOM parser...
Any help is appreciated..
As I said in my comment some text processing is unavoidable, however you can get the text element associated with the td like so :
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.sitename.com/');
foreach ($html->find("tr") as $row) {
$element = $row->find('td.FootNotes2',0);
if ($element == null) { continue; }
$textNode = array_filter($element->nodes, function ($n) {
return $n->nodetype == 3; //Text node type, like in jQuery
});
if (!empty($textNode)) {
$text = current($textNode);
echo $text;
}
}
This echoes:
- dreamer1984
- monariyadh
Do with that what you will.
Updated to only find the first td for each tr.
If you want to extract only text (not tags and its contain)
foreach ($html->find("td.FootNotes2") as $element) {
$children = $element->children; // get an array of children
foreach ($children AS $child) {
$child->outertext = ''; // This removes the element, but MAY NOT remove it from the original $myDiv
}
echo $element->innertext."<br>";
}
o/p:
- dreamer1984
02/28/17 01:42
0
200
- monariyadh
02/27/17 23:12
0
108
You have to use regex either way so no sense overcomplicating it:
foreach($html->find('tr') as $tr) {
echo preg_replace('/.* - /', '', $tr->find('td',1)->text()) . "\n";
echo $tr->find('td',3)->text() . "\n";
}
I really don't like apokryfos' approach to this, it's a lot of confusion with no benefit.

How Can I get values into html tags using Simple HTML Dom?

I am scraping some values from a table but I have problem to get values from data-odd ("odds" and "odds best betrate" as Class) into td tags.
I will post code:
<tr class="first-row">
<td class="first-cell tl">
Kortrijk - St. Truiden
</td>
<td class="result">
3:0
</td>
<td class="odds best-betrate" **data-odd="1.72"**></td>
<td class="odds" **data-odd="3.61"**></td>
<td class="odds" **data-odd="4.76"**></td>
<td class="last-cell nobr date">20.02.2016</td>
</tr>
<tr class="strong">
<td class="first-cell tl">
Lokeren - Genk
</td>
<td class="result">
0:0
</td>
<td class="odds" **data-odd="3.11"**></td>
<td class="odds best-betrate" **data-odd="3.31"**></td>
<td class="odds" **data-odd="2.25"**></td>
<td class="last-cell nobr date">20.02.2016</td>
</tr>
I know how I can get values between tags and I got them using Simple HTML Dom, but I really don't know how I can get values about "data-odd. In my code you can see bold values that I want to get.
Thanks :)
EDIT: Now I got this result (see picture below):
enter image description here
I want that values together the others values, example:
21.02.2016
Waasland-Beveren - Anderlecht 1:0 5.96 4.20 1.51
21.02.2016
Waregem - KV Mechelen 2:3 1.83 3.71 3.98
Thanks Again!
EDIT2:
This is my code:
<?php
include('../simple_html_dom.php');
$html = file_get_html('http://www.betexplorer.com/soccer/belgium/jupiler-league/results/');
foreach($html->find('td') as $e) {
echo $e->innertext . '<br>';
}
foreach( $html->find('td[data-odd]') as $td )
{
echo $td->attr['data-odd'].PHP_EOL;
}
?>
As mentioned in the comments, data-odd is an attribute of node, so to retrieve its value with simple_html_dom you have to use this syntax:
foreach( $html->find('td[data-odd]') as $td )
{
echo $td->attr['data-odd'].PHP_EOL;
}

Convert HTML table to CSV via PHP

I am trying to pull each td element from the html table below and import each element into its own cell in a CSV file.
Here are the two html tables:
<table width="100%" border="0" cellspacing="1" cellpadding="0" bgcolor="#006699">
<tr align="center" class="tableRow1Font">
<td width="7%">WAITLIST</td>
<td width="5%">91630</td>
<td width="11%">
ACCY 2001
</td>
<td width="5%">10</td>
<td width="16%">Intro Financial Accounting</td>
<td width="6%">3.00</td>
<td width="8%"> Zou, Y</td>
<td width="8%"><A HREF="http://www.gwu.edu/~map/building.cfm?BLDG=DUQUES" target="_blank"
>DUQUES</a> 251</td>
<td width="13%">TR<br>09:35AM - 10:50AM</td>
<td width="14%">
01/13/14 - 04/28/14
</td>
<td width="7%">
</td>
</tr>
</table>
<table width="100%" border="0" cellspacing="1" cellpadding="0" bgcolor="#006699">
<tr align="center" class="tableRow2Font">
<td width="7%">WAITLIST</td>
<td width="5%">90003</td>
<td width="11%">
ACCY 2001
</td>
<td width="5%">11</td>
<td width="16%">Intro Financial Accounting</td>
<td width="6%">3.00</td>
<td width="8%"> Zou, Y</td>
<td width="8%"><A HREF="http://www.gwu.edu/~map/building.cfm?BLDG=DUQUES" target="_blank"
>DUQUES</a> 254</td>
<td width="13%">TR<br>11:10AM - 12:25PM</td>
<td width="14%">
1/13/14 - 04/28/14
</td>
<td width="7%">
</td>
</tr>
</table>
I have written code that goes through the tables and pulls the td elements:
foreach($html->find('tr[align=center] td') as $e)
$str .= strip_tags($e->innertext) . ', ';
echo $str;
So how can I extract these elements into a CSV file? In Excel I want it to look like this with each td element in its own cell, starting a new row for each html table:
WAITLIST 91630 ACCY 2001 10 Intro Financial Accounting 3.00 Zou, Y DUQUES 251 TR
WAITLIST 90003 ACCY 2001 11 Intro Financial Accounting 3.00 Zou, Y DUQUES 251 TR
There is a library exist for this. Goto http://phpexcel.codeplex.com/. Download the zip file and in example you would find 17html.php try this code. I hope this will help.
CSV means Comma Separated Values. Thus, as you echo out the data (after running it through your function to strip the <td> tags), put commas in between each piece of data (cell), and a new line where you want the next line to start.
So to use your example above, it should look like this:
WAITLIST,91630,ACCY,2001,10,Intro Financial Accounting,3.00,Zou,Y,DUQUES,251,TR
WAITLIST,90003,ACCY,2001,11,Intro Financial Accounting,3.00,Zou,Y,DUQUES,2,
Keep in mind that when you echo this, you shouldn't have any other html tags or anything.

PHP mySQL - How do i print only different attributes of two things that share a common name?

I'm trying to print out a list of products on a page. This is cake so far..
However Some of my items share names but have different attributes.
Example would be like...:
product:
keyboard
keyboard
size:
25 inches
23 inches
color:
red
blue
My table looks something like this:
id, product, size, color, so on...
So my .php I'm doing my query and print from looks something like this
<div id="accordion">
<?php
$letter = $_GET['letter'];
//echo "$id";
include("database.php");
$result = mysql_query("SELECT UPPER(product) AS upperName, PRODUCTS.* FROM PRODUCTS WHERE product LIKE '$letter%' ORDER BY UPPER(product) ");
$prodName = "";
while($row = mysql_fetch_array($result))
{
if ($row['upperName'] != $prodName)
{
print('<div style="background-color:#666; padding-bottom:25px; margin-bottom:25px;">');
print ("<h1>" . "$row[product]" . "</h1>");
}
print ("$row[ndc]" . "<br />");
print ("$row[size]" . "<br />");
print ("$row[strength]" . "<br />");
print ("$row[imprint]" . "<br />");
print ("$row[form]" . "<br />");
print ("$row[color]" . "<br />");
print('</div>');
$prodName = $row['upperName'];
}
mysql_close($linkID);
?>
</div>
My problem comes from trying to style the attributes..
Do you see that /div tag?
I want to style that stuff within an accordion however that stuff repeats for each product that shares the same name in that accordion. So if i include the div there, it prints out for 3 times with the same name 3 end div tags (hence breaking all my html nooo!!)
Is there a way to maybe loop? or use some kind of conditional logic to print that top stuff, once for the product name that all the products share, then the loop and print all the different attributes, then when each attribute is finished, to then include my closing html?
So if i have 6 products and half are named "keyboard", and the other half are named "shoes"
could i get it to print out
TABLE
product name: KEYBOARD
table for all my attributes
end table for all my attributes
end TABLE
TABLE
product name: SHOES
table for all my attributes
end table for all my attributes
end TABLE
That way i can style all my attributes.
I'm really sorry if this isn't correctly explained I'm still learning.
Any help is appreciated!
Extra stuff that you may not need to figure out my problem just an example of a table i'm printing to and why the way the attributes print is a problem. the 1-2-3-4-5-6-7-8-9 are data for the attributes.
<table cellspacing="0" cellpadding="0" border="0" width="649">
<tr>
<td width="180" height="10" valign="top"></td><td width="36" height="10" valign="top"></td><td width="433" height="10" valign="top"></td>
</tr>
<tr>
<td width="180" valign="top"><img src="images/slide1.gif" width="180" height="200" border="0" /></td>
<td width="36" valign="top"></td>
<td width="433" valign="top"><table cellspacing="0" cellpadding="0" border="0" width="433">
<tr>
<td valign="top"><span class="in-table-head">Name:</span></td>
<td valign="top"><span class="in-table-name">$row[name]</span></td>
</tr>
<tr>
<td valign="top"><span class="in-table-head">Therapeutic Category:</span></td>
<td valign="top"><span class="in-table-name">$row[therapeutic]</span></td>
</tr>
<tr>
<td colspan="2" valign="top"><span class="in-table-providers">Information for Providers</span></td>
</tr>
</table></td>
</tr></table>
<table cellspacing="0" cellpadding="0" border="0" width="649">
<tr>
<td><span class="in-table-att">NDC#:</span></td>
<td><span class="in-table-att">Strength:</span></td>
<td><span class="in-table-att">Size:</span></td>
<td><span class="in-table-att">Imprint:</span></td>
<td><span class="in-table-att">Form:</span></td>
<td><span class="in-table-att">Color:</span></td>
<td><span class="in-table-att">Shape:</span></td>
<td><span class="in-table-att">Pack Size:</span></td>
<td><span class="in-table-att">Rating:</span></td>
</tr>
</table>
<table cellspacing="0" cellpadding="0" border="0" width="649">
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
</tr>
</table>
Thank you again to anyone who looks at this problem!
stop matching by name but start using sku’s as a selector. that way you’re always spot on, no matter how many keyboards you have.
set up your pages to pass the sku instead of generic name.
otherwise you need to add more criteria to your where statement like dimensions, weight, color, price etc... to get ta reliable result.
-- Update --
An example.
page.php?product_type=keyboard
<?php
// set up database
$db = mysqli_connect("localhost", "user", "pass", "database_name");
// Just using mysql escape string, you should consider adding more securty checking to prevent injection
$producttype = mysql_real_escape_string($_GET['product_type']);
$result = $db->query("SELECT * FROM `products` WHERE `products`.`type`='$producttype'");
$features["width"][]
$features["height"][]
$features["color"][]
$features["whatever"][]
while ($row = $result->fetch_assoc())
{
$features["width"][] = $row["width"];
$features["height"][] = $row["height"];
$features["color"][] = $row["color"];
$features["whatever"][] = $row["whatever"];
}
// printing the features
echo "You have selected $producttype<BR>";
echo "The following widths can be selected<BR><UL>";
foreach($features["width"] as $width)
echo "<LI>$width</lI>";
echo "</UL><P>The following heights can be selected<BR><UL>";
foreach($features["heights"] as $height)
echo "<LI>$height</lI>";
echo "</UL><P>The following colors can be selected<BR><UL>";
foreach($features["colors"] as $color)
echo "<LI>$color</lI>";
echo "</UL><P>The following whatevers can be selected<BR><UL>";
foreach($features["whatevers"] as $whatevers)
echo "<LI>$whatevers</lI>";
echo "</UL>";
echo "have a nice day";
?>

Extracting table cell text contents with xpath in rows for consumption?

I have something along the following lines in terms of HTML. I would like to extract the various contents of the table cells, however I discovered that there are some embedded divs occasionally in the cells and perhaps other oddities that I'm not sure of yet:
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
<blockquote>
<p class="textstyle">
Text.
</p>
</blockquote>
My first impulse was to extract ALL element texts and just programmatically slice it up. I would watch for Title1, Title2, etc. to know when a row starts and then if a "----" is found meaning no value, just skip this row and move on. However, I realized that there is probably a better way of handling this with xpath directly.
How could this be solved with xpath so as to essentially give each cell's final child text content vs having to walk into each div if it exists? Or is there a more xpath like way to approach this?
Obviously I'm attempting to have the most flexible solution that will not be brittle if other unexpected elements crop up, even though they are unlikely.
The provided text isn't well-formed XML document, therefore XPath isn't applicable.
If you correct and covert it to a well-formed xml document as the one below, an expression like this might be useful:
/*/TABLE//TD//text()
or even:
//TABLE//TD//text()
Here is a wellformed XML document, constructed from the provided HTML:
<html>
<p align="center">
<img src="some_image.gif" alt="Some Title"/>
</p>
<TABLE WIDTH="500" BORDER="1" class="textwhite" ALIGN="center" CELLPADDING="0" CELLSPACING="0">
<TR>
<TD colspan="4" ALIGN="center">
<b>Title</b>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title</TD>
<TD ALIGN="center">date</TD>
<TD ALIGN="center">value</TD>
<TD ALIGN="center">value</TD>
</TR>
<TR>
<TD ALIGN="center">Title2</TD>
<TD ALIGN="center"></TD>
<TD ALIGN="center">
<div class="redtext">----</div>
</TD>
<TD> </TD>
</TR>
<TR>
<TD ALIGN="center">Title3</TD>
<TD ALIGN="center">
<div class="yellowtext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD ALIGN="center">value
<SUP>6</SUP>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title4</TD>
<TD ALIGN="center">
<div class="bluetext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD> </TD>
</TR>
</TABLE>
<blockquote>
<p class="textstyle"> Text. </p>
</blockquote>
</html>
So maybe you don't want to walk the divs, but here is my solution using lxml, which I highly recommend:
import re
from cStringIO import StringIO
from lxml import etree
def getTable(html, table_xpath, rows_xpath, cells_xpath):
"""Get a table on a webpage"""
parser = etree.HTMLParser()
# Build document tree and get table
root = etree.parse(StringIO(html), parser)
table = root.find(table_xpath)
if table == None:
print 'No table.'
return []
rows = table.findall(rows_xpath)
document = []
def cleanText(text):
"""Clean up text by replacing line breaks and tabs. """
return re.sub(r'[\r\n\t]+','',str(text).strip())
# iterate over the table rows and collect text from each cell.
for r in rows:
cells = r.findall(cells_xpath)
rowdata = []
for c in cells:
text = ''
it = c.itertext()
for i in it:
text += cleanText(i) + ' '
rowdata.append(text)
document.append(rowdata)
return document
html = """
<html><head><title></title></head><body>
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
</body>
</html>
"""
tp = "//table[#width='500']"
rt = "tr"
cp = "td[#align='center']"
doc = getTable(html, tp, rt, cp)
print repr(doc)
I believe that your program is going to run into many problems as the input data is manipulated -- what if the case of 'title' changes, or there is a typo?
It's not really possible to make a rigorous solution to scraping someone else's website, as they can at no notice completely change everything. Better is normally to write tolerant and flexible code that at least tries to verify that its output is sane. In this case it's probably best to iterate over the results of '//table/tr', then inside this loop, process the td elements:
import lxml.etree
tree = lxml.etree.fromstring("<table><tr><td>test</td></tr><tr><td><div>test2</div></td></tr></table>")
stringify = lambda x : "".join(x.xpath(".//text()"))
for x in tree.xpath("//table/tr"):
print "New row"
for y in x.xpath("td"):
print stringify(y)
Output:
New row
test
New row
test2
The following code will, however, get the list you ask for:
print map(stringify, tree.xpath("//table/tr/td"))
Output:
['test', 'test2']
This will find all text elements which are at all descended from a td which is a direct descendant of a tr which is in turn a direct descendant of a table.
(Simply asking for all text() elements will create some funny bugs when run on HTML which contains "<td>Foo <b>bar</b></td>" or similar.)

Categories