Convert HTML table to CSV via PHP - php

I am trying to pull each td element from the html table below and import each element into its own cell in a CSV file.
Here are the two html tables:
<table width="100%" border="0" cellspacing="1" cellpadding="0" bgcolor="#006699">
<tr align="center" class="tableRow1Font">
<td width="7%">WAITLIST</td>
<td width="5%">91630</td>
<td width="11%">
ACCY 2001
</td>
<td width="5%">10</td>
<td width="16%">Intro Financial Accounting</td>
<td width="6%">3.00</td>
<td width="8%"> Zou, Y</td>
<td width="8%"><A HREF="http://www.gwu.edu/~map/building.cfm?BLDG=DUQUES" target="_blank"
>DUQUES</a> 251</td>
<td width="13%">TR<br>09:35AM - 10:50AM</td>
<td width="14%">
01/13/14 - 04/28/14
</td>
<td width="7%">
</td>
</tr>
</table>
<table width="100%" border="0" cellspacing="1" cellpadding="0" bgcolor="#006699">
<tr align="center" class="tableRow2Font">
<td width="7%">WAITLIST</td>
<td width="5%">90003</td>
<td width="11%">
ACCY 2001
</td>
<td width="5%">11</td>
<td width="16%">Intro Financial Accounting</td>
<td width="6%">3.00</td>
<td width="8%"> Zou, Y</td>
<td width="8%"><A HREF="http://www.gwu.edu/~map/building.cfm?BLDG=DUQUES" target="_blank"
>DUQUES</a> 254</td>
<td width="13%">TR<br>11:10AM - 12:25PM</td>
<td width="14%">
1/13/14 - 04/28/14
</td>
<td width="7%">
</td>
</tr>
</table>
I have written code that goes through the tables and pulls the td elements:
foreach($html->find('tr[align=center] td') as $e)
$str .= strip_tags($e->innertext) . ', ';
echo $str;
So how can I extract these elements into a CSV file? In Excel I want it to look like this with each td element in its own cell, starting a new row for each html table:
WAITLIST 91630 ACCY 2001 10 Intro Financial Accounting 3.00 Zou, Y DUQUES 251 TR
WAITLIST 90003 ACCY 2001 11 Intro Financial Accounting 3.00 Zou, Y DUQUES 251 TR

There is a library exist for this. Goto http://phpexcel.codeplex.com/. Download the zip file and in example you would find 17html.php try this code. I hope this will help.

CSV means Comma Separated Values. Thus, as you echo out the data (after running it through your function to strip the <td> tags), put commas in between each piece of data (cell), and a new line where you want the next line to start.
So to use your example above, it should look like this:
WAITLIST,91630,ACCY,2001,10,Intro Financial Accounting,3.00,Zou,Y,DUQUES,251,TR
WAITLIST,90003,ACCY,2001,11,Intro Financial Accounting,3.00,Zou,Y,DUQUES,2,
Keep in mind that when you echo this, you shouldn't have any other html tags or anything.

Related

Pull mysql data into html email with php

Hello so basically I have an invoice email that gets sent out every night containing order information from todays orders. I am in the process of trying to automate this email with php and the mysql data but I am not quite sure what to do. Just an FYI I am new to html. I am used to bash and or python.
Below is a sample of the email html file and where I am trying to put order information
<!-- INSERT HERE Product quantity for Kit 1 Blue Dot -->
1</td>
<td width="30" class="wz2">
</td>
</tr>
<tr>
<td colspan="3" height="20" style="font-size:0;line-height:1;" class="va2">
</td>
</tr>
</table>
</th>
<th width="139" class="stack3" data-border-left-color="borderColor" data-
border-bottom-color="borderColor" style="border-left:1px solid
#dde5f1;border-bottom:1px solid #dde5f1;margin:0; padding:0;">
<table width="139" align="center" cellpadding="0" cellspacing="0" border="0"
class="table60033">
<tr>
<td colspan="3" height="20" style="font-size:0;line-height:1;" class="va2">
</td>
</tr>
<tr>
<td width="30" class="wz2">
</td>
<td class="RegularText5TD" data-link-style="text-decoration:none;
color:#67bffd;" data-link-color="RegularLink" data-color="RegularTXT"
style="color: #425065;font-family: sans-serif;font-size: 14px;font-weight:
lighter;text-align: center;line-height: 23px;">
<a href="#" target="_blank" data-color="RegularLink" style="text-decoration:
none;color: #67bffd;">
</a>
<!--INSERT HERE Total for Kit 1 Blue Dot-->
$22.00</td>
<td width="30" class="wz2">
</td>
</tr>
<tr>
<td colspan="3" height="20" style="font-size:0;line-height:1;" class="va2">
</td>
</tr>
What I would like to do is via php run the mysql query to populate and get the quantity of the product such as 'blue dot' shown here. Then take that quantity and multiply it by the price to get the total cost for the product. I got my queries and know what to run via php and grab the data. I just do not know how to get the data into this dynamic email template. I use phpmailer to mail this template. Any help would be great!

Simple DOM html parser read html table

I am trying to read specific values of this HTML table via a php dom parser. I want my code to only read the "td width" tags and output only these items from the table and look like this:
" WAITLIST, 91630, ACCY 2001, 10, Intro Financial Accounting, 3.00, Zou, Y, Duques 251, 9:35AM-10:50AM, 01/13/14-04/28/14 "
Here is the HTML table:
<table width="100%" border="0" cellspacing="1" cellpadding="0" bgcolor="#006699">
<tr align="center" class="tableRow1Font">
<td width="7%">WAITLIST</td>
<td width="5%">91630</td>
<td width="11%">
ACCY 2001
</td>
<td width="5%">10</td>
<td width="16%">Intro Financial Accounting</td>
<td width="6%">3.00</td>
<td width="8%"> Zou, Y</td>
<td width="8%"><A HREF="http://www.gwu.edu/~map/building.cfm?BLDG=DUQUES" target="_blank" >DUQUES</a> 251</td>
<td width="13%">TR<br>09:35AM - 10:50AM</td>
<td width="14%">
01/13/14 - 04/28/14
</td>
<td width="7%">
</td>
</tr>
</table
Here is my php code which grabs the whole table, some elements of which I don't want in my output, and repeats the output multiple times:
// Retrieve the DOM from a given URL
$html = file_get_html('testdata.html');
foreach($html->find('table') as $e){
foreach($html->find('td') as $f){
echo $f->innertext . '<br>';
}
}
How can I change my code to only grab and output these elements:
"WAITLIST, 91630, ACCY 2001, 10, Intro Financial Accounting, 3.00, Zou, Y, Duques 251, 9:35AM-10:50AM, 01/13/14-04/28/14"
// Retrieve the DOM from a given URL
$html = file_get_html('testdata.html');
foreach($html->find('table') as $e){
foreach($e->find('td') as $f){
echo strip_tags($f->innertext) . '<br>';
}
}
You were pretty close already...
Forgot about the tag. See if strip_tags works for you.
http://us3.php.net/strip_tags

Pull data from a webpage to use [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
To give some background, I sell Lego parts online. The order total when you place the order is based on the price of the parts you purchased, and the shipping costs.
Shipping costs vary depending on the weight of the order, and the country of shipment.
I am not a techie buff, and thats why I need some help. I know the basics, but not much else, though I'd love to learn and I've been trying around with this for days before coming here.
The source code of an order page, the only place where you can see the weight is this:
<FONT CLASS="fv">Estimated Weight of Order:</FONT></TD><TD ALIGN="RIGHT"><FONT CLASS="fv">2.17oz 61.44g</FONT>
It is the same for every single order.
So, I know where the data I want is.
What I need help with is, coding something that pulls the data out of this webpage (say it's inside a webpage called order.com/order.asp and the document contains a bunch of other data apart from the weight) and exporting a shipping price based on the weight it inputed. I don't know whether you can do this with PHP or Python, etc.
I would have on my server a... say a table with the shipping costs based on weight. Now, what I needed, would be to take that bit of data from the order.com website into my own server. (On my own server process the weight data that I took, match it with the shipping cost, pull out invoices, etc). The weight data is in the order page, always on a line like the one I posted on the question. I just read about web scraping. Maybe some PHP that looks into the order page till it finds the line with the weight, and pulls out the weight?
Many, many, many thanks for your help, and I apologize in advance if I sound too uninformed, which I am. I really need a detailed explanation.
Gerald
*TL;DR*Two webpages. One is in my server and one isn't. The one that isn't in my server (order.asp), has this line:
<FONT CLASS="fv">Estimated Weight of Order:</FONT></TD><TD ALIGN="RIGHT"><FONT CLASS="fv">XX.XXoz XX.XXg</FONT>
I need something that I can put in my server, queries the weight from the page that isn't on my server (order.asp page) and matches the weight with a shipping price that I would have on my page (as a table or maybe with ifs).
There will be different order pages (order1.asp order2.asp order3.asp) with different weights. The script or whatever should do that for ea. wpage.
Thanks.
This would be the source code of an example page that I would need to take the weight of. Removed some sensitive info.
<SCRIPT TYPE="text/javascript" LANGUAGE="JavaScript">
function killImage(imgName){
if (document.images){
document.images[imgName].src="/images/noImage.gif"
}
}
function killImageM(imgName){
if (document.images){
document.images[imgName].src="/images/noImageM.gif"
}
}
</SCRIPT>
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<META HTTP-EQUIV="IMAGETOOLBAR" CONTENT="NO">
<LINK REL="STYLESHEET" TYPE="text/css" HREF="/stylesheet.css?13">
<STYLE TYPE="text/css">body { margin: 15 auto; }</STYLE>
<SCRIPT TYPE="text/javascript" LANGUAGE="javascript" SRC="/js/getAjax.js"></SCRIPT>
<SCRIPT TYPE="text/javascript" LANGUAGE="javascript" SRC="/lytebox/lytebox.js?10"></SCRIPT>
<LINK REL="STYLESHEET" HREF="/lytebox/lytebox.css?13" TYPE="text/css" MEDIA="screen" />
</HEAD>
<BODY BGCOLOR="#666666">
<CENTER>
<TABLE WIDTH="680" CELLPADDING="10" CELLSPACING="0"><TR><TD BGCOLOR="#FFFFFF">
<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0"><TR>
<TD><IMG SRC="/images/logowhite.gif" WIDTH="200" HEIGHT="60" ALIGN="ABSMIDDLE" BORDER="0"> </TD>
<TD> <FONT SIZE="+3">Order #3953198</FONT></TD></TR></TABLE><P><FONT FACE="Tahoma,Arial" SIZE="2">
<HR NOSHADE SIZE="1" COLOR="#000000"><B>Order Summary</B><HR NOSHADE SIZE="1" COLOR="#000000">
<TABLE WIDTH="100%" CELLPADDING="5" CELLSPACING="0" BORDER="0" BGCOLOR="#EEEEEE"><TR><TD WIDTH="60%" VALIGN="TOP">
<TABLE WIDTH="100%" BORDER="0" CELLPADDING="1" CELLSPACING="0" CLASS="ta">
<TR>
<TD WIDTH="125">Order Date:</TD>
<TD>Nov 20, 2013 17:12</TD>
</TR>
<TR>
<TD>Payment By:</TD>
<TD>PayPal.com</TD>
</TR>
<TR>
<TD>Payment In:</TD>
<TD>Euro</TD>
</TR>
<TR VALIGN="TOP">
<TD>Order Status:</TD>
<TD>Shipped</TD>
</TR>
<TR>
<TD>Changed:</TD>
<TD>Nov 22, 2013 14:15</TD>
</TR>
<TR>
<TD NOWRAP>Total Items:</TD>
<TD>24</TD>
</TR>
<TR>
<TD NOWRAP>Unique Items (Lots):</TD>
<TD>2</TD>
</TR>
<TR>
<TD NOWRAP>Invoiced:</TD>
<TD>Nov 21, 2013 08:56</TD>
</TR>
<TR VALIGN="TOP">
<TD NOWRAP>Shipping Method:</TD>
<TD>Registered<BR><FONT CLASS="fv">By default, with tracking number and insured up to 30 euros only.</FONT></TD>
</TR>
</TABLE>
</TD><TD WIDTH="40%" VALIGN="TOP">
<TABLE WIDTH="100%" BORDER="0" CELLPADDING="1" CELLSPACING="0" CLASS="ta">
<TR>
<TD>Order Total:</TD>
<TD ALIGN="RIGHT">EUR 8.92</TD>
</TR>
<TR>
<TD>Shipping:</TD>
<TD ALIGN="RIGHT">EUR 4.85</TD>
</TR>
<TR>
<TD>Insurance:</TD>
<TD ALIGN="RIGHT">EUR 0.00</TD>
</TR>
<TR>
<TD>Additional Charges 1:</TD>
<TD ALIGN="RIGHT">EUR 0.00</TD>
</TR>
<TR>
<TD>Additional Charges 2:</TD>
<TD ALIGN="RIGHT">EUR 0.00</TD>
</TR>
<TR>
<TD>Credit:</TD>
<TD ALIGN="RIGHT">EUR 0.00</TD>
</TR>
<TR>
<TD>Grand Total:</TD>
<TD ALIGN="RIGHT"><B>EUR 13.77</TD>
</TR>
<TR>
<TD>Orders in this Store:</TD>
<TD ALIGN="RIGHT">1</TD>
</TR>
</TABLE>
</TD></TR>
</TABLE><HR NOSHADE SIZE="1" COLOR="#000000"><TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0" WIDTH="100%" CLASS="ta"><TR><TD><B>Items in Order</B></TD></TR></TABLE><HR NOSHADE SIZE="1" COLOR="#000000"><TABLE WIDTH="100%" BORDER="0" CELLSPACING="1" CELLPADDING="3" CLASS="ta"><TR BGCOLOR="#C0C0C0"><TD><B>Image</B></TD><TD ALIGN="CENTER"><B>Condition</B></TD><TD><B>Item Description</B></TD><TD ALIGN="RIGHT"><B>Lots</B></TD><TD ALIGN="RIGHT"><B>Qty</B></TD><TD ALIGN="RIGHT"><B>Left</B></TD><TD ALIGN="RIGHT"><B>Price</B></TD><TD ALIGN="RIGHT"><B>Total</B></TD><TD ALIGN="RIGHT"><B>Weight</B></TD></TR><TR><TD COLSPAN="2" BGCOLOR="#C0C0C0"><B>Batch #1</B></TD><TD COLSPAN="7" BGCOLOR="#C0C0C0"><TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0" WIDTH="100%"><TR><TD><FONT CLASS="fv">Submitted on Nov 20, 2013 17:12</TD><TD ALIGN="RIGHT"><IMG SRC="/images/printer16.png" WIDTH="16" HEIGHT="16" BORDER="0" ALT="Print Batch" TITLE="Print Batch"><IMG SRC="/images/dot.gif" WIDTH="5" HEIGHT="1"><IMG SRC="/images/invoice16YC.gif" WIDTH="16" HEIGHT="16" ALT="Batch Invoiced" TITLE="Batch Invoiced"></TD></TR></TABLE></TD></TR><TR BGCOLOR="FFFFFF"><TD HEIGHT="60"><CENTER><A ID='imgLink0' HREF='/catalogItemPic.asp?P=60208' REL='blcatimg'><IMG ALT="Lot ID: 48295541 Part No: 60208 Name: Wheel 31mm D. x 15mm Technic" TITLE="Lot ID: 48295541 Part No: 60208 Name: Wheel 31mm D. x 15mm Technic" BORDER='0' WIDTH='80' HEIGHT='60' SRC='http://img.bricklink.com/P/86/60208.gif' NAME='img0' ID='img0' onError="killImage('img0');"></A><BR><FONT FACE='Tahoma,Arial' SIZE='1'>*</FONT></TD><TD ALIGN="CENTER"><B>New</B></TD><TD><SPAN CLASS="u"><FONT COLOR="#000000">Light Bluish Gray Wheel 31mm D. x 15mm Technic </FONT></SPAN><BR><FONT CLASS="fv">AB4</FONT></TD><TD ALIGN="RIGHT"> </TD><TD ALIGN="RIGHT">12</TD><TD ALIGN="RIGHT">X</TD><TD ALIGN="RIGHT">EUR 0.11</TD><TD ALIGN="RIGHT">EUR 1.32</TD><TD ALIGN="RIGHT"><FONT CLASS="fv">38.16g</TD></TR><TR BGCOLOR="EEEEEE"><TD HEIGHT="60"><CENTER><A ID='imgLink1' HREF='/catalogItemPic.asp?P=6179' REL='blcatimg'><IMG ALT="Lot ID: 49014568 Part No: 6179 Name: Tile, Modified 4 x 4 with Studs on Edge" TITLE="Lot ID: 49014568 Part No: 6179 Name: Tile, Modified 4 x 4 with Studs on Edge" BORDER='0' WIDTH='80' HEIGHT='60' SRC='http://img.bricklink.com/P/86/6179.gif' NAME='img1' ID='img1' onError="killImage('img1');"></A><BR><FONT FACE='Tahoma,Arial' SIZE='1'>*</FONT></TD><TD ALIGN="CENTER"><B>New</B></TD><TD><SPAN CLASS="u"><FONT COLOR="#000000">Light Bluish Gray Tile, Modified 4 x 4 with Studs on Edge </FONT></SPAN><BR><FONT CLASS="fv">AJ2</FONT></TD><TD ALIGN="RIGHT"> </TD><TD ALIGN="RIGHT">12</TD><TD ALIGN="RIGHT">X</TD><TD ALIGN="RIGHT">EUR 0.633</TD><TD ALIGN="RIGHT">EUR 7.596</TD><TD ALIGN="RIGHT"><FONT CLASS="fv">23.28g</TD></TR><TR BGCOLOR="#DDDDDD"><TD COLSPAN="3"><B>Batch Total:</B></TD><TD ALIGN="RIGHT">2</TD><TD ALIGN="RIGHT">24</TD><TD></TD><TD> </TD><TD ALIGN="RIGHT">EUR 8.92</TD><TD ALIGN="RIGHT"><FONT CLASS="fv">61.44g</TD></TR><TR BGCOLOR="#C0C0C0"><TD COLSPAN="3"><B>Order Total:</B></TD><TD ALIGN="RIGHT">2</TD><TD ALIGN="RIGHT">24</TD><TD></TD><TD> </TD><TD ALIGN="RIGHT">EUR 8.92</TD><TD ALIGN="RIGHT"></TD></TR><TR><TD COLSPAN="10" ALIGN="RIGHT" BGCOLOR="#EEEEEE"><TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0" WIDTH="100%"><TR><TD><FONT CLASS="fv">Estimated Weight of Order:</FONT></TD><TD ALIGN="RIGHT"><FONT CLASS="fv">2.17oz 61.44g</FONT></TD></TR></TABLE></TD></TR></TABLE><TABLE WIDTH="100%" BORDER="0" CELLPADDING="1" CELLSPACING="0" CLASS="ta"><TR><TD COLSPAN="2" CLASS="fv" ALIGN="RIGHT">Contact your buyer about this order<BR> </TD></TR></TABLE><HR NOSHADE SIZE="1" COLOR="#000000"><FONT CLASS="fv"><CENTER>This order will be purged from the BrickLink website on May 20, 2014.</CENTER></FONT></TABLE><FONT CLASS="fv"><P><CENTER><FONT COLOR="#FFFFFF">Back to Orders</FONT> | <FONT COLOR="#FFFFFF">Show Temporary Checkboxes</FONT> | <FONT COLOR="#FFFFFF">Show Categories</FONT> | <FONT COLOR="#FFFFFF">Consolidate Batches</FONT> | <FONT COLOR="#FFFFFF">My Settings</FONT><P><FONT COLOR="#FFFFFF">Hide Qty Left in My Inventory</FONT> | <FONT COLOR="#FFFFFF">Hide Item Weight</FONT> | <FONT COLOR="#FFFFFF">Show My Cost</FONT> | <FONT COLOR="#FFFFFF">Show Only Items in Order</FONT> | <FONT COLOR="#FFFFFF">Edit Order</FONT>
It's a little tough to write full-blown code without looking at the page you wish to scrape, but you should be able to use the following code to get what you want. The code below reads in a file called "html.txt", finds all orders in that text file, finds the total weight values in ozs and grams, and writes that data to an output file called foundWeights.txt. To run the code, just save your html in a text file called "html.txt", save the code below in a file called "findweights.py", and then put both of those files in the same folder. Then, open a shell or a terminal window, navigate to that folder, and type "python findweights.py" and momentarily a text file will appear in the same folder with your data in it.
html = open("html.txt").read()
out = open("foundWeights.txt", "w")
#split html on order number
legoOrders = html.split("Order #")
for order in legoOrders[1:]:
print order
orderNumber = order.split("<")[0]
weightString = order.split('Estimated Weight of Order:</FONT></TD><TD ALIGN="RIGHT"><FONT CLASS="fv">')[1]
splitWeightString = weightString.split(' ')
splitStringFinal = splitWeightString[1].split("<")
grams = splitStringFinal[0]
ozs = weightString.split('&nbsp')[0]
out.write(str(orderNumber) + "\t" + str(grams) + "\t" + str(ozs) + "\n"
Outfile is tab-separated (Order #, grams, ozs):
3953198 61.44g 2.17oz

How to extract data from a table With no Style/Class/ID information

i want to parse specific table for scrapping. the code of the table is given below..
<table class="NormalText" cellspacing="1" cellpadding="2" width="100%" border="0"
bgcolor="#eeeeee">
<tr>
<td width="108" align="center">
Stock No.
</td>
<td width="108" align="center">
<span id="invModule_grid_row18_lblMileage">Mileage</span>
</td>
<td width="108" align="center">
Color
</td>
<td width="76" align="center">
Interior
</td>
<td width="104" align="center">
Transmission
</td>
<td width="110" align="center">
Engine
</td>
</tr>
<tr>
<td width="108" align="center">
1204
</td>
<td width="108" align="center">
161,328
</td>
<td width="108" align="center">
Tan
</td>
<td width="76" align="center">
Leather
</td>
<td width="104" align="center">
Automatic
</td>
<td width="110" align="center">
3.5L V6 DOHC 16V
</td>
</tr>
<tr>
<td colspan="7" height="7">
</td>
</tr>
</table>
and the output i want is
1194 56,200 Blue Vinyl 5 Speed 6.8L V10 SOHC 30V
Questions
Which parsing Technique /Parser is best for this? PHPQuery, simplehtmlparse or xpath?
I am more familiar with domDocument, xpath and php, can it be done using xPath?
if yes, what will be xPath? (I am confused as my required data is in td and td tag has no id or class information attached. Also, on the uper row, which is basically a heading row, td are ther too)
Please guide me
XPath
The following example selects the text from all the td nodes in a table row in a table:
//table/tr[position()>1]/td/text()
You will have to know one of two things if there are other tables on the page:
Gets the last table:
//table[last()]/tr[position()>1]/td/text()
Gets the third table:
//table[2]/tr[position()>1]/td/text()
Gets a table based on an attribute, in this case, when class="NormalText":
//table[#class='NormalText']/tr[position()>1]/td/text()

Extracting table cell text contents with xpath in rows for consumption?

I have something along the following lines in terms of HTML. I would like to extract the various contents of the table cells, however I discovered that there are some embedded divs occasionally in the cells and perhaps other oddities that I'm not sure of yet:
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
<blockquote>
<p class="textstyle">
Text.
</p>
</blockquote>
My first impulse was to extract ALL element texts and just programmatically slice it up. I would watch for Title1, Title2, etc. to know when a row starts and then if a "----" is found meaning no value, just skip this row and move on. However, I realized that there is probably a better way of handling this with xpath directly.
How could this be solved with xpath so as to essentially give each cell's final child text content vs having to walk into each div if it exists? Or is there a more xpath like way to approach this?
Obviously I'm attempting to have the most flexible solution that will not be brittle if other unexpected elements crop up, even though they are unlikely.
The provided text isn't well-formed XML document, therefore XPath isn't applicable.
If you correct and covert it to a well-formed xml document as the one below, an expression like this might be useful:
/*/TABLE//TD//text()
or even:
//TABLE//TD//text()
Here is a wellformed XML document, constructed from the provided HTML:
<html>
<p align="center">
<img src="some_image.gif" alt="Some Title"/>
</p>
<TABLE WIDTH="500" BORDER="1" class="textwhite" ALIGN="center" CELLPADDING="0" CELLSPACING="0">
<TR>
<TD colspan="4" ALIGN="center">
<b>Title</b>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title</TD>
<TD ALIGN="center">date</TD>
<TD ALIGN="center">value</TD>
<TD ALIGN="center">value</TD>
</TR>
<TR>
<TD ALIGN="center">Title2</TD>
<TD ALIGN="center"></TD>
<TD ALIGN="center">
<div class="redtext">----</div>
</TD>
<TD> </TD>
</TR>
<TR>
<TD ALIGN="center">Title3</TD>
<TD ALIGN="center">
<div class="yellowtext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD ALIGN="center">value
<SUP>6</SUP>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title4</TD>
<TD ALIGN="center">
<div class="bluetext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD> </TD>
</TR>
</TABLE>
<blockquote>
<p class="textstyle"> Text. </p>
</blockquote>
</html>
So maybe you don't want to walk the divs, but here is my solution using lxml, which I highly recommend:
import re
from cStringIO import StringIO
from lxml import etree
def getTable(html, table_xpath, rows_xpath, cells_xpath):
"""Get a table on a webpage"""
parser = etree.HTMLParser()
# Build document tree and get table
root = etree.parse(StringIO(html), parser)
table = root.find(table_xpath)
if table == None:
print 'No table.'
return []
rows = table.findall(rows_xpath)
document = []
def cleanText(text):
"""Clean up text by replacing line breaks and tabs. """
return re.sub(r'[\r\n\t]+','',str(text).strip())
# iterate over the table rows and collect text from each cell.
for r in rows:
cells = r.findall(cells_xpath)
rowdata = []
for c in cells:
text = ''
it = c.itertext()
for i in it:
text += cleanText(i) + ' '
rowdata.append(text)
document.append(rowdata)
return document
html = """
<html><head><title></title></head><body>
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
</body>
</html>
"""
tp = "//table[#width='500']"
rt = "tr"
cp = "td[#align='center']"
doc = getTable(html, tp, rt, cp)
print repr(doc)
I believe that your program is going to run into many problems as the input data is manipulated -- what if the case of 'title' changes, or there is a typo?
It's not really possible to make a rigorous solution to scraping someone else's website, as they can at no notice completely change everything. Better is normally to write tolerant and flexible code that at least tries to verify that its output is sane. In this case it's probably best to iterate over the results of '//table/tr', then inside this loop, process the td elements:
import lxml.etree
tree = lxml.etree.fromstring("<table><tr><td>test</td></tr><tr><td><div>test2</div></td></tr></table>")
stringify = lambda x : "".join(x.xpath(".//text()"))
for x in tree.xpath("//table/tr"):
print "New row"
for y in x.xpath("td"):
print stringify(y)
Output:
New row
test
New row
test2
The following code will, however, get the list you ask for:
print map(stringify, tree.xpath("//table/tr/td"))
Output:
['test', 'test2']
This will find all text elements which are at all descended from a td which is a direct descendant of a tr which is in turn a direct descendant of a table.
(Simply asking for all text() elements will create some funny bugs when run on HTML which contains "<td>Foo <b>bar</b></td>" or similar.)

Categories