below is the markup im pulling from my database table. basically i want to replace the image
<img src="http://newvision.co.ug/IM/logo_white_big.gif" width="80" style="background-color:white;padding:1px">
to
<div style='background:url(http://newvision.co.ug/IM/logo_white_big.gif) center center no-repeat;width:40px;height:40px'></div>
I dnt wanna use regular expressions just an htmlparser that ships with php
<table>
<tbody>
<tr>
<td valign="top"><a href="http://newvision.co.ug/PA/8/13/748484" target=
"_blank"><img src="http://newvision.co.ug/IM/logo_white_big.gif" width="80"
style="background-color:white;padding:1px" /></a></td>
<td valign="top">
<table>
<tbody>
<tr>
<td></td>
</tr>
<tr>
<td valign="top"><b><a target="_blank" href=
"http://newvision.co.ug/PA/8/13/748484" style="font-size:9pt">The New
Vision Online : Holland withholds sh10b over CHOGM</a></b></td>
</tr>
<tr>
<td valign="top"><a href="http://newvision.co.ug/PA/8/13/748484" style=
"font-size:8pt;color<img src="smilies/worry.gif" alt="worry" />ilver"
target="_blank">http://newvision.co.ug/PA/8/13/748484</a></td>
</tr>
<tr>
<td valign="top" style="font-size:8pt;font-weight:normal">The New Vision
is Uganda's leading daily newspaper.</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
There is no parser that ships with PHP, so use PHPQuery, a way of manipulating the DOM in a JQuery like manner instead. This will allow you to use selectors to easily swap out chunks of HTML.
Related
I am able to get the coding of a website with file_get_contents but I want to be able to get certain values out of the html. This piece of code is always the same but the value between the html tag changes from time to time. This is the HTML Code:
<div class="cheapest-bins">
<h3>Cheapest Live Buy Now</h3>
<table>
<tbody><tr>
<th>Console</th>
<th>Buy Now Price</th>
</tr>
<tr class=" active">
<td class="xb1">XB1</td>
<td>1,480,000</td>
</tr>
<tr class="">
<td class="ps4">PS4</td>
<td>1,590,000</td>
</tr>
<tr class="">
<td class="x360">360</td>
<td>---</td>
</tr>
<tr class="">
<td class="ps3">PS3</td>
<td>2,800,000</td>
</tr>
</tbody></table>
</div>
How would I go about getting the: 1,480,000 .. 1,590,000 .. --- and 2,800,000?
short answer:
find a css selector library such as https://github.com/tj/php-selector
then you could grab all td:last-child elements/innerhtml
for your specific example you could just just
preg_match_all('#<td>(.*?)</td>#', $html, $matches);
I'm using mpdf to generate PDF from a form. In form I have an option to adding new rows to table. The problem is when count of rows is too big for generated PDF page. Then the table is resizing (it's smaller) instead of going to the next page.
This is mpdf code:
$mpdf=new mPDF('UTF-8','A4','','',20,15,48,25,10,10);
$mpdf->WriteHTML(generatePDF());
$mpdf->Output();
exit;
This is html table code:
function getHTMLStyle(){
$html ='<table class="items" width="100%" style="font-size: 9pt; border-collapse: collapse;" cellpadding="8">
<tr>
<td width="5%">A</td>
<td width="95%"><b>'.$a.'</b><br /><br /> '.$_POST['title'].'</td>
</tr>
<tr>
<td >B</td>
<td ><b>'.$b.'</b><br /><br /> '.$_POST['organizationName'].'</td>
</tr>
<tr>
<td >C</td>
<td></td>
<table class="items2" width="100%" page-break-before="always" >
<tr>
<td ><b>'.$c.'</b></td>'.addTableC().'
</tr>
</table>
</tr>
This is image with property view:
And this is an amage with wrong view:
How can I make a break in table and continue to another side?
Because you're incorrectly nesting tables -
<tr>
<td >C</td>
<td></td>
<table class="items2" width="100%" page-break-before="always" >
<tr>
<td ><b>'.$c.'</b></td>'.addTableC().'
</tr>
</table>
</tr>
The table should be inside the <td> tag, like so:
<tr>
<td>C</td>
<td>
<table class="items2" width="100%" page-break-before="always" >
<tr>
<td ><b>'.$c.'</b></td>'.addTableC().'
</tr>
</table>
</td>
</tr>
i want to parse specific table for scrapping. the code of the table is given below..
<table class="NormalText" cellspacing="1" cellpadding="2" width="100%" border="0"
bgcolor="#eeeeee">
<tr>
<td width="108" align="center">
Stock No.
</td>
<td width="108" align="center">
<span id="invModule_grid_row18_lblMileage">Mileage</span>
</td>
<td width="108" align="center">
Color
</td>
<td width="76" align="center">
Interior
</td>
<td width="104" align="center">
Transmission
</td>
<td width="110" align="center">
Engine
</td>
</tr>
<tr>
<td width="108" align="center">
1204
</td>
<td width="108" align="center">
161,328
</td>
<td width="108" align="center">
Tan
</td>
<td width="76" align="center">
Leather
</td>
<td width="104" align="center">
Automatic
</td>
<td width="110" align="center">
3.5L V6 DOHC 16V
</td>
</tr>
<tr>
<td colspan="7" height="7">
</td>
</tr>
</table>
and the output i want is
1194 56,200 Blue Vinyl 5 Speed 6.8L V10 SOHC 30V
Questions
Which parsing Technique /Parser is best for this? PHPQuery, simplehtmlparse or xpath?
I am more familiar with domDocument, xpath and php, can it be done using xPath?
if yes, what will be xPath? (I am confused as my required data is in td and td tag has no id or class information attached. Also, on the uper row, which is basically a heading row, td are ther too)
Please guide me
XPath
The following example selects the text from all the td nodes in a table row in a table:
//table/tr[position()>1]/td/text()
You will have to know one of two things if there are other tables on the page:
Gets the last table:
//table[last()]/tr[position()>1]/td/text()
Gets the third table:
//table[2]/tr[position()>1]/td/text()
Gets a table based on an attribute, in this case, when class="NormalText":
//table[#class='NormalText']/tr[position()>1]/td/text()
I have something along the following lines in terms of HTML. I would like to extract the various contents of the table cells, however I discovered that there are some embedded divs occasionally in the cells and perhaps other oddities that I'm not sure of yet:
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
<blockquote>
<p class="textstyle">
Text.
</p>
</blockquote>
My first impulse was to extract ALL element texts and just programmatically slice it up. I would watch for Title1, Title2, etc. to know when a row starts and then if a "----" is found meaning no value, just skip this row and move on. However, I realized that there is probably a better way of handling this with xpath directly.
How could this be solved with xpath so as to essentially give each cell's final child text content vs having to walk into each div if it exists? Or is there a more xpath like way to approach this?
Obviously I'm attempting to have the most flexible solution that will not be brittle if other unexpected elements crop up, even though they are unlikely.
The provided text isn't well-formed XML document, therefore XPath isn't applicable.
If you correct and covert it to a well-formed xml document as the one below, an expression like this might be useful:
/*/TABLE//TD//text()
or even:
//TABLE//TD//text()
Here is a wellformed XML document, constructed from the provided HTML:
<html>
<p align="center">
<img src="some_image.gif" alt="Some Title"/>
</p>
<TABLE WIDTH="500" BORDER="1" class="textwhite" ALIGN="center" CELLPADDING="0" CELLSPACING="0">
<TR>
<TD colspan="4" ALIGN="center">
<b>Title</b>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title</TD>
<TD ALIGN="center">date</TD>
<TD ALIGN="center">value</TD>
<TD ALIGN="center">value</TD>
</TR>
<TR>
<TD ALIGN="center">Title2</TD>
<TD ALIGN="center"></TD>
<TD ALIGN="center">
<div class="redtext">----</div>
</TD>
<TD> </TD>
</TR>
<TR>
<TD ALIGN="center">Title3</TD>
<TD ALIGN="center">
<div class="yellowtext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD ALIGN="center">value
<SUP>6</SUP>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title4</TD>
<TD ALIGN="center">
<div class="bluetext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD> </TD>
</TR>
</TABLE>
<blockquote>
<p class="textstyle"> Text. </p>
</blockquote>
</html>
So maybe you don't want to walk the divs, but here is my solution using lxml, which I highly recommend:
import re
from cStringIO import StringIO
from lxml import etree
def getTable(html, table_xpath, rows_xpath, cells_xpath):
"""Get a table on a webpage"""
parser = etree.HTMLParser()
# Build document tree and get table
root = etree.parse(StringIO(html), parser)
table = root.find(table_xpath)
if table == None:
print 'No table.'
return []
rows = table.findall(rows_xpath)
document = []
def cleanText(text):
"""Clean up text by replacing line breaks and tabs. """
return re.sub(r'[\r\n\t]+','',str(text).strip())
# iterate over the table rows and collect text from each cell.
for r in rows:
cells = r.findall(cells_xpath)
rowdata = []
for c in cells:
text = ''
it = c.itertext()
for i in it:
text += cleanText(i) + ' '
rowdata.append(text)
document.append(rowdata)
return document
html = """
<html><head><title></title></head><body>
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
</body>
</html>
"""
tp = "//table[#width='500']"
rt = "tr"
cp = "td[#align='center']"
doc = getTable(html, tp, rt, cp)
print repr(doc)
I believe that your program is going to run into many problems as the input data is manipulated -- what if the case of 'title' changes, or there is a typo?
It's not really possible to make a rigorous solution to scraping someone else's website, as they can at no notice completely change everything. Better is normally to write tolerant and flexible code that at least tries to verify that its output is sane. In this case it's probably best to iterate over the results of '//table/tr', then inside this loop, process the td elements:
import lxml.etree
tree = lxml.etree.fromstring("<table><tr><td>test</td></tr><tr><td><div>test2</div></td></tr></table>")
stringify = lambda x : "".join(x.xpath(".//text()"))
for x in tree.xpath("//table/tr"):
print "New row"
for y in x.xpath("td"):
print stringify(y)
Output:
New row
test
New row
test2
The following code will, however, get the list you ask for:
print map(stringify, tree.xpath("//table/tr/td"))
Output:
['test', 'test2']
This will find all text elements which are at all descended from a td which is a direct descendant of a tr which is in turn a direct descendant of a table.
(Simply asking for all text() elements will create some funny bugs when run on HTML which contains "<td>Foo <b>bar</b></td>" or similar.)
i'm learning Regex but can't figure it out.... i want to get the entire HTML from a DIV, how to procced?
already tried this;
/\< td class=\"desc1\"\>(.+)/i
it returns;
Array
(
[0] => < td class="desc1">
[1] =>
)
the code that i'm matching is this;
<table id="profile" cellpadding="1" cellspacing="1">
<thead>
<tr>
<th colspan="2">Jogador TheInFEcT </th>
</tr>
<tr>
<td>Detalhes</td>
<td>Descrição:</td>
</tr>
</thead><tbody>
<tr>
<td class="empty"></td><td class="empty"></td>
</tr>
<tr>
<td class="details">
<table cellpadding="0" cellspacing="0">
<tbody><tr>
<th>Classificação</th>
<td>11056</td>
</tr>
<tr>
<th>Tribo:</th>
<td>Teutões</td>
</tr>
<tr>
<th>Aliança:</th>
<td>-</td>
</tr>
<tr>
<th>Aldeias:</th>
<td>1</td>
</tr>
<tr>
<th>População:</th>
<td>2</td>
</tr><tr>
<td colspan="2" class="empty"></td>
</tr>
<tr>
<td colspan="2"> » Alterar perfil</td>
</tr>
</tbody></table>
</td>
<td class="desc1">
<div>STATUS: OFNAaaaAA</div>
</td>
</tr>
</tbody>
</table>
i need to get the entire code inside the < td class="desc1">, like that;
<div >STATUS: OFNAaaaAA< /div>
</td>
</tr>
</tbody>
</table>
Could someone help me out?
Thanks in advance.
I usually use
$dom = DOMDocument::load($htmldata);
for converting HTML code to XML DOM. And then you can use
$node = $dom->getElementsById($id);
/* or */
$nodes = $dom->getElementsByTagName($tag);
to get your HTML/XML node.
Now, use
$node->textContent
to get data inside node.
try this, it does not cover all possible cases but it should work:
/<td\s+class=['"]\s*desc1\s*['"]\s*>((.|\n)*)<\/td>/i
tested with: http://www.pagecolumn.com/tool/pregtest.htm
edit: improved solution suggested by Alan Moore
/<td\s+class=['"]\s*desc1\s*['"]\s*>(.*?)<\/td>/s