i am fetching specific data from a site for which i am using XPath but for this i have to exclude few variables for which i have to use NOT. but this NOT is not working in the code please explain what i have to do to make it work :
heres the html code
<tr><td colspan="2" valign="top" align="left"><span class="tl-document">
<left>some text here
</left>
</span></td></tr>
<tr><td colspan="2" valign="top" align="left">
<span class="text-id">some text here,<sup>a</sup><sup>b</sup></span>
<span class="text-id">some text here,<sup>a</sup></span>
</td></tr>
<tr><td colspan="2" valign="top" class="right">
<sup>a</sup>some text here<br>
</td></tr>
<tr><td colspan="2" valign="top" class="right">
<sup>b</sup>some text here<br>
</td></tr>
<td colspan="2" valign="top">
<br><div>
<span class="tl-default">Objective</span>
<p>some text here,</p>
</div>
<div>
<span class="tl-default">Methods</span>
<p>some text here,</p>
</div>
<div>
</td>
<td colspan="2" valign="top">
<br><div>
<span class="tl-default">Objective</span>
<p>some text here,</p>
</div>
</td>
trying to fetch only not td containing class and align and for this i am using this method for my xpath :
$getnew="http://www.example.com/;
$html = new DOMDocument();
#$html->loadHtmlFile($getnew);
$xpath = new DOMXPath( $html );
$y = $xpath->query('//td[#colspan="2" and valign="top" and (not(#class and #align))]');
$ycnt = $y->length;
for ( $idf=6; $idf<$ycnt; $idf++)
{ if($idf==6){
echo "<p class='artbox'>".$y->item($idf)->nodeValue."</p>";}
}
i am new to this so please suggest your opinions
The problem with your logic is that no elements have both #class and #align, so the not() will always yield true.
Instead you should exclude elements that have either attribute:
//td[#colspan="2" and #valign="top" and not(#class or #align)]
Alternatively, to match elements that only have those two attributes, you can add a count() condition:
//td[#colspan="2" and #valign="top" and count(#*)=2]
Update
$query = '//td[#colspan="2" and #valign="top" and not(#class or #align)]';
foreach ($xpath->query($query) as $node) {
// do something with $node
}
Related
I'd like to catch the word "Bronze" from this html page portion:
<tr class="">
<td align="left" csk="Nikpai,Rohullah">Rohullah Nikpai</td>
<td align="right" >25</td>
<td align="left" >Men's Featherweight</td>
<td align="right" csk="3">3T </td>
<td align="left" class=" Bronze" csk="1"><strong>Bronze</strong></td>
</tr>
I tried different code but I failed in my intent. One of many attempts is the following:
foreach($html4->find('td align="left" strong') as $tag4) {
echo $prova = $tag4->innertext . "\n";
}
where html4 is the entire html page I have to process.
With following Code you can get the classname "Bronze"
<?php
$html='<tr class="">
<td align="left" csk="Nikpai,Rohullah">Rohullah Nikpai</td>
<td align="right" >25</td>
<td align="left" >Mens Featherweight</td>
<td align="right" csk="3">3T </td>
<td align="left" class=" Bronze" csk="1"><strong>Bronze</strong></td>
</tr>';
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach($dom->getElementsByTagName('td') as $link) {
echo trim($link->getAttribute('class'),' ');
}
?>
Or, if you prefer the Node Value and not the class name and the csk attribut is always 1:
foreach($dom->getElementsByTagName('td') as $link) {
if ($link->getAttribute('csk')=="1"){
echo $link->nodeValue;
}
}
I'm trying to get a better understanding of PHP Simple HTML DOM and am kinda stuck on the following.
I am trying to retrieve information from one of my user pages by using the following code :
$dom = file_get_html('http://127.0.0.1/comments/top-commenters/');
foreach($dom->find('tr[id*=commenter]') as $result) {
print_r($result->innertext);
}
Which produces for each commenter profile ($result->innertext) the following :
<td class="Position"># 3 </td>
<td class="img" align="center">
<a href="/images/users/814ocnqlN6.jpg">
<img src="/images/users/814ocnqlN6.jpg" info="Image" border="0"/></a>
<a uid="814ocnqlN6"></td>
<td> <b>User 3.</b>
<div class="tiny">Most recent comments</div>
</td>
<td class="NumCredits"> 471 </td>
<td class="NumComments"> 5.439 </td>
<td class="PercUpVotes"> 93% </td>
Now if I would like to access within each result (same foreach loop) for example :
<td class="Position"># 3 </td>
And
<td class="NumComments"> 5.439 </td>
What would be the best way to accomplish this ?
Try:
$dom = file_get_html('http://127.0.0.1/comments/top-commenters/');
foreach($dom->find('tr[id*=commenter]') as $result) {
print_r($result->find('td.Position'));
print_r($result->find('td.NumComments'));
}
}
I'am using strip_tags function to fetch only required content but it fetches the whole data from a link
see the example code below i m using to fetch content from a link:
<?php
$a=fopen("http://example.com/","r");
$contents=stream_get_contents($a);
fclose($a);
$contents1=strtolower($contents);
$start='<div id="content">';
$start_pos=strpos($contents1,$start);
$first_trim=substr($contents1,$start_pos);
$stop='</div><!-- content -->';
$stop_pos=strpos($first_trim,$stop);
$second_trim=substr($first_trim,0,$stop_pos+6);
$second_trim = strip_tags($second_trim, '<div><table><tbody><tr><td><a><h2><h4>');
echo "<div>$second_trim</div>";
?>
here is the html code fetched in $second_trim:
<div><div id="content">
<div id="issuedescription"></div>
<h2 class="wsite-content-title" style="text-align:center;">download content<br /><font color="#f30519">table of content</font><br /> <font color="#f80117"> content </font></h2>
<h2>table of contents</h2>
<h4 class="tocsectiontitle">editorial</h4>
<h2 class="wsite-content-title" style="text-align:left;">technical note</h2>
<table class="tocarticle" width="100%">
<tr valign="top">
<td class="toctitle" width="95%" align="left">where are we at and where are we heading to? </td>
<td class="tocgalleys" width="5%" align="left">
pdf
</td>
</tr>
<tr>
<td class="tocauthors" width="95%" align="left">
sergio eduardo de paiva gonã§alves </td>
<td class="tocpages" width="5%" align="left">1-2</td>
</tr>
</table>
<div class="separator"></div>
h4 class="tocsectiontitle">some text here</h4>
<table class="tocarticle" width="100%">
<tr valign="top">
<td class="toctitle" width="95%" align="left">some text here</td>
<td class="tocgalleys" width="5%" align="left">
pdf
</td>
</tr>
<tr>
<td class="tocauthors" width="95%" align="left">
some text here, some text here, some text here, some text here, some text here, some text here </td>
<td class="tocpages" width="5%" align="left">3-10</td>
</tr>
</table>
<a target="_blank" rel="license" href="http://example.com/">
</a>
some text here<a rel="license" target="_blank" href="http://example.com/">example</a>.
</div></div>
Now my problem is i want to fetch a particular tag only, from the whole content like 2nd anchor from two of given below using strip_tag function
pdf
some text here
and 2nd header tag from two of given below:
<h2 class="wsite-content-title" style="text-align:center;">download content<br /><font color="#f30519">table of content</font><br /> <font color="#f80117"> content </font></h2>
<h2>table of contents</h2>
but strip tag function is either fetching all of them or none of them , So how can i make them identify to fetch the tag I want instead of fetching all the similar tags.If their is any better way to do this please share your ideas here !!
A regexp can do such a thing:
function handle_link($data) {
list($link, $attributes, $content) = $data;
$classes = preg_match('#class=[\'"]([^\'"]+)[\'"]#', $attributes, $match) ? preg_split('#\s+#', $match[1]) : array();
// If the link has the "file" class
if(in_array('file', $classes)) {
return $content; // only the internal content (like strip_tags would do)
// or you can return a new link:
// return '' . $content . '';
} else {
return $link; // all the link not filtered
}
}
$second_trim = strip_tags($second_trim, '<div><table><tbody><tr><td><h2><h4>');
$second_trim = preg_replace_callback('#<a([^>]*)>(.+)</a>#U', 'handle_link', $second_trim);
There is a html page, it contains a block:
<table class="tborder" cellpadding="6" cellspacing="1" border="0" width="100%" align="center">
<tr>
<td class="tcat" colspan="2">
Some regular text <span class="normal">the desired text 1</span>
</td>
</tr>
<tr>
<td class="alt1" colspan="2">
<span class="smallfont">link1, <i><b><font color="#006400">link2</font></b></i></span>
</td>
</tr>
</table>
Help me to parse with simple html dom library or a regular expression, so that would be deduced only here it is:
the desired text 1 <span class="smallfont">link1, <i><b><font color="#006400">link2</font></b></i></span>
If I do this:
<?
include 'simple_html_dom.php';
$html = file_get_html('http://some-url.com/power.html');
foreach($html->find('td[class="tcat"]') as $element1)
echo $element1. '<br>';
foreach($html->find('span[class="smallfont"]') as $element2)
echo $element2. '<br>';
?>
So, along with the necessary data also are displayed more similar elements that presents on the page. (with the same parameters 'td class="tcat"' and 'class="smallfont"')
I need that would be deduced only that:
the desired text 1 <span class="smallfont">link1, <i><b><font color="#006400">link2</font></b></i></span>
It's all about knowing css:
echo $html->find('td.tcat span', 0)->text();
echo $html->find('span.smallfont', 0);
//the desired text 1 <span class="smallfont">link1, <i><b><font color="#006400">link2</font></b></i></span>
I have something along the following lines in terms of HTML. I would like to extract the various contents of the table cells, however I discovered that there are some embedded divs occasionally in the cells and perhaps other oddities that I'm not sure of yet:
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
<blockquote>
<p class="textstyle">
Text.
</p>
</blockquote>
My first impulse was to extract ALL element texts and just programmatically slice it up. I would watch for Title1, Title2, etc. to know when a row starts and then if a "----" is found meaning no value, just skip this row and move on. However, I realized that there is probably a better way of handling this with xpath directly.
How could this be solved with xpath so as to essentially give each cell's final child text content vs having to walk into each div if it exists? Or is there a more xpath like way to approach this?
Obviously I'm attempting to have the most flexible solution that will not be brittle if other unexpected elements crop up, even though they are unlikely.
The provided text isn't well-formed XML document, therefore XPath isn't applicable.
If you correct and covert it to a well-formed xml document as the one below, an expression like this might be useful:
/*/TABLE//TD//text()
or even:
//TABLE//TD//text()
Here is a wellformed XML document, constructed from the provided HTML:
<html>
<p align="center">
<img src="some_image.gif" alt="Some Title"/>
</p>
<TABLE WIDTH="500" BORDER="1" class="textwhite" ALIGN="center" CELLPADDING="0" CELLSPACING="0">
<TR>
<TD colspan="4" ALIGN="center">
<b>Title</b>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title</TD>
<TD ALIGN="center">date</TD>
<TD ALIGN="center">value</TD>
<TD ALIGN="center">value</TD>
</TR>
<TR>
<TD ALIGN="center">Title2</TD>
<TD ALIGN="center"></TD>
<TD ALIGN="center">
<div class="redtext">----</div>
</TD>
<TD> </TD>
</TR>
<TR>
<TD ALIGN="center">Title3</TD>
<TD ALIGN="center">
<div class="yellowtext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD ALIGN="center">value
<SUP>6</SUP>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title4</TD>
<TD ALIGN="center">
<div class="bluetext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD> </TD>
</TR>
</TABLE>
<blockquote>
<p class="textstyle"> Text. </p>
</blockquote>
</html>
So maybe you don't want to walk the divs, but here is my solution using lxml, which I highly recommend:
import re
from cStringIO import StringIO
from lxml import etree
def getTable(html, table_xpath, rows_xpath, cells_xpath):
"""Get a table on a webpage"""
parser = etree.HTMLParser()
# Build document tree and get table
root = etree.parse(StringIO(html), parser)
table = root.find(table_xpath)
if table == None:
print 'No table.'
return []
rows = table.findall(rows_xpath)
document = []
def cleanText(text):
"""Clean up text by replacing line breaks and tabs. """
return re.sub(r'[\r\n\t]+','',str(text).strip())
# iterate over the table rows and collect text from each cell.
for r in rows:
cells = r.findall(cells_xpath)
rowdata = []
for c in cells:
text = ''
it = c.itertext()
for i in it:
text += cleanText(i) + ' '
rowdata.append(text)
document.append(rowdata)
return document
html = """
<html><head><title></title></head><body>
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
</body>
</html>
"""
tp = "//table[#width='500']"
rt = "tr"
cp = "td[#align='center']"
doc = getTable(html, tp, rt, cp)
print repr(doc)
I believe that your program is going to run into many problems as the input data is manipulated -- what if the case of 'title' changes, or there is a typo?
It's not really possible to make a rigorous solution to scraping someone else's website, as they can at no notice completely change everything. Better is normally to write tolerant and flexible code that at least tries to verify that its output is sane. In this case it's probably best to iterate over the results of '//table/tr', then inside this loop, process the td elements:
import lxml.etree
tree = lxml.etree.fromstring("<table><tr><td>test</td></tr><tr><td><div>test2</div></td></tr></table>")
stringify = lambda x : "".join(x.xpath(".//text()"))
for x in tree.xpath("//table/tr"):
print "New row"
for y in x.xpath("td"):
print stringify(y)
Output:
New row
test
New row
test2
The following code will, however, get the list you ask for:
print map(stringify, tree.xpath("//table/tr/td"))
Output:
['test', 'test2']
This will find all text elements which are at all descended from a td which is a direct descendant of a tr which is in turn a direct descendant of a table.
(Simply asking for all text() elements will create some funny bugs when run on HTML which contains "<td>Foo <b>bar</b></td>" or similar.)