I've got the following HTML code:
<html><body><h1> <span class="mw-headline" id="Discussie_over_Titel">Discussie over Titel</span></h1>
<div class="comments">
<div class="comment new">
<div class="newcommenttext">Klik op de button om een nieuwe opmerking te maken over Titel</div><div class="buttons">Nieuwe opmerking</div>
<div class="clear"></div>
</div>
<div class="comments"><div>
</div><div class="commentBlock"><div class="comment" id="WikiSysop-"><div class="poster">Door <span class="usernamehighlight">WikiSysop</span> op <span class="extradata">Type: Suggestie</span>
</div>
<div class="buttons">Reageer op deze opmerking<span class="collapse">Collapse</span></div><div class="content">Wat is dit nou weer?
</div>
</div>
</div><div class="commentBlock"><div class="comment" id="WikiSysop-"><div class="poster">Door <span class="usernamehighlight">WikiSysop</span> op <span class="extradata">Type: Suggestie</span>
</div>
<div class="buttons">Reageer op deze opmerking<span class="collapse">Collapse</span></div><div class="content">Mega mooi!
</div>
</div></div></div></div> <div class="comments">
<div class="comment new"><div class="newcommenttext">
Klik op de button om een nieuwe opmerking te maken over Titel</div><div class="buttons">Nieuwe opmerking</div>
</div>
<p><br />
</p><p><br />
</p>
Klik hier om terug te keren naar Titel.</div>
</body></html>
To fetch all the comments I simply create a new dom parser:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($text);
$xpath = new DOMXPath($dom);
$xpathResult = $xpath->query("//div[#class='comments']//div[#class='comments']");
But somehow the xpath query ALWAYS returns 0. Even when i use //body. Any body knows why?
The Xpath query you are using will only grab nodes with class comments which are located (deep) inside another node with class comments.
With the HTML snippet you provided, it will return 1 node.
You duplicate selector, right using is:
$xpath->query("//div[#class='comments']"); // returns 3
No need for the duplicate entry:
$xpathResult = $xpath->query("//div[#class='comments']");
# 3 elements found
See a working demo on ideone.com.
Related
$divs = $xpathsuj->query("//div[#class='txt-msg text-enrichi-forum ']");
$div = $divs[$i];
With this XPath command I'm able to select the div with the class "txt-msg text-enrichi-forum " :
<div class="bloc-contenu">
<div class="txt-msg text-enrichi-forum ">
<p>Tu pourrais écrire en FRANCAIS si ce n'est pas trop demandé?
<img src="http://image.jeuxvideo.com/smileys_img/54.gif" alt=":coeur:" data-code=":coeur:" title=":coeur:" width="21" height="20" />
</p>
</div>
</div>
But not this one :
<div class="bloc-contenu">
<div class="txt-msg text-enrichi-forum ">
<p>
<img src="http://image.jeuxvideo.com/smileys_img/42.gif" alt=":salut:" data-code=":salut:" title=":salut:" width="46" height="41" />
</p>
</div>
<div class="signature-msg text-enrichi-forum ">
<p>break;</p>
</div>
</div>
What am I doing wrong?
I've tried it with both segments of XML and it seems to work with both, but there is a possibility that there is some issue with spacing.
In your XPath query, your looking for an exact match of 'txt-msg text-enrichi-forum ' which has two spaces after txt-msg and one after the last part. If any spaces are missing, then this will not find the element.
If you change it to...
$divs = $xpathsuj->query("//div[contains(#class,'txt-msg') and contains(#class,'text-enrichi-forum')]");
foreach ( $divs as $div ) {
echo $doc->saveXML($div).PHP_EOL;
}
It should be a bit more tolerant.
How I can prevent HTMLPURIFIER break this code:
<a target="_blank" href="http://www.example" class="link_normal opacity">
<blockquote url="http://www.example" class="big">
<div class="center">
<div class="table_cell"><img src="/images/imagenes_urls/4241.jpg" class="img_url"></div>
</div>
<p style="color:#3b5998" class="b">Desde los 80 hasta 2015, así ha sido la impresionante evolución de los móviles</p>
<p>¿Cómo olvidarse de aquellos enormes objetos a los que llamábamos teléfonos móviles? Muchos de vosotros los recordar...</p>
<span class="dominio">andro4all.com</span>
</blockquote></a>
When i use it turns into something like this:
<blockquote class="big">
<a class="link_normal opacity" href="http://www.example"></a>
<div class="center">
<a class="link_normal opacity" href="http://www.example"></a>
<div class="table_cell">
<a class="link_normal opacity" href="http://www.example">
<img alt="4241.jpg" class="img_url" src="/images/imagenes_urls/4241.jpg">
</a>
</div>
<a class="link_normal opacity" href="http://www.example"></a>
</div>
<a class="link_normal opacity" href="http://www.example"></a>
<p class="b" style="color:#3b5998;">
<a class="link_normal opacity" href="http://www.example">Desde los 80 hasta 2015, asà ha sido la impresionante evolución de los móviles</a>
</p></blockquote>
you can use a trick like adding JavaScript to your main tag like:
onclick="document.location='http://www.example'"
and you can add pointer style to cursor to make it look like a normal link:
style="cursor:pointer"
and that will be like this:
<blockquote url="http://www.example" class="big" onclick="document.location='http://www.example'" style="cursor:pointer">
<div class="center">
<div class="table_cell"><img src="/images/imagenes_urls/4241.jpg" class="img_url"></div>
</div>
<p style="color:#3b5998" class="b">Desde los 80 hasta 2015, así ha sido la impresionante evolución de los móviles</p>
<p>¿Cómo olvidarse de aquellos enormes objetos a los que llamábamos teléfonos móviles? Muchos de vosotros los recordar...</p>
<span class="dominio">andro4all.com</span>
</blockquote>
That is not valid to wrap grouping tag blockquote with inline a tag in HTML4.
Also you can find here good information over this tag http://www.w3.org/html/wg/drafts/html/master/grouping-content.html#the-blockquote-element
After experimentation, I just discovered a slick answer to this specific question. You can customize HTMLPurifier's behavior, as described at http://htmlpurifier.org/docs/enduser-customize.html.
Specifically, you can overwrite the default behavior of the 'a' anchor element by defining the element again, like this:
include_once('HTMLPurifier.auto.php');
$config = HTMLPurifier_Config::createDefault();
$def = $config->getHTMLDefinition(true);
// here is the magic method that overwrites the default anchor definition
$def->addElement(
'a', // element name
'Inline', // the type of element: 'Block','Inline', or false, if it's a special case
'Flow', // what type of child elements are permitted: 'Empty', 'Inline', or 'Flow', which includes block elements like div
'Common' // permitted attributes
);
$purifier = new HTMLPurifier($config);
// $dirty_html is the html you want cleaned
echo $purifier->purify($dirty_html);
Now you can include block elements inside anchor tags, as allowed in the HTML5 specification (http://dev.w3.org/html5/markup/a.html).
For a more robust HTML5 solution, Christoffer Bubach offers a detailed HTMLPurifier configuration to allow newer HTML5 tags, though it doesn't redefine the anchor tag to permit block elements inside it. See his answer to this SO question: HTML filter that is HTML5 compliant
i'm ussing xpath to parse html with no problems until I found the code below.
I usually use the "textContent" property one I got this td with ax xpath query, BUT I need to get only the text BEFORE the <img tag.
<td class="rowdispari">
ZONA NON SERVITA QUOTIDIANAMENTE-PROSSIMA CONSEGNA
<img onmouseover="caricaTool()" src="template/img/infoTip.png" width="17">
<div class="bottom" id='tooool'>
<div class="contenuto">
<div class="top">
<font class="testobold"><font class='testoblubold'>ZONA NON SERVITA QUOTIDIANAMENTE - PROSSIMA CONSEGNA </font><br>La località di destinazione non è tra quelle servite quotidianamente da SDA. La consegna avverrà al più presto possibile, compatibilmente con le operazioni logistiche.</font>
<p> <br><u>Chiudi</u>
</div>
</div>
</div>
</td>
You can probably use:
//td[#class="rowdispari"][img[#src="template/img/infoTip.png"]]/text()[1]
or:
//td[#class="rowdispari"]/text()[following-sibling::img[#src="template/img/infoTip.png"]][1]
Assuming that you already have XPath to get the outer <td> element, you can simply append the XPath with /text()[1] to get the first text node that is direct child of current <td> element :
path_to_td_here/text()[1]
more concrete example :
//td[#class='rowdispari']/text()[1]
<div id="lyrics">
<img />
<span id="line_31" class="line line-s">En medio de este tropico mortal</span
<br>
<span id="line_32" class="line line-s">Roots and creation, come again!</span>
<br>
<span id="line_33" class="line line-s">So mi guardian, mi guardian mi lift up di plan</span>
<span id="line_34" class="line line-s">Now everybody a go' do dis one</span>
<span id="line_35" class="line line-s">Like in down di Caribbean</span>
<span id="line_36" class="line line-s">San Andrés, Providence Island</span>
<br>
</div>
Here I have a div, inside div there is multiple span and br tag between span. I want to scrape the span text and br tag as it is. so how can i scrape with php simple dom parser.
thanks for any help.
Let's say the html file you have above is called "index.html".
$html = file_get_html("index.html");
$element = $html->find('div#lyrics');
$result = $element->innertext;
You want to consult the manual: http://simplehtmldom.sourceforge.net/
in this variable: $this->item->text I have this string:
<!-- JoomlaWorks "Disqus Comment System for Joomla!" Plugin (v2.2) starts here -->
<div class="itp-fshare-floating" id="itp-fshare" style="position:fixed; top:30px !important; left:50px !important;"></div><p>Lorem Ipsum è un testo segnaposto utilizzato nel settore della tipografia e della stampa. Lorem Ipsum è considerato il testo segnaposto standard sin dal sedicesimo secolo, quando un anonimo tipografo prese una cassetta di caratteri e li assemblòdei fogli di caratteri trasferibili “Letraset”,</p>
<p style="text-align: center;"><span class="easy_img_caption" style="display:inline-block;line-height:0.5;vertical-align:top;background-color:#F2F2F2;text-align:left;width:150px;float:left;margin:0px 10px;"><img src="/joomla/plugins/content/imagesresizecache/8428e9c26f1d8498ece730c0aa6aa023.jpeg" border="0" alt="1" title="1" style="width:150px; height:120px; ;margin:0;" /><span class="easy_img_caption_inner" style="display:inline-block;line-height:normal;color:#000000;font-size:8pt;font-weight:normal;font-style:normal;padding:4px 8px;margin:0px;">1</span></span></p>
<p>che contenevano passaggi del Lorem Ipsum, e più recentemente da software di impaginazione come Aldus PageMaker</p>
<!-- Disqus comments counter and anchor link -->
<a class="jwDisqusListingCounterLink" href="http://clouderize.it/joomla/index.php?option=com_content&view=article&id=8:recensione&catid=3:recensione-dei-servizi-di-cloud-computing&Itemid=4#disqus_thread" title="Add a comment">
Add a comment</a>
<!-- JoomlaWorks "Disqus Comment System for Joomla!" Plugin (v2.2) ends here -->
<div class="cp_tags">
<span class="cp_tag_label">Tags: </span><span class="cp_tag cp_tag_6">Recensioni
</span> </div>
So with this code I extract
<span class="easy_img_caption......></span>
Code (I am using this library called phpQuery http://goo.gl/rSu3k):
include_once('includes/phpQuery.php');
$doc = phpQuery::newDocument($this->item->text);
$extraction=pq('.easy_img_caption:eq(0)')->htmlOuter();
echo"<textarea>".$extraction."</textarea>";
So my question is:
How can I remove $extraction string from $this->item->text?
Thank you.
I'll assume phpQuery is some library aiding with dom-parsing in php?
Anyway, to accomplish this, you don't exactly need this external library. It can easily be accomplished with a regular expression replace:
$text = preg_replace('/<span.*?class="[^"]*?easy_img_caption[^"]*?".*?>.*?<\/span>/s', '', $this->item->text);
echo "<textarea>" . $text . "</textarea>";