I've got the following HTML code:
<html><body><h1> <span class="mw-headline" id="Discussie_over_Titel">Discussie over Titel</span></h1>
<div class="comments">
<div class="comment new">
<div class="newcommenttext">Klik op de button om een nieuwe opmerking te maken over Titel</div><div class="buttons">Nieuwe opmerking</div>
<div class="clear"></div>
</div>
<div class="comments"><div>
</div><div class="commentBlock"><div class="comment" id="WikiSysop-"><div class="poster">Door <span class="usernamehighlight">WikiSysop</span> op <span class="extradata">Type: Suggestie</span>
</div>
<div class="buttons">Reageer op deze opmerking<span class="collapse">Collapse</span></div><div class="content">Wat is dit nou weer?
</div>
</div>
</div><div class="commentBlock"><div class="comment" id="WikiSysop-"><div class="poster">Door <span class="usernamehighlight">WikiSysop</span> op <span class="extradata">Type: Suggestie</span>
</div>
<div class="buttons">Reageer op deze opmerking<span class="collapse">Collapse</span></div><div class="content">Mega mooi!
</div>
</div></div></div></div> <div class="comments">
<div class="comment new"><div class="newcommenttext">
Klik op de button om een nieuwe opmerking te maken over Titel</div><div class="buttons">Nieuwe opmerking</div>
</div>
<p><br />
</p><p><br />
</p>
Klik hier om terug te keren naar Titel.</div>
</body></html>
To fetch all the comments I simply create a new dom parser:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($text);
$xpath = new DOMXPath($dom);
$xpathResult = $xpath->query("//div[#class='comments']//div[#class='comments']");
But somehow the xpath query ALWAYS returns 0. Even when i use //body. Any body knows why?
The Xpath query you are using will only grab nodes with class comments which are located (deep) inside another node with class comments.
With the HTML snippet you provided, it will return 1 node.
You duplicate selector, right using is:
$xpath->query("//div[#class='comments']"); // returns 3
No need for the duplicate entry:
$xpathResult = $xpath->query("//div[#class='comments']");
# 3 elements found
See a working demo on ideone.com.
For some strange reason 2 of the elements in my arrays are being ignored. They are...
ETIQUETAS XTRA MINI PARA OBJETOS and ETIQUETAS TERMOAHDESIVAS CLASICAS.
If the product page title is included in the array it renders an info box on the product page below the product image. This is true for all products in the array except the 2 mentioned above.
Below I have the two array vars and below that... the code that renders the info boxes. Any help is greatly appreciated. By the way, this is WooCommerce over Wordpress 4.0.
<?php
//for every label product.
$infoBox1Array = array('Pack Duo','Etiquetas Grandes','Etiquetas Navideñas','Pack Trio','ETIQUETAS PARA CUMPLEAÑOS SUPER PERSONALIZADA','LETREROS PARA CASAS','ETIQUETAS PARA CUMPLEAÑOS','PACK ZAPATOS FORMAS DE PIE','Pack Zapatos','Etiquetas Para Alergias Personalizadas','PACK FULL COLOR','Membretes','Etiquetas Termoadhesivas Mini (fondo blanco)','ETIQUETAS TERMOADHESIVAS FULL COLOR','ETIQUETAS TERMOAHDESIVAS CLASICAS','Saca & Pega','Stickers para Carros','ETIQUETAS PARA ALERGIAS','Etiquetas Kosher','Etiquetas Cocina','Etiquetas Minis','Pack mix 3','Pack mix 2','Pack mix 1','Etiquetas Redondas','Etiquetas Medianas','ETIQUETAS XTRA MINI PARA OBJETOS','Pack Xpress','Pack Guarderia','Pack Regreso al Cole');
//for every product that can be personalized.
$infoBox2Array = array('Pack Duo','Etiquetas Grandes','Etiquetas Navideñas','Pack Trio','ETIQUETAS PARA CUMPLEAÑOS SUPER PERSONALIZADA','LETREROS PARA CASAS','ETIQUETAS PARA CUMPLEAÑOS','PACK ZAPATOS FORMAS DE PIE','Pack Zapatos','Etiquetas Para Alergias Personalizadas','PACK FULL COLOR','Membretes','Etiquetas Termoadhesivas Mini (fondo blanco)','ETIQUETAS TERMOADHESIVAS FULL COLOR','ETIQUETAS TERMOAHDESIVAS CLASICAS','Saca & Pega','Etiquetas Minis','Pack mix 3','Pack mix 2','Pack mix 1','Etiquetas Redondas','Etiquetas Medianas','ETIQUETAS XTRA MINI PARA OBJETOS','Pack Xpress','Pack Guarderia','Pack Regreso al Cole');
?>
<div class="row">
<div>
<?php
include(TEMPLATEPATH . '/info-box-arrays.php');
if(is_single( $infoBox1Array )) {
echo '
<div style="background-color: #ffffff; border: solid 1px #CCCCCC;padding: 7px"><p style="text-align: justify">
El tamaño de la etiqueta y de letra son aproximados.<br/>
Los colores pueden variar de acuerdo a la configuración de su pantalla.
</p></div>
<div style="clear: both"> </div>';
}
if(is_single( $infoBox2Array )) {
echo '
<div style="background-color: #ffffff; border: solid 1px #CCCCCC;padding: 7px"><p style="text-align: justify">
<strong><span style="color: #ff0000;">Nota Importante:</span></strong> Los datos que escribe son los que serán procesados en su pedido.</br/>
Considerar acentos y mayusculas.
</p></div>';
}
?>
</div>
</div>
I have replicated your code and created a post ETIQUETAS TERMOAHDESIVAS CLASICAS. It works fine so I suspect that there is nothing wrong with your array.
I think that there is another reason that is_single is returning false. I think it is either you have a slightly different title for your post or that you are trying to use it for a page.
i'm ussing xpath to parse html with no problems until I found the code below.
I usually use the "textContent" property one I got this td with ax xpath query, BUT I need to get only the text BEFORE the <img tag.
<td class="rowdispari">
ZONA NON SERVITA QUOTIDIANAMENTE-PROSSIMA CONSEGNA
<img onmouseover="caricaTool()" src="template/img/infoTip.png" width="17">
<div class="bottom" id='tooool'>
<div class="contenuto">
<div class="top">
<font class="testobold"><font class='testoblubold'>ZONA NON SERVITA QUOTIDIANAMENTE - PROSSIMA CONSEGNA </font><br>La località di destinazione non è tra quelle servite quotidianamente da SDA. La consegna avverrà al più presto possibile, compatibilmente con le operazioni logistiche.</font>
<p> <br><u>Chiudi</u>
</div>
</div>
</div>
</td>
You can probably use:
//td[#class="rowdispari"][img[#src="template/img/infoTip.png"]]/text()[1]
or:
//td[#class="rowdispari"]/text()[following-sibling::img[#src="template/img/infoTip.png"]][1]
Assuming that you already have XPath to get the outer <td> element, you can simply append the XPath with /text()[1] to get the first text node that is direct child of current <td> element :
path_to_td_here/text()[1]
more concrete example :
//td[#class='rowdispari']/text()[1]
<div id="lyrics">
<img />
<span id="line_31" class="line line-s">En medio de este tropico mortal</span
<br>
<span id="line_32" class="line line-s">Roots and creation, come again!</span>
<br>
<span id="line_33" class="line line-s">So mi guardian, mi guardian mi lift up di plan</span>
<span id="line_34" class="line line-s">Now everybody a go' do dis one</span>
<span id="line_35" class="line line-s">Like in down di Caribbean</span>
<span id="line_36" class="line line-s">San Andrés, Providence Island</span>
<br>
</div>
Here I have a div, inside div there is multiple span and br tag between span. I want to scrape the span text and br tag as it is. so how can i scrape with php simple dom parser.
thanks for any help.
Let's say the html file you have above is called "index.html".
$html = file_get_html("index.html");
$element = $html->find('div#lyrics');
$result = $element->innertext;
You want to consult the manual: http://simplehtmldom.sourceforge.net/
in this variable: $this->item->text I have this string:
<!-- JoomlaWorks "Disqus Comment System for Joomla!" Plugin (v2.2) starts here -->
<div class="itp-fshare-floating" id="itp-fshare" style="position:fixed; top:30px !important; left:50px !important;"></div><p>Lorem Ipsum è un testo segnaposto utilizzato nel settore della tipografia e della stampa. Lorem Ipsum è considerato il testo segnaposto standard sin dal sedicesimo secolo, quando un anonimo tipografo prese una cassetta di caratteri e li assemblòdei fogli di caratteri trasferibili “Letraset”,</p>
<p style="text-align: center;"><span class="easy_img_caption" style="display:inline-block;line-height:0.5;vertical-align:top;background-color:#F2F2F2;text-align:left;width:150px;float:left;margin:0px 10px;"><img src="/joomla/plugins/content/imagesresizecache/8428e9c26f1d8498ece730c0aa6aa023.jpeg" border="0" alt="1" title="1" style="width:150px; height:120px; ;margin:0;" /><span class="easy_img_caption_inner" style="display:inline-block;line-height:normal;color:#000000;font-size:8pt;font-weight:normal;font-style:normal;padding:4px 8px;margin:0px;">1</span></span></p>
<p>che contenevano passaggi del Lorem Ipsum, e più recentemente da software di impaginazione come Aldus PageMaker</p>
<!-- Disqus comments counter and anchor link -->
<a class="jwDisqusListingCounterLink" href="http://clouderize.it/joomla/index.php?option=com_content&view=article&id=8:recensione&catid=3:recensione-dei-servizi-di-cloud-computing&Itemid=4#disqus_thread" title="Add a comment">
Add a comment</a>
<!-- JoomlaWorks "Disqus Comment System for Joomla!" Plugin (v2.2) ends here -->
<div class="cp_tags">
<span class="cp_tag_label">Tags: </span><span class="cp_tag cp_tag_6">Recensioni
</span> </div>
So with this code I extract
<span class="easy_img_caption......></span>
Code (I am using this library called phpQuery http://goo.gl/rSu3k):
include_once('includes/phpQuery.php');
$doc = phpQuery::newDocument($this->item->text);
$extraction=pq('.easy_img_caption:eq(0)')->htmlOuter();
echo"<textarea>".$extraction."</textarea>";
So my question is:
How can I remove $extraction string from $this->item->text?
Thank you.
I'll assume phpQuery is some library aiding with dom-parsing in php?
Anyway, to accomplish this, you don't exactly need this external library. It can easily be accomplished with a regular expression replace:
$text = preg_replace('/<span.*?class="[^"]*?easy_img_caption[^"]*?".*?>.*?<\/span>/s', '', $this->item->text);
echo "<textarea>" . $text . "</textarea>";