[php]remove html code from string - php

in this variable: $this->item->text I have this string:
<!-- JoomlaWorks "Disqus Comment System for Joomla!" Plugin (v2.2) starts here -->
<div class="itp-fshare-floating" id="itp-fshare" style="position:fixed; top:30px !important; left:50px !important;"></div><p>Lorem Ipsum è un testo segnaposto utilizzato nel settore della tipografia e della stampa. Lorem Ipsum è considerato il testo segnaposto standard sin dal sedicesimo secolo, quando un anonimo tipografo prese una cassetta di caratteri e li assemblòdei fogli di caratteri trasferibili “Letraset”,</p>
<p style="text-align: center;"><span class="easy_img_caption" style="display:inline-block;line-height:0.5;vertical-align:top;background-color:#F2F2F2;text-align:left;width:150px;float:left;margin:0px 10px;"><img src="/joomla/plugins/content/imagesresizecache/8428e9c26f1d8498ece730c0aa6aa023.jpeg" border="0" alt="1" title="1" style="width:150px; height:120px; ;margin:0;" /><span class="easy_img_caption_inner" style="display:inline-block;line-height:normal;color:#000000;font-size:8pt;font-weight:normal;font-style:normal;padding:4px 8px;margin:0px;">1</span></span></p>
<p>che contenevano passaggi del Lorem Ipsum, e più recentemente da software di impaginazione come Aldus PageMaker</p>
<!-- Disqus comments counter and anchor link -->
<a class="jwDisqusListingCounterLink" href="http://clouderize.it/joomla/index.php?option=com_content&view=article&id=8:recensione&catid=3:recensione-dei-servizi-di-cloud-computing&Itemid=4#disqus_thread" title="Add a comment">
Add a comment</a>
<!-- JoomlaWorks "Disqus Comment System for Joomla!" Plugin (v2.2) ends here -->
<div class="cp_tags">
<span class="cp_tag_label">Tags: </span><span class="cp_tag cp_tag_6">Recensioni
</span> </div>
So with this code I extract
<span class="easy_img_caption......></span>
Code (I am using this library called phpQuery http://goo.gl/rSu3k):
include_once('includes/phpQuery.php');
$doc = phpQuery::newDocument($this->item->text);
$extraction=pq('.easy_img_caption:eq(0)')->htmlOuter();
echo"<textarea>".$extraction."</textarea>";
So my question is:
How can I remove $extraction string from $this->item->text?
Thank you.

I'll assume phpQuery is some library aiding with dom-parsing in php?
Anyway, to accomplish this, you don't exactly need this external library. It can easily be accomplished with a regular expression replace:
$text = preg_replace('/<span.*?class="[^"]*?easy_img_caption[^"]*?".*?>.*?<\/span>/s', '', $this->item->text);
echo "<textarea>" . $text . "</textarea>";

Related

Regex with negative lookahead and dot matches newline modifier (/s)

I have a PHP script and I need to match the last occurence of a specific string.
Let's say I have the following scenarios:
1
<p class="TPTexto" style="text-autospace: none; ">
<font face="Arial" size="2" color="#FF0000">Este texto não substitui o publicado no DOU de 28.9.2006.</font>
</p>
2
Este texto abc def
<p class="TPTexto" style="text-autospace: none; ">
<font face="Arial" size="2" color="#FF0000">Este texto não substitui o publicado no DOU de 28.9.2006.</font>
</p>
3
Este texto abc def
<p class="TPTexto" style="text-autospace: none; ">
<font face="Arial" size="2" color="#FF0000">Este
texto não substitui o publicado no DOU de 28.9.2006.</font>
</p>
4
Este texto abc def
<p class="TPTexto" style="text-autospace: none; ">
<font face="Arial" size="2" color="#FF0000">Este <font></font>
texto não substitui o publicado no DOU de 28.9.2006.</font>
</p>
5
Este texto abc def
<p class="TPTexto" style="text-autospace: none; ">
<font face="Arial" size="2" color="#FF0000">Este texto não substitui o publicado no DOU de 28.9.2006.</font>
</p>
I want to match Este texto não substitui o publicado in all cases, accepting some occasional garbage in between, like Este <font></font>\ntexto não substitui o publicado.
So I went with the following regex:
/Este(?:.(?!Este))+?texto.+?n.+?o.+?substitui.+?o.+?publicado/uis
The flags:
u to accept unicode characters
i to accept insensitive content
s to make dot (.) match newlines (so my negative lookahead works)
This way I would match the last Este and the following text, as I want, right? Nope! The s modifier kills it.
(I'm using this PHP tool to test it btw)
I don't know why the s modifier kills it in this case. Any help would be very appreciated.
I'm using PHP's preg_match_all on this project.
Edit:
Noticed it wasn't clear: I need the SECOND Este texto... not the first.
Your regular expression is OK. You could just prepend your regex with this:
\A.*\K
\A asserts beginning of input string
.* matches entire input string immediately and then tries to backtrack to match the next pattern which is Este
\K resets output up to the point so that you'll see the desired string only
I removed the lookahead and made your regex a bit simpler. Putting it all together we have this:
\A.*\KEste.+?texto.+?n.+?o.+?substitui.+?o.+?publicado

Block elements inside "a" tag (htmlpurifier)

How I can prevent HTMLPURIFIER break this code:
<a target="_blank" href="http://www.example" class="link_normal opacity">
<blockquote url="http://www.example" class="big">
<div class="center">
<div class="table_cell"><img src="/images/imagenes_urls/4241.jpg" class="img_url"></div>
</div>
<p style="color:#3b5998" class="b">Desde los 80 hasta 2015, así ha sido la impresionante evolución de los móviles</p>
<p>¿Cómo olvidarse de aquellos enormes objetos a los que llamábamos teléfonos móviles? Muchos de vosotros los recordar...</p>
<span class="dominio">andro4all.com</span>
</blockquote></a>
When i use it turns into something like this:
<blockquote class="big">
<a class="link_normal opacity" href="http://www.example"></a>
<div class="center">
<a class="link_normal opacity" href="http://www.example"></a>
<div class="table_cell">
<a class="link_normal opacity" href="http://www.example">
<img alt="4241.jpg" class="img_url" src="/images/imagenes_urls/4241.jpg">
</a>
</div>
<a class="link_normal opacity" href="http://www.example"></a>
</div>
<a class="link_normal opacity" href="http://www.example"></a>
<p class="b" style="color:#3b5998;">
<a class="link_normal opacity" href="http://www.example">Desde los 80 hasta 2015, así ha sido la impresionante evolución de los móviles</a>
</p></blockquote>
you can use a trick like adding JavaScript to your main tag like:
onclick="document.location='http://www.example'"
and you can add pointer style to cursor to make it look like a normal link:
style="cursor:pointer"
and that will be like this:
<blockquote url="http://www.example" class="big" onclick="document.location='http://www.example'" style="cursor:pointer">
<div class="center">
<div class="table_cell"><img src="/images/imagenes_urls/4241.jpg" class="img_url"></div>
</div>
<p style="color:#3b5998" class="b">Desde los 80 hasta 2015, así ha sido la impresionante evolución de los móviles</p>
<p>¿Cómo olvidarse de aquellos enormes objetos a los que llamábamos teléfonos móviles? Muchos de vosotros los recordar...</p>
<span class="dominio">andro4all.com</span>
</blockquote>
That is not valid to wrap grouping tag blockquote with inline a tag in HTML4.
Also you can find here good information over this tag http://www.w3.org/html/wg/drafts/html/master/grouping-content.html#the-blockquote-element
After experimentation, I just discovered a slick answer to this specific question. You can customize HTMLPurifier's behavior, as described at http://htmlpurifier.org/docs/enduser-customize.html.
Specifically, you can overwrite the default behavior of the 'a' anchor element by defining the element again, like this:
include_once('HTMLPurifier.auto.php');
$config = HTMLPurifier_Config::createDefault();
$def = $config->getHTMLDefinition(true);
// here is the magic method that overwrites the default anchor definition
$def->addElement(
'a', // element name
'Inline', // the type of element: 'Block','Inline', or false, if it's a special case
'Flow', // what type of child elements are permitted: 'Empty', 'Inline', or 'Flow', which includes block elements like div
'Common' // permitted attributes
);
$purifier = new HTMLPurifier($config);
// $dirty_html is the html you want cleaned
echo $purifier->purify($dirty_html);
Now you can include block elements inside anchor tags, as allowed in the HTML5 specification (http://dev.w3.org/html5/markup/a.html).
For a more robust HTML5 solution, Christoffer Bubach offers a detailed HTMLPurifier configuration to allow newer HTML5 tags, though it doesn't redefine the anchor tag to permit block elements inside it. See his answer to this SO question: HTML filter that is HTML5 compliant

Elements In Array Being Ignored

For some strange reason 2 of the elements in my arrays are being ignored. They are...
ETIQUETAS XTRA MINI PARA OBJETOS and ETIQUETAS TERMOAHDESIVAS CLASICAS.
If the product page title is included in the array it renders an info box on the product page below the product image. This is true for all products in the array except the 2 mentioned above.
Below I have the two array vars and below that... the code that renders the info boxes. Any help is greatly appreciated. By the way, this is WooCommerce over Wordpress 4.0.
<?php
//for every label product.
$infoBox1Array = array('Pack Duo','Etiquetas Grandes','Etiquetas Navideñas','Pack Trio','ETIQUETAS PARA CUMPLEAÑOS SUPER PERSONALIZADA','LETREROS PARA CASAS','ETIQUETAS PARA CUMPLEAÑOS','PACK ZAPATOS FORMAS DE PIE','Pack Zapatos','Etiquetas Para Alergias Personalizadas','PACK FULL COLOR','Membretes','Etiquetas Termoadhesivas Mini (fondo blanco)','ETIQUETAS TERMOADHESIVAS FULL COLOR','ETIQUETAS TERMOAHDESIVAS CLASICAS','Saca & Pega','Stickers para Carros','ETIQUETAS PARA ALERGIAS','Etiquetas Kosher','Etiquetas Cocina','Etiquetas Minis','Pack mix 3','Pack mix 2','Pack mix 1','Etiquetas Redondas','Etiquetas Medianas','ETIQUETAS XTRA MINI PARA OBJETOS','Pack Xpress','Pack Guarderia','Pack Regreso al Cole');
//for every product that can be personalized.
$infoBox2Array = array('Pack Duo','Etiquetas Grandes','Etiquetas Navideñas','Pack Trio','ETIQUETAS PARA CUMPLEAÑOS SUPER PERSONALIZADA','LETREROS PARA CASAS','ETIQUETAS PARA CUMPLEAÑOS','PACK ZAPATOS FORMAS DE PIE','Pack Zapatos','Etiquetas Para Alergias Personalizadas','PACK FULL COLOR','Membretes','Etiquetas Termoadhesivas Mini (fondo blanco)','ETIQUETAS TERMOADHESIVAS FULL COLOR','ETIQUETAS TERMOAHDESIVAS CLASICAS','Saca & Pega','Etiquetas Minis','Pack mix 3','Pack mix 2','Pack mix 1','Etiquetas Redondas','Etiquetas Medianas','ETIQUETAS XTRA MINI PARA OBJETOS','Pack Xpress','Pack Guarderia','Pack Regreso al Cole');
?>
<div class="row">
<div>
<?php
include(TEMPLATEPATH . '/info-box-arrays.php');
if(is_single( $infoBox1Array )) {
echo '
<div style="background-color: #ffffff; border: solid 1px #CCCCCC;padding: 7px"><p style="text-align: justify">
El tamaño de la etiqueta y de letra son aproximados.<br/>
Los colores pueden variar de acuerdo a la configuración de su pantalla.
</p></div>
<div style="clear: both"> </div>';
}
if(is_single( $infoBox2Array )) {
echo '
<div style="background-color: #ffffff; border: solid 1px #CCCCCC;padding: 7px"><p style="text-align: justify">
<strong><span style="color: #ff0000;">Nota Importante:</span></strong> Los datos que escribe son los que serán procesados en su pedido.</br/>
Considerar acentos y mayusculas.
</p></div>';
}
?>
</div>
</div>
I have replicated your code and created a post ETIQUETAS TERMOAHDESIVAS CLASICAS. It works fine so I suspect that there is nothing wrong with your array.
I think that there is another reason that is_single is returning false. I think it is either you have a slightly different title for your post or that you are trying to use it for a page.

PHP XPATH parse HTML how to get inner text BEFORE another nested tag

i'm ussing xpath to parse html with no problems until I found the code below.
I usually use the "textContent" property one I got this td with ax xpath query, BUT I need to get only the text BEFORE the <img tag.
<td class="rowdispari">
ZONA NON SERVITA QUOTIDIANAMENTE-PROSSIMA CONSEGNA
<img onmouseover="caricaTool()" src="template/img/infoTip.png" width="17">
<div class="bottom" id='tooool'>
<div class="contenuto">
<div class="top">
<font class="testobold"><font class='testoblubold'>ZONA NON SERVITA QUOTIDIANAMENTE - PROSSIMA CONSEGNA </font><br>La località di destinazione non è tra quelle servite quotidianamente da SDA. La consegna avverrà al più presto possibile, compatibilmente con le operazioni logistiche.</font>
<p> <br><u>Chiudi</u>
</div>
</div>
</div>
</td>
You can probably use:
//td[#class="rowdispari"][img[#src="template/img/infoTip.png"]]/text()[1]
or:
//td[#class="rowdispari"]/text()[following-sibling::img[#src="template/img/infoTip.png"]][1]
Assuming that you already have XPath to get the outer <td> element, you can simply append the XPath with /text()[1] to get the first text node that is direct child of current <td> element :
path_to_td_here/text()[1]
more concrete example :
//td[#class='rowdispari']/text()[1]

how to scrape the external url using php simple html dom parser

<div id="lyrics">
<img />
<span id="line_31" class="line line-s">En medio de este tropico mortal</span
<br>
<span id="line_32" class="line line-s">Roots and creation, come again!</span>
<br>
<span id="line_33" class="line line-s">So mi guardian, mi guardian mi lift up di plan</span>
<span id="line_34" class="line line-s">Now everybody a go' do dis one</span>
<span id="line_35" class="line line-s">Like in down di Caribbean</span>
<span id="line_36" class="line line-s">San Andrés, Providence Island</span>
<br>
</div>
Here I have a div, inside div there is multiple span and br tag between span. I want to scrape the span text and br tag as it is. so how can i scrape with php simple dom parser.
thanks for any help.
Let's say the html file you have above is called "index.html".
$html = file_get_html("index.html");
$element = $html->find('div#lyrics');
$result = $element->innertext;
You want to consult the manual: http://simplehtmldom.sourceforge.net/

Categories