I have a text string mixed with html and need to separate words only and wrap them in spans.
String:
$string ='<div class="something">What </div> if it it is is same <div style="color:red;">same </div>';
desired output
<div class="something">
<span class="splits split1">
What
</span>
</div>
<span class="splits split2">
if
</span>
<span class="splits split3">
it
</span>
<span class="splits split4">
it
</span>
<span class="splits split5">
is
</span>
<span class="splits split6">
is
</span>
<span class="splits split7">
same
</span>
<div style="color:red;">
<span class="splits split8">
same
</span>
</div>
I tried everything I could think of, preg_replace with word boundary, preg_match , str_replace, combos with explodes , loops and replace, but one way or another the output fails and brakes the html or it adds replacement where it should not be.
Any help is appreciated.
As #MarcB said, here with DOM life is tastier but since question is tagged regex this could be a workaround (not %100 guaranteed though):
</?\b[^<>]+>(*SKIP)(*F)|\w+
PHP:
preg_replace_callback('~</?\b[^<>]+>(*SKIP)(*F)|\w+~', function($matches) {
static $counter = 0;
return "<span class=\"splits split".(++$counter)."\">{$matches[0]}</span>";
}, $string);
Live demo
Related
This question already has answers here:
Removing a span from a DOM object but not the content and save it to a variable
(2 answers)
Closed 4 years ago.
The preg_replcae in this code is from another answer on this site, and is supposed to replace all spans and their contents with nothing i.e remove all spans and their content. But, why doesn't it work?
$string = <<<STR
<span>Span 1</span>
<div>DIV 1</div><div class="text">DIV 2</div>
<span> Span 2 </span>
<div class="apache">DIV 3</div>
<span>
Span 3
</span>
<span>
Span 4
</span>
STR;
$string = preg_replace("/<span[^>]+\>/is","",$string);
echo $string;
The end result I was hoping to get is:
<div>DIV 1</div><div class="text">DIV 2</div>
<div class="apache">DIV 3</div>
// no spans and their content
// everything else remains
[^>]+ matches one or more characters, so <span> wont match. Use * there instead. Then, because you're trying to match everything in between, let's match all the content (.*?), and then the </span> closing tag. /<span[^>]*>.*?<\/span>/is becomes our final regex.
<?php
$string = <<<STR
<span>Span 1</span>
<div>DIV 1</div><div class="text">DIV 2</div>
<span> Span 2 </span>
<div class="apache">DIV 3</div>
<span>
Span 3
</span>
<span>
Span 4
</span>
STR;
$string = preg_replace("/<span[^>]*>.*?<\/span>/is","",$string);
echo $string;
You can see the output here: https://3v4l.org/X9C53
I have a specific string that is:
<span style="color: green;">
I want a function that finds the instance of this string and then further finds whether the next string's first 2 characters are
I have the idea of searching it character by character, but it would take too long. Is there any shorter solution to it?
Input:
ab <span style="color: green;"> </strong>
Output:
ab </strong> <span style="color: green;">
The strong tag is just for an example, it could be /b, /i, /li or any other closing tag.
You can use preg_replace for this, i.e.:
$myHtml = <<< LOL
ab <span style="color: green;"> </strong>
LOL;
$myHtml = preg_replace('%(<span style="color: green;">)(?:\s+)?(</.*?>)%i', '$2 $1', $myHtml);
echo $myHtml;
//ab </strong> <span style="color: green;">
It will work with any tag that comes after the span.
DEMO:
http://sandbox.onlinephpfunctions.com/code/9dc934ece66856a92b041114140982dc822a6bec
i want to get all the text between <p> and <h3> tag for the following HTML
<div class="bodyText">
<p>
<div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
<div class="one">
<img src="url" alt="bar" class="img" width="80" height="60" />
</div>
<div class="two">
<h4 class="preTitle">QIEZ-Lieblinge</h4>
<h3 class="title"><a href="url" title="ABC" onclick="cmsTracking.trackClickOut({element:this, channel : 32333770, channelname : 'top_listen', content : 14832081, callTemplate : '_htmltagging.Text', action : 'click', mouseevent : event});">
Prominente Gastronomen </a></h3>
<span class="postTitle"></span>
<span class="district">Berlin</span> </div>
<div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
<div class="inlineImage alignLeft">
<div class="medium">
<img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
<span class="caption">
Schöne Lage: das Restaurant Ø. (c)QIEZ </span>
</div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
"Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
<div class="clear"></div>
</div>
i want all "I want this TEXT". i used xpath query
//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']
but it does not give me the text if <p> tag is followed by any other tag
It looks like you have div elements contained within your p element which is not valid and messing up things. If you use a var_dump in the loop you can see that it does actually pick up the node but the nodeValue is empty.
A quick and dirty fix to your html would be to wrap the first div that is contained in the p element in a span.
<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>
A better fix would be to put the div element outside the paragraph.
If you use the dirty workaround you will need to change your query like so:
$xpath->query("//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");
If you do not have control of the source html. You can make a copy of the html and remove the offending divs:
$nodes = $xpath->query("//div[contains(#class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);
It might be easier to work with simple_html_dom. Maybe you can try this:
include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[class=bodyText]") as $parent) {
foreach($parent->children() as $child) {
if ($child->tag == 'p' || $child->tag == 'h3') {
// remove the inner text of divs contained within a p element
foreach($dom->find('div') as $e)
$e->innertext = '';
echo $child->plaintext . '<br>';
}
}
}
This is mixed content. Depending on what defines the position of the element, you can use a number of factors. In this cse, probably simply selected all the text nodes will be sufficient:
//div[contains(#class, 'bodyText')]/(p | h3)/text()
If the union operator within a path location is not allowed in your processor, then you can use your syntax as before or a little bit simpler in my opinion:
//div[contains(#class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()
I need to access the 48.20 Lac(s) text which is without any tag inside the div,which is the reason I'm not able to access it.
I need to find this in a PHP file.I've tried $html->find('div.priceDetail') followed with trim(strip_tags($result)) which gave me 48.20 Lac(s) + unnecesary text.
Since I've to build a generic file I can't depend on exploding and imploding for a particular fixed case.
<div class="priceDetail">
<b>Total Price :</b>
<img alt="" src="someimage">48.20 Lac(s)
<!-- Per Sq Ft Price -->
<span class="pricePerSqFt">(Price per sq.ft. : Rs. 3,679)</span>
<!-- Code for price Trends -->
<span class="priceGrowth">4 %
<img alt="" src="someimage"
align="absmiddle">
<span class="iconWhatisThis">
<img src="someimage"
class="whatIcon" align="absmiddle">
<span style="" id="StoolTip" class="price_main-c"></span>
</span>
</span>
<div class="tt_top-c">
<span class="priceGrowth"></span>
</div>
<div class="tt_mid-c">
<div class="tt_pointer-c"></div>
<div>
<span class="tt_txt-c">Per sq.ft. price for this property is
<b>higher than the average</b>property price in this locality as per MagicBricks.com
Price Trends.</span>
</div>
<span class="tt_txt-c">
<span class="tp_txt">To know more about this
<a href="#priceTrends" onclick="swithTab('priceTrends', tabbedDivArray);">Click
Here</a>
</span>
</span>
</div>
<div class="tt_bot-c"></div>
</div>
Do as much work as you can with a DOM Parser and then when left with your random load of text, pull the bit out you want with this RegEx:
([0-9]{1,5}?\.[0-9]{2} Lac\(s\))
Result
48.20 Lac(s)
(Change the 5 in the RegEx to the number of digits you want to allow before the decimal point)
Here a solution with DomDocument, probably more robust than Regex :
$DOM = new DOMDocument;
$DOM->loadHTML($str);
//Get all the image tags
$elem = $DOM->getElementsByTagName('img');
//Get the first Image
$first = $elem->item(0);
//Get the node after the image
$txt= $first->nextSibling;
//Get the text
echo $txt->nodeValue;
Of course it requires that the text is always located after the first image in the div.
Hi I need to remove the link of a screen scrap site here is the output source.
<div class="FourDayForecastContainerInner">
<span class="day">Friday</span>
<a href="forecastPublicExtended.asp#Period4" target="_blank">
<img src="./images/wimages/b_rain.gif" class="thumbnail">
</a>
<span class="hi">
<span style="width:24px;">Hi</span>
19 / 66
</span>
<span class="lo">
<span style="width:24px;">Lo</span>
16 / 60
</span>
<span class="description">
Sunny Breaks, showers
</span>
</div>
<div class="FourDayForecastContainerInner">
<span class="day">Saturday</span>
and here is my code Im using phpquery
$doc = phpQuery::newDocumentHTML( $e );
$containers = pq('.FourDayForecastContainerInner', $doc);
foreach( $containers as $container ) {
$div = pq('span', $container);
$img = pq('img', $container);
$div->eq(0)
->removeAttr('style')
->addClass('day')
->html(
pq( 'u', $div->eq(0) )
->html()
);
$img->eq(0)
->removeAttr('style')
->removeAttr('height')
->removeAttr('width')
->removeAttr('alt')
->addClass('thumbnail')
->html( pq( 'img', $img->eq(0)) );
$div->eq(1)
->removeAttr('style')
->addClass('hi');
$div->eq(3)
->removeAttr('style')
->addClass('lo');
$div->eq(5)
->removeAttr('style')
->addClass('description');
}
print $doc;
I have manage to remove all attributes styles height width etc. but i can't seem to remove the a href
thank you so much for your help
I tried it with your sample code and it works. This is the output
<div class='FourDayForecastContainerInner'>
<span class='day'>Friday</span>
<img src='./images/wimages/b_rain.gif' class='thumbnail'>
<span class='hi'>
<span style='width:24px;'>Hi</span>
19 / 66
</span>
<span class='lo'>
<span style='width:24px;'>Lo</span>
16 / 60
</span>
<span class='description'>
Sunny Breaks, showers
</span>
</div>
<div class='FourDayForecastContainerInner'>
<span class='day'>Saturday</span><div class='FourDayForecastContainerInner'>
<span class='day'>Friday</span>
<img src='./images/wimages/b_rain.gif' class='thumbnail'>
<span class='hi'>
<span style='width:24px;'>Hi</span>
19 / 66
</span>
<span class='lo'>
<span style='width:24px;'>Lo</span>
16 / 60
</span>
<span class='description'>
Sunny Breaks, showers
</span>
</div>
<div class='FourDayForecastContainerInner'>
<span class='day'>Saturday</span>
The way you are doing is too long and tiresome. Use regular expressions to replace the link.
$html = 'Your HTMl CODE HERE';
$exp = "~<a.*>~isU";
$html = preg_replace($exp,"", $html);
$exp = "~</a>~isU";
$html = preg_replace($exp,"", $html);
echo $html
this will totally remove the link
Does the following code do what you want ? (when added at the end of 'for' loop)
$imghtml = pq('a', $container)->html();
pq($container)->prepend($imghtml);
pq('a', $container)->remove();
Note : phpquery doesn't seem to support jquery detach()
I ran into the same question and I wanted to share my solution. My goal was to remove all tags from the title portion of some SoundCloud embed code. The HTML looked like this:
<object height="81" width="100%">
... a bunch of embed code ...
</object>
<span>
Mike Ink _ Silver
by
MINIMAL
</span>
At the end of the HTML above, you can see that the title has not only one but two links around it. My goal was to strip those out.
Assuming that HTML is assigned to the PHP variable $text, here's how I did it:
$doc = phpQuery::newDocument($text);
$soundcloud_title = strip_tags((string) $doc->find('span'));
print($soundcloud_title);
// outputs: Mike Ink _ Silver by MINIMAL
I know that this doesn't directly answer the question. In fact, I'm using strip_tags to remove the links instead of using phpquery, but I hoped it might help other coders who are looking for the same answers I was.
Happy coding!