How replace all spaces inside HTML PRE elements with - php

Similar to How replace all spaces inside HTML elements with using preg_replace?
Except I only want to modify spaces found between PRE tags. For example:
<table atrr="zxzx"><tr>
<td>adfa a adfadfaf></td><td><br /> dfa dfa</td>
</tr></table>
<pre class="abc" id="abc">abc abc</pre>
<pre>123 123</pre>
would be converted to (note the pre tag may contain attributes, or may not):
<table atrr="zxzx"><tr>
<td>adfa a adfadfaf></td><td><br /> dfa dfa</td>
</tr></table>
<pre class="abc" id="abc">abc abc</pre>
<pre>123 123</pre>

$html = preg_replace(
'#(\<pre[^>]*>)(.*)(</pre>)#Umie'
, "'$1'.str_replace(' ', ' ', '$2').'$3'"
, $html);
Has been tested, works with the sample string you provided. It's ungreedy, you don't want to replace spaces between </pre> and <pre>. Also works if the <pre></pre> section spans several lines.
Note: this will fail if you have nested situations like <pre> <pre> </pre> </pre>. If you want to be able to parse that, you need to parse the (X)HTML using the Document Object Model.
Update:
I have done some benchmarking and it turns out the callback version is faster by about 1 second per 100,000 iterations, so I think I should also mention that option.
$html = preg_replace_callback(
'#(\<pre[^>]*>)(.*)(</pre>)#Uim'
, function($matches){
return $matches[1].str_replace(' ', ' ', $matches[2]).$matches[3];
}
, $html);
This requires PHP 5.3 or newer, earlier versions do not support anonymous functions.

do
$html = preg_replace('/(<pre.*>.*) (.*<\/pre>)/', '$1 $2', $html, 1, $count);
while($count);
echo $html;
I'm not sure if there's a better solution. I'm not very familiar with all the preg functions.

Related

How to wrap every word in spans with PHP?

I have a some html paragraphs and I want to wrap every word in . Now I have
$paragraph = "This is a paragraph.";
$contents = explode(' ', $paragraph);
$i = 0;
$span_content = '';
foreach ($contents as $c){
$span_content .= '<span>'.$c.'</span> ';
$i++;
}
$result = $span_content;
The above codes work just fine for normal cases, but sometimes the $paragraph would contains some html tags, for example
$paragraph = "This is an image: <img src='/img.jpeg' /> This is a <a href='/abc.htm'/>Link</a>'";
How can I not wrap "words" inside html tag so that the htmnl tags still works but have the other words wrapped in spans? Thanks a lot!
Some (*SKIP)(*FAIL) mechanism?
<?php
$content = "This is an image: <img src='/img.jpeg' /> ";
$content .= "This is a <a href='/abc.htm'/>Link</a>";
$regex = '~<[^>]+>(*SKIP)(*FAIL)|\b\w+\b~';
$wrapped_content = preg_replace($regex, "<span>\\0</span>", $content);
echo $wrapped_content;
See a demo on ideone.com as well as on regex101.com.
To leave out the Link as well, you could go for:
(?:<[^>]+> # same pattern as above
| # or
(?<=>)\w+(?=<) # lookarounds with a word
)
(*SKIP)(*FAIL) # all of these alternatives shall fail
|
(\b\w+\b)
See a demo for this on on regex101.com.
The short version is you really do not want to attempt this.
The longer version: If you are dealing with HTML then you need an HTML parser. You can't use regexes. But where it becomes even more messy is that you are not starting with HTML, but with an HTML fragment (which may, or may not be well-formed. It might work if Hence you need to use an HTML praser to identify the non-HTML extents, separate them out and feed them into a secondary parser (which might well use regexes) for translation, then replace the translted content back into the DOM before serializing the document.

how do I use preg_replace_callback on unknown values?

I got some great help today with starting to understand preg_replace_callback with known values. But now I want to tackle unknown values.
$string = '<p id="keepthis"> text</p><div id="foo">text</div><div id="bar">more text</div><a id="red"href="page6.php">Page 6</a><a id="green"href="page7.php">Page 7</a>';
With that as my string, how would I go about using preg_replace_callback to remove all id's from divs and a tags but keeping the id in place for the p tag?
so from my string
<p id="keepthis"> text</p>
<div id="foo">text</div>
<div id="bar">more text</div>
<a id="red"href="page6.php">Page 6</a>
<a id="green"href="page7.php">Page 7</a>
to
<p id="keepthis"> text</p>
<div>text</div>
<div>more text</div>
Page 6
Page 7
There's no need of a callback.
$string = preg_replace('/(?<=<div|<a)( *id="[^"]+")/', ' ', $string);
Live demo
However in the use of preg_replace_callback:
echo preg_replace_callback(
'/(?<=<div|<a)( *id="[^"]+")/',
function ($match)
{
return " ";
},
$string
);
Demo
For your example, the following should work:
$result = preg_replace('/(<(a|div)[^>]*\s+)id="[^"]*"\s*/', '\1', $string);
Though in general you'd better avoid parsing HTML with regular expressions and use a proper parser instead (for example load the HTML into a DOMDocument and use the removeAttribute method, like in this answer). That way you can handle variations in markup and malformed HTML much better.

strip out <p> tags which is inside another tag

I need to strip out <p> tags which is inside a pre tag, How can i do this in php? My code will be like this:
<pre class="brush:php;">
<p>Guna</p><p>Sekar</p>
</pre>
I need text inside <p> tags, need to remove only <p> </p> tag.
This could be done with a single regex, this was tested in powershell but should work for most regex which supports look arounds
$string = '<pre class="brush:php;"><p>Guna</p><p>Sekar</p></pre><pre class="brush:php;"><p>Point</p><p>Miner</p></pre>'
$String -replace '(?<=<pre.*?>[^>]*?)(?!</pre)(<p>|</p>)(?=.*?</pre)', ""
Yields
<pre class="brush:php;">GunaSekar</pre><pre class="brush:php;">PointMiner</pre>
Dissecting the regex:
the first lookahead validates there is a pre tag before the current match
the second lookaround validates there was no /pre tag between the pre tag and the match
test for both p and /p
look around to ensure there is a closing /pre tag
You could use basic Regexp.
<?php
$str = <<<STR
<pre class="brush:php;">
<p>Guna</p><p>Sekar</p>
</pre>
STR;
echo preg_replace("/<[ ]*p( [^>]*)?>|<\/[ ]*p[ ]*>/i", " ", $str);
You can try the following code. It runs 2 regex commands to list all the <p> tags inside <pre> tags.
preg_match('/<pre .*?>(.*?)<\/pre>/s', $string, $matches1);
preg_match_all('/<p>.*?<\/p>/', $matches1[1], $ptags);
The matching <p> tags will be available in $ptags array.
You could use preg_replace_callback() to match everything that's in a <pre> tag and then use strip_tags() to remove all html tags:
$html = '<pre class="brush:php;">
<p>Guna</p><p>Sekar</p>
</pre>
';
$removed_tags = preg_replace_callback('#(<pre[^>]*>)(.+?)(</pre>)#is', function($m){
return($m[1].strip_tags($m[2]).$m[3]);
}, $html);
var_dump($removed_tags);
Note this works only with PHP 5.3+
It looked like simple work, but it took hours to find a way. This is what i done:
Downloaded simple dom parser from source forge
Traversed each <pre> tag and strip out <p> tags
Rewrite the content into <pre> tag
Retrive modified content
Here is full code:
include_once 'simple_html_dom.php';
$text='<pre class="brush:php;"><p>Guna</p><p>Sekar</p></pre>';
$html = str_get_html($text);
$strip_chars=array('<p>','</p>');
foreach($html->find('pre') as $element){
$code = $element->getAttribute('innertext');
$code=str_replace($strip_chars,'',$code);
$element->setAttribute('innertext',$code);
}
echo $html->root->innertext();
This will output:
<pre class="brush:php;">GunaSekar</pre>
Thanks for all your suggestions.

Regex Replace with Backreference modified by functions

I want to replace the class with the div text like this :
This: <div class="grid-flags" >FOO</div>
Becomes: <div class="iconFoo" ></div>
So the class is changed to "icon". ucfirst(strtolower(FOO)) and the text is removed
Test HTML
<div class="grid-flags" >FOO</div>
Pattern
'/class=\"grid-flags\" \>(FOO|BAR|BAZ)/e'
Replacement
'class="icon'.ucfirst(strtolower($1).'"'
This is one example of a replacement string I've tried out of seemingly hundreds. I read that the /e modifier evaluates the PHP code but I don't understand how it works in my case because I need the double quotes around the class name so I'm lost as to which way to do this.
I tried variations on the backref eg. strtolower('$1'), strtolower('\1'), strtolower('{$1}')
I've tried single and double quotes and various escaping etc and nothing has worked yet.
I even tried preg_replace_callback() with no luck
function callback($matches){
return 'class="icon"'.ucfirst(strtolower($matches[0])).'"';
}
It was difficult for me to try to work out what you meant, but I think you want something like this:
preg_replace('/class="grid-flags" \>(FOO|BAR|BAZ)/e',
'\'class="icon\'.ucfirst(strtolower("$1")).\'">\'',
$text);
Output for your example input:
<div class="iconFoo"></div>
If this isn't what you want, could you please give us some example inputs and outputs?
And I have to agree that this would be easier with an HTML parser.
Instead of using the e(valuate) option you can use preg_replace_callback().
$text = '<div class="grid-flags" >FOO</div>';
$pattern = '/class="grid-flags" >(FOO|BAR|BAZ)/';
$myCB = function($cap) {
return 'class="icon'.ucfirst($cap[1]).'" >';
};
echo preg_replace_callback($pattern, $myCB, $text);
But instead of using regular expressions you might want to consider a more suitable parser for html like simple_html_dom or php's DOM extension.
This works for me
$html = '<div class="grid-flags" >FOO</div>';
echo preg_replace_callback(
'/class *= *\"grid-flags\" *\>(FOO|BAR|BAZ)/'
, create_function( '$matches', 'return \'class="icon\' . ucfirst(strtolower($matches[1])) .\'">\'.$matches[1];' )
, $html
);
Just be aware of the problems of parsing HTML with regex.

How to grab the contents of HTML tags?

Hey so what I want to do is snag the content for the first paragraph. The string $blog_post contains a lot of paragraphs in the following format:
<p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p>
The problem I'm running into is that I am writing a regex to grab everything between the first <p> tag and the first closing </p> tag. However, it is grabbing the first <p> tag and the last closing </p> tag which results in me grabbing everything.
Here is my current code:
if (preg_match("/[\\s]*<p>[\\s]*(?<firstparagraph>[\\s\\S]+)[\\s]*<\\/p>[\\s\\S]*/",$blog_post,$blog_paragraph))
echo "<p>" . $blog_paragraph["firstparagraph"] . "</p>";
else
echo $blog_post;
Well, sysrqb will let you match anything in the first paragraph assuming there's no other html in the paragraph. You might want something more like this
<p>.*?</p>
Placing the ? after your * makes it non-greedy, meaning it will only match as little text as necessary before matching the </p>.
If you use preg_match, use the "U" flag to make it un-greedy.
preg_match("/<p>(.*)<\/p>/U", $blog_post, &$matches);
$matches[1] will then contain the first paragraph.
It would probably be easier and faster to use strpos() to find the position of the first
<p>
and first
</p>
then use substr() to extract the paragraph.
$paragraph_start = strpos($blog_post, '<p>');
$paragraph_end = strpos($blog_post, '</p>', $paragraph_start);
$paragraph = substr($blog_post, $paragraph_start + strlen('<p>'), $paragraph_end - $paragraph_start - strlen('<p>'));
Edit: Actually the regex in others' answers will be easier and faster... your big complex regex in the question confused me...
Using Regular Expressions for html parsing is never the right solution. You should be using XPATH for this particular case:
$string = <<<XML
<a>
<b>
<c>texto</c>
<c>cosas</c>
</b>
<d>
<c>código</c>
</d>
</a>
XML;
$xml = new SimpleXMLElement($string);
/* Busca <a><b><c> */
$resultado = $xml->xpath('//p[1]');

Categories