replacing next occurence of tag [duplicate] - php

This question already has an answer here:
find and replace keywords by hyperlinks in an html fragment, via php dom
(1 answer)
Closed 8 years ago.
Ive got a large string with some markup in it I want to change in order for it to work with fpdf.
<span style="text-decoration: underline;">some text</span>
I need to replace the tags here with
<i>some text</i>
However a simple str_replace(); wont work because there are span tags that should not be replaced. I need to make something that finds <span style="text-decoration: underline;">
and then looks for the next occurence of </span> and only replaces that. I haven't got the slightest clue on how to do this. I've looked at http://us.php.net/strpos but not sure on how to implement that, and if that will be the solution. Can anyone give me some pointers?
Thanks.

This should do the trick:
<?php
$in = '<span>Invalid</span><span style="text-decoration: underline;">some text</span><span>Invalid</span>';
$out = preg_replace('#<span style=".*">([^<]+)<\/span>#', '<i>\1</i>', $in);
echo $out;
?>
View on Codepad.org
You can also restrict what text you'll look for in the tag, for example, only alphanumerics and whitespaces:
<?php
$in = '<span>Invalid</span><span style="text-decoration: underline;">some text</span><span>Invalid</span>';
$out = preg_replace('#<span style=".*">([\w|\s]+)<\/span>#', '<i>\1</i>', $in);
echo $out;
?>
View on Codepad.org

$dom = new domDocument;
$dom->loadHTML($html);
$spans = $dom->getElementsByTagName('span');
foreach ($spans as $node){
$text = $node->textContent;
$node->removeChild($node->firstChild);
$fragment = $dom->createDocumentFragment();
$fragment->appendXML('<i>'.$text.'</i>');
$node->appendChild($fragment);
}
$out = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML()));

Related

Regular expression for finding html tags [duplicate]

This question already has answers here:
Regex select all text between tags
(23 answers)
Closed 2 years ago.
I'm trying to write a function, which will find each substring in string, where substring is some html tag, for example
<li>
But my regular expression don't work and i can't finde my mistake.
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
$items = preg_match_all('/(<li>\w+<\/li>)', $str, $matches);
$items must be an array of the desired substrings
Consider using DOMDocument to parse and manipulate HTML or XML tags. Do not reinvent the wheel with Regex.
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
$dom = new DOMDocument();
$dom->loadHTML($str);
$li = $dom->getElementsByTagName('li');
$value = $li->item(0)->nodeValue;
echo $value;
' hello'
Or if you want to iterate over all
foreach($li as $item)
echo $item->nodeValue, PHP_EOL;
' hello'
'how are you?'
Markus' answer is correct but in case you just want the fast and dirty regex one, here it is:
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
preg_match_all('/(<li>.+<\/li>)/U', $str, $items);
U makes it ungreedy.

How to replace all specific strings between specific strings? [duplicate]

This question already has answers here:
replace all "foo" between ()
(3 answers)
Closed 7 years ago.
I like to replace all \n inside of <pre></pre> with a placeholder. This is what I created:
<?php
$html = "<div>\n<pre id=foo>Foo\n\nBar Bar\nFoo Foo</pre>\n\n</div>";
echo preg_replace("/(<pre[^>]*>[^<]*)(\n)([^<]*<\/pre)/", "$1{NEWLINE}$3", $html);
?>
It replaces only one \n as expected. Do I need to use preg_replace_callback() and a separate function to replace the linebreaks or is it possible with one regex alone?
EDIT: Any solution available for this, too?
$html2 = "<div>\n<pre id=foo><b>Foo\n\n</b>Bar Bar\nFoo Foo</pre>\n\n</div>";
You can do this using a callback as you suggested.
$html = preg_replace_callback('~<pre[^>]*>\K.*?(?=</pre>)~si',
function($m) {
return str_replace(array("\r\n", "\n", "\r"), '{NEWLINE}', $m[0]);
}, $html);
Although, I would recommend using DOM to perform this task.
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML
$nodes = $doc->getElementsByTagName('pre');
$find = array("\r\n", "\n", "\r");
foreach ($nodes as $node) {
$node->nodeValue = str_replace($find, '{NEWLINE}', $node->nodeValue);
}
echo $doc->saveHTML();
My question is duplicate:
https://stackoverflow.com/a/5756032/318765
This is what I need:
<?php
echo preg_replace("/(\r\n|\n\r|\n|\r)(?=[^<>]*<\/pre)/", "{NEWLINE}", $html);
?>

PHP: Removing duplicate words from between quotes

How can I remove the duplicates from between class="" in the following string?
<li class="active active">Sample Page</li>
Please note that the classes shown can change and be in different positions.
You can use DOM parser then explode and array_unique:
$html = '<li class="active active">
Sample Page</li>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//li");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$tok = explode(' ', $node->getAttribute('class'));
$tok = array_unique($tok);
$node->setAttribute('class', implode(' ', $tok));
}
$html = $doc->saveHTML();
echo $html;
OUTPUT:
<html><body>
<li class="active">Sample Page</li>
</body></html>
Online Demo
With regex you could use a lookbehind and lookahead for finding duplicates:
$pattern = '/(?<=class=")(?:([-\w]+) (?=\1[ "]))+/i';
This would replace multiple instances of capture group 1 ([-\w]+) in a sequence.
$str = '<li class="active active">';
echo preg_replace($pattern, "", $str);
output:
<li class="active">
Test at regex101
EDIT 08.04.2014
To remove duplicates, that are not directly after the lookbehind (?<=class=")...
The problem is, that a lookbehind assertion can only be of fixed length. so something like (?<=class="[^"]*?) is not possible. As an alternative \K could be used, which resets the beginning of the match. A pattern could be:
$pattern = '/class="[^"]*?\K(?<=[ "])(?:([-\w]+) (?=\1[ "]))+/i';
You could imagine everything before \K as a virtual lookbehind of variable length.
This regex, as the first one, would only replace multiple instances of one duplicate in a sequence.
EDIT 11.09.2014
Finally I think a single regex, that would strip out all of different duplicates is getting rather complex:
/(?>(?<=class=")|(?!^)\G)(?>\b([-\w]++)\b(?=[^"]*?\s\1[\s"])\s+|[-\w]+\s+\K)/
This one uses continuous matching, as soon class=" is found.
Test at regex101; Also see SO Regex FAQ
A more simple way using regex would be a preg_replace_callback():
$html = '<li class="a1 a1 li li-home active li li active a1">';
$html = preg_replace_callback('/\sclass="\K[^"]+/', function ($m) {
return trim(implode(" ",array_unique(preg_split('~\s+~', $m[0]))));
}, $html);
Note that older PHP-versions don't support anonymous functions (if so, change to a normal function).
A way to do it would be to add these values into an array and to filter them. Here is how it can be made.
<?php
preg_match_all('/class="([A-Za-z0-9 ]+)"/',$htmlString, $result);
$classes = explode(" ",$result[0]);
$classes = array_unique($classes);
echo "<li class=\"".implode(" ",$classes)."\">Sample Page</li>";
?>

preg_replace - How to remove contents inside a tag?

Say I have this.
$string = "<div class=\"name\">anyting</div>1234<div class=\"name\">anyting</div>abcd";
$regex = "#([<]div)(.*)([<]/div[>])#";
echo preg_replace($regex,'',$string);
The output is
abcd
But I want
1234abcd
How do I do it?
Like this:
preg_replace('/(<div[^>]*>)(.*?)(<\/div>)/i', '$1$3', $string);
If you want to remove the divs too:
preg_replace('/<div[^>]*>.*?<\/div>/i', '', $string);
To replace only the content in the divs with class name and not other classes:
preg_replace('/(<div.*?class="name"[^>]*>)(.*?)(<\/div>)/i', '$1$3', $string);
$string = "<div class=\"name\">anything</div>1234<div class=\"name\">anything</div>abcd";
echo preg_replace('%<div.*?</div>%i', '', $string); // echo's 1234abcd
Live example:
http://codepad.org/1XEC33sc
add ?, it will find FIRST occurence
preg_replace('~<div .*?>(.*?)</div>~','', $string);
http://sandbox.phpcode.eu/g/c201b/3
This might be a simple example, but if you have a more complex one, use an HTML/XML parser. For example with DOMDocument:
$doc = DOMDocument::loadHTML($string);
$xpath = new DOMXPath($doc);
$query = "//body/text()";
$nodes = $xpath->query($query);
$text = "";
foreach($nodes as $node) {
$text .= $node->wholeText;
}
Which query you have to use or whether you have to process the DOM tree in some other way, depends on the particular content you have.

Need help with preg_replace

$text = '<p width="50px;" style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p><table style="text-align:center"></table>';
$text_2 = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $text);
OUTPUT(i have given the html format here):
<p>
<strong>hello</strong>
</p>
<table></table>
My problem is all attributes must be removed but not the attributes belongs to table. That is i am expecting the out put exactly like below(HTML FORMAT):
<p>
<strong>hello</strong>
</p>
<table style="text-align:center"></table>
What should i need to modify in the above regular expression to achieve it..
Any help will be thankful and grateful....
Thanks in advance...
If you want to avoid using regex, because you really souldn't use regex to work on xml/html structures, try:
<?php
$text = '<p width="50px;" style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p><table style="text-align:center"></table>';
$dom = new DOMDocument;
$dom->formatOutput = true;
$dom->loadHtml($text);
$xpath = new DOMXpath($dom);
foreach ($xpath->query('//*[not(name()="table")]/#*') as $attrNode) {
$attrNode->ownerElement->removeAttributeNode($attrNode);
}
$output = array();
foreach ($xpath->query('//body/*') as $childNode) {
$output[] = $dom->saveXml($childNode, LIBXML_NOEMPTYTAG);
}
echo implode("\n", $output);
Output:
<p>
<strong>hello</strong>
</p>
<table style="text-align:center"></table>
You are very close with your current reg-ex. You need to do a check (think it is a negative look-ahead in this case?)
<(?!table)([a-z][a-z0-9]*)[^>]*?(\/?)>
What that first bit of reg-ex is doing is checking that it does not start with 'table', then it is your regex.
Bit of hacky solution, but works .
Try disabling TABLE tags for a while in your code, and enable them again.
It would work.
see : http://codepad.org/nevLWMq8
<?php
$text = '<p width="50px;" style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p><table style="text-align:center"></table>';
/* temporary change table tags with something not occuring in your HTML */
$textTemp = str_replace(array("<table","/table>"),array('###','+++'),$text);
$text_2 = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $textTemp);
echo "\n\n";
/* restore back the table tags */
$finalText = str_replace(array("###","+++"),array("<table","/table>"),$text_2);
echo $finalText ;
?>

Categories