preg_replace on text only and not inside href's - php

I have this code to do some ugly inline text style color formatting on html content.
But this breaks anything inside tags as links, and emails.
I've figured out halfways how to prevent it from formatting links, but it still doesn't prevent the replace when the text is info#mytext.com
$text = preg_replace('/(?<!\.)mytext(?!\/)/', '<span style="color:#DD1E32">my</span><span style="color:#002d6a">text</span>', $text);
What will be a better approach to only replace text, and prevent the replacing on links?

A better approach would be to use XML functions instead.

Your lookbehind assertion only tests one character, so it's insufficient to assert matches outside of html tags. This is something where regular expression aren't the best option. You can however get an approximation like:
preg_replace("/(>[^<]*)(?<![#.])(mytext)/", "$1<span>$2</span>",
This would overlook the first occourence of mytext if it's not preceeded by a html tag. So works best if $text = "<div>$text</div>" or something.

[edited] ohh I see you solved the href problem.
to solve your email problem, change all #mytext. to [email_safeguard] with str_replace, before working on the text, and when your finished, change it back. :)
$text = str_replace('info#mytext.com','[email_safeguard]',$text);
//work on the text with preg_match()
$text = str_replace('[email_safeguard]','info#mytext.com',$text);
that should do the trick :)
but as people have mentioned before, you better avoid html and regex, or you will suffer the wrath of Cthulhu.
see this instead

Related

preg_replace issue with html elements

i'm having issues with my preg_match solution.
I have the following html code:
<h1> Text marking test</h1><b> Chicago</b> - This is the text. Can this problem be solved by you?
I also have almost similar content:
Chicago - This is the text. Can this issue be solved by you?
All multiple spaces are gone and Problem has turned into Issue
I want to mark:
Chicago - This is the text. Can this
be solved by you?
So i get this:
<h1> Text marking test</h1><div class="marked"><b> Chicago</b> - This is the text. Can this</div> problem <div class="marked">be solved by you?</div>
I have the following regular expression pattern which works:
$string = preg_replace( "/(?im)(<b>)*Chicago([\s,.!?:;'\"]|<([^>]+)>)*-([\s,.!?:;'\"]|<([^>]+)>)*This([\s,.!?:;'\"]|<([^>]+)>)*is([\s,.!?:;'\"]|<([^>]+)>)*the([\s,.!?:;'\"]|<([^>]+)>)*text([\s,.!?:;'\"]|<([^>]+)>)*Can([\s,.!?:;'\"]|<([^>]+)>)*this([\s,.!?:;'\"]|<([^>]+)>)*/", '<div class="marked">' .'${0}'.'</div> , $string);
The problem is that the appending <b> tag could be any tag with any attribute and also optional.
It can only be the appending tag and not any tag before Chicago.
But somehow i constantly fail in my attempts.
Any help is greatly appreciated!
Maybe you can remove all the html tags before the text analysis by using "<[^>]*>" with replace_all, and then make a simpler text analysis regex.
I invite you to use multiple regex instead of making a big one, it's more confortable to locate a bug or update your program
Edit: I had misread your question and deleted my answer, but upon reading it again I think it might offer you some pointers on how to proceed. I don't completely understand the question, so please pardon the unsatisfactory answer.
You want to strip the text of HTML tags, as well as of multiple spaces. I would tackle these things separately:
function clean_text($text) {
$text = strip_tags($text);
$text = preg_replace('/\s{2,}/', ' ', $text);
return $text;
}
Use built-in functions where possible – no sense in re-inventing the wheel, especially as usually a lot of thought went into the functions. As for the second part, we match two or more whitespace characters and replace them with one space only.

process text in html and reinsert to html structure

i want to grab text from HTML do some process and change to it and reinsert to that HTML code with php.
<p>This is my sentence <span>and more</span> also <strong>important</strong> part.</p>
What's the best method? Using preg_* ? how can i reinsert my text to HTML style ?
for example i want to remove all double or more spaces between words.
preg_replace('/\s+/', ' ', $myText);
but i want just applied in text of my html not html tags, attributes or etc ...
Have a look at DomDocument. It'll allow you to do some manipulation on your HTML.
http://www.php.net/manual/en/domdocument.loadhtml.php
EDIT
If you want to elaborate on exactly what you want to do with your HTML example, we might be able to provide a more specific answer :)
EDIT
To reflect the updated answer: the multiple spaces in HTML should collapse anyway, but if you want to remove them then you could try the following:
$result = preg_replace_callback('/(?<=\>)[\w\s]+(?=\<)/', function($match) {
return preg_filter('/\s+/', ' ', $match[0]);
}, $str);
I'm not a regex expert by any stretch so I'm sure there's a more elegant way to do this, but this might work for you nonetheless: first do a preg_replace_callback and use lookarounds to grab any text fragments between end and start tags. Then, pass the result through preg_filter (or preg_replace) to replace any multiple spaces as a single space.
Hope this helps/works :)

preg_replace script, link tag not working

I used the following code to remove script, link tags from my string,
$contents='<script>inside tag</script>hfgkdhgjh<script>inside 2</script>';
$ss=preg_replace('#<script(.*?)>(.*?)</script>#is', '', $contents);
echo htmlspecialchars($ss);
it works fine. But can I use anything that similar to html parsing rather than preg_match for this?
Here are few things you can do
htmlspecialchars() can prove those tags useless
striptags() removes all HTML tags
But the technique you are using is the correct one. However here is a improved version for that
echo preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $contents);
HTML Purifier is always a good choice. phpQuery has also come in handy a few times.
If you are sanitizing content, it's very easy to make mistakes with regular expressions... read this post. It just depends what you're trying to achieve.

php - preg_match string not within the href attribute

i find regex kinda confusing so i got stuck with this problem:
i need to insert <b> tags on certain keywords in a given text. problem is that if the keyword is within the href attribute, it would result to a broken link.
the code goes like this:
$text = preg_replace('/(\b'.$keyword.'\b)/i','<b>\1</b>',$text);
so for cases like
this keyword here
i end up with:
this <b>keyword</b> here
i tried all sorts of combinations but i still couldn't get the right pattern.
thanks!
You can't only use Regex to do that. They are powerful, but they can't parse recursive grammar like HTML.
Instead you should properly parse the HTML using a existing HTML parser. you just have to echo the HTML unless you encouter some text entity. In that case, you run your preg_repace on the text before echoing it.
If your HTML is valid XHTML, you can use the xml_parse function. if it's not, then use whatever HTML parser is available.
You can use preg_replace again after the first replacement to remove b tags from href:
$text=preg_replace('#(href="[^"]*)<b>([^"]*)</b>#i',"$1$2",$text);
Yes, you can use regex like that, but the code might become a little convulted.
Here is a quick example
$string = 'link text with keyword and stuff';
$keyword = 'keyword';
$text = preg_replace(
'/(<a href=")('.$keyword.')(.php">)(.*)(<\/a>)/',
"$1$2$3<b>$4</b>$5",
$string
);
echo $string."\n";
echo $text."\n";
The content inside () are stored in variables $1,$2 ... $n, so I don't have to type stuff over again. The match can also be made more generic to match different kinds of url syntax if needed.
Seeing this solution you might want to rethink the way you plan to do matching of keywords in your code. :)
output:
link text with keyword and stuff
<b>link text with keyword and stuff</b>

question regarding php function preg_replace

I want to dynamically remove specific tags and their content from an html file and thought of using preg_replace but can't get the syntax right. Basically it should, for example, do something like :
Replace everything between (and including) "" by nothing.
Could anybody help me out on this please ?
Easy dude.
To have a Ungreedy regexpr, use the U modifier
And to make it multiline, use the s modifier.
Knowing that, to remove all paragraphes use this pattern :
#<p[^>]*>(.*)?</p>#sU
Explain :
I use # delimiter to not have to protect my \ characters (to have a more readable pattern)
<p[^>]*> : part detecting an opening paragraph (with a hypothetic style, such as )
(.*)? : Everything (in "Ungreedy mode")
</p> : Obviously, the closing paragraph
Hope that help !
If you are trying to sanitize your data, it is often recommended that you use a whitelist as opposed to blacklisting certain terms and tags. This is easier to sanitize and prevent XSS attacks. There's a well known library called HTML Purifier that, although large and somewhat slow, has amazing results regarding purifying your data.
I would suggest not trying to do this with a regular expression. A safer approach would be to use something like
Simple HTML DOM
Here is the link to the API Reference: Simple HTML DOM API Reference
Another option would be to use DOMDocument
The idea here is to use a real HTML parser to parse the data and then you can move/traverse through the tree and remove whichever elements/attributes/text you need to. This is a much cleaner approach than trying to use a regular expression to replace data within the HTML.
<?php
$doc = new DOMDocument;
$doc->loadHTMLFile('blah.html');
$content = $doc->documentElement;
$table = $content->getElementsByTagName('table')->item(0);
$delfirstTable = $content->removeChild($table);
echo $doc->saveHTML();
?>
If you don't know what is between the tags, Phill's response won't work.
This will work if there's no other tags in between, and is definitely the easier case. You can replace the div with whatever tag you need, obviously.
preg_replace('#<div>[^<]+</div>#','',$html);
If there could be other tags in the middle, this should work, but could cause problems. You're probably better off going with the DOM solution above, if so
preg_replace('#<div>.+</div>#','',$html);
These aren't tested
PSEUDO CODE
function replaceMe($html_you_want_to_replace,$html_dom) {
return preg_replace(/^$html_you_want_to_replace/, '', $html_dom);
}
HTML Before
<div>I'm Here</div><div>I'm next</div>
<?php
$html_dom = "<div>I'm Here</div><div>I'm next</div>";
$get_rid_of = "<div>I'm Here</div>";
replaceMe($get_rid_of);
?>
HTML After
<div>I'm next</div>
I know it's a hack job

Categories