preg_replace() help in PHP - php

Consider this string
hello awesome <a href="" rel="external" title="so awesome is cool"> stuff stuff
What regex could I use to match any occurence of awesome which doesn't appear within the title attribute of the anchor?
So far, this is what I've came up with (it doesn't work sadly)
/[^."]*(awesome)[^."]*/i
Edit
I took Alan M's advice and used a regex to capture every word and send it to a callback. Thanks Alan M for your advice. Here is my final code.
$plantDetails = end($this->_model->getPlantById($plantId));
$botany = new Botany_Model();
$this->_botanyWords = $botany->getArray();
foreach($plantDetails as $key=>$detail) {
$detail = preg_replace_callback('/\b[a-z]+\b/iU', array($this, '_processBotanyWords'), $detail);
$plantDetails[$key] = $detail;
}
And the _processBotanyWords()...
private function _processBotanyWords($match) {
$botanyWords = $this->_botanyWords;
$word = $match[0];
if (array_key_exists($word, $botanyWords)) {
return '' . $word . '';
} else {
return $word;
}
}
Hope this well help someone else some day! Thanks again for all your answers.

This subject comes up pretty much every day here and basically the issue is this: you shouldn't be using regular expressions to parse or alter HTML (or XML). That's what HTML/XML parsers are for. The above problem is just one of the issues you'll face. You may get something that mostly works but there'll still be corner cases where it doesn't.
Just use an HTML parser.

Asssuming this is related to the question you posted and deleted a little while ago (that was you, wasn't it?), it's your fundamental approach that's wrong. You said you were generating these HTML links yourself by replacing words from a list of keywords. The trouble is that keywords farther down the list sometimes appear in the generated title attributes and get replaced by mistake--and now you're trying to fix the mistakes.
The underlying problem is that you're replacing each keyword using a separate call to preg_replace, effectively processing the entire text over and over again. What you should do is process the text once, matching every single word and looking it up in your list of keywords; if it's on the list, replace it. I'm not set up to write/test PHP code, but you probably want to use preg_replace_callback:
$text = preg_replace_callback('/\b[A-Za-z]+\b/', "the_callback", $text);
"the_callback" is the name of a function that looks up the word and, if it's in the list, generates the appropriate link; otherwise it returns the matched word. It may sound inefficient, processing every word like this, but in fact it's a great deal more efficient than your original approach.

Sure, using a parsing library is the industrial-strength solution, but we all have times were we just want to write something in 10 seconds and be done. Next time you want to process the meaty text of a page, ignoring tags, try just run your input through strip_tags first. This way you will get only the plain, visible text and your regex powers will again reign supreme.

This is so horrible I hesitate to post it, but if you want a quick hack, reverse the problem--instead of finding the stuff that isn't X, find the stuff that IS, change it, do the thing and change it back.
This is assuming you're trying to change awesome (to "wonderful"). If you're doing something else, adjust accordingly.
$string = 'Awesome is the man who <b>awesome</b> does and awesome is.';
$string = preg_replace('#(title\s*=\s*\"[^"]*?)awesome#is', "$1PIGDOG", $string);
$string = preg_replace('#awesome#is', 'wonderful', $string);
$string = preg_replace('#pigdog#is', 'awesome', $string);
Don't vote me down. I know it's hack.

Related

Replace only images and tags with surgical precision that match an img list var

OK my goal is to remove all images and their tags that I specify in an array or group, it should remove the entire image and tags and work if its contained in a link or not.
so far I got this working somewhat but far from perfect this version only removes images not in an href tag, i need it to work both ways.
so if we have <img src="test1.gif" width="235"> it must remove that even if it contains other code and even if its surrounded by a link as long as the image name matches.
So any images contained in the group must be completely removed with there tags and or links that wrap that image contained in my var.
This is what I have so far.
#<img[^>]+src=".*?(test1.gif|test2.png|test3.jpg)"[^>]+>?#i
Ultimately what I am trying to do is not as simple as I hoped so I am hopping some regex guru's can help with this task as I cant find anything on here or the net most are just replacing all images on a page not specific images. Not my reason for it needing to be a Regex is because this must work in other code that's based around preg_replace and yes, I know thats not the best way to do it.
UPDATED added this as example sorry for any confusion.
This all PHP Based!
So this var will have all the images that we need to replace. with nothing.
$m_rimg = "imagewatever.gif|test.jpg|animage.png";
preg_replace('#<img[^>]+src=".*?('.$m_rimg.')"[^>]+>?#i','');
This almost works but not correctly as it must also remove images wrapped in a link href tag and remove the image along with the link if it has one. so basically I need what I have modified to work correctly with <img src="whatever.gif" width=""> or <img src="whatever.gif" width=""> but it must only replace or remove the images that match in the var list not just replacing all images, that are images ... that I can do this is more complex.
I hope this better explains it.
UPDATED 04/25/15
Ok I tried the last one that was added to test it out info below.
I had to mod it with some \ so i did not get parse error so for anyone looking to do something similar to my needs.
This worked great. I just modded what you gave me like this.
"#(?:<a\b[^>]*?>)?(<img[^>]+src=[\"'][^>]*?($m_rimg)['\"][^>]*>)(?:<\/a>)?#is"
and did not use preg_quote, not sure why but that did not work at all but without preg_quote it works so far in some tests i just did.
I was told to not use | but that is what seems to work how else would you guys suggest?
As to this being a duplicate of another answered question flagged by some, I do not think that's the case as I looked at what is said to be the answer to my question as well and it is not the same that I see at all, and is not doing the exact thing I need to do match whats in my var. while yes it is Regex related it did not help, I tried to find something on here that worked for my needs, way before ever posting.
I got a helpful answer to my problem from one user, who understood why I was doing it this way. I hope this is now acceptable to lift he dupe status as my goal was not to offend those who don't think I should use a Regex as part of an HTML parser script.
Try something like:
$DOM = new DOMDocument();
$DOM->loadXML('HTML_DOCUMENT');
$list = $DOM->getElementsByTagName('img');
foreach($list as $img){
$src = $img->getAttribute('src');
//only match if src contains `test1.gif`:
if(stringEndsWith($src, 'test1.gif') ||
stringEndsWith($src, 'test2.gif') ||
stringEndsWith($src, 'test3.gif')) {
$list->removeChild($img);
}
}
function stringEndsWith($haystack, $ending, $caseInsensitivity = false)
{
if ($caseInsensitivity)
return strcasecmp(substr($haystack, strlen($haystack) - strlen($ending)), $haystack) === 0;
else
return strpos($haystack, $ending, strlen($haystack) - strlen($ending)) !== false;
}
Or as you state you still need a regex way to remove <img> tags based on the alternative list inside a $m_rimg variable, and any <a> tags wrapped around, so use this:
$re = "#(?:<a\b[^>]*?>)?(<img[^>]+src=["'][^>]*?('.$m_rimg.')['"][^>]*>)(?:<\/a>)?#is";
$str = "<img\n att=\"value\"\n src=\"sometext3456..,gjyg&&&test1.gif\" />\n\n<img src=\"imagewatever.gif\">";
$result = preg_replace($re, "", $str);
Mind that all the items in your variable must be preg_quoted, but not the | symbols.
Demo

What is the best way to censor inappropriate words that might contain markup within them?

I run a large website that contains millions of user generated posts that contain HTML. Some of these posts contain sensitive words my advertisers don't want to advertise next to. Instead of deleting these posts, I'd rather censor out the "bad" words. I also need to preserve the markup because letting the users mark up their posts is a major feature of the site.
I am currently using a search and replace with str_ireplace(), but our authors have become clever and are doing things (below) that slip through my primitive filter. I can strip the tags and detect the inappropriate words, but am looking for a way of replacing the words while leaving the markup untouched.
Examples:
Successfully censored:
input: "<p>Mary is a bitch.</p>"
output: "<p>Mary is a *****.</p>"
Unsuccessfully censored:
input: "<p>Mary is a <strong>b</strong>itch.</p>"
failed output: "<p>Mary is a <strong>b</strong>itch.</p>"
desired output: "<p>Mary is a <strong>*</strong>****.</p>"
My advice would be to use other methods to stop this, as it is extremely hard.
from this amusing piece by Jeff Atwood about what 'clbuttic' problems arise from trying to do so:
Obscenity filtering is an enduring, maybe even timeless problem. I'm doubtful it will ever be possible to solve this particular problem through code alone. But it seems some companies and developers can't stop tilting at that windmill. Which means you might want to think twice before you move to Scunthorpe.
Just for fun here is a quick and dirty way:
$badWords = array('bitch', 'jerk');
$input = '<p>Mary is a <strong>b</strong>itch. </p>';
$arr = explode(' ', $input);
foreach($arr as $key => $word)
{
$word = str_replace('.', '', strip_tags($word));
if(in_array($word, $badWords))
{
$arr[$key] = '*****';
}
}
$output = implode(' ', $arr);
echo $output;
Output
<p>Mary is a ***** </p>
The above splits the text into words, and applies strip_tags() on each of the words, so that it doesn't affect the entire content.
There are still many ways around it though, as the comments point out. You'll never get a perfect solution that can handle everything they throw at it - you would need to create something close to artificial intelligence. I think the best real solution would be to strip_tags() on the whole post and search for the bad words, then if any found, flag the post for moderator attention. Or just simply have a report post system with active moderators.
You're going to have an extremely tough time accomplishing this in your way, but my recommendation would be to not change the words out with asterisks, but rather just reject the posting and let the user know why. Here's why:
Simplify your searching. If your algorithm only has to check if some form of a bad word exists in the text, then you can strip_tags the text and search for your words. If you were to try to replace this out with asterisks, you can't strip_tags since you need to leave the originating text in it's prior condition.
It's what people expect. What people don't expect is for their text to be modified with no notification to them. You'd likely be better sending people back with a message that says "this post contains inappropriate words/text"
If you are insistent that you replace with asterisks instead of sending the user back, you'll need to write a basic character-by-character parser that ignores HTML tags and constructs words out of it.
You could start from a "bad words" list and check the tag-clean string (that is, filtered via strip_tags() against the "bad words".
Then you could iterate each bad word through a series of possible single-letter alterations, eg S=>5, 1=>L, 0=>O etc.

Replace smileys inside [code][/code] tags

So I started a post before but it got closed :( since then I have managed to progress a little realizing I need to somehow grab the content inside the [code][/code] tags and do a str_replace() on the smiley bbcode text within them, here is what I have so far but its not working
if (preg_match_all('~[code](.*?)[\/code]~i', $row['message'], $match)){
foreach($match[1] AS $key) {
$find = array(':)',':(',':P',':D',':O',';)','B)',':confused:',':mad:',':redface:',':rolleyes:',':unsure:');
$replace = array(':)',':(',':P',':D',':O',';)','B)',':confused:',':mad:',':redface:',':rolleyes:',':unsure:');
}
$message = str_replace($find, $replace, $key);
} else {
$message = $row['message'];
}
it just returns no message content at all.
if i change this line:
$message = str_replace($find, $replace, $key);
to this:
$message = str_replace($find, $replace, $row['message']);
it sort of works but replaces all smileys inside the whole message rather then just the content inside the [code][/code] tags which I assume is being represented by $key?! ...any help please its causing my brain to overload!
I did find this question which is different but very relevant to mine but there was no real answer to it.
I think it might be easier for you to use an existing BBCode parser (eg. NBBC) or at least checking how they did that. And they did it in a much more intelligent way. Rather than using regexp, they use a lexer, which splits it into separate tags (shown below). Then, they just don’t do anything to a [code] tag. If you still want to stay with your solution, I would, for one, split them out to arrays (bbcode = explode('[code]', bbcode) and then the same for [code]. Every second list element should not have other parsing used. As stated before: there is no purpose in reinventing the wheel, so don’t do that.
This is how their solution works:
[b]Hello![/b]
[code]
#/usr/bin/python
import antigravity
[/code]
Python is much more awesome.
Becomes eg. an array with the following elements:
[b]Hello![/b]
[code]…[/code]
Python is much more awesome.
And then it applies the formatting it should apply. Much better and more human.
not a comment due to length.

Regex replace matched subexpression (and nothing else)?

I've used regex for ages but somehow I managed to never run into something like this.
I'm looking to do some bulk search/replace operations within a file where I need to replace some data within tag-like elements. For example, converting <DelayEvent>13A</DelayEvent> to just <DelayEvent>X</DelayEvent> where X might be different for each.
The current way I'm doing this is such:
$new_data = preg_replace('|<DelayEvent>(\w+)</DelayEvent>|', '<DelayEvent>X</DelayEvent>', $data);
I can shorten this a bit to:
$new_data = preg_replace('|(<DelayEvent>)(\w+)(</DelayEvent>)|', '${1}X${2}', $data);
But really all I want to do is simulate a "replace text between tags T with X".
Is there a way to do such a thing? In essence I'm trying to prevent having to match all the surrounding data and reassembling it later. I just want to replace a given matched sub-expression with something else.
Edit: The data is not XML, although it does what appear to be tag-like elements. I know better than parsing HTML and XML with RegEx. ;)
It is possible using lookarounds:
$new_data = preg_replace('|(?<=<DelayEvent>)\w+(?=</DelayEvent>)|', 'X', $data);
See it working online: ideone

Regexp for cleaning the empty, unnecessary HTML tags

I'm using TinyMCE (WYSIWYG) as the default editor in one of my projects and sometimes it automatically adds <p> </p> , <p> </p> or divs.
I have been searching but I couldn't really find a good way of cleaning any empty tags with regex.
The code I've tried to used is,
$pattern = "/<[^\/>]*>([\s]?)*<\/[^>]*>/";
$str = preg_replace($pattern, '', $str);
Note: I also want to clear &nbsp too :(
Try
/<(\w+)>(\s| )*<\/\1>/
instead. :)
That regexp is a little odd - but looks like it might work. You could try this instead:
$pattern = ':<[^/>]*>\s*</[^>]*>:';
$str = preg_replace($pattern, '', $str);
Very similar though.
I know it's not directly what you asked for, but after months of TinyMCE, coping with not only this but the hell that results from users posting directly from Word, I have made the switch to FCKeditor and couldn't be happier.
EDIT: Just in case it's not clear, what I'm saying is that FCKeditor doesn't insert arbitrary paras where it feels like it, plus copes with pasted Word crap out of the box. You may find my previous question to be of help.
You would want multiple Regexes to be sure you do not eliminated other wanted elements with one generic one.
As Ben said you may drop valid elements with one generic regex
<\s*[^>]*>\s*` `\s*<\s*[^>]*>
<\s*p\s*>\s*<\s*/p\s*>
<\s*div\s*>\s*<\s*/div\s*>
Try this:
<([\w]+)[^>]*?>(\s| )*<\/\1>

Categories