Regex everything but strings + use of groups

Regex everything but strings + use of groups - php

I'm trying to make a whitelist of html tags, here's my code :
$string = "<x>-<x>";
$result = preg_match('#^<(?!white1|white2)>.*<(\1)>$#i', $string);
But it returns false, and I don't know why. I simplified the regex to avoid confusions, but this is still the same idea.
I want to match every correct tag but the ones I want to keep safe. This regex will go on a preg_replace to erase every matched tag and let the ones I allow.
Thanks for your help in advance !
EDIT: If I find a way to do this with regexs, I'll put the solution here. But for now, I'll do it with strip_tags().
EDIT2: The easiest way I thought to is to parse all the tags and then revert back the ones we allow.

If you want to eliminate unwanted tags, you can use strip_tags:
$allowedTags = '<p><a><img>';
$filteredContent = strip_tags($content, $allowedTags);

Related

Regex replace matched subexpression (and nothing else)?

I've used regex for ages but somehow I managed to never run into something like this.
I'm looking to do some bulk search/replace operations within a file where I need to replace some data within tag-like elements. For example, converting <DelayEvent>13A</DelayEvent> to just <DelayEvent>X</DelayEvent> where X might be different for each.
The current way I'm doing this is such:
$new_data = preg_replace('|<DelayEvent>(\w+)</DelayEvent>|', '<DelayEvent>X</DelayEvent>', $data);
I can shorten this a bit to:
$new_data = preg_replace('|(<DelayEvent>)(\w+)(</DelayEvent>)|', '${1}X${2}', $data);
But really all I want to do is simulate a "replace text between tags T with X".
Is there a way to do such a thing? In essence I'm trying to prevent having to match all the surrounding data and reassembling it later. I just want to replace a given matched sub-expression with something else.
Edit: The data is not XML, although it does what appear to be tag-like elements. I know better than parsing HTML and XML with RegEx. ;)

It is possible using lookarounds:
$new_data = preg_replace('|(?<=<DelayEvent>)\w+(?=</DelayEvent>)|', 'X', $data);
See it working online: ideone

PHP Regular expression tag matching

Been beating my head against a wall trying to get this to work - help from any regex gurus would be greatly appreciated!
The text that has to be matched
[template option="whatever"]
<p>any amount of html would go here</p>
[/template]
I need to pull the 'option' value (i.e. 'whatever') and the html between the template tags.
So far I have:
> /\[template\s*option=["\']([^"\']+)["\']\]((?!\[\/template\]))/
Which gets me everything except the html between the template tags.
Any ideas?
Thanks, Chris

edit: [\s\S] will match anything that is space or not space.
you may have a problem when there are consecutive blocks in a large string. in that case you will need to make a more specific quantifier - either non greedy (+?) or specify range {1,200} or make the [\s\S] more specific
/\[template\s*option=["\']([^"\']+)["\']\]([\s\S]+)\[\/template\]/

Try this
/\[template\s*option=\"(.*)\"\](.*)\[\/template]/
basically instead of using complex regex to match every single thing just use (.*) which means all since you want everything in between its not like you want to verify the data in between

The assertion ?! method is unneeded. Just match with .*? to get the minimum giblets.
/\[template\s*option=\pP([\h\w]+)\pP\] (.*?) [\/template\]/x

Chris,
I see you've already accepted an answer. Great!
However, I don't think use of regular expressions is the right solution here. I think you can get the same effect by using string manipulations (substrings, etc)
Here is some code that may help you. If not now, maybe later in your coding endeavors.
<?php
$string = '[template option="whatever"]<p>any amount of html would go here</p>[/template]';
$extractoptionline = strstr($string, 'option=');
$chopoff = substr($extractoptionline,8);
$option = substr($chopoff, 0, strpos($chopoff, '"]'));
echo "option: $option<br \>\n";
$extracthtmlpart = strstr($string, '"]');
$chopoffneedle = substr($extracthtmlpart,2);
$html = substr($chopoffneedle, 0, strpos($chopoffneedle, '[/'));
echo "html: $html<br \>\n";
?>
Hope this helps anyone looking for a similar answer with a different flavor.

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.

You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.

I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);

Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Regexp for cleaning the empty, unnecessary HTML tags

I'm using TinyMCE (WYSIWYG) as the default editor in one of my projects and sometimes it automatically adds <p> </p> , <p> </p> or divs.
I have been searching but I couldn't really find a good way of cleaning any empty tags with regex.
The code I've tried to used is,
$pattern = "/<[^\/>]*>([\s]?)*<\/[^>]*>/";
$str = preg_replace($pattern, '', $str);
Note: I also want to clear &nbsp too :(

Try
/<(\w+)>(\s| )*<\/\1>/
instead. :)

That regexp is a little odd - but looks like it might work. You could try this instead:
$pattern = ':<[^/>]*>\s*</[^>]*>:';
$str = preg_replace($pattern, '', $str);
Very similar though.

I know it's not directly what you asked for, but after months of TinyMCE, coping with not only this but the hell that results from users posting directly from Word, I have made the switch to FCKeditor and couldn't be happier.
EDIT: Just in case it's not clear, what I'm saying is that FCKeditor doesn't insert arbitrary paras where it feels like it, plus copes with pasted Word crap out of the box. You may find my previous question to be of help.

You would want multiple Regexes to be sure you do not eliminated other wanted elements with one generic one.
As Ben said you may drop valid elements with one generic regex
<\s*[^>]*>\s*` `\s*<\s*[^>]*>
<\s*p\s*>\s*<\s*/p\s*>
<\s*div\s*>\s*<\s*/div\s*>

Try this:
<([\w]+)[^>]*?>(\s| )*<\/\1>

preg_replace() help in PHP

Consider this string
hello awesome <a href="" rel="external" title="so awesome is cool"> stuff stuff
What regex could I use to match any occurence of awesome which doesn't appear within the title attribute of the anchor?
So far, this is what I've came up with (it doesn't work sadly)
/[^."]*(awesome)[^."]*/i
Edit
I took Alan M's advice and used a regex to capture every word and send it to a callback. Thanks Alan M for your advice. Here is my final code.
$plantDetails = end($this->_model->getPlantById($plantId));
$botany = new Botany_Model();
$this->_botanyWords = $botany->getArray();
foreach($plantDetails as $key=>$detail) {
$detail = preg_replace_callback('/\b[a-z]+\b/iU', array($this, '_processBotanyWords'), $detail);
$plantDetails[$key] = $detail;
}
And the _processBotanyWords()...
private function _processBotanyWords($match) {
$botanyWords = $this->_botanyWords;
$word = $match[0];
if (array_key_exists($word, $botanyWords)) {
return '' . $word . '';
} else {
return $word;
}
}
Hope this well help someone else some day! Thanks again for all your answers.

This subject comes up pretty much every day here and basically the issue is this: you shouldn't be using regular expressions to parse or alter HTML (or XML). That's what HTML/XML parsers are for. The above problem is just one of the issues you'll face. You may get something that mostly works but there'll still be corner cases where it doesn't.
Just use an HTML parser.

Asssuming this is related to the question you posted and deleted a little while ago (that was you, wasn't it?), it's your fundamental approach that's wrong. You said you were generating these HTML links yourself by replacing words from a list of keywords. The trouble is that keywords farther down the list sometimes appear in the generated title attributes and get replaced by mistake--and now you're trying to fix the mistakes.
The underlying problem is that you're replacing each keyword using a separate call to preg_replace, effectively processing the entire text over and over again. What you should do is process the text once, matching every single word and looking it up in your list of keywords; if it's on the list, replace it. I'm not set up to write/test PHP code, but you probably want to use preg_replace_callback:
$text = preg_replace_callback('/\b[A-Za-z]+\b/', "the_callback", $text);
"the_callback" is the name of a function that looks up the word and, if it's in the list, generates the appropriate link; otherwise it returns the matched word. It may sound inefficient, processing every word like this, but in fact it's a great deal more efficient than your original approach.

Sure, using a parsing library is the industrial-strength solution, but we all have times were we just want to write something in 10 seconds and be done. Next time you want to process the meaty text of a page, ignoring tags, try just run your input through strip_tags first. This way you will get only the plain, visible text and your regex powers will again reign supreme.

This is so horrible I hesitate to post it, but if you want a quick hack, reverse the problem--instead of finding the stuff that isn't X, find the stuff that IS, change it, do the thing and change it back.
This is assuming you're trying to change awesome (to "wonderful"). If you're doing something else, adjust accordingly.
$string = 'Awesome is the man who <b>awesome</b> does and awesome is.';
$string = preg_replace('#(title\s*=\s*\"[^"]*?)awesome#is', "$1PIGDOG", $string);
$string = preg_replace('#awesome#is', 'wonderful', $string);
$string = preg_replace('#pigdog#is', 'awesome', $string);
Don't vote me down. I know it's hack.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex everything but strings + use of groups - php

If you want to eliminate unwanted tags, you can use strip_tags: $allowedTags = '<p><a><img>'; $filteredContent = strip_tags($content, $allowedTags);

Related

Regex replace matched subexpression (and nothing else)?

PHP Regular expression tag matching

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

Regexp for cleaning the empty, unnecessary HTML tags

preg_replace() help in PHP

Categories

Resources