How to stop BB Code manipulation (part two)? - php

I recently discovered an issue where people using BB Code to enter links are able to manipulate them.
They are meant to enter something like:
[LINK=http://www.domain.com]example text[/LINK]
However they can enter something like this to make the link color red:
[LINK=http://www.domain.com 'span style="color:red;"']example text[/LINK]
This is the code which converts it:
$text = preg_replace("/\[LINK\=(.*?)\](.*?)\[\/LINK\]/is", "<a href='$1' target='_blank'>$2</a>", $text);
Someone else was kind enough to provide a solution to a very similar problem but they want me to start a new question for this. Their solution just needs adapting. I have tried myself but I really can't get it to work. How to stop BB Code manipulation?

preg_replace_callback("/\\[LINK\=(.*?)\\\](.*?)\\[\/LINK\\]/is",
function (array $matches) {
if (filter_var($matches[1], FILTER_VALIDATE_URL))
return '<a href="'.
htmlspecialchars($matches[1], ENT_QUOTES).
'" target="_blank">'.
htmlspecialchars($matches[2])."</a>";
else
return "INVALID MARKUP";
}, $text);
Use a callback to validate the URL and don't forget htmlspecialchars.

I think the simplest and best solution might be to run the url through htmlspecialchars() to escape the weird characters. That way it wouldn't get put directly into the source, but would be escaped first, so it wouldn't be able to hack out of the href="...".

Instead of using a regex replace, use a regex match to extract the information that you want, in this case the link and link text.
Then write that information out in the right format. That should eliminate the opportunity to get weird data into the output.
You can even double-check the variables before using them to make sure they don't contain any HTML.

Related

Parsing link from javascript function

I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?
function videoPoster() {
document.getElementById("html5_vid").innerHTML =
"<video x-webkit-airplay='allow' id='html5_video' style='margin-top:"
+ style_padding
+ "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' "
+ "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}
What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!
/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.
See it in action.
If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/
In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.
In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:
['"](http://[^'"]*)['"]
If you have to enter that regexp as a string, beware of escaping.
If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.
For your specific case, this should work, provided that none of the characters in the URL are escaped.
preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];
See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).
(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)
The following captures any url in your html
$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
print_r($matches['urls']);
}
if you want to do the same in javascript you could use this:
var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

php - safest way to ensure plain text

What is the most secure way to stop users adding html or javascript to a field. I am adding a youtube style 'description' where users can explain their work but I don't want anything other than plain text in there and preferable none of the htmlentities rubbish like '<' or '>'.
Could I do something like this:
$clean = htmlentities($_POST['description']);
if ($clean != $_POST['description']) ... then return the form with an error?
Have you seen strip_tags?
strip_tags() would probably be the best bet.
You don't need to check the cleaned code vs the original and throw an error. As long as it is cleaned, you should be able to display it. Just throw away the original comment. You can put a note under the textbox saying that no html is allowed if you want to make it more user friendly.
Use strip_tags() instead htmlentities().
And the method is ok.
htmlspecialchars(), if used properly (see comments), is the safest way to ensure plain text. There is no way to inject any HTML or JavaScript when the output has all the HTML special characters escaped. If you use strip_tags, you will prevent your users from using completely legitimate characters.
Also don't forget mysql_real_escape_string() if you are storing data in MySQL.

Regex to change spaces in images into entities

I'm having a lot of difficulty matching an image url with spaces.
I need to make this
http://site.com/site.com/files/images/img 2 (5).jpg
into a div like this:
.replace(/(http:\/\/([^\s]+\.(jpg|png|gif)))/ig, "<div style=\"background: url($1)\"></div>")
Here's the thread about that:
regex matching image url with spaces
Now I've decided to first make the spaces into entities so that the above regex will work.
But I'm really having a lot of difficulty doing so.
Something like this:
.replace(/http:\/\/(.*)\/([^\<\>?:;]*?) ([^\<\>?:;]*)(\.(jpe?g|png|gif))/ig, "http://$1/$2%20$3$4")
Replaces one space, but all the rest are still spaces.
I need to write a regex that says, make all spaces between http:// and an image extension (png|jpg|gif) into %20.
At this point, frankly not sure if it's even possible. Any help is appreciated, thanks.
Trying Paolo's escape:
.escape(/http:\/\/(.*)\/([^\<\>?:;]*?) ([^\<\>?:;]*)(\.(jpe?g|png|gif))/)
Another way I can do this is to escape serverside in PHP, and in PHP I can directly mess with the file name without having to match it in regex.
But as far as I know something like htmlentities do not apply to spaces. Any hints in this direction would be great as well.
Try the escape function:
>>> escape("test you");
test%20you
If you want to control the replacement character but don't want to use a regular expression, a simple...
$destName = str_replace(' ', '-', $sourceName);
..would probably be the more efficient solution.
Lets say you have the string variable urlWithSpaces which is set to a URL which contains spaces.
Simply go:
urlWithoutSpaces = escape(urlWithSpaces);
What about urlencode() - that may do what you want.
On the JS side you should be using encodeURI(), and escape() only as a fallback. The reason to use encodeURI() is that it uses UTF-8 for encoding, while escape() uses ISO Latin. Same problems applies for decoding.
encodeURI = encodeURI || escape;
alert(encodeURI('image name.png'));

preg_replace() help in PHP

Consider this string
hello awesome <a href="" rel="external" title="so awesome is cool"> stuff stuff
What regex could I use to match any occurence of awesome which doesn't appear within the title attribute of the anchor?
So far, this is what I've came up with (it doesn't work sadly)
/[^."]*(awesome)[^."]*/i
Edit
I took Alan M's advice and used a regex to capture every word and send it to a callback. Thanks Alan M for your advice. Here is my final code.
$plantDetails = end($this->_model->getPlantById($plantId));
$botany = new Botany_Model();
$this->_botanyWords = $botany->getArray();
foreach($plantDetails as $key=>$detail) {
$detail = preg_replace_callback('/\b[a-z]+\b/iU', array($this, '_processBotanyWords'), $detail);
$plantDetails[$key] = $detail;
}
And the _processBotanyWords()...
private function _processBotanyWords($match) {
$botanyWords = $this->_botanyWords;
$word = $match[0];
if (array_key_exists($word, $botanyWords)) {
return '' . $word . '';
} else {
return $word;
}
}
Hope this well help someone else some day! Thanks again for all your answers.
This subject comes up pretty much every day here and basically the issue is this: you shouldn't be using regular expressions to parse or alter HTML (or XML). That's what HTML/XML parsers are for. The above problem is just one of the issues you'll face. You may get something that mostly works but there'll still be corner cases where it doesn't.
Just use an HTML parser.
Asssuming this is related to the question you posted and deleted a little while ago (that was you, wasn't it?), it's your fundamental approach that's wrong. You said you were generating these HTML links yourself by replacing words from a list of keywords. The trouble is that keywords farther down the list sometimes appear in the generated title attributes and get replaced by mistake--and now you're trying to fix the mistakes.
The underlying problem is that you're replacing each keyword using a separate call to preg_replace, effectively processing the entire text over and over again. What you should do is process the text once, matching every single word and looking it up in your list of keywords; if it's on the list, replace it. I'm not set up to write/test PHP code, but you probably want to use preg_replace_callback:
$text = preg_replace_callback('/\b[A-Za-z]+\b/', "the_callback", $text);
"the_callback" is the name of a function that looks up the word and, if it's in the list, generates the appropriate link; otherwise it returns the matched word. It may sound inefficient, processing every word like this, but in fact it's a great deal more efficient than your original approach.
Sure, using a parsing library is the industrial-strength solution, but we all have times were we just want to write something in 10 seconds and be done. Next time you want to process the meaty text of a page, ignoring tags, try just run your input through strip_tags first. This way you will get only the plain, visible text and your regex powers will again reign supreme.
This is so horrible I hesitate to post it, but if you want a quick hack, reverse the problem--instead of finding the stuff that isn't X, find the stuff that IS, change it, do the thing and change it back.
This is assuming you're trying to change awesome (to "wonderful"). If you're doing something else, adjust accordingly.
$string = 'Awesome is the man who <b>awesome</b> does and awesome is.';
$string = preg_replace('#(title\s*=\s*\"[^"]*?)awesome#is', "$1PIGDOG", $string);
$string = preg_replace('#awesome#is', 'wonderful', $string);
$string = preg_replace('#pigdog#is', 'awesome', $string);
Don't vote me down. I know it's hack.

Categories