Parsing a Markdown style link safely

Parsing a Markdown style link safely - php

I have written some code to match and parse a Markdown link of this style:
[click to view a flower](http://www.yahoo.com/flower.html)
I have this code that is meant to extract the link text, then the url itself, then stick them in an A HREF link. I am worried though that maybe I am missing a way for someone to inject XSS, because I am leaving in a decent amount of characters. is this safe?
$pattern_square = '\[(.*?)\]';
$pattern_round = "\((.*?)\)";
$pattern = "/".$pattern_square.$pattern_round."/";
preg_match($pattern, $input, $matches);
$words = $matches[1];
$url = $matches[2];
$words = ereg_replace("[^-_#0-9a-zA-Z\.]", "", $words);
$url = ereg_replace("[^-A-Za-z0-9+&##/%?=~_|!:.]","",$url);
$final = "<a href='$url'>$words</a>";
It seems to work okay, and it does exclude some stupid URLs that include semicolons and backslashes, but I don't care about those URLs.

If you have already passed the input through htmlspecialchars (which you are doing, right?) then it is already impossible for the links to contain any characters that could cause XSS.
If you have not already passed the input through htmlspecialchars, then it doesn't matter what filtering you do when parsing the links, because you're already screwed, because one can trivially include arbitrary HTML or XSS outside the links.
This function will safely parse Markdown links in text while applying htmlspecialchars on it:
function doMarkdownLinks($s) {
return preg_replace_callback('/\[(.*?)\]\((.*?)\)/', function ($matches) {
return '' . $matches[1] . '';
}, htmlspecialchars($s));
}
If you need to do anything more complicated than that, I advise you to use an existing parser, because it is too easy to make a mistake with this sort of thing.

Related

how to escape only <script> tag using htmlspecialchars() in php

I have a string , in my sql database that has come from user.
$str ='<h2 contenteditable="true">I am a not a good user <script>alert("hacked") </script> </h2>';
if I echo it as it is then it is not good So I use htmlspecialchars(); to escape the special html chracters
echo htmlspecialchars($str);
This will save me from hacking , but i want to keep other tags (like <h2> ) as it is , i don't want it to change , is their a way if i could only escape specific tag using htmlspecialchars();

I was about to propose something very basic with regular expressions but I found this here:
https://stackoverflow.com/a/7131156/6219628
After reading more of the docs, I didn't found anything to ignore specific tags with just htmlspecialchars(), which doesn't sound surprising.
EDIT: And since using regex to parse html seems to be evil, you may eventually appreciate reading this bulky answer :)
https://stackoverflow.com/a/1732454/6219628

I think strip_tags() is what you are looking for. You can add allowed tags to the second parameter
Check out this function from the PHP Docs
$strippedinput = strip_tags_attributes($nonverifiedinput,"<p><br><h1><h2><h3><a><img>","class,style");
function strip_tags_attributes($string,$allowtags=NULL,$allowattributes=NULL){
$string = strip_tags($string,$allowtags);
if (!is_null($allowattributes)) {
if(!is_array($allowattributes)) $allowattributes = explode(",",$allowattributes);
if(is_array($allowattributes)) $allowattributes = implode(")(?<!",$allowattributes);
if (strlen($allowattributes) > 0) $allowattributes = "(?<!".$allowattributes.")";
$string = preg_replace_callback("/<[^>]*>/i",create_function( '$matches', 'return preg_replace("/ [^ =]*'.$allowattributes.'=(\"[^\"]*\"|\'[^\']*\')/i", "", $matches[0]);' ),$string);
}
return $string;
}
As Gerrit0 pointed out, you shouldn't use regex to parse HTML

Note that just removing the <script> tag isn't sufficient; there are many other ways that users can inject malicious content into your site.
If you want to restrict the HTML tags that users can input, use a tool like HTML Purifier which uses a whitelist of allowable tags and attributes.

preg_match (I can't do regex)

The company I work for have asked me to give them the ability to place a modal box on the web page from the CMS, but do not want to type HTML. As I cannot for the life of me understand regex I can't get it.
The layout of the code they should type is this:
++modal++
Some paragraph text.
Another paragraph.
++endmodal++
The paragraphs are already converted by markdown into <p>paragraph</p>.
So really the match has to be ++modal++ any number of A-Za-z0-9any symbol excluding + ++endmodal++ then replaced with HTML.
I'm not sure it preg_match or preg_replace should be used.
I got this far:
$string = '++modal++<p>Hello</p>++endmodal++';
$pattern = '/\+\+modal\+\+/';
preg_match($pattern, $string, $matches);
Thank you in advance.
EDIT: A to be a bit more clear, I wish to replace the ++modal++ and ++endmodal++ with HTML and leave the middle bit as is.

I don't really think you need a RegEx here as your delimiters remain always the same and always on the same position of the string. Regular expressions are also expensive on resources and as a third counter argument you said you're not fit with them.
So why not use a simple replacement or string trimming if it comes to that.
$search = array('++modal++', '++endmodal++');
$replacement = array('<tag>', '</tag>');
$str = '++modal++<p>Hello</p>++endmodal++';
$result = str_replace($search, $replacement, $str);
Where, of course, '<tag>' and '</tag>' are just example placeholders for your replacement.
This is what the manual for str_replace() says:
If you don't need fancy replacing rules (like regular expressions),
you should always use this function instead of preg_replace().

I think you should get your desired content using:
preg_match('/\+\+modal\+\+([^\+]+)\+\+endmodal\+\+/', $string, $matches)
$matches[1] = '<p>Hello</p>

You're trying to re-invent the wheel here. You're trying to write a simple template system here, but there are dozens of templating tools for PHP that you could use, ranging from big and complex like Smarty and Twig to really simple ones that aren't much more than you're trying to write.
I haven't used them all, so rather than recommend one I'll point you to a list of template engines you could try. You'll probably find more with a quick bit of googling.
If you do insist on writing your own, it's important to consider security. If you're outputting anything that contains data entered by your users, you must make sure all your output is properly escaped and sanitised for display on a web page; there a numerous common hacks that can take advantage of an insecure templating system to completely compromise a site.

<?php
$string = '++modal++<p>Hello</p>++endmodal++';
$patterns = array();
$patterns[0] = "/\+\+modal\+\+/"; // put '\' just before +
$patterns[1] = "/\+\+endmodal\+\+/";
$replacements = array();
$replacements[1] = '<html>';
$replacements[0] = '</html>';
echo preg_replace($patterns, $replacements, $string);
?>
Very similar to this example

StackOverflow Style A Href Auto Linking in Regex

I am using the below function to search for text links and convert them to a hyperlink. First of all is it correct? It appears to work but do you know of a (perhaps malformed) url that would break this function?
My question is whether it is possible to get this to support port numbers as well, for example stackoverflow.com:80/index will not be converted as the port is not seen as a valid part of the url.
So in summary I am looking for Stackoverflow style url recognition, which I believe is a custom addition to Markdown.
/**
* Search for and create links from urls
*/
static public function autoLink($text) {
$pattern = "/(((http[s]?:\/\/)|(www\.))(([a-z][-a-z0-9]+\.)?[a-z][-a-z0-9]+\.[a-z]+(\.[a-z]{2,2})?)\/?[a-z0-9._\/~#&=;%+?-]+[a-z0-9\/#=?]{1,1})/is";
$text = preg_replace($pattern, " <a href='$1'>$1</a>", $text);
// fix URLs without protocols
$text = preg_replace("/href='www/", "href='http://www", $text);
return $text;
}
Thanks for your time,

You should also look at the answers to this question: How to mimic StackOverflow Auto-Link Behavior
I have ended up combining the answers I have got both at stack overflow and talking to colleagues. The below code is the best we could come up with.
/**
* Search for and create links from urls
*/
static public function autoLink($text) {
$pattern = "/\b((?P<protocol>(https?)|(ftp)):\/\/)?(?P<domain>[-A-Z0-9\\.]+)[.][A-Z]{2,7}(([:])?([0-9]+)?)(?P<file>\/[-A-Z0-9+&##\/%=~_|!:,\\.;]*)?(?P<parameters>\?[A-Z0-9+&##\/%=~_|!:,\\.;]*)?/ise";
$text = preg_replace($pattern, "' $0'", $text);
// fix URLs without protocols
$text = preg_replace("#href='www#i", "href='http://www", $text);
$text = preg_replace("#href=['\"](?!(https?|ftp)://)#i", "href='http://", $text);
return $text;
}

Rather than writing your own autolinking routine, which is essentially the beginning of a custom markup engine, you might want to use an open source markup engine, as it is less likely to be vulnerable to cross-site scripting attacks. One example of an open source markup engine for PHP is PHP Markdown, which has the ability to autolink URLs and essentially uses the same Markdown syntax that is in use at Stack Overflow.
One note: you should always escape HTML special characters using htmlspecialchars() before sticking the text into attributes or in the inner text of elements.

$pattern = "/\b(?P<protocol>https?|ftp):\/\/(?P<domain>[-A-Z0-9.]+)(([:])?([0-9]+)?)(?P<file>\/[-A-Z0-9+&##\/%=~_|!:,.;]*)?(?P<parameters>\?[A-Z0-9+&##\/%=~_|!:,.;]*)?/i";
will match:
http://www.scroogle.org/index.html
http://www.scroogle.org:80/index.html?source=library

PHP: finding, replacing, shortening, and prettifying user links with <a> tags, ellipses, and link icons

When a user enters a URL, e.g. http://www.google.com, I would like to be able to parse that text using PHP, find any links, and replace them with <a> tags that include the original URL as an HREF.
In other words, http://www.google.com will become
http://www.google.com
I'd like to be able to do this for all URLs of these forms (with .com interchangeable with any TLD):
http://www.google.com
www.google.com
google.com
docs.google.com
What's the most performant way to do this? I could try writing some really fancy regex, but I doubt that's the best method available to me.
For bonus points, I'd also like to prepend http:// to any URL lacking it, and strip the display text itself down to something of the form http://www.google.com/reallyLongL... and display an external link icon afterwards.

Trying to find links in the format domain.com is going to be a pain in the butt. It would require keeping track of all TLDs and using them in the search.if you didnt the end of the last sentence i typed and the beginning of this sentence would be a link to http://search.if. Even if you did .in is a valid TLD and a common word.
I'd recommend telling your users they have to begin links with www. or http:// then write a simple regex to capture them and add the links.

www.google.com
This is not a URL, it's a hostname. It's generally not a good idea to start marking up bare hostnames in arbitrary text, because in the general case any word or sequence of dot-separated words is a perfectly valid hostname. That means you up with horrible hacks like looking for leading www. (and you'll get questions like “why can I link to www.stackoverflow.com but not stackoverflow.com?”) or trailing TLDs (which gets more and more impractical as more new TLDs are introduced; “why can I like to ncm.com but not ncm.museum?”), and you'll often mark up things that aren't supposed to be links.
I could try writing some really fancy regex
Well I can't see how you'd do it without regex.
The trick is coping with markup. If you can have <, & and " characters in the input, you mustn't let them into HTML output. If your input is plain text, you can do that by calling htmlspecialchars() before applying a simple replacement on a pattern like that in nico's answer.
(If the input already contains markup, you've got problems and you'd probably need an HTML parser to determine which bits are markup to avoid adding more markup inside of. Similarly, if you're doing more processing after this, inserting more tags, those steps are may have the same difficulty. In ‘bbcode’-like languages this often leads to bugs and security problems.)
Another problem is trailing punctuation. It's common for people to put a full stop, comma, close bracket, exclamation mark etc after a link, which aren't supposed to be part of the link but which are actually valid characters. It's useful to strip these off and not put them in the link. But then you break Wiki links that end in ), so maybe you want to not treat ) as a trailing character if there's a ( in the link, or something like that. This sort of thing can't be done in a simple regex replace, but you can in a replacement callback function.

HTML Purifier has a built-in linkify function to save you all the headaches.
It's other features are also simply too useful to pass up if you're dealing with any kind of user input that you also have to display.

Not so fancy regexps that should work
/\b(https?:\/\/[^\s+\"\<\>]+)/ig
/\b(www.[^\s+\"\<\>]+)/ig
Note that the last two would be impossible to do correctly as you cannot distinguish google.com from something like this.Where I finish one sentence and don't put a space after the full stop.
As for shortening the URLs, having your URL in $url:
if (strlen($url) > 20) // Or whatever length you like
{
$shortURL = substr($url, 0, 20)."…";
}
else
{
$shortURL = $url;
}
echo '<a href="'.$url.'" >'.$shortURL.'</a>';

From http://www.exorithm.com/algorithm/view/markup_urls
function markup_urls ($text)
{
// split the text into words
$words = preg_split('/([\s\n\r]+)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
$text = "";
// iterate through the words
foreach($words as $word) {
// chopword = the portion of the word that will be replaced
$chopword = $word;
$chopword = preg_replace('/^[^A-Za-z0-9]*/', '', $chopword);
if ($chopword <> '') {
// linkword = the text that will replace chopword in the word
$linkword='';
// does it start with http://abc. ?
if (preg_match('/^(http:\/\/)[a-zA-Z0-9_]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it equal abc.def.ghi ?
} else if (preg_match('/^[a-zA-Z]{2,}\.([a-zA-Z0-9_]+\.)+[a-zA-Z]{2,}(\/.*)?/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it start with abc#def.ghi ?
} else if (preg_match('/^[a-zA-Z0-9_\.]+\#([a-zA-Z0-9_]{2,}\.)+[a-zA-Z]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9]*$/', '', $chopword);
$linkword = ''.$chopword.'';
}
// replace chopword with linkword in word (if linkword was set)
if ($linkword <> '') {
$word = str_replace($chopword, $linkword, $word);
}
}
// append the word
$text = $text.$word;
}
return $text;
}

I got this working exactly the way I want here:
<?php
$input = <<<EOF
http://www.example.com/
http://example.com
www.example.com
http://iamanextremely.com/long/link/so/I/will/be/trimmed/down/a/bit/so/i/dont/mess
/up/text/wrapping.html
EOF;
function trimlong($match)
{
$url = $match[0];
$display = $url;
if ( strlen($display) > 30 ) {
$display = substr($display,0,30)."...";
}
return ''.$display.' <img src="http://static.goalscdn.com/img/external-link.gif" height="10" width="11" />';
}
$output = preg_replace_callback('#(http://|www\\.)[^\\s<]+[^\\s<,.]#i',
array($this,'trimlong'),$input);
echo $output;

Extract URLs from text in PHP

I have this text:
$string = "this is my friend's website http://example.com I think it is coll";
How can I extract the link into another variable?
I know it should be by using regular expression especially preg_match() but I don't know how?

Probably the safest way is using code snippets from WordPress. Download the latest one (currently 3.1.1) and see wp-includes/formatting.php. There's a function named make_clickable which has plain text for param and returns formatted string. You can grab codes for extracting URLs. It's pretty complex though.
This one line regex might be helpful.
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $string, $match);
But this regex still can't remove some malformed URLs (ex. http://google:ha.ckers.org ).
See also:
How to mimic StackOverflow Auto-Link Behavior

I tried to do as Nobu said, using Wordpress, but to much dependencies to other WordPress functions I instead opted to use Nobu's regular expression for preg_match_all() and turned it into a function, using preg_replace_callback(); a function which now replaces all links in a text with clickable links. It uses anonymous functions so you'll need PHP 5.3 or you may rewrite the code to use an ordinary function instead.
<?php
/**
* Make clickable links from URLs in text.
*/
function make_clickable($text) {
$regex = '#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#';
return preg_replace_callback($regex, function ($matches) {
return "<a href=\'{$matches[0]}\'>{$matches[0]}</a>";
}, $text);
}

URLs have a quite complex definition — you must decide what you want to capture first. A simple example capturing anything starting with http:// and https:// could be:
preg_match_all('!https?://\S+!', $string, $matches);
$all_urls = $matches[0];
Note that this is very basic and could capture invalid URLs. I would recommend catching up on POSIX and PHP regular expressions for more complex things.

The code that worked for me (especially if you have several links in your $string):
$string = "this is my friend's website https://www.example.com I think it is cool, but this one is cooler https://www.stackoverflow.com :)";
$regex = '/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i';
preg_match_all($regex, $string, $matches);
$urls = $matches[0];
// go over all links
foreach($urls as $url)
{
echo $url.'<br />';
}
Hope that helps others as well.

If the text you extract the URLs from is user-submitted and you're going to display the result as links anywhere, you have to be very, VERY careful to avoid XSS vulnerabilities, most prominently "javascript:" protocol URLs, but also malformed URLs that might trick your regexp and/or the displaying browser into executing them as Javascript URLs. At the very least, you should accept only URLs that start with "http", "https" or "ftp".
There's also a blog entry by Jeff where he describes some other problems with extracting URLs.

preg_match_all('/[a-z]+:\/\/\S+/', $string, $matches);
This is an easy way that'd work for a lot of cases, not all. All the matches are put in $matches. Note that this do not cover links in anchor elements (<a href=""...), but that wasn't in your example either.

You could do like this..
<?php
$string = "this is my friend's website http://example.com I think it is coll";
echo explode(' ',strstr($string,'http://'))[0]; //"prints" http://example.com

preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+".
"(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
$var, &$matches);
$matches = $matches[1];
$list = array();
foreach($matches as $var)
{
print($var."<br>");
}

You could try this to find the link and revise the link (add the href link).
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// The Text you want to filter for urls
$text = "The text you want to filter goes here. http://example.com";
if(preg_match($reg_exUrl, $text, $url)) {
echo preg_replace($reg_exUrl, "{$url[0]} ", $text);
} else {
echo "No url in the text";
}
refer here: http://php.net/manual/en/function.preg-match.php

There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.
I created a PHP library that could deal with lots of edge cases: Url highlight.
Example:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
$urlHighlight->getUrls("this is my friend's website http://example.com I think it is coll");
// return: ['http://example.com']
For more details see readme. For covered url cases see test.

Here is a function I use, can't remember where it came from but seems to do a pretty good job of finding links in the text. and making them links.
You can change the function to suit your needs. I just wanted to share this as I was looking around and remembered I had this in one of my helper libraries.
function make_links($str){
$pattern = '(?xi)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = preg_match('!^https?://!i', $input) ? $input : "http://$input";
return '' . "$input";
}, $str);
}
Use:
$subject = 'this is a link http://google:ha.ckers.org maybe don't want to visit it?';
echo make_links($subject);
Output
this is a link http://google:ha.ckers.org maybe don't want to visit it?

<?php
preg_match_all('/(href|src)[\s]?=[\s\"\']?+(.*?)[\s\"\']+.*?/', $webpage_content, $link_extracted);
preview

This Regex works great for me and i have checked with all types of URL,
<?php
$string = "Thisregexfindurlhttp://www.rubular.com/r/bFHobduQ3n mixedwithstring";
preg_match_all('/(https?|ssh|ftp):\/\/[^\s"]+/', $string, $url);
$all_url = $url[0]; // Returns Array Of all Found URL's
$one_url = $url[0][0]; // Gives the First URL in Array of URL's
?>
Checked with lots of URL's can find here http://www.rubular.com/r/bFHobduQ3n

public function find_links($post_content){
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// Check if there is a url in the text
if(preg_match_all($reg_exUrl, $post_content, $urls)) {
// make the urls hyper links,
foreach($urls[0] as $url){
$post_content = str_replace($url, ' LINK ', $post_content);
}
//var_dump($post_content);die(); //uncomment to see result
//return text with hyper links
return $post_content;
} else {
// if no urls in the text just return the text
return $post_content;
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing a Markdown style link safely - php

Related

how to escape only <script> tag using htmlspecialchars() in php

preg_match (I can't do regex)

StackOverflow Style A Href Auto Linking in Regex

PHP: finding, replacing, shortening, and prettifying user links with <a> tags, ellipses, and link icons

Extract URLs from text in PHP

Categories

Resources