StackOverflow Style A Href Auto Linking in Regex - php

I am using the below function to search for text links and convert them to a hyperlink. First of all is it correct? It appears to work but do you know of a (perhaps malformed) url that would break this function?
My question is whether it is possible to get this to support port numbers as well, for example stackoverflow.com:80/index will not be converted as the port is not seen as a valid part of the url.
So in summary I am looking for Stackoverflow style url recognition, which I believe is a custom addition to Markdown.
/**
* Search for and create links from urls
*/
static public function autoLink($text) {
$pattern = "/(((http[s]?:\/\/)|(www\.))(([a-z][-a-z0-9]+\.)?[a-z][-a-z0-9]+\.[a-z]+(\.[a-z]{2,2})?)\/?[a-z0-9._\/~#&=;%+?-]+[a-z0-9\/#=?]{1,1})/is";
$text = preg_replace($pattern, " <a href='$1'>$1</a>", $text);
// fix URLs without protocols
$text = preg_replace("/href='www/", "href='http://www", $text);
return $text;
}
Thanks for your time,

You should also look at the answers to this question: How to mimic StackOverflow Auto-Link Behavior
I have ended up combining the answers I have got both at stack overflow and talking to colleagues. The below code is the best we could come up with.
/**
* Search for and create links from urls
*/
static public function autoLink($text) {
$pattern = "/\b((?P<protocol>(https?)|(ftp)):\/\/)?(?P<domain>[-A-Z0-9\\.]+)[.][A-Z]{2,7}(([:])?([0-9]+)?)(?P<file>\/[-A-Z0-9+&##\/%=~_|!:,\\.;]*)?(?P<parameters>\?[A-Z0-9+&##\/%=~_|!:,\\.;]*)?/ise";
$text = preg_replace($pattern, "' $0'", $text);
// fix URLs without protocols
$text = preg_replace("#href='www#i", "href='http://www", $text);
$text = preg_replace("#href=['\"](?!(https?|ftp)://)#i", "href='http://", $text);
return $text;
}

Rather than writing your own autolinking routine, which is essentially the beginning of a custom markup engine, you might want to use an open source markup engine, as it is less likely to be vulnerable to cross-site scripting attacks. One example of an open source markup engine for PHP is PHP Markdown, which has the ability to autolink URLs and essentially uses the same Markdown syntax that is in use at Stack Overflow.
One note: you should always escape HTML special characters using htmlspecialchars() before sticking the text into attributes or in the inner text of elements.

$pattern = "/\b(?P<protocol>https?|ftp):\/\/(?P<domain>[-A-Z0-9.]+)(([:])?([0-9]+)?)(?P<file>\/[-A-Z0-9+&##\/%=~_|!:,.;]*)?(?P<parameters>\?[A-Z0-9+&##\/%=~_|!:,.;]*)?/i";
will match:
http://www.scroogle.org/index.html
http://www.scroogle.org:80/index.html?source=library

Related

Parsing a Markdown style link safely

I have written some code to match and parse a Markdown link of this style:
[click to view a flower](http://www.yahoo.com/flower.html)
I have this code that is meant to extract the link text, then the url itself, then stick them in an A HREF link. I am worried though that maybe I am missing a way for someone to inject XSS, because I am leaving in a decent amount of characters. is this safe?
$pattern_square = '\[(.*?)\]';
$pattern_round = "\((.*?)\)";
$pattern = "/".$pattern_square.$pattern_round."/";
preg_match($pattern, $input, $matches);
$words = $matches[1];
$url = $matches[2];
$words = ereg_replace("[^-_#0-9a-zA-Z\.]", "", $words);
$url = ereg_replace("[^-A-Za-z0-9+&##/%?=~_|!:.]","",$url);
$final = "<a href='$url'>$words</a>";
It seems to work okay, and it does exclude some stupid URLs that include semicolons and backslashes, but I don't care about those URLs.
If you have already passed the input through htmlspecialchars (which you are doing, right?) then it is already impossible for the links to contain any characters that could cause XSS.
If you have not already passed the input through htmlspecialchars, then it doesn't matter what filtering you do when parsing the links, because you're already screwed, because one can trivially include arbitrary HTML or XSS outside the links.
This function will safely parse Markdown links in text while applying htmlspecialchars on it:
function doMarkdownLinks($s) {
return preg_replace_callback('/\[(.*?)\]\((.*?)\)/', function ($matches) {
return '' . $matches[1] . '';
}, htmlspecialchars($s));
}
If you need to do anything more complicated than that, I advise you to use an existing parser, because it is too easy to make a mistake with this sort of thing.

preg_match (I can't do regex)

The company I work for have asked me to give them the ability to place a modal box on the web page from the CMS, but do not want to type HTML. As I cannot for the life of me understand regex I can't get it.
The layout of the code they should type is this:
++modal++
Some paragraph text.
Another paragraph.
++endmodal++
The paragraphs are already converted by markdown into <p>paragraph</p>.
So really the match has to be ++modal++ any number of A-Za-z0-9any symbol excluding + ++endmodal++ then replaced with HTML.
I'm not sure it preg_match or preg_replace should be used.
I got this far:
$string = '++modal++<p>Hello</p>++endmodal++';
$pattern = '/\+\+modal\+\+/';
preg_match($pattern, $string, $matches);
Thank you in advance.
EDIT: A to be a bit more clear, I wish to replace the ++modal++ and ++endmodal++ with HTML and leave the middle bit as is.
I don't really think you need a RegEx here as your delimiters remain always the same and always on the same position of the string. Regular expressions are also expensive on resources and as a third counter argument you said you're not fit with them.
So why not use a simple replacement or string trimming if it comes to that.
$search = array('++modal++', '++endmodal++');
$replacement = array('<tag>', '</tag>');
$str = '++modal++<p>Hello</p>++endmodal++';
$result = str_replace($search, $replacement, $str);
Where, of course, '<tag>' and '</tag>' are just example placeholders for your replacement.
This is what the manual for str_replace() says:
If you don't need fancy replacing rules (like regular expressions),
you should always use this function instead of preg_replace().
I think you should get your desired content using:
preg_match('/\+\+modal\+\+([^\+]+)\+\+endmodal\+\+/', $string, $matches)
$matches[1] = '<p>Hello</p>
You're trying to re-invent the wheel here. You're trying to write a simple template system here, but there are dozens of templating tools for PHP that you could use, ranging from big and complex like Smarty and Twig to really simple ones that aren't much more than you're trying to write.
I haven't used them all, so rather than recommend one I'll point you to a list of template engines you could try. You'll probably find more with a quick bit of googling.
If you do insist on writing your own, it's important to consider security. If you're outputting anything that contains data entered by your users, you must make sure all your output is properly escaped and sanitised for display on a web page; there a numerous common hacks that can take advantage of an insecure templating system to completely compromise a site.
<?php
$string = '++modal++<p>Hello</p>++endmodal++';
$patterns = array();
$patterns[0] = "/\+\+modal\+\+/"; // put '\' just before +
$patterns[1] = "/\+\+endmodal\+\+/";
$replacements = array();
$replacements[1] = '<html>';
$replacements[0] = '</html>';
echo preg_replace($patterns, $replacements, $string);
?>
Very similar to this example

URL detection problems

i'm currently having some problems with detecting urls and making them clickable.
Until now it always worked fine, probably because we always tested this with real urls, but now the website is live, we're having some problems.
This was the code we used to detect them before
$content = preg_replace('!(((f|ht)tp://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $content);
$content = eregi_replace('([[:space:]()[{}])(www.[-a-zA-Z0-9#:%_\+.~#?&//=]+)', '\\1\\2', $content);
It was doing a great job for normal urls, but some urls are giving problems:
- hk.linkedin.com
- www.test.com
- test.com
Also notice that some urls don't have http in fron of them.
I'm really not that good with regex, so I would very much appreciate it if somebody could help me figure this out.
What exactly you wanted to get. In this example, I can see blatant lack of understanding for regular expressions... but then, I see this exact code used in few codes according to Google Code Search. But those were made to find URLs in middle of text (not always what looks like URL is URL, but if it contains http:// or www it's sure that's URL.
Not everything needs to be done only using regular expressions. Those are helpful, but sometimes they make additional problems.
One of problems in regular expressions is that they don't have conditionals on result. You can use multiple regular expressions, but there is chance that something will be done wrongly (like affecting what previous regular expression has done). Just look at this. It assigns additional function (you can use e modifier, but it may make code unreadable).
<?php
$content = preg_replace_callback('{\b(?:(https?|ftp)://)?(\S+[.]\S+)\b}i',
'addHTTP', $content);
function addHTTP($matches) {
if(empty($matches[1])) {
return 'http://' . $matches[2] . '';
}
else {
return '' . $matches[2] . '';
}
}
Or two regular expressions (little harder to understand)...
$content = preg_replace('{\b(?:(?:https?|ftp)://)\S+[.]\S+\b}i',
'$0', $content);
$content = preg_replace('{\b(?<!["\'=><.])[-a-zA-Zа-яА-Яа-яА-Я()0-9#:%_+.~#?&;//=]+[.][-a-zA-Zа-яА-Яа-яА-Я()0-9#:%_+.~#?&;//=]+(?!["\'=><.])\b}i',
'http://$0', $content);
Also, you should avoid using target="". Users don't expect that new window will appear when clicking the link. After user will click such link he might wonder why "Go left" button doesn't work (hint: new window caused it to disappear). If somebody really wants to open link in new window he will do it yourself (it's not hard...).
Note that usually such stuff is linked with other helpers like this. For example, Stack Overflow uses some kind of Markdown modification which does more intelligent renaming, like changing plain text lists to HTML lists... But that all depends on what you need. If you only need processing links, you can try using those regexpes, but well...

How to make this linking making script behave with markdown?

I'm using PHP markdown but I also need a script to convert plaintext links into clicakable ones. Both work independently, but when I try to run them together, if I run markdown first, the makelinks still processes on the html code and screws things up.. and.. vice versa. Any idea of how to stop it from doing that? I can't figure out regex to ignore the markdown style links
function makeLinks($text) {
$text = preg_replace('%(((f|ht){1}tp://)[-a-zA-^Z0-9#:\%_\+.~#?&//=]+)%i',
'\\1', $text);
$text = preg_replace('%([[:space:]()[{}])(www.[-a-zA-Z0-9#:\%_\+.~#?&//=]+)%i',
'\\1\\2', $text);
return $text;
}
sample text:
###[Title Section](http://domain/folder/page.html)
- Blah blah some text and then a link: www.webpage.org.
The double-linkify problem can be solved best with guesswork and workarounds. (We have some duplicate questions, but I can never find a good one..)
Since already converted http://-urls only occur right after href=" or an >, you can use those for negative assertions.
(?<!href="|>)
Should be written at the start of your first regex:
$text = preg_replace('%(?<!href="|>)(((f|ht){1}tp://)...
Your second regex uses the :space: as anchor, so should be fault tolerant already.

Extract URL from string

I'm trying to find a reliable solution to extract a url from a string of characters. I have a site where users answer questions and in the source box, where they enter their source of information, I allow them to enter a url. I want to extract that url and make it a hyperlink. Similar to how Yahoo Answers does it.
Does anyone know a reliable solution that can do this?
All the solutions I have found work for some URL's but not for others.
Thanks
John Gruber has spent a fair amount of time perfecting the "one regex to rule them all" for link detection. Using preg_replace() as mentioned in the other answers, using the following regex should be one of the most accurate, if not the most accurate, method for detecting a link:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
If you only wanted to match HTTP/HTTPS:
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
$string = preg_replace('/https?:\/\/[^\s"<>]+/', '$0', $string);
It only matches http/https, but that's really the only protocol you want to turn into a link. If you want others, you can change it like this:
$string = preg_replace('/(https?|ssh|ftp):\/\/[^\s"]+/', '$0', $string);
There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.
I created a PHP library that could deal with lots of edge cases: Url highlight.
You could extract urls from string or directly highlight them.
Example:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
// Extract urls
$urlHighlight->getUrls("This is example http://example.com.");
// return: ['http://example.com']
// Make urls as hyperlinks
$urlHighlight->highlightUrls('Hello, http://example.com.');
// return: 'Hello, http://example.com.'
For more details see readme. For covered url cases see test.
Yahoo! Answers does a fairly good job of link identification when the link is written properly and separate from other text, but it isn't very good at separating trailing punctuation. For example The links are http://example.com/somepage.php, http://example.com/somepage2.php, and http://example.com/somepage3.php. will include commas on the first two and a period on the third.
But if that is acceptable, then patterns like this should do it:
\<http:[^ ]+\>
It looks like stackoverflow's parser is better. Is is open source?
This code is worked for me.
function makeLink($string){
/*** make sure there is an http:// on all URLs ***/
$string = preg_replace("/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i", "$1http://$2",$string);
/*** make all URLs links ***/
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</a>",$string);
/*** make all emails hot links ***/
$string = preg_replace("/([\w-?&;#~=\.\/]+\#(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","$1",$string);
return $string;
}

Categories