Converting URLs to HTML hyperlinks while avoiding already-formatted hyperlinks - php

So I have this code:
$sURLRegExp = '/http\:\/\/([a-z0-9\-\.]+\.[a-z]{2,3}(\/\S*)?)/i';
$iURLMatches = preg_match($sURLRegExp, $sMessage, $aURLMatches);
if ($iURLMatches > 0) {
$sURL = $aURLMatches[1];
$sURL = str_replace('www.', '', $sURL);
$sMessage = preg_replace($sURLRegExp, '<a href="http://$1" target="_blank">' .
$sURL . '</a>', $sMessage);
}
It does a perfect job of converting all incoming messages so that plain URLs entered will turn into HTML hyperlinks that even remove the "http://" and "www." part, for brevity.
Thing is, administrators for the site on which this works are able to enter in HTML. If they do, it turns it into a horrid mess. Something like site.com">text</a>.
I tried altering the regular expression to make sure that there is no quotation mark after the given URL (which most likely indicates it's part of a hyperlink anchor tag) like so:
$sURLRegExp = '/http\:\/\/([a-z0-9\-\.]+\.[a-z]{2,3}(\/\S*)?([^"])/i';
...but it doesn't seem to work. I know about look-ahead assertions, but have no idea how to use them at all. Would that be the best thing to use in this case? How would I detect the presence of an anchor tag around this URL?
Note: I know I could just use strpos(...) !== false on the entire message, but that doesn't account for mixes of plain URLs and anchor tags in the same message.

Hmm, turns out I hadn't searched Stack Overflow thoroughly enough. All I had to do was add (?<![">]) to the beginning of my regular expression, like so:
$sURLRegExp = '/(?<![">])http\:\/\/([a-z0-9\-\.]+\.[a-z]{2,3}(\/\S*)?)([^"])/i';
...and it works perfectly. I'm keeping this for future reference for anybody else who happens upon this post.

Related

Slugs for SEO using PHP - Appending name to end of URL

Something I have noticed on the StackOverflow website:
If you visit the URL of a question on StackOverflow.com:
"https://stackoverflow.com/questions/10721603"
The website adds the name of the question to the end of the URL, so it turns into:
"https://stackoverflow.com/questions/10721603/grid-background-image-using-imagebrush"
This is great, I understand that this makes the URL more meaningful and is probably good as a technique for SEO.
What I wanted to Achieve after seeing this Implementation on StackOverflow
I wish to implement the same thing with my website. I am happy using a header() 301 redirect in order to achieve this, but I am attempting to come up with a tight script that will do the trick.
My Code so Far
Please see it working by clicking here
// Set the title of the page article (This could be from the database). Trimming any spaces either side
$original_name = trim(' How to get file creation & modification date/times in Python with-dash?');
// Replace any characters that are not A-Za-z0-9 or a dash with a space
$replace_strange_characters = preg_replace('/[^\da-z-]/i', " ", $original_name);
// Replace any spaces (or multiple spaces) with a single dash to make it URL friendly
$replace_spaces = preg_replace("/([ ]{1,})/", "-", $replace_strange_characters);
// Remove any trailing slashes
$removed_dashes = preg_replace("/^([\-]{0,})|([\-]{2,})|([\-]{0,})$/", "", $replace_spaces);
// Show the finished name on the screen
print_r($removed_dashes);
The Problem
I have created this code and it works fine by the looks of things, it makes the string URL friendly and readable to the human eye. However, it I would like to see if it is possible to simplify or "tightened it up" a bit... as I feel my code is probably over complicated.
It is not so much that I want it put onto one line, because I could do that by nesting the functions into one another, but I feel that there might be an overall simpler way of achieving it - I am looking for ideas.
In summary, the code achieves the following:
Removes any "strange" characters and replaces them with a space
Replaces any spaces with a dash to make it URL friendly
Returns a string without any spaces, with words separated with dashes and has no trailing spaces or dashes
String is readable (Doesn't contain percentage signs and + symbols like simply using urlencode()
Thanks for your help!
Potential Solutions
I found out whilst writing this that article, that I am looking for what is known as a URL 'slug' and they are indeed useful for SEO.
I found this library on Google code which appears to work well in the first instance.
There is also a notable question on this on SO which can be found here, which has other examples.
I tried to play with preg like you did. However it gets more and more complicated when you start looking at foreign languages.
What I ended up doing was simply trimming the title, and using urlencode
$url_slug = urlencode($title);
Also I had to add those:
$title = str_replace('/','',$title); //Apache doesn't like this character even encoded
$title = str_replace('\\','',$title); //Apache doesn't like this character even encoded
There are also 3rd party libraries such as: http://cubiq.org/the-perfect-php-clean-url-generator
Indeed, you can do that:
$original_name = ' How to get file creation & modification date/times in Python with-dash?';
$result = preg_replace('~[^a-z0-9]++~i', '-', $original_name);
$result = trim($result, '-');
To deal with other alphabets you can use this pattern instead:
~\P{Xan}++~u
or
~[^\pL\pN]++~u

Regex/Php: Cannot match question mark in live site

I would like to remove the $_GET parameter of the first "page" item on a website.
The following works perfectly in a test script on my local server:
$urls = array(
'http://www.foo.com/bar.html?p=1', //should match
'http://www.foo.com/bar.html?p=23',
'http://www.foo.com/bar.html?p=120',
'http://www.foo.com/bar.html?baz=123&p=1' //should match
);
foreach ($urls as $url) {
echo $url . '<br>';
echo preg_replace('/([\?&]p=1)(?!\d)/', '', $url) . '<p>';
}
This produces:
http://www.foo.com/bar.html?p=1
http://www.foo.com/bar.html
http://www.foo.com/bar.html?p=23
http://www.foo.com/bar.html?p=23
http://www.foo.com/bar.html?p=120
http://www.foo.com/bar.html?p=120
http://www.foo.com/bar.html?baz=123&p=1
http://www.foo.com/bar.html?baz=123
However on the live site, it never matches.
To make matters worse,
str_replace('?p=1','',$url);
will not work as well. What am I missing? I can match a single question mark, but as soon as something follows it, I'm out of luck. This is the case for both str_replace and preg_replace. I feel like I'm missing something obvious, but I cannot figure it out. Thank you for your help.
Solution:
In my specific case, it turned out that the underlying Magento shop system was already giving out html_encoded characters. This, plus the fact the first parameter is always a session ID which is later removed from the URL string, made my task as easy as
$url = str_replace('&p=1', '', $url);
try \\\? instead of \? ; if that doesn't work, you might run a regex engine version which doesnt support negative lookahead.
In that case you could reform your preg_replace to
preg_replace('/([\?&]p=1)([^\d])/', '$2', $url) . '<p>';
which would consume the non-digit, but put it back in again. There might be edge cases where this differs from your regex, but I don't think you'd be able to encounter those with urls (and I can't think of any from the top of my head)
of course, there are other non-regex solutions to this, but as regex is a very powerful tool, it's always good to learn something about it ;)

Parsing link from javascript function

I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?
function videoPoster() {
document.getElementById("html5_vid").innerHTML =
"<video x-webkit-airplay='allow' id='html5_video' style='margin-top:"
+ style_padding
+ "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' "
+ "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}
What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!
/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.
See it in action.
If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/
In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.
In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:
['"](http://[^'"]*)['"]
If you have to enter that regexp as a string, beware of escaping.
If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.
For your specific case, this should work, provided that none of the characters in the URL are escaped.
preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];
See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).
(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)
The following captures any url in your html
$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
print_r($matches['urls']);
}
if you want to do the same in javascript you could use this:
var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}

URL detection problems

i'm currently having some problems with detecting urls and making them clickable.
Until now it always worked fine, probably because we always tested this with real urls, but now the website is live, we're having some problems.
This was the code we used to detect them before
$content = preg_replace('!(((f|ht)tp://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $content);
$content = eregi_replace('([[:space:]()[{}])(www.[-a-zA-Z0-9#:%_\+.~#?&//=]+)', '\\1\\2', $content);
It was doing a great job for normal urls, but some urls are giving problems:
- hk.linkedin.com
- www.test.com
- test.com
Also notice that some urls don't have http in fron of them.
I'm really not that good with regex, so I would very much appreciate it if somebody could help me figure this out.
What exactly you wanted to get. In this example, I can see blatant lack of understanding for regular expressions... but then, I see this exact code used in few codes according to Google Code Search. But those were made to find URLs in middle of text (not always what looks like URL is URL, but if it contains http:// or www it's sure that's URL.
Not everything needs to be done only using regular expressions. Those are helpful, but sometimes they make additional problems.
One of problems in regular expressions is that they don't have conditionals on result. You can use multiple regular expressions, but there is chance that something will be done wrongly (like affecting what previous regular expression has done). Just look at this. It assigns additional function (you can use e modifier, but it may make code unreadable).
<?php
$content = preg_replace_callback('{\b(?:(https?|ftp)://)?(\S+[.]\S+)\b}i',
'addHTTP', $content);
function addHTTP($matches) {
if(empty($matches[1])) {
return 'http://' . $matches[2] . '';
}
else {
return '' . $matches[2] . '';
}
}
Or two regular expressions (little harder to understand)...
$content = preg_replace('{\b(?:(?:https?|ftp)://)\S+[.]\S+\b}i',
'$0', $content);
$content = preg_replace('{\b(?<!["\'=><.])[-a-zA-Zа-яА-Яа-яА-Я()0-9#:%_+.~#?&;//=]+[.][-a-zA-Zа-яА-Яа-яА-Я()0-9#:%_+.~#?&;//=]+(?!["\'=><.])\b}i',
'http://$0', $content);
Also, you should avoid using target="". Users don't expect that new window will appear when clicking the link. After user will click such link he might wonder why "Go left" button doesn't work (hint: new window caused it to disappear). If somebody really wants to open link in new window he will do it yourself (it's not hard...).
Note that usually such stuff is linked with other helpers like this. For example, Stack Overflow uses some kind of Markdown modification which does more intelligent renaming, like changing plain text lists to HTML lists... But that all depends on what you need. If you only need processing links, you can try using those regexpes, but well...

PHP: finding, replacing, shortening, and prettifying user links with <a> tags, ellipses, and link icons

When a user enters a URL, e.g. http://www.google.com, I would like to be able to parse that text using PHP, find any links, and replace them with <a> tags that include the original URL as an HREF.
In other words, http://www.google.com will become
http://www.google.com
I'd like to be able to do this for all URLs of these forms (with .com interchangeable with any TLD):
http://www.google.com
www.google.com
google.com
docs.google.com
What's the most performant way to do this? I could try writing some really fancy regex, but I doubt that's the best method available to me.
For bonus points, I'd also like to prepend http:// to any URL lacking it, and strip the display text itself down to something of the form http://www.google.com/reallyLongL... and display an external link icon afterwards.
Trying to find links in the format domain.com is going to be a pain in the butt. It would require keeping track of all TLDs and using them in the search.if you didnt the end of the last sentence i typed and the beginning of this sentence would be a link to http://search.if. Even if you did .in is a valid TLD and a common word.
I'd recommend telling your users they have to begin links with www. or http:// then write a simple regex to capture them and add the links.
www.google.com
This is not a URL, it's a hostname. It's generally not a good idea to start marking up bare hostnames in arbitrary text, because in the general case any word or sequence of dot-separated words is a perfectly valid hostname. That means you up with horrible hacks like looking for leading www. (and you'll get questions like “why can I link to www.stackoverflow.com but not stackoverflow.com?”) or trailing TLDs (which gets more and more impractical as more new TLDs are introduced; “why can I like to ncm.com but not ncm.museum?”), and you'll often mark up things that aren't supposed to be links.
I could try writing some really fancy regex
Well I can't see how you'd do it without regex.
The trick is coping with markup. If you can have <, & and " characters in the input, you mustn't let them into HTML output. If your input is plain text, you can do that by calling htmlspecialchars() before applying a simple replacement on a pattern like that in nico's answer.
(If the input already contains markup, you've got problems and you'd probably need an HTML parser to determine which bits are markup to avoid adding more markup inside of. Similarly, if you're doing more processing after this, inserting more tags, those steps are may have the same difficulty. In ‘bbcode’-like languages this often leads to bugs and security problems.)
Another problem is trailing punctuation. It's common for people to put a full stop, comma, close bracket, exclamation mark etc after a link, which aren't supposed to be part of the link but which are actually valid characters. It's useful to strip these off and not put them in the link. But then you break Wiki links that end in ), so maybe you want to not treat ) as a trailing character if there's a ( in the link, or something like that. This sort of thing can't be done in a simple regex replace, but you can in a replacement callback function.
HTML Purifier has a built-in linkify function to save you all the headaches.
It's other features are also simply too useful to pass up if you're dealing with any kind of user input that you also have to display.
Not so fancy regexps that should work
/\b(https?:\/\/[^\s+\"\<\>]+)/ig
/\b(www.[^\s+\"\<\>]+)/ig
Note that the last two would be impossible to do correctly as you cannot distinguish google.com from something like this.Where I finish one sentence and don't put a space after the full stop.
As for shortening the URLs, having your URL in $url:
if (strlen($url) > 20) // Or whatever length you like
{
$shortURL = substr($url, 0, 20)."…";
}
else
{
$shortURL = $url;
}
echo '<a href="'.$url.'" >'.$shortURL.'</a>';
From http://www.exorithm.com/algorithm/view/markup_urls
function markup_urls ($text)
{
// split the text into words
$words = preg_split('/([\s\n\r]+)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
$text = "";
// iterate through the words
foreach($words as $word) {
// chopword = the portion of the word that will be replaced
$chopword = $word;
$chopword = preg_replace('/^[^A-Za-z0-9]*/', '', $chopword);
if ($chopword <> '') {
// linkword = the text that will replace chopword in the word
$linkword='';
// does it start with http://abc. ?
if (preg_match('/^(http:\/\/)[a-zA-Z0-9_]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it equal abc.def.ghi ?
} else if (preg_match('/^[a-zA-Z]{2,}\.([a-zA-Z0-9_]+\.)+[a-zA-Z]{2,}(\/.*)?/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it start with abc#def.ghi ?
} else if (preg_match('/^[a-zA-Z0-9_\.]+\#([a-zA-Z0-9_]{2,}\.)+[a-zA-Z]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9]*$/', '', $chopword);
$linkword = ''.$chopword.'';
}
// replace chopword with linkword in word (if linkword was set)
if ($linkword <> '') {
$word = str_replace($chopword, $linkword, $word);
}
}
// append the word
$text = $text.$word;
}
return $text;
}
I got this working exactly the way I want here:
<?php
$input = <<<EOF
http://www.example.com/
http://example.com
www.example.com
http://iamanextremely.com/long/link/so/I/will/be/trimmed/down/a/bit/so/i/dont/mess
/up/text/wrapping.html
EOF;
function trimlong($match)
{
$url = $match[0];
$display = $url;
if ( strlen($display) > 30 ) {
$display = substr($display,0,30)."...";
}
return ''.$display.' <img src="http://static.goalscdn.com/img/external-link.gif" height="10" width="11" />';
}
$output = preg_replace_callback('#(http://|www\\.)[^\\s<]+[^\\s<,.]#i',
array($this,'trimlong'),$input);
echo $output;

Categories