PHP Regular Expression Text URL to HTML Link - php

I'm working on a project where I need to replace text urls anywhere from domain.com to www.domain.com to http(s)://www.domain.com and e-mail addresses to the proper html <a> tag. I was using a great solution in the past, but it used the now depreciated eregi_replace function. On top of that, the regular expression used for such function does not work with preg_replace.
So basically, the user inputs a message in which may/may not contain a link/e-mail address and I need a regular expression that works with preg_replace to replace that link/email with a HTML link like link.
Please note that I have multiple other preg_replaces too. Below is my current code for the other replacements being made.
$patterns = array('~\[#([^\]]*)\]~','~\[([^\]]*)\]~','~{([^}]*)}~','~_([^_]*)_~','/\s{2}/');
$replacements = array('<b class="reply">#\\1</b>','<b>\\1</b>','<i>\\1</i>','<u>\\1</u>','<br />');
$msg = preg_replace($patterns, $replacements, $msg);
return stripslashes(utf8_encode($msg));

I have created a very basic set of Regular Expressions for this. Don't expect them to be 100% reliable, and you may need to tweak them as you go.
$pattern = array(
'/((?:[\w\d]+\:\/\/)?(?:[\w\-\d]+\.)+[\w\-\d]+(?:\/[\w\-\d]+)*(?:\/|\.[\w\-\d]+)?(?:\?[\w\-\d]+\=[\w\-\d]+\&?)?(?:\#[\w\-\d]*)?)/' , # URL
'/([\w\-\d]+\#[\w\-\d]+\.[\w\-\d]+)/' , # Email
'/\[#([^\]]*)\]/' , # Reply
'/\[([^\]]*)\]/' , # Bold
'/\{([^}]*)\}/' , # Italics
'/_([^_]*)_/' , # Underline
'/\s{2}/' , # Linebreak
);
$replace = array(
'$1' ,
'$1' ,
'<b class="reply">#$1</b>' ,
'<b>$1</b>' ,
'<i>$1</i>' ,
'<u>$1</u>' ,
'<br />'
);
$msg = preg_replace( $pattern , $replace , $msg );
return stripslashes( utf8_encode( $msg ) );

Related

How to find, make link and shorten url text in text block with PHP

So I currently have this...
<?php
$textblockwithformatedlinkstoecho = preg_replace('!(((f|ht)tp(s)?://)[-a-zA-
Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1',
$origtextwithlinks);
echo $textblockwithformatedlinkstoecho;
?>
But, I would like to also shorten the clickable link to around 15 chars in length...
Example input text
I recommend you visit http://www.example.com/folder1/folder2/page3.html?
longtext=ugsdfhsglshghsdghlsg8ysd87t8sdts8dtsdtygs9ysd908yfsd0fyu for more
information.
Required output text
I recommend you visit example.com/fol... for more information.
You can use preg_replace_callback() to manipulate the matches.
Example:
$text = "I recommend you visit http://www.example.com/folder1/folder2/page3.html?longtext=ugsdfhsglshghsdghlsg8ys\d87t8sdts8\dtsdtygs9ysd908yfsd0fyu for more information.";
$fixed = preg_replace_callback(
'!(((f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i',
function($matches) {
// Get the fully matched url
$url = $matches[0];
// Do some magic for the link text, like only show the first 15 characters
$text = strlen($url) > 15
? substr($url, 0, 15) . '...'
: $url;
// Return the new html link
return '' . $text . '';
},
$text
);
echo $fixed;
You probably need to modify your regex though, since it doesn't match the \-characters you have in the query string in the url.

How to fix this preg_replace codes?

Having following code to turn an URL in a message into HTML links:
$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?-])*#",
"\\0", $message);
$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])*)#",
"\\1\\2", $message);
It works very good with almost all links, except in following cases:
1) http://example.com/mediathek#/video/1976914/zoom:-World-Wide
Problem here is the # and the : within the link, because not the complete link is transformed.
2) If someone just writes "www" in a message
Example: www
So the question is about if there is any way to fix these two cases in the code above?
Since you want to include the hash (#) to the regex, you need to change the delimiters to characters that are not included in your regex, e.g. !. So, your regex should look like this:
$message = preg_replace("!(http|https|ftp|ftps)://([.]?[&;%#:=a-zA-Z0-9_/?-])*!",
"\\0", $message);
Does this help?
Though, if you would like to be more along the specification (RCF 1738) you might want to exclude % which is not allowed in URLs. There are also some more allowed characters which you didn't include:
$
_
. (dot)
+
!
*
'
(
)
If you would include these chars, you should then delimiter your regex with %.
Couple minor tweaks. Add \# and : to the first regex, then change the * to + in the second regex:
$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?\#:-])*#",
"\\0", $message);
$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])+)#",
"\\1\\2", $message);
In my opinion, it is vain to tackle this problem. A good alternative is to find what could be an URL via regex (begin with the protocol: http, ftp, mail... or by www) and then test it with FILTER_VALIDATE_URL. Keep in mind that this filter is not a waterproof way as the PHP manual says:
"Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail."
Example of code (not tested):
$message = preg_replace_callback(
'~(?(DEFINE)
(?<prot> (?>ht|f) tps?+ :// ) # you can add protocols here
)
(?>
<a\b (?> [^<]++ | < (?!/a>) )++ </a> # avoid links inside "a" tags
|
<[^>]++> # and tags attributes.
) (*SKIP)(?!) # makes fail the subpattern.
| # OR
\b(?>(\g<prot>)|www\.)(\S++) # something that begins with
# "http://" or "www."
~xi',
function ($match) {
if (filter_var($match[2], FILTER_VALIDATE_URL)) {
$url = (empty($match[1])) ? 'http://' : '';
$url .= $match[0];
return '<a href="away?to=' . $url . '"target="_blank">'
. $url . '</a>';
} else { return $match[0] }
},
$message);

PHP regular expression to replace links

I have this replace regex (it's taken from the phpbb source code).
$match = array(
'#<!\-\- ([mw]) \-\-><a (?:class="[\w-]+" )?href="(.*?)" target\=\"_blank\">.*?</a><!\-\- \1 \-\->#',
'#<!\-\- .*? \-\->#s',
'#<.*?>#s',
);
$replace = array( '\2', '', '');
$message = preg_replace($match, $replace, $message);
If I run it through a message like this
asdfafdsfdfdsfds
<!-- m --><a class="postlink" href="http://website.com/link-is-looooooong.txt">http://website.com/link ... oooong.txt</a><!-- m -->
asdfafdsfdfdsfds4324
It returns this
asdfafdsfdfdsfds
http://website.com/link ... oooong.txt
asdfafdsfdfdsfds4324
However I would like to make it into a replace function. So I can replace the link title in a block by providing the href.
I want to provide the url, new url and new title. So I can run a regex with these variables.
$url = 'http://website.com/link-is-looooooong.txt';
$new_title = 'hello';
$new_url = 'http://otherwebsite.com/';
And it would return the same raw message but with the link changed.
<!-- m --><a class="postlink" href="http://otherwebsite.com/">hello</a><!-- m -->
I've tried tweaking it into something like this but I can't get it right. I don't know how to build up the matched result so it has the same format after replacing.
$message = preg_replace('#<!\-\- ([mw]) \-\-><a (?:class="[\w-]+" )?href="'.preg_quote($url).'" target\=\"_blank\">(.*?)</a><!\-\- \1 \-\->#', $replace, $message);
You'll find that parsing HTML with regex can be a pain and get very complex. Your best bet is to use a DOM parser, like this one, and modify the links with that instead.
You need to catch the other parts in groups as well and then use them in the replacement. try something like this:
$replace = '\1http://otherwebsite.com/\3hello\4';
$reg = '#(<!-- ([mw]) --><a (?:class="[\w-]+" )?href=")'.preg_quote($url).'("(?: target="_blank")?>).*?(</a><!-- \2 -->)#';
$message = preg_replace($reg, $replace, $message);
See here.

PHP regex on weird situations

I'm trying to scrape a website using some regex. But the site isn't written in well formatted html. In fact, the html is horrible and not structured hardly at all. But I've managed to tackle most of it. The problem I'm encountering now is that in some emails, a span is wrapped around a random part of the email like so:
****.*******#g<span class="tournamenttext">mail.com</span>
************<span class="tournamenttext">#yahoo.com</span>
<span class="tournamenttext">**********#mail.com</span>
*******#gmail.com
Is there a way to retrieve the emails with all this inconsistency?
$string ='****.*******#g<span class="tournamenttext">mail.com</span>
************<span class="tournamenttext">#yahoo.com</span>
<span class="tournamenttext">**********#mail.com</span>
*******#gmail.com';
$pattern = "/<\/?span[^>]*>/";
$string = preg_replace($pattern, "", $string);
after that $string will be only mails
****.*******#gmail.com
************#yahoo.com
**********#mail.com
*******#gmail.com
Your code will be like this
$text[1]->innertext = "Where innertext contains something like: "<em>Local (Open)
Tournament.</em> ****.*******#g<span class="tournamenttext">mail.com</span>"
// Firstly clear spans
$pattern = "/<\/?span[^>]*>/";
$text[1]->innertext = preg_replace($pattern, "", $text[1]->innertext);
// Preg Match mail
$email_regex = "^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$"; // Just an example email match regex
preg_match($email_regex, $text[1]->innertext, $theMatch);
echo '<pre>' . print_r($theMatch, true) . '</pre>';
You could simply remove all span tags by replacing </?span[^>]*> with nothing and try your favourite email address finder on the result.

Removal of bad hyperlinks and the content inside of them

Ok, basically I have an array of bad urls and I would like to search through a string and strip them out. I want to strip everything from the opening tag to the closing tag, but only if the url in the hyperlink is in the array of bad urls. Here is how I would picture it working but I don't understand regular expressions well.
foreach($bad_urls as $bad_url){
$pattern = "/<a*$bad_url*</a>/";
$replacement = ' ';
preg_replace($pattern, $replacement, $content);
}
Thanks in advance.
Assuming that your 'bad urls' are properly formatted URLs, I would suggest doing something like this:
foreach($bad_urls as $bad_url){
$pattern = '/<[aA]\s.+[href|HREF]\=\"' . convert_to_pattern($bad_url) . '\".+<\/[aA]>/msU';
$replacement = ' ';
$content = preg_replace_all($pattern, $replacement, $content);
}
and separately
function convert_to_pattern($url)
{
searches = array('%', '&', '?', '.', '/', ';', ' ');
replaces = array('\%','\&','\?','\.','\/','\;','\ ');
return preg_replace_all($searches, $replaces, $url);
}
Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, find all the <a> tags and check the href property. Much simpler and fool-proof.

Categories