Having following code to turn an URL in a message into HTML links:
$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?-])*#",
"\\0", $message);
$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])*)#",
"\\1\\2", $message);
It works very good with almost all links, except in following cases:
1) http://example.com/mediathek#/video/1976914/zoom:-World-Wide
Problem here is the # and the : within the link, because not the complete link is transformed.
2) If someone just writes "www" in a message
Example: www
So the question is about if there is any way to fix these two cases in the code above?
Since you want to include the hash (#) to the regex, you need to change the delimiters to characters that are not included in your regex, e.g. !. So, your regex should look like this:
$message = preg_replace("!(http|https|ftp|ftps)://([.]?[&;%#:=a-zA-Z0-9_/?-])*!",
"\\0", $message);
Does this help?
Though, if you would like to be more along the specification (RCF 1738) you might want to exclude % which is not allowed in URLs. There are also some more allowed characters which you didn't include:
$
_
. (dot)
+
!
*
'
(
)
If you would include these chars, you should then delimiter your regex with %.
Couple minor tweaks. Add \# and : to the first regex, then change the * to + in the second regex:
$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?\#:-])*#",
"\\0", $message);
$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])+)#",
"\\1\\2", $message);
In my opinion, it is vain to tackle this problem. A good alternative is to find what could be an URL via regex (begin with the protocol: http, ftp, mail... or by www) and then test it with FILTER_VALIDATE_URL. Keep in mind that this filter is not a waterproof way as the PHP manual says:
"Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail."
Example of code (not tested):
$message = preg_replace_callback(
'~(?(DEFINE)
(?<prot> (?>ht|f) tps?+ :// ) # you can add protocols here
)
(?>
<a\b (?> [^<]++ | < (?!/a>) )++ </a> # avoid links inside "a" tags
|
<[^>]++> # and tags attributes.
) (*SKIP)(?!) # makes fail the subpattern.
| # OR
\b(?>(\g<prot>)|www\.)(\S++) # something that begins with
# "http://" or "www."
~xi',
function ($match) {
if (filter_var($match[2], FILTER_VALIDATE_URL)) {
$url = (empty($match[1])) ? 'http://' : '';
$url .= $match[0];
return '<a href="away?to=' . $url . '"target="_blank">'
. $url . '</a>';
} else { return $match[0] }
},
$message);
Related
Well I know there several questions similar but could not find any with this specific case.
I took one code and tweak it to my needs but now I'm founding a bug on it that I can't correct.
Code:
$tag = 'namespace';
$match = Tags::get($f, $tag);
var_dump($match);
static function get( $xml, $tag) { // http://stackoverflow.com/questions/3404433/get-content-within-a-html-tag-using-7-processing
// bug case string(56) "<namespaces>
// <namespace key="-2">Media</namespace>"
$tag_ini = "<{$tag}[^\>]*?>"; $tag_end = "<\\/{$tag}>";
$tag_regex = '/' . $tag_ini . '(.*?)' . $tag_end . '/si';
preg_match_all($tag_regex,
$xml,
$matches,
PREG_OFFSET_CAPTURE);
return $matches;
}
As you can see, there is a bug if the tag is nested:
<namespaces> <namespace key="-2">Media</namespace>
When it should return 'Media', or even the outer '<namespaces>' and then the inside ones.
I tried to add "<{$tag}[^\>|^\r\n ]*?>", ^\s+, changing the * to *?, and other few things that in best case turned to recognize only the bugged case.
Also tried "<{$tag}[^{$tag}]*?>" which gives blank, I suppose it nullifies itself.
I'm a newb on regex, I can tell that to fix this just is needed to add don't let open a new tag of the same type.
Or I could even use a hack answer for my use case, that excludes if the inside text has new line carriage.
Can anyone get the right syntax for this?
You can check an extract of the text here: http://pastebin.com/f2naN2S3
After the proposed change: $tag_ini = "<{$tag}\\b[^>]*>"; $tag_end = "<\\/{$tag}>"; it does work for the the example case, but not for this one:
<namespace key="0" />
<namespace key="1">Talk</namespace>
As it results in:
<namespace key="1">Talk"
It's because numbers and " and letters are considered inside word boundary. How could I address that?
The main problem is that you did not use a word boundary after the opening tag and thus, namespace in the pattern could also match namespaces tag, and many others.
The subsequent issue is that the <${tag}\b[^>]*>(.*?)<\/${tag}> pattern would overfire if there is a self-closing namespace tag followed with a "normal" paired open/close namespace tag. So, you need to either use a negative lookbehind (?<!\/) before the > (see demo), or use a (?![^>]*\/>) negative lookahead after \b (see demo).
So, you can use
$tag_ini = "<{$tag}\\b[^>]*(?<!\\/)>"; $tag_end = "<\\/{$tag}>";
This is probably not the idea answer, but I was messing with a regex generator:
<?php
# URL that generated this code:
# http://txt2re.com/index-php.php3?s=%3Cnamespace%3E%3Cnamespace%20key=%22-2%22%3EMedia%3C/namespace%3E&12&11
$txt='arstarstarstarstarstarst<namespace key="-2">Media</namespace>arstarstarstarstarst';
$re1='.*?'; # Non-greedy match on filler
$re2='(?:[a-z][a-z]+)'; # Uninteresting: word
$re3='.*?'; # Non-greedy match on filler
$re4='(?:[a-z][a-z]+)'; # Uninteresting: word
$re5='.*?'; # Non-greedy match on filler
$re6='(?:[a-z][a-z]+)'; # Uninteresting: word
$re7='.*?'; # Non-greedy match on filler
$re8='((?:[a-z][a-z]+))'; # Word 1
if ($c=preg_match_all ("/".$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8."/is", $txt, $matches))
{
$word1=$matches[1][0];
print "($word1) \n";
}
#-----
# Paste the code into a new php file. Then in Unix:
# $ php x.php
#-----
?>
This line is what I needed
$tag_ini = "<{$tag}\\b[^>|^\\/>]*>"; $tag_end = "<\\/{$tag}>";
Thank you very much you #Alison and #Wictor for your help and directions
I'm building a simple function to embed videos in Wordpress. I want to read the post content and replace [youku: xxAAAJFSK] with an iframe: <iframe src="http://player.youku.com/embed/xxAAAJFSK"></iframe>
I'm guessing I should use a regular expression to do the replacement but can't seem to find the correct one... I tried:
$pattern = '/youku\.com\/([^\/]*)/i';
if (preg_match($pattern, $content, $matches)){
$id_video = $matches[1];
return "<iframe src='http://player.youku.com/embed/" . $id_video . "></iframe>";
}
This just breaks my site though..
Extra points if you manage to let me set the width and height using something like [youku: xxAAAJFSK width:400 height:400]
Are you fixed to that syntax? If not, you'd be best looking at the Wordpress Shortcode API and following their style. That would take a lot of the hard work out of it for you as the system would handle the argument parsing. For example:
// [youku vid="xxAAAJFSK" width="400" height="400"]
function youku_func( $atts ) {
return "<iframe src='http://player.youku.com/embed/" . $atts['vid'] . " width='" . $atts['width'] . "' height='" $atts['height'] . "'></iframe>";
}
add_shortcode( 'youku', 'youku_func' );
You would probably want to expand this to include default values for width and height or remove them if they're not given as arguments.
This is actually very easy to do ...
\[: Match [
\s* : Match a whitespace 0 or more times
youku : Match youku
\s* : Match a whitespace 0 or more times
: : Match :
\s* : Match a whitespace 0 or more times
([^]]*) : Match anything except ] 0 or more times and group it
\] : Match ]
You may even use the i modifier for case insenstive matching.
Regex: \[\s*youku\s*:\s*([^]]*)\]
Replace: <iframe src="http://player.youku.com/embed/$1"></iframe>
PHP code: $output = preg_replace('#\[\s*youku\s*:\s*([^]]*)\]#i', '<iframe src="http://player.youku.com/embed/$1"></iframe>', $input);
Unless you're doing this for educational purposes, don't reinvent the wheel.
There are a lot of youku-enabled Wordpress plugins already.
Edit: If you want to roll your own, I'd suggest looking at one of the existing working plugins and tailoring their implementation to suit your needs.
I'm trying to write a code library for my own personal use and I'm trying to come up with a solution to linkify URLs and mail links. I was originally going to go with a regex statement to transform URLs and mail addresses to links but was worried about covering all the bases. So my current thinking is perhaps use some kind of tag system like this:
l:www.google.com becomes http://www.google.com and where m:john.doe#domain.com becomes john.doe#domain.com.
What do you think of this solution and can you assist with the expression? (REGEX is not my strong point). Any help would be appreciated.
Maybe some regex like this :
$content = "l:www.google.com some text m:john.doe#domain.com some text";
$pattern = '/([a-z])\:([^\s]+)/'; // One caracter followed by ':' and everything who goes next to the ':' which is not a space or tab
if (preg_match_all($pattern, $content, $results))
{
foreach ($results[0] as $key => $result)
{
// $result is the whole matched expression like 'l:www.google.com'
$letter = $results[1][$key];
$content = $results[2][$key];
echo $letter . ' ' . $content . '<br/>';
// You can put str_replace here
}
}
We're converting to markdown, before we used an 'in-house' system, where both the image links and all data with it (e.g. alt) in another bracket.
For example {IMAGE LINK}[OPTIONAL ALT WITH OTHER DATA]
Now we are moving to markdown, (our data is stored as markdown in the database), I need to convert everything into markdown:
So How can I turn all instances of {LINK}[OPTIONAL DATA] (square brackets not required, so some are just {}) into markdown equivalent:
Basically,
{http://www.youtube.com/image.gif}[this
is optional alt] INTO
![alt](http://www.youtube.com/Image.gif)
I have the following so far, but do I deal with the optional [ALT DATA] tag?
if (preg_match_all('/\[(.*?)\]/i', $string, $matches, PREG_SET_ORDER))
{
}
To deal with the optional alt attribute you should use preg_replace_callback. This allows you to test for the existence of the alt attr and add it if necessary.
$str = '
This is an image {http://www.youtube.com/image.gif}[this is optional alt]
This is an image with an alt attribute {http://www.youtube.com/image.gif}
';
echo preg_replace_callback(
'~{(http://[^s]+)}(?:\[(.*?)\])?~',
function($m){
if ( isset( $m[2] ) ) {
return $img = sprintf( '![%s](%s)', $m[2], $m[1] );
}
return $img = sprintf( '(%s)', $m[1] );
},
$str
);
The simple case would be
{(.*?)}\[(.*?)\] <-- search pattern
![\1](\2) <-- replace pattern
but you'll be messed up with links that contain the escaped characters (\{, \}, \[, \]). It would involve a lookahead that you'll have to hope someone else writes up for you. However, if this is just image URLs, you shouldn't have too many (if any) instances of that happening.
I would use preg_replace_callback for that purpose. There it's easier to probe for the optional alt tag and/or construct a replacement.
$source = preg_replace_callback('#
\{ (http://[^}\s]+) \}
(?:
\[ ([^\]{}\n]+) \]
)?
#x',
"cb_img_markdown",
$source);
function cb_img_markdown($m) {
list($asis, $link, $alt) = $m;
if (!strlen($alt)) {
$alt = "image " . basename($link);
}
return "![$alt]($link)";
}
You could also make the link match stricter to avoid false positives. Here I just made it depend on http:// being present, but you could append e.g. (?:png|jpe?g|gif) to ensure it only matches image urls.
This is so hectic in to parse tags in PHP,
I would suggest you should use this PHP Simple HTML DOM Parser
it is very easy to parsing any kind tags, and you can easily filter by attributes also.
I'm working on a project where I need to replace text urls anywhere from domain.com to www.domain.com to http(s)://www.domain.com and e-mail addresses to the proper html <a> tag. I was using a great solution in the past, but it used the now depreciated eregi_replace function. On top of that, the regular expression used for such function does not work with preg_replace.
So basically, the user inputs a message in which may/may not contain a link/e-mail address and I need a regular expression that works with preg_replace to replace that link/email with a HTML link like link.
Please note that I have multiple other preg_replaces too. Below is my current code for the other replacements being made.
$patterns = array('~\[#([^\]]*)\]~','~\[([^\]]*)\]~','~{([^}]*)}~','~_([^_]*)_~','/\s{2}/');
$replacements = array('<b class="reply">#\\1</b>','<b>\\1</b>','<i>\\1</i>','<u>\\1</u>','<br />');
$msg = preg_replace($patterns, $replacements, $msg);
return stripslashes(utf8_encode($msg));
I have created a very basic set of Regular Expressions for this. Don't expect them to be 100% reliable, and you may need to tweak them as you go.
$pattern = array(
'/((?:[\w\d]+\:\/\/)?(?:[\w\-\d]+\.)+[\w\-\d]+(?:\/[\w\-\d]+)*(?:\/|\.[\w\-\d]+)?(?:\?[\w\-\d]+\=[\w\-\d]+\&?)?(?:\#[\w\-\d]*)?)/' , # URL
'/([\w\-\d]+\#[\w\-\d]+\.[\w\-\d]+)/' , # Email
'/\[#([^\]]*)\]/' , # Reply
'/\[([^\]]*)\]/' , # Bold
'/\{([^}]*)\}/' , # Italics
'/_([^_]*)_/' , # Underline
'/\s{2}/' , # Linebreak
);
$replace = array(
'$1' ,
'$1' ,
'<b class="reply">#$1</b>' ,
'<b>$1</b>' ,
'<i>$1</i>' ,
'<u>$1</u>' ,
'<br />'
);
$msg = preg_replace( $pattern , $replace , $msg );
return stripslashes( utf8_encode( $msg ) );