EDIT: I'm not parsing html like the 5 billion other questions that have been posted. This is raw unformatted text that I want to convert into some HTML.
I'm working on a post processing. I need to convert Urls with image endings (jpe?g|png|gif) into image tags, and all other Urls into href links. I have my image replacement correct, however I'm stuck keeping the link replacement from trying to overwrite one another.
I need help with the expression within how to get it to looked for urls without the tags in place from the image replace, or look for urls that do not end in dot jpe?g|png|gif.
public function smartConvertPost($post) {
/**
* Match image based urls
*/
$pattern = '!http://([a-z0-9\-\.\/\_]+\.(?:jpe?g|png|gif))!Ui';
$replace='<p><img src="http://$1"></p>';
$postImages = preg_replace($pattern,$replace,$post);
/**
* Match url based
*/
$pattern='/http://([a-z0-9\-\.\/\_]+(?:\S|$))/i';
$replace='$1';
$postUrl = preg_replace($pattern,$replace, $postImages);
return $postUrl;
}
Please note I am not talking about matching tags or html. matching a string like so and converting it to html.
If this was an example post with a Url to a page like http://www.some-website.com/some-page/anything.html and I also put a url to an image http://www.some-website.com/someimage.jpg you would need to regex the two to be a hyperlink and an image.
Thanks,
Brad Christie's preg_replace_callback() recommendation is a good one. Here is one possible implementation:
function smartConvertPost($post)
{ // Disclaimer: This "URL plucking" regex is far from ideal.
$pattern = '!http://[a-z0-9\-._~\!$&\'()*+,;=:/?#[\]#%]+!i';
$replace='_handle_URL_callback';
return preg_replace_callback($pattern,$replace, $post);
}
function _handle_URL_callback($matches)
{ // preg_replace_callback() is passed one parameter: $matches.
if (preg_match('/\.(?:jpe?g|png|gif)(?:$|[?#])/', $matches[0]))
{ // This is an image if path ends in .GIF, .PNG, .JPG or .JPEG.
return '<p><img src="'. $matches[0] .'"></p>';
} // Otherwise handle as NOT an image.
return ''. $matches[0] .'';
}
Note that the regex used to pluck out a URL is not ideal. To do it right is tricky. See the following resources:
The Problem With URLs by Jeff Atwood.
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber.
URL Linkification (HTTP/FTP). by yours truly.
Edit: Added ability to recognize image URLs having a query or fragment.
Since it's the 215247th post on that kind of topic, let's say it again : HTML is too complicated to use regex. Use a parser.
See this. Regular expression for parsing links from a webpage?
PS: no offense =).
Edit:
I personnaly often user symfony, and there's a really great parser for what you need : http://fabien.potencier.org/article/42/parsing-xml-documents-with-css-selectors
You can get all images using simple css expression on your html. Give it a try.
What about using a marker ?
public function smartConvertPost($post) {
$MY_MARKER="<MYMARKER>"; // Define the marker here
/**
* Match image based urls
*/
$pattern = '!http://([a-z0-9\-\.\/\_]+\.(?:jpe?g|png|gif))!Ui';
$replace='<p><img src="$MY_MARKERhttp://$1$MY_MARKER"></p>'; // Use it here...
$postImages = preg_replace($pattern,$replace,$post);
/**
* Match url based
*/
$pattern='/(?<!$MY_MARKER)http://([a-z0-9\-\.\/\_]+(?:\S|$))(?!$MY_MARKER)/i';//...here
$replace='$1';
$postUrl = preg_replace($pattern,$replace, $postImages);
/**
* Remove all markers
*/
$postUrl = str_replace( $MY_MARKER, '', $postUrl);
return $postUrl;
}
Try to choose a marker that will have no chance to aapear in the post.
HTH
Related
I was using c and c# for programming and I am using some third-party regular expression library to identify link pattern. But yesterday, for some reason, someone asked me to use php instead. I am not familiar with the php regular expression but I try, didn't get the result as expected. I have to extract and replace the link of an image src of the form :
<img src="/a/b/c/d/binary/capture.php?id=main:slave:demo.jpg"/>
I only want the path in the src but the quotation could be double or single, also the id could be vary form case to case (here it is main:slave:demo.jpg)
I try the following code
$searchfor = '/src="(.*?)binary\/capture.php?id=(.+?)"/';
$matches = array();
while ( preg_match($searchfor, $stringtoreplace, $matches) == 1 ) {
// here if mataches found, replace the source text and search again
$stringtoreplace= str_replace($matches, 'whatever', $stringtoreplace);
}
But it doesn't work, anything I miss or any mistake from above code?
More specifically, let say I have a image tag which give the src as
<img src="ANY_THING/binary/capture.php?id=main:slave:demo.jpg"/>
here ANY_THING could be anything and "/binary/capture.php?id=" will be fixed for all cases, the string after "id=" is of pattern "main:slave:demo.jpg", the string before colon will be changed from case to case, the name of the jpeg will be varied too. I would expect to have it replaced as
<img src="/main/slave/demo.jpg"/>
Since I only have right to modify the php script at specific and limit time, I want to debug my code before any modification made. Thanks.
First of all, as you may know, regex shouldn't be used to manipulate HTML.
However, try:
$stringtoreplace = '<img src="/a/b/c/d/binary/capture.php?id=main:slave:demo.jpg"/>';
$new_str = preg_replace_callback(
// The regex to match
'/<img(.*?)src="([^"]+)"(.*?)>/i',
function($matches) { // callback
parse_str(parse_url($matches[2], PHP_URL_QUERY), $queries); // convert query strings to array
$matches[2] = '/'.str_replace(':', '/', $queries['id']); // replace the url
return '<img'.$matches[1].'src="'.$matches[2].'"'.$matches[3].'>'; // return the replacement
},
$stringtoreplace // str to replace
);
var_dump($new_str);
I'm trying to scan text for links to some video-sharing sites so I can create an embedded player when videos are linked to.
This is what I've got so far:
function extract(&$text) {
// Scans text for links to YouTube, Vimeo, DailyMotion.
// *keep ~discard
// youtube.com/watch?v=[*alphanumeric]&[~whatever]
// youtube-nocookie.com/watch?v=[*alphanumeric]&[~whatever]
// vimeo.com/[*numeric]
// dailymotion.com/video/[*alphanumeric]_[~whatever]
$sites = 'youtube\.com|youtube-nocookie\.com|vimeo\.com|dailymotion\.com';
$regex = '/^(http|https):\/\/(www\.|)(' . $sites . ')\/.*/';
preg_match_all($regex, $text, $videos);
return $videos;
}
This is working oddly. It found no results on the following text:
And what about YouTube videos?
http://www.youtube.com/timminchin#p/a/u/2/zkGEbRrNNtE
http://www.youtube.com/timminchin#p/a/f/1/zU4iyjoVWQ
http://www.youtube.com/watch?v=XzU4iyjoVWQ
http://www.youtube-nocookie.com
It found one result on this text:
http://youtube.com/watch?v=XzU4iyjoVWQ
https://www.youtube.com/watch?v=XzU4iyjoVWQ
And works fine on texts which contain just a single link and nothing else.
I'm not nearly as au fait with regular expressions as I should be, and used http://www.strfriend.com to help me to construct this one.
All I want is an array of URLs.
Change your regular expression to the following:
/(http|https):\/\/(www\.|)(' . $sites . ')\/[^\s]*/
Differences:
The ^ in the beginning makes the regular expression look only at the beginning of the text, instead of everywhere.
The [\s] in the beginning makes sure you can find two links in one line of text.
The last URL won't be found because there is no trailing slash in the end of the URL. If you're trying to detect videos, this won't matter, though, because the video is always on a subpage.
I'm trying to find a reliable solution to extract a url from a string of characters. I have a site where users answer questions and in the source box, where they enter their source of information, I allow them to enter a url. I want to extract that url and make it a hyperlink. Similar to how Yahoo Answers does it.
Does anyone know a reliable solution that can do this?
All the solutions I have found work for some URL's but not for others.
Thanks
John Gruber has spent a fair amount of time perfecting the "one regex to rule them all" for link detection. Using preg_replace() as mentioned in the other answers, using the following regex should be one of the most accurate, if not the most accurate, method for detecting a link:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
If you only wanted to match HTTP/HTTPS:
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
$string = preg_replace('/https?:\/\/[^\s"<>]+/', '$0', $string);
It only matches http/https, but that's really the only protocol you want to turn into a link. If you want others, you can change it like this:
$string = preg_replace('/(https?|ssh|ftp):\/\/[^\s"]+/', '$0', $string);
There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.
I created a PHP library that could deal with lots of edge cases: Url highlight.
You could extract urls from string or directly highlight them.
Example:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
// Extract urls
$urlHighlight->getUrls("This is example http://example.com.");
// return: ['http://example.com']
// Make urls as hyperlinks
$urlHighlight->highlightUrls('Hello, http://example.com.');
// return: 'Hello, http://example.com.'
For more details see readme. For covered url cases see test.
Yahoo! Answers does a fairly good job of link identification when the link is written properly and separate from other text, but it isn't very good at separating trailing punctuation. For example The links are http://example.com/somepage.php, http://example.com/somepage2.php, and http://example.com/somepage3.php. will include commas on the first two and a period on the third.
But if that is acceptable, then patterns like this should do it:
\<http:[^ ]+\>
It looks like stackoverflow's parser is better. Is is open source?
This code is worked for me.
function makeLink($string){
/*** make sure there is an http:// on all URLs ***/
$string = preg_replace("/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i", "$1http://$2",$string);
/*** make all URLs links ***/
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</a>",$string);
/*** make all emails hot links ***/
$string = preg_replace("/([\w-?&;#~=\.\/]+\#(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","$1",$string);
return $string;
}
When a user enters a URL, e.g. http://www.google.com, I would like to be able to parse that text using PHP, find any links, and replace them with <a> tags that include the original URL as an HREF.
In other words, http://www.google.com will become
http://www.google.com
I'd like to be able to do this for all URLs of these forms (with .com interchangeable with any TLD):
http://www.google.com
www.google.com
google.com
docs.google.com
What's the most performant way to do this? I could try writing some really fancy regex, but I doubt that's the best method available to me.
For bonus points, I'd also like to prepend http:// to any URL lacking it, and strip the display text itself down to something of the form http://www.google.com/reallyLongL... and display an external link icon afterwards.
Trying to find links in the format domain.com is going to be a pain in the butt. It would require keeping track of all TLDs and using them in the search.if you didnt the end of the last sentence i typed and the beginning of this sentence would be a link to http://search.if. Even if you did .in is a valid TLD and a common word.
I'd recommend telling your users they have to begin links with www. or http:// then write a simple regex to capture them and add the links.
www.google.com
This is not a URL, it's a hostname. It's generally not a good idea to start marking up bare hostnames in arbitrary text, because in the general case any word or sequence of dot-separated words is a perfectly valid hostname. That means you up with horrible hacks like looking for leading www. (and you'll get questions like “why can I link to www.stackoverflow.com but not stackoverflow.com?”) or trailing TLDs (which gets more and more impractical as more new TLDs are introduced; “why can I like to ncm.com but not ncm.museum?”), and you'll often mark up things that aren't supposed to be links.
I could try writing some really fancy regex
Well I can't see how you'd do it without regex.
The trick is coping with markup. If you can have <, & and " characters in the input, you mustn't let them into HTML output. If your input is plain text, you can do that by calling htmlspecialchars() before applying a simple replacement on a pattern like that in nico's answer.
(If the input already contains markup, you've got problems and you'd probably need an HTML parser to determine which bits are markup to avoid adding more markup inside of. Similarly, if you're doing more processing after this, inserting more tags, those steps are may have the same difficulty. In ‘bbcode’-like languages this often leads to bugs and security problems.)
Another problem is trailing punctuation. It's common for people to put a full stop, comma, close bracket, exclamation mark etc after a link, which aren't supposed to be part of the link but which are actually valid characters. It's useful to strip these off and not put them in the link. But then you break Wiki links that end in ), so maybe you want to not treat ) as a trailing character if there's a ( in the link, or something like that. This sort of thing can't be done in a simple regex replace, but you can in a replacement callback function.
HTML Purifier has a built-in linkify function to save you all the headaches.
It's other features are also simply too useful to pass up if you're dealing with any kind of user input that you also have to display.
Not so fancy regexps that should work
/\b(https?:\/\/[^\s+\"\<\>]+)/ig
/\b(www.[^\s+\"\<\>]+)/ig
Note that the last two would be impossible to do correctly as you cannot distinguish google.com from something like this.Where I finish one sentence and don't put a space after the full stop.
As for shortening the URLs, having your URL in $url:
if (strlen($url) > 20) // Or whatever length you like
{
$shortURL = substr($url, 0, 20)."…";
}
else
{
$shortURL = $url;
}
echo '<a href="'.$url.'" >'.$shortURL.'</a>';
From http://www.exorithm.com/algorithm/view/markup_urls
function markup_urls ($text)
{
// split the text into words
$words = preg_split('/([\s\n\r]+)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
$text = "";
// iterate through the words
foreach($words as $word) {
// chopword = the portion of the word that will be replaced
$chopword = $word;
$chopword = preg_replace('/^[^A-Za-z0-9]*/', '', $chopword);
if ($chopword <> '') {
// linkword = the text that will replace chopword in the word
$linkword='';
// does it start with http://abc. ?
if (preg_match('/^(http:\/\/)[a-zA-Z0-9_]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it equal abc.def.ghi ?
} else if (preg_match('/^[a-zA-Z]{2,}\.([a-zA-Z0-9_]+\.)+[a-zA-Z]{2,}(\/.*)?/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it start with abc#def.ghi ?
} else if (preg_match('/^[a-zA-Z0-9_\.]+\#([a-zA-Z0-9_]{2,}\.)+[a-zA-Z]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9]*$/', '', $chopword);
$linkword = ''.$chopword.'';
}
// replace chopword with linkword in word (if linkword was set)
if ($linkword <> '') {
$word = str_replace($chopword, $linkword, $word);
}
}
// append the word
$text = $text.$word;
}
return $text;
}
I got this working exactly the way I want here:
<?php
$input = <<<EOF
http://www.example.com/
http://example.com
www.example.com
http://iamanextremely.com/long/link/so/I/will/be/trimmed/down/a/bit/so/i/dont/mess
/up/text/wrapping.html
EOF;
function trimlong($match)
{
$url = $match[0];
$display = $url;
if ( strlen($display) > 30 ) {
$display = substr($display,0,30)."...";
}
return ''.$display.' <img src="http://static.goalscdn.com/img/external-link.gif" height="10" width="11" />';
}
$output = preg_replace_callback('#(http://|www\\.)[^\\s<]+[^\\s<,.]#i',
array($this,'trimlong'),$input);
echo $output;
i reposted this question because i didn't find a good answer.
i have a string which can contains text with urls.
i want a function to strip all urls from this string and just let the text.
by example the string can contains like this :
1) hey take a look here : http://xxx.xxx/545df5 this is nice!
2) hey take a look here : http://www.xxx.xxx/545df5 this is nice!
3) hey take a look here : xxx.xxx/545df5 this is nice!
4) hey take a look here : www.xxx.xxx/545df5 this is nice!
Thanks
Regular expression for URL and how to use regular expression with php should help you.
What you really need is a solid regex to find urls in a string and you can preg_replace that pattern with nothing. I can tell you though that tracking down a regex like that is not easy. Depending on the variations in the urls you're looking for (i.e. http:// vs https:// vs ftp://) You could run into real trouble trying to account for all that.
Here is a page that I found to be a good start though.
Regex is the way to go as was discussed prior. Finding one isn't that terribly hard (google: url regex pattern) One example returned is here
http://www.geekzilla.co.uk/View2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm
I would also recommend you test your regex using one of the many fine online regex testers. My favorite (for non-java) is
http://www.regextester.com/
This function should do it(assuming your strings are seperated by space " "):
function isValidURL($url) {
return preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?$|i', $url);
}
function cleanUpUrls($urls) {
$urlArray = explode(' ',$urls);
$resultArray = array();
foreach ($urlArray as $url) {
if(!isValidURL($url)) {
$resultArray[] = $url;
}
}
return implode(' ',$resultArray);
}