I've the following script which hyperlinks any links posted on my site:
$text = trim($text);
while ($text != stripslashes($text)) { $text = stripslashes($text); }
$text = strip_tags($text,"<b><i><u>");
$text = preg_replace("/(?<!http:\/\/)www\./","http://www.",$text);
$text = preg_replace( "/((http|ftp)+(s)?:\/\/[^<>\s]+)/i", "\\0",$text);
However, for some reason if I add a https://www.test.com link it ends up displaying like this - https://http://www.test.com - what am I doing wrong? How can I make it work with https links as well? It works fine with http links. Thank you! :-)
The lookbehind that you have here, (?<!http:\/\/)www\. is only matching http://, but your test input (that's failing) is https://.
You can add a second lookbehind chained with the current one to specify the alternative https:// version too:
(?<!http:\/\/)(?<!https:\/\/)www\.
This would make your full line look like:
$text = preg_replace("/(?<!http:\/\/)(?<!https:\/\/)www\./","http://www.",$text);
The last I checked, PHP does not support variable-length lookbehinds, so things that may be familiar such as http[s]?:// wouldn't work here - hence the second pattern.
Related
I want to make like a proxy page (not for proxy at all) and as i knew i need to change all URLS SRC LINK and so on to others - for styles and images grab from right play, and urls goto throught my page going to $_GET["url"] and then to give me next page.
But iv tied to preg_replace() each element, also im not so good with it, and if on one website it works, on another i cant see CSS for example...
The first question is there are any PHP classes or just scripts to make it easy? (I was trying to google hours)
And if not help me with the following code :
<?php
$url = $_GET["url"];
$text = file_get_contents($url);
$data = parse_url($url);
$url=$data['scheme'].'://'.$data['host'];
$text = preg_replace('|<iframe [^>]*[^>]*|', '', $text);
$text = preg_replace('/<a(.*?)href="([^"]*)"(.*?)>/','<a $1 href="http://my.site/?url='.$url.'$2" $3>',$text);
$text = preg_replace('/<link(.*?)href="(?!http:\/\/)([^"]+)"(.*?)/', "<link $1 href=\"".$url."/\\2\"$3", $text);
$text = preg_replace('/src="(?!http:\/\/)([^"]+)"/', "src=\"".$url."/\\1\"", $text);
$text = preg_replace('/background:url\(([^"]*)\)/',"background:url(".$url."$1)", $text);
echo $text;
?>
Replacing with "src" №4 i need to denied replace when starts from double slash, because it could starts like 'src="//somethingdomain"' and not need to replace them.
Also i need to ignore replace №2 when href is going to the same domain, or it looks like need.site/news.need.site/324244
And is it possible to pass action in form throught my script? For example google search query.
And one small problem one web site is openning corrent some times before, but after iv open it hundreds times by this script in getting unknown symbols (without any divs body etc...) ��S�n�#�� i was trying to encode to UTF-8 ANSI but symbol just changing,
maybe they ban me ? oO
function link_replace($url,$myurl) {
$content = file_get_contents($url);
$content = preg_replace('#href="(http)(.*?)"#is', 'href="'.$myurl.'?url=$1$2"', $content);
$content = preg_replace('#href="([^http])(.*?)"#is', 'href="'.$myurl.'?url='.$url.'$1$2"', $content);
return $content;
}
echo link_replace($url,$myurl);
I'm not absolutely sure but I guess the result is just compressed e.g. with gzip try removing the accepted encoding headers while proxying the request.
Lets say that $content is the content of a textarea
/*Convert the http/https to link */
$content = preg_replace('!((https://|http://)+[a-z0-9_./?=&-]+)!i', '<a target="_blank" href="$1">$1</a> ', nl2br($_POST['helpcontent'])." ");
/*Convert the www. to link prepending http://*/
$content = preg_replace('!((www\.)+[a-z0-9_./?=&-]+)!i', '<a target="_blank" href="http://$1">$1</a> ', $content." ");
This was working ok for links, but realised that it was breaking the markup when an image is within the text...
I am trying like this now:
$content = preg_replace('!\s((https?://|http://)+[a-z0-9_./?=&-]+)!i', ' $1 ', nl2br($_POST['content'])." ");
$content = preg_replace('!((www\.)+[a-z0-9_./?=&-]+)!i', '<a target="_blank" href="http://$1">$1</a> ', $content." ");
As is the images are respected, but the problem is that url's with http:// or https:// format won't be converted now..:
google.com -> Not converted (as expected)
www.google.com -> Well Converted
http://google.com -> Not converted (unexpected)
https://google.com -> Not converted (unexpected)
What am I missing?
-EDIT-
Current almost working solution:
$content = preg_replace('!(\s|^)((https?://)+[a-z0-9_./?=&-]+)!i', ' $2 ', nl2br($_POST['content'])." ");
$content = preg_replace('!(\s|^)((www\.)+[a-z0-9_./?=&-]+)!i', '<a target="_blank" href="http://$2" target="_blank">$2</a> ', $content." ");
The thing here is that if this is the input:
www.funcook.com http://www.funcook.com https://www.funcook.com
funcook.com http://funcook.com https://funcook.com
All the urls I want (all, except name.domain) are converted as expected, but this is the output
www.funcook.com http://www.funcook.com https://www.funcook.com ;
funcook.com http://funcook.com https://funcook.com
Note an ; is inserted, any idea why?
try this:
preg_replace('!(\s|^)((https?://|www\.)+[a-z0-9_./?=&-]+)!i', ' $2 ',$text);
It will pick up links beginning with http:// or with www.
Example
You can't at 100%. Becuase there may be links such as stackoverflow.com which do not have www..
If you're only targeting those links:
!(www\.\S+)!i
Should work well enough for you.
EDIT: As for your newest question, as to why http links don't get converted but https do, Your first pattern only searches for https://, or http://. which isn't the case. Simplify it by replacing:
(https://|http://\.)
With
(https?://)
Which will make the s optional.
Another method to go about adding hyperlinks is that you could take the text that you want to parse for links, and explode it into an array. Then loop through it using foreach (very fast function - http://www.phpbench.com/) and change anything that starts with http://, or https://, or www., or ends with .com/.org/etc into a link.
I'm thinking maybe something like this:
$userTextArray = explode(" ",$userText);
foreach( $userTextArray as &$word){
//if statements to test if if it starts with www. or ends with .com or whatever else
//change $word so that it is a link
}
Your changes will be reflected in the array since you had the "&" before $userText in your foreach statement.
Now just implode the array back into a string and you're good to go.
This made sense in my head... But I'm not 100% sure that this is what you're looking for
I had similar problem. Here is function which helped me. Maybe it will fit your needs to:
function clHost($Address) {
$parseUrl = parse_url(trim($Address));
return str_replace ("www.","",trim(trim($parseUrl[host] ? $parseUrl[host].$parseUrl[path] : $parseUrl[path]),'/'));
}
This function will return domain without protocol and "www", so you can add them yourself later.
For example:
$url = "http://www.". clHost($link);
I did it like that, because I couldn't find good regexp.
\s((https?://|www.)+[a-z0-9_./?=&-]+)
The problem is that your starting \s is forcing the match to start with a space, so, if you don't have that starting space your match fails. The reg exp is fine (without the \s), but to avoid replacing the images you need to add something to avoid matching them.
If the images are pure html use this:
(?<!src=")((https?://|www.)+[a-z0-9_./?=&-]+)
That will look for src=" before the url, to ignore it.
If you use another mark up, tell me and I'll try to find another way to avoid the images.
following code is used to find url from a string with php. Here is the code:
$string = "Hello http://www.bytes.com world www.yahoo.com";
preg_match('/(http:\/\/[^\s]+)/', $string, $text);
$hypertext = "" . $text[0] . "";
$newString = preg_replace('/(http:\/\/[^\s]+)/', $hypertext, $string);
echo $newString;
Well, it shows a link but if i provide few link it doesn't work and also if i write without http:// then it doesn't show link. I want whatever link is provided it should be active, Like stackoverflow.com.
Any help please..
A working method for linking with http/https/ftp/ftps/scp/scps:
$newStr = preg_replace('!(http|ftp|scp)(s)?:\/\/[a-zA-Z0-9.?&_/]+!', "\\0",$str);
I strongly advise NOT linking when it only has a dot, because it will consider PHP 5.2, ASP.NET, etc. links, which is hardly acceptable.
Update: if you want www. strings as well, take a look at this.
If you want to detect something like stackoverflow.com, then you're going to have to check for all possible TLDs to rule out something like Web 2.0, which is quite a long list. Still, this is also going to match something as ASP.NET etc.
The regex would looks something like this:
$hypertext = preg_replace(
'{\b(?:http://)?(www\.)?([^\s]+)(\.com|\.org|\.net)\b}mi',
'$1$2$3',
$text
);
This only matches domains ending in .com, .org and .net... as previously stated, you would have to extend this list to match all TLDs
#axiomer your example wasn't work if link will be in format:
https://stackoverflow.com?val1=bla&val2blablabla%20bla%20bla.bl
correct solution:
preg_replace('!(http|ftp|scp)(s)?:\/\/[a-zA-Z0-9.?%=&_/]+!', "\\0", $content);
produces:
https://stackoverflow.com?val1=bla&val2blablabla%20bla%20bla.bl
function makeLinks($text) {
$text = preg_replace('%(?<!href=")(((f|ht){1}(tp://|tps://))[-a-zA-^Z0-9#:\%_\+.~#?&//=]+)%i',
'\\1', $text);
$text = preg_replace('%([:space:]()[{}])(www.[-a-zA-Z0-9#:\%_\+.~#?&//=]+)%i',
'\\1\\2', $text);
return $text;
}
It misses if I have something like this: - www.website.org (a hyphen then a space) at the beginning of a line. If I have - www.website.org - www.website.org it catches the second one.
Shouldn't that be covered by the space in the second preg_replace?
I also tried %(\s\n\r(){})
I am running it through markdown, but not till after (markdown(makeLinks($foo))) so I thought that shouldn't interfere, but when I take the markdown off and everything just echos out in one line, it does make links out of them. If i put makeLinks(markdown($foo)) it behaves the same as initially.. not making links out of the ones that begin with www at the beginning of list items.
Thats some pretty dodgy regex work there. Here is a regex I would recommend instead for URL dectection:
%(?<!href="?)(((f|ht)(tp://|tps://))?[a-zA-Z0-9-].[-a-zA-^Z0-9#:\%_\+.~#?&//=]+)%i
Should be a lot more reliable than the two you have now.
When a user enters a URL, e.g. http://www.google.com, I would like to be able to parse that text using PHP, find any links, and replace them with <a> tags that include the original URL as an HREF.
In other words, http://www.google.com will become
http://www.google.com
I'd like to be able to do this for all URLs of these forms (with .com interchangeable with any TLD):
http://www.google.com
www.google.com
google.com
docs.google.com
What's the most performant way to do this? I could try writing some really fancy regex, but I doubt that's the best method available to me.
For bonus points, I'd also like to prepend http:// to any URL lacking it, and strip the display text itself down to something of the form http://www.google.com/reallyLongL... and display an external link icon afterwards.
Trying to find links in the format domain.com is going to be a pain in the butt. It would require keeping track of all TLDs and using them in the search.if you didnt the end of the last sentence i typed and the beginning of this sentence would be a link to http://search.if. Even if you did .in is a valid TLD and a common word.
I'd recommend telling your users they have to begin links with www. or http:// then write a simple regex to capture them and add the links.
www.google.com
This is not a URL, it's a hostname. It's generally not a good idea to start marking up bare hostnames in arbitrary text, because in the general case any word or sequence of dot-separated words is a perfectly valid hostname. That means you up with horrible hacks like looking for leading www. (and you'll get questions like “why can I link to www.stackoverflow.com but not stackoverflow.com?”) or trailing TLDs (which gets more and more impractical as more new TLDs are introduced; “why can I like to ncm.com but not ncm.museum?”), and you'll often mark up things that aren't supposed to be links.
I could try writing some really fancy regex
Well I can't see how you'd do it without regex.
The trick is coping with markup. If you can have <, & and " characters in the input, you mustn't let them into HTML output. If your input is plain text, you can do that by calling htmlspecialchars() before applying a simple replacement on a pattern like that in nico's answer.
(If the input already contains markup, you've got problems and you'd probably need an HTML parser to determine which bits are markup to avoid adding more markup inside of. Similarly, if you're doing more processing after this, inserting more tags, those steps are may have the same difficulty. In ‘bbcode’-like languages this often leads to bugs and security problems.)
Another problem is trailing punctuation. It's common for people to put a full stop, comma, close bracket, exclamation mark etc after a link, which aren't supposed to be part of the link but which are actually valid characters. It's useful to strip these off and not put them in the link. But then you break Wiki links that end in ), so maybe you want to not treat ) as a trailing character if there's a ( in the link, or something like that. This sort of thing can't be done in a simple regex replace, but you can in a replacement callback function.
HTML Purifier has a built-in linkify function to save you all the headaches.
It's other features are also simply too useful to pass up if you're dealing with any kind of user input that you also have to display.
Not so fancy regexps that should work
/\b(https?:\/\/[^\s+\"\<\>]+)/ig
/\b(www.[^\s+\"\<\>]+)/ig
Note that the last two would be impossible to do correctly as you cannot distinguish google.com from something like this.Where I finish one sentence and don't put a space after the full stop.
As for shortening the URLs, having your URL in $url:
if (strlen($url) > 20) // Or whatever length you like
{
$shortURL = substr($url, 0, 20)."…";
}
else
{
$shortURL = $url;
}
echo '<a href="'.$url.'" >'.$shortURL.'</a>';
From http://www.exorithm.com/algorithm/view/markup_urls
function markup_urls ($text)
{
// split the text into words
$words = preg_split('/([\s\n\r]+)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
$text = "";
// iterate through the words
foreach($words as $word) {
// chopword = the portion of the word that will be replaced
$chopword = $word;
$chopword = preg_replace('/^[^A-Za-z0-9]*/', '', $chopword);
if ($chopword <> '') {
// linkword = the text that will replace chopword in the word
$linkword='';
// does it start with http://abc. ?
if (preg_match('/^(http:\/\/)[a-zA-Z0-9_]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it equal abc.def.ghi ?
} else if (preg_match('/^[a-zA-Z]{2,}\.([a-zA-Z0-9_]+\.)+[a-zA-Z]{2,}(\/.*)?/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it start with abc#def.ghi ?
} else if (preg_match('/^[a-zA-Z0-9_\.]+\#([a-zA-Z0-9_]{2,}\.)+[a-zA-Z]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9]*$/', '', $chopword);
$linkword = ''.$chopword.'';
}
// replace chopword with linkword in word (if linkword was set)
if ($linkword <> '') {
$word = str_replace($chopword, $linkword, $word);
}
}
// append the word
$text = $text.$word;
}
return $text;
}
I got this working exactly the way I want here:
<?php
$input = <<<EOF
http://www.example.com/
http://example.com
www.example.com
http://iamanextremely.com/long/link/so/I/will/be/trimmed/down/a/bit/so/i/dont/mess
/up/text/wrapping.html
EOF;
function trimlong($match)
{
$url = $match[0];
$display = $url;
if ( strlen($display) > 30 ) {
$display = substr($display,0,30)."...";
}
return ''.$display.' <img src="http://static.goalscdn.com/img/external-link.gif" height="10" width="11" />';
}
$output = preg_replace_callback('#(http://|www\\.)[^\\s<]+[^\\s<,.]#i',
array($this,'trimlong'),$input);
echo $output;