Removing 'http://' from link via REGEX - php

What I would like to do is remove the "http://" part of these autogenerated links, below is an example of it.
http://google.com/search?gc...
Here are the regexes I am using in PHP to generate these links from a URL.
$patterns_sp[5] = '~([\S]+)~';
$replaces_sp[5] = '<a href=\1 target="_blank">\1<br/>';
$patterns_sp[6] = '~(?<=\>)([\S]{1,25})[^\s]+~';
$replaces_sp[6] = '\1...</a><br/>';
When these patterns are run on a URL like this:
http://www.google.com/search?gcx=c&ix=c1&sourceid=chrome&ie=UTF-8&q=regex
the REGEX gives me:
http://google.com/search?gc...
Where I am stuck:
There is no obvious reason why I cannot modify the fourth line of code to read like this:
$patterns_sp[6] = '~(?<=\>http\:\/\/)([\S]{1,25})[^\s]+~';
However, the REGEX still seems to capture the "http://" part of the address, thus making a long list of these very redundant looking. What I am left with is the same thing as in the first example.

Replace...
$patterns_sp[5] = '~([\S]+)~';
...with...
$patterns_sp[5] = '~^(?:https?|ftp):([\S]+)~';
Then you can access the protocol-less version with $1 and the whole link with $0.
Optionally, you can remove a leading protocol with something like...
preg_replace('/^(?:https?|ftp):/', '', $str);

I suggest not writing your own regex, instead have a look at http://php.net/manual/en/function.parse-url.php
Retrieve the components of the URL, then compose a new version that only contains the parts you want.

Related

Slugs for SEO using PHP - Appending name to end of URL

Something I have noticed on the StackOverflow website:
If you visit the URL of a question on StackOverflow.com:
"https://stackoverflow.com/questions/10721603"
The website adds the name of the question to the end of the URL, so it turns into:
"https://stackoverflow.com/questions/10721603/grid-background-image-using-imagebrush"
This is great, I understand that this makes the URL more meaningful and is probably good as a technique for SEO.
What I wanted to Achieve after seeing this Implementation on StackOverflow
I wish to implement the same thing with my website. I am happy using a header() 301 redirect in order to achieve this, but I am attempting to come up with a tight script that will do the trick.
My Code so Far
Please see it working by clicking here
// Set the title of the page article (This could be from the database). Trimming any spaces either side
$original_name = trim(' How to get file creation & modification date/times in Python with-dash?');
// Replace any characters that are not A-Za-z0-9 or a dash with a space
$replace_strange_characters = preg_replace('/[^\da-z-]/i', " ", $original_name);
// Replace any spaces (or multiple spaces) with a single dash to make it URL friendly
$replace_spaces = preg_replace("/([ ]{1,})/", "-", $replace_strange_characters);
// Remove any trailing slashes
$removed_dashes = preg_replace("/^([\-]{0,})|([\-]{2,})|([\-]{0,})$/", "", $replace_spaces);
// Show the finished name on the screen
print_r($removed_dashes);
The Problem
I have created this code and it works fine by the looks of things, it makes the string URL friendly and readable to the human eye. However, it I would like to see if it is possible to simplify or "tightened it up" a bit... as I feel my code is probably over complicated.
It is not so much that I want it put onto one line, because I could do that by nesting the functions into one another, but I feel that there might be an overall simpler way of achieving it - I am looking for ideas.
In summary, the code achieves the following:
Removes any "strange" characters and replaces them with a space
Replaces any spaces with a dash to make it URL friendly
Returns a string without any spaces, with words separated with dashes and has no trailing spaces or dashes
String is readable (Doesn't contain percentage signs and + symbols like simply using urlencode()
Thanks for your help!
Potential Solutions
I found out whilst writing this that article, that I am looking for what is known as a URL 'slug' and they are indeed useful for SEO.
I found this library on Google code which appears to work well in the first instance.
There is also a notable question on this on SO which can be found here, which has other examples.
I tried to play with preg like you did. However it gets more and more complicated when you start looking at foreign languages.
What I ended up doing was simply trimming the title, and using urlencode
$url_slug = urlencode($title);
Also I had to add those:
$title = str_replace('/','',$title); //Apache doesn't like this character even encoded
$title = str_replace('\\','',$title); //Apache doesn't like this character even encoded
There are also 3rd party libraries such as: http://cubiq.org/the-perfect-php-clean-url-generator
Indeed, you can do that:
$original_name = ' How to get file creation & modification date/times in Python with-dash?';
$result = preg_replace('~[^a-z0-9]++~i', '-', $original_name);
$result = trim($result, '-');
To deal with other alphabets you can use this pattern instead:
~\P{Xan}++~u
or
~[^\pL\pN]++~u

about use regex to convert url to link

I need to convert the url in the article to the 3g domain.
for example, i need to convert
here is the link:http://www.mydomain.com/index thanks
to
here is the link:<a href='http://3g.mydomain.com$4' target='_self'>http://3g.$3.com$4</a> thanks
don't convert the other domain, just mydomain. here is the code:
$c = "/([^'\"=])?http:\/\/([^ ]+?)(mydomain)\.com([A-Za-z0-9&%\?=\/\-\._#]*)/";
$b=preg_replace($c, "$1<a href='http://3g.$3.com$4' target='_self'>http://3g.$3.com$4</a>",$b);
it works very well,but if the text like this:
a link
it will return the wrong result like this:
a link
but l need the result of
a link
how should i do?
You should do the following:
Strip target attributes from existing hyperlinks
Rewrite hyperlinks in href attributes
Rewrite any other hyperlinks
$plain = "http://([^ ]+?)(mydomain)\.com(/?[^'\"\s]*(?=['\"\s]))";
$plain_replace = "http://3g.$3.com$4";
$in_href = "href=(['\"])" + plain + "(['\"])";
$in_href_replace = "href='http://3g.$3.com$4' target='self'";
$strip_target = "target=['\"][^'\"]*['\"]";
...
So:
Replace $strip_target with ""
Replace $in_href with $in_href_replace
Replace $plain with $plain_replace
(The regexes are tested to work in C#, you might have to adjust the \ escaping to suit the php regex rules.)
Get rid of the first ? in your regular expression. That allows for the absence of a preceding character.
Or, perhaps more to your intention, if you want to allow URLs at the beginning, you can replace:
([^'\"=])?
with:
(^|[^'\"=])
...which will allow a link if at the very beginning, or if not preceded by a quote, etc., but not otherwise.

Find url from string with php

following code is used to find url from a string with php. Here is the code:
$string = "Hello http://www.bytes.com world www.yahoo.com";
preg_match('/(http:\/\/[^\s]+)/', $string, $text);
$hypertext = "" . $text[0] . "";
$newString = preg_replace('/(http:\/\/[^\s]+)/', $hypertext, $string);
echo $newString;
Well, it shows a link but if i provide few link it doesn't work and also if i write without http:// then it doesn't show link. I want whatever link is provided it should be active, Like stackoverflow.com.
Any help please..
A working method for linking with http/https/ftp/ftps/scp/scps:
$newStr = preg_replace('!(http|ftp|scp)(s)?:\/\/[a-zA-Z0-9.?&_/]+!', "\\0",$str);
I strongly advise NOT linking when it only has a dot, because it will consider PHP 5.2, ASP.NET, etc. links, which is hardly acceptable.
Update: if you want www. strings as well, take a look at this.
If you want to detect something like stackoverflow.com, then you're going to have to check for all possible TLDs to rule out something like Web 2.0, which is quite a long list. Still, this is also going to match something as ASP.NET etc.
The regex would looks something like this:
$hypertext = preg_replace(
'{\b(?:http://)?(www\.)?([^\s]+)(\.com|\.org|\.net)\b}mi',
'$1$2$3',
$text
);
This only matches domains ending in .com, .org and .net... as previously stated, you would have to extend this list to match all TLDs
#axiomer your example wasn't work if link will be in format:
https://stackoverflow.com?val1=bla&val2blablabla%20bla%20bla.bl
correct solution:
preg_replace('!(http|ftp|scp)(s)?:\/\/[a-zA-Z0-9.?%=&_/]+!', "\\0", $content);
produces:
https://stackoverflow.com?val1=bla&val2blablabla%20bla%20bla.bl

Preg_replace for url and links

Right now
I'm using
$content = preg_replace('#(https?://([-\w\.]+)+(:\d+)?((/[\w/_\.%\-+~]*)?(\?\S+)?)?)#', '$1', $content);
for replace url with links but it doesn't works with some symbols like # and so many other
and also i want that if the content appears like this
http://www.abc.com/
then the preg_replace skip this otherwise it will duplicate the same and produces wrong result.
The text helper class from Kohana has a function for this that would probably be a good starting point: https://github.com/kohana/core/blob/3.2/master/classes/kohana/text.php#L362
Why not just look for anything starting with http:// or https:// up until any whitespace character?
https?://[^\s]+
That is obviously pretty forgiving, the only problem is that you might get some false positives.

regex to get current page or directory name?

I am trying to get the page or last directory name from a url
for example if the url is: http://www.example.com/dir/ i want it to return dir or if the passed url is http://www.example.com/page.php I want it to return page Notice I do not want the trailing slash or file extension.
I tried this:
$regex = "/.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*/i";
$name = strtolower(preg_replace($regex,"$2",$url));
I ran this regex in PHP and it returned nothing. (however I tested the same regex in ActionScript and it worked!)
So what am I doing wrong here, how do I get what I want?
Thanks!!!
Don't use / as the regex delimiter if it also contains slashes. Try this:
$regex = "#^.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*$#i";
You may try tho escape the "/" in the middle. That simply closes your regex. So this may work:
$regex = "/.*\.(com|gov|org|net|mil|edu)\/([a-z_\-]+).*/i";
You may also make the regex somewhat more general, but that's another problem.
You can use this
array_pop(explode('/', $url));
Then apply a simple regex to remove any file extension
Assuming you want to match the entire address after the domain portion:
$regex = "%://[^/]+/([^?#]+)%i";
The above assumes a URL of the format extension://domainpart/everythingelse.
Then again, it seems that the problem here isn't that your RegEx isn't powerful enough, just mistyped (closing delimiter in the middle of the string). I'll leave this up for posterity, but I strongly recommend you check out PHP's parse_url() method.
This should adequately deliver:
substr($s = basename($_SERVER['REQUEST_URI']), 0, strrpos($s,'.') ?: strlen($s))
But this is better:
preg_replace('/[#\.\?].*/','',basename($path));
Although, your example is short, so I cannot tell if you want to preserve the entire path or just the last element of it. The preceding example will only preserve the last piece, but this should save the whole path while being generic enough to work with just about anything that can be thrown at you:
preg_replace('~(?:/$|[#\.\?].*)~','',substr(parse_url($path, PHP_URL_PATH),1));
As much as I personally love using regular expressions, more 'crude' (for want of a better word) string functions might be a good alternative for you. The snippet below uses sscanf to parse the path part of the URL for the first bunch of letters.
$url = "http://www.example.com/page.php";
$path = parse_url($url, PHP_URL_PATH);
sscanf($path, '/%[a-z]', $part);
// $part = "page";
This expression:
(?<=^[^:]+://[^.]+(?:\.[^.]+)*/)[^/]*(?=\.[^.]+$|/$)
Gives the following results:
http://www.example.com/dir/ dir
http://www.example.com/foo/dir/ dir
http://www.example.com/page.php page
http://www.example.com/foo/page.php page
Apologies in advance if this is not valid PHP regex - I tested it using RegexBuddy.
Save yourself the regular expression and make PHP's other functions feel more loved.
$url = "http://www.example.com/page.php";
$filename = pathinfo(parse_url($url, PHP_URL_PATH), PATHINFO_FILENAME);
Warning: for PHP 5.2 and up.

Categories