function getHost($Address) {
$parseUrl = parse_url(trim($Address));
return trim($parseUrl[host]
? $parseUrl[host]
: array_shift(explode('/', $parseUrl[path], 2))
);
}
$httpreferer = getHost($_SERVER['HTTP_REFERER']);
$httpreferer = preg_replace('#^(http(s)?://)?w{3}\.#', '$1', $httpreferer);
echo $httpreferer;
I am using this to strip http:// , www and subdomains to return just the host however it returns the following:
http://site.google.com ==> google.com
http://google.com ==> com
How do i get it to just remove the subdomain when it exists instead of stripping down to the tld when it doesn't exist?
Start with parse_url specifically parse_url($url)['host']
$arr = parse_url($url);
echo preg_replace('/^www\./', '', $arr['host'])."\n";
Output
site.google.com
google.com
Sandbox
The Regex for this is just matches www. if it's the start of the string, you could probably do this part a few ways, such as with
No subdomain
If you don't want any subdomain at all:
$arr = parse_url($url)['host'];
echo preg_replace('/^(?:[-a-z0-9_]+\.)?([-a-z0-9_]+\..+)$/', '$1',$arr['host'])."\n";
Sandbox
No subdomain, no Country Code
$arr = parse_url($url)['host'];
echo preg_replace('/^(?:[-a-z0-9_]+\.)?([-a-z0-9_]+)(\.[^.]+).*?$/', '$1$2',$arr['host'])."\n";
Sandbox
How it works,
Same as the previous one but the domain is separated from the host, and instead of just capturing everything, we capture everything but the . and outside the new group we capture everything (confusingly the . is everything here) but with *? which means * 0 or more times, ? non-greedy don't take characters from previous expressions.
Or to put it another way. Capture anything 0 or more times don't steal characters from previous matches. This way if there is nothing such as www.google.com we are only worried about stuff after .com then its 0 matches. But if its www.google.com.uk it matches the .uk.
Single Line Answer.
Some versions of PHP, I forget what ones but the newer ones actually let you do this:
$host = parse_url($url)['host'];
So taking the last example we can compress that into one line and remove the variable assignment.
echo preg_replace('/^(?:[-a-z0-9_]+\.)?([-a-z0-9_]+)(\.[^.]+).*?$/', '$1$2',parse_url($url)['host'])."\n";
See it in action
That was just for fun!
Summery
Using parse_url is really the "correct" way to do it. Or the proper way to start as it removes a lot of the other "stuff" and gives you a good starting place. Anyway this was fun for me ... :) ... And I needed a break from coding my Website, because it's tedious for me now (It was 8 years old, so I'm redoing it in WordPress, and I've done about a zillion WordPress site) ...
Cheers, hope it helps!
Found the Answer
$testAdd = "https://testing.google.co.uk";
$parse = parse_url($testAdd);
$httpreferer = preg_replace("/^([a-zA-Z0-9].*\.)?([a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z.]{2,})$/", '$2', $parse['host']);
echo $httpreferer;
This will also deal with domain with country TLD
Thanks for all your help.
Related
I have a series of string that could either be:
1: google.com
2: www.google.com
3: finance.google.com
What I need to do is basically add a www. to any string that doesn't have a subdomain already attached to it.
So in this case, we should add a www. infront of string #1 (google.com) but leave #2 and #3 alone.
What do you think would be the best way to accomplish this? Some form of REGEX?
You can't because there is no specific rules (how many subdomains, for instance you.can.have.lot.of.dots.com ...), and you can't know that www.yourdomain.com is equal to yourdomain.com
You could detect if the 4th character is a dot. Not always going to work, but it will detect the presence of either a 3-letter prefix, or a url for a 3-letter domain name. (such as ksl.com)
if (substr($url,4,1)==".")
You could also check the first 3 for WWW
if (strtolower(substr($url,0,3))=="www")
Good Luck
$domain = "in.yahoo.com";
$pattern = '/\./';
preg_match_all($pattern, $domain, $matches, PREG_OFFSET_CAPTURE);
if(count($matches[0])==1){
$new_domain="www.".$domain;
} else {
$new_domain=$domain;
}
print $new_domain;
Regex, or you could count the dots in this case. If there are two, www. can be added in front. That is assuming you don't use filenames, pages, (sub)domains with extra dots, and so on.
<?php
if(substr_count($url, '.') < 2)
{
$url = "www.".$url;
}
?>
I know I've seen this done a lot in places, but I need something a little more different than the norm. Sadly When I search this anywhere it gets buried in posts about just making the link into an html tag link. I want the PHP function to strip out the "http://" and "https://" from the link as well as anything after the .* so basically what I am looking for is to turn A into B.
A: http://www.youtube.com/watch?v=spsnQWtsUFM
B: www.youtube.com
If it helps, here is my current PHP regex replace function.
ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]", "\\0", htmlspecialchars($body, ENT_QUOTES)));
It would probably also be helpful to say that I have absolutely no understanding in regular expressions. Thanks!
EDIT: When I entered a comment like this blahblah https://www.facebook.com/?sk=ff&ap=1 blah I get html like this<a class="bwl" href="blahblah https://www.facebook.com/?sk=ff&ap=1 blah">www.facebook.com</a> which doesn't work at all as it is taking the text around the link with it. It works great if someone only comments a link however. This is when I changed the function to this
preg_replace("#^(.*)//(.*)/(.*)$#",'<a class="bwl" href="\0">\2</a>', htmlspecialchars($body, ENT_QUOTES));
This is the simples and cleanest way:
$str = 'http://www.youtube.com/watch?v=spsnQWtsUFM';
preg_match("#//(.+?)/#", $str, $matches);
$site_url = $matches[1];
EDIT: I assume that the $str had been checked to be a URL in the first place, so I left that out. Also, I assume that all the URLs will contain either 'http://' or 'https://'. In case the url is formatted like this www.youtube.com/watch?v=spsnQWtsUFM or even youtube.com/watch?v=spsnQWtsUFM, the above regexp won't work!
EDIT2: I'm sorry, I didn't realize that you were trying to replace all strings in a whole test. In that case, this should work the way you want it:
$str = preg_replace('#(\A|[^=\]\'"a-zA-Z0-9])(http[s]?://(.+?)/[^()<>\s]+)#i', '\\1\\3', $str);
I am not a regex whizz either,
^(.*)//(.*)/(.*)$
\2
was what worked for me when I tried to use as find and replace in programmer's notepad.
^(.)// should extract the protocol - referred as \1 in the second line.
(.)/ should extract everything till the first / - referred as \2 in the second line.
(.*)$ captures everything till the end of the string. - referred as \3 in the second line.
Added later
^(.*)( )(.*)//(.*)/(.*)( )(.*)$
\1\2\4 \7
This should be a bit better, but will only replace just 1 URL
The \0 is replaced by the entire matched string, whereas \x (where x is a number other than 0 starting at 1) will be replaced by each subpart of your matched string based on what you wrap in parentheses and the order those groups appear. Your solution is as follows:
ereg_replace("[[:alpha:]]+://([^<>[:space:]]+[:alnum:]*)[[:alnum:]/]", "\\1
I haven't been able to test this though so let me know if it works.
I think this should do it (I haven't tested it):
preg_match('/^http[s]?:\/\/(.+?)\/.*/i', $main_url, $matches);
$final_url = ''.$matches[1].'';
I'm surprised no one remembers PHP's parse_url function:
$url = 'http://www.youtube.com/watch?v=spsnQWtsUFM';
echo parse_url($url, PHP_URL_HOST); // displays "www.youtube.com"
I think you know what to do from there.
$result = preg_replace('%(http[s]?://)(\S+)%', '\2', $subject);
The code with regex does not work completely.
I made this code. It is much more comprehensive, but it works:
See the result here: http://cht.dk/data/php-scripts/inc_functions_links.php
See the source code here: http://cht.dk/data/php-scripts/inc_functions_links.txt
is there a great way to extract subdomain with php without regex ?
why without regex ?
there are a lot of topic about this, one if them is Find out subdomain using Regular Expression in PHP
the internet says it consumes memory a
lot, if there is any consideration or
you think better use regex ( maybe we
use a lot of function to get this
solution ) please comment below too.
example
static.x.com = 'static'
helloworld.x.com = 'helloworld'
b.static.ak.x.com = 'b.static.ak'
x.com = ''
www.x.com = ''
Thanks for looking in.
Adam Ramadhan
http://php.net/explode ?
Just split them on the dot? And do some functions?
Or if the last part (x.com) is the same everytime, do a substring on the hostname, stripping of the last part.
The only exception you'll have to make in your handling is the www.x.com (which technically is a subdomain).
$hostname = '....';
$baseHost = 'x.com';
$subdomain = substr($hostname, 0, -strlen($baseHost));
if ($subdomain === 'www') {
$subdomain = '';
}
Whoever told you that regexes "consume a lot" was an idiot. Simple regexes are not very cpu/memory-consuming.
However, for your purpose a regex is clearly overkill. You can explode() the string and then take as many elements from the array as you need. However, your last example is really bad. www is a perfectly valid subdomain.
You can first use parse_url http://www.php.net/manual/de/function.parse-url.php
and than explode with . as delimiter on the host http://www.php.net/manual/de/function.explode.php
I would not say it is quicker (just test it), but maybe this solution is better.
function getSubdomain($host) {
return implode('.', explode('.', $host, -2));
}
explode splits the string on the dot and drops the last two elements. Then implode combines these pieces again using the dot as separator.
I am absolutely a newbie and have not ventured to this level yet but needed to be able to strip a domain down to only the hostname for a search function. I looked and found this below which pretty much works except if the domain name has any - in it. So http://www.example.com strips down to example.com as does www.example.com but www.exa-mple.com becomes example.com.
$pattern = '/\w+\..{2,3}(?:\..{2,3})?(?:$|(?=\/))/i';
$url = $myurl;
if (preg_match($pattern, $url, $matches) === 1) {
$mydom = $matches[0];
}
What would have to be changed in the expression so that it accepts the - in the domain names?
You'd be better off with parse_url function:
parse_url($url)
Just prepend http:// if the url doesn't start with it.
Your regex currently allows the character _ and disallows the character -, which means it accepts invalid URLs. You can correct this with the following group:
$pattern = '/[a-z0-9-]+\..{2,3}(?:\..{2,3})?(?:$|(?=\/))/i';
Note that there are still issues with this. First, domain names are not allowed to start or end with a hyphen. Second, you are currently allowing any character in the TLD, whereas they only contain letters.
The best solution would be to use a proper URL parsing library and not to try to do this yourself.
$sites = array('mysite.com',
'www.mysite.com',
'http://www.mysite.com',
'www.my-site.com',
'sub.folder.2.example.com',
'http://www.mysite.com/argh/index.php');
$reg = '%^(?:http://)?(?:[^.]*\.)*([a-zA-Z0-9_-]+\.[a-zA-Z0-9]+)%m';
foreach($sites as $site)
{
if(preg_match($reg,$site,$matches))
{
echo $matches[1],PHP_EOL;
}
}
Output:
mysite.com
mysite.com
mysite.com
my-site.com
examle.com
mysite.com
I am trying to implement a php script which will run on every call to my site, look for a certain pattern of URL, then explode the URL and perform a redirect.
Basically I want to run this on a new CMS to catch all incoming links from the old CMS, and redirect, based on mapping, say an article id stripped form the URL to the same article ID imported into the new CMS's DB.
I can do the implementation, the redirect etc, but I am lost on the regex.
I need to catch any occurrences of:
domain.com/content/view/*/34/ or domain.com/content/view/*/30/ (where * is a wildcard) and capture * and the 30 or 34 in a variable which I will then use in a DB query.
If the following is encountered:
domain.com/content/view/*/34/1/*/
I need to capture the first * and the second *.
Be very grateful for anyone who can give me a hand on this.
I'm not sure regular expressions are the way to go. I think it would probably be easier to use explode ('/' , $url) and check by looping over that array.
Here are the steps I would follow:
$url = parse_url($url, PHP_URL_PATH);
$url = trim($url, '/');
$parts = explode ('/' , $url);
Then you can check if
($parts[0]=='content' && $parts[1]=='view' && $parts[3]=='34')
You can also easily get the information you want with $parts[2].
It's actually very simple, a more flexible and straightforward approach is to explode() the url into an array called something like $segments, and then test on there. If you have a very small number of expected URLs, then this kind of approach is probably easier to maintain and to read.
I wouldn't recommend doing this in the htaccess file because of the performance overhead.
First, I would use the PHP function parse_url() to get the path, devoid of any protocol or hostname.
Once you have that the following code should get you the info you need.
<?php
$url = 'http://domain.com/content/view/*/34/'; // first example
$url = 'http://domain.com/content/view/*/34/1/*/'; // second example
$url_array = parse_url($url);
$path = $url_array['path'];
// Match the URL against regular expressions
if (preg_match('/content\/view\/([^\/]+)\/([0-9]+)\//i', $path, $matches)){
print_r($matches);
}
if (preg_match('/content\/view\/([^\/]+)\/([0-9]+)\/([0-9]+)\/([^\/]+)/i', $path, $matches)){
print_r($matches);
}
?>
([^/]+) matches any sequence of characters except a forward slash
([0-9]+) matches any sequence of numbers
Though you can probably write a single regular expression to match most URL variants, consider using multiple regular expressions to check for different types of URLs. Depending on how much traffic you get, the speed hit won't be all that terrible.
Also, I recommend reading Mastering Regular Expressions by O'reilly. A good knowledge of regular expressions will come in handy quite often.
http://www.regular-expressions.info/php.html