regex to trim down subdomain in the url

regex to trim down subdomain in the url - php

I have a regexp that match to something like : wiseman.google.com.jp, me.co.uk, paradise.museum, abcd-abc.net, www.google.jp, 12345-daswe-23dswe-dswedsswe-54eddss.info, del.icio.us, jo.ggi.ng, all of this is from a textarea value.
used regexp (in preg_match_all($regex1, $str, $match)) to get the above values: /(?:[a-zA-Z0-9]{2,}\.)?[-a-zA-Z0-9]{2,}\.[a-zA-Z0-9]{2,7}(?:\.[-a-zA-Z0-9]{2,3})?/
Now, my question is : how can I make the regexp to trim down the "wiseman.google.com.jp" into "google.com.jp" and "www.google.jp" into "google.jp"?
I am thingking to make a second preg_match($regex2, $str, $match) function with each value coming from the preg_match_all function.
I have tried this regexp in $regex2 : ([-a-zA-Z0-9\x{0080}-\x{00FF}]{2,}+)\.[a-zA-Z0-9\x{0080}-\x{00FF}]{2,7}(?:\.[-a-zA-Z0-9\x{0080}-\x{00FF}]{2,3})? but it doesn't work.
Any inputs? TIA
here is my little solution :
preg_match_all($regex, $str, $matches, PREG_PATTERN_ORDER);
$arrlength=count($matches[0]);
for($x=0;$x<$arrlength;$x++){
$dom = $matches[0][$x];
$newstringcount = substr_count($dom, '.'); // this line is to count how many "." present in the string.
if($newstringcount == 3){ // if there are 3 '.' present in the string = true
$pos = strpos($dom, '.', 0); // this line is to find the first occurence of the '.' in the string
$find = substr($dom, $pos+1); //this line is to get the value after the first occurence of the '.' in the string
echo $find;
}else if($newstringcount == 2){
if ($pos = strpos($dom,'www.') !== false) {
$find = substr($dom, $pos+3);
echo $find;
}else{
echo $dom;
}
}else if($newstringcount == 1){
echo $dom;
}
echo "<br>";
}

(Caution: this answer will only fit your needs if you HAVE to use regex or you're somewhat... desperate...)
What you want to achieve isn't possible with general rules due to domains like .com.jp or .co.uk.
The only general rule one can find is:
When read from right to left there are one or two TLDs followed by one second level domain
Thus, we have to whitelist all available TLDs. I think i'll call the following the "domain-kraken".
Release the kraken!
([a-z0-9\-]{2,63}(?:\.(?:a(?:cademy|ero|rpa|sia|[cdefgilmnoqrstuwxz])|b(?:ike
|iz|uilders|uzz|[abdefghijlmnoqrstvwyz])|c(?:ab|amera|amp|areers|at|enter|eo
|lothing|odes|offee|om(?:pany|puter)?|onstruction|ontractors|oop|
[acdfghiklmnoruvwxyz])|d(?:iamonds|irectory|omains|[ejkmoz])|e(?:du(?:cation)?
|mail|nterprises|quipment|state|[ceghrstu])|f(?:arm|lorist|[ijkmor])|g(?:allery|
lass|raphics|uru|[abdefghlmnpqrstuwy])|h(?:ol(?:dings|iday)|ouse|[kmnrtu])|
i(?:mmobilien|n(?:fo|stitute|ternational)|[delmnoqrst])|j(?:obs|[emop])|
k(?:aufen|i(?:tchen|wi)|[eghimnprwxyz])|l(?:and|i(?:ghting|mo)|[abcikrstuvy])|
m(?:anagement|enu|il|obi|useum|[acdefghklmnopqrstuvwxyz])|n(?:ame|et|inja|
[acefgilopruz])|o(?:m|nl|rg)|p(?:hoto(?:graphy|s)|lumbing|ost|ro|[aefghklmnrstwy])|
r(?:e(?:cipes|pair)|uhr|[eosuw])|s(?:exy|hoes|ingles|ol(?:ar|utions)|upport|
ystems|[abcdeghijklmnorstuvxyz])|t(?:attoo|echnology|el|ips|oday|
[cdfghjklmnoprtvwz])|u(?:no|[agkmsyz])|v(?:entures|iajes|oyage|[aceginu])|
w(?:ang|ien|[fs])|xxx|y(?:[et])|z(?:[amw]))){1,2})$
Use it together with the i and m flags.
This supposes your data is on mutiple lines.
In case your data is seperated by a ,, change the last character in the regex ($) to ,? and use the g and i flags.
Demos are available on regex101 and debuggex.
(Both of the demos have an explanation: regex101 describes it with text while debuggex visualizes the beast)
A list of available TLDs can be found at iana.org, the used TLDs in the regex are as of January 2014.

Related

Find string pattern from end

I am trying to find if a string is of a format <initial_part_of_name>_0.pdf i.e.
Find if it ends with .pdf (could be eliminated using rtrim)
The initial part is followed by an _ underscore.
The underscore is followed by a whole number (0, 1, 2, ... , etc.)
What could be the optimum way to achieve this? I have tried combinations of the string functions strpos (to find the position. but could not get to do anything from the end of the string).
Any pointers would be appreciated!
Edit:
Sample strings:
public://Big_Data_Tutorial_part4_0.pdf
public://Big_Data_Tutorial_part4_1.pdf
public://Big_Data_Tutorial_part4_3.pdf
The reason why I need to check is to avoid duplicate files which are stored with the _<number> appended.

You can use preg_match() function for matching patterns
Check the function preg_match()
preg_match("/(.*)_(\d+)\.pdf$/", "<initial_part_of_name>_0.pdf",$arr);
In $arr[1], you will get the <initial_part_of_name>
in $arr[2], you will get the number after underscore

a non-array and regex way
$str1 = "public://Big_Data_Tutorial_part4_0a.pdf"; // no match because 0a
$str2 = "public://Big_Data_Tutorial_part4_1.pdf"; // match
$str3 = "public://Big_Data_Tutorial_part4_3.pdf"; // match
$last_part = strrchr($str1, "_");
if (trim(strstr($last_part, ".", true), "_0..9") == "" && strstr($last_part, ".") == ".pdf") {
echo "match";
}

$str = '<initial_part_of_name>_0.pdf';
$exploded = explode('.', $str);
echo $exploded[1];

This could be done using regex. Something like
^[0-9A-Za-z]+\_[\d]\.pdf$
Implementation:
$filename = '<initial_part_of_name>_0.pdf';
if(preg_match('/^[0-9A-Za-z]+\_[\d]\.pdf$/i', $filename)){
// name pattern Matched
}
SOLUTION 2
use pathinfo()
$filename = '<initial_part_of_name>_0.pdf';
$path_parts = pathinfo($filename);
if(strtolower($path_parts['extension']) == 'pdf') {
if(preg_match('/.*_[\d]$/', $path_parts['filename'])){
// name pattern Matched
}
} else {
// Not a PDF file
}

remove part of string after substring

This may be a dupe, but I cannot seem to find a thread which matches this issue. I want to remove all chars from a string after a given sub-string - but the chars and the number of chars after the sub-string is unknown. Most solutions I have found seem to only work for removing the given sub-string itself or a fixed length after a given sub-string.
I have
$str = preg_replace('(.gif*)','.gif$',$str);
Which locates 'blahblah.gif?12345' ok, but I cannot seem to remove the chars after the sub-string '.gif'. I read that $ denotes EOS so I thought this would work, but apparently not. I also tried
'.gif$/'
and simply
'.gif'

It can be done without regex:
echo substr('blahblah.gif?12345', strpos('blahblah.gif?12345', '.gif') + 4);
// returns ?12345 this is the length of the substring ^
So the code is:
$str = 'original string';
$match = 'matching string';
$output = substr($str, strpos($str, $match) + strlen($match));
Ok, now I'm not sure if you want to keep the first or the second part of the string. Anyway, here's the code for keeping the first part:
echo substr('blahblah.gif?12345', 0, strpos('blahblah.gif?12345', '.gif') + 4);
// returns blahblah.gif ^ this is the key
And the full code:
$str = 'original string';
$match = 'matching string';
$output = substr($str, 0, strpos($str, $match) + strlen($match));
See the both examples work here: http://ideone.com/Ge30rY

Assuming (from OP's comment) that you are working with actual URLs as your source string, I believe that the best course of action here would be to use PHP's built-in functionality for working with and parsing URLs. You do this by using the parse_url() function:
(PHP 4, PHP 5)
parse_url — Parse a URL and return its components
This function parses a URL and returns an associative array containing any of the various components of the URL that are present.
This function is not meant to validate the given URL, it only breaks it up into the above listed parts. Partial URLs are also accepted, parse_url() tries its best to parse them correctly.
From your example: www.page.com/image.gif?123 (or even just image.gif?123) using parse_url() will look something like this:
var_dump( parse_url( "www.page.com/image.gif?123" ) );
array(2) {
["path"]=>
string(22) "www.page.com/image.gif"
["query"]=>
string(3) "123"
}
As you can see, without the need for regular expressions or string manipulations we have broken up the URL into it's separate components. No need to re-invent the wheel. Nice and clean :)

You could do this:
$str = "somecontent.gif?anddata";
$pattern = ".gif";
echo strstr($str,$pattern,true).$pattern;

// Set up string to search through
$haystack = "blahblah.gif?12345";
// Determine substring and length of it
$needle = ".gif";
$length = strlen($needle);
// Find position of last substring
$location = strrpos($haystack, $needle);
// Use location of last occurence + it's length to get new string
$newtext = substr($haystack, 0, $location+$length);

How to handle a miss with regex / PHP / preg_match_all

I'm using the code at the bottom to grab parameters from a wordpress shortcode. The shortcode itself looks like this:
[FLOWPLAYER=http://www.tvovermind.com/wp-content/uploads/2013/01/pll-316-21.jpg|http://www.tvovermind.com/wp-content/uploads/2013/01/PLL316_fv2.h264HD-Clip2.flv,440,280]
Or
[FLOWPLAYER=http://www.tvovermind.com/wp-content/uploads/2013/01/pll-316-21.jpg|http://www.tvovermind.com/wp-content/uploads/2013/01/PLL316_fv2.h264HD-Clip2.flv,440,280,false]
What I would like to have happen is that if the extra parameter (false/true) is missing then that match becomes "false", however with the current code if the parameter is missing a match is never made. Any ideas?
function legacy_hook($content){
$regex = '/\[FLOWPLAYER=([a-z0-9\:\.\-\&\_\/\|]+)\,([0-9]+)\,([0-9]+)\,([a-z0-9\:\.\-\&\_\/\|]+)\]/i';
$matches = array();
preg_match_all($regex, $content, $matches);
if($matches[0][0] != '') {
foreach($matches[0] as $key => $data) {
$content = str_replace($matches[0][$key], flowplayer::build_player($matches[2][$key], $matches[3][$key], $matches[1][$key],$matches[4][$key]),$content);
}
}
return $content;
}

your regex is looking for the last comma to be there and one or more of the characters in the last set of brackets. Something like
/\[FLOWPLAYER=([a-z0-9\:\.\-\&\_\/\|]+)\,([0-9]+)\,([0-9]+)(\,[a-z]+)?\]/i
only issue is you'll get the comma in the match too.
might be what you're after, then you have to test for the last match being present. preg_match_all returns the number of matches so you might be able to use that, or you could do an inline if...
(count($matches) > 4 ? $matches[4][$key] : false)

You can add OR at the end of your expression
(,true|,false|$)
I didn't check does it work but you get the idea.

Determine User Input Contains URL

I have a input form field which collects mixed strings.
Determine if a posted string contains an URL (e.g. http://link.com, link.com, www.link.com, etc) so it can then be anchored properly as needed.
An example of this would be something as micro blogging functionality where processing script will anchor anything with a link. Other sample could be this same post where 'http://link.com' got anchored automatically.
I believe I should approach this on display and not on input. How could I go about it?

You can use regular expressions to call a function on every match in PHP. You can for example use something like this:
<?php
function makeLink($match) {
// Parse link.
$substr = substr($match, 0, 6);
if ($substr != 'http:/' && $substr != 'https:' && $substr != 'ftp://' && $substr != 'news:/' && $substr != 'file:/') {
$url = 'http://' . $match;
} else {
$url = $match;
}
return '' . $match . '';
}
function makeHyperlinks($text) {
// Find links and call the makeLink() function on them.
return preg_replace('/((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])/e', "makeLink('$1')", $text);
}
?>

You will want to use a regular expression to match common URL patterns. PHP offers a function called preg_match that allows you to do this.
The regular expression itself could take several forms, but here is something to get you started (also maybe just Google 'URL regex':
'/^(((http|https|ftp)://)?([[a-zA-Z0-9]-.])+(.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]/+=%&_.~?-]))$/'
So your code should look something this:
$matches = array(); // will hold the results of the regular expression match
$string = "http://www.astringwithaurl.com";
$regexUrl = '/^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$/';
preg_match($regexUrl, $string, $matches);
print_r($matches); // an array of matched patterns
From here, you just want to wrap those URL patterns in an anchor/href tag and you're done.

Just how accurate do you want to be? Given just how varied URLs can be, you're going to have to draw the line somewhere. For instance. www.ca is a perfectly valid hostname and does bring up a site, but it's not something you'd EXPECT to work.

You should investigate regular expressions for this.
You will build a pattern that will match the part of your string that looks like a URL and format it appropriately.
It will come out something like this (lifted this, haven't tested it);
$pattern = "((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)";
preg_match($pattern, $input_string, $url_matches, PREG_OFFSET_CAPTURE, 3);
$url_matches will contain an array of all of the parts of the input string that matched the url pattern.

You can use $_SERVER['HTTP_HOST'] to get the host information.
<?php
$host = $SERVER['HTTP_HOST'];
?>
Post

How to use str_replace() to remove text a certain number of times only in PHP?

I am trying to remove the word "John" a certain number of times from a string. I read on the php manual that str_replace excepts a 4th parameter called "count". So I figured that can be used to specify how many instances of the search should be removed. But that doesn't seem to be the case since the following:
$string = 'Hello John, how are you John. John are you happy with your life John?';
$numberOfInstances = 2;
echo str_replace('John', 'dude', $string, $numberOfInstances);
replaces all instances of the word "John" with "dude" instead of doing it just twice and leaving the other two Johns alone.
For my purposes it doesn't matter which order the replacement happens in, for example the first 2 instances can be replaced, or the last two or a combination, the order of the replacement doesn't matter.
So is there a way to use str_replace() in this way or is there another built in (non-regex) function that can achieve what I'm looking for?

As Artelius explains, the last parameter to str_replace() is set by the function. There's no parameter that allows you to limit the number of replacements.
Only preg_replace() features such a parameter:
echo preg_replace('/John/', 'dude', $string, $numberOfInstances);
That is as simple as it gets, and I suggest using it because its performance hit is way too tiny compared to the tedium of the following non-regex solution:
$len = strlen('John');
while ($numberOfInstances-- > 0 && ($pos = strpos($string, 'John')) !== false)
$string = substr_replace($string, 'dude', $pos, $len);
echo $string;
You can choose either solution though, both work as you intend.

You've misunderstood the wording of the manual.
If passed, this will be set to the number of replacements performed.
The parameter is passed by reference and its value is changed by the function to indicate how many times the string was found and replaced. Its initial value is discarded.

There are a few things you could do to achieve this, but I can't think of one specific php function that will easily let you do this.
One option is to create your own replace function and utilize strripos and substr to do the replaces.
Another thing you can do is use preg_replace_callback and count the number of replacements you have done in the callback.
There's probably more ways but that's all I can think of on the fly. If performance is an issue I suggest you give both a try and do some simple benchmarks.

The cleanest, most-direct, single function call is to use preg_replace(). Its replacement limiting parameter makes the task intuitive and readable.
$string = preg_replace('/John/', 'dude', $string, $numberOfInstances);
The function is also attractive because making the search case-insensitive is as simple as adding the i pattern modifier to the end of the pattern. I won't delve into the usefulness of word boundaries (\b).
If a search string might contain characters with special meaning to the regex engine, then preg_quote() will be necessary -- this diminishes the beauty of the technique but not prohibitively so.
$search = '$5.99';
$pattern = '/' . preg_quote($search, '/') . '/';
$string = preg_replace($pattern, 'free', $string, $numberOfInstances);
For anyone who has an unnatural bias against regex functions, this can be done without regex and without looping -- it will be case-sensitive though.
Limited Explode & Implode: (Demo)
$numberOfInstances = 2;
$string = 'Hello John, how are you John. John are you happy with your life John?';
// explode here -^^^^ and ---------^^^^ only to create the following array:
// 0 => 'Hello ',
// 1 => ', how are you ',
// 2 => '. John are you happy with your life John?'
echo implode('dude', explode('John', $string, $numberOfInstances + 1));
Output:
Hello dude, how are you dude. John are you happy with your life John?
Notice the explode's limiting parameter dictates how many elements are generated, not how many explosions are executed on the string.

function str_replace_occurrences($find, $replace, $string, $count = -1) {
// current occrurence
$current = 0;
// while any occurrence
while (($pos = strpos($string, $find)) != false) {
// update length of str (size of string is changing)
$len = strlen($find);
// found next one
$current++;
// check if we've reached our target
// -1 is used to replace all occurrence
if($current <= $count || $count == -1) {
// do replacement
$string = substr_replace($string, $replace, $pos, $len);
} else {
// we've reached our
break;
}
}
return $string;
}

Artelius has already described how the function works, ill just show you how to do this via the manual methods:
function str_replace_occurrences($find,$replace,$string,$count = 0)
{
if($count == 0)
{
return str_replace($find,$replace,$string);
}
$pos = 0;
$len = strlen($find);
while($pos < $count && false !== ($pos = strpos($string,$find,$pos)))
{
$string = substr_replace($string,$replace,$pos,$len);
}
return $string;
}
This is untested but should work.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

regex to trim down subdomain in the url - php

Related

Find string pattern from end

remove part of string after substring

How to handle a miss with regex / PHP / preg_match_all

Determine User Input Contains URL

How to use str_replace() to remove text a certain number of times only in PHP?

Categories

Resources