I want to match a subdomain that is in the PHP variable $_SERVER['SERVER_NAME'] and then do a internal redirect. Apache or nginx rewrites is not a option because this are external rewrites that are visible to the client/user.
My regular expression is (.*(?<!^.))subdomain\.example\.com as you can see I match a subdomain in a subdomain (multi-level subdomains). The first capture group would I like to use for later.
This is my PHP code:
if(preg_match('#(.*(?<!^.))subdomain\.example\.com#', $_SERVER['SERVER_NAME'], $match1)) {
echo $match1[1] . 'anothersubdomain.example.com';
}
But this will fail if the subdomain is for example csssubdomain.example.com because this is another subdomain that I don't want to match. With the following PHP script I test the matches:
$tests = array(
'subdomain.example.com' => 'anothersubdomain.example.com',
'css.subdomain.example.com' => 'css.anothersubdomain.example.com',
'csssubdomain.example.com' => 'csssubdomain.example.com',
'tsubdomain.example.com' => 'tsubdomain.example.com',
'multi.sub.subdomain.example.com' => 'multi.sub.anothersubdomain.example.com',
'.subdomain.example.com' => '.subdomain.example.com',
);
foreach( $tests as $test => $correct_answer) {
$result = preg_replace( '#(.*(?<!^.))subdomain\.example\.com#', '$1anothersubdomain.example.com', $test);
echo 'Input: ' . $test . "\n" .
'Expected: ' . $correct_answer . "\n" .
'Actual : ' .$result . "\n";
$passorfail = (strcmp( $result, $correct_answer) === 0 ? "PASS\n\n" : "FAIL\n\n");
echo $passorfail;
}
You'd get as output:
Input: subdomain.example.com
Expected: anothersubdomain.example.com
Actual : anothersubdomain.example.com
PASS
Input: css.subdomain.example.com
Expected: css.anothersubdomain.example.com
Actual : css.anothersubdomain.example.com
PASS
Input: csssubdomain.example.com
Expected: csssubdomain.example.com
Actual : cssanothersubdomain.example.com
FAIL
Input: tsubdomain.example.com
Expected: tsubdomain.example.com
Actual : tsubdomain.example.com
PASS
Input: multi.sub.subdomain.example.com
Expected: multi.sub.anothersubdomain.example.com
Actual : multi.sub.anothersubdomain.example.com
PASS
Input: .subdomain.example.com
Expected: .subdomain.example.com
Actual : .subdomain.example.com
PASS
The strange thing is that it does match csssubdomain.example.com but not tsubdomain.example.com.
Does someone know what regular expression you can use for this case? I've tried some things with lookahead and lookbehind zero-width assertions but it didn't really work.
You can try this pattern:
~^((?:\w+\.)*?)subdomain\.example\.com~
if you allow this .toto.subdomain.example.com, just add \.? at the beginning:
~^((?:\.?\w+\.)*?)subdomain\.example\.com~
if you want to allow a hyphen character just add it to the character class:
~^((?:\.?[\w-]+\.)*?)subdomain\.example\.com~
and if you don't allow substring to begin or ending with a hypen character:
~^((?:\.?\w+([\w-]*?\w)?\.)*?)subdomain\.example\.com~
Related
How can I write a regex expression that will convert any absolute URLs to relative paths. For example:
src="http://www.test.localhost/sites/
would become
src="/sites/"
The domains are not static.
I can't use parse_url (as per this answer) because it is part of a larger string, that contains no-url data as well.
Solution
You can use the following regex:
/https?:\/{2}[^\/]+/
Which would match the following:
http://www.test.localhost/sites/
http://www.domain.localhost/sites/
http://domain.localhost/sites/
So it would be:
$domain = preg_replace('/https?:\/{2}[^\/]+/', '', $domain);
Explanation
http: Look for 'http'
s?: Look for an 's' after the 'http' if there's one
: : Look for the ':' character
\/{2}: Look for the '//'
[^\/]+: Go for anything that is not a slash (/)
My guess is that maybe this expression or an improved version of that might work to some extent:
^\s*src=["']\s*https?:\/\/(?:[^\/]+)([^"']+?)\s*["']$
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
Test
$re = '/^\s*src=["\']\s*https?:\/\/(?:[^\/]+)([^"\']+?)\s*["\']$/m';
$str = 'src=" http://www.test.localhost/sites/ "
src=" https://www.test.localhost/sites/"
src=" http://test.localhost/sites/ "
src="https://test.localhost/sites/ "
src="https://localhost/sites/ "
src=\'https://localhost/ \'
src=\'http://www.test1.test2.test3.test4localhost/sites1/sites2/sites3/ \'';
$subst = 'src="$1"';
var_export(preg_replace($re, $subst, $str));
Output
src="/sites/"
src="/sites/"
src="/sites/"
src="/sites/"
src="/sites/"
src="/"
src="/sites1/sites2/sites3/"
RegEx Circuit
jex.im visualizes regular expressions:
$dom = new DOMDocument;
$dom->loadHTML($yourHTML)
$xp = new DOMXPath($dom);
foreach($xp->query('//#src') as $attr) {
$url = parse_url($attr->nodeValue);
if ( !isset($url['scheme']) || stripos($url['scheme'], 'http']) !== 0 )
continue;
$src = $url['path']
. ( isset($url['query']) ? '?' . $url['query'] : '' )
. ( isset($url['fragment']) ? '#' . $url['fragment'] : '' );
$attr->parentNode->setAttribute('src', $src);
}
$result = $dom->saveHTML();
I added an if condition to skip cases when it isn't possible to say if the beginning of the src attribute is a domain or the beginning of the path. Depending of what you are trying to do, you can remove this test.
If you are working with parts of an html document (ie: not a full document), you have to change $result = $dom->saveHTML() with something like:
$result = '';
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $childNode) {
$result . = $dom->saveHTML($childNode);
}
I want to be able to extract the tag names and values of queries.
Given the following query:
title:(Harry Potter) abc def author:'John' rating:5 jhi cost:"2.20" lmnop qrs
I want to be able to extract the following information:
title => Harry Potter
author => John
rating => 5
cost => 2.20
rest => abc def jhi lmnop qrs
Note that a tag value can be contained in an '..', "..." or (...). It dosent matter which.
This problem was solved using the following:
$query = "..."; // User input
while (preg_match(
'#(?P<key>title|author|rating|cost):(?P<value>[^\'"(\s]+)#',
$query,
$matches
)) {
echo $matches['key'] . " => " . $matches['value'];
$query = trim(str_replace($matches[0], '', $query));
}
while (preg_match(
'#(?P<key>title|author|rating|cost):[\'"(](?P<value>[^\'")]+)[\'")]#',
$query,
$matches
)) {
echo $matches['key'] . " => " . $matches['value'];
$query = trim(str_replace($matches[0], '', $query));
}
Now this is okay for many cases. However, there are quite a few corner cases:
1) For example consider:
title:(John's) abc
should go to:
title => John's
rest => abc
but instead goes to
title => (John'
rest => s) abc
2) Also consider:
title: (foo (: bar)
should go to:
title => foo (: bar
goes to:
rest => (foo (bar)
How can I do this? Is regex even the best way to go? How else can I solve this issue?
UPDATE Fixed a mistake on one of the expected outputs
It's not possible to parse everything exactly with one regex like you do because you don't have the same rule for all your pairs (key, value). Indeed, a close parenthesis, for instance, would be accepted in the middle of the tag author but not in the middle of title. A single quote mark would be accepted in the middle of title but not in the middle of author, etc. So, even if your rule works in most of the case, your second capture group cannot be properly defined.
One way to improve your solution would be to use different regular expression for each tags. You could then do something like this :
$str = "title:(foo (: bar) abc def ".
"author:'John' " .
"rating:5 jhi " .
"cost:\"2.20\"" .
"lmnop qrs ";
$regex = array(
"title" => "/(?P<key>title):[[:space:]]*\((?P<value>[^\)]*)\)/" ,
"author" => "/(?P<key>author):[[:space:]]*'(?P<value>[^']*)'/" ,
"rating" => "/(?P<key>rating):[[:space:]]*(?P<value>[\d]+)/" ,
"cost" => "/(?P<key>cost):[[:space:]]*\"(?P<value>[\d]+\.[\d]{2})\"/"
);
foreach($regex as $k => $r)
{
if(preg_match($r, $str, $matches))
{
echo $matches['key'] . " => " . $matches['value'] . "\n";
}
else
{
echo "Nothing found for " . $k . "\n";
}
}
However, note that this solution is not bullet proof. For example, you'll have a problem if the title of a book contains the string author: 'JOHN'.
In my opinion, the best way to avoid such issue is to define a grammatical rule for your input string and to reject all the strings that doesn't mach you rule. Well, it also depends on your requirements and on your application I guess.
Edit
Note that a tag value can be contained in an '..', "..." or (...). It dosent matter which
In that case, your problem is still that
[\'\"\(](?P<value>[^\'\"\)]+)[\'\"\)]
is incorrect. Instead, you want that each pairs of delimiters match. There's an option in subpattern for that (reference here)
(?|\'(?P<value>[^\']+)\'|\"(?P<value>[^\"]+)+\"|\((?P<value>[^\)]+)\))
If you use \ as escape char, the code becomes
$str = 'title:"foo \" bar" abc def '.
'author:(Joh\)n) ' .
'rating:\'5\\\'4\' jhi ' .
'cost:"2.20"' .
'lmnop qrs ';
$regex = "/(?P<key>title|author|rating|cost):[[:space:]]*" .
"(?|" .
"\"(?P<value>(?:(?:\\\\\")|[^\"])+)\"" . "|" . // matches "..."
"\'(?P<value>(?:(?:\\\\\')|[^\'])+)\'" . "|" . // matches '...'
"\((?P<value>(?:(?:\\\\\))|[^\)])+)\)" . // matches (...)
")/"; // close (?|...
while(preg_match($regex, $str, $matches))
{
echo $matches['key'] . " => " $matches['value'] . "\n";
$str = str_replace($matches[0], '', $str);
}
Output
title => foo \" bar
author => Joh\)n
rating => 5\'4
cost => 2.20
I have huge online php fileset that is made dynamically.
It has links, even some invalid ones with quotes (made with frontpage)
index2.php?page=xd
index2.php?page=xj asdfa
index2.php?page=xj%20aas
index2.php?page=xj#jumpword
index2.php?page=gj#jumpword with spaces that arenot%20
index2.php?page=afdsdj#jumpword%20with
index2.php?page=xj#jumpword with "quotes" iknow
$input_lines=preg_replace("/(index2.php?page\=.*)(#[a-zA-Z0-9_ \\"]*)(\"\>)/U", "$0 --> $2", $input_lines);
I want all of those to be just with the # -part and not have the index2.php?page=* part.
I could not get this to work in whole evening. So please help.
In some cases, you can use parse_url to get attributes from the URL (ex: what is after the #), like so:
$urls = array(
'index2.php?page=xd',
'index2.php?page=xj asdfa',
'index2.php?page=xj%20aas',
'index2.php?page=xj#jumpword',
'index2.php?page=gj#jumpword with spaces that arenot%20',
'index2.php?page=afdsdj#jumpword%20with',
'index2.php?page=xj#jumpword with "quotes" iknow',
);
foreach($urls as $url){
echo 'For "' . $url . '": ';
$parsed = parse_url($url);
echo isset($parsed['fragment']) ? $parsed['fragment'] : 'DID NOT WORK';
echo '<br>';
}
Output:
For "index2.php?page=xd": DID NOT WORK
For "index2.php?page=xj asdfa": DID NOT WORK
For "index2.php?page=xj%20aas": DID NOT WORK
For "index2.php?page=xj#jumpword": jumpword
For "index2.php?page=gj#jumpword with spaces that arenot%20": jumpword with spaces that arenot%20
For "index2.php?page=afdsdj#jumpword%20with": jumpword%20with
For "index2.php?page=xj#jumpword with "quotes" iknow": jumpword with "quotes" iknow
Having following code to turn an URL in a message into HTML links:
$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?-])*#",
"\\0", $message);
$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])*)#",
"\\1\\2", $message);
It works very good with almost all links, except in following cases:
1) http://example.com/mediathek#/video/1976914/zoom:-World-Wide
Problem here is the # and the : within the link, because not the complete link is transformed.
2) If someone just writes "www" in a message
Example: www
So the question is about if there is any way to fix these two cases in the code above?
Since you want to include the hash (#) to the regex, you need to change the delimiters to characters that are not included in your regex, e.g. !. So, your regex should look like this:
$message = preg_replace("!(http|https|ftp|ftps)://([.]?[&;%#:=a-zA-Z0-9_/?-])*!",
"\\0", $message);
Does this help?
Though, if you would like to be more along the specification (RCF 1738) you might want to exclude % which is not allowed in URLs. There are also some more allowed characters which you didn't include:
$
_
. (dot)
+
!
*
'
(
)
If you would include these chars, you should then delimiter your regex with %.
Couple minor tweaks. Add \# and : to the first regex, then change the * to + in the second regex:
$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?\#:-])*#",
"\\0", $message);
$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])+)#",
"\\1\\2", $message);
In my opinion, it is vain to tackle this problem. A good alternative is to find what could be an URL via regex (begin with the protocol: http, ftp, mail... or by www) and then test it with FILTER_VALIDATE_URL. Keep in mind that this filter is not a waterproof way as the PHP manual says:
"Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail."
Example of code (not tested):
$message = preg_replace_callback(
'~(?(DEFINE)
(?<prot> (?>ht|f) tps?+ :// ) # you can add protocols here
)
(?>
<a\b (?> [^<]++ | < (?!/a>) )++ </a> # avoid links inside "a" tags
|
<[^>]++> # and tags attributes.
) (*SKIP)(?!) # makes fail the subpattern.
| # OR
\b(?>(\g<prot>)|www\.)(\S++) # something that begins with
# "http://" or "www."
~xi',
function ($match) {
if (filter_var($match[2], FILTER_VALIDATE_URL)) {
$url = (empty($match[1])) ? 'http://' : '';
$url .= $match[0];
return '<a href="away?to=' . $url . '"target="_blank">'
. $url . '</a>';
} else { return $match[0] }
},
$message);
I am using PHP to remove/add static pages once a page has been deleted, I want to be able to remove it from the .htaccess, however I've tried this, but it throws an error:
Warning: preg_replace() [function.preg-replace]: Unknown modifier '' in ...
The code:
$page_name = $row['page_name']; // Example: help
preg_replace('/RewriteRule ' . preg_quote('^' . $page_name . '/?$ page.php?mode=') . '.*/i', '', $htaccess);
This is an example of what it should fully remove:
RewriteRule ^help/?$ page.php?mode=help
You have to escape the expression delimiter by passing it to preg_quote as the second argument.
preg_replace('/RewriteRule ' . preg_quote('^' . $page_name . '/?$ page.php?mode=', '/') . '.*/i', '', $htaccess);
Or else your / won't be escaped. As stated in the documentation "the special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -"
USe like this
preg_replace ( "~~msi", "pattern to replace").
Also - good practive is analise by line's not - change in all text a time!!!
so
foreach ( file(.htaccess) as $line)
{
and replace in each line,
}
than output all, store copy of old .htaccess ...
,Arsen