Regular Expression within URL - php

I've tried searching for the answer but it's a tough question to ask in the first place, here goes.
Consider the following URL(s):
hotels-london-kensington-5star
hotels-london-kensington-mayfair-5star
hotels-london-5star
the following would be returned using the correct regular expression:
london-kensington
london-kensington-mayfair
london
I'm trying to use regular expression to get the centre value(s) ONLY. I know the first and last words of the string in all cases i.e. 'hotels' (first) and '5star' (last).
UPDATE: I need to use regular expressions as the URL's are being routed through Codeigniters URI router. The centre part of the URI is dynamically built for search results.

Since you know the first and last words, you also know their lengths.
$out = substr($in,7,-6);
Or more generally:
$out = substr($in,strlen($begin),-strlen($end));
EDIT: If regex is required, just use /(?<=hotels-).*(?=-5star)/

For such simple cases you can use a explode()
$parts = explode('-', $string);
$wanted = array_slice($parts, 1, count($parts)-2);
Update: A regular expression. I still think it's easier to split the string into pieces manually.
~hotels-(.+)-5start~

Related

Capturing a pattern of unknown repitition in PCRE

This may be a quick question for experienced regular expressionists, but I'm having trouble getting my match to execute correctly.
Suppose I had a string that looked like this:
http://aaa-bbbb-cc-ddddd-eee-.sub.dom
I would like to go capture all of the "aaa", "bbbb", "cc", and "ddddd" substrings, but I'm not sure how many there will be (e.g., having all triplets up through "zzz").
This is the regular expression I'm trying to use right now:
/http:\/\/(\w*?\-)+\.sub\.dom/
I wrote it this way because:
I want to match substrings, but I want each to terminate when a - is parsed
I want to capture one or more of these substrings
But it seems to only be saving the last match that it makes (in the above case, it would only match "eee-".
Is there a good way to capture all of the matched substrings?
More information: I'm using PHP's PCRE function preg_replace_callback. Thanks!
No, it is not possible to match an unknown number of capture groups.
If you try to repeat a capture group, it will always contain the last value captured.
Could you explain a bit more broadly what you're trying to do? Perhaps there is another simple way to do it (possibly without regular expressions).
If you want the items in the subdomain, and then all matches between the dashes... This should work:
$string = "http://aaa-bbbb-cc-ddddd-eee-.sub.dom";
preg_match("/^http:\/\/([\w-]+?)\..*$/i", $string, $match);
$parts = explode('-', $match[1]);
print_r($parts);
Short of that you will probably have to build a small parsing script to parse the string yourself if that doesn't do it for you.

PHP Regex on URL - split into variables

I am trying to implement a php script which will run on every call to my site, look for a certain pattern of URL, then explode the URL and perform a redirect.
Basically I want to run this on a new CMS to catch all incoming links from the old CMS, and redirect, based on mapping, say an article id stripped form the URL to the same article ID imported into the new CMS's DB.
I can do the implementation, the redirect etc, but I am lost on the regex.
I need to catch any occurrences of:
domain.com/content/view/*/34/ or domain.com/content/view/*/30/ (where * is a wildcard) and capture * and the 30 or 34 in a variable which I will then use in a DB query.
If the following is encountered:
domain.com/content/view/*/34/1/*/
I need to capture the first * and the second *.
Be very grateful for anyone who can give me a hand on this.
I'm not sure regular expressions are the way to go. I think it would probably be easier to use explode ('/' , $url) and check by looping over that array.
Here are the steps I would follow:
$url = parse_url($url, PHP_URL_PATH);
$url = trim($url, '/');
$parts = explode ('/' , $url);
Then you can check if
($parts[0]=='content' && $parts[1]=='view' && $parts[3]=='34')
You can also easily get the information you want with $parts[2].
It's actually very simple, a more flexible and straightforward approach is to explode() the url into an array called something like $segments, and then test on there. If you have a very small number of expected URLs, then this kind of approach is probably easier to maintain and to read.
I wouldn't recommend doing this in the htaccess file because of the performance overhead.
First, I would use the PHP function parse_url() to get the path, devoid of any protocol or hostname.
Once you have that the following code should get you the info you need.
<?php
$url = 'http://domain.com/content/view/*/34/'; // first example
$url = 'http://domain.com/content/view/*/34/1/*/'; // second example
$url_array = parse_url($url);
$path = $url_array['path'];
// Match the URL against regular expressions
if (preg_match('/content\/view\/([^\/]+)\/([0-9]+)\//i', $path, $matches)){
print_r($matches);
}
if (preg_match('/content\/view\/([^\/]+)\/([0-9]+)\/([0-9]+)\/([^\/]+)/i', $path, $matches)){
print_r($matches);
}
?>
([^/]+) matches any sequence of characters except a forward slash
([0-9]+) matches any sequence of numbers
Though you can probably write a single regular expression to match most URL variants, consider using multiple regular expressions to check for different types of URLs. Depending on how much traffic you get, the speed hit won't be all that terrible.
Also, I recommend reading Mastering Regular Expressions by O'reilly. A good knowledge of regular expressions will come in handy quite often.
http://www.regular-expressions.info/php.html

Regular expression to extract from URI

I need a regular expression to extract from two types of URIs
http://example.com/path/to/page/?filter
http://example.com/path/to/?filter
Basically, in both cases I need to somehow isolate and return
/path/to
and
?filter
That is, both /path/to and filter is arbitrary. So I suppose I need 2 regular expressions for this? I am doing this in PHP but if someone could help me out with the regular expressions I can figure out the rest. Thanks for your time :)
EDIT: So just want to clearify, if for example
http://example.com/help/faq/?sort=latest
I want to get /help/faq and ?sort=latest
Another example
http://example.com/site/users/all/page/?filter=none&status=2
I want to get /site/users/all and ?filter=none&status=2. Note that I do not want to get the page!
Using parse_url might be easier and have fewer side-effects then regex:
$querystring = parse_url($url, PHP_URL_QUERY);
$path = parse_url($var, PHP_URL_PATH);
You could then use explode on the path to get the first two segments:
$segments = explode("/", $path);
Try this:
^http://[^/?#]+/([^/?#]+/[^/?#]+)[^?#]*\?([^#]*)
This will get you the first two URL path segments and query.
not tested but:
^https?://[^ /]+[^ ?]+.*
which should match http and https url with or without path, the second argument should match until the ? (from the ?filter for instance) and the .* any char except the \n.
Have you considered using explode() instead (http://nl2.php.net/manual/en/function.explode.php) ? The task seems simple enough for it. You would need 2 calls (one for the / and one for the ?) but it should be quite simple once you did that.

preg_match returning weird results

I am searching a string for urls...and my preg_match is giving me an incorrect amount of matches for my demo string.
String:
Hey there, come check out my site at www.example.com
Function:
preg_match("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t<]*)#ise", $string, $links);
echo count($links);
The result comes out as 3.
Can anybody help me solve this? I'm new to REGEX.
$links is the array of sub matches:
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
The matches of the two groups plus the match of the full regular expression results in three array items.
Maybe you rather want all matches using preg_match_all.
If you use preg_match_pattern, (as Gumbo suggested), please note that if you run your regex against this string, it will both match the value of your anchor attribute "href" as well as the linked Text which in this case happens to comtain an url. This makes TWO matches.
It would be wise to run an array_unique on your resultset :)
In addition to the advice on how to use preg_match, I believe there is something seriously wrong with the regular expression you are using. You may want to trying something like this instead:
preg_match("_([a-zA-Z]+://)?([0-9a-zA-Z$-\_.+!*'(),]+\.)?([0-9a-zA-Z]+)+\.([a-zA-Z]+)_", $string, $links);
This should handle most cases (although it wouldn't work if there was a query string after the top-level domain). In the future, when writing regular expressions, I recommend the following web-sites to help: http://www.regular-expressions.info/ and especially http://regexpal.com/ for testing them as you're writing them.

Google Style Regular Expression Search

It's been several years since I have used regular expressions, and I was hoping I could get some help on something I'm working on. You know how google's search is quite powerful and will take stuff inside quotes as a literal phrase and things with a minus sign in front of them as not included.
Example: "this is literal" -donotfindme site:examplesite.com
This example would search for the phrase "this is literal" in sites that don't include the word donotfindme on the webiste examplesite.com.
Obviously I'm not looking for something as complex as Google I just wanted to reference where my project is heading.
Anyway, I first wanted to start with the basics which is the literal phrases inside quotes. With the help of another question on this site I was able to do the following:
(this is php)
$search = 'hello "this" is regular expressions';
$pattern = '/".*"/';
$regex = preg_match($pattern, $search, $matches);
print_r($matches);
But this outputs "this" instead of the desired this, and doesn't work at all for multiple phrases in quotes. Could someone lead me in the right direction?
I don't necessarily need code even a real nice place with tutorials would probably do the job.
Thanks!
Well, for this example at least, if you want to match only the text inside the quotes you'll need to use a capturing group. Write it like this:
$pattern = '/"(.*)"/';
and then $matches will be an array of length 2 that contains the text between the quotes in element 1. (It'll still contain the full text matched in element 0) In general, you can have more than one set of these parentheses; they're numbered from the left starting at 1, and there will be a corresponding element in $matches for the text that each group matched. Example:
$pattern = '/"([a-z]+) ([a-z]+) (.*)"/';
will select all quoted strings which have two lowercase words separated by a single space, followed by anything. Then $matches[1] will be the first word, $matches[2] the second word, and $matches[3] the "anything".
For finding multiple phrases, you'll need to pick out one at a time with preg_match(). There's an optional "offset" parameter you can pass, which indicates where in the string it should start searching, and to find multiple matches you should give the position right after the previous match as the offset. See the documentation for details.
You could also try searching Google for "regular expression tutorial" or something like that, there are plenty of good ones out there.
Sorry, but my php is a bit rusty, but this code will probably do what you request:
$search = 'hello "this" is regular expressions';
$pattern = '/"(.*)"/';
$regex = preg_match($pattern, $search, $matches);
print_r($matches[1]);
$matches1 will contain the 1st captured subexpression; $matches or $matches[0] contains the full matched patterns.
See preg_match in the PHP documentation for specifics about subexpressions.
I'm not quite sure what you mean by "multiple phrases in quotes", but if you're trying to match balanced quotes, it's a bit more involved and tricky to understand. I'd pick up a reference manual. I highly recommend Mastering Regular Expressions, by Jeffrey E. F. Friedl. It is, by far, the best aid to understanding and using regular expressions. It's also an excellent reference.
Here is the complete answer for all the sort of search terms (literal, minus, quotes,..) WITH replacements . (For google visitors at the least).
But maybe it should not be done with only regular expressions though.
Not only will it be hard for yourself or other developers to work and add functionality on what would be a huge and super complex regular expression otherwise
it might even be that it is faster with this approach.
It might still need a lot of improvement but at least here is a working complete solution in a class. There is a bit more in here than asked in the question, but it illustrates some reasons behind some choices.
class mySearchToSql extends mysqli {
protected function filter($what) {
if (isset(what) {
//echo '<pre>Search string: '.var_export($what,1).'</pre>';//debug
//Split into different desires
preg_match_all('/([^"\-\s]+)|(?:"([^"]+)")|-(\S+)/i',$what,$split);
//echo '<pre>'.var_export($split,1).'</pre>';//debug
//Surround with SQL
array_walk($split[1],'self::sur',array('`Field` LIKE "%','%"'));
array_walk($split[2],'self::sur',array('`Desc` REGEXP "[[:<:]]','[[:>:]]"'));
array_walk($split[3],'self::sur',array('`Desc` NOT LIKE "%','%"'));
//echo '<pre>'.var_export($split,1).'</pre>';//debug
//Add AND or OR
$this ->where($split[3])
->where(array_merge($split[1],$split[2]), true);
}
}
protected function sur(&$v,$k,$sur) {
if (!empty($v))
$v=$sur[0].$this->real_escape_string($v).$sur[1];
}
function where($s,$OR=false) {
if (empty($s)) return $this;
if (is_array($s)) {
$s=(array_filter($s));
if (empty($s)) return $this;
if($OR==true)
$this->W[]='('.implode(' OR ',$s).')';
else
$this->W[]='('.implode(' AND ',$s).')';
} else
$this->W[]=$s;
return $this;
}
function showSQL() {
echo $this->W? 'WHERE '. implode(L.' AND ',$this->W).L:'';
}
Thanks for all stackoverflow answers to get here!
You're in luck because I asked a similar question regarding string literals recently. You can find it here: Regex for managing escaped characters for items like string literals
I ended up using the following for searching for them and it worked perfectly:
(?<!\\)(?:\\\\)*(\"|')((?:\\.|(?!\1)[^\\])*)\1
This regex differs from the others as it properly handles escaped quotation marks inside the string.

Categories