preg_match returning weird results - php

I am searching a string for urls...and my preg_match is giving me an incorrect amount of matches for my demo string.
String:
Hey there, come check out my site at www.example.com
Function:
preg_match("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t<]*)#ise", $string, $links);
echo count($links);
The result comes out as 3.
Can anybody help me solve this? I'm new to REGEX.

$links is the array of sub matches:
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
The matches of the two groups plus the match of the full regular expression results in three array items.
Maybe you rather want all matches using preg_match_all.

If you use preg_match_pattern, (as Gumbo suggested), please note that if you run your regex against this string, it will both match the value of your anchor attribute "href" as well as the linked Text which in this case happens to comtain an url. This makes TWO matches.
It would be wise to run an array_unique on your resultset :)

In addition to the advice on how to use preg_match, I believe there is something seriously wrong with the regular expression you are using. You may want to trying something like this instead:
preg_match("_([a-zA-Z]+://)?([0-9a-zA-Z$-\_.+!*'(),]+\.)?([0-9a-zA-Z]+)+\.([a-zA-Z]+)_", $string, $links);
This should handle most cases (although it wouldn't work if there was a query string after the top-level domain). In the future, when writing regular expressions, I recommend the following web-sites to help: http://www.regular-expressions.info/ and especially http://regexpal.com/ for testing them as you're writing them.

Related

preg_replace with Regex - find number-sequence in URL

I'm a regex-noobie, so sorry for this "simple" question:
I've got an URL like following:
http://stellenanzeige.monster.de/COST-ENGINEER-AUTOMOTIVE-m-w-Job-Mainz-Rheinland-Pfalz-Deutschland-146370543.aspx
what I'm going to archieve is getting the number-sequence (aka Job-ID) right before the ".aspx" with preg_replace.
I've already figured out that the regex for finding it could be
(?!.*-).*(?=\.)
Now preg_replace needs the opposite of that regular expression. How can I archieve that? Also worth mentioning:
The URL can have multiple numbers in it. I only need the sequence right before ".aspx". Also, there could be some php attributes behind the ".aspx" like "&mobile=true"
Thank you for your answers!
You can use:
$re = '/[^-.]+(?=\.aspx)/i';
preg_match($re, $input, $matches);
//=> 146370543
This will match text not a hyphen and not a dot and that is followed by .aspx using a lookahead (?=\.aspx).
RegEx Demo
You can just use preg_match (you don't need preg_replace, as you don't want to change the original string) and capture the number before the .aspx, which is always at the end, so the simplest way, I could think of is:
<?php
$string = "http://stellenanzeige.monster.de/COST-ENGINEER-AUTOMOTIVE-m-w-Job-Mainz-Rheinland-Pfalz-Deutschland-146370543.aspx";
$regex = '/([0-9]+)\.aspx$/';
preg_match($regex, $string, $results);
print $results[1];
?>
A short explanation:
$result contains an array of results; as the whole string, that is searched for is the complete regex, the first element contains this match, so it would be 146370543.aspx in this example. The second element contains the group captured by using the parentheeses around [0-9]+.
You can get the opposite by using this regex:
(\D*)\d+(.*)
Working demo
MATCH 1
1. [0-100] `http://stellenanzeige.monster.de/COST-ENGINEER-AUTOMOTIVE-m-w-Job-Mainz-Rheinland-Pfalz-Deutschland-`
2. [109-114] `.aspx`
Even if you just want the number for that url you can use this regex:
(\d+)

using preg_match_all to find patterns, don't include pattern deliminator in matchs

I'm matching patterns with reg_ex as in
$Structure = 'C:N:X:A:V:T:J:N:G:T:N:N:C:J:N:C:A:J:N:.:';
preg_match_all('/(T:|G:|L:|D:).*?(G:|i:|X:|\.:)/', $Structure, $arr, PREG_SET_ORDER);
the results I get are
T:J:N:G: , T:N:N:C:J:N:C:A:J:N:.:
How can I modify the query so that the deliminator (G:|i:|X:|.:) of the match is not included in the find, but will bu used in the next search. In other words make the result look as bellow:
T:J:N: , G:T:N:N:C:J:N:C:A:J:N:
instead?
Is this possible?
Thanks
Yes, instead of making your 2nd capturing group consume the input, turn it into a positive lookahead:
/(T:|G:|L:|D:).*?(?=(?:G:|i:|X:|\.:))/
Now, instead of matching (and consuming) the delimiter, this:
(?=(?:G:|i:|X:|\.:))
States that the regex must assert that the delimiter is present from the current point forward, i.e. a positive lookahead.
This results in:
"T:J:N:, G:T:N:N:C:J:N:C:A:J:N:"
It is possible by lookaheads, with the following syntax:
(?=G:|i:|X:|\.:)
That will not consume the piece that matches the regex.
On a side note, the delimiter means the slashes that you have enclosing your regex and not the capturing group you have.

Regular Expression separation

I'm trying to make a regular expression that will select only the first string of two strings.
IE:
hello:howareyou
I want the regex to return only hello.
Similarly, I would want another one to return howareyou, but I should be able to figure that out once I understand the first part.
Thank you!
EDIT:
So far I have tried (?:[^"<:]|"[^"]*"|<[^>]*)* but that merely splits the two.
You could simply use explode(':', $str), but if you insist on using a regular expression, you can do that as well with preg_match('/(.+?):(.+)/', $str, $matches) which will return the first part in $matches[1] and the second part in $matches[2].

Capturing a pattern of unknown repitition in PCRE

This may be a quick question for experienced regular expressionists, but I'm having trouble getting my match to execute correctly.
Suppose I had a string that looked like this:
http://aaa-bbbb-cc-ddddd-eee-.sub.dom
I would like to go capture all of the "aaa", "bbbb", "cc", and "ddddd" substrings, but I'm not sure how many there will be (e.g., having all triplets up through "zzz").
This is the regular expression I'm trying to use right now:
/http:\/\/(\w*?\-)+\.sub\.dom/
I wrote it this way because:
I want to match substrings, but I want each to terminate when a - is parsed
I want to capture one or more of these substrings
But it seems to only be saving the last match that it makes (in the above case, it would only match "eee-".
Is there a good way to capture all of the matched substrings?
More information: I'm using PHP's PCRE function preg_replace_callback. Thanks!
No, it is not possible to match an unknown number of capture groups.
If you try to repeat a capture group, it will always contain the last value captured.
Could you explain a bit more broadly what you're trying to do? Perhaps there is another simple way to do it (possibly without regular expressions).
If you want the items in the subdomain, and then all matches between the dashes... This should work:
$string = "http://aaa-bbbb-cc-ddddd-eee-.sub.dom";
preg_match("/^http:\/\/([\w-]+?)\..*$/i", $string, $match);
$parts = explode('-', $match[1]);
print_r($parts);
Short of that you will probably have to build a small parsing script to parse the string yourself if that doesn't do it for you.

How can use a match in the same regex in php?

I have this string (that is a serialized variable in php):
s:12:"hello "world";
and I wanna to find "hello "world" only with regex, I try this, but seems it is stupid :P
(s:(?P<num>[0-9]+):".{\k{num}}";)
I only want to know how I can use "num" result in the its regex?
this regex is used in a big regex so I can't check for end of string.
thanks advance!
You can use your named capturing groups as backreference like this
Back references to the named subpatterns can be achieved by (?P=name)
or, since PHP 5.2.2, also by \k or \k'name'. Additionally PHP
5.2.4 added support for \k{name} and \g{name}.
According to php.net
But I think this can be used only to match the found pattern again, but not as a number in a quantifier. (At least I didn't got it to work.)
You can use preg_match function, which will populate an array of matches:
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches1 will have the text that matched the first captured parenthesized subpattern, and so on.
More information about preg_match: PHP: preg_match
$text = 's:12:"hello "world";s:12:"good bue world";';
$pattern = "(.*:[0-9]+:\"(.*)\";.*)U";
preg_match_all($pattern,$text,$r);

Categories