preg_match basics question - php

Got some trouble with my preg_match.
The code.
$text = "tel: 012 213 123. mobil: 0303 11234 \n address: street 14";
$regex_string = '/(tel|Tel|TEL)[\s|:]+(.+)[\.|\n]/';
preg_match($regex_string , $text, $match);
And I get this result in $match[2]
"012 213 123. mobil: 023 123 123"
First question.
I want the regex to stop at the .(dot) but it doesent.
Can someone explain to why it isnt?
Second question.
preg_match uses () to get their match.
Is it possible to skip the parentheses surrounding the different "Tel" and still get the same functionality?
Thnx all stackoverflow is great :D

This should do:
/tel(?:\s|:)+([^.]+)(?:\.|$)/i
+ is a greedy quantifier, which means it'll match as many characters as possible.
To your second question: in this particular case you just need to use case-insensitive match (i flag). Generally, you could use (?:...) syntax, example of which you could see in the end match. Square brackets are used for character classes.

If you're simply trying to extract a phone number out of that line, and it's guaranteed to be 11 numbers, you could simply use this:
$text = 'tel: 012 213 123. mobil: 0303 11234';
$phone_number = substr(preg_replace('/[^\d]/', '', $text), 0, 11);`
With your example, $phone_number would be 0122131230.
How this works is any non-digit is replaced with an empty string, removing it, and then the first 11 numbers are returned.

No idea - your regex works for me (I get "012 213 123" in $match[2] with your code). The fact that the mobile phone differs between the two might indicate that it's not really the output of your code; check again.
Some other things - if you happen to have more dots in the line ("tel: xxx. phone: xxx. fax: xxx" for example), you will get bad results - use non-greedy operators ("get least chunk that matches" .*? instead of "get biggest chunk that matches" .*) or limit the repeated characters ("any number of non-periods" [^.]*). Also, you could spare yourself the trouble by making the regex case-insensitive (unless you really hate people typing "tEl").
Your other question: (?:stuff) will match "stuff" just like (stuff), but will not capture it.
Useful link: http://www.regular-expressions.info/

Why do you have pipes in your character classes [\.|\n] and [\s|:]? Character classes (stuff in square brackets []) are by definition like an OR relationship, so you don't need the pipe... unless you really are trying to match pipe |.
As for question #1, I'm not sure what's cusiong your problem, but usually this has to do with greedy quantifiers. The (.+) quantifier is greedy, so it matches as much as it can while still matching the entire pattern. Greedy quantifiers don't care what comes after them in the pattern. Since a period . matches any character other than new line characters, it can match a period, and so it does match a period. To make a quantifier non-greedy you can use a question mark ?.
For your second question In RegEx uses parenthesis to group things and to store them. If you want to group (tel|Tel|TEL) but not store it in $match you can put a ?: at after the open parenthesis:
(?:tel|Tel|TEL)

Do you mean you want to match only the number, so you don't have to strip off the tel: and the dot? Try this:
/tel[:\s]+\K[^.]+/i
The i makes it case-insensitive.
[:\s] matches a colon or whitespace (the | doesn't mean "or" in a character class, it just matches a |).
[^.]+ matches one or more non-dots; it stops matching when it sees a dot or the end of the line, so you don't have to match the dot if you don't want it in the result.
Finally, \K means "forget about whatever you've matched so far and pretend the match really started here"--a little gem of a feature that's only available in Perl and PHP (that I know of).

Related

PHP Regular Expression - Extract Data

I have a long string, and am trying to extract specific data that is deliminated in that string by specific words.
For example, here is a subset of the string:
Current Owner 123 Capital Calculated
I am looking to extract
123 Capital
and as you can see it is surrounded by "Current Owner" (with a bunch of arbitrary spaces) to the left and "Calculated" (again with arbitrary spaces) to the right.
I tried this, but I'm a bit new at RegEx. Can anyone help me create a more effective RegEx?
preg_match("/Owner[.+]Calculated/",$inputString,$owner);
Thanks!
A character class defines a set of characters. Saying, "match one character specified by the class". Place the dot . and quantifier inside of a capturing group instead and enable the s modifier which forces the dot to span newlines.
preg_match('/Owner(.+?)Calculated/s', $inputString, $owner);
echo trim($owner[1]);
Note: + is a greedy operator, meaning it will match as much as it can and still allow the remainder of the regex to match. Use +? instead to prevent greediness meaning "one or more — preferably as few as possible".
You can use lookarounds as
(?<=Owner)\s*.*?(?=\s+Calculated)
Example usage
$str = "Current Owner 123 Capital Calculated ";
preg_match("/(?<=Owner)\s*.*?(?=\s+Calculated)/", $str, $matches);
print_r($matches);
Will give an output
Array ( [0] => 123 Capital )
Hope this helps, group index #1 is your target:
Owner\s+(\d+\s+\w+)\s+Calculated
You may also want to try a tool like RegExr to help you learn/tinker.

PHP regex replacement doesn't match

I'm using this regex to get house number of a street adress.
[a-zA-ZßäöüÄÖÜ .]*(?=[0-9])
Usually, the street is something like "Ohmstraße 2a" or something. At regexpal.com my pattern matches, but I guess preg_replace() isn't identical with it's regex engine.
$num = preg_replace("/[a-zA-ZßäöüÄÖÜ .]*(?=[0-9])/", "", $num);
Update:
It seems that my pattern matches, but I've got some encoding problems with the special chars like äöü
Update #2:
Turns out to be a encoding problem with mysqli.
First of all if you want to get the house number then you should not replace it. So instead of preg_replace use preg_match.
I modified your regex a little bit to match better:
$street = 'Öhmsträße 2a';
if(preg_match('/\s+(\d+[a-z]?)$/i', trim($street), $matches) !== 0) {
var_dump($matches);
} else {
echo 'no house number';
}
\s+ matches one or more space chars (blanks, tabs, etc.)
(...) defines a capture group which can be accesses in $matches
\d+ matches one or more digits (2, 23, 235, ...)
[a-z] matches one character from a to z
? means it's optional (not every house number has a letter in it)
$ means end of string, so it makes sure the house number is at the end of the string
Make sure you strip any spaces after the end of the house number with trim().
The u modifier can help sometimes for handling "extra" characters.
I feel this may be a character set or UTF-8 issue.
It would be a good idea to find out what version of PHP you're running too. If I recall correctly, full Unicode support came in around 5.1.x

how can i write this regex? ungreedy related

I'm sorry for the poor title, but it is a very generic question
I have to match this pattern
;AAAAAAA(BBBBBB,CCCCC,DDDDDD)
AAAAA = all characters starting from ";" to "(" (both ;( not included)
BBBBB = all characters starting from "(" to "," (both (, not included)
CCCCC = all characters starting from "," to "," (both ,, not included)
DDDDD = all characters starting from "," to ")" (both ,) not included)
The "all characters between x and y" is a problem that kills me everytime
:(
I'm using PHP and I have to match all occurrences of this pattern (preg_match_all) that also, sadly, can be on multiple lines
Thank you in advance!
I would recommend you do not use an ungreedy quantifier, but instead make all repetitions mutually exclusive with their delimiters. What does this mean? It means, for instance, that A can be any character except (. Giving this regex:
;([^(]*)[(]([^,]*),([^,]*),([^)]*)[)]
Where the last [)] is not even necessary.
The PHP code would then look like this:
preg_match_all('/;([^(]*)[(]([^,]*),([^,]*),([^)]*)[)]/', $input, $matches);
$fullMatches = $matches[0];
$arrayOfAs = $matches[1];
$arrayOfBs = $matches[2];
$arrayOfCs = $matches[3];
$arrayOfDs = $matches[4];
As the comments show, my escaping technique is a matter of taste. This regex is of course equal to:
;([^(]*)\(([^,]*),([^,]*),([^)]*)\)
But I think that looks a lot more mismatched/unbalanced than the other variant. Take you pick!
Finally, for the question why this approach would be better than using ungreedy (lazy) quantifiers. Here is some good, general reading. Basically, when you use ungreedy quantifiers, the engine still has to backtrack. It tries one repetition first, then notices that ( after that doesn't match. So it has to go back into the repetition and consume another character. But then the ( still doesn't match, so back to the repetition again. With this approach however, the engine will consume as much as possible, when going into the repetition for the first time. And when all non-( characters are consumed, then the engine will be able to match the following ( right away.
You could use something like this code:
preg_match_all('/;(.*?)\((.*?),(.*?),(.*?)\)/s',$text,$matches);
See it on ideone.com.
Basically, you can use .*? (question mark being ungreedy), make sure to escape the parentheses, and you may need the s modifier to have it work on multiple lines.
Variables would be in an array: $matches

Regex question mark

To match a string with pattern like:
-TEXT-someMore-String
To get -TEXT-, I came to know that this works:
/-(.+?)-/ // -TEXT-
As of what I know, ? makes preceding token as optional as in:
colou?r matches both colour and color
I initially put in regex to get -TEXT- part like this:
/-(.+)-/
But it gave -TEXT-someMore-.
How does adding ? stops regex to get the -TEXT- part correctly? Since it used to make preceding token optional not stopping at certain point like in above example ?
As you say, ? sometimes means "zero or one", but in your regex +? is a single unit meaning "one or more — and preferably as few as possible". (This is in contrast to bare +, which means "one or more — and preferably as many as possible".)
As the documentation puts it:
However, if a quantifier is followed by a question mark,
then it becomes lazy, and instead matches the minimum
number of times possible, so the pattern /\*.*?\*/
does the right thing with the C comments. The meaning of the
various quantifiers is not otherwise changed, just the preferred
number of matches. Do not confuse this use of
question mark with its use as a quantifier in its own right.
Because it has two uses, it can sometimes appear doubled, as
in \d??\d which matches one digit by preference, but can match two if
that is the only way the rest of the pattern matches.
Alternatively, you can use Ungreedy modifier to set the whole regular expression to search for preferably as short as possible match:
/-(.+)-/U
? before a token is shorthand for {0,1}, which means: Anything up from 0 to 1 appearances as the foremost.
But + is not a token, but a quantifier. shorthand for {1,}: 1 up to endless appearances.
A ? after a quantifier sets it into nongreedy mode. If in greedy mode, it matches as much of the string as possible. If non greedy it matches as little as possible
Another, perhaps the underlying error in your regex is that you try to match a number of arbitrary characters via .+?. However, what you really want is probably: "any character except -". You can get that via [^-]+ In this case, it doesn't matter if you do a greedy match or not -- the repeated match will terminate as soon as you encounter the second "-" in your string.

Matching ugly extra abbreviations and numbers in titles with PHP regex

I have to create regex to match ugly abbreviations and numbers. These can be one of following "formats":
1) [any alphabet char length of 1 char][0-9]
2) [double][whitespace][2-3 length of any alphabet char]
I tried to match double:
preg_match("/^-?(?:\d+|\d*\.\d+)$/", $source, $matches);
But I coldn't get it to select following example: 1.1 AA My test title. What is wrong with my regex and how can I add those others to my regex too?
In your regex you say "start of string, followed by maybe a - followed by at least one digit or followed by 0 or more digits, followed by a dot and followed by at least one digit and followed by the end of string.
So you regex could match for example.. 4.5, -.1 etc. This is exactly what you tell it to do.
You test input string does not match since there are other characters present after the number 1.1 and even if it somehow magically matched your "double" matching regex is wrong.
For a double without scientific notation you usually use this regex :
[-+]?\b[0-9]+(\.[0-9]+)?\b
Now that we have this out of our way we need a whitespace \s and
[2-3 length of alphabet]
Now I have no idea what [2-3 length of alphabet] means but by combining the above you get a regex like this :
[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]
You can also place anchors ^$ if you want the string to match entirely :
^[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]$
Feel free to ask if you are stuck! :)
I see multiple issues with your regex:
You try to match the whole string (as a number) by the anchors: ^ at the beginning and $ at the end. If you don't want that, remove those.
The number group is non-catching. It will be checked for matches, but those won't be added to $matches. That's because of the ?: internal options you set in (?:...). Remove ?: to make that group catching.
You place the shorter digit-pattern before the longer one. If you swap the order, the regex engine will look for it first and on success prefer it over the shorter one.
Maybe this already solves your issue:
preg_match("/-?(\d*\.\d+|\d+)/", $source, $matches);
Demo

Categories