PHP Regular Expression - Extract Data - php

I have a long string, and am trying to extract specific data that is deliminated in that string by specific words.
For example, here is a subset of the string:
Current Owner 123 Capital Calculated
I am looking to extract
123 Capital
and as you can see it is surrounded by "Current Owner" (with a bunch of arbitrary spaces) to the left and "Calculated" (again with arbitrary spaces) to the right.
I tried this, but I'm a bit new at RegEx. Can anyone help me create a more effective RegEx?
preg_match("/Owner[.+]Calculated/",$inputString,$owner);
Thanks!

A character class defines a set of characters. Saying, "match one character specified by the class". Place the dot . and quantifier inside of a capturing group instead and enable the s modifier which forces the dot to span newlines.
preg_match('/Owner(.+?)Calculated/s', $inputString, $owner);
echo trim($owner[1]);
Note: + is a greedy operator, meaning it will match as much as it can and still allow the remainder of the regex to match. Use +? instead to prevent greediness meaning "one or more — preferably as few as possible".

You can use lookarounds as
(?<=Owner)\s*.*?(?=\s+Calculated)
Example usage
$str = "Current Owner 123 Capital Calculated ";
preg_match("/(?<=Owner)\s*.*?(?=\s+Calculated)/", $str, $matches);
print_r($matches);
Will give an output
Array ( [0] => 123 Capital )

Hope this helps, group index #1 is your target:
Owner\s+(\d+\s+\w+)\s+Calculated
You may also want to try a tool like RegExr to help you learn/tinker.

Related

Regex match section within string

I have a string foo-foo-AB1234-foo-AB12345678. The string can be in any format, is there a way of matching only the following pattern letter,letter,digits 3-5 ?
I have the following implementation:
preg_match_all('/[A-Za-z]{2}[0-9]{3,6}/', $string, $matches);
Unfortunately this finds a match on AB1234 AND AB12345678 which has more than 6 digits. I only wish to find a match on AB1234 in this instance.
I tried:
preg_match_all('/^[A-Za-z]{2}[0-9]{3,6}$/', $string, $matches);
You will notice ^ and $ to mark the beginning and end, but this only applies to the string, not the section, therefore no match is found.
I understand why the code is behaving like it is. It makes logical sense. I can't figure out the solution though.
You must be looking for word boundaries \b:
\b\p{L}{2}\p{N}{3,5}\b
See demo
Note that \p{L} matches a Unicode letter, and \p{N} matches a Unicode number.
You can as well use your modified regex \b[a-zA-Z]{2}[0-9]{3,5}\b. Note that using anchors makes your regex match only at the beginning of a string (with ^) or/and at the end of the string (with $).
In case you have underscored words (like foo-foo_AB1234_foo_AB12345678_string), you will need a slight modification:
(?<=\b|_)\p{L}{2}\p{N}{3,5}(?=\b|_)
You have to end your regular expression with a pattern for a non-digit. In Java this would be \D, this should be the same in PHP.

(PHP) How to find words beginning with a pattern and replace all of them?

I have a string. An example might be "Contact /u/someone on reddit, or visit /r/subreddit or /r/subreddit2"
I want to replace any instance of "/r/x" and "/u/x" with "[/r/x](http://reddit.com/r/x)" and "[/u/x](http://reddit.com/u/x)" basically.
So I'm not sure how to 1) find "/r/" and then expand that to the rest of the word (until there's a space), then 2) take that full "/r/x" and replace with my pattern, and most importantly 3) do this for all "/r/" and "/u/" matches in a single go...
The only way I know to do this would be to write a function to walk the string, character by character, until I found "/", then look for "r" and "/" to follow; then keep going until I found a space. That would give me the beginning and ending characters, so I could do a string replacement; then calculate the new end point, and continue walking the string.
This feels... dumb. I have a feeling there's a relatively simple way to do this, and I just don't know how to google to get all the relevant parts.
A simple preg_replace will do what you want.
Try:
$string = preg_replace('#(/(?:u|r)/[a-zA-Z0-9_-]+)#', '[\1](http://reddit.com\1)', $string);
Here is an example: http://ideone.com/dvz2zB
You should see if you can discover what characters are valid in a Reddit name or in a Reddit username and modify the [a-zA-Z0-9_-] charset accordingly.
You are looking for a regular expression.
A basic pattern starts out as a fixed string. /u/ or /r/ which would match those exactly. This can be simplified to match one or another with /(?:u|r)/ which would match the same as those two patterns. Next you would want to match everything from that point up to a space. You would use a negative character group [^ ] which will match any character that is not a space, and apply a modifier, *, to match as many characters as possible that match that group. /(?:u|r)/[^ ]*
You can take that pattern further and add a lookbehind, (?<= ) to ensure your match is preceded by a space so you're not matching a partial which results in (?<= )/(?:u|r)/[^ ]*. You wrap all of that to make a capturing group ((?<= )/(?:u|r)/[^ ]*). This will capture the contents within the parenthesis to allow for a replacement pattern. You can express your chosen replacement using the \1 reference to the first captured group as [\1](http://reddit.com\1).
In php you would pass the matching pattern, replacement pattern, and subject string to the preg_replace function.
In my opinion regex would be an overkill for such a simple operation. If you just want to replace instance of "/r/x" with "[r/x](http://reddit.com/r/x)" and "/u/x" with "[/u/x](http://reddit.com/u/x)" you should use str_replace although with preg_replace it'll lessen the code.
str_replace("/r/x","[/r/x](http://reddit.com/r/x)","whatever_string");
use regex for intricate search string and replace. you can also use http://www.jslab.dk/tools.regex.php regular expression generator if you have something complex to capture in the string.

PHP regex replacement doesn't match

I'm using this regex to get house number of a street adress.
[a-zA-ZßäöüÄÖÜ .]*(?=[0-9])
Usually, the street is something like "Ohmstraße 2a" or something. At regexpal.com my pattern matches, but I guess preg_replace() isn't identical with it's regex engine.
$num = preg_replace("/[a-zA-ZßäöüÄÖÜ .]*(?=[0-9])/", "", $num);
Update:
It seems that my pattern matches, but I've got some encoding problems with the special chars like äöü
Update #2:
Turns out to be a encoding problem with mysqli.
First of all if you want to get the house number then you should not replace it. So instead of preg_replace use preg_match.
I modified your regex a little bit to match better:
$street = 'Öhmsträße 2a';
if(preg_match('/\s+(\d+[a-z]?)$/i', trim($street), $matches) !== 0) {
var_dump($matches);
} else {
echo 'no house number';
}
\s+ matches one or more space chars (blanks, tabs, etc.)
(...) defines a capture group which can be accesses in $matches
\d+ matches one or more digits (2, 23, 235, ...)
[a-z] matches one character from a to z
? means it's optional (not every house number has a letter in it)
$ means end of string, so it makes sure the house number is at the end of the string
Make sure you strip any spaces after the end of the house number with trim().
The u modifier can help sometimes for handling "extra" characters.
I feel this may be a character set or UTF-8 issue.
It would be a good idea to find out what version of PHP you're running too. If I recall correctly, full Unicode support came in around 5.1.x

Get word from string - PHP

I am trying to extract a word that matches a specific pattern from various strings.
The strings vary in length and content.
For example:
I want to extract any word that begins with jac from the following strings and populate an array with the full words:
I bought a jacket yesterday.
Jack is going home.
I want to go to Jacksonville.
The resulting array should be [jacket,Jack,Jacksonville]
I have been trying to use preg_match() but for some reason it won't work. Any suggestions???
$q = "jac";
$str = "jacket";
preg_match($q,$str,$matches);
print $matches[1];
This returns null :S. I dunno what the problem is.
You can use preg_match as:
preg_match("/\b(jac.+?)\b/i", $string, $matches);
See it
You've got to read the manual a few hundred times and it will eventually come to you.
Otherwise, what you're trying to capture can be expressed as "look for 'jac' followed by 0 or more letters* and make sure it's not preceded by a letter" which gives you: /(?<!\\w)(jac\\w*)/i
Here's an example with preg_match_all() so that you can capture all the occurences of the pattern, not just the first:
$q = "/(?<!\\w)(jac\\w*)/i";
$str = "I bought a jacket yesterday.
Jack is going home.
I want to go to Jacksonville.";
preg_match_all($q,$str,$matches);
print_r($matches[1]);
Note: by "letter" I mean any "word character." Officially, it includes numbers and other "word characters." Depending on the exact circumstances, one may prefer \w (word character) or \b (word boundary.)
You can include extra characters by using a character class. For instance, in order to match any word character as well as single quotes, you can use [\w'] and your regexp becomes:
$q = "/(?<!\\w)(jac[\\w']*)/i";
Alternatively, you can add an optional 's to your existing pattern, so that you capture "jac" followed by any number of word characters optionally followed by "'s"
$q = "/(?<!\\w)(jac\\w*(?:'s)?)/i";
Here, the ?: inside the parentheses means that you don't actually need to capture their content (because they're already inside a pair of parentheses, it's unnecessary), and the ? after the parentheses means that the match is optional.

preg_match basics question

Got some trouble with my preg_match.
The code.
$text = "tel: 012 213 123. mobil: 0303 11234 \n address: street 14";
$regex_string = '/(tel|Tel|TEL)[\s|:]+(.+)[\.|\n]/';
preg_match($regex_string , $text, $match);
And I get this result in $match[2]
"012 213 123. mobil: 023 123 123"
First question.
I want the regex to stop at the .(dot) but it doesent.
Can someone explain to why it isnt?
Second question.
preg_match uses () to get their match.
Is it possible to skip the parentheses surrounding the different "Tel" and still get the same functionality?
Thnx all stackoverflow is great :D
This should do:
/tel(?:\s|:)+([^.]+)(?:\.|$)/i
+ is a greedy quantifier, which means it'll match as many characters as possible.
To your second question: in this particular case you just need to use case-insensitive match (i flag). Generally, you could use (?:...) syntax, example of which you could see in the end match. Square brackets are used for character classes.
If you're simply trying to extract a phone number out of that line, and it's guaranteed to be 11 numbers, you could simply use this:
$text = 'tel: 012 213 123. mobil: 0303 11234';
$phone_number = substr(preg_replace('/[^\d]/', '', $text), 0, 11);`
With your example, $phone_number would be 0122131230.
How this works is any non-digit is replaced with an empty string, removing it, and then the first 11 numbers are returned.
No idea - your regex works for me (I get "012 213 123" in $match[2] with your code). The fact that the mobile phone differs between the two might indicate that it's not really the output of your code; check again.
Some other things - if you happen to have more dots in the line ("tel: xxx. phone: xxx. fax: xxx" for example), you will get bad results - use non-greedy operators ("get least chunk that matches" .*? instead of "get biggest chunk that matches" .*) or limit the repeated characters ("any number of non-periods" [^.]*). Also, you could spare yourself the trouble by making the regex case-insensitive (unless you really hate people typing "tEl").
Your other question: (?:stuff) will match "stuff" just like (stuff), but will not capture it.
Useful link: http://www.regular-expressions.info/
Why do you have pipes in your character classes [\.|\n] and [\s|:]? Character classes (stuff in square brackets []) are by definition like an OR relationship, so you don't need the pipe... unless you really are trying to match pipe |.
As for question #1, I'm not sure what's cusiong your problem, but usually this has to do with greedy quantifiers. The (.+) quantifier is greedy, so it matches as much as it can while still matching the entire pattern. Greedy quantifiers don't care what comes after them in the pattern. Since a period . matches any character other than new line characters, it can match a period, and so it does match a period. To make a quantifier non-greedy you can use a question mark ?.
For your second question In RegEx uses parenthesis to group things and to store them. If you want to group (tel|Tel|TEL) but not store it in $match you can put a ?: at after the open parenthesis:
(?:tel|Tel|TEL)
Do you mean you want to match only the number, so you don't have to strip off the tel: and the dot? Try this:
/tel[:\s]+\K[^.]+/i
The i makes it case-insensitive.
[:\s] matches a colon or whitespace (the | doesn't mean "or" in a character class, it just matches a |).
[^.]+ matches one or more non-dots; it stops matching when it sees a dot or the end of the line, so you don't have to match the dot if you don't want it in the result.
Finally, \K means "forget about whatever you've matched so far and pretend the match really started here"--a little gem of a feature that's only available in Perl and PHP (that I know of).

Categories