PHP tricky regex to get quoted string up until certain words - php

I have a variant of strings that look like either of these
First rounder 'John Smith' had a good game.
Second rounder 'Jim O'Rielly' is on fire!
What I ultimately want is to get both names between quotes John Smith and Jim O'Rielly, however the tricky part is the names that include apostrophe like the second.
I initially was using '/\'([^\']*)\'/' to get the text inside the quotes, but doesn't work for the second case - this would only return Jim O.
I then thought to use .+?(?=had) in order to get everything up to the word had, but it needs to be either had or is, and I don't want the words First rounder, etc.
I need to essentially combine these, so I can get only the text inside the quotes, but UP UNTIL either word had or is, and I just want the text without quotes.
Unless there is a trick to get the 2nd option ignoring the apostrophe in the name (I thought to addSlashes() but how do I know which apostrophe's to add slashes to?), can anyone suggest a better solution to this ? Bonus points to ignore any special characters that I haven't considering may be found in the name :)

You can alternate between matching non-'s, and matching 's which have word characters on either side. This way 's in the middle of a word will be matched, but 's at either end of a word won't.
'((?:[^']+|\b'\b)+)'
https://regex101.com/r/L9Em5l/1

Another option could be matching any char except ' using a negated character class.
Then only accept matching a ' if followed by a word boundary and repeat that 0+ so it is optional and also matches a name without a single quote in it.
'([^']+(?:'\b[^']++)*)'
Explanation
'( Match starting ' and open capture group 1
[^']+ Match 1+ times any char except a '
(?: Non capture group
'\b[^']++ Match ' and word boundary, match 1+ times any char except ' using a possessive quantifier
)* Close group and repeat 0+ times so this will be optional
)' Close group 1 and match the closing '
Regex demo
If you don't want the negated character class match newlines, you could use [^'\r\n]+ instead.

Related

Ignoring apostrophe while capturing contents in single quotes REGEX

The issue for me here is to capture the content inside single quotes(like 'xyz').
But the apostrophe which is the same symbol as a single quote(') is coming in the way!
The regex I've written is : /(\w\'\w)(*SKIP)(*F)|(\'[^\']*\')/
The example i have used is : Hello ma'am 'This is Prashanth's book.'
What needs to be captured is : 'This is Prashanth's book.'.
But, what's capured is : 'This is Prashanth'!
Here is the link of what i tried on online regex tester
Any help is greatly appreciated. Thank you!
You can't use [^\'] to capture a text that contains ' with in and in your example, This is Prashanth's book. contains a ' character within the text. You need to modify your regex to use .*? instead of [^\'] and can write your regex as this,
(\w'\w)(*SKIP)(*F)|('.*?'\B)
Demo with your updated regex
Also, you don't need to escape a single quote ' as that has no special meaning in regex.
From your example, it is not clear whether you want the captured match to contain ' around the match or not. In case you don't want ' to be captured in the match, you can use a lookarounds based regex and use this,
(?<=\B').*?(?='\B)
Explanation of regex:
(?<=\B') - This positive look behind ensures what gets captured in match is preceded by a single quote which is not preceded by a word character which is ensured by \B
.*? - Captures the text in non-greedy manner
(?='\B) - Ensures the matched text is followed by a single quote and \B ensures it doesn't match a quote that is immediately followed by any word character. E.g. it won't match an ending quote like 's
Demo
For the string you have provided, you can use the regex:
\B'\K(?:(?!'\B).)+
Click for Demo
Explanation:
\B - a non-word boundary
' - matches a '
\K - forget everything matched so far
(?:(?!'\B).)+ - matches 1+ occurrences of any character(except newline) which does not start with ' followed by a non-word boundary

Pattern for check single occurrency into preg_match_all

I'm writing a function that should retrieve all occurrences that I pass.
I'm italian so I think that I could be more clear with an example.
I would check if my phrase contains some fruits.
Ok, so lets see my php code:
$pattern='<apple|orange|pear|lemon|Goji berry>i';
$phrase="I will buy an apple to do an applepie!";
preg_match_all($pattern,$phrase,$match);
the result will be an array with "apple" and "applepie".
How can I search only exact occurency?
Reading the manual I found:
http://php.net/manual/en/regexp.reference.anchors.php
I try to use \A , \Z , ^ and $ but no one seems to work correctly in my case!
Someone can help me?
EDIT: After the #cris85 's answer I try to improve my question ...
My really pattern contains over 200 occorrency and the phrase is over 10000 caracters so the real case is too large to insert here.
After some trials I found an error on the occurrency "microsoft exchange"! There is some special caracters that I must escape?
At the moment I escape "+" "-" "." "?" "$" and "*".
The anchors you tried to use are for the full string, not per word. You can use word boundaries to match individual words. This should allow you to find only complete fruit matches:
$pattern='<\b(?:apple|orange|pear|lemon|Goji berry)\b>i';
The ?: is so you don't make an additional capture group, it is a non-capture group.
Here's the definitation from regex-expressions for what a boundary matches:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
PHP Demo: https://3v4l.org/h5GCf
Regex Demo: https://regex101.com/r/5aBaMO/1/

PHP Regexp - if custom punctuation symbols are side-by-side, then regex doesn't match

my regexp
/^[\p{L}\p{N}][\p{L}\p{N} \.,;:\?!-“”‘’"']+$/u
aim of regexp
allow utf-8 characters, numbers, spaces AND custom punctuation to verify article title
these inputs below don't match but I want matching also if punctuation are side-by side? Can you show me the correct form of my regexp? note: Backslash in front of dot and question mark are for escaping attempt. I also tried without escaping. I am not good at regexp. I can only find sub-parts then try to combine. thanks. BR
inputs that don't match
"Selim"!'"':?-
"'
'"
?!
I also discovered that I can not start with punctuation to a title.
example "title" Day doesn't match
change with:
/^[\p{L}\p{N}“”‘’"'][\p{L}\p{N} .,;:?!\-“”‘’"']*$/u
NB: - must be escaped if it isn't in the first or last position within the character class. But . and ? doesn't need.
Are the square brackets within the regex characters you accept? If so, they need to be escaped.
/^[\p{L}\p{N}\]\[\p{L}\p{N} \.,;:\?!-“”‘’"']+$/u
If not, then you need to include the punctuation you'll allow inside the first character class.

get initialized string regex

kNO = "Get this value now if you can";
How do I get Get this value now if you can from that string? It looks easy but I don't know where to start.
Start by reading PHP PCRE and see the examples. For your question:
$str = 'kNO = "Get this value now if you can";';
preg_match('/kNO\s+=\s+"([^"]+)"/', $str, $m);
echo $m[1]; // Get this value now if you can
Explanation:
kNO Match with "kNO" in the input string
\s+ Follow by one or more whitespace
"([^"]+)" Get any characters within double-quotes
Depending on how you're getting that input, you could use parse_ini_file or parse_ini_string. Dead simple.
Use character classes to start extracting from one open quote to the next:
$str = 'kNO = "Get this value now if you can";'
preg_match('~"([^"]*)"~', $str, $matches);
print_r($matches[1]);
Explanation:
~ //php requires explicit regex bounds
" //match the first literal double quotation
( //begin the capturing group, we want to omit the actual quotes from the result so group the relevant results
[^"] //charater class, matches any character that is NOT a double quote
* //matches the aforementioned character class zero or more times (empty string case)
) //end group
" //closing quote for the string.
~ //close the boundary.
EDIT, you may also want to account for escaped quotes, use the following regex instead:
'~"((?:[^\\\\"]+|\\\\.)*)"~'
This pattern is slightly more difficult to wrap your head around. Essentially this is broken into two possible matches (seperated by the Regex OR character |)
[^\\\\"]+ //match any character that is NOT a backslash and is NOT a double quote
| //or
\\\\. //match a backslash followed by any character.
The logic is pretty straightforward, the first character class will match all characters except a double quote or a backslash. If a quote or a backslash is found, the regex attempts to match the 2nd part of the group. In the event that it's a backslash, it will of course match the pattern \\\\., but it will also advance the match by 1 character, effectively skipping whatever escaped character followed the backslash. The only time this pattern will stop matching is when a lone, unescaped double quote is encountered,

regex: remove all text within "double-quotes" (multiline included)

I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:
$file=file_get_contents('test.html');
$replaced = preg_replace('/"(\n.)+?"/m','', $file);
I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.
I read that newlines can be \r\n and \n as well.
Try this expression:
"[^"]+"
Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).
Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.
I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:
"[^"]*"
EDIT:
On second thought, here's a better one:
"[\S\s]*?"
This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"
The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.
there are many regexes that can solve your problem but here's one:
"(.*?(\s)*?)*?"
this reads as:
find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily
greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.
great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/
Remember that your regex may have to change slightly based on what language you're using to search using regex.
You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are):
/".+?"/s
You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.
"[^"]+"
Something like below. s is dotall mode where . will match even newline:
/".+?"/s
$replaced = preg_replace('/"[^"]*"/s','', $file);
will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

Categories