preg_match_all to parse an xml-like attribute string

preg_match_all to parse an xml-like attribute string - php

I have a string like so:
option_alpha="value" option_beta="some other value" option_gamma="X" ...etc.
I'm using this to parse them into name & value pairs:
preg_match_all("/([a-z0-9_]+)\s*=\s*[\"\'](.+?)[\"\']/is", $var_string, $matches)
Which works fine, unless it encounters an empty attribute value:
option_alpha="value" option_beta="" option_gamma="X"
What have I done wrong in my regex?

[\"\'](.+?)[\"\']
should be
[\"\'](.*?)[\"\']
* instead of +. The first means there can be zero to whatever occurrences of the previous expression (so it can be omitted, that is what you need). The latter means, there has to be at least one.

I think you want to change the very middle of your expression from (.+?) to (.*?). That makes it a non-greedy match on any character (including no characters), instead of a non-greedy match on at least one character.
preg_match_all("/([a-z0-9_]+)\s*=\s*[\"\'](.*?)[\"\']/is",$var_string,$matches);

The other answers here are right in that you need to change the middle of the expression, but I would change it to [^\"\']* which means "any character that is not a ", 0 or more times. This ensures the greediness doesn't match more than it is supposed to and allows for empty "".
your expression becomes
"/([a-z0-9_]+)\s*=\s*[\"\'][^\"\']*[\"\']/is"
note you can change the [a-z0-9_] to [\w_] which would also for upper case characters.

Related

Regex to get string between single or double quotes even if it's empty

Below is the REGEX which I am trying:
/((?<![\\\\])['"])((?:.(?!(?<![\\\\])\\1))*.?)\\1/
Here this is the text which I am giving
val1=""val2>"2022-11-16 10:19:20"
I need blank expressions like for val1 as well,
i.e. I need something like below in matches
""
2022-11-16 10:19:20
If I change the text to something like below, I am getting proper output
val2>"2022-11-16 10:19:20"val1=""
Can anyone please let me know where I am going wrong

Use alternatives to match the two cases.
One alternative matches the pair of quotes, the other uses lookarounds to match the inside of two quotes.
""|(?<=")[^"]+(?=")

In your pattern, this part (?:.(?!(?<![\\])\1))* first matches any character and then it asserts that what is to the right is not a group 1 value without an escape \
So in this string ""val2>" your whole pattern matches " with the character class ["'] and then it matches " again with the . From the position after that match, it is true that what is to the right is not the group 1 value without a preceding \ and that is why that match is ""val2>" instead of ""
If the second example string does give you a proper output, you could reverse the dot and first do the assertiong in the repeating part of the pattern, and omit matching an optional char .?
Note that the backslash does not have to be in square brackets.
(?<!\\)(['"])((?:(?!(?<!\\)\1).)*+)\1
See the updated regex101 demo.

preg_match() is evaluating my regex incorrently

My regex validation is producing true when it should be false. I've tried this exact example using online regex validators, and it is always rejected except in my code. Am I doing something wrong?
$name = "1NTH";
preg_match("/[A-Z][A-Z][A-Z][A-Z]?/",$name);
This exact example is evaluating to true.

You're getting the correct behaviour, as you're asking for three capital letters eventually followed by a fourth one.
You probably want to use this regex:
/^[A-Z][A-Z][A-Z][A-Z]?$/
(note the ^, start of line, and $ end of line) as it explicitly requires that the capital letters must be all the content of the text line.

This is because it is true. It contains [A-Z] characters.
You're missing the anchors to start your regex from the start of the string to finish of the string.
^[A-Z][A-Z][A-Z][A-Z]?$

There's nothing wrong with your regex. It is valid based on the rule you specified.
Let's do it one step at a time:
[A-Z] means match exactly 1 upper case alphabet.
[A-Z]? means, match either 0 or 1 upper case alphabet.
See what's going on? If not, move on.
[A-Z][A-Z][A-Z] means match exactly 3 upper case alphabets. (1 for each [A-Z] rule)
[A-Z][A-Z][A-Z][A-Z]? means the first three characters must be an upper case alphabet. The last one can either be 0 or 1 upper case alphabet.
In your example, 1NTH contains exactly 3 upper case alphabets, which is correct. You didn't put any restrictions on whether it should contain number or not, whether before or after the 3 alphabets. And the last [A-Z]?? Well, that's optional, right? (see rule #2)

The standard PHP regular expression engine checks if the the string contains the pattern, and is not an exact match. That differs to, for example, the standard Java regular expression engine.
You should use ^ and $, which match respectively the beginning and the end of a string. Both are zero-length assertions.
$name = "1NTH";
preg_match("/^[A-Z]{3}[A-Z]?$/", $name);
PS: I have optimized your regular expression by using the quantifier {3}, which matches three subsequent occurrences of the preceding character or group.

Accoring to PHP Manual:
preg_match() returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred.
In your example, there must be 3 obligatory and 1 optional capital letter. So, the match is due.

As stribizhev said, your regex matches since you're asking for more than 3 letters which are found in $name. I assume you want to reject "1NTH" because it starts with a digit. That means you have to add an anchor saying "from the start" (\A).
Also, the 3 repeated [A-Z] can be summarized by adding a repeat-counter. So the whole statement should be: \A[A-Z]{3,}

You have given like this,
$name = "1NTH";
preg_match("/[A-Z][A-Z][A-Z][A-Z]?/",$name);
In your code some please change this below code
$name = "1NTH";
preg_match("/[A-Z][A-Z][A-Z][A-Z]?$/",$name);
you have missed '$' in end of preg string.
i have checked and it's working perfectly to your requirement.
See this link,and you also test once in this link. Click Here

Explain the Regular Expression /^[a-zA-Z ]*/

I understand that the regex pattern must match a string which starts with the combination and the repetition of the following characters:
a-z
A-Z
a white-space character
And there is no limitation to how the string may end!
First Case
So a string such as uoiui897868 (any string that only starts with space, a-z or A-Z) matches the pattern... (Sure it does)
Second Case
But the problem is a string like 76868678jugghjiuh (any string that only starts with a character other than space, a-z or A-Z) matches too! This should not happen!
I have checked using the php function preg_match() too , which returns true (i.e. the pattern matches the string).
Also have used other online tools like regex101 or regexr.com. The string does match the pattern.
Can anybody could help me understand why the pattern matches the string described in the second case?

/^[a-zA-Z ]*/
Your regex will match strings that "begin with" any number (including zero) of letters or spaces.
^ means "start of string" and * means "zero or more".
Both uoiui897868 and 76868678jugghjiuh start with 0 or more letters/spaces, so they both match.
You probably want:
/^[a-zA-Z ]+/
The + means "one or more", so it won't match zero characters.

Your regex is completely useless: it will trivially match any string (empty, non-empty, with numbers, without,...), regardless of its structure.
This because
with ^, you enforce the begin of the string, now every string has a start.
You use a group [A-Za-z ], but you use a * operator, so 0 or more repititions. Thus even if the string does not contain (or begins with) a character from [A-Za-z ], the matcher will simply say: zero matches and parse the remaining of the string.
You need to use + instead of * to enforce "at least one character".

The '*' quantifier on the end means zero or more matches of the character, so all strings will match. Perhaps you want to drop the wildcard quantifier, or change it to a '+' quantifier, and add a '$' on the end to test the whole string.

What you really want is to match one or more of the preceding characters.
For that you use +
/^[a-zA-Z ]+/

Why does this regex not validate in the same way in PHP?

when I try preg_match with the following expression: /.{0,5}/, it still matches string longer than 5 characters.
It does, however, work properly when trying in online regexp matcher

The site you reference, myregexp.com, is focussed on Java.
Java has a specific function for matching an exact pattern, without needing to use anchor characters. This is the function which myregexp.com uses.
In most other languages, in order to match an exact pattern, you would need to add the anchoring characters ^ and $ at the start and end of the pattern respectively, otherwise the regex assumes it only needs to find the matched pattern somewhere within the string, rather than the whole string being the match.
This means that without the anchors, your pattern will match any string, of any length, because whatever the string, it will contain within it somewhere a match for "zero to five of any character".
So in PHP, and Perl, and virtually any other language, you need your pattern to look like this:
/^.{0,5}$/
Having explained all that, I would make one final observation though: this specific pattern really doesn't need to be a regular expression -- you could achieve the same thing with strlen(). In addition, the dot character in regex may not work exactly as you expect: it typically matches almost any character; some characters, including new line characters, are excluded by default, so if your string contains five characters, but one of them is a new line, it will fail your regex when you might have expected it to pass. With this in mind, strlen() would be a safer option (or mb_strlen() if you expect to have unicode characters).
If you need to match any character in regex, and the default behaviour of the dot isn't good enough, there are two options: One is to add the s modifier at the end of the expression (ie it becomes /^.{0,5}$/s). The s modifier tells regex to include new line characters in the dot "any character" match.
The other option (which is useful for languages that don't support the s modifier) is to use an expression and its negative together in a character class - eg [\s\S] - instead of the dot. \s matches any white space character, and \S is a negative of \s, so any character not matched by \s. So together in a character class they match any character. It's more long winded and less readable than a dot, but in some languages it's the only way to be sure.
You can find out more about this here: http://www.regular-expressions.info/dot.html
Hope that helps.

You need to anchor it with ^$. These symbols match the beginning and end of the string respectively, so it must be 0-5 characters between the beginning and end. Leaving out the anchors will match anywhere in the string so it could be longer.
/^.{0,5}$/
For better readability, I would probably also enclose the . in (), but that's kind of subjective.
/^(.){0,5}$/

How can I match occurrences of string not in another string using regular expressions?

I'm trying to match all occurances of "string" in something like the following sequence except those inside ##
as87dio u8u u7o #string# ou os8 string os u
i.e. the second occurrence should be matched but not the first
Can anyone give me a solution?

You can use negative lookahead and lookbehind:
(?<!#)string(?!#)
EDIT
NOTE: As per Marks comments below, this would not match #string or string#.

You can try:
(?:[^#])string(?:[^#])

OK,
If you want to NOT match a character you put it in a character class (square brackets) and start it with the ^ character which negates it, for example [^a] means any character but a lowercase 'a'.
So if you want NOT at-sign, followed by string, followed by another NOT at-sign, you want
[^#]string[^#]
Now, the problem is that the character classes will each match a character, so in your example we'd get " string " which includes the leading and trailing whitespace. So, there's another construct that tells you not to match anything, and that is parens with a ?: in the beginning. (?: ). So you surround the ends with that.
(?:[^#])string(?:[^#])
OK, but now it doesn't match at the start of string (which, confusingly, is the ^ character doing double-duty outside a character class) or at the end of string $. So we have to use the OR character | to say "give me a non-at-sign OR start of string" and at the end "give me an non-at-sign OR end of string" like this:
(?:[^#]|^)string(?:[^#]|$)
EDIT: The negative backward and forward lookahead is a simpler (and clever) solution, but not available to all regular expression engines.
Now a follow-up question. If you had the word "astringent" would you still want to match the "string" inside? In other words, does "string" have to be a word by itself? (Despite my initial reaction, this can get pretty complicated :) )

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_match_all to parse an xml-like attribute string - php

[\"\'](.+?)[\"\'] should be [\"\'](.?)[\"\'] instead of +. The first means there can be zero to whatever occurrences of the previous expression (so it can be omitted, that is what you need). The latter means, there has to be at least one.

I think you want to change the very middle of your expression from (.+?) to (.?). That makes it a non-greedy match on any character (including no characters), instead of a non-greedy match on at least one character. preg_match_all("/([a-z0-9_]+)\s=\s[\"\'](.?)[\"\']/is",$var_string,$matches);

Related

Regex to get string between single or double quotes even if it's empty

preg_match() is evaluating my regex incorrently

Explain the Regular Expression /^[a-zA-Z ]*/

Why does this regex not validate in the same way in PHP?

How can I match occurrences of string not in another string using regular expressions?

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_match_all to parse an xml-like attribute string - php

[\"\'](.+?)[\"\'] should be [\"\'](.*?)[\"\'] * instead of +. The first means there can be zero to whatever occurrences of the previous expression (so it can be omitted, that is what you need). The latter means, there has to be at least one.

I think you want to change the very middle of your expression from (.+?) to (.*?). That makes it a non-greedy match on any character (including no characters), instead of a non-greedy match on at least one character. preg_match_all("/([a-z0-9_]+)\s*=\s*[\"\'](.*?)[\"\']/is",$var_string,$matches);

Related

Regex to get string between single or double quotes even if it's empty

preg_match() is evaluating my regex incorrently

Explain the Regular Expression /^[a-zA-Z ]*/

Why does this regex not validate in the same way in PHP?

How can I match occurrences of string not in another string using regular expressions?

Categories

Resources

[\"\'](.+?)[\"\'] should be [\"\'](.?)[\"\'] instead of +. The first means there can be zero to whatever occurrences of the previous expression (so it can be omitted, that is what you need). The latter means, there has to be at least one.

I think you want to change the very middle of your expression from (.+?) to (.?). That makes it a non-greedy match on any character (including no characters), instead of a non-greedy match on at least one character. preg_match_all("/([a-z0-9_]+)\s=\s[\"\'](.?)[\"\']/is",$var_string,$matches);