preg_match not reading results: Delimiter issue? - php

I have a part of a string. I need to check if it would equal another string when I add a few symbols on. However, my use of delimiters (I believe) is not allowing for the matches to take place.
My IF statement:
if (preg_match("{" . "$words[$counter_words]" . "[<]N}", "$corpus[$counter_corpus]"))
My corpus:
{3(-)D[<]AN}
{dog[<]N}
{4(-)H(')er[<]N}
{4(-)H[<]A}
{A battery[<]h}
My partial array is as follows
dog
cat
3-D
plant
My goal is to match "dog" with "{dog[<]N}" (the [] and {} are delimiters). To try to compensate for this, I glue delimiters to the start and end of the string. Preg_match accepts it, but does not match the two together.
What would be the solution to this? I cannot find or think of a solution. Your help is greatly appreciated.

if (preg_match("{" . $words[$counter_words] . "\\[<\\]N}", $corpus[$counter_corpus]))
[ and ] have special meaning in regular expressions. If you don't want the special meaning you need to escape them, \[. But because this is inside a PHP string, to get a \ character, you must enter \\.

I have a part of a string. I need to check if it would equal another string when I add a few symbols on.
To check if a string equals another one, we don't need preg_match(); this would do:
if ("{{$words[$counter_words]}[<]N}" == $corpus[$counter_corpus])
If you use preg_match(), you have to heed the PCRE regex syntax:
When using the PCRE functions, it is required that the pattern is enclosed
by delimiters. A delimiter can be any non-alphanumeric,
non-backslash, non-whitespace character.
Often used delimiters are forward slashes (/), …
The added symbols { and } at the start and end of your string were taken as pattern-enclosing delimiters, while you meant them to be part of the pattern. You have to add actual delimiters:
if (preg_match("/{{$words[$counter_words]}\[<]N}/", $corpus[$counter_corpus]))
Also of interest:
Double quoted
Variable parsing / Complex (curly) syntax

Related

How to properly escape a string for use in regular expression in PHP?

I am trying to escape a string for use in a regular expression in PHP. So far I tried:
preg_quote(addslashes($string));
I thought I need addslashes in order to properly account for any quotes that are in the string. Then preg_quote escapes the regular expression characters.
However, the problem is that quotes are escaped with backslash, e.g. \'. But then preg_quote escapes the backslash with another one, e.g. \\'. So this leaves the quote unescaped once again. Switching the two functions does not work either because that would leave an unescaped backslash which is then interpreted as a special regular expression character.
Is there a function in PHP to accomplish the task? Or how would one do it?
The proper way is to use preg_quote and specify the used pattern delimiter.
preg_quote() takes str and puts a backslash in front of every character that is part of the regular expression syntax... characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
Trying to use a backslash as delimiter is a bad idea. Usually you pick a character, that's not used in the pattern. Commonly used is slash /pattern/, tilde ~pattern~, number sign #pattern# or percent sign %pattern%. It is also possible to use bracket style delimiters: (pattern)
Your regex with modification mentioned in comments by #CasimiretHippolyte and #anubhava.
$pattern = '/(?<![a-z])' . preg_quote($string, "/") . '/i';
Maybe wanted to use \b word boundary. No need for any additional escaping.

What are those characters in a regular expression?

I found this regex that works correctly but I didn't understand what is # (at the start) and at the end of the expression. Are not ^ and $ the start/end characters?
preg_match_all('#^/([^/]+)/([^/]+)/$#', $s, $matches);
Thanks
The matched pattern contains many /, thus the # is used as regex delimeter. These are identical
/^something$/
and
#^something$#
If you have multiple / in your pattern the 2nd example is better suited to avoid ugly masking with \/. This is how the RE would like like with using the standard // syntax:
/^\/([^\/]+)\/([^\/]+)\/$/
About #:
That's a delimiter of the regular expression itself. It's only meaning is to tell which delimiter is used for the expression. Commonly / is used, but others are possible. PCRE expressions need a delimiter with preg_match or preg_match_all.
About ^:
Inside character classes ([...]), the ^ has the meaning of not if it's the first character.
[abc] : matching a, b or c
[^abc] : NOT matching a, b or c, match every other character instead
Also # at the start and the end here are custom regex delimiters. Instead of the usual /.../ you have #...#. Just like perl.
These are delimiters. You can use any delimiter you want, but they must appear at the start and end of the regular expression.
Please see this documentation for a detail insight in to regular expressions:
http://www.php.net/manual/en/pcre.pattern.php
You can use pretty much anything as delimiters. The most common one is /.../, but if the pattern itself contains / and you don't want to escape any and all occurrences, you can use a different delimiter. My personal preference is (...) because it reminds me that $0 of the result is the entire pattern. But you can do anything, <...>, #...#, %...%, {...}... well, almost anything. I don't know exactly what the requirements are, but I think it's "any non-alphanumeric character".
Let me break it down:
# is the first character, so this is the character used as the delimiter of the regular expression - we know we've got to the end when we reach the next (unescaped) one of these
^ outside of a character class, this means the beginning of the string
/ is just a normal 'slash' character
([^/]+) This is a bracketed expression containing at least one (+) instance of any character that isn't a / (^ at the beginning of a character class inverts the character class - meaning it will only match characters that are not in this list)
/ again
([^/]+) again
/ again
$ this matches the end of the string
# this is the final delimeter, so we know that the regex is now finished.

what does # mean when used in preg_match?

This is from a class there is a # sign in preg_match what does it mean or its purpose? Does it mean a space?
if (preg_match("#Property Information </td>#",simplexml_import_dom($cols->item(0))->asXML(),$ok))
{
$table_name = 'Property Information';
}
In that case, it is being used as a pattern delimiter. As that manual page says,
When using the PCRE functions, it is required that the pattern is enclosed by delimiters. A delimiter can be any non-alphanumeric, non-backslash, non-whitespace character.
Often used delimiters are forward slashes (/), hash signs (#) and tildes (~).
It is just a delimiter. It can be any other pair of character. The following are all the same
"#Property Information </td>#"
"+Property Information </td>+"
"|Property Information </td>|"
"#Property Information </td>#"
"[Property Information </td>]"
...
The purpose of the delimiter to separate regex pattern with modifier, e.g. if you need case-insensitive match you'll put an i after the delimiter, e.g.
"#Property Information </td>#i"
"+Property Information </td>+i"
"|Property Information </td>|i"
"#Property Information </td>#i"
"[Property Information </td>]i"
...
See http://www.php.net/manual/en/regexp.reference.delimiters.php for detail.
Almost any character - when appearing at the first position - can be used as a PCRE delimiter. In this case it's the # (another common one would be / but when dealing with closing tags that one is not really good as you'd have to escape every / in the text
)
See http://www.php.net/manual/en/regexp.reference.delimiters.php for details.
However, you shouldn't use a Regex for this check at all - you are just testing if a plain string is in another string. Here's a proper solution:
$xml = simplexml_import_dom($cols->item(0))->asXML()
if(strpos($xml, 'Property Information </td>') !== false) { ... }
Actually, using string operators when dealing with html/xml is not really nice but if you are just doing simple "contains" checks it's usually the easiest way.
every regular expression must start and end with the same character. the author of the given regular expression has chosen to start and end the regular expression with an # sign.

Beginner PHP help needed

I have been learning PHP for some time now and I wanted one clarification.
I have seen the preg_match function called with different delimiting symbols like:
preg_match('/.../')
and
preg_match('#...#')
Today I also saw % being used.
My question is two part:
What all characters can be used?
And is there a standard ?
Any
non-alphanumeric
non-whitespace and
non-backslash ASCII character
can be used as delimiter.
Also if you using the opening punctuation symbols as opening delimiter:
( { [ <
then their corresponding closing punctuation symbols must be used as closing delimiter:
) } ] >
The most common delimiter is /.But sometimes it's advised to use a different delimiter if a / is part of the regex.
Example:
// check if a string is number/number format:
if(preg_match(/^\d+\/\d+$/)) {
// match
}
Since the regex contains the delimiter, you must escape the delimiter found in the regex.
To avoid the escaping it is better to choose a different delimiter, one which is not present in the regex, that way your regex will be shorter and cleaner:
if(preg_match(#^\d+/\d+$#))

Can you use back references in the pattern part of a regular expression?

Is there a way to back reference in the regular expression pattern?
Example input string:
Here is "some quoted" text.
Say I want to pull out the quoted text, I could create the following expression:
"([^"]+)"
This regular expression would match some quoted.
Say I want it to also support single quotes, I could change the expression to:
["']([^"']+)["']
But what if the input string has a mixture of quotes say Here is 'some quoted" text. I would not want the regex to match. Currently the regex in the second example would still match.
What I would like to be able to do is if the first quote is a double quote then the closing quote must be a double. And if the start quote is single quote then the closing quote must be single.
Can I use a back reference to achieve this?
My other related question: Getting text between quotes using regular expression
You can make use of the regex:
(["'])[^"']+\1
() : used for grouping
[..] : is the char class. so ["']
matches either " or ' equivalent
to "|'
[^..] : char class with negation.
It matches any char not listed after
the ^
+ : quantifier for one or more
\1 : backreferencing the first
group which is (["'])
In PHP you'd use this as:
preg_match('#(["\'])[^"\']+\1#',$str)
preg_match('/(["\'])([^"\']+)\1/', 'Here is \'quoted text" some quoted text.');
Explanation: (["'])([^"']+)\1/ I placed the first quote in parentheses. Because this is the first grouping, it's back reference number is 1. Then, where the closing quote would be, I placed \1 which means whichever character was matched in group 1.
/"\(.*?\)".*?\1/ should work, but it depends on the regular expression engine
This is old. But you need to provide the $matches variable in preg_match($pattern, $subject, &$matches)
Then you can use it var_dump($matches)
see https://www.php.net/manual/en/function.preg-match

Categories