How to use dot as in punctuation and not append in PHP - php

I'm putting a SQL query together in PHP. How do I declare a dot in punctuation?
Example code as requested:
$sql="SELECT COUNT(*) FROM Table1 WHERE LOWER(location2) REGEXP '.* .$location .*'";
See .* is a regexp and should not be interpreted by PHP as a concatenation.

This is nothing to do with PHP syntax. Your example contains a . inside a quoted string, which PHP interprets as a . inside a quoted string. Therefore nothing is wrong there.
What you're probably experiencing is MySQL treating the . as a wild-card operator in a regular expression. In regular expression syntax (whether in MySQL, PHP, Perl, wherever) . is a wild-card that matches any single character. If you want to include a literal . in your regex, you need to escape it, i.e. \..
Because you are using it inside a string inside a string, you also need to escape the \ character so that it makes it through to the regex correctly. Without testing I would say it needs escaping twice (once for PHP and once for MySQL), e.g. "'\\\\.'" in PHP becomes '\\.' in MySQL, becomes \. in the regular expression.
(Obviously, only escape the . characters that are meant to be treated literally - I would assume the .* is meant to match any character - these should not be changed.)

Related

RegExp Capture literals

I need a way to strip all literals from PHP files. My current regexp solution works fine when there is no nested quotes in the string. Tried updating it to handle escaped quotes as well, which did work in most cases, except when there are escaped escape characters in the string.
This is what it should be able to handle, if this should be done correctly
"text"
"\"text\""
"\\"
"\"\\\""
So as I see it, it needs to handle cases where there are an even amount of escape characters and cases where there are an uneven amount. But how do you get this into regexp?
Update
I want to clean up PHP files to make them easier to search through and index different parts, something for a small project that I am playing with. Since literals can contain mostly anything, they can also contain data similar to some of the searches. So I want to remove anything in the files that is wrapped in " or '.
"/\"[^\"]*\"/"
This will work unless there is a nested quote "\"data\"".
"/\"(\\\\\"|[^\"])*\"/"
This will work unless there is "\\"
This is what I need
$var = "...";
Becomes
$var = ;
You could use this regular expression based substitution:
Find: ((?<!\\)(?:\\.)*)(["'])(?:\\.|(?!\2).)*?\2
Replace: $1
Note that if you are going to use this regular expression in PHP (where you encode it as a string literal) you need to escape the backslashes and quote in that regular expression, so like this:
preg_replace("~((?<!\\\\)(?:\\\\.)*)([\"'])(?:\\\\.|(?!\\2).)*?\\2~s", "$1", $input);
As PHP string literals can span multiple lines, the s modifier is added so that . matches newline characters also.
See it run on eval.in
NB: You'll need to think about heredoc notation also...

preg_match not reading results: Delimiter issue?

I have a part of a string. I need to check if it would equal another string when I add a few symbols on. However, my use of delimiters (I believe) is not allowing for the matches to take place.
My IF statement:
if (preg_match("{" . "$words[$counter_words]" . "[<]N}", "$corpus[$counter_corpus]"))
My corpus:
{3(-)D[<]AN}
{dog[<]N}
{4(-)H(')er[<]N}
{4(-)H[<]A}
{A battery[<]h}
My partial array is as follows
dog
cat
3-D
plant
My goal is to match "dog" with "{dog[<]N}" (the [] and {} are delimiters). To try to compensate for this, I glue delimiters to the start and end of the string. Preg_match accepts it, but does not match the two together.
What would be the solution to this? I cannot find or think of a solution. Your help is greatly appreciated.
if (preg_match("{" . $words[$counter_words] . "\\[<\\]N}", $corpus[$counter_corpus]))
[ and ] have special meaning in regular expressions. If you don't want the special meaning you need to escape them, \[. But because this is inside a PHP string, to get a \ character, you must enter \\.
I have a part of a string. I need to check if it would equal another string when I add a few symbols on.
To check if a string equals another one, we don't need preg_match(); this would do:
if ("{{$words[$counter_words]}[<]N}" == $corpus[$counter_corpus])
If you use preg_match(), you have to heed the PCRE regex syntax:
When using the PCRE functions, it is required that the pattern is enclosed
by delimiters. A delimiter can be any non-alphanumeric,
non-backslash, non-whitespace character.
Often used delimiters are forward slashes (/), …
The added symbols { and } at the start and end of your string were taken as pattern-enclosing delimiters, while you meant them to be part of the pattern. You have to add actual delimiters:
if (preg_match("/{{$words[$counter_words]}\[<]N}/", $corpus[$counter_corpus]))
Also of interest:
Double quoted
Variable parsing / Complex (curly) syntax

why 3 backslash equal 4 backslash in php?

<?php
$a='/\\\/';
$b='/\\\\/';
var_dump($a);//string '/\\/' (length=4)
var_dump($b);//string '/\\/' (length=4)
var_dump($a===$b);//boolean true
?>
Why is the string with 3 backslashes equal to the string with 4 backslashes in PHP?
And can we use the 3-backslash version in regular expression?
The PHP reference says we must use 4 backslashes.
Note:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.
$b='/\\\\/';
php parses the string literal (more or less) character by character. The first input symbol is the forward slash. The result is a forward slash in the result (of the parsing step) and the input symbol (one character, the /) is taken away from the input.
The next input symbol is a backslash. It's taken from the input and the next character/symbol is inspected. It's also a backslash. That's a valid combination, so the second symbol is also taken from the input and the result is a single blackslash (for both input symbols).
The same with the third and fourth backslash.
The last input symbol (within the literal) is the forwardslash -> forwardslash in the result.
-> /\\/
Now for the string with three backslashes:
$a='/\\\/';
php "finds" the first blackslash, the next character is a blackslash - that's a valid combination resulting in one single blackslash in the result and both characters in the input literal taken.
php then "finds" the third blackslash, the next character is a forward-slash, this is not a valid combination. So the result is a single blackslash (because php loves and forgives you....) and only one character taken from the input.
The next input character is the forward-slash, resulting in a forwardslash in the result.
-> /\\/
=> both literals encode the same string.
It is explained in the documentation on the page about Strings:
Under the Single quoted section it says:
The simplest way to specify a string is to enclose it in single quotes (the character ').
To specify a literal single quote, escape it with a backslash (\). To specify a literal backslash, double it (\\). All other instances of backslash will be treated as a literal backslash.
Let's try to interpret your strings:
$a='/\\\/';
The forward slashes (/) have no special meaning in PHP strings, they represent themselves.
The first backslash (\) escapes the second backslash, as explained in the first sentence from the second paragraph quoted above.
The third backslash stands for itself, as explained in the last sentence of the above quote, because it is not followed by an apostrophe (') or a backslash (\).
As a result, the variable $a contains this string: /\\/.
On
$b='/\\\\/';
there are two backslashes (the second and the fourth) that are escaped by the first and the third backslash. The final (runtime) string is the same as for $a: /\\/.
Note
The discussion above is about the encoding of strings in PHP source. As you can see, there always is more than one (correct) way to encode the same string. Other options (beside string literals enclosed in single or double quotes, using heredoc or nowdoc syntax) is to use constants (for literal backslashes, for example) and build the strings from pieces.
For example:
define('BS', '\'); // can also use '\\', the result is the same
$c = '/'.BS.BS.'/';
uses no escaping and a single backslash. The constant BS contains a literal backslash and it is used everywhere a backslash is needed for its intrinsic value. Where a backslash is needed for escaping then a real backslash is used (there is no way to use BS for that).
The escaping in regex is a different thing. First, the regex is parsed at the runtime and at runtime $a, $b and $c above contain /\\/, no matter how they were generated.
Then, in regex a backslash that is not followed by a special character is ignored (see the difference above, in PHP it is interpreted as a literal backslash).
Combining PHP & regex
There are endless possibilities to make the things complicate. Let's try to keep them simple and put some guidelines for regex in PHP:
enclose the regex string in apostrophes ('), if it's possible; this way there are only two characters that needs to be escaped for PHP: the apostrophe and the backslash;
when parse URLs, paths or other strings that can contain forward slashes (/) use #, ~, ! or # as regex delimiter (which one is not used in the regex itself); this way there is no need to escape the delimiter when it is used inside the regex;
don't escape in regex characters when it's not needed; f.e., the dash (-) has a special meaning only when it is used in character classes; outside them it's useless to escape it (and even in character classes it can be used unquoted without having any special meaning if it is placed as the very first or the very last character inside the [...] enclosure);

+ Symbol in Regular Expression PHP

I want to use preg_match and regular expression in PHP to check that a string starts with either "+44" or "0", but how can I do this without the + being read as matching the preceding character once or more? Would (+44|0) work?
use the ^ to signify start with and a backslash \ to escape the + character. So you'll check for
^\+44 | ^0
In php, to store the regexp in a string, you don't need to double backslash \ to confuse things, just use single quotes instead like:
$regexp = '^\+44 | ^0';
In fact, you don't even need to use anything, this works too:
$regexp = "^\+44 | ^0";
The backslash is the default escape character for regular expressions. You may have to escape the backslash itself as well if it is used in a PHP string, so you'd use something like "(\\+44|0)" as string constant. The regular expression itself would then be (\+44|0).
You can do it several ways. Amongst those I know two:
One is escaping the + with escape character(i.e. back slash)
^(\+44|0)
Or placing the + inside the character class [] where it means the character as it's literal meaning.
^([+]44|0)
^ is the anchor character that means the start of the string/line based on your flag(modifier).

To double escape or not to double escape in PHP PCRE functions?

I was looking for a solid article on when double escaping is necessary and when it is not, but I was not able to find anything. Perhaps I didn't look hard enough, because I'm sure there is an explanation out there somewhere, but lets just make it easy to find for the next guy that has this question!
Take for example the following regex patterns:
/\n/
/domain\.com/
/myfeet \$ your feet/
Nothing ground breaking right? OK, lets use those examples within the context of PHP's preg_match function:
$foo = preg_match("/\n/", $bar);
$foo = preg_match("/domain\.com/", $bar);
$foo = preg_match("/myfeet \$ your feet/", $bar);
To my understanding, a backslash in the context of a quoted string value escapes the following character, and the expression is being given via a quoted string value.
Would the previous be like doing the folloing, and wouldnt this cause an error?:
$foo = preg_match("/n/", $bar);
$foo = preg_match("/domain.com/", $bar);
$foo = preg_match("/myfeet $ your feet/", $bar);
Which is not what I want right? those expressions are not the same as above.
Would I not have to write them double escaped like this?
$foo = preg_match("/\\n/", $bar);
$foo = preg_match("/domain\\.com/", $bar);
$foo = preg_match("/myfeet \\$ your feet/", $bar);
So that when PHP processes the string it escapes the backslash to a backslash which is then left in when its passed to the PCRE interpreter?
Or does PHP just magically know that I want to pass that backslash to the PCRE interpreter... i mean how does it know I'm not trying to \" escape a quote that I want to use in my expression? or are only double slashes required when using an escaped quote? And for that matter, would you need to TRIPLE escape a quote? \\\" You know, so that the quote is escaped and a double is left over?
Whats the rule of thumb with this?
I just did a test with PHP:
$bar = "asdfasdf a\"ONE\"sfda dsf adsf me & mine adsf asdf asfd ";
echo preg_match("/me \$ mine/", $bar);
echo "<br /><br />";
echo preg_match("/me \\$ mine/", $bar);
echo "<br /><br />";
echo preg_match("/a\"ONE\"/", $bar);
echo "<br /><br />";
echo preg_match("/a\\\"ONE\\\"/", $bar);
echo "<br /><br />";
Output:
0
1
1
1
So, it looks like somehow it doesnt really matter for quotes, but for the dollar sign, a double escape is required as I thought.
Double quoted strings
When it comes to escaping inside double quotes, the rule is that PHP will inspect the character(s) immediately following the backslash.
If the neighboring character is in the set ntrvef\$" or if a numeric value follows it (rules can be found here) it gets evaluated as the corresponding control character or ordinal (hexadecimal or octal) representation, respectively.
It's important to note that if an invalid escape sequence is given, the expression is not evaluated and both the backslash and character remain. This is different from some other languages where an invalid escape sequence would cause an error instead.
E.g. "domain\.com" will be left as is.
Note that variables get expanded inside double quotes as well, e.g. "$var" needs to be escaped as "\$var".
Single quotes strings
Since PHP 5.1.1, any backslash inside single quoted strings (and followed by at least one character) will get printed as is and no variables get substituted either. This is by far the most convenient feature of single quoted strings.
Regular expressions
For escaping regular expressions, it's best to leave escaping to preg_quote():
$foo = preg_match('/' . preg_quote('mine & yours', '/') . '/', $bar);
This way you don't have to worry about which characters need to be escaped, so it works well for user input.
See also: preg_quote
Update
You added this test:
"/me \$ mine/"
This gets evaluated as "/me $ mine/"; but in PCRE the $ has a special meaning (it's an end-of-subject anchor).
"/me \\$ mine/"
This is evaluated as "/me \$ mine/" and so the backslashes is escaped for PHP itself whereas the $ is escaped for PCRE. This only works by accident btw.
$var = 'something';
"/me \\$var mine/"
This gets evaluated as "/me \something", so you need to escape the $ again.
"/me \\\$var mine/"
Use single quotes. They prevent escape sequences from occurring.
For example:
php > print "hi\n";
hi
php > print 'hi\n';
hi\nphp >
Whenever you have an invalid escape sequence, PHP actually leaves the characters literally in the string. From the documentation:
As in single quoted strings, escaping any other character will result in the backslash being printed too.
I.e. "\&" really is interpreted as "\&". There are not that many escape sequences, so in most cases you probably get away with a single backslash. But for consistency, escaping the backslash might be a better choice.
As always: Know what you are doing :)
OK So I did some more testing and discovered the RULE OF THUMB when encapsulating a PCRE in DOUBLE QUOTES, the following holds true:
$ - Requires double escape because PHP will interpret that as the beginning of a variable if text is immediately following it. Left unescaped and it will indicate the end of your needle and will break.
\r\n\t\v - Special PHP string escapes, single escape required only.
[\^$.|?*+() - Special RegEx characters, require single escape only. Double escape does not seem to break expressions when used unnecessarily.
" - Quotes are obviously going to have to be escaped due to the encapsulation, but only need to be escaped once.
\ - Searching for a backslash? Using the double quote encapsulation of your expression, this will require 3 escapes! \\ (four backslashes in total)
Anything I'm missing?
I'll start saying that all I'll write below is not exactly what happens, but, for clarity, I'll simplify it.
Imagine that there are two evaluations happening when using regular expressions: the first being done by PHP and the second being done by PCRE, as if they were separate engines. And for our bad luck,
PHP AND PCRE EVALUATES THINGS IN DIFFERENT WAYS.
We have 3 "guys" here: 1) the USER; 2) the PHP and; 3) the PCRE.
The USER communicates with PHP by writing the CODE, which is exactly what you type in a code editor.
PHP then evaluates this CODE and sends another bit of information to PCRE. This bit of information is different from what you typed in your CODE.
PCRE then evaluates it and returns something to PHP, that evaluates this response and returns something to the USER.
I'll explain better in the exemple below. There I'm going to use the backslash ("\") to ilustrate what's going on.
Assume this bit of CODE in a php file:
<?php
$sub = "A backslash \ in a string";
$pat1 = "#\#";
$pat2 = "#\\#";
$pat3 = "#\\\#";
$pat4 = "#\\\\#";
echo "sub: ".$sub;
echo "\n\n";
echo "pat1: ".$pat1;
echo "\n";
echo "pat2: ".$pat2;
echo "\n";
echo "pat3: ".$pat3;
echo "\n";
echo "pat4: ".$pat4;
?>
This will print:
sub: A backslash \ in a string
pat1: #\#
pat2: #\#
pat3: #\\#
pat4: #\\#
In this exemple, there is no regular expression involved, so there is only the PHP evaluation of the code happening.
PHP leaves a backslash as is if it doesn't precede any special character. That's why it prints the backslash correctly in $sub.
PHP evaluates $pat1 and $pat2 EXACTLY the same, because in $pat1 the backslash is left as is, and in $pat2 the first backslash escapes the second, resulting in a single backslash.
Now, in $pat3, the first backslash escapes the second, resulting in one backslash. Then PHP evaluates the third backslash and leaves it as is because it is not preceding anything special. The result is going to be the double backslash.
Now someone could say "but now we have two backslashes again! shouldn't the first one escape the second one again?!"
The answer is "No". After PHP evaluates the first two backslashes into a single one, it doesn't look back again, and keeps moving on evaluating what is next.
At this point you already know what's going on with $pat4: the first backslash escapes the second and the third escapes the fourth, leaving two in the end.
Now that it's clear what PHP is doing to these strings, let's add some more code after the previous one.
if (preg_match($pat1, $sub)) echo "test1: true"; else echo "test1: false";
echo "\n";
if (preg_match($pat2, $sub)) echo "test2: true"; else echo "test2: false";
echo "\n";
if (preg_match($pat3, $sub)) echo "test3: true"; else echo "test3: false";
echo "\n";
if (preg_match($pat4, $sub)) echo "test4: true"; else echo "test4: false";
And the result is:
test1: false
test2: false
test3: true
test4: true
So, what's going on here is that PHP is not sending "what you typed" in the CODE directly to PCRE. Instead, PHP is sending what it has evaluated previously (which are exactly what we saw above).
For test1 and test2, even though we have written different patterns in the CODE for each test, PHP is sending the same pattern #\# to PCRE. The same thing happens for test3 and test4: PHP is sending #\\#. So, the results for test1 and test2 are the same, as well as for test3 and test4.
Now, what's going on when PCRE evaluates these patterns? PCRE doesn't act like PHP.
In test1 and test2, when PCRE sees a single backslash escaping nothing special (or nothing at all), it doesn't leave it as is. Instead, it problably thinks "what the hell is this?" and returns an error to PHP (actually, I don't really know what goes on when sending a single backslash to PCRE, searched for this, but still no conclusive). Then PHP takes what we are assuming is an error and evaluates it as "false" and returns that to the rest of the CODE (in this exemple, the if () function).
In test3 and test4, things go as we now expect: PCRE evaluates the first backslash as escaping the second, resulting in a single backslash. That of course matches the $sub string and returns a "successful message" to PHP, which evaluates it as "true".
ANSWERING QUESTIONS
Some characters are special to PHP (e.g. n for NEW LINE, t for TAB).
Some characters are special to PCRE (e.g. . (dot) to match any character, s to match whitespaces).
And some characters are special to both (e.g. $ to php is the beginning of the name of a variable and to PCRE it asserts the end of the subject).
That's why you need to escape newlines just once, like this \n. PHP will evaluate it as the REAL character NEW LINE and send that to PCRE.
For the dot, if you want to match that specific character, you should use \. and PHP will do nothing because the dot isn't a special character to PHP in a string. Instead, it will send them as is to PCRE. Now on PCRE, it will "see" a backslash preceding a dot and understand that it should match that specific character. If you use a double escape \\. the first backslash will escape the second, leaving you with the same result.
And if you want to match a dollar sign in a string, then you should use \\\$. In PHP, the first backslash will escape the second one, leaving a single backslash. Then the third backslash will escape the dollar sign. In the end, the result is \$. This is what PCRE will receive. PCRE will see that backslash and understand that the dollar sign is not asserting end of subject, but the literal character.
QUOTES
And now we've come to quotes. The problem with them is the fact that PHP evaluates a string in different ways, depending on the quotes used to surround it. Check it out: Strings
All I said until this point is valid for double quotes.
If you try this '\n' in single quotes, PHP will evaluate that backslash as a literal one.
But, if it is used in a regular expression, PCRE will get this string as is. And since n is also special to PCRE, it will interpret that as a newline character, and BOOM, it "magicaly" matches a newline in a string.
Check the escape sequences here: Escape Sequences
As I said in the beginning, things area not exactly as I tried to explain here, but I really hope it helps (and not make it more confusing than it already is).

Categories