Regex to count the instances of '?' that are in quotes - php

I'm trying to make a regex that will count every question mark that is inside of quotes. This regex is being tested in javascript but I intend to use it for PHP if that matters. I have something that kind of works but not well enough.
Here it is.
/(\"|\')(([^\"\'\\]|\\.)*)\?(([^\"\'\\]|\\.)*(\"|\'))/g
As you can probably see I also want to ignore escaped quotes.
Say I have the string "hello? \"world?\"". This will return 1 which is correct.
But as for this "hello? \"world??\"". This will also return 1, but what I want is 2. How can I accomplish this?
Also extra love if I can get a regex that is the exact opposite of this (counting question marks that are NOT in quotes).
Here's the whole function used for this test if it helps.
function countTest(str) {
regx = /(\"|\')(([^\"\'\\]|\\.)*)\?(([^\"\'\\]|\\.)*(\"|\'))/g;
test = str.match(regx);
test = test ? test.length : 0;
console.log(test);
}
EDIT:
Also! I noticed from my own typo in this question the string hello \"world?\'" will also return 1. That seems easy to fix though.

So I just tried it, and apparently the reason the 1 is displayed is, because test is treated not as a String, but as an array. So I would replace
test = test ? test.length : 0;
by
test = test ? new String(test).length - 3 * test.length + 1: 0;
In other words, you are subtracting the 3 characters "," from the String value of the array, but add the 1 because there is no comma character for the first array element.
Edit: For the counting question marks outside quotes, just simply subtract the number given above from the number of total question marks in the string.
Edit: for PHP, you would probably use the array_slice function instead of new String(test), with the multiplication and addition constants adjusted as necessary.

Related

Unexpected string length when using strlen in php

I am stuck with the issue as currently the result is quite unexpected. I am calculating a hashed keyword length and it is surely giving me an unexpected result.
echo strlen("$2a$08$MphfRBNtQMLuNro5HOtw3Ovu20cLgC0VKjt6w7zrKXfj1bv8tNnNa");
Output - 6
Let me know the reason for this and why it is outputting 6 as a result.
Codepad Link - http://codepad.org/pLARBx6F
You must use single quotes '. With the double quotes ("), due to the $ in your string, parts of it get interpreted as variables.
Generally, it's not a bad idea to get accustomed to using single quotes unless you specifically need doubles.
Look at the "variables" contained here. They would be $2a, $08, and $MphfRBNtQM......
The first two couldn't be variables as they start with a number, thus, the 6 characters. The third one indeed could be a proper variable, but since it isn't set, it's empty.
Use the below code to calculate the string length -
echo strlen('$2a$08$MphfRBNtQMLuNro5HOtw3Ovu20cLgC0VKjt6w7zrKXfj1bv8tNnNa');
You need to use single quotes, as at the third occurrence of the symbol $, a alphabet is starting after it and it get treated as a new variable. So before this third occurrence of $ only 6 character were there and you were getting string length as 6
Try following
<?php
echo strlen('$2a$08$MphfRBNtQMLuNro5HOtw3Ovu20cLgC0VKjt6w7zrKXfj1bv8tNnNa');
?>
If you change your string and remove rest of '$' signs except the first one, then this will work fine because by adding $ it gets a special meaning in PHP.

PHP Regex Check if two strings share two common characters

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.
Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";
This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)
You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}
There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).
People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.
Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);
I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}

RegEx and MySQL basics - replacing a varying part of a string

I have a field in my table that holds a string denoting some object levels, like so:
"<3<"
"<3<5<"
"<3<5<49<"
etc.
I have a function that is to remove a level from such a string, without knowing the position of the level in the string itself. Concretely, I would like to remove "3". The result should be:
"0"
"<5<"
"<5<49<"
If I would, however, want to remove 5, and not 3, the result should be this:
"<3<"
"<3<"
"<3<49<"
Lastly, if I chose to remove 49 instead of 3 or 5, I would like to get this:
"<3<"
"<3<5<"
"<3<5<"
As you can see, the position of the substring that is to be removed varies - sometimes it's the leftmost one, sometimes in the middle, sometimes the rightmost one. What is important after all this is:
If the number I am removing is the only value, enclosed in "less than" signs (as in "<3<" while removing 3), the new result must be 0.
If the number I am removing is not the only value, the only thing that matters is that the final notation stays the same - as in, the entire string must remain enclosed in "less than" symbols, and substrings of multiple "less than" symbols in a row must not happen (as in, "3<<5<" is not allowed).
Is there an easy regex way to handle this with php and mysql, or should I just make 3 manual checks?
P.S. While I may have posed it as such, this is not homework but an actual work issue.
for each line two replacements: (for example, you want to remove "3")
replace "^<3<$" -> "0";
replace "<3" -> "";
You can do it in 2 steps.
Suppose your input is this
"<3<"
"<3<5<"
"<3<5<49<"
and you want to remove number 3:
Step 1. Since the values always start with "<", you can try to replace "<3" with "". Then the input becomes
"<"
"<5<"
"<5<49<"
Step 2. Replace strings which EQUALS "<" with 0. Then you can get
"0"
"<5<"
"<5<49<"
It's the same if you want to remove 5 or 49.
I think you can easily use regex to do these steps.
In the first step:
replace "<3(?=<)"
I'ts important to use lookaheads, otherwise you could be replacing something like *<3*4 and that's not what you want.
Second step:
replace "^<$" with "0"

PHP: get numeric value in the end of a given formatted string

I "inherited" a buggy PHP page. I'm not an expert of this language but I think I found the origin of the bug. Inside a loop, the page sends a formatted string to the server: the string I found in the HTML page is like this one:
2011-09-19__full_1
so, it seems we have three parts:
a date (0,10);
a string (10,6);
a final number (17,1);
The code the handles this situation is the following:
$datagrid[] = array("date"=>substr($post_array_keys[$i], 0, 10),"post_mode"=>substr($post_array_keys[$i], 10, 6),"class_id"=>substr($post_array_keys[$i], 17, 1),"value"=>$_POST[$post_array_keys[$i]]);
What happens: the final number can contain more than one character, so this piece:
"class_id"=>substr($post_array_keys[$i], 17, 1)
is not correct because it seems to retrieve only one character starting from the 17th (and this seems to cause strange behaviors to the website).
Being the whole number the last part of the string, to get the entire number could I safely change this line this way?
"class_id"=>substr($post_array_keys[$i], 17, strlen($post_array_keys[$i])-17);
If you change the code the way you suggest you would get the numbers at the end starting in position 17. The original code gets only the first digit. Your code would get all the digits.
And it seems you did your homework the line
$datagrid[] = array("date"=>substr($post_array_keys[$i], 0, 10),"post_mode"=>substr($post_array_keys[$i], 10, 6),"class_id"=>substr($post_array_keys[$i], 17, 1),"value"=>$_POST[$post_array_keys[$i]]);
does give you a very good clue of what you should expect in the variable:
first 10 is the date
then you have 6 chars for post_mode
then you have 1 char for class_id
If you also confirmed that sometimes the class_id can be more than 1 char, your suggested change would give you the complete class_id at the end.
Good luck.
you could use
$array = explode("_", $string);
this functions returns an array with the elements in the string delimited by "_".
I suggest this because the double underscore may hide another value that is empty in that particular case.
If it's only the last integer causing trouble, you can use strrchr to get the "tail" of the string, starting with the last '_'.

How to split a string and find the occurence of one string in another?

I need to figure out how to do some C# code in php, and im not sure exactly how.
so first off i need the Split function, im going to have a string like
"identifier 82asdjka271akshjd18ajjd"
and i need to split the identifier word from the rest. so in C#, i used string.Split(new char{' '}); or something like that (working off the top of my head) and got two strings, the first word, and then the second part.. i understand that the php split function has been deprecated as of PHP 5.3.0.. so thats not an option, what are the alternatives?
and im also looking for a IndexOf function, so if i had the above code again as an example, i would need the location of 271 in the string, so i can generate a substring.
you can use explode for splitting and strpos for finding the index of one string inside another.
$a = "identifier 82asdjka271akshjd18ajjd";
$arr = explode(' ',$a); // split on space..to get an array of size 2.
$pos = strpos($arr[1],'271'); // search for '271' in the 2nd ele of array.
echo $pos; // prints 8

Categories