Suppose I have the following string:
insert into table values ('text1;'); insert into table values ('text2')
How do I break those queries (get each individual query) using regular expressions?
I've found a very similar problem: Use regex to find specific string not in html tag ...but it uses a solution that is specific to .NET: behind lookup (in php it complains that is not fixed length).
I would be very grateful if someone could give me some hints on how to deal with this problem.
The trick is to count how many unescaped quote characters you've passed. Assuming that the SQL is syntactically correct, semicolons after an even number of unescaped quote characters will be the ones you want, and semicolons after an odd number of unescaped quote characters will be part of a string literal. (Remember, string literals can contain properly escaped quote characters.)
If you want 100% reliability, you'll need a real SQL parser, like this. (I just Googled "SQL parser in PHP". I don't know if it works or not.)
EDIT:
I don't think it's possible to find pairs of unescaped quote characters using nothing but regex. Maybe a regex guru will prove me wrong, but it just seems too damn difficult to distinguish between escaped and unescaped quote characters in so many possible combinations. I tried look-behind assertions and backrefs with no success.
The following is not a pure-regex solution, but I think it works:
preg_match_all("/(?:([^']*'){2})*[^']*;/U", str_replace("\\'", "\0\1\2", $input), $matches);
$output = array_map(function($str){ return str_replace("\0\1\2", "\\'", $str); }, $matches[0]);
Basically, we temporarily replace escaped quote characters with a string of bytes that is extremely unlikely to occur, in this case \0\1\2. After that, all the quote characters that remain are the unescaped ones. The regex picks out semicolons preceded by an even number of quote characters. Then we restore the escaped quote characters. (I used a closure there, so it's PHP 5.3 only.)
If you don't need to deal with quote characters inside string literals, yes, you can easily do it with pure regex.
Assuming proper SQL syntax it would probably be best to split on the semicolon.
The following regexp will work but only if all quotes come in pairs.
/.+?\'.+?\'.*?;|.+?;/
To avoid escaped single quotes:
/.+?[^\\\\]\'.+?[^\\\\]\'.*?;|.+?;/
To handle Multiple pairs of single quotes.
/.+?(?:[^\\]\'.+?[^\\]\')+.*?;|.+?;/
Tested against the following data set:
insert into table values ('text1;\' ','2'); insert into table values ('text2'); insert into test3 value ('cookie\'','fly');
Returns:
insert into table values ('text1;\' ','2');
insert into table values ('text2');
insert into test3 value ('cookie\'','fly');
I have to admit this is a pretty dirty regexp. It would not handle any sort of SQL syntax errors at all. I enjoyed the challenge of coming up with a pure regex though.
How you want to break?
You can use explode( ' ', $query ) to transform into an array.
Or if you want to get text1 and text2 values with regexp you can use preg_match( '/(\'([\w]+)\')/', $query, $matches ) where $matches[1] is your value.
preg_match_all( '/([\w ]+([\w \';]+))/', $queries, $matches ) will give to you all matches with this pattern of query.
Regex's aren't always good at this type of thing. The following function should work though:
function splitQuery($query) {
$open = false;
$buffer = null;
$parts = array();
for($i = 0, $l = strlen($query); $i < $l; $i++) {
if ($query[$i] == ';' && !$open) {
$parts[] = trim($buffer);
$buffer = null;
continue;
}
if ($query[$i] == "'") {
$open = ($open) ? false: true;
}
$buffer .= $query[$i];
}
if ($buffer) $parts[] = trim($buffer);
return $parts;
}
Usage:
$str = "insert into table values ('text1;'); insert into table values ('text2')";
$str = splitQuery($str);
print_r($str);
Outputs:
Array
(
[0] => insert into table values ('text1;')
[1] => insert into table values ('text2')
)
Related
I think my question is so easy to be solved, but I can't.
I want to take this words inside of my query string:
$string = "INSERT INTO table (a,b) VALUES ('foo', 'bar')";
Expected result:
array one = [a,b]
array two = [foo, bar]
There are many regex strategies you could use for this depending on how flexible you need it to be. Here is one very simple implementation which assumes that you know the string 'VALUES' is in all caps, and there is exactly one space before and after 'VALUES' and the two sets of parenthesis.
$string = "INSERT INTO table (a,b) VALUES ('foo', 'bar')";
$matches = array();
// we escape literal parenthesis in the regex, and also add
// grouping parenthesis to capture the sections we're interested in
preg_match('/\((.*)\) VALUES \((.*)\)/', $string, $matches);
// make sure the matches you expect to be found are populated
// before referencing their array indices. index 0 is the match of
// our entire regex, indices 1 and 2 are our two sets of parens
if (!empty($matches[1]) && !empty($matches[2]) {
$column_names = explode(',', $matches[1]); // array of db column names
$values = explode(',', $matches[2]); // array of values
// you still have some clean-up work to do here on the data in those arrays.
// for instance there may be empty spaces that need to be trimmed from
// beginning/ending of some of the strings, and any of the values that were
// quoted need the quotation marks removed.
}
This is only a starting point, be sure to test it on your data and revise the regex as needed!
I recommend using a regex tester to test your regex string against actual query strings you need it to work on. http://regexpal.com/ (There are many others)
This question already has answers here:
Regular expression to match common SQL syntax?
(13 answers)
Closed 9 years ago.
I need to retrieve table parent-child relationship from "WHERE" clause like this:
select ... large list of fields with aliases ...
from ... list of joined tables ...
where ((`db_name`.`catalog`.`group` = `db_name`.`catalog_group`.`iden`)
and (`db_name`.`catalog`.`iden` = `db_name`.`catalog_sub`.`parent`))
Is there a some regex to get identifiers from each condition? Say in an array element[0] = table from the left side, element[1] is table from right. Ident's name may be any. So only sql operators like 'where' 'and' '=' may be keys.
Any help would be greatly appreciated.
CLARIFY
I dont want to get references from WHERE clause by WHERE clause. I just want references as such. So as could I see there may be regex to replace all sequences
`.`
to
.
and then match all backticked pairs by
` # ` = ` # `
Backticks around identifier always present in any may query by default. All string values surrounded by double quotes by default. I thought it's not a complex task for regex guru. Thanks in advance.
PS It's because myISAM engine does not support references I forced to restore in manually.
ENDED with:
public function initRef($q) {
$s = strtolower($q);
// remove all string values within double quotes
$s = preg_replace('|"(\w+)"|', '', $q);
// split by 'where' clause
$arr = explode('where', $s);
if (isset($arr[1])) {
// remove all spaces and parenthesis
$s = preg_replace('/\s|\(|\}/', '', $arr[1]);
// replace `.` with .
$s = preg_replace('/(`\.`)/', '.', $s);
// replace `=` with =
$s = preg_replace("/(`=`)/", "=", $s);
// match pairs within ticks
preg_match_all('/`.*?`/', $s, $matches);
// recreate arr
$arr = array();
foreach($matches[0] as &$match) {
$match = preg_replace('/`/', '', $match); // now remove all backticks
$match = str_replace($this->db . '.', '', $match); // remove db_name
$arr[] = explode('=', $match); // split by = sign
}
$this->pairs = $arr;
} else {
$this->pairs = 0;
}
}
Using a regular expression seems like it won't help you. What if there are subqueries? What if your query contains a string with the text "WHERE" in it? Hakre mentioned it in a comment above, but your best bet really is using something that can actually interpret your SQL so that you can find what really is a proper WHERE clause and what is not.
If you insist on doing this the "wrong" way instead of by using some context aware parser, you would have to find the WHERE clause, for instance like this:
$parts = explode('WHERE', $query);
Assuming there is only one WHERE clause in your query, $parts[1] will then contain everything from the WHERE onwards. After that you would have to detect all valid clauses like ORDER BY, GROUP BY, LIMIT, etc. that could follow, and break off your string there. Something like this:
$parts = preg_split("/(GROUP BY|ORDER BY|LIMIT)|/", $parts[1]);
$where = $parts[0];
You would have to check the documentation for your flavor of SQL and the types of queries (SELECT, INSERT, UPDATE, etc.) you want to support for a full list of keywords that you want to split on.
After that, it would probably help to remove all brackets because precedence is not relevant for your problem and they make it harder to parse.
$where = preg_replace("/[()]/", "", $where);
From that point onward, you'd have to split again to find all the separate conditions:
$conditions = preg_split("/(AND|OR|XOR)/", $where);
And lastly, you'd have to split on operators to get the right and left values:
foreach ($conditions as $c)
{
$idents = preg_split("/(<>|=|>|<|IS|IS NOT)/");
}
You would have to check that list of operators and add to it if needed. $idents now has all possible identifiers in it.
You may want to note that several of these steps (at the very least the last step) will also require trimming of the string to work properly.
Disclaimer: again, I think this is a very bad idea. This code only works if there is only one WHERE clause and even then it depends on a lot of assumptions. A complicated query will probably break this code. Please use a SQL parser/interpreter instead.
I want to search withing PHP-Files for a special function call. The reason is, that I want to generate .MO-Files for the GetText-Extension. So I first need to create a .PO-Files, which contains all the needed text-strings.
I already find a lot of texts, but there are some problems.
Here is my Regex to find the first Argument of an functioncall:
/\_\([\'|\"]{1}(.+?[^\\\])[\'|\"]{1}[,]{0,1}.*?\)+/si
I need to find function-calls with the following patterns:
_("text");
_("text %s", 3);
_('text');
The Text could contain escaped Quotes. My Problem is acuallty, that I need to know, if there was an apostrophe or an normal quote used for the call.
If I have the call
_('"text"');
then i get the Problem, that I get the text
"text
without the ending quote.
Does anybody of you have an Idea, how I can get my Regex to work?
I would use PHP's tokenizer for this kind of stuff, not regular expressions:
$funcName = '_';
$tokens = token_get_all(file_get_contents('path/to/your/script.php'));
$strings = array();
foreach($tokens as $index => $token){
if(!is_array($token))
continue;
if($token[0] === T_CONSTANT_ENCAPSED_STRING){
if(!isset($tokens[$index - 2]) || ($tokens[$index - 1] !== "("))
continue;
list($id, $text, $line) = $tokens[$index - 2];
// this is your string (substr drops quotes around it)
if(($id === T_STRING) && ($text === $funcName))
$strings[] = substr($token[1], 1, -1);
}
}
var_dump($strings);
Raw regex:
_\((?|'((?:[^'\\]|\\.)*)'|"((?:[^"\\]|\\.)*)")
Delimited regex:
~_\((?|'((?:[^'\\]|\\.)*)'|"((?:[^"\\]|\\.)*)")~
The result is in capturing group 1. I used the branch reset pattern (?|pattern) so that the capturing group number is reset for each alternating branch separated by |.
Inside of the branch reset (?|'((?:[^'\\]|\\.)*)'|"((?:[^"\\]|\\.)*)") are 2 pattern:
'((?:[^'\\]|\\.)*)': Match and capture content inside single quoted string, which consists of either non-quote-non-backslash or escaped sequence. Actually, I am a bit careless here, since (raw) new line character is considered part of the string. I don't think the specification would allow this, but if the input contains valid code, then there should be no problem.
"((?:[^"\\]|\\.)*)": Same as above, but for double quoted string.
Note that I don't consume the rest of the arguments to the function.
Hopefully, this is an easy one. I have an array with lines that contain output from a CSV file. What I need to do is simply remove any commas that appear between double-quotes.
I'm stumbling through regular expressions and having trouble. Here's my sad-looking code:
<?php
$csv_input = '"herp","derp","hey, get rid of these commas, man",1234';
$pattern = '(?<=\")/\,/(?=\")'; //this doesn't work
$revised_input = preg_replace ( $pattern , '' , $csv_input);
echo $revised_input;
//would like revised input to echo: "herp","derp,"hey get rid of these commas man",1234
?>
Thanks VERY much, everyone.
Original Answer
You can use str_getcsv() for this as it is purposely designed for process CSV strings:
$out = array();
$array = str_getcsv($csv_input);
foreach($array as $item) {
$out[] = str_replace(',', '', $item);
}
$out is now an array of elements without any commas in them, which you can then just implode as the quotes will no longer be required once the commas are removed:
$revised_input = implode(',', $out);
Update for comments
If the quotes are important to you then you can just add them back in like so:
$revised_input = '"' . implode('","', $out) . '"';
Another option is to use one of the str_putcsv() (not a standard PHP function) implementations floating about out there on the web such as this one.
This is a very naive approach that will work only if 'valid' commas are those that are between quotes with nothing else but maybe whitespace between.
<?php
$csv_input = '"herp","derp","hey, get rid of these commas, man",1234';
$pattern = '/([^"])\,([^"])/'; //this doesn't work
$revised_input = preg_replace ( $pattern , "$1$2" , $csv_input);
echo $revised_input;
//ouput for this is: "herp","derp","hey get rid of these commas man",1234
It should def be tested more but it works in this case.
Cases where it might not work is where you don't have quotes in the string.
one,two,three,four -> onetwothreefour
EDIT : Corrected the issues with deleting spaces and neighboring letters.
Well, I haven't been lazy and written a small function to do exactly what you need:
function clean_csv_commas($csv){
$len = strlen($csv);
$inside_block = FALSE;
$out='';
for($i=0;$i<$len;$i++){
if($csv[$i]=='"'){
if($inside_block){
$inside_block=FALSE;
}else{
$inside_block=TRUE;
}
}
if($csv[$i]==',' && $inside_block){
// do nothing
}else{
$out.=$csv[$i];
}
}
return $out;
}
You might be coming at this from the wrong angle.
Instead of removing the commas from the text (presumably so you can then split the string on the commas to get the separate elements), how about writing something that works on the quotes?
Once you've found an opening quote, you can check the rest of the string; anything before the next quote is part of this element. You can add some checking here to look for escaped quotes, too, so things like:
"this is a \"quote\""
will still be read properly.
Not exactly an answer you've been looking for - But I've used it for cleaning commas in numbers in CSV.
$csv = preg_replace('%\"([^\"]*)(,)([^\"]*)\"%i','$1$3',$csv);
"3,120", 123, 345, 567 ==> 3120, 123, 345, 567
I have strings like these:
"my value1" => my value1
"my Value2" => my Value2
myvalue3 => myvalue3
I need to get rid of " (double-quotes) in end and start, if these exist, but if there is this kind of character inside String then it should be left there. Example:
"my " value1" => my " value1
How can I do this in PHP - is there function for this or do I have to code it myself?
The literal answer would be
trim($string,'"'); // double quotes
trim($string,'\'"'); // any combination of ' and "
It will remove all leading and trailing quotes from a string.
If you need to remove strictly the first and the last quote in case they exist, then it could be a regular expression like this
preg_replace('~^"?(.*?)"?$~', '$1', $string); // double quotes
preg_replace('~^[\'"]?(.*?)[\'"]?$~', '$1', $string); // either ' or " whichever is found
If you need to remove only in case the leading and trailing quote are strictly paired, then use the function from Steve Chambers' answer
However, if your goal is to read a value from a CSV file, fgetcsv is the only correct option. It will take care of all the edge cases, stripping the value enclosures as well.
I had a similar need and wrote a function that will remove leading and trailing single or double quotes from a string:
/**
* Remove the first and last quote from a quoted string of text
*
* #param mixed $text
*/
function stripQuotes($text) {
return preg_replace('/^(\'(.*)\'|"(.*)")$/', '$2$3', $text);
}
This will produce the outputs listed below:
Input text Output text
--------------------------------
No quotes => No quotes
"Double quoted" => Double quoted
'Single quoted' => Single quoted
"One of each' => "One of each'
"Multi""quotes" => Multi""quotes
'"'"#";'"*&^*'' => "'"#";'"*&^*'
Regex demo (showing what is being matched/captured): https://regex101.com/r/3otW7H/1
trim will remove all instances of the char from the start and end if it matches the pattern you provide, so:
$myValue => '"Hi"""""';
$myValue=trim($myValue, '"');
Will become:
$myValue => 'Hi'.
Here's a way to only remove the first and last char if they match:
$output=stripslashes(trim($myValue));
// if the first char is a " then remove it
if(strpos($output,'"')===0)$output=substr($output,1,(strlen($output)-1));
// if the last char is a " then remove it
if(strripos($output,'"')===(strlen($output)-1))$output=substr($output,0,-1);
As much as this thread should have been killed long ago, I couldn't help but respond with what I would call the simplest answer of all. I noticed this thread re-emerging on the 17th so I don't feel quite as bad about this. :)
Using samples as provided by Steve Chambers;
echo preg_replace('/(^[\"\']|[\"\']$)/', '', $input);
Output below;
Input text Output text
--------------------------------
No quotes => No quotes
"Double quoted" => Double quoted
'Single quoted' => Single quoted
"One of each' => One of each
"Multi""quotes" => Multi""quotes
'"'"#";'"*&^*'' => "'"#";'"*&^*'
This only ever removes the first and last quote, it doesn't repeat to remove extra content and doesn't care about matching ends.
This is an old post, but just to cater for multibyte strings, there are at least two possible routes one can follow. I am assuming that the quote stripping is being done because the quote is being considered like a program / INI variable and thus is EITHER "something" or 'somethingelse' but NOT "mixed quotes'. Also, ANYTHING between the matched quotes is to be retained intact.
Route 1 - using a Regex
function sq_re($i) {
return preg_replace( '#^(\'|")(.*)\1$#isu', '$2', $i );
}
This uses \1 to match the same type quote that matched at the beginning. the u modifier, makes it UTF8 capable (okay, not fully multibyte supporting)
Route 2 - using mb_* functions
function sq($i) {
$f = mb_substr($i, 0, 1);
$l = mb_substr($i, -1);
if (($f == $l) && (($f == '"') || ($f == '\'')) ) $i = mb_substr($i, 1, mb_strlen( $i ) - 2);
return $i;
}
You need to use regular expressions, look at:-
http://php.net/manual/en/function.preg-replace.php
Or you could, in this instance, use substr to check if the first and then the last character of the string is a quote mark, if it is, truncate the string.
http://php.net/manual/en/function.substr.php
How about regex
//$singleQuotedString="'Hello this 'someword' and \"somewrod\" stas's SO";
//$singleQuotedString="Hello this 'someword' and \"somewrod\" stas's SO'";
$singleQuotedString="'Hello this 'someword' and \"somewrod\" stas's SO'";
$quotesFreeString=preg_replace('/^\'?(.*?(?=\'?$))\'?$/','$1' ,$singleQuotedString);
Output
Hello this 'someword' and "somewrod" stas's SO
If you like performance over clarity this is the way:
// Remove double quotes at beginning and/or end of output
$len=strlen($output);
if($output[0]==='"') $iniidx=1; else $iniidx=0;
if($output[$len-1]==='"') $endidx=-1; else $endidx=$len-1;
if($iniidx==1 || $endidx==-1) $output=substr($output,$iniidx,$endidx);
The comment helps with clarity...
brackets in an array-like usage on strings is possible and demands less processing effort than equivalent methods, too bad there isnt a length variable or a last char index