\r\n \" printing out - php

I have begun using ADOdb and parameterized queries (ex. $db->Execute("SELECT FROM users WHERE user_name=?;",array($get->id);)to prevent SQL injections. I have read this is suppose to protect you on the MySQL injection side of things, but obviously not XSS. While this may be the case, I'm still a bit skeptical about it.
Nevertheless, I always filter my environmental variables using shotgun approach towards safety at the beginning of my wrapper code (kernel.php). I notice the combination of using ADOdb and the following functions produces browser-visible carriage returns (\r\n \" \'), which is something I don't want (although I do want to store that information!). I also don't want to have to filter my output before display, since I already properly filter my input (aside from BBcode and that sort of thing). Below you will find the functions I'm referring to.
While in general I have isolated this problem to the mysql_real_escape_string portion of the sanitize function, do note that my server is running PHP 5.2+, and this issue does not exist when I use my own simplified db abstraction class. Also, the site is ran on mostly my own code and not built on the scaffold of some preexisting CMS). Thus, considering these factors, my only guess is there is some double-escaping going on. However, when I looked at adodb.inc.php file, I noticed $rs->FetchNextObj() doesn't utilize mysql_real_escape_string. It appears the only function that does this is qstr, which encapsulates the entire string. This leads me to worry that relying on parameterized queries may not be enough, but I don't know!
// Sanitize all possible user inputs
if(keyring_access("am")) // XSS and HTML stripping exemption for administrators editing HTML content
{
$_POST = sanitize($_POST,false,false);
$_GET = sanitize($_GET,false,false);
$_COOKIE = sanitize($_COOKIE,false,false);
$_SESSION = sanitize($_SESSION,false,false);
}
else
{
$_POST = sanitize($_POST);
$_GET = sanitize($_GET);
$_COOKIE = sanitize($_COOKIE);
$_SESSION = sanitize($_SESSION);
}
// Setup $form object shortcuts (merely convenience)
if($_POST)
{
foreach($_POST as $key => $value)
{
$form->$key = $value;
}
}
if($_GET)
{
foreach($_GET as $key => $value)
{
$get->$key = $value;
}
}
function sanitize($val, $strip = true, $xss = true, $charset = 'UTF-8')
{
if (is_array($val))
{
$output = array();
foreach ($val as $key => $data)
{
$output[$key] = sanitize($data, $strip, $xss, $charset);
}
return $output;
}
else
{
if ($xss)
{
// code by nicolaspar
$val = preg_replace('/([\x00-\x08][\x0b-\x0c][\x0e-\x20])/', '', $val);
$search = 'abcdefghijklmnopqrstuvwxyz';
$search .= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
$search .= '1234567890!##$%^&*()';
$search .= '~`";:?+/={}[]-_|\'\\';
for ($i = 0; $i < strlen($search); $i++)
{
$val = preg_replace('/(&#[x|X]0{0,8}'.dechex(ord($search[$i])).';?)/i', $search[$i], $val); // with a ;
$val = preg_replace('/(&#0{0,8}'.ord($search[$i]).';?)/', $search[$i], $val); // with a ;
}
$ra1 = Array('javascript', 'vbscript', 'expression', 'applet', 'meta', 'xml', 'blink', 'link', 'style', 'script', 'embed', 'object', 'iframe', 'frame', 'frameset', 'ilayer', 'layer', 'bgsound', 'title', 'base');
$ra2 = Array('onabort', 'onactivate', 'onafterprint', 'onafterupdate', 'onbeforeactivate', 'onbeforecopy', 'onbeforecut', 'onbeforedeactivate', 'onbeforeeditfocus', 'onbeforepaste', 'onbeforeprint', 'onbeforeunload', 'onbeforeupdate', 'onblur', 'onbounce', 'oncellchange', 'onchange', 'onclick', 'oncontextmenu', 'oncontrolselect', 'oncopy', 'oncut', 'ondataavailable', 'ondatasetchanged', 'ondatasetcomplete', 'ondblclick', 'ondeactivate', 'ondrag', 'ondragend', 'ondragenter', 'ondragleave', 'ondragover', 'ondragstart', 'ondrop', 'onerror', 'onerrorupdate', 'onfilterchange', 'onfinish', 'onfocus', 'onfocusin', 'onfocusout', 'onhelp', 'onkeydown', 'onkeypress', 'onkeyup', 'onlayoutcomplete', 'onload', 'onlosecapture', 'onmousedown', 'onmouseenter', 'onmouseleave', 'onmousemove', 'onmouseout', 'onmouseover', 'onmouseup', 'onmousewheel', 'onmove', 'onmoveend', 'onmovestart', 'onpaste', 'onpropertychange', 'onreadystatechange', 'onreset', 'onresize', 'onresizeend', 'onresizestart', 'onrowenter', 'onrowexit', 'onrowsdelete', 'onrowsinserted', 'onscroll', 'onselect', 'onselectionchange', 'onselectstart', 'onstart', 'onstop', 'onsubmit', 'onunload');
$ra = array_merge($ra1, $ra2);
$found = true;
while ($found == true)
{
$val_before = $val;
for ($i = 0; $i < sizeof($ra); $i++)
{
$pattern = '/';
for ($j = 0; $j < strlen($ra[$i]); $j++)
{
if ($j > 0)
{
$pattern .= '(';
$pattern .= '(&#[x|X]0{0,8}([9][a][b]);?)?';
$pattern .= '|(&#0{0,8}([9][10][13]);?)?';
$pattern .= ')?';
}
$pattern .= $ra[$i][$j];
}
$pattern .= '/i';
$replacement = substr($ra[$i], 0, 2).'<x>'.substr($ra[$i], 2);
$val = preg_replace($pattern, $replacement, $val);
if ($val_before == $val)
{
$found = false;
}
}
}
}
// Strip HTML tags
if ($strip)
{
$val = strip_tags($val);
// Encode special chars
$val = htmlentities($val, ENT_QUOTES, $charset);
}
// Cross your fingers that we don't get a MySQL injection with relying on ADOdb prepared statements alone… ? It works great otherwise by just returning $val... so it appears the code below is the culprit of the \r\n \" etc. escaping
//return $val;
if(function_exists('get_magic_quotes_gpc') or get_magic_quotes_gpc())
{
return mysql_real_escape_string(stripslashes($val));
}
else
{
return mysql_real_escape_string($val);
}
}
}
Thank you very much in advance for your help! If you need any further clarifications, please let me know.
Update the backslash is still showing up in front of " and ', and yes I removed the extra mysql_real_escape_string... now I can only think this might be get_quotes_gpc, or ADOdb adding them...
~elix

It turned out to be a side effect of qstr in ADOdb, even though I didn't reference that particular function of the class, but must be called elsewhere. The problem in my particular case was that magic quotes is enabled, so I set the default argument for the function to $magic_quotes=disabled. As for not needing any escaping with this, I found that ADOdb by itself DOES NOT utilize mysql_real_escape_string through the basic Execute() with binding alone! How I recognized this was due to the fact that the characters " ' threw errors (hence didn't render on my server where error_reporting is disabled). It appears the combination of the functions with fixing that small issue with ADOdb has me both well protected, and accepts most/all input the way I want it to: which in the case of the double quote prevented any quotes from being entered as content into the database, which meant at the very least no HTML
Nevertheless, I appreciate your suggestions, but also felt that my follow-up might help others.

Related

Mysql/PHP Querying a table value containing reserved words

I'm doing some maintenance on a clients website that uses the simple mysql_query() function for all of their database queries. On one of their pages, a query is done to pull user information based on their nickname. The file is very robust and going through and changing every instance to pull from user IDs instead of nicknames is not really feasible.
They're running into problems with some nicknames, particularly "Link", "Echo", "Slayer". I see why link and echo could potentially cause issues with the query, but not so much Slayer. Is there anything I can do (aside from preventing creation of these names in the future) to help the query go through and pull the information I need?
Edit:
The whole function:
function userInfo($username){
global $username_array;
$username = prepare($username);
$username_array = mysql_fetch_array(mysql_query("SELECT * FROM `users` WHERE `name` = '$username' LIMIT 1"));
}
It should return an array $username_array back to the original script. With 99% of users, this works fine. For some reason, the users above, this done not work.
function prepare($val,$type=0){
$val = XSS($val);
$val = sqlInjection($val);
return $val;
}
function XSS($val) {
// remove all non-printable characters. CR(0a) and LF(0b) and TAB(9) are allowed
$val = preg_replace('/([\x00-\x08][\x0b-\x0c][\x0e-\x20])/', '', $val);
// straight replacements, the user should never need these since they're normal characters
// this prevents like <IMG SRC=&#X40&#X61&#X76&#X61&#X73&#X63&#X72&#X69&#X70&#X74&#X3A&#X61&#X6C&#X65&#X72&#X74&#X28&#X27&#X58&#X53&#X53&#X27&#X29>
$search = 'abcdefghijklmnopqrstuvwxyz';
$search .= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
$search .= '1234567890!##$%^&*()';
$search .= '~`";:?+/={}[]-_|\'\\';
for ($i = 0; $i < strlen($search); $i++) {
// ;? matches the ;, which is optional
// 0{0,7} matches any padded zeros, which are optional and go up to 8 chars
// &#x0040 # search for the hex values
$val = preg_replace('/(&#[x|X]0{0,8}'.dechex(ord($search[$i])).';?)/i', $search[$i], $val); // with a ;
// &#00064 # 0{0,7} matches '0' zero to seven times
$val = preg_replace('/(&#0{0,8}'.ord($search[$i]).';?)/', $search[$i], $val); // with a ;
}
// now the only remaining whitespace attacks are \t, \n, and \r
$ra1 = Array('javascript', 'vbscript', 'expression', 'applet', 'blink', 'script', 'iframe', 'frameset', 'ilayer', 'bgsound');
$ra2 = Array('onabort', 'onactivate', 'onafterprint', 'onafterupdate', 'onbeforeactivate', 'onbeforecopy', 'onbeforecut', 'onbeforedeactivate', 'onbeforeeditfocus', 'onbeforepaste', 'onbeforeprint', 'onbeforeunload', 'onbeforeupdate', 'onblur', 'onbounce', 'oncellchange', 'onchange', 'onclick', 'oncontextmenu', 'oncontrolselect', 'oncopy', 'oncut', 'ondataavailable', 'ondatasetchanged', 'ondatasetcomplete', 'ondblclick', 'ondeactivate', 'ondrag', 'ondragend', 'ondragenter', 'ondragleave', 'ondragover', 'ondragstart', 'ondrop', 'onerror', 'onerrorupdate', 'onfilterchange', 'onfinish', 'onfocus', 'onfocusin', 'onfocusout', 'onhelp', 'onkeydown', 'onkeypress', 'onkeyup', 'onlayoutcomplete', 'onload', 'onlosecapture', 'onmousedown', 'onmouseenter', 'onmouseleave', 'onmousemove', 'onmouseout', 'onmouseover', 'onmouseup', 'onmousewheel', 'onmove', 'onmoveend', 'onmovestart', 'onpaste', 'onpropertychange', 'onreadystatechange', 'onreset', 'onresize', 'onresizeend', 'onresizestart', 'onrowenter', 'onrowexit', 'onrowsdelete', 'onrowsinserted', 'onscroll', 'onselect', 'onselectionchange', 'onselectstart', 'onstart', 'onstop', 'onsubmit', 'onunload');
$ra = array_merge($ra1, $ra2);
$found = true; // keep replacing as long as the previous round replaced something
while ($found == true) {
$val_before = $val;
for ($i = 0; $i < sizeof($ra); $i++) {
$pattern = '/';
for ($j = 0; $j < strlen($ra[$i]); $j++) {
if ($j > 0) {
$pattern .= '(';
$pattern .= '(&#[x|X]0{0,8}([9][a][b]);?)?';
$pattern .= '|(&#0{0,8}([9][10][13]);?)?';
$pattern .= ')?';
}
$pattern .= $ra[$i][$j];
}
$pattern .= '/i';
$replacement = substr($ra[$i], 0, 2).'<x>'.substr($ra[$i], 2); // add in <> to nerf the tag
$val = preg_replace($pattern, $replacement, $val); // filter out the hex tags
if ($val_before == $val) {
// no replacements were made, so exit the loop
$found = false;
}
}
}
return $val;
}
function sqlInjection($val){
if (get_magic_quotes_gpc()){
$val = stripslashes($val);
}
if(version_compare(phpversion(),"4.3.0") == "-1"){
return mysql_escape_string($val);
}else{
return mysql_real_escape_string($val);
}
}
There should be no problem since no user input should ever be directly executed.
Make sure you escape the strings properly and consider using prepared statements incase someone has a nasty suprise waiting such as this.
Using strings that are php reserved words is not the issue with your problem.
you can have strings that are reserved words $test = 'function'; is valid. The issue lies elsewhere.
http://www.php.net/manual/en/reserved.keywords.php

preg_replace & mysql_real_escape_string problem cleaning SQL

check out the method below. If entered value in text box is \ mysql_real_escape_string will return duble backslash but preg_replace will return SQL with only one backslash. Im not that good with regular expression so plz help.
$sql = "INSERT INTO tbl SET val='?'";
$params = array('someval');
public function execute($sql, array $params){
$keys = array();
foreach ($params as $key => $value) {
$keys[] = '/[?]/';
if (get_magic_quotes_gpc()) {
$value = stripslashes($value);
}
$paramsEscaped[$key] = mysql_real_escape_string(trim($value));
}
$sql = preg_replace($keys, $paramsEscaped, $sql, 1, $count);
return $this->query($sql);
}
For me it basically looks like you're re-inventing the wheel and your concept has some serious flaws:
It assumes get_magic_quotes_gpc could be switched on. This feature is broken. You should not code against it. Instead make your application require that it is switched off.
mysql_real_escape_string needs a database link identifier to properly work. You are not providing any. This is a serious issue, you should change your concept.
You're actually not using prepared statements, but you mimic the syntax of those. This is fooling other developers who might think that it is safe to use the code while it is not. This is highly discouraged.
However let's do it, but just don't use preg_replace for the job. That's for various reasons, but especially, as the first pattern of ? results in replacing everything with the first parameter. It's inflexible to deal with the error-cases like too less or too many parameters/placeholders. And additionally imagine a string you insert contains a ? character as well. It would break it. Instead, the already processed part as well as the replacement needs to be skipped (Demo of such).
For that you need to go through, take it apart and process it:
public function execute($sql, array $params)
{
$params = array_map(array($this, 'filter_value'), $params);
$sql = $this->expand_placeholders($sql, $params);
return $this->query($sql);
}
public function filter_value($value)
{
if (get_magic_quotes_gpc())
{
$value = stripslashes($value);
}
$value = trim($value);
$value = mysql_real_escape_string($value);
return $value;
}
public function expand_placeholders($sql, array $params)
{
$sql = (string) $sql;
$params = array_values($params);
$offset = 0;
foreach($params as $param)
{
$place = strpos($sql, '?', $offset);
if ($place === false)
{
throw new InvalidArgumentException('Parameter / Placeholder count mismatch. Not enough placeholders for all parameters.');
}
$sql = substr_replace($sql, $param, $place, 1);
$offset = $place + strlen($param);
}
$place = strpos($sql, '?', $offset);
if ($place === false)
{
throw new InvalidArgumentException('Parameter / Placeholder count mismatch. Too many placeholders.');
}
return $sql;
}
The benefit with already existing prepared statements is, that they actually work. You should really consider to use those. For playing things like that is nice, but you need to deal with much more cases in the end and it's far easier to re-use an existing component tested by thousand of other users.
It's better to use prepared statement. See more info http://www.php.net/manual/en/pdo.prepared-statements.php

mysql_real_escape_string not being used with given regex

I am using a dataHandler library to handle all of my db inserts / updates, etc.
The library has the following functions:
function prepareValue($value, $connection){
$preparedValue = $value;
if(is_null($value)){
$preparedValue = 'NULL';
}
else{
$preparedValue = '\''.mysql_real_escape_string($value, $connection).'\'';
}
return $preparedValue;
}
function parseParams($params, $type, $connection){
$fields = "";
$values = "";
if ($type == "UPDATE"){
$return = "";
foreach ($params as $key => $value){
if ($return == ""){
if (preg_match("/\)$/", $value)){
$return = $key."=".$value;
}
else{
$return = $key."=".$this->prepareValue($value, $connection);
}
}
else{
if (preg_match("/\)$/", $value)){
$return = $return.", ".$key."=".$value;
}
else{
$return = $return.", ".$key."=".$this->prepareValue($value,
$connection);
}
}
}
return $return;
/* rest of function contains similar but for "INSERT", etc.
}
These functions are then used to build queries using sprintf, as in:
$query = sprintf("UPDATE table SET " .
$this->parseParams($params, "UPDATE", $conn) .
" WHERE fieldValue = %s;", $this->prepareValue($thesis_id, $conn));
$params is an associative array: array("db_field_name"=>$value, "db_field_name2"=>$value2, etc.)
I am now running into problems when I want to do an update or insert of a string that ends in ")" because the parseParams function does not put these values in quotes.
My question is this:
Why would this library NOT call prepareValue on strings that end in a closed parenthesis? Would calling mysql_real_escape_string() on this value cause any problems? I could easily modify the library, but I am assuming there is a reason the author handled this particular regex this way. I just can't figure out what that reason is! And I'm hesitant to make any modifications until I understand the reasoning behind what is here.
Thanks for your help!
Please note that inside prepareValue not only mysql_real_escape_string is applied to the value but it is also put inside '. With this in mind, we could suspect that author assumed all strings ending with ) to be mysql function calls, ie:
$params = array(
'field1' => "John Doe",
'field2' => "CONCAT('John',' ','Doe')",
'field3' => "NOW()"
);
Thats the only reasonable answer that comes to mind.

Backticking MySQL Entities

I've the following method which allows me to protect MySQL entities:
public function Tick($string)
{
$string = explode('.', str_replace('`', '', $string));
foreach ($string as $key => $value)
{
if ($value != '*')
{
$string[$key] = '`' . trim($value) . '`';
}
}
return implode('.', $string);
}
This works fairly well for the use that I make of it.
It protects database, table, field names and even the * operator, however now I also want it to protect function calls, ie:
AVG(database.employees.salary)
Should become:
AVG(`database`.`employees`.`salary`) and not `AVG(database`.`employees`.`salary)`
How should I go about this? Should I use regular expressions?
Also, how can I support more advanced stuff, from:
MAX(AVG(database.table.field1), MAX(database.table.field2))
To:
MAX(AVG(`database`.`table`.`field1`), MAX(`database`.`table`.`field2`))
Please keep in mind that I want to keep this method as simple/fast as possible, since it pretty much iterates over all the entity names in my database.
If this is quoting parts of an SQL statement, and they have only complexity that you descibe, a RegEx is a great approach. On the other hand, if you need to do this to full SQL statements, or simply more complicated components of statements (such as "MAX(AVG(val),MAX(val2))"), you will need to tokenize or parse the string and have a more sophisticated understanding of it to do this quoting accurately.
Given the regular expression approach, you may find it easier to break the function name out as one step, and then use your current code to quote the database/table/column names. This can be done in one RE, but it will be tricker to get right.
Either way, I'd highly recommend writing a few unit test cases. In fact, this is an ideal situation for this approach: it's easy to write the tests, you have some existing cases that work (which you don't want to break), and you have just one more case to add.
Your test can start as simply as:
assert '`ticked`' == Tick('ticked');
assert '`table`.`ticked`' == Tick('table.ticked');
assert 'db`.`table`.`ticked`' == Tick('db.table.ticked');
And then add:
assert 'FN(`ticked`)' == Tick('FN(ticked)');
etc.
Using the test case ndp gave I created a regex to do the hard work for you. The following regex will replace all word boundaries around words that are not followed by an opening parenthesis.
\b(\w+)\b(?!\()
The Tick() functionality would then be implemented in PHP as follows:
function Tick($string)
{
return preg_replace( '/\b(\w+)\b(?!\()/', '`\1`', $string );
}
It's generally a bad idea to pass the whole SQL to the function. That way, you'll always find a case when it doesn't work, unless you fully parse the SQL syntax.
Put the ticks to the names on some previous abstraction level, which makes up the SQL.
Before you explode your string on periods, check if the last character is a parenthesis. If so, this call is a function.
<?php
$string = str_replace('`', '', $string)
$function = "";
if (substr($string,-1) == ")") {
// Strip off function call first
$opening = strpos($string, "(");
$function = substr($string, 0, $opening+1);
$string = substr($string, $opening+1, -1);
}
// Do your existing parsing to $string
if ($function == "") {
// Put function back on string
$string = $function . $string . ")";
}
?>
If you need to cover more advanced situations, like using nested functions, or multiple functions in sequence in one "$string" variable, this would become a much more advanced function, and you'd best ask yourself why these elements aren't being properly ticked in the first place, and not need any further parsing.
EDIT: Updating for nested functions, as per original post edit
To have the above function deal with multiple nested functions, you likely need something that will 'unwrap' your nested functions. I haven't tested this, but the following function might get you on the right track.
<?php
function unwrap($str) {
$pos = strpos($str, "(");
if ($pos === false) return $str; // There's no function call here
$last_close = 0;
$cur_offset = 0; // Start at the beginning
while ($cur_offset <= strlen($str)) {
$first_close = strpos($str, ")", $offset); // Find first deep function
$pos = strrpos($str, "(", $first_close-1); // Find associated opening
if ($pos > $last_close) {
// This function is entirely after the previous function
$ticked = Tick(substr($str, $pos+1, $first_close-$pos)); // Tick the string inside
$str = substr($str, 0, $pos)."{".$ticked."}".substr($str,$first_close); // Replace parenthesis by curly braces temporarily
$first_close += strlen($ticked)-($first_close-$pos); // Shift parenthesis location due to new ticks being added
} else {
// This function wraps other functions; don't tick it
$str = substr($str, 0, $pos)."{".substr($str,$pos+1, $first_close-$pos)."}".substr($str,$first_close);
}
$last_close = $first_close;
$offset = $first_close+1;
}
// Replace the curly braces with parenthesis again
$str = str_replace(array("{","}"), array("(",")"), $str);
}
If you are adding the function calls in your code, as opposed to passing them in through a string-only interface, you can replace the string parsing with type checking:
function Tick($value) {
if (is_object($value)) {
$result = $value->value;
} else {
$result = '`'.str_replace(array('`', '.'), array('', '`.`'), $value).'`';
}
return $result;
}
class SqlFunction {
var $value;
function SqlFunction($function, $params) {
$sane = implode(', ', array_map('Tick', $params));
$this->value = "$function($sane)";
}
}
function Maximum($column) {
return new SqlFunction('MAX', array($column));
}
function Avg($column) {
return new SqlFunction('AVG', array($column));
}
function Greatest() {
$params = func_get_args();
return new SqlFunction('GREATEST', $params);
}
$cases = array(
"'simple'" => Tick('simple'),
"'table.field'" => Tick('table.field'),
"'table.*'" => Tick('table.*'),
"'evil`hack'" => Tick('evil`hack'),
"Avg('database.table.field')" => Tick(Avg('database.table.field')),
"Greatest(Avg('table.field1'), Maximum('table.field2'))" => Tick(Greatest(Avg('table.field1'), Maximum('table.field2'))),
);
echo "<table>";
foreach ($cases as $case => $result) {
echo "<tr><td>$case</td><td>$result</td></tr>";
}
echo "</table>";
This avoids any possible SQL injection while remaining legible to future readers of your code.
You could use preg_replace_callback() in conjunction with your Tick() method to skip at least one level of parens:
public function tick($str)
{
return preg_replace_callback('/[^()]*/', array($this, '_tick_replace_callback'), $str);
}
protected function _tick_replace_callback($str) {
$string = explode('.', str_replace('`', '', $string));
foreach ($string as $key => $value)
{
if ($value != '*')
{
$string[$key] = '`' . trim($value) . '`';
}
}
return implode('.', $string);
}
Are you generating the SQL Query or is it being passed to you? If you generating the query I wouldn't pass the whole query string just the parms/values you want to wrap in the backticks or what ever else you need.
EXAMPLE:
function addTick($var) {
return '`' . $var . '`';
}
$condition = addTick($condition);
$SQL = 'SELECT' . $what . '
FROM ' . $table . '
WHERE ' . $condition . ' = ' . $constraint;
This is just a mock but you get the idea that you can pass or loop through your code and build the query string rather than parsing the query string and adding your backticks.

Regex to parse define() contents, possible?

I am very new to regex, and this is way too advanced for me. So I am asking the experts over here.
Problem
I would like to retrieve the constants / values from a php define()
DEFINE('TEXT', 'VALUE');
Basically I would like a regex to be able to return the name of constant, and the value of constant from the above line. Just TEXT and VALUE . Is this even possible?
Why I need it? I am dealing with language file and I want to get all couples (name, value) and put them in array. I managed to do it with str_replace() and trim() etc.. but this way is long and I am sure it could be made easier with single line of regex.
Note: The VALUE may contain escaped single quotes as well. example:
DEFINE('TEXT', 'J\'ai');
I hope I am not asking for something too complicated. :)
Regards
For any kind of grammar-based parsing, regular expressions are usually an awful solution. Even smple grammars (like arithmetic) have nesting and it's on nesting (in particular) that regular expressions just fall over.
Fortunately PHP provides a far, far better solution for you by giving you access to the same lexical analyzer used by the PHP interpreter via the token_get_all() function. Give it a character stream of PHP code and it'll parse it into tokens ("lexemes"), which you can do a bit of simple parsing on with a pretty simple finite state machine.
Run this program (it's run as test.php so it tries it on itself). The file is deliberately formatted badly so you can see it handles that with ease.
<?
define('CONST1', 'value' );
define (CONST2, 'value2');
define( 'CONST3', time());
define('define', 'define');
define("test", VALUE4);
define('const5', //
'weird declaration'
) ;
define('CONST7', 3.14);
define ( /* comment */ 'foo', 'bar');
$defn = 'blah';
define($defn, 'foo');
define( 'CONST4', define('CONST5', 6));
header('Content-Type: text/plain');
$defines = array();
$state = 0;
$key = '';
$value = '';
$file = file_get_contents('test.php');
$tokens = token_get_all($file);
$token = reset($tokens);
while ($token) {
// dump($state, $token);
if (is_array($token)) {
if ($token[0] == T_WHITESPACE || $token[0] == T_COMMENT || $token[0] == T_DOC_COMMENT) {
// do nothing
} else if ($token[0] == T_STRING && strtolower($token[1]) == 'define') {
$state = 1;
} else if ($state == 2 && is_constant($token[0])) {
$key = $token[1];
$state = 3;
} else if ($state == 4 && is_constant($token[0])) {
$value = $token[1];
$state = 5;
}
} else {
$symbol = trim($token);
if ($symbol == '(' && $state == 1) {
$state = 2;
} else if ($symbol == ',' && $state == 3) {
$state = 4;
} else if ($symbol == ')' && $state == 5) {
$defines[strip($key)] = strip($value);
$state = 0;
}
}
$token = next($tokens);
}
foreach ($defines as $k => $v) {
echo "'$k' => '$v'\n";
}
function is_constant($token) {
return $token == T_CONSTANT_ENCAPSED_STRING || $token == T_STRING ||
$token == T_LNUMBER || $token == T_DNUMBER;
}
function dump($state, $token) {
if (is_array($token)) {
echo "$state: " . token_name($token[0]) . " [$token[1]] on line $token[2]\n";
} else {
echo "$state: Symbol '$token'\n";
}
}
function strip($value) {
return preg_replace('!^([\'"])(.*)\1$!', '$2', $value);
}
?>
Output:
'CONST1' => 'value'
'CONST2' => 'value2'
'CONST3' => 'time'
'define' => 'define'
'test' => 'VALUE4'
'const5' => 'weird declaration'
'CONST7' => '3.14'
'foo' => 'bar'
'CONST5' => '6'
This is basically a finite state machine that looks for the pattern:
function name ('define')
open parenthesis
constant
comma
constant
close parenthesis
in the lexical stream of a PHP source file and treats the two constants as a (name,value) pair. In doing so it handles nested define() statements (as per the results) and ignores whitespace and comments as well as working across multiple lines.
Note: I've deliberatley made it ignore the case when functions and variables are constant names or values but you can extend it to that as you wish.
It's also worth pointing out that PHP is quite forgiving when it comes to strings. They can be declared with single quotes, double quotes or (in certain circumstances) with no quotes at all. This can be (as pointed out by Gumbo) be an ambiguous reference reference to a constant and you have no way of knowing which it is (no guaranteed way anyway), giving you the chocie of:
Ignoring that style of strings (T_STRING);
Seeing if a constant has already been declared with that name and replacing it's value. There's no way you can know what other files have been called though nor can you process any defines that are conditionally created so you can't say with any certainty if anything is definitely a constant or not nor what value it has; or
You can just live with the possibility that these might be constants (which is unlikely) and just treat them as strings.
Personally I would go for (1) then (3).
This is possible, but I would rather use get_defined_constants(). But make sure all your translations have something in common (like all translations starting with T), so you can tell them apart from other constants.
Try this regular expression to find the define calls:
/\bdefine\(\s*("(?:[^"\\]+|\\(?:\\\\)*.)*"|'(?:[^'\\]+|\\(?:\\\\)*.)*')\s*,\s*("(?:[^"\\]+|\\(?:\\\\)*.)*"|'(?:[^'\\]+|\\(?:\\\\)*.)*')\s*\);/is
So:
$pattern = '/\\bdefine\\(\\s*("(?:[^"\\\\]+|\\\\(?:\\\\\\\\)*.)*"|\'(?:[^\'\\\\]+|\\\\(?:\\\\\\\\)*.)*\')\\s*,\\s*("(?:[^"\\\\]+|\\\\(?:\\\\\\\\)*.)*"|\'(?:[^\'\\\\]+|\\\\(?:\\\\\\\\)*.)*\')\\s*\\);/is';
$str = '<?php define(\'foo\', \'bar\'); define("define(\\\'foo\\\', \\\'bar\\\')", "define(\'foo\', \'bar\')"); ?>';
preg_match_all($pattern, $str, $matches, PREG_SET_ORDER);
var_dump($matches);
I know that eval is evil. But that’s the best way to evaluate the string expressions:
$constants = array();
foreach ($matches as $match) {
eval('$constants['.$match[1].'] = '.$match[1].';');
}
var_dump($constants);
You might not need to go overboard with the regex complexity - something like this will probably suffice
/DEFINE\('(.*?)',\s*'(.*)'\);/
Here's a PHP sample showing how you might use it
$lines=file("myconstants.php");
foreach($lines as $line) {
$matches=array();
if (preg_match('/DEFINE\(\'(.*?)\',\s*\'(.*)\'\);/i', $line, $matches)) {
$name=$matches[1];
$value=$matches[2];
echo "$name = $value\n";
}
}
Not every problem with text should be solved with a regexp, so I'd suggest you state what you want to achieve and not how.
So, instead of using php's parser which is not really useful, or instead of using a completely undebuggable regexp, why not write a simple parser?
<?php
$str = "define('nam\\'e', 'va\\\\\\'lue');\ndefine('na\\\\me2', 'value\\'2');\nDEFINE('a', 'b');";
function getDefined($str) {
$lines = array();
preg_match_all('#^define[(][ ]*(.*?)[ ]*[)];$#mi', $str, $lines);
$res = array();
foreach ($lines[1] as $cnt) {
$p = 0;
$key = parseString($cnt, $p);
// Skip comma
$p++;
// Skip space
while ($cnt{$p} == " ") {
$p++;
}
$value = parseString($cnt, $p);
$res[$key] = $value;
}
return $res;
}
function parseString($s, &$p) {
$quotechar = $s[$p];
if (! in_array($quotechar, array("'", '"'))) {
throw new Exception("Invalid quote character '" . $quotechar . "', input is " . var_export($s, true) . " # " . $p);
}
$len = strlen($s);
$quoted = false;
$res = "";
for ($p++;$p < $len;$p++) {
if ($quoted) {
$quoted = false;
$res .= $s{$p};
} else {
if ($s{$p} == "\\") {
$quoted = true;
continue;
}
if ($s{$p} == $quotechar) {
$p++;
return $res;
}
$res .= $s{$p};
}
}
throw new Exception("Premature end of line");
}
var_dump(getDefined($str));
Output:
array(3) {
["nam'e"]=>
string(7) "va\'lue"
["na\me2"]=>
string(7) "value'2"
["a"]=>
string(1) "b"
}

Categories