The NO-BREAK SPACE and many other UTF-8 symbols need 2 bytes to its representation; so, in a supposed context of UTF8 strings, an isolated (not preceded by xC2) byte of non-ASCII (>127) is a non-recognized character... Ok, it is only a layout problem (!), but it corrupts the whole string?
How to avoid this "non-expected behaviour"? (it occurs in some functions and not in others).
Example (generating an non-expected behaviour with preg_match only):
header("Content-Type: text/plain; charset=utf-8"); // same if text/html
//PHP Version 5.5.4-1+debphp.org~precise+1
//using a .php file enconded as UTF8.
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m); // empty! (corrupted)
$m=str_word_count($s,1);
var_dump($m); // ok
$s = "THE UTF-8 NO-BREAK\xC2\xA0SPACE"; // utf8-encoded nbsp
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m); // ok!
$m=str_word_count($s,1);
var_dump($m); // ok
This is not a complete answer because I not say why some PHP functions "fail entirely on invalidly encoded strings" and others not: see #deceze at question's comments and #hakre answer.
If you are looking for an PCRE-replacement for str_word_count(), see my preg_word_count() below.
PS: about "PHP5's build-in-library behaviour uniformity" discussion, my conclusion is that PHP5 is not so bad, but we have create a lot of user-defined wrap (façade) functions (see diversity of PHP-framworks!)... Or wait for PHP6 :-)
Thanks #pebbl! If I understand your link, there are a lack of error messagens on PHP. So a possible workaround of my illustred problem is to add an error condition... I find the condition here (it ensures valid utf8!)... And thanks #deceze for remember that exists a build-in function for check this condition (I edited the code after).
Putting the issues together, a solution translated to a function (EDITED, thanks to #hakre comments!),
function my_word_count($s,$triggError=true) {
if ( preg_match_all('/[-\'\p{L}]+/u',$s,$m) !== false )
return count($m[0]);
else {
if ($triggError) trigger_error(
// not need mb_check_encoding($s,'UTF-8'), see hakre's answer,
// so, I wrong, there are no 'misteious error' with preg functions
(preg_last_error()==PREG_BAD_UTF8_ERROR)?
'non-UTF8 input!': 'other error',
E_USER_NOTICE
);
return NULL;
}
}
Now (edited after thinking around #hakre answer), about uniform behaviour: we can develop a reasonable function with PCRE library that mimic the str_word_count behaviour, accepting bad UTF8. For this task I used the #bobince iconv tip:
/**
* Like str_word_count() but showing how preg can do the same.
* This function is most flexible but not faster than str_word_count.
* #param $wRgx the "word regular expression" as defined by user.
* #param $triggError changes behaviour causing error event.
* #param $OnBadUtfTryAgain mimic the str_word_count behaviour.
* #return 0 or positive integer as word-count, negative as PCRE error.
*/
function preg_word_count($s,$wRgx='/[-\'\p{L}]+/u', $triggError=true,
$OnBadUtfTryAgain=true) {
if ( preg_match_all($wRgx,$s,$m) !== false )
return count($m[0]);
else {
$lastError = preg_last_error();
$chkUtf8 = ($lastError==PREG_BAD_UTF8_ERROR);
if ($OnBadUtfTryAgain && $chkUtf8)
return preg_word_count(
iconv('CP1252','UTF-8',$s), $wRgx, $triggError, false
);
elseif ($triggError) trigger_error(
$chkUtf8? 'non-UTF8 input!': "error PCRE_code-$lastError",
E_USER_NOTICE
);
return -$lastError;
}
}
Demonstrating (try other inputs!):
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
print "\n-- str_word_count=".str_word_count($s,0);
print "\n-- preg_word_count=".preg_word_count($s);
$s = "THE UTF-8 NO-BREAK\xC2\xA0SPACE"; // utf8-encoded nbsp
print "\n-- str_word_count=".str_word_count($s,0);
print "\n-- preg_word_count=".preg_word_count($s);
Okay, I can somewhat feel your disappointment that things didn't worked easily out switching from str_word_count to preg_match_all. However the way you ask the question is a bit imprecise, I try to answer it anyway. Imprecise, because you have a high amount of wrong assumptions that you obviously take for granted (it happens to the best of us). I hope I can correct this a little:
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m); // empty! (corrupted)
This code is wrong. You blame PHP here for not giving a warning or something, but I must admit, the only one to blame here is "you". PHP does allow you to check for the error. Before you judge so early that a warning has to be given in error handling, I have to remind you that there are different ways how to deal with errors. Some dealing is with giving messages, another type of dealing with errors is by telling about them with return values. And if we visit the manual page of preg_match_all and look for the documentation of the return value, we can find this:
Returns the number of full pattern matches (which might be zero), or FALSE if an error occurred.
The part at the end:
FALSE if an error occurred [Highlight by me]
is some common way in error handling to signal the calling code that some error occured. Let's review your code of which you think it does not work:
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m); // empty! (corrupted)
The only thing this code shows is that the person who typed it (I guess it was you), clearly decided to not do any error handling. That's fine unless that person as well protests that the code won't work.
The sad thing about this is, that this is a common user-error, if you write fragile code (e.g. without error handling), don't expect it to work in a solid manner. That will never happen.
So what does this require when you program? First of all you should know about the functions you use. That normally requires knowledge about the input parameters and the return values. You find that information normally documented. Use the manual. Second you actually need to care about return values and do the error handling your own. The function alone does not know what it means if an error occured. Is it an exception? Then you need to do the exception handling probably as in the demo example:
<?php
/**
* #link http://stackoverflow.com/q/19316127/367456
*/
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
$result = preg_match_all('/[-\'\p{L}]+/u',$s,$m);
if ($result === FALSE) {
switch (preg_last_error()) {
case PREG_BAD_UTF8_ERROR:
throw new InvalidArgumentException(
'UTF-8 encoded binary string expected.'
);
default:
throw new RuntimeException('preg error occured.');
}
}
var_dump($m); // nothing at all corrupted...
In any case it means you need to look what you do, learn about it and write more code. No magic. No bug. Just a bit of work.
The other part you've in front of you is perhaps to understand what characters in a software are, but that is more independent to concrete programming languages like PHP, for example you can take an introductory read here:
A tutorial on character code issues
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The first is a must read or perhaps must-bookmark, because it is a lot to read but it explains it all very good.
Related
I am keeping record of every request made to my website. I am very aware of the security measurements that need to be taken before executing any MySQL query that contains data coming from query strings. I clean it as much as possible from injections and so far all tests have been successful using:
htmlspecialchars, strip_tags, mysqli_real_escape_string.
But on the logs of pages visited I find query strings of failed hack attempts that contain a lot of php code:
?1=%40ini_set%28"display_errors"%2C"0"%29%3B%40set_time_limit%280%29%3B%40set_magic_quotes_runtime%280%29%3Becho%20%27->%7C%27%3Bfile_put_contents%28%24_SERVER%5B%27DOCUMENT_ROOT%27%5D.%27/webconfig.txt.php%27%2Cbase64_decode%28%27PD9waHAgZXZhb
In the previous example we can see:
display_errors, set_time_limit, set_magic_quotes_runtime, file_put_contents
Another example:
/?s=/index/%5Cthink%5Capp/invokefunction&function=call_user_func_array&vars[0]=file_put_contents&vars[1][]=ctlpy.php&vars[1][]=<?php #assert($_REQUEST["ysy"]);?>ysydjsjxbei37$
This one is worst, there is even some <?php and $_REQUEST["ysy"] stuff in there. Although I am able to sanitize it, strip tags and encode < or > when I decode the string I can see the type of requests that are being sent.
Is there any way to detect a string that contains php code like:
filter_var($var, FILTER_SANITIZE_PHP);
FYI: This is not a real function, I am trying to give an idea of what I am looking for.
or some sort of function:
function findCode($var){
return ($var contains PHP) ? true : false
}
Again, not real
No need to sanitize, that has been taken care of, just to detect PHP code in a string. I need this because I want to detect them and save them in other logs.
NOTE: NEVER EXECUTE OR EVAL CODE COMING FROM QUERY STRINGS
After reading lots of comments #KIKO Software came up with an ingenious idea by using PHP tokenizer, but it ended up being extremely difficult because the string that is to be analyzed needed to have almost prefect syntax or it would fail.
So the best solution that I came up with is a simple function that tries to find commonly used PHP statements, In my case, especially on query strings with code injection. Another advantage of this solution is that we can modify and add to the list as many PHP statements as we want. Keep in mind that making the list bigger will considerably slow down your script. this functions uses strpos instead of preg_match (regex ) as its proven to perform faster.
This will not find 100% PHP code inside a string, but you can customize it to find as much as is required, never include terms that could be used in regular English, like 'echo' or 'if'
function findInStr($string, $findarray){
$found=false;
for($i=0;$i<sizeof($findarray);$i++){
$res=strpos($string,$findarray[$i]);
if($res !== false){
$found=true;
break;
}
}
return $found;
}
Simply use:
$search_line=array(
'file_put_contents',
'<?=',
'<?php',
'?>',
'eval(',
'$_REQUEST',
'$_POST',
'$_GET',
'$_SESSION',
'$_SERVER',
'exec(',
'shell_exec(',
'invokefunction',
'call_user_func_array',
'display_errors',
'ini_set',
'set_time_limit',
'set_magic_quotes_runtime',
'DOCUMENT_ROOT',
'include(',
'include_once(',
'require(',
'require_once(',
'base64_decode',
'file_get_contents',
'sizeof',
'array('
);
if(findInStr("this has some <?php echo 'PHP CODE' ?>",$search_line)){
echo "PHP found";
}
I have the following code as the only way I know to convert a float to a string with the fewest possible significant digits required to reproduce it (dtoa() with mode 4 in C).
$i = 14;
do {
$str = sprintf("%.{$i}e", $x);
$i++;
} while ($x != (float) $str);
The Hack typechecker reports an error because it expects the first parameter to sprintf() to be a literal string so it can check it against the arguments. Is there a way I can turn that off for this line?
Or is there another way I could achieve the same thing? Perhaps with the NumberFormatter class?
The typechecker has various methods of suppressing errors. The most appropriate in this case is probably HH_IGNORE_ERROR to suppress this particular error.
As written, your code will produce an error like Typing[4110] Invalid argument. Take the error code, in this case "4110", and use it to add the ignore annotation:
/* HH_IGNORE_ERROR[4110] Allow dynamic sprintf() explain explain etc */
$str = sprintf("%.{$i}e", $x);
I think your error code probably is exactly 4110, but I don't have the typechecker in front of me to verify for sure, make sure to use the right code from your error message.
Note that for technical reasons the parser is pretty finicky about HH_IGNORE_ERROR -- it must be a block-style comment with no extra whitespace from what I've written above, until after the final ] at which point you can write as much as you like in the comment explaining.
In PHP, you can handle errors by calling or die to exit when you encounter certain errors, like this:
$handle = fopen($location, "r") or die("Couldn't get handle");
Using die() isn't a great way to handle errors. I'd rather return an error code so the parent function can decide what to do, instead of just ending the script ungracefully and displaying the error to the user.
However, PHP shows an error when I try to replace or die with or return, like this:
$handle = fopen($location, "r") or return 0;
Why does or die() work, but not or return 0?
I want to thank you for asking this question, since I had no idea that you couldn't perform an or return in PHP. I was as surprised as you when I tested it. This question gave me a good excuse to do some research and play around in PHP's internals, which was actually quite fun. However, I'm not an expert on PHP's internals, so the following is a layman's view of the PHP internals, although I think it's fairly accurate.
or return doesn't work because return isn't considered an "expression" by the language parser - simple as that.
The keyword or is defined in the PHP language as a token called T_LOGICAL_OR, and the only expression where it seems to be defined looks like this:
expr T_LOGICAL_OR { zend_do_boolean_or_begin(&$1, &$2 TSRMLS_CC); } expr { zend_do_boolean_or_end(&$$, &$1, &$4, &$2 TSRMLS_CC); }
Don't worry about the bits in the braces - that just defines how the actual "or" logic is handled. What you're left with is expr T_LOGICAL_OR expr, which just says that it's a valid expression to have an expression, followed by the T_LOGICAL_OR token, followed by another expression.
An expr is also defined by the parser, as you would expect. It can either be a r_variable, which just means that it's a variable that you're allowed to read, or an expr_without_variable, which is a fancy way of saying that an expression can be made of other expressions.
You can do or die() because the language construct die (not a function!) and its alias exit are both represented by the token T_EXIT, and T_EXIT is considered a valid expr_without_variable, whereas the return statement - token T_RETURN - is not.
Now, why is T_EXIT considered an expression but T_RETURN is not? Honestly, I have no clue. Maybe it was just a design choice made just to allow the or die() construct that you're asking about. The fact that it used to be so widely used - at least in things like tutorials, since I can't speak to a large volume of production code - seems to imply that this may have been an intentional choice. You would have to ask the language developers to know for sure.
With all of that said, this shouldn't matter. While the or die() construct seemed ubiquitous in tutorials (see above) a few years ago, it's not really recommended, since it's an example of "clever code". or die() isn't a construct of its own, but rather it's a trick which uses - some might say abuses - two side-effects of the or operator:
it is very low in the operator precedence list, which means practically every other expression will be evaluated before it is
it is a short-circuiting operator, which means that the second operand (the bit after the or) is not executed if the first operand returns TRUE, since if one operand is TRUE in an or expression, then they both are.
Some people consider this sort of trickery to be unfavourable, since it is harder for a programmer to read yet only saves a few characters of space in the source code. Since programmer time is expensive, and disk space is cheap, you can see why people don't like this.
Instead, you should be explicit with your intent by expanding your code into a full-fledged if statement:
$handle = fopen($location, "r");
if ($handle) {
// process the file
} else {
return 0;
}
You can even do the variable assignment right in the if statement. Some people still find this unreadable, but most people (myself included) disagree:
if ($handle = fopen($location, "r")) {
// process the file
} else {
return 0;
}
One last thing: it is convention that returning 0 as a status code indicates success, so you would probably want to return a different value to indicate that you couldn't open the file.
Return is fairly special - it cannot be anything like a function since it's a tool to exit functions. Imagine this:
if(1==1) return(); // say what??
If it was like this, return would have to be a function that does a "double exit", leaving not just its own scope but the caller's, too. Therefore return is nothing like an expression, it simply can't work that way.
Now in theory, return could be an expression that evaluates to (say) false and then quits the function; maybe a later php version will implement this.
The same thing applies to goto which would be a charm to work as a fallback; and yes, fallbacks are necessary and often make the code readable, so if someone complains about "clever code" (which certainly is a good point) maybe php should have some "official" way to do such a thing:
connectMyDB() fallback return false;
Something like try...catch, just more to the point. And personally, I'd be a lot happier with "or" doing this job since it's working well with English grammar: "connect or report failure".
TLDR: you're absolutely right: return, goto, break - none of them works. Easy to understand why but still annoying.
I've also stumbled upon that once. All I could find was this:
https://bugs.php.net/bug.php?id=40712
Look at the comment down below:
this is not a bug
I've searched in the documentation and I think it's due to the fact that return 0 is a statement whereas die() is essentially an expression. You can't run $handle = return 0; but $handle = fun(); is valid code.
Regarding error handling I would recommend custom codes or using custom handlers and triggers. The latter are described here for example.
This is the preg_match i am trying to use to find specific text in text file.
if (preg_match($regexp,$textFile,$result) > 0) {
echo "Found ".$result[0];
} else {
echo "Not found";
}
However, the result is always Found and nothing more. The result array is empty. Now i read that preg_match can't work with long strings.
My text file is about 300KB so thats 300000 characters i guess.
I am 100% sure that the searched string is in the text file, and the fact that preg_match function returns value above 0 means it found it, but it didn't place it into the result array somehow.
So my question would be, how do i make it work?
regexp would be /[specific text]\{(\d*)\}/ so, of course i want to be able to get the number in the parentheses.
You'll be glad I found this question. As of PHP 5.2, they introduced a limit on the size of text that the PCRE functions can be used on, which defaults to 100k. That's not so bad. The bad part is that it silently fails if greater than that.
The solution? Up the limit. The initialization parameter is pcre.backtrack_limit.
No, don't up the pcre limit. Don't do things without understand them. This is a common bug with php pcre
Read this awesome answer by #ridgerunner :
https://stackoverflow.com/a/7627962/1077650
this class of regex will repeatably (and silently) crash Apache/PHP with an unhandled segmentation fault due to a stack overflow!
PHP Bug 1: PHP sets: pcre.recursion_limit too large.
PHP Bug 2: preg_match() does not return FALSE on error.
At work today we were trying to come up with any reason you would use strspn.
I searched google code to see if it's ever been implemented in a useful way and came up blank. I just can't imagine a situation in which I would really need to know the length of the first segment of a string that contains only characters from another string. Any ideas?
Although you link to the PHP manual, the strspn() function comes from C libraries, along with strlen(), strcpy(), strcmp(), etc.
strspn() is a convenient alternative to picking through a string character by character, testing if the characters match one of a set of values. It's useful when writing tokenizers. The alternative to strspn() would be lots of repetitive and error-prone code like the following:
for (p = stringbuf; *p; p++) {
if (*p == 'a' || *p == 'b' || *p = 'c' ... || *p == 'z') {
/* still parsing current token */
}
}
Can you spot the error? :-)
Of course in a language with builtin support for regular expression matching, strspn() makes little sense. But when writing a rudimentary parser for a DSL in C, it's pretty nifty.
It's based on the the ANSI C function strspn(). It can be useful in low-level C parsing code, where there is no high-level string class. It's considerably less useful in PHP, which has lots of useful string parsing functions.
Well, by my understanding, its the same thing as this regex:
^[set]*
Where set is the string containing the characters to be found.
You could use it to search for any number or text at the beginning of a string and split.
It seems it would be useful when porting code to php.
I think its great for blacklisting and letting the user know from where the error started. Like MySQL returns part of the query from where the error occured.
Please see this function, that lets the user know which part of his comment is not valid:
function blacklistChars($yourComment){
$blacklistedChars = "!##$%^&*()";
$validLength = strcspn($yourComment, $blacklistedChars);
if ($validLength !== strlen($yourComment))
{
$error = "Your comment contains invalid chars starting from here: `" .
substr($yourComment, (int) '-' . $validLength) . "`";
return $error;
}
return false;
}
$yourComment = "Hello, why can you not type and $ dollar sign in the text?";
$yourCommentError = blacklistChars($yourComment);
if ($yourCommentError <> false)
echo $yourCommentError;
It is useful specificaly for functions like atoi - where you have a string you want to convert to a number, and you don't want to deal with anything that isn't in the set "-.0123456789"
But yes, it has limited use.
-Adam