preg_match and long strings

preg_match and long strings - php

This is the preg_match i am trying to use to find specific text in text file.
if (preg_match($regexp,$textFile,$result) > 0) {
echo "Found ".$result[0];
} else {
echo "Not found";
}
However, the result is always Found and nothing more. The result array is empty. Now i read that preg_match can't work with long strings.
My text file is about 300KB so thats 300000 characters i guess.
I am 100% sure that the searched string is in the text file, and the fact that preg_match function returns value above 0 means it found it, but it didn't place it into the result array somehow.
So my question would be, how do i make it work?
regexp would be /[specific text]\{(\d*)\}/ so, of course i want to be able to get the number in the parentheses.

You'll be glad I found this question. As of PHP 5.2, they introduced a limit on the size of text that the PCRE functions can be used on, which defaults to 100k. That's not so bad. The bad part is that it silently fails if greater than that.
The solution? Up the limit. The initialization parameter is pcre.backtrack_limit.

No, don't up the pcre limit. Don't do things without understand them. This is a common bug with php pcre
Read this awesome answer by #ridgerunner :
https://stackoverflow.com/a/7627962/1077650
this class of regex will repeatably (and silently) crash Apache/PHP with an unhandled segmentation fault due to a stack overflow!
PHP Bug 1: PHP sets: pcre.recursion_limit too large.
PHP Bug 2: preg_match() does not return FALSE on error.

Related

How to detect if a string contains PHP code? PHP

I am keeping record of every request made to my website. I am very aware of the security measurements that need to be taken before executing any MySQL query that contains data coming from query strings. I clean it as much as possible from injections and so far all tests have been successful using:
htmlspecialchars, strip_tags, mysqli_real_escape_string.
But on the logs of pages visited I find query strings of failed hack attempts that contain a lot of php code:
?1=%40ini_set%28"display_errors"%2C"0"%29%3B%40set_time_limit%280%29%3B%40set_magic_quotes_runtime%280%29%3Becho%20%27->%7C%27%3Bfile_put_contents%28%24_SERVER%5B%27DOCUMENT_ROOT%27%5D.%27/webconfig.txt.php%27%2Cbase64_decode%28%27PD9waHAgZXZhb
In the previous example we can see:
display_errors, set_time_limit, set_magic_quotes_runtime, file_put_contents
Another example:
/?s=/index/%5Cthink%5Capp/invokefunction&function=call_user_func_array&vars[0]=file_put_contents&vars[1][]=ctlpy.php&vars[1][]=<?php #assert($_REQUEST["ysy"]);?>ysydjsjxbei37$
This one is worst, there is even some <?php and $_REQUEST["ysy"] stuff in there. Although I am able to sanitize it, strip tags and encode < or > when I decode the string I can see the type of requests that are being sent.
Is there any way to detect a string that contains php code like:
filter_var($var, FILTER_SANITIZE_PHP);
FYI: This is not a real function, I am trying to give an idea of what I am looking for.
or some sort of function:
function findCode($var){
return ($var contains PHP) ? true : false
}
Again, not real
No need to sanitize, that has been taken care of, just to detect PHP code in a string. I need this because I want to detect them and save them in other logs.
NOTE: NEVER EXECUTE OR EVAL CODE COMING FROM QUERY STRINGS

After reading lots of comments #KIKO Software came up with an ingenious idea by using PHP tokenizer, but it ended up being extremely difficult because the string that is to be analyzed needed to have almost prefect syntax or it would fail.
So the best solution that I came up with is a simple function that tries to find commonly used PHP statements, In my case, especially on query strings with code injection. Another advantage of this solution is that we can modify and add to the list as many PHP statements as we want. Keep in mind that making the list bigger will considerably slow down your script. this functions uses strpos instead of preg_match (regex ) as its proven to perform faster.
This will not find 100% PHP code inside a string, but you can customize it to find as much as is required, never include terms that could be used in regular English, like 'echo' or 'if'
function findInStr($string, $findarray){
$found=false;
for($i=0;$i<sizeof($findarray);$i++){
$res=strpos($string,$findarray[$i]);
if($res !== false){
$found=true;
break;
}
}
return $found;
}
Simply use:
$search_line=array(
'file_put_contents',
'<?=',
'<?php',
'?>',
'eval(',
'$_REQUEST',
'$_POST',
'$_GET',
'$_SESSION',
'$_SERVER',
'exec(',
'shell_exec(',
'invokefunction',
'call_user_func_array',
'display_errors',
'ini_set',
'set_time_limit',
'set_magic_quotes_runtime',
'DOCUMENT_ROOT',
'include(',
'include_once(',
'require(',
'require_once(',
'base64_decode',
'file_get_contents',
'sizeof',
'array('
);
if(findInStr("this has some <?php echo 'PHP CODE' ?>",$search_line)){
echo "PHP found";
}

String exectuion without a function

One of my website was hacked, during the cleanup process I found a few files with some cryptic code. Doing a bit of research I found that a piece of string was getting executed without being passed to a function. I am a bit stumped as to how PHP is executing that string? Does PHP automatically eval strings?
Here's a piece of the string I found $VCAzeJzvMaKiJXvfeJZd='99V38=jVN4C;.Z<'^'ZK3RLX50;Z OG5R'; when I add it to a test script as below it produces create_function:
<?php
echo $VCAzeJzvMaKiJXvfeJZd='99V38=jVN4C;.Z<'^'ZK3RLX50;Z OG5R';
// output : create_function
P.S: in the malicious file I found, there's no echo, just a long cryptic string. I added echo in the code above for test purpose.

No, strings don't automagically get eval'ed, in this case it's actually 2 strings being bitwise Xor'd. Note the ^ between the 2 strings.

Simple php preg_match not working for me

I have Googled and looked at a lot of threads on this subject but none answers this issue for me. I have even copied several different filter strings that do the same thing into do my simple validation. They all work on www.functions-online.com/preg_match.htm but none of them are working in my if() statement.
PHP ver. 5.3.3
My code to allow for only Alpha-Numaric and spaces:
$myString = 'My Terrific Game';
if (!preg_match('/^[a-z0-9\s]+$/i', $myString)) { "$errorMessage }
I have even reduced the filter to '/^[a-z]+$/' and the test string to 'mytarrificgame' and it still returns zero. It's as if preg_match() isn't functioning at all.
Anyone have any suggestions?

Some things you need to correct:
You have unclosed quotes in "$errorMessage. It should be $errorMessage or "$errorMessage".
There is no command there. If you want to print the value, use echo $errorMessage.
And is that variable being set anywhere in the code?
This should work
if (!preg_match('/^[a-z0-9\s]+$/i', $myString)) { echo $errorMessage; }
Or, if you think about it, matching only 1 character outside of those allowed should be enough.
Code:
$myString = 'My Terrific Game';
$errorMessage = 'ERROR: Unallowed character';
if (preg_match('/[^a-z0-9\s]/i', $myString)) {
echo $errorMessage;
} else {
echo 'Valid input';
}
ideone Demo

I tried the filter string shown in the 2nd example of the first answer above and it didn't even work on the on-line test page but it did give me a new starting place to solve my dilemma. When I moved the carrot to before the opening bracket, it kind of works but flaky. After trying different things I finally got back to my original '/^[a-z0-9\s]+$\i' and it started working rock solid. I think the reason that it wouldn't work at all for me before is that I was expecting a boolean test on the preg-match() expression to accept a zero as false. This is probably a php version 5.3.3 thing. I tried the following;
if(preg-match('/^[a-z0-9\s]+$/i', $testString) === 0) { "Do Something" }
It is now working. By looking deep into the preg-match() description, I found that it only returns a boolean false if an error occurred. A valid no-match returns a zero.
Thanks to all who looked here and considered my problem.

String corrupted or preg_match bug?

The NO-BREAK SPACE and many other UTF-8 symbols need 2 bytes to its representation; so, in a supposed context of UTF8 strings, an isolated (not preceded by xC2) byte of non-ASCII (>127) is a non-recognized character... Ok, it is only a layout problem (!), but it corrupts the whole string?
How to avoid this "non-expected behaviour"? (it occurs in some functions and not in others).
Example (generating an non-expected behaviour with preg_match only):
header("Content-Type: text/plain; charset=utf-8"); // same if text/html
//PHP Version 5.5.4-1+debphp.org~precise+1
//using a .php file enconded as UTF8.
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m); // empty! (corrupted)
$m=str_word_count($s,1);
var_dump($m); // ok
$s = "THE UTF-8 NO-BREAK\xC2\xA0SPACE"; // utf8-encoded nbsp
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m); // ok!
$m=str_word_count($s,1);
var_dump($m); // ok

This is not a complete answer because I not say why some PHP functions "fail entirely on invalidly encoded strings" and others not: see #deceze at question's comments and #hakre answer.
If you are looking for an PCRE-replacement for str_word_count(), see my preg_word_count() below.
PS: about "PHP5's build-in-library behaviour uniformity" discussion, my conclusion is that PHP5 is not so bad, but we have create a lot of user-defined wrap (façade) functions (see diversity of PHP-framworks!)... Or wait for PHP6 :-)
Thanks #pebbl! If I understand your link, there are a lack of error messagens on PHP. So a possible workaround of my illustred problem is to add an error condition... I find the condition here (it ensures valid utf8!)... And thanks #deceze for remember that exists a build-in function for check this condition (I edited the code after).
Putting the issues together, a solution translated to a function (EDITED, thanks to #hakre comments!),
function my_word_count($s,$triggError=true) {
if ( preg_match_all('/[-\'\p{L}]+/u',$s,$m) !== false )
return count($m[0]);
else {
if ($triggError) trigger_error(
// not need mb_check_encoding($s,'UTF-8'), see hakre's answer,
// so, I wrong, there are no 'misteious error' with preg functions
(preg_last_error()==PREG_BAD_UTF8_ERROR)?
'non-UTF8 input!': 'other error',
E_USER_NOTICE
);
return NULL;
}
}
Now (edited after thinking around #hakre answer), about uniform behaviour: we can develop a reasonable function with PCRE library that mimic the str_word_count behaviour, accepting bad UTF8. For this task I used the #bobince iconv tip:
/**
* Like str_word_count() but showing how preg can do the same.
* This function is most flexible but not faster than str_word_count.
* #param $wRgx the "word regular expression" as defined by user.
* #param $triggError changes behaviour causing error event.
* #param $OnBadUtfTryAgain mimic the str_word_count behaviour.
* #return 0 or positive integer as word-count, negative as PCRE error.
*/
function preg_word_count($s,$wRgx='/[-\'\p{L}]+/u', $triggError=true,
$OnBadUtfTryAgain=true) {
if ( preg_match_all($wRgx,$s,$m) !== false )
return count($m[0]);
else {
$lastError = preg_last_error();
$chkUtf8 = ($lastError==PREG_BAD_UTF8_ERROR);
if ($OnBadUtfTryAgain && $chkUtf8)
return preg_word_count(
iconv('CP1252','UTF-8',$s), $wRgx, $triggError, false
);
elseif ($triggError) trigger_error(
$chkUtf8? 'non-UTF8 input!': "error PCRE_code-$lastError",
E_USER_NOTICE
);
return -$lastError;
}
}
Demonstrating (try other inputs!):
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
print "\n-- str_word_count=".str_word_count($s,0);
print "\n-- preg_word_count=".preg_word_count($s);
$s = "THE UTF-8 NO-BREAK\xC2\xA0SPACE"; // utf8-encoded nbsp
print "\n-- str_word_count=".str_word_count($s,0);
print "\n-- preg_word_count=".preg_word_count($s);

Okay, I can somewhat feel your disappointment that things didn't worked easily out switching from str_word_count to preg_match_all. However the way you ask the question is a bit imprecise, I try to answer it anyway. Imprecise, because you have a high amount of wrong assumptions that you obviously take for granted (it happens to the best of us). I hope I can correct this a little:
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m); // empty! (corrupted)
This code is wrong. You blame PHP here for not giving a warning or something, but I must admit, the only one to blame here is "you". PHP does allow you to check for the error. Before you judge so early that a warning has to be given in error handling, I have to remind you that there are different ways how to deal with errors. Some dealing is with giving messages, another type of dealing with errors is by telling about them with return values. And if we visit the manual page of preg_match_all and look for the documentation of the return value, we can find this:
Returns the number of full pattern matches (which might be zero), or FALSE if an error occurred.
The part at the end:
FALSE if an error occurred [Highlight by me]
is some common way in error handling to signal the calling code that some error occured. Let's review your code of which you think it does not work:
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m); // empty! (corrupted)
The only thing this code shows is that the person who typed it (I guess it was you), clearly decided to not do any error handling. That's fine unless that person as well protests that the code won't work.
The sad thing about this is, that this is a common user-error, if you write fragile code (e.g. without error handling), don't expect it to work in a solid manner. That will never happen.
So what does this require when you program? First of all you should know about the functions you use. That normally requires knowledge about the input parameters and the return values. You find that information normally documented. Use the manual. Second you actually need to care about return values and do the error handling your own. The function alone does not know what it means if an error occured. Is it an exception? Then you need to do the exception handling probably as in the demo example:
<?php
/**
* #link http://stackoverflow.com/q/19316127/367456
*/
$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
$result = preg_match_all('/[-\'\p{L}]+/u',$s,$m);
if ($result === FALSE) {
switch (preg_last_error()) {
case PREG_BAD_UTF8_ERROR:
throw new InvalidArgumentException(
'UTF-8 encoded binary string expected.'
);
default:
throw new RuntimeException('preg error occured.');
}
}
var_dump($m); // nothing at all corrupted...
In any case it means you need to look what you do, learn about it and write more code. No magic. No bug. Just a bit of work.
The other part you've in front of you is perhaps to understand what characters in a software are, but that is more independent to concrete programming languages like PHP, for example you can take an introductory read here:
A tutorial on character code issues
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The first is a must read or perhaps must-bookmark, because it is a lot to read but it explains it all very good.

Find function and line number where variable gets modified

Let's say that at the beginning of a random function variable $variables['content'] is 1,000 characters long.
This random function is very long, with many nested functions within.
At the end of the function $variables['content'] is only 20 characters long.
How do you find which of nested functions modified this variable?

Not sure how you'd want to return it but you could use the magic constant __LINE__. That returns the line of the current document.
You could create a variable called $variables['line'] and assign __LINE__ as the value, where appropriate.

If it were me, first I'd consider breaking apart the 1000-line beast. There's probably no good reason for it to be so huge. Yes, it'll take longer than just trying to monkey-patch your current bug, but you'll probably find dozens more bugs in that function just trying to break it apart.
Lecture over, I'd do a search/replace for $variables['content'].*=([^;]*); to a method call like this: $variables['content'] = hello(\1, __LINE__);. This will fail if you are assigning strings with semicolons in them or something similar, so make sure you inspect every change carefully. Write a hello() function that takes two parameters: whatever it is you're assigning to $variables['content'] and the line number. In hello(), simply print your line number to the log or standard error or whatever is most convenient, and then return the first argument unchanged.
When you're done fixing it all up, you can either remove all those silly logging functions, or you can see if 'setting the $variables['content'] action' is important enough to have its own function that does something useful. Refactoring can start small. :)

I think this is a problem code tracing can help with.
In my case I had this variable that was being modified across many functions, and I didn't know where.
The problem is that at some point in the program the variable (a string) was around 40,000 characters, then at the end of the program something had cut it to 20 characters.
To find this information I stepped through the code with the Zend debugger. I found the information I wanted (what functions modified the variable), but it took me a while.
Apparently XDebug does tell you what line numbers, in what functions the variables are modified:
Example, tracing documentation, project home, tutorial article.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_match and long strings - php

Related

How to detect if a string contains PHP code? PHP

String exectuion without a function

Simple php preg_match not working for me

String corrupted or preg_match bug?

Find function and line number where variable gets modified

Categories

Resources