Searching for hex string in a file in php? - php

I'm currently using the following two methods in my class to get the job done:
function xseek($h,$pos){
rewind($h);
if($pos>0)
fread($h,$pos);
}
function find($str){
return $this->startingindex($this->name,$str);
}
function startingindex($a,$b){
$lim = 1 + filesize($a) - strlen($b)/2;
$h = fopen($a,"rb");
rewind($h);
for($i=0;$i<$lim;$i++){
$this->xseek($h,$i);
if($b==strtoupper(bin2hex(fread($h,strlen($b)/2)))){
fclose($h);
return $i;
}
}
fclose($h);
return -1;
}
I realize this is quite inefficient, especially for PHP, but I'm not allowed any other language on my hosting plan.
I ran a couple tests, and when the hex string is towards the beginning of the file, it runs quickly and returns the offset. When the hex string isn't found, however, the page hangs for a while. This kills me inside because last time I tested with PHP and had hanging pages, my webhost shut my site down for 24 hours due to too much cpu time.
Is there a better way to accomplish this (finding a hex string's offset in a file)? Is there certain aspects of this that could be improved to speed up execution?
I would read the entire contents of the file into one hex string and use strrpos, but I was getting errors about maximum memory being exceeded. Would this be a better method if I chopped the file up and searched large pieces with strrpos?
edit:
To specify, I'm dealing with a settings file for a game. The settings and their values are in a block where there is a 32-bit int before the setting, then the setting, a 32-bit int before the value, and then the value. Both ints represent the lengths of the following strings. For example, if the setting was "test" and the value was "0", it would look like (in hex): 00000004746573740000000130. Now that you mention it, this does seem like a bad way to go about it. What would you recommend?
edit 2:
I tried a file that was below the maximum memory I'm allowed and tried strrpos, but it was very much slower than the way I've been trying.
edit 3: in reply to Charles:
What's unknown is the length of the settings block and where it starts. What I do know is what the first and last settings USUALLY are. I've been using these searching methods to find the location of the first and last setting and determine the length of the settings block. I also know where the parent block starts. The settings block is generally no more than 50 bytes into its parent, so I could start the search for the first setting there and limit how far it will search. The problem is that I also need to find the last setting. The length of the settings block is variable and could be any length. I could read the file the way I assume the game does, by reading the size of the setting, reading the setting, reading the size of the value, reading the value, etc. until I reached a byte with value -1, or FF in hex. Would a combination of limiting the search for the first setting and reading the settings properly make this much more efficient?

You have a lot of garbage code. For example, this code is doing nearly nothing:
function xseek($h,$pos){
rewind($h);
if($pos>0)
fread($h,$pos);
}
because it reads everytime from the begining of the file. Furthemore, why do you need to read something if you are not returning it? May be you looke for fseek()?
If you need to find a hex string in binary file, may be better to use something like this: http://pastebin.com/fpDBdsvV (tell me if there some bugs/problems).
But, if you are parsing game's settings file, I'd advise you to use fseek(), fread() and unpack() to seek to a place of where setting is, read portion of bytes and unpack it to PHP's variable types.

Related

PHP Return Long String is not Usable as String

I am having a for loop which is iterating through a large array $result. Every array item $row is used to generate a new line of string $line. This string is concatenated into $str. To preserve memory, I am using unset($result[$i]) to clear every array item that has been processed in the original array. This looks like this:
$resultcount = count($result);
for($i=0; $i<$resultcount; ++$i){
$row = $result[$i];
$line = do_something($row);
$str.= '<tr><td>'.$line.'</td></tr>';
unset($result[$i]);
}
return $str;
This (more or less exotic piece of code) works unless the string exceeds a length of approx. 1'000'000 characters. In this case the return value is just empty. This is highly irritating because:
The calculation effort (in this example do_something()) is not a problem. By using echo count($result).' - '.strlen($str)."\n" I can see that the loop finishes properly. There is no web server or php error shown.
The memory_limit is also not a problem. I am working with more data on other parts of the application.
It appears that the problem lies in return $str itself. If I am using return substr($str, 0, 980000) then it just works fine. Further debugging shows that the string gets tainted as soon as it reaches the length of 999'775 bytes.
I can't put longer return value string into another string variable. But I am able to do a strlen() to get a proper result (1310307). So the return value string has a total length of 1'310'307 bytes. But I can't use them properly as a string.
Where does this limitation come from and how may I circumvent it?
You seem to describe return value in contradicting way:
In this case the return value is just empty.
but
But I am able to do a strlen() to get a proper result (1310307). So the return value string has a total length of 1'310'307 bytes.
Something can't be empty and have length at the same time?
I have tested returning a string that long and it works fine for me. Returning larger and larger strings works to a point memory limit is exceeded, which triggers explicit Fatal Error.
On a technical side strings can be much larger than million characters:
Note: As of PHP 7.0.0, there are no particular restrictions regarding the length of a string on 64-bit builds. On 32-bit builds and in earlier versions, a string can be as large as up to 2GB (2147483647 bytes maximum)
http://php.net/manual/en/language.types.string.php
I suspect that the failing operation has to do with how you use the result rather than return from the function.
Well, it took me two and a half year to figure this out ;(
Rarst was right: It was not a problem of the function compiling and delivering the string. Later in the process I was using preg_replace() to do some further replacements. This function is not able to handle long strings. This could be fixed by using ini_set('pcre.backtrack_limit', 99999999999) beforehand as mentioned in How to do preg_replace on a long string

Dealing with binary data and mb_function overloading?

I have a piece of code here which I need either assurance, or "no no no!" about in regards to if I'm thinking about this in the right or entirely wrong way.
This has to deal with cutting a variable of binary data at a specific spot, and also dealing with multi-byte overloaded functions. For example substr is actually mb_substr and strlen is mb_strlen etc.
Our server is set to UTF-8 internal encoding, and so theres this weird little thing I do to circumvent it for this binary data manipulation:
// $binary_data is the incoming variable with binary
// $clip_size is generally 16, 32 or 64 etc
$curenc = mb_internal_encoding();// this should be "UTF-8"
mb_internal_encoding('ISO-8859-1');// change so mb_ overloading doesnt screw this up
if (strlen($binary_data) >= $clip_size) {
$first_hunk = substr($binary_data,0,$clip_size);
$rest_of_it = substr($binary_data,$clip_size);
} else {
// skip since its shorter than expected
}
mb_internal_encoding($curenc);// put this back now
I can't really show input and output results, since its binary data. But tests using the above appear to be working just fine and nothing is breaking...
However, parts of my brain are screaming "what are you doing... this can't be the way to handle this"!
Notes:
The binary data coming in, is a concatenation of those two parts to begin with.
The first part's size is always known (but changes).
The second part's size is entirely unknown.
This is pretty darn close to encryption and stuffing the IV on front and ripping it off again (which oddly, I found some old code which does this same thing lol ugh).
So, I guess my question is:
Is this actually fine to be doing?
Or is there something super obvious I'm overlooking?
However, parts of my brain are screaming "what are you doing... this can't be the way to handle this"!
Your brain is right, you shouldn't be doing that in PHP in the first place. :)
Is this actually fine to be doing?
It depends the purpose of your code.
I can't see any reason of the top of my head to cut a binary like that. So my first instinct would be "no no no!" use unpack() to properly parse the binary into usable variables.
That being said if you just need to split your binary because reasons, then I guess this is fine. As long as your tests confirm that the code is working for you, I can't see any problem.
As a side note, I don't use mbstring overloading exactly for this kind of use case - i.e. for whenever you need the default string functions.
MY SOLUTION TO THE WORRY
I dislike answering my own questions... but I wanted to share what I have decided on nonetheless.
Although what I had, "worked", I still wanted to change the hack-job-altering of the charset encoding. It was old code I admit, but for some reason, I never looked at hex2bin bin2hex for doing this. So I decided to change it to use those.
The resulting new code:
// $clip_size remains the same value for continuity later,
// only spot-adjusted here... which is why the *2.
$hex_data = bin2hex( $binary_data );
$first_hunk = hex2bin( substr($hex_data,0,($clip_size*2)) );
$rest_of_it = hex2bin( substr($hex_data,($clip_size*2)) );
if ( !empty($rest_of_it) ) { /* process the result for reasons */ }
Using the hex functions, turns the mess into something mb will not screw with either way. A 1 million bench loop, showed the process wasn't anything to be worried about (and its safer to run in parallel to itself than the mb_encoding mangle method).
So I'm going with this. It sits better in my mind, and resolves my question for now... until I revisit this old code again in a few years and go "what was I thinking ?!".

Overhead of large length in PHP stream_get_line()

I'm testing out a PHP library which relies upon another streams based library. My experience is with sockets which is low level compared to streams so I'm a little unsure if streams is going to be flexible enough.
Basically, using sockets, I would write a buffer loop which checked each chunk of data for an EOL character like \n like so...
PHP
$data = NULL;
while ($buffer = socket_read($connection,1024)) {
$data .= $buffer;
if (strpos($buffer, "\n") !== FALSE) {
$data = substr($data,0,-1);
break;
}
}
I'm looking to do something similar without having to rewrite their entire library. Here's the problem...
stream_get_line($handle,$length,$EOL) accepts a length value but will truncate everything longer than that. PHP docs state...
Reading ends when length bytes have been read, when the string specified by ending is found (which is not included in the return value), or on EOF (whichever comes first).
... and there's no offset param so I can't use it the same way to get the remainder. This means that if I don't know the length of the data, or if it's inconsistent, I need to set length high enough to deal with ANY possible length of data.
That isn't the big concern, that seems to work. The question is will setting the $length value to something like 512000 (500Kb) cause a lot of unnecessary overhead for shorter responses?
As far as the statement from the docs:
Reading ends when length bytes have been read, when the string
specified by ending is found (which is not included in the return
value), or on EOF (whichever comes first).
This doesn't mean that if you pass a length of 1,024 or even 200,000 and the line is longer than that the rest of the data is truncated or lost. This is just the maximum amount of data the call will return if it doesn't reach EOF/EOL before that.
So if you have a line that is 102,500 bytes long and you have the length parameter set to 1024, you will have to have called stream_get_line 101 times before the entire line of data is read but the entire line will be read, just not in one call to the function.
To directly answer your question, there won't be any extra overhead for short responses if you pass a large value. How it works under the hood really depends on what type of stream you are reading from. In the case of a network stream with a large length value, it may take a long time before the call returns any data in the event that it takes a long time for length data to be read from the network, where if you were reading in smaller chunks, you might start to get more data from the network before everything has been received.

Can I use file_get_contents() to compare two files?

I want to synchronize two directories. And I use
file_get_contents($source) === file_get_contents($dest)
to compare two files. Is there any problem to do this?
I would rather do something like this:
function files_are_equal($a, $b)
{
// Check if filesize is different
if(filesize($a) !== filesize($b))
return false;
// Check if content is different
$ah = fopen($a, 'rb');
$bh = fopen($b, 'rb');
$result = true;
while(!feof($ah))
{
if(fread($ah, 8192) != fread($bh, 8192))
{
$result = false;
break;
}
}
fclose($ah);
fclose($bh);
return $result;
}
This checks if the filesize is the same, and if it is it goes through the file step by step.
Checking the modified time check can be a quick way in some cases, but it doesn't really tell you anything other than that the files have been modified at different times. They still might have the same content.
Using sha1 or md5 might be a good idea, but this requires going through the whole file to create that hash. If this hash is something that could be stored and used later, then it's a different story probably, but yeah...
Use sha1_file() instead. It's faster and works fine if you just need to see whether the files differ. If the files are large, comparing the whole strings to each other can be very heavy. As sha1_file() returns an 40 character representation of the file, comparing files will be very fast.
You can also consider other methods like comparing filemtime or filesize, but this will give you guaranteed results even if there's just one bit that's changed.
Memory: e.g. you have a 32 MB memory limit, and the files are 20 MB each. Unrecoverable fatal error while trying to allocate memory. This can be solved by checking the files by smaller parts.
Speed: string comparisons are not the fastest thing in the world, calculating a sha1 hash should be faster (if you want to be 110% sure, you can compare the files byte-by-byte when hash matches, but you'll rule out all the cases where content and hash change (99%+ cases))
Efficiency: do some preliminary checks - e.g. there's no point comparing two files if their size differs.
Ths will work, but is inherently more inefficient than calculating checksum for both files and comparing these. Good candidates for checksum algorithms are SHA1 and MD5.
http://php.net/sha1_file
http://php.net/md5_file
if (sha1_file($source) == sha1_file($dest)) {
/* ... */
}
Seems a bit heavy. This will load both files completely as strings and then compare.
I think you might be better off opening both files manually and ticking through them, perhaps just doing a filesize check first.
There isn't anything wrong with what you are doing here, accept it is a little inefficient. Getting the contents of each file and comparing them, especially with larger files or binary data, you may run into problems.
I would take a look at filetime (last modified) and filesize, and run some tests to see if that works for you. It should be all you need at a fraction of the computation power.
Check first for the obvious:
Compare size
Compare file type (mime-type).
Compare content.
(add comparison of date, file name and other metadata to this obvious list if those are also not supposed to be similar).
When comparing content hashing sounds not very efficient like #Oli says in his comment. If the files are different they most likely will be different already in the beginning. Calculating a hash of two 50 Mb files and then comparing the hash sounds like a waste of time if the second bit is already different...
Check this post on php.net. Looks very similar to that of #Svish but it also compares file mime-type. A smart addition if you ask me.
Something I noticed is there is a lack of the N! factor. In other words - to do the filesize() function you would first have to check every file against all of the other files. Why? What if the first file and the second file are different sizes but the third file is the same size.
So first - you need to get a list of all of the files you are going to work with If you want to do the filesize type of thing - then use the COMPLETE / string as the key for an array and then store the filesize() information. Then you sort the array so all files which are the same size are lined up. THEN you can check file sizes. However - this does not mean they really are the same - only that they are the same size.
You need to do something like the sha1_file() command and, like above, make an array where the keys are the / names are the keys and the values is the value returned. Sort those, and then just do a simple walk through the array storing the sha1_file() value to test against. So is A==B? Yes. Do any additional tests, then get rid of the SECOND file and continue.
Why am I commenting? I'm working on this same problem and I just found out my program did not work correctly. So now I'm going to go correct it using the sha1_file() function. :-)

How to determine the memory footprint (size) of a variable?

Is there a function in PHP (or a PHP extension) to find out how much memory a given variable uses? sizeof just tells me the number of elements/properties.
memory_get_usage helps in that it gives me the memory size used by the whole script. Is there a way to do this for a single variable?
Note that this is on a development machine, so loading extensions or debug tools is feasible.
There's no direct way to get the memory usage of a single variable, but as Gordon suggested, you can use memory_get_usage. That will return the total amount of memory allocated, so you can use a workaround and measure usage before and after to get the usage of a single variable. This is a bit hacky, but it should work.
$start_memory = memory_get_usage();
$foo = "Some variable";
echo memory_get_usage() - $start_memory;
Note that this is in no way a reliable method, you can't be sure that nothing else touches memory while assigning the variable, so this should only be used as an approximation.
You can actually turn that to an function by creating a copy of the variable inside the function and measuring the memory used. Haven't tested this, but in principle, I don't see anything wrong with it:
function sizeofvar($var) {
$start_memory = memory_get_usage();
$tmp = unserialize(serialize($var));
return memory_get_usage() - $start_memory;
}
You Probably need a Memory Profiler. I have gathered information fro SO but I have copied the some important thing which may help you also.
As you probably know, Xdebug dropped the memory profiling support since the 2.* version. Please search for the "removed functions" string here: http://www.xdebug.org/updates.php
Removed functions
Removed support for Memory profiling as that didn't work properly.
Other Profiler Options
php-memory-profiler
https://github.com/arnaud-lb/php-memory-profiler. This is what I've done on my Ubuntu server to enable it:
sudo apt-get install libjudy-dev libjudydebian1
sudo pecl install memprof
echo "extension=memprof.so" > /etc/php5/mods-available/memprof.ini
sudo php5enmod memprof
service apache2 restart
And then in my code:
<?php
memprof_enable();
// do your stuff
memprof_dump_callgrind(fopen("/tmp/callgrind.out", "w"));
Finally open the callgrind.out file with KCachegrind
Using Google gperftools (recommended!)
First of all install the Google gperftools by downloading the latest package here: https://code.google.com/p/gperftools/
Then as always:
sudo apt-get update
sudo apt-get install libunwind-dev -y
./configure
make
make install
Now in your code:
memprof_enable();
// do your magic
memprof_dump_pprof(fopen("/tmp/profile.heap", "w"));
Then open your terminal and launch:
pprof --web /tmp/profile.heap
pprof will create a new window in your existing browser session with something like shown below:
Xhprof + Xhgui (the best in my opinion to profile both cpu and memory)
With Xhprof and Xhgui you can profile the cpu usage as well or just the memory usage if that's your issue at the moment.
It's a very complete solutions, it gives you full control and the logs can be written both on mongo or in the filesystem.
For more details see here.
Blackfire
Blackfire is a PHP profiler by SensioLabs, the Symfony2 guys https://blackfire.io/
If you use puphpet to set up your virtual machine you'll be happy to know it's supported ;-)
Xdebug and tracing memory usage
XDEBUG2 is a extension for PHP. Xdebug allows you to log all function calls, including parameters and return values to a file in different formats.There are three output formats. One is meant as a human readable trace, another one is more suited for computer programs as it is easier to parse, and the last one uses HTML for formatting the trace. You can switch between the two different formats with the setting. An example would be available here
forp
forp simple, non intrusive, production-oriented, PHP profiler. Some of features are:
measurement of time and allocated memory for each function
CPU usage
file and line number of the function call
output as Google's Trace Event format
caption of functions
grouping of functions
aliases of functions (useful for anonymous functions)
DBG
DBG is a a full-featured php debugger, an interactive tool that helps you debugging php scripts. It works on a production and/or development WEB server and allows you debug your scripts locally or remotely, from an IDE or console and its features are:
Remote and local debugging
Explicit and implicit activation
Call stack, including function calls, dynamic and static method calls, with their parameters
Navigation through the call stack with ability to evaluate variables in corresponding (nested) places
Step in/Step out/Step over/Run to cursor functionality
Conditional breakpoints
Global breakpoints
Logging for errors and warnings
Multiple simultaneous sessions for parallel debugging
Support for GUI and CLI front-ends
IPv6 and IPv4 networks supported
All data transferred by debugger can be optionally protected with SSL
No, there is not. But you can serialize($var) and check the strlen of the result for an approximation.
In answer to Tatu Ulmanens answer:
It should be noted, that $start_memory itself will take up memory (PHP_INT_SIZE * 8).
So the whole function should become:
function sizeofvar($var) {
$start_memory = memory_get_usage();
$var = unserialize(serialize($var));
return memory_get_usage() - $start_memory - PHP_INT_SIZE * 8;
}
Sorry to add this as an extra answer, but I can not yet comment on an answer.
Update: The *8 is not definate. It can depend apparently on the php version and possibly on 64/32 bit.
You can't retrospectively calculate the exact footprint of a variable as two variables can share the same allocated space in the memory
Let's try to share memory between two arrays, we see that allocating the second array costs half of the memory of the first one. When we unset the first one, nearly all the memory is still used by the second one.
echo memory_get_usage()."\n"; // <-- 433200
$c=range(1,100);
echo memory_get_usage()."\n"; // <-- 444348 (+11148)
$d=array_slice($c, 1);
echo memory_get_usage()."\n"; // <-- 451040 (+6692)
unset($c);
echo memory_get_usage()."\n"; // <-- 444232 (-6808)
unset($d);
echo memory_get_usage()."\n"; // <-- 433200 (-11032)
So we can't conclude than the second array uses half the memory, as it becomes false when we unset the first one.
For a full view about how the memory is allocated in PHP and for which use, I suggest you to read the following article: How big are PHP arrays (and values) really? (Hint: BIG!)
The Reference Counting Basics in the PHP documentation has also a lot of information about memory use, and references count to shared data segment.
The different solutions exposed here are good for approximations but none can handle the subtle management of PHP memory.
calculating newly allocated space
If you want the newly allocated space after an assignment, then you have to use memory_get_usage() before and after the allocation, as using it with a copy does give you an erroneous view of the reality.
// open output buffer
echo "Result: ";
// call every function once
range(1,1); memory_get_usage();
echo memory_get_usage()."\n";
$c=range(1,100);
echo memory_get_usage()."\n";
Remember that if you want to store the result of the first memory_get_usage(), the variable has to already exist before, and memory_get_usage() has to be called another previous time, and every other function also.
If you want to echo like in the above example, your output buffer has to be already opened to avoid accounting memory needed to open the output buffer.
calculating required space
If you want to rely on a function to calculate the required space to store a copy of a variable, the following code takes care of different optimizations:
<?php
function getMemorySize($value) {
// existing variable with integer value so that the next line
// does not add memory consumption when initiating $start variable
$start=1;
$start=memory_get_usage();
// json functions return less bytes consumptions than serialize
$tmp=json_decode(json_encode($value));
return memory_get_usage() - $start;
}
// open the output buffer, and calls the function one first time
echo ".\n";
getMemorySize(NULL);
// test inside a function in order to not care about memory used
// by the addition of the variable name to the $_GLOBAL array
function test() {
// call the function name once
range(1,1);
// we will compare the two values (see comment above about initialization of $start)
$start=1;
$start=memory_get_usage();
$c=range(1,100);
echo memory_get_usage()-$start."\n";
echo getMemorySize($c)."\n";
}
test();
// same result, this works fine.
// 11044
// 11044
Note that the size of the variable name matters in the memory allocated.
Check your code!!
A variable has a basic size defined by the inner C structure used in the PHP source code. This size does not fluctuate in the case of numbers. For strings, it would add the length of the string.
typedef union _zvalue_value {
long lval; /* long value */
double dval; /* double value */
struct {
char *val;
int len;
} str;
HashTable *ht; /* hash table value */
zend_object_value obj;
} zvalue_value;
If we do not take the initialization of the variable name into account, we already know how much a variable uses (in case of numbers and strings):
44 bytes in the case of numbers
&plus; 24 bytes in the case of strings
&plus; the length of the string (including the final NUL character)
(those numbers can change depending on the PHP version)
You have to round up to a multiple of 4 bytes due to memory alignment. If the variable is in the global space (not inside a function), it will also allocate 64 more bytes.
So if you want to use one of the codes inside this page, you have to check that the result using some simple test cases (strings or numbers) match those data taking into account every one of the indications in this post ($_GLOBAL array, first function call, output buffer, ...)
See:
memory_get_usage() — Returns the amount of memory allocated to PHP
memory_get_peak_usage() — Returns the peak of memory allocated by PHP
Note that this won't give you the memory usage of a specific variable though. But you can put calls to these function before and after assigning the variable and then compare the values. That should give you an idea of the memory used.
You could also have a look at the PECL extension Memtrack, though the documentation is a bit lacking, if not to say, virtually non-existent.
You could opt for calculating memory difference on a callback return value. It's a more elegant solution available in PHP 5.3+.
function calculateFootprint($callback) {
$startMemory = memory_get_usage();
$result = call_user_func($callback);
return memory_get_usage() - $startMemory;
}
$memoryFootprint = calculateFootprint(
function() {
return range(1, 1000000);
}
);
echo ($memoryFootprint / (1024 * 1024)) . ' MB' . PHP_EOL;
I had a similar problem, and the solution I used was to write the variable to a file then run filesize() on it. Roughly like this (untested code):
function getVariableSize ( $foo )
{
$tmpfile = "temp-" . microtime(true) . ".txt";
file_put_contents($tmpfile, $foo);
$size = filesize($tmpfile);
unlink($tmpfile);
return $size;
}
This solution isn't terribly fast because it involves disk IO, but it should give you something much more exact than the memory_get_usage tricks. It just depends upon how much precision you require.
The following script shows total memory usage of a single variable.
function getVariableUsage($var) {
$total_memory = memory_get_usage();
$tmp = unserialize(serialize($var));
return memory_get_usage() - $total_memory;
}
$var = "Hey, what's you doing?";
echo getVariableUsage($var);
Check this out
http://www.phpzag.com/how-much-memory-do-php-variables-use/
Never tried, but Xdebug traces with xdebug.collect_assignments may be enough.

Categories