Overhead of large length in PHP stream_get_line() - php

I'm testing out a PHP library which relies upon another streams based library. My experience is with sockets which is low level compared to streams so I'm a little unsure if streams is going to be flexible enough.
Basically, using sockets, I would write a buffer loop which checked each chunk of data for an EOL character like \n like so...
PHP
$data = NULL;
while ($buffer = socket_read($connection,1024)) {
$data .= $buffer;
if (strpos($buffer, "\n") !== FALSE) {
$data = substr($data,0,-1);
break;
}
}
I'm looking to do something similar without having to rewrite their entire library. Here's the problem...
stream_get_line($handle,$length,$EOL) accepts a length value but will truncate everything longer than that. PHP docs state...
Reading ends when length bytes have been read, when the string specified by ending is found (which is not included in the return value), or on EOF (whichever comes first).
... and there's no offset param so I can't use it the same way to get the remainder. This means that if I don't know the length of the data, or if it's inconsistent, I need to set length high enough to deal with ANY possible length of data.
That isn't the big concern, that seems to work. The question is will setting the $length value to something like 512000 (500Kb) cause a lot of unnecessary overhead for shorter responses?

As far as the statement from the docs:
Reading ends when length bytes have been read, when the string
specified by ending is found (which is not included in the return
value), or on EOF (whichever comes first).
This doesn't mean that if you pass a length of 1,024 or even 200,000 and the line is longer than that the rest of the data is truncated or lost. This is just the maximum amount of data the call will return if it doesn't reach EOF/EOL before that.
So if you have a line that is 102,500 bytes long and you have the length parameter set to 1024, you will have to have called stream_get_line 101 times before the entire line of data is read but the entire line will be read, just not in one call to the function.
To directly answer your question, there won't be any extra overhead for short responses if you pass a large value. How it works under the hood really depends on what type of stream you are reading from. In the case of a network stream with a large length value, it may take a long time before the call returns any data in the event that it takes a long time for length data to be read from the network, where if you were reading in smaller chunks, you might start to get more data from the network before everything has been received.

Related

PHP Return Long String is not Usable as String

I am having a for loop which is iterating through a large array $result. Every array item $row is used to generate a new line of string $line. This string is concatenated into $str. To preserve memory, I am using unset($result[$i]) to clear every array item that has been processed in the original array. This looks like this:
$resultcount = count($result);
for($i=0; $i<$resultcount; ++$i){
$row = $result[$i];
$line = do_something($row);
$str.= '<tr><td>'.$line.'</td></tr>';
unset($result[$i]);
}
return $str;
This (more or less exotic piece of code) works unless the string exceeds a length of approx. 1'000'000 characters. In this case the return value is just empty. This is highly irritating because:
The calculation effort (in this example do_something()) is not a problem. By using echo count($result).' - '.strlen($str)."\n" I can see that the loop finishes properly. There is no web server or php error shown.
The memory_limit is also not a problem. I am working with more data on other parts of the application.
It appears that the problem lies in return $str itself. If I am using return substr($str, 0, 980000) then it just works fine. Further debugging shows that the string gets tainted as soon as it reaches the length of 999'775 bytes.
I can't put longer return value string into another string variable. But I am able to do a strlen() to get a proper result (1310307). So the return value string has a total length of 1'310'307 bytes. But I can't use them properly as a string.
Where does this limitation come from and how may I circumvent it?
You seem to describe return value in contradicting way:
In this case the return value is just empty.
but
But I am able to do a strlen() to get a proper result (1310307). So the return value string has a total length of 1'310'307 bytes.
Something can't be empty and have length at the same time?
I have tested returning a string that long and it works fine for me. Returning larger and larger strings works to a point memory limit is exceeded, which triggers explicit Fatal Error.
On a technical side strings can be much larger than million characters:
Note: As of PHP 7.0.0, there are no particular restrictions regarding the length of a string on 64-bit builds. On 32-bit builds and in earlier versions, a string can be as large as up to 2GB (2147483647 bytes maximum)
http://php.net/manual/en/language.types.string.php
I suspect that the failing operation has to do with how you use the result rather than return from the function.
Well, it took me two and a half year to figure this out ;(
Rarst was right: It was not a problem of the function compiling and delivering the string. Later in the process I was using preg_replace() to do some further replacements. This function is not able to handle long strings. This could be fixed by using ini_set('pcre.backtrack_limit', 99999999999) beforehand as mentioned in How to do preg_replace on a long string

Predict size needed to store data in shared memory

I'm working with the PHP shm (part of the semaphores extension, not to be confused with the shmop ones!) functions in a project. Basically the shared memory serves as kind of heap, I have only one array inside in which I'm storing keys (with meaningless values) as hashed index, I just check "Ah, it's there already". Now my problem is: that array can get quite big at times, but it doesn't always. I don't want to reserve a huge amount of memory I don't usually need, but rather resize dynamically.
I have registered an error handler that converts errors into ErrorExceptions, so I can catch the error thrown by shm_put_var when the memory is to small to store the array - but unfortunatly PHP clears the segment when data doesn't fit in there, so all other data is lost, too. This isn't an option therefore.
Because of this, I need a way to predict the size I'll need to store the data. One of the comments to shm_attach at php.net states that PHP appends an header of (PHP_INT_SIZE * 4) + 8bytes length, and one variable needs strlen(serialize($foo)) + 4 * PHP_INT_SIZE) + 4 (I have simplified the expression given in the comment, it's equal to mine but was blown up unecessarily)
While the header size seems to be correct (any memory smaller than 24 byte results in an error at creation, so 24 bytes seems to be the size of the header PHP puts in there), the size of each variable entry doesn't seem to hold true anymore in recent versions of PHP:
- I could store "1" in a shared memory segment with a size of 24 + strlen(serialize("1") + 3 * PHP_INT_SIZE) + 4 byte (note the 3 in there instead of 4),
- I couldn NOT store "999" in one sized 24 + strlen(serialize("999") + 4 * PHP_INT_SIZE) + 4
Does anyone know a way to predict how much memory is needed to store any data in shared memory using the shm functions or has some reference on how shm stores the variables? (I read the whole contets using shmop functions and printed them, but since it's binary data it's not reverse-engineerable in reasonable time)
(I will provide code samples as needed, I'm just not sure what parts will get relevant - ping me if you want to see any working samples, I have tried much so I have samples ready for most cases)
[Update] My C is pretty bad, so I odn't get far looking at the source (sysvshm.c and php_sysvshm.h), but I already found one issue with the solution that was suggested at php.net: While I could simplify the complex formula there to what I have included here (which was taken from the C sourcecode basically), this is NOT possible with the original one, as there are typecasts and no floating point math. The formula divides by sizeof(long) and multiplies with it again - which is useless in PHP but does round to multiples of sizeof(long) on C. SO I need to correct that in PHP first. Still, this is not everything, as Tests showed that I could store some values in even less memory than returned by the formula (see above).
As a workaround for the problem of a variable being deleted when trying to update it and there is not enough free space in the segment for the new value, you can check first if there is enough free space, and if so, only then you proceed to update.
The following function uses shmop_* API to obtain the used, free and total space in a segment created with shm_attach.
function getMemSegmentStats($segmentKey){
$segId = shmop_open($segmentKey, 'a', 0, 0) ;
$wc = PHP_INT_SIZE/4 ;
$stats = unpack("I{$wc}used/I{$wc}free/I{$wc}total",shmop_read($segId,8+PHP_INT_SIZE,3*PHP_INT_SIZE)) ;
shmop_close($segId) ;
return combineUnpackLHwords($stats) ;
}
function combineUnpackLHwords($array){
foreach($array as $key => &$val)
if( preg_match('/([^\d]+)(\d+)/',$key,$matches) ){
$key2 = $matches[1].($matches[2]+1) ;
$array[$matches[1]] = $val | $array[$key2] << 4*PHP_INT_SIZE ;
unset( $array[$key], $array[$key2] ) ;
}
return $array ;
}
The function combineUnpackLHwords is needed in 64bit machines because the unpack function doesn't unpack 64bit integers so they have to be constructed from the low-order and high-order 32-bit words (on 32bit machines the function has no effect).
Example:
$segmentKey = ftok('/path/to/a/file','A') ;
$segmentStats = getMemSegmentStats($segmentKey) ;
print_r($segmentStats) ;
Output:
Array
(
[used] => 3296
[free] => 96704
[total] => 100000
)
Ok, answering this myself, as I figured it out by now. I still have no sources but my own research, so feel free to comment with any helpful links or answer on your own.
Most important thing first: a working formula to calculate the size necessary to store data in shared memory using shm_* functions is:
$header = 24; // actually 4*4 + 8
$dataLength = (ceil(strlen(serialize($data)) / 4) * 4) + 16; // actually that 16 is 4*4
The header with the size of $header is only stored once at the beginning of the memory segment and is stored when the segment is allocated (using shm_attach the first time with that system v ressource key), even if no data is written. Therefore, you cannot ever create a memory segment smaller than 24 byte.
If you onyl want to use this and don'T care bout the details, just one warning: this is correct as long as PHP is compiled on a system that uses 32 bits for longs in C. If PHP is compiled with 64 bit longs, it's most likely a header size of 4 * 8 + 8 = 40 and each data variable needs (ceil(strlen(serialize($data)) / 8) * 8) + 32. Details in the explanation below.
So, how did I get there?
I looked into the PHP sourcecode. I don't know much C, so what I'm telling here is only how I got it, it may be nothing more than a lot of hot air...
The relevant files are already linked in the question - look there. The important parts are:
From php_sysvshm.h:
typedef struct {
long key;
long length;
long next;
char mem;
} sysvshm_chunk;
typedef struct {
char magic[8];
long start;
long end;
long free;
long total;
} sysvshm_chunk_head;
And from sysvshm.c:
/* these are lines 166 - 173 in the sourcecode of PHP 5.2.17 (the one I found frist),
line nubmers may differ in recent versions */
/* check if shm is already initialized */
chunk_ptr = (sysvshm_chunk_head *) shm_ptr;
if (strcmp((char*) &(chunk_ptr->magic), "PHP_SM") != 0) {
strcpy((char*) &(chunk_ptr->magic), "PHP_SM");
chunk_ptr->start = sizeof(sysvshm_chunk_head);
chunk_ptr->end = chunk_ptr->start;
chunk_ptr->total = shm_size;
chunk_ptr->free = shm_size-chunk_ptr->end;
}
/* these are lines 371 - 397, comments as above */
/* {{{ php_put_shm_data
* inserts an ascii-string into shared memory */
static int php_put_shm_data(sysvshm_chunk_head *ptr, long key, char *data, long len)
{
sysvshm_chunk *shm_var;
long total_size;
long shm_varpos;
total_size = ((long) (len + sizeof(sysvshm_chunk) - 1) / sizeof(long)) * sizeof(long) + sizeof(long); /* long alligment */
if ((shm_varpos = php_check_shm_data(ptr, key)) > 0) {
php_remove_shm_data(ptr, shm_varpos);
}
if (ptr->free < total_size) {
return -1; /* not enough memeory */
}
shm_var = (sysvshm_chunk *) ((char *) ptr + ptr->end);
shm_var->key = key;
shm_var->length = len;
shm_var->next = total_size;
memcpy(&(shm_var->mem), data, len);
ptr->end += total_size;
ptr->free -= total_size;
return 0;
}
/* }}} */
So, lot'S of code, I'll try to break it down.
The parts from php_sysvshm.h tell us what size those structures ahve, we'll need that. I'm assuming each char has8 bits (which is most likely valid on any system), and each longhas 32 bits (which may differ on some systems that actually use 64 bit - you have to change the numbers then).
sysvshm_chunk has 3*sizeof(long) + sizeof(char), that makes 3*4 + 1 = 13 bytes.
sysvshm_chunk_head has 8*sizeof(char) + 4*sizeof(long), that makes 8*1 + 4*4 = 24 bytes.
Now the first part from sysvshm.c is part of the code that gets executed when we're calling shm_attach in PHP. It initializes the memory segment by writing a header strucutre - the one defined as sysvshm_chunk_head we already talked about - if it'S not there already. This will need the 24 byte we calculated - the same 24 byte I gave in the formular right at the beginning.
The second part is the function that actually inserts a variable into the shared memory. This get's called by another function, but I skipped that one, as it's not that usefull. Basicall, it gets the shared memory header structure, whcih includes the addresses of start and end of the data inside the meory segment. It then gets a longwith the variavble key you used to store the variable, a char* (well, similar to strings, but C version) with the already serialized data, and the length of that data (for whatever reason, it could calculate that on it's own, but anyway).
For each data, a header (the structure defined as sysvshm_chunk we looked at) plus the actual data is now written into the memory. It is aligned to long however for easier memory management (that means: It's size is always rounded to the next multiple of sizeof(long), which is 4 bytes on most systems again). Now here it becomes a little strange. According to the C code we're looking at, (ceil((strlen(serialize($data)) + 13 - 1) / 4) * 4) ; should work (that13 in there is sizeof(sysvshm_chunk)). But: It doesn't. It always yields 4 bytes less then we actually need. couldn't find those four bytes. I assume that the length of that serialized data (len) is already alingned, but I didn't look into the source for that. But I couldn't find those 4 bytes anywhere else. The char is lasted in the C structure definition, and charis aligned on full bytes and nothing more, so that shouldn't cause those 4 additional bytes either - but if I'm wrong of how C alignes those, that could be the reason, too. ANyway, I aligned the data and the header individually in my formula, and it worked (aligned header alweayss has 16 bytes, that's the 16 in my formula, the data length gets aligned by that divide-round-multiply thingy). But, technically, the formula could also be
$dataLength = (ceil((strlen(serialize($data)) + 13 - 1) / 4) * 4) + 4;
It yields the sam results however, if I just missed those 4 bytes somewhere else. I have no system with a PHP versoin running that was compiled with 64 bit longs, so I cannot verify which one is correct.
tl;dr: problem solved, comments welcome, if you got any additional questions, now is the time.

Which is preferable: sha1_file(f) or sha1(file_get_contents(f))?

I want to create a hash of a file which size minimum 5Mb and can extend to 1-2 Gb. Now tough choice arise in between these two methods although they work exactly same.
Method 1: sha1_file($file)
Method 2: sha1(file_get_contents($file))
I have tried with 10 Mb but there is no much difference in performance.
But on higher data scale. What's better way to go?
Use the most high-level form offered unless there is a compelling reason otherwise.
In this case, the correct choice is sha1_file. Because sha1_file is a higher-level function that only works with files. This 'restriction' allows it to take advantage of the fact that the file/source can be processed as a stream1: only a small part of the file is ever read into memory at a time.
The second approach guarantees that 5MB-2GB of memory (the size of the file) is wasted/used as file_get_contents reads everything into memory before the hash is generated. As the size of the files increase and/or system resources become limited this can have a very detrimental effect on performance.
1 The source for sha1_file can be found on github. Here is an extract showing only lines relevant to stream processing:
PHP_FUNCTION(sha1_file)
{
stream = php_stream_open_wrapper(arg, "rb", REPORT_ERRORS, NULL);
PHP_SHA1Init(&context);
while ((n = php_stream_read(stream, buf, sizeof(buf))) > 0) {
PHP_SHA1Update(&context, buf, n);
}
PHP_SHA1Final(digest, &context);
php_stream_close(stream);
}
By using higher-level functions, the onus of a suitable implementation is placed on the developers of the library. In this case it allowed the use of a scaling stream implementation.

Searching for hex string in a file in php?

I'm currently using the following two methods in my class to get the job done:
function xseek($h,$pos){
rewind($h);
if($pos>0)
fread($h,$pos);
}
function find($str){
return $this->startingindex($this->name,$str);
}
function startingindex($a,$b){
$lim = 1 + filesize($a) - strlen($b)/2;
$h = fopen($a,"rb");
rewind($h);
for($i=0;$i<$lim;$i++){
$this->xseek($h,$i);
if($b==strtoupper(bin2hex(fread($h,strlen($b)/2)))){
fclose($h);
return $i;
}
}
fclose($h);
return -1;
}
I realize this is quite inefficient, especially for PHP, but I'm not allowed any other language on my hosting plan.
I ran a couple tests, and when the hex string is towards the beginning of the file, it runs quickly and returns the offset. When the hex string isn't found, however, the page hangs for a while. This kills me inside because last time I tested with PHP and had hanging pages, my webhost shut my site down for 24 hours due to too much cpu time.
Is there a better way to accomplish this (finding a hex string's offset in a file)? Is there certain aspects of this that could be improved to speed up execution?
I would read the entire contents of the file into one hex string and use strrpos, but I was getting errors about maximum memory being exceeded. Would this be a better method if I chopped the file up and searched large pieces with strrpos?
edit:
To specify, I'm dealing with a settings file for a game. The settings and their values are in a block where there is a 32-bit int before the setting, then the setting, a 32-bit int before the value, and then the value. Both ints represent the lengths of the following strings. For example, if the setting was "test" and the value was "0", it would look like (in hex): 00000004746573740000000130. Now that you mention it, this does seem like a bad way to go about it. What would you recommend?
edit 2:
I tried a file that was below the maximum memory I'm allowed and tried strrpos, but it was very much slower than the way I've been trying.
edit 3: in reply to Charles:
What's unknown is the length of the settings block and where it starts. What I do know is what the first and last settings USUALLY are. I've been using these searching methods to find the location of the first and last setting and determine the length of the settings block. I also know where the parent block starts. The settings block is generally no more than 50 bytes into its parent, so I could start the search for the first setting there and limit how far it will search. The problem is that I also need to find the last setting. The length of the settings block is variable and could be any length. I could read the file the way I assume the game does, by reading the size of the setting, reading the setting, reading the size of the value, reading the value, etc. until I reached a byte with value -1, or FF in hex. Would a combination of limiting the search for the first setting and reading the settings properly make this much more efficient?
You have a lot of garbage code. For example, this code is doing nearly nothing:
function xseek($h,$pos){
rewind($h);
if($pos>0)
fread($h,$pos);
}
because it reads everytime from the begining of the file. Furthemore, why do you need to read something if you are not returning it? May be you looke for fseek()?
If you need to find a hex string in binary file, may be better to use something like this: http://pastebin.com/fpDBdsvV (tell me if there some bugs/problems).
But, if you are parsing game's settings file, I'd advise you to use fseek(), fread() and unpack() to seek to a place of where setting is, read portion of bytes and unpack it to PHP's variable types.

Can I use file_get_contents() to compare two files?

I want to synchronize two directories. And I use
file_get_contents($source) === file_get_contents($dest)
to compare two files. Is there any problem to do this?
I would rather do something like this:
function files_are_equal($a, $b)
{
// Check if filesize is different
if(filesize($a) !== filesize($b))
return false;
// Check if content is different
$ah = fopen($a, 'rb');
$bh = fopen($b, 'rb');
$result = true;
while(!feof($ah))
{
if(fread($ah, 8192) != fread($bh, 8192))
{
$result = false;
break;
}
}
fclose($ah);
fclose($bh);
return $result;
}
This checks if the filesize is the same, and if it is it goes through the file step by step.
Checking the modified time check can be a quick way in some cases, but it doesn't really tell you anything other than that the files have been modified at different times. They still might have the same content.
Using sha1 or md5 might be a good idea, but this requires going through the whole file to create that hash. If this hash is something that could be stored and used later, then it's a different story probably, but yeah...
Use sha1_file() instead. It's faster and works fine if you just need to see whether the files differ. If the files are large, comparing the whole strings to each other can be very heavy. As sha1_file() returns an 40 character representation of the file, comparing files will be very fast.
You can also consider other methods like comparing filemtime or filesize, but this will give you guaranteed results even if there's just one bit that's changed.
Memory: e.g. you have a 32 MB memory limit, and the files are 20 MB each. Unrecoverable fatal error while trying to allocate memory. This can be solved by checking the files by smaller parts.
Speed: string comparisons are not the fastest thing in the world, calculating a sha1 hash should be faster (if you want to be 110% sure, you can compare the files byte-by-byte when hash matches, but you'll rule out all the cases where content and hash change (99%+ cases))
Efficiency: do some preliminary checks - e.g. there's no point comparing two files if their size differs.
Ths will work, but is inherently more inefficient than calculating checksum for both files and comparing these. Good candidates for checksum algorithms are SHA1 and MD5.
http://php.net/sha1_file
http://php.net/md5_file
if (sha1_file($source) == sha1_file($dest)) {
/* ... */
}
Seems a bit heavy. This will load both files completely as strings and then compare.
I think you might be better off opening both files manually and ticking through them, perhaps just doing a filesize check first.
There isn't anything wrong with what you are doing here, accept it is a little inefficient. Getting the contents of each file and comparing them, especially with larger files or binary data, you may run into problems.
I would take a look at filetime (last modified) and filesize, and run some tests to see if that works for you. It should be all you need at a fraction of the computation power.
Check first for the obvious:
Compare size
Compare file type (mime-type).
Compare content.
(add comparison of date, file name and other metadata to this obvious list if those are also not supposed to be similar).
When comparing content hashing sounds not very efficient like #Oli says in his comment. If the files are different they most likely will be different already in the beginning. Calculating a hash of two 50 Mb files and then comparing the hash sounds like a waste of time if the second bit is already different...
Check this post on php.net. Looks very similar to that of #Svish but it also compares file mime-type. A smart addition if you ask me.
Something I noticed is there is a lack of the N! factor. In other words - to do the filesize() function you would first have to check every file against all of the other files. Why? What if the first file and the second file are different sizes but the third file is the same size.
So first - you need to get a list of all of the files you are going to work with If you want to do the filesize type of thing - then use the COMPLETE / string as the key for an array and then store the filesize() information. Then you sort the array so all files which are the same size are lined up. THEN you can check file sizes. However - this does not mean they really are the same - only that they are the same size.
You need to do something like the sha1_file() command and, like above, make an array where the keys are the / names are the keys and the values is the value returned. Sort those, and then just do a simple walk through the array storing the sha1_file() value to test against. So is A==B? Yes. Do any additional tests, then get rid of the SECOND file and continue.
Why am I commenting? I'm working on this same problem and I just found out my program did not work correctly. So now I'm going to go correct it using the sha1_file() function. :-)

Categories