Randomly appearing gzip headers - php

I have a long running script in a shared hosting environment that outputs a bunch of XML
Sometimes (only sometimes) a random GZIP header will appear in my output, and the output will be terminated.
For instance
0000000: 3c44 4553 435f 4c4f 4e47 3e3c 215b 4344 <DESC_LONG><![CD
0000010: 4154 415b 1fc2 8b08 0000 0000 0000 03c3 ATA[............
0000020: b3c3 8b57 c388 c38c 2b28 2d51 48c3 8bc3 ...W....+(-QH...
0000030: 8c49 5528 2e48 4dc3 8e4c c38b 4c4d c391 .IU(.HM..L..LM..
0000040: c3a3 0200 c291 4464 c383 1900 0000 0d0a ......Dd........
or
0000000: 3c2f 5052 4f44 5543 543e 0d0a 1fc2 8b08 </PRODUCT>......
0000010: 0000 0000 0000 03c3 b3c3 8b57 c388 c38c ...........W....
0000020: 2b28 2d51 48c3 8bc3 8c49 5528 2e48 4dc3 +(-QH....IU(.HM.
0000030: 8e4c c38b 4c4d c391 c3a3 0200 c291 4464 .L..LM........Dd
0000040: c383 1900 0000 0d0a ........
or
0000000: 3c4d 4544 4941 5f55 524c 3e2f 696d 6167 <MEDIA_URL>/imag
0000010: 6573 2f69 6d70 6f72 7465 642f 7374 6f63 es/imported/stoc
0000020: 6b5f 7072 6f64 3235 3339 365f 696d 6167 k_prod25396_imag
0000030: 655f 3531 3737 3439 3436 302e 6a70 673c e_517749460.jpg<
0000040: 2f4d 4544 4941 5f55 1fc2 8b08 0000 0000 /MEDIA_U........
0000050: 0000 03c3 b3c3 8b57 c388 c38c 2b28 2d51 .......W....+(-Q
0000060: 48c3 8bc3 8c49 5528 2e48 4dc3 8e4c c38b H....IU(.HM..L..
0000070: 4c4d c391 c3a3 0200 c291 4464 c383 1900 LM........Dd....
0000080: 0000 0d0a ....
The switch to GZIP does not seem to hit at any particular time og byte count, it can be after 1MB of data or after 15MB
The compiled blade template at the corresponding lines are as follows
<DESC_LONG><![CDATA[<?php echo $product->display_name; ?>]]></DESC_LONG>
-
</PRICES>
</PRODUCT>
<?php foreach($product->models()->get() as $model): ?>
-
<MEDIA_URL>/images/imported/<?php echo $picture->local_name; ?></MEDIA_URL>
I am at my wits end, I have tried the following:
Disable gzip on the server.
Run while(ob_get_level()){ ob_end_clean(); } before running the script
In .htaccess i have tried SetEnv no-gzip 1, SetEnv no-gzip dont-vary and various permutations thereof.
When I visit other pages, no gzip encoding or headers appear, so I'm thinking this is something with the output size or output buffer.

Did you finally find out where these headers come from? I mean apache or php?
You can simulate xml generator scipt with something like:
echo file_get_contents('your_good_test.xml');
If you won't see any headers, I suggest to debug your xml generator. You can try to call header_remove(); before output.
If you see headers, you have to debug your web server. Try to disable gzip in apache by rewrite rule:
`RewriteRule . - [E=no-gzip:1]`
Whenever you have any proxy or balancer (nginx, squid, haproxy) you automaticly get one more firing line.

your gziping is not related to server output that returns your main xml body. Otherwise the whole xml would be compressed.
These methods return GZIP sometimes because the source where these take the items is set to support gzip and are not asked properly.
$product->display_name
$product->models()->get()
$picture->local_name
Look inside these.
- Check web calls for all places where headers are set.
- temporally disable compression for database connection if any.
Add CDATA tags for all places where binary data could be returned to avoid main xml body building termination. Wait for an xml with bin data, Save bin data, unzip it and look what is inside. :-)

This is more of a set of comments, but it is too long for the comment box.
First, this very likely NOT an output buffer issue. Even though <![CDATA[ and ]]> is not within PHP tags this doesn't mean that it doesn't pass through PHP's output buffer. To be clear, anything within a .php file will be placed in the PHP output buffer. The content within a .php file (including static content) is buffered outside of Apache and then passed back to Apache through this buffer when the script is finished. This means that your problem must lie within the code itself, which is a shot in the dark to solve without viewing the code.
My suggestions:
1) do a search within the script to find any instances of gz functions (gzcompress, gzdeflate, gzdecode, etc). I have seen scripts compress content if it was greater than a specific size and then decompress the content on the fly when taken from the DB. If that is the case you are likely dealing with a faulty comparison operation. In short, the logic within compression and decompression conditions is slightly off so it is failing to decompress SOME of the content.
2) do a search within the script to see how this data is fetched. Is it all from a database? Does any of it come from a stream? Is any of it fetched remotely? These questions might not directly lead to an answer but are vital. It can safely be assumed that these variables are being set with data already compressed when it shouldn't be. It requires knowing where/why/how the compression is taking place in order to answer why it is not being decompressed.
3) It matters greatly that it is working as expected on one system but not the other. The only times I have seen this happen was always due to differences in configuration. What operating system was your local machine using? What's the difference in local database (if any), what extensions might be missing/present on one or the other, possibly causing a function to fall back on different procedure on the two different machines.
EDIT:
Also, and this is a small chance, but are you dealing with data that originated from an SQL dump from a different server? You said it works on your local host but not on a different host, so we know your dealing with two machines. Was there a third at some point? If so, it might have been compressed using a mismatched version/form of compression, or might be an issue with encoding.

Related

Piecemeal bzcompression for large files in PHP

Creating bzip2 archived data in PHP is very easy thanks to its implementation in bzcompress. In my present application I cannot in all reason simply read the input file into a string and then call bzcompress or bzwrite. The PHP documentation does not make it clear whether successive calls to bzwrite with relatively small amounts of data will yield the same result as when compressing the whole file in one single swoop. I mean something along the lines of
$data = file_get_contents('/path/to/bigfile');
$cdata = bzcompress($data);
I tried out a piecemeal bzcompression using the routines shown below
function makeBZFile($infile,$outfile)
{
$fp = fopen($infile,'r');
$bz = bzopen($outfile,'w');
while (!feof($fp))
{
$bytes = fread($fp,10240);
bzwrite($bz,$bytes);
}
bzclose($bz);
fclose($fp);
}
function unmakeBZFile($infile,$outfile)
{
$bz = bzopen($infile,'r');
while (!feof($bz))
{
$str = bzread($bz,10240);
file_put_contents($outfile,$str,FILE_APPEND);
}
}
set_time_limit(1200);
makeBZFile('/tmp/test.rnd','/tmp/test.bz');
unmakeBZFile('/tmp/test.bz','/tmp/btest.rnd');
To test this code I did two things
I used makeBZFile and unmakeBZFile to compress and then decompress a SQLite database - which is what I need to do eventually.
I created a 50Mb filled with random data dd if=/dev/urandom of='/tmp.test.rnd bs=50M count=1
In both cases I performed a diff original.file decompressed.file and found that the two were identical.
All very nice but it is not clear to me why this is working. The PHP docs state that bzread(bzpointer,length) reads a maximum length bytes of UNCOMPRESSED data. If my code below is woring it is because I am forcing the bzwite and bzread size to 10240 bytes.
What I cannot see is just how bzread knows how to fetch lenth bytes of UNCOMPRESSED data. I checked out the format of a bzip2 file. I cannot see tht there is anything there which helps easily establish the uncompressed data length for a chunk of the .bz file.
I suspect there is a gap in my understanding of how this works - or else the fact that my code below appears to perform a correct piecemeal compression is purely accidental.
I'd much appreciate a few explanations here.
To understand how the decompression get the length of bytes you have to understand first the compression. It seems that you don't know any thing about compression algorigthim.
BZIP2
Crucial algorithm of BZIP2 is the Burrows Wheeler transformation (BWT), that converts the original data into a suitable form for following coding. The current version applies a Huffman code. Compression algorithm processes the data in blocks totally independent from each block. Block sizes can be set in a range from 1-9 (100,000 - 900,000 bytes).
BZIP2 Data Structure
The first two character of compressed string start with letter 'BZ' and thereafter 1 byte for algorigthim used. Thereafter identification of the block size immediately follows, being valid for the entire file (h1, h2, h3 to h9). The parameter indicates the block size in units from 1-9 (100,000 - 900,000 bytes).
Actual original data are stored in blocks according to the selected size and will be protected individually with a CRC32 checksum. Additionally a 48 bit identifier introduces each block. This block structure allows a partial reconstruction of damaged files.
GZIP/BZIP
Gzip and bzip2 are functionally equivalent. One advantage of GZIP is that it can compress a stream, a sequence where you can't look behind. This makes it the official compressor of http streams. GZZIP DEFLATE RFC 1951 Compressed Data Format Specification and GUNZIP RFC 1952 File Format Specification are published documents.
GIP explained

File reading from PHP using python script

Okay, this is driving me crazy. I have a small file. Here is the dropbox link https://www.dropbox.com/s/74nde57f07jj0zj/transcript.txt?dl=0.
If I try to read the content of the file using python f.read(), I can easily read it. But, if I try to run the same python program using php shell_exec(), the file read fails. This is the error I get.
Traceback (most recent call last):
File "/var/www/python_code.py", line 2, in <module>
transcript = f.read()
File "/opt/anaconda/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 107: ordinal not in range(128)
I have checked all the permission issues and there is no problem with that.
Can anyone kindly shed some light?
Here is my python code.
f = open('./transcript/transcript.txt', 'r')
transcript = f.read()
print(transcript)
Here is my PHP code.
$output = shell_exec("/opt/anaconda/bin/python /var/www/python_code.py");
Thank you!
EDIT: I think the problem is in the file content. If I replace the content with simple 'I eat rice', then I can read the content from php. But the current content cannot be read. Still don't know why.
The problem appears is that your file contains non-ASCII characters, but you're trying to read it as ASCII text.
Either it is text, but is in some encoding or other that you haven't told us (probably UTF-8, Latin-1, or cp1252, but there are countless other possibilities), or it's not text at all, but rather arbitrary binary data.
When you open a text file without specifying an encoding, Python has to guess. When you're running from inside the terminal or whatever IDE you use, presumably, it's guessing the same encoding that you used in creating the file, and you're getting lucky. But when you're running from PHP, Python doesn't have as much information, so it's just guessing ASCII, which means it fails to read the file because the file has bytes that aren't valid as ASCII.
If you want to understand how Python guesses, see the docs for open, but briefly: it calls locale.getpreferredencoding(), which, at least on non-Windows platforms, reads it from the locale settings in the environment. On a typical linux system that's not new enough to be based on systemd but not too old, the user's shell will be set up for a UTF-8 locale, but services will be set up for C locale. If all of that makes sense to you, you may see a way to work around your problem. If it all sounds like gobbledegook, just ignore it.
If the file is meant to be text, then the right solution is to just pass the encoding to the open call. For example, if the file is UTF-8, do this:
f = open('./transcript/transcript.txt', 'r', encoding='utf-8')
Then Python doesn't have to guess.
If, on the other hand, the file is arbitrary binary data, then don't open it in text mode:
f = open('./transcript/transcript.txt', 'rb')
In this case, of course, you'll get bytes instead of str every time you read from it, and print is just going to print something ugly like b'aq\x9bz' that makes no sense; you'll have to figure out what you actually want to do with the bytes instead of printing them as a bytes.

Encode string in Android so PHP would be able to gzdecompress it?

How do I correctly compress a string, so PHP would be able to decompress?
I tried this:
public static byte[] compress(String string) throws IOException {
ByteArrayOutputStream os = new ByteArrayOutputStream(string.length());
DeflaterOutputStream gos = new DeflaterOutputStream(os);
// ALSO TRIED GZOutputStream, same results!
gos.write(string.getBytes());
gos.close();
byte[] compressed = os.toByteArray();
os.close();
return compressed;
}
But PHP does not recognize output as valid GZip compressed string...
The problem seems to be in some headers / footers being added by Android...
For example when I compress something word via PHP with gzcompress I got similar results as with Android, but not similar enough, so PHP could read it:
something (HEX DUMP):
Android: 1f8b08000000000000002bcecf4d2dc9c8cc4b0700fb31da0909000000
PHP: 789c2bcecf4d2dc9c8cc4b0700134703cf
The weirdest thing is that by changing GZOutputStream to DeflaterOutputStream it fixed the problem with something word, but the problem still appears with longer strings...
PS. Removing heading 10 characters from Android generated data does not help at all.
EDIT: I tried to decompress it in PHP with:
gzdecode() - this function does not exist in standard Debian PHP5
version.
gzdecompress() - does not work
And some functions to emulate gzdecode() from PHP site comments that don't really do much.
All above, with removing first 10 bytes and leaving them.
PS2. I tried every single solution from Stack Overflow, and other sources, and still nothing. It is not a duplicate.
EDIT2 (BINARY DUMP): Sample data generated with Android that can't be decomprssed by gzuncompress() or pseudo-gzdecode() functions from PHP.NET: data.compressed.
It supposed to be some JSON, after decompression.
The Android data that starts with 1f8b is a gzip stream. In php you use gzdecode() for that. gzencode() on php makes gzip streams.
The php data that starts with 789c is a zlib stream. You used gzcompress() to make that, and you would use gzuncompress() to decode it.
The compressed data contained within both of those streams, starting with 2bce is raw deflate data. You can use gzinflate() to decode that if you happened to make it somewhere, and you can use gzdeflate() to generate raw deflate.
Just to rant, gzencode(), gzcompress(), and gzdeflate() are some of the most misleading function names ever concocted, since only one of them is related to gzip yet all start with gz, and nothing in the name gzcompress() indicates zlib.
Update:
The "EDIT2" data is, for some reason, doubly compressed. It was compressed first to the zlib format, and then that zlib stream was compressed to the gzip format. (Though gzip couldn't compress the already compressed data, so it's a little bigger.)
You should repair the problem that made it doubly compressed. Or if you have no control over that, you can doubly decompress it, first stripping the gzip header using the RFC 1952 specification and then gzinflate() on the raw deflate data, and then using gzdecompress() on the result.

Files with same content differ in sizes

Here is the protocol:
1) I generate a text file online with PHP containing alphanumeric characters. Then I download it and note its size (from Properties menu).
2) I open the text file with Notepad++ and cut all the content in a new text file, then I save the new file (with the same name).
3) To my astonishment, even thought both files have the exact same text content, their size isn't the same!
--TEST 1--
Downloaded file: 1529 Ko
New copy file: 1594 Ko
--TEST 2--
Downloaded file: 52 Ko
New copy file: 54 Ko
So what? Why am I posting this here? Because the file in question is available to my users for download on my website, and they can use it to replace a file in a game's save. However, the game reacts to the new file by rejecting it, whilst the copied one (with the above protocol) works fine.
The only difference I see between both files is their size (slight difference as shown above) - but the content and the name is the same. Any idea why there is that size difference?
This will most likely be newlines that are converted between unix (1 byte) and windows (2 bytes).
As mentioned in the comments, it could also be encoding, but NotePad++ is pretty good at encoding. It's also unlikely to account for the difference.
You need to convert the "\r\n" to "\n" to get the smaller filesize. Here's a page I just found with a few options: http://darklaunch.com/2009/05/06/php-normalize-newlines-line-endings-crlf-cr-lf-unix-windows-mac
Another thingto watch for is a trailing "newline" which is not very obvious. Again, strip it out before doing your comparisson.
Is your client and server are different platforms ? Say linux and windows ? In that case there is a difference in the way new line characters are stored. This can cause size difference.
Another reason could be character encoding used but it is little less likely.

Converting a PDF to JPG with ImageMagick in PHP Gives Odd Letter Spacing

I am trying to convert a PDF to a JPG with a PHP exec() call, which looks like this:
convert page.pdf -resize 716x716 page.jpg
For some reason, the JPG comes out with janky text, despite the PDF looking just fine in Acrobat and Mac Preview. Here is the original PDF:
http://whit.info/dev/conversion/page.pdf
and here is the janktastic output:
http://whit.info/dev/conversion/page.jpg
The server is a LAMP stack with PHP 5 and ImageMagick 6.2.8.
Can you help this stumped Geek?
Thanks in advance,
Whit
ImageMagick is just going to call out to Ghostscript to convert this PDF to an image. If you run gs on the pdf, you get the same badly-spaced output.
I suspect Ghostscript isn't handling the PDF's embedded TrueType fonts very well. If you could change your output to either embed Type 1 fonts or use a "core" PostScript font, you'd get better results.
I suspect its an encoding/widths issue. Both are a tad off, though I can't put my finger on why.
Here are some suspects:
First
The text stream is defined in UTF-16 LE. charNULLcharNULL, using the normal string drawing command syntax:
(some text) Tj
There's a way to escape any old character value into a () string. You can also define strings in hex thusly:
<203245> Tj
Neither method are used, just the questionable inline nulls. That could cause an issue in GS if it's trying to work with pointers to char without lengths associated with them.
Second
The widths array is dumb. You can define widths in groups thusly:
[ 32 [450 525 500] 37 [600 250] 40 [0] ]
This defines
32: 450
33: 525
34: 500
37: 600
38: 250
40: 0
These fonts defines their consecutive widths in individual arrays. Not illegal, but definitely wasteful/stupid, and if GS were coded to EXPECT gaps between the arrays, it could induce a bug.
There's also some extremely fishy values in the array. 32 through 126 are defined consecutively, but then it starts jumping all over: ...126 [600] 8364 [500] 8216 [222] 402 [500] 8222 [389]. 8230 [1000] 8224 [444]... and then goes back to being consecutive from 160 to 255.
Just weird.
Third
I'm not even remotely sure, but the CIDToGIDMap stream contains an AWEFUL lot of nulls.
Bottom line
Those fonts are fishy. And I've never heard of "Bellflower Books" or "UFPDF 0.1"
That version number makes me cringe. It should make you cringe too.
Googleing for "UFPDF" I found this note from the author:
Note: I wrote UFPDF as an experiment, not as a finished product. If you have problems using it, don't bug me for support. Patches are welcome though, but I don't have much time to maintain this.
UFPDF is a PHP library that sits on top of FPDF. 0.1. Just run away.

Categories