How to determine if a string was compressed? - php

How can I determine whether a string was compressed with gzcompress (aparts from comparing sizes of string before/after calling gzuncompress, or would that be the proper way of doing it) ?

PRE: I guess, if you send a request, you can immediately look into $http_response_header to see if the one of the items in the array is a variation of Content-Encoding: gzip. But this is not ideal!
there is a far better method.
Here is HOW TO...
Check if its GZIP. Like a BOSS!
according to GZIP RFC:
The header of gzip content looks like this
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
the ID1 and ID2 identify the content as GZIP. And CM states that the ZLIB_ENCODING (the compression method) is ZLIB_ENCODING_DEFLATE - which is customarily used by GZIP with all web-servers.
oh! and they have fixed values:
The value of ID1 is "\x1f"
The value of ID2 is "\x8b"
The value of CM is "\x08" (or just 8...)
almost there:
`$is_gzip = 0 === mb_strpos($mystery_string , "\x1f" . "\x8b" . "\x08");`
Working example
<?php
/** #link https://gist.github.com/eladkarako/d8f3addf4e3be92bae96#file-checking_gzip_like_a_boss-php */
date_default_timezone_set("Asia/Jerusalem");
while (ob_get_level() > 0) ob_end_flush();
mb_language("uni");
#mb_internal_encoding('UTF-8');
setlocale(LC_ALL, 'en_US.UTF-8');
header('Time-Zone: Asia/Jerusalem');
header('Charset: UTF-8');
header('Content-Encoding: UTF-8');
header('Content-Type: text/plain; charset=UTF-8');
header('Access-Control-Allow-Origin: *');
function get($url, $cookie = '') {
$html = #file_get_contents($url, false, stream_context_create([
'http' => [
'method' => "GET",
'header' => implode("\r\n", [''
, 'Pragma: no-cache'
, 'Cache-Control: no-cache'
, 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2310.0 Safari/537.36'
, 'DNT: 1'
, 'Accept-Language: en-US,en;q=0.8'
, 'Accept: text/plain'
, 'X-Forwarded-For: ' . implode(', ', array_unique(array_filter(array_map(function ($item) { return filter_input(INPUT_SERVER, $item, FILTER_SANITIZE_SPECIAL_CHARS); }, ['HTTP_X_FORWARDED_FOR', 'REMOTE_ADDR', 'HTTP_CLIENT_IP', 'SERVER_ADDR', 'REMOTE_ADDR']), function ($item) { return null !== $item; })))
, 'Referer: http://eladkarako.com'
, 'Connection: close'
, 'Cookie: ' . $cookie
, 'Accept-Encoding: gzip'
])
]]));
$is_gzip = 0 === mb_strpos($html, "\x1f" . "\x8b" . "\x08", 0, "US-ASCII");
return $is_gzip ? zlib_decode($html, ZLIB_ENCODING_DEFLATE) : $html;
}
$html = get('http://www.pogdesign.co.uk/cat/');
echo $html;
What do we see here that is worth mentioning?
start with initializing the PHP engine to use UTF-8 (since we don't really know if the web-server will return a GZIP content.
Providing the header Accept-Encoding: gzip, tells the web-sever, it may output a GZIP content.
Discovering GZIP content (you should use the multi-byte functions with ASCII encoding).
Finally returning the plain output, is easy using the ZLIB methods.

A string and a compressed string are both simply sequences of bytes. You cannot really distinguish one sequence of bytes from another sequence of bytes. You should know whether a blob of bytes represents a compressed format or not from accompanying metadata.
If you really need to guess programmatically, you have several things you can try:
Try to uncompress the string and see if the uncompress operation succeeds. If it fails, the bytes probably did not represent a compressed string.
Try to check for obvious "weird" bytes like anything before 0x20. Those bytes aren't typically used in regular text. There's no real guarantee that they occur in a compressed string though.
Use mb_check_encoding to see whether a string is valid in the encoding you suspect it to be in. If it isn't, it's probably compressed (or you checked for the wrong encoding). With the caveat that virtually any byte sequence is valid in virtually every single-byte encoding, so this'll only work for multi-byte encodings.

This work fine for me:
if (#gzuncompress($_xml)!==false) {
// gzipped sring

You can simply try gzuncompress() on the data as noted by #DiDiegodaFonseca. If it fails, it was not made by gzcompress(), or it was not faithfully transmitted.
If you really want to, you can check the first two bytes for a zlib header (not a gzip header, as incorrectly suggested in the accepted answer). gzcompress() produces a zlib stream, not a gzip stream. gzencode() is what produces a gzip stream. gzdeflate() produces a raw deflate stream.
RFC 1950 describes the zlib header. It is two bytes, where the two bytes taken as a big-endian 16-bit unsigned integer must be a multiple of 31. In addition to checking that, you can check that the low four bits of the first byte is 8 (1000), and that the high bit is zero.

Related

File get contents retrieve question diamonds characters [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 3 years ago.
I've created my crawler using file_get_contents function but when I crawl some sites I'm getting this character: � when I should get this: é. Some ideas of what is happening?
This is for a windows vps server running php.
I've already tried:
file_get_contents() Breaks Up UTF-8 Characters
How fix UTF-8 Characters in PHP file_get_contents()
How to get file content with a proper utf-8 encoding using file_get_contents?
But all these things didn't work.
PD: My file where I'm running this code is on UTF8.
$url = "https://play.google.com/books/reader?id=4rqYDwAAQBAJ&hl=en_US";
$options = array('http'=>array('method'=>"GET", 'header'=>"Accept-language: en-US,en;q=0.8\r\n" ."Accept-Charset: UTF-8, *;q=0"));
$context = stream_context_create($options)
$profile = file_get_contents($url,false,$context);
echo $profile
I'm expecting to get accented characters and not this diamond character �.
Google is ignoring your Accept-Charset header because you're not specifying a User-Agent, no idea why. It took me one hour to figure it out. Adjust your options as follows:
$options = [
"http" => [
"method" => "GET",
"header" => "Accept-language: en-US,en;q=0.8\\r\n" .
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0\r\n" .
"Accept-Charset: UTF-8, *;q=0"
]
];
Adding the "User-Agent" header seems to do the trick. Google is probably returning a different encoding if not.

GuzzleHttp request sends garbled characters

I use GuzzleHTTP 6.0 to get the data from the API server. For some reason the request which the API server receives are not UTF-8 endoded the characters ü,ö,ä,ß are garbled characters.
My default System and Database is UTF-8 encoded.
I set debug to true in the RequestOptions this is the output:
User-Agent: GuzzleHttp/6.2.1 curl/7.47.0 PHP/7.0.22-0ubunut0.16.04.1
Content-type: text/xml;charset="UTF-8"
Accept: text/xml" Cache-Control: no-cache
Content-Length: 2175 * upload completely sent off: 2175 out of 2175 bytes
<HTTP/1.1 200 OK <Server:Apache:Coyote/1.1 <Content-Type: text/xml; charset=utf-8 <Transfer-Encoding: chunked <Date: Thu, 23 Nov 2017 9:34:12 GMT <* Connection #5 to host www.abcdef.com left intact
I have set explicitily the headers contents to UTF-8;
$headers = array(
'Content-type' => 'text/xml;charset="utf-8"',
'Accept' => 'text/xml',
'Content-length' => strlen($requestBody),
);
I also tried to test using mb_detect_encoding() method
mb_detect_encoding($requestBody,'UTF-8',true); // returns UTF-8
Any further ideas how do i debug this issue..??
Content-Length must contain number of bytes, not number of characters. That could the reason if you use mbstring.func_overload. Try to omit manual set of this header, Guzzle will set it automatically in the correct way for you then.

How output data with gzip header and not deflate header

I'm testing compression of html files.
I have 2 HTML files:
Not compressed HTML file ( content will change )
Compressed HTML file .gz ( content won't change )
Using PHP I'm trying to output compressed files data and here begins.
test with already compressed html file.
//header gzip
$data = getfile($name); // custom function packed with fopen fread
header(Content-Encoding: gzip); // header works perfect
echo $data; // output OK
//header deflate
$data = getfile($name); // custom function packed with fopen fread
header(Content-Encoding: deflate); // file was gzip compressed so error is normal
echo $data; // fireFox : Content Encoding Error
test with not compressed html file
//header gzip using gzcompress();
$data = gzcompress(getfile($name), 9);
header(Content-Encoding: gzip); // somehow header is bad
echo $data; // fireFox : Content Encoding Error , but IE 9 output OK
but here we got magic
//header deflate using gzcompress();
$data = gzcompress(getfile($name), 9);
header(Content-Encoding: deflate); // header works perfect
echo $data; // Firefox output OK, but IE output ERROR
How fix this crazy thing and send all data as gzip with gzip header not defalte? maybe someone have any idea what is wrong?
Thank you
The HTTP spec. (RFC2616) says:
gzip
An encoding format produced by the file compression program
"gzip" (GNU zip) as described in RFC 1952 [25].
compress
The encoding format produced by the common UNIX file compression
program "compress".
deflate
The "zlib" format defined in RFC 1950 [31] in combination with
the "deflate" compression mechanism described in RFC 1951 [29].
The PHP docs say:
gzcompress
For details on the ZLIB compression algorithm see the document
"ZLIB Compressed Data Format Specification version 3.3"
(RFC 1950).
gzdeflate
For details on the DEFLATE compression algorithm see the document
"DEFLATE Compressed Data Format Specification version 1.3"
(RFC 1951).
gzencode
For more information on the GZIP file format, see the document:
GZIP file format specification version 4.3 (RFC 1952).
From this, one can come to the conclusion that gzencode() must be used with gzip, and gzcompress() (with the DEFLATE encoding) must be used with deflate.
The first combination works for me. I haven't tried the second; don't know why it wouldn't work with IE. A URL might help to trouble-shoot that problem.

php: gzdecode() insufficient memory

Sometimes when downloading sources of a webpage and trying to decode it, i get an error: gzdecode() insufficient memory. (memory limit i 500m, usage far below that)
I include headers with my curl output, those are separated from the content correctly, before decoding. Content encoding header of the pages is clearly gzip. I read on php.net that including a length argument could cause such a crash, but I do not use length argument with gzdecode.
So while seemingly everything should be fine, I still get the error. Last time I found it with this page: https://ahmia.fi/address/.
Is there probably something with the https I do not know about? My curl settiing is \CURLOPT_SSL_VERIFYPEER => false.
Any help appreciated!
CURLOPT_ENCODING The contents of the "Accept-Encoding: " header. This enables decoding of the response. Supported encodings are "identity", "deflate", and "gzip". If an empty string, "", is set, a header containing all supported encoding types is sent.
try this:
CURLOPT_ENCODING => ""

gzip and return as string

I want to compress some files with gzip in PHP..
It works as it should when the output file is saved into a file.. When the file is opened it looks like this
But not when the output is returned as a string.. Then the opened file looks like this.. Why is tar file showed inside the gzip file?
public function compress(){
if($this->stream){
return gzencode($this->data, 9);
}
else{
$gz = gzopen('test.tar.gz', 'w9');
gzwrite($gz, $this->data);
gzclose($gz);
}
}
headers sent with string output to the browser
header('Content-Type: application/octet-stream');
header('Content-Disposition: attachment; filename="'.$filename.'"');
This looks like a WinRAR file extension parsing issue.
In the first example your file is called .tar.gz and WinRAR knows how to handle both tar files and gz compression, so it is able to decompress the tar headers in memory and retrieve a list of files contained within.
In your second example the file is called .tar-19.gz, so WinRAR happily deals with the gz compression but has no idea what format tar-19 is supposed to be (it doesn't even try and guess from header heuristics)
I bet if you stream the file with a tar.gz extension it will open up just fine.
I am sure that you looked into http://php.net/manual/en/function.gzcompress.php
make sure that you have php version PHP 4 >= 4.0.1, PHP 5
string gzcompress ( string $data [, int $level = -1 [, int $encoding = ZLIB_ENCODING_DEFLATE ]] )
This function compress the given string using the ZLIB data format.
For details on the ZLIB compression algorithm see the document "» ZLIB Compressed Data Format Specification version 3.3" (RFC 1950).
<?php
$compressed = gzcompress('Compress me', 9);
echo $compressed;
?>
Note:
This is not the same as gzip compression, which includes some header
data. See gzencode() for gzip compression.
Parameters
data
The data to compress.
level
The level of compression. Can be given as 0 for no compression up to 9
for maximum compression.
If -1 is used, the default compression of the zlib library is used which is 6.
encoding
One of ZLIB_ENCODING_* constants.
gzencode — Create a gzip compressed string
<?php
$data = implode("", file("bigfile.txt"));
$gzdata = gzencode($data, 9);
$fp = fopen("bigfile.txt.gz", "w");
fwrite($fp, $gzdata);
fclose($fp);
?>
Have you tried passing "Content-type: application/x-gzip" headers when sending the file as a string?
It's possible Apache is re-running it through a gzip filter and that's causing issues.

Categories