Reading the ZIP file central directory in PHP - php

I need to obtain the uncompressed filesize of a zip, without storing the entire zip into memory (files range up to 30gb+)
From research I understand this information can be obtained from the central-directory that all Zips have, and that it's stored at the end of the file.
In most scenarios, I can find the central-directory within the last 64kb of a file.
function findCentralDirectory($data) {
$offset = 0;
while ($offset < 65536) {
$pos = strpos($data, "\x50\x4b\x01\x02", $offset);
if ($pos !== false) {
return $pos;
}
$offset++;
}
return false;
}
function findEndCentralDirectory($data) {
$offset = 0;
while ($offset < 65536) {
$pos = strpos($data, "\x50\x4b\x05\x06", $offset);
if ($pos !== false) {
return $pos;
}
$offset++;
}
return false;
}
$data = $this->extractLast64Kb($path);
$startOffset = $this->findCentralDirectory($data);
$endOffset = $this->findEndCentralDirectory(substr($data, $startOffset));
$centralDirectory = substr($data, $startOffset, $endOffset);
This appears to work correctly as I'll get this in response:
PK?4>U
~��test.txtPK6*
(this zip in particular has one file, called test.txt, tested with multiple zips, all showing the correct list of files in this output)
I'm lead to believe that in the encoded part of this, is the filesize, however being new to this kind of programming, i'm struggling to work out how I can "decode" this into an array I can sum.
Just looking to get this working for 32-bit zips before I worry about 64-bit.
Any help appreciated, thanks

I was surprised that there isn't a package out there that does this already.
The snippet below is the extract of the "central directory file header" and before it was copied/pasted here; it contained an entry for each file (and directory) within the zip which included information like uncompressed filesize etc (everything you should need to know about a file).
What we've got
PK?4>U
~��test.txtPK6*
As shown in the question, this is the binary data extract between 0x504b0102 and 0x504b0506.
One thing to note is that for ZIP64 archives, the "end of central directory" signature is 0x504b0606 (which you can translate to \x50\x4b\x06\x06) and you should check for this signature if the 32-bit end offset was not found.
What we do with it
So what we do now is we feed that binary data into a buffer, for my implementation I used the nelexa/buffer package and as we've extracted the exact binary data that we need to parse we can start the buffer at position 0 to make things a bit easier.
$binaryData = '..';
$buffer = new \Nelexa\Buffer\StringBuffer($binaryData);
$bufferPosition = 0;
$entries = [];
// ensure our position does not exceed the size of our data
while ($bufferPosition < $buffer->size()) {
$nameLength = $buffer
->setPosition($bufferPosition + 28)
->getArrayBytes(1)[0];
$extraLength = $buffer
->setPosition($bufferPosition + 30)
->getArrayBytes(1)[0];
$commentLength = $buffer
->setPosition($bufferPosition + 32)
->getArrayBytes(1)[0];
$size = $buffer
->setPosition($bufferPosition + 24)
->getUnsignedInt();
$crc32 = $buffer
->setPosition($bufferPosition + 16)
->getUnsignedInt();
$entries[] = [
'isDirectory' => $size === 0 && $crc32 === 0,
'size' => $size,
'crc32' => $crc32,
'filename' => $buffer
->setPosition($bufferPosition + 46)
->getString($nameLength);
'comment' => $buffer
->setPosition($bufferPosition + 46 + $nameLength + $extraLength)
->getString($commentLength);
];
// set the position of the next file entry
$bufferPosition += 46 + $nameLength + $extraLength + $commentLength;
}
var_dump($entries);
The code above should give you a working example and the general gist of how it works, you can find more information about the offsets/positions used here: Wikipedia - ZIP (file format).
Output for the binary data above should be similar to:
array:1 [
0 => array:5 [
"isDirectory" => false
"filename" => "test.txt"
"size" => 307
"crc32" => 1071205881
"comment" => ""
]
]
Gotchas
It is important to note that you may encounter endianness issues depending on the system in use (which is a fancy way of saying your bytes will be reversed). In such a scenario you would note that the example above will output incorrect values for file size and the CRC-32.
A good way to check if your system is little-endian or big-endian is:
function isLittleEndian(): bool
{
$testInteger = 0x00FF;
$packed = pack('S', $testInteger);
$unpacked = (array)unpack('v', $packed);
return $testInteger === current($unpacked);
}
If your system is little-endian, then you'll want to convert integers to big-endian instead, using this byte-swap function:
function convertToBigEndian(int $value): int
{
return (($value & 0x000000FF) << 24) | (($value & 0x0000FF00) << 8) | (($value & 0x00FF0000) >> 8) | (($value & 0xFF000000) >> 24);
}
and then we can make sure we have correct values for $size and $crc32 from our example above:
$size = ...
$crc32 = ...
// convert endianness
if (isLittleEndian()) {
$size = convertToBigEndian($size);
$crc32 = convertToBigEndian($crc32);
}
WARNING: It is still possible to parse the central-directory from corrupt zips (e.g zips that the native ZipArchive cannot open due an error). Therefore the existence of the central-directory should not be used as an indicator of whether the ZIP is compliant/valid/working.
A few bonus examples of fetching the last 100kib of files using a few different methods
$range = "bytes=-".(100*1024*1024);
Using AWS S3 client
$result = $s3client->getObject([
'Bucket' => 'your-bucket',
'Region' => 'your-region',
'Key' => 'path/to/file.zip',
'Range' => $range,
]);
$data = (string) $result['Body'];
Using cURL
$url = 'https://somewhere.com/path/to/file.zip';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RANGE, '-'.$bytes);
$data = (string) curl_exec($ch);
curl_close($ch);
Using file_get_contents()
$url = 'https://somewhere.com/path/to/file.zip';
$opts = [
'http' => [
'method' => 'GET',
'header' => [
'Range: '.$range
],
]
];
$context = stream_context_create($opts);
$data = (string) file_get_contents($url, false, $context);
Using Guzzle
$url = 'https://somewhere.com/path/to/file.zip';
$client = new \GuzzleHttp\Client();
$res = $client->request('GET', $url, [
'headers' => [
'Range' => $range,
]
]);
$data = (string) $res->getBody();
Hope this helps someone, there was very little information out there on this topic and even less for PHP.
Disclaimer: I am an amateur at this kind of thing, in fact this is my first time having any success at it. Please comment if I have made any mistakes

Related

fgetcsv encoding issue (PHP)

I am being sent a csv file that is tab delimited. Here is a sample of what I see:
Invoice: Invoice Date Account: Name Bill To: First Name Bill To: Last Name Bill To: Work Email Rate Plan Charge: Name Subscription: Device Serial Number
2021-03-10 Test Company Wally Kolcz test#test.com Sample plan A0H1234567890A
I wrote a script to open, read and loop over the values but I get weird stuff after:
if (($handle = fopen($user_file, "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, "\t")) !== FALSE) {
if($line >1 && isset($data[1])){
$user = [
'EmailAddress' => $data[4],
'Name' => $data[2].' '.$data[3],
];
}
$line++;
}
fclose($handle);
}
Here is what I get when I dump the first line.
array:7 [▼
0 => b"ÿþI\x00n\x00v\x00o\x00i\x00c\x00e\x00:\x00 \x00I\x00n\x00v\x00o\x00i\x00c\x00e\x00 \x00D\x00a\x00t\x00e\x00"
1 => "\x00A\x00c\x00c\x00o\x00u\x00n\x00t\x00:\x00 \x00N\x00a\x00m\x00e\x00"
2 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00F\x00i\x00r\x00s\x00t\x00 \x00N\x00a\x00m\x00e\x00"
3 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00L\x00a\x00s\x00t\x00 \x00N\x00a\x00m\x00e\x00"
4 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00W\x00o\x00r\x00k\x00 \x00E\x00m\x00a\x00i\x00l\x00"
5 => "\x00R\x00a\x00t\x00e\x00 \x00P\x00l\x00a\x00n\x00 \x00C\x00h\x00a\x00r\x00g\x00e\x00:\x00 \x00N\x00a\x00m\x00e\x00"
6 => "\x00S\x00u\x00b\x00s\x00c\x00r\x00i\x00p\x00t\x00i\x00o\x00n\x00:\x00 \x00D\x00e\x00v\x00i\x00c\x00e\x00 \x00S\x00e\x00r\x00i\x00a\x00l\x00 \x00N\x00u\x00m\x00b\x00e\x00r\x00 ◀"
]
I tried adding:
header('Content-Type: text/html; charset=UTF-8');
$data = array_map("utf8_encode", $data);
setlocale(LC_ALL, 'en_US.UTF-8');
And when I dump mb_detect_encoding($data[2]), I get 'ASCII'...
Any way to fix this so I don't have to manually update the file each time I receive it? Thanks!
Looks like the file is in UTF-16 (every other byte is null).
You probably need to convert the whole file with something like mb_convert_encoding($data, "UTF-8", "UTF-16");
But you can't really use fgetcsv() in that case…
As #Andrea already mentioned, your data is encoded as UTF-16LE and you need to convert it to an encoding compatible with what you want to do. That said, it is possible to do in-flight with PHP stream filters.
abstract class TranslateCharset extends php_user_filter {
protected $in_charset, $out_charset;
private $buffer = '';
private $total_consumed = 0;
public function filter($in, $out, &$consumed, $closing) {
$output = '';
while ($bucket = stream_bucket_make_writeable($in)) {
$input = $this->buffer . $bucket->data;
for( $i=0, $p=0; ($c=mb_substr($input, $i, 1, $this->in_charset)) !== ""; ++$i, $p+=strlen($c) ) {
$output .= mb_convert_encoding($c, $this->out_charset, $this->in_charset);
}
$this->buffer = substr($input, $p);
$consumed += $p;
}
// this means that there's unconverted data at the end of the bridage.
if( $closing && strlen($this->buffer) > 0 ) {
$this->raise_error( sprintf(
"Likely encoding error at offset %d in input stream, subsequent data may be malformed or missing.",
$this->total_consumed += $consumed)
);
$consumed += strlen($this->buffer);
// give it the ol' college try
$output .= mb_convert_encoding($this->buffer, $this->out_charset, $this->in_charset);
}
$this->total_consumed += $consumed;
if ( ! isset($bucket) ) {
$bucket = stream_bucket_new($this->stream, $output);
} else {
$bucket->data = $output;
}
stream_bucket_append($out, $bucket);
return PSFS_PASS_ON;
}
protected function raise_error($message) {
user_error( sprintf(
"%s[%s]: %s",
__CLASS__, get_class($this), $message
), E_USER_WARNING);
}
}
class UTF16LEtoUTF8 extends TranslateCharset {
protected $in_charset = 'UTF-16LE';
protected $out_charset = 'UTF-8';
}
stream_filter_register('UTF16LEtoUTF8', 'UTF16LEtoUTF8');
// properly-encoded UTF-16BE example input "Invoice:,a"
$in = "\xFE\xFFI\x00n\x00v\x00o\x00i\x00c\x00e\x00:\x00,\x00a\x00";
// prep example pipe, in practice this would simple be your fopen() call.
$fh = fopen('php://memory', 'rwb+');
fwrite($fh, $in);
rewind($fh);
// skip BOM
fseek($fh, 2);
stream_filter_append($fh, 'UTF16LEtoUTF8', STREAM_FILTER_READ);
var_dump(fgetcsv($fh, 4096));
Output:
array(2) {
[0]=>
string(8) "Invoice:"
[1]=>
string(1) "a"
}
In practice there is no "magic bullet" to detect the encoding of an input file or string. In this case there is a Byte Order Mark [BOM] of 0xFF 0xFE that denotes that this in UTF-16LE but the BOM is frequently omitted, or may simply occur naturally at the beginning of any arbitrary string, or is simply not required for most encodings, or is simply not used by whoever encoded the data.
That last bit is the exact reason why everyone should avoid the utf8_encode() and utf8_decode() functions like the plague, because they simply assume that you only ever want to go between UTF-8 and ISO-8859-1 [western european], and make no effort to avoid corrupting your data when used incorrectly because they can't possibly know any better.
TLDR: You must explicitly know the encoding of your input data, or you're going to have a bad time.
Edit: Since I've gone and put a proper spitshine on this I've put it up as a Composer package, in case anyone else needs something like this.
https://packagist.org/packages/wrossmann/costrenc
I ended up with is as working code:
$f = file_get_contents($user_file);
$f = mb_convert_encoding($f, 'UTF8', 'UTF-16LE');
$f = preg_split("/\R/", $f);
$f = array_map('str_getcsv', $f);
$line = 0;
foreach($f as $record){
if($line !== 0 && isset($record[0])){
$pieces = preg_split('/[\t]/',$record[0]);
//My work here
}
}
Thank you everyone for your examples and suggestions!

Microsoft Graph API file upload using PHP SDK for large file still fails

I am currently using microsoft-php-sdk and it has been pretty good. I have managed to upload small files from the server to OneDrive. But when I tried to upload a 38MB Powerpoint file it was failing. The Microsoft Graph API documentation suggests creating an upload session. I thought it would just be as easy as just updating the URI from /content to /createUploadSession but it was still failing.
$response = $graph->createRequest('POST', '/me/drive/root/children/'.basename($path).'/createUploadSession')
->setReturnType(Model\DriveItem::class)
->upload($path);
My code looks something like this. I have difficulty figuring out the PHP SDK documentation and there was no example for upload session. Has anyone used the PHP SDK for this scenario before?
i used the similar approach with https://github.com/microsoftgraph/msgraph-sdk-php/wiki and laravel framework. here is the code that worked for me finally.
public function test()
{
//initialization
$viewData = $this->loadViewData();
// Get the access token from the cache
$tokenCache = new TokenCache();
$accessToken = $tokenCache->getAccessToken();
// Create a Graph client
$graph = new Graph();
$graph->setAccessToken($accessToken);
//upload larger files
// 1. create upload session
$fileLocation = 'S:\ebooks\myLargeEbook.pdf';
$file = \File::get($fileLocation);
$reqBody=array(
"#microsoft.graph.conflictBehavior"=> "rename | fail | replace",
"description"=> "description",
"fileSystemInfo"=> ["#odata.type"=> "microsoft.graph.fileSystemInfo"] ,
"name"=> "ebook.pdf",
);
$uploadsession=$graph->createRequest("POST", "/drive/root:/test/ebook.pdf:/createUploadSession")
->attachBody($reqBody)
->setReturnType(Model\UploadSession::class)
->execute();
//2. upload bytes
$fragSize =320 * 1024;// 1024 * 1024 * 4;
$fileLocation = 'S:\ebooks\myLargeEbook.pdf';
// check if exists file if not make one
if (\File::exists($fileLocation)) {
$graph_url = $uploadsession->getUploadUrl();
$fileSize = filesize($fileLocation);
$numFragments = ceil($fileSize / $fragSize);
$bytesRemaining = $fileSize;
$i = 0;
while ($i < $numFragments) {
$chunkSize = $numBytes = $fragSize;
$start = $i * $fragSize;
$end = $i * $fragSize + $chunkSize - 1;
$offset = $i * $fragSize;
if ($bytesRemaining < $chunkSize) {
$chunkSize = $numBytes = $bytesRemaining;
$end = $fileSize - 1;
}
if ($stream = fopen($fileLocation, 'r')) {
// get contents using offset
$data = stream_get_contents($stream, $chunkSize, $offset);
fclose($stream);
}
$content_range = "bytes " . $start . "-" . $end . "/" . $fileSize;
$headers = array(
"Content-Length"=> $numBytes,
"Content-Range"=> $content_range
);
$uploadByte = $graph->createRequest("PUT", $graph_url)
->addHeaders($headers)
->attachBody($data)
->setReturnType(Model\UploadSession::class)
->setTimeout("1000")
->execute();
$bytesRemaining = $bytesRemaining - $chunkSize;
$i++;
}
}
dd($uploadByte);
}
}
I have developed a similar library for oneDrive based on microsoft graph rest api. This problem is also solved here : tarask/oneDrive
look at the documentation in the Upload large files section
I'm not familiar with PHP, but I am familiar with the upload API. Hopefully this will help.
The /content endpoint you were using before allows you to write binary contents to a file directly and returns a DriveItem as your code expects. The /createUploadSession method works differently. The Graph documentation for resumable upload details this, but I'll summarize here.
Instead of sending the binary contents in the CreateUploadSession request, you either send an empty body or you send a JSON payload with metadata like the filename or conflict-resolution behavior.
The response from CreateUploadSession is an UploadSession object, not a DriveItem. The object has an uploadUrl property that you use to send the binary data.
Upload your binary data over multiple requests using the HTTP Content-Range header to indicate which byte range you're uploading.
Once the server receives the last bytes of the file, the upload automatically finishes.
While this overview illustrates the basics, there are some more concepts you should code around. For example, if one of your byte ranges fails to upload, you need to ask the server which byte ranges it already has and where to resume. That and other things are detailed in the docs. https://developer.microsoft.com/en-us/graph/docs/api-reference/v1.0/api/driveitem_createuploadsession
<?php
require __DIR__.'/path/to/vendor/autoload.php';
use Microsoft\Graph\Graph;
use Microsoft\Graph\Model;
$graph = new Graph();
$graph->setAccessToken('YOUR_TOKEN_HERE');
/** #var Model\UploadSession $uploadSession */
$uploadSession = $graph->createRequest("POST", "/me/drive/items/root:/doc-test2.docx:/createUploadSession")
->addHeaders(["Content-Type" => "application/json"])
->attachBody([
"item" => [
"#microsoft.graph.conflictBehavior" => "rename",
"description" => 'File description here'
]
])
->setReturnType(Model\UploadSession::class)
->execute();
$file = __DIR__.'/path/to/large-file.avi';
$handle = fopen($file, 'r');
$fileSize = fileSize($file);
$fileNbByte = $fileSize - 1;
$chunkSize = 1024*1024*4;
$fgetsLength = $chunkSize + 1;
$start = 0;
while (!feof($handle)) {
$bytes = fread($handle, $fgetsLength);
$end = $chunkSize + $start;
if ($end > $fileNbByte) {
$end = $fileNbByte;
}
/* or use stream
$stream = \GuzzleHttp\Psr7\stream_for($bytes);
*/
$res = $graph->createRequest("PUT", $uploadSession->getUploadUrl())
->addHeaders([
'Content-Length' => ($end - 1) - $start,
'Content-Range' => "bytes " . $start . "-" . $end . "/" . $fileSize
])
->setReturnType(Model\UploadSession::class)
->attachBody($bytes)
->execute();
$start = $end + 1;
}
It's work for Me !

Download rapidshare file using rapidshare api in php

I am trying to download a rapidshare file using its "download" subroutine as a free user. The following is the code that I use to get response from the subroutine.
function rs_download($params)
{
$url = "http://api.rapidshare.com/cgi-bin/rsapi.cgi?sub=download&fileid=".$params['fileid']."&filename=".$params['filename'];
$reply = #file_get_contents($url);
if(!$reply)
{
return false;
}
$result_arr = array();
$result_keys = array(0=> 'hostname', 1=>'dlauth', 2=>'countdown_time', 3=>'md5hex');
if( preg_match("/DL:(.*)/", $reply, $reply_matches) )
{
$reply_altered = $reply_matches[1];
}
else
{
return false;
}
foreach( explode(',', $reply_altered) as $index => $value )
{
$result_arr[ $result_keys[$index] ] = $value;
}
return $result_arr;
}
For instance; trying to download this...
http://rapidshare.com/files/440817141/AutoRun__live-down.com_Champ.rar
I pass the fileid(440817141) and filename(AutoRun__live-down.com_Champ.rar) to rs_download(...) and I get a response just as rapidshare's api doc says.
The rapidshare api doc (see "sub=download") says call the server hostname with the download authentication string but I couldn't figure out what form the url should take.
Any suggestions?, I tried
$download_url = "http://$the-hostname/$the-dlauth-string/files/$fileid/$filename"
and a couple other variations of the above, nothing worked.
I use curl to download the file, like the following;
$cr = curl_init();
$fp = fopen ("d:/downloaded_files/file1.rar", "w");
// set curl options
$curl_options = array(
CURLOPT_URL => $download_url
,CURLOPT_FILE => $fp
,CURLOPT_HEADER => false
,CURLOPT_CONNECTTIMEOUT => 0
,CURLOPT_FOLLOWLOCATION => true
);
curl_setopt_array($cr, $curl_options);
curl_exec($cr);
curl_close($cr);
fclose($fp);
The above curl code doesn't seem to work, nothing gets downloaded. Probably its the download url that is incorrect.
Also tried this format for the download url:
"http://rs$serverid$shorthost.rapidshare.com/files/$fileid/$filename"
With this curl writes a file entry but that is all it does(writes a 0/1 kb file).
Here is the code that I use to get the serverid, shorthost, among a few other values from rapidshare.
function rs_checkfile($params)
{
$url = "http://api.rapidshare.com/cgi-bin/rsapi.cgi?sub=checkfiles_v1&files=".$params['fileids']."&filenames=".$params['filenames'];
// the response from rapishare would a string something like:
// 440817141,AutoRun__live-down.com_Champ.rar,47768,20,1,l3,0
$reply = #file_get_contents($url);
if(!$reply)
{
return false;
}
$result_arr = array();
$result_keys = array(0=> 'file_id', 1=>'file_name', 2=>'file_size', 3=>'server_id', 4=>'file_status', 5=>'short_host'
, 6=>'md5');
foreach( explode(',', $reply) as $index => $value )
{
$result_arr[ $result_keys[$index] ] = $value;
}
return $result_arr;
}
rs_checkfile(...) takes comma seperated fileids and filenames(no commas if calling for a single file)
Thanks in advance for any suggestions.
You start by requesting ?sub=download&fileid=X&filename=Y, and it returns $hostname,$dlauth,$countdown,$md5hex.. since you're a free user you have to delay for $countdown seconds, and then call ?sub=download&fileid=X&filename=Y&dlauth=Z to perform the download.
There's a working implementation in python here that would probably answer any of your other questions.

xml parse error: 'Invalid character'

I'm using the google weather api for a widget.
All is fine and dandy except that today I encountered a problem that I cannot solve.
When called with this location:
http://www.google.com/ig/api?weather=dunjkovec,medimurska,croatia&hl=en
I get this error:
XML parse error 9 'Invalid character' at line 1, column 169 (byte index 199)
I suspect that the problem is here: Nedelišće
The code block is this one:
$parser = xml_parser_create('UTF-8');
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE, 1);
$ok = xml_parse_into_struct($parser, $data, $values);
if (!$ok) {
$errmsg = sprintf("XML parse error %d '%s' at line %d, column %d (byte index %d)",
xml_get_error_code($parser),
xml_error_string(xml_get_error_code($parser)),
xml_get_current_line_number($parser),
xml_get_current_column_number($parser),
xml_get_current_byte_index($parser));
}
$data is the content of the xml and $values is empty.
Can someone help me? Thank you very much!
EDIT----------------------------------
After reading Hussein's post I discovered that the problem is in the way the file gets retrieved.
I tried file_get_contents and cURL. Both returns:
that is the line that creates problems. Or so I thought! I tried this html_entity_decode($data,ENT_NOQUOTES,'UTF-8') and it wasn't working, so I made a discover, I can't echo the contents of the xml, I can only print_r them and see the results in the html source! With any other location in the world it works, only this one creates problems... I wanna cry :-(
EDIT 2--------------------------------
For anybody that cares. I fixed the problem with this lines of code after retrieving the xml file from the api:
$data = mb_convert_encoding($data, 'UTF-8', mb_detect_encoding($data, 'UTF-8, ISO-8859-1', true));
$data = html_entity_decode($data,ENT_NOQUOTES,'UTF-8');
then parse the xml, it works like a charm.
I marked hussein's answer because it got me on the right track.
After reading at your problem, I tried same thing on my machine.
What I did is
1. Downloaded xml file on my local machine from the URL you posted.
2. Used your xml parsing script to prepare structure from XML.
Amazingly it worked perfectly on my machine, even though XML has Nedelišće keyword.
So, I see the problem in the way of reading XML file.
It would be easy to debug if you can tell me the way you are reading the xml form google api.
Are you using CURL?
EDIT -----------------------------------------------
Hi 0plus1,
I have prepared one helper function to convert those special chars to html for making it able for parsing..
I am pasting entire code here. Use following script..
function utf8tohtml($utf8, $encodeTags)
{
$result = '';
for ($i = 0; $i < strlen($utf8); $i++)
{
$char = $utf8[$i];
$ascii = ord($char);
if ($ascii < 128)
{
// one-byte character
$result .= ($encodeTags) ? htmlentities($char , ENT_QUOTES, 'UTF-8') : $char;
} else if ($ascii < 192)
{
// non-utf8 character or not a start byte
} else if ($ascii < 224)
{
// two-byte character
$result .= htmlentities(substr($utf8, $i, 2), ENT_QUOTES, 'UTF-8');
$i++;
} else if ($ascii < 240)
{
// three-byte character
$ascii1 = ord($utf8[$i+1]);
$ascii2 = ord($utf8[$i+2]);
$unicode = (15 & $ascii) * 4096 +
(63 & $ascii1) * 64 +
(63 & $ascii2);
$result .= "&#$unicode;";
$i += 2;
} else if ($ascii < 248)
{
// four-byte character
$ascii1 = ord($utf8[$i+1]);
$ascii2 = ord($utf8[$i+2]);
$ascii3 = ord($utf8[$i+3]);
$unicode = (15 & $ascii) * 262144 +
(63 & $ascii1) * 4096 +
(63 & $ascii2) * 64 +
(63 & $ascii3);
$result .= "&#$unicode;";
$i += 3;
}
}
return $result;
}
$curlHandle = curl_init();
$serviceUrl = "http://www.google.com/ig/api?weather=dunjkovec,medimurska,croatia&hl=en";
// setup the basic options for the curl
curl_setopt($curlHandle , CURLOPT_URL, $serviceUrl);
curl_setopt($curlHandle , CURLOPT_HEADER , 0);
curl_setopt($curlHandle , CURLOPT_HTTPHEADER , array("Cache-Control: no-cache","Content-type: application/x-www-form-urlencoded;charset=UTF-8"));
curl_setopt($curlHandle , CURLOPT_FOLLOWLOCATION , true);
curl_setopt($curlHandle , CURLOPT_RETURNTRANSFER , true);
curl_setopt($curlHandle , CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
$data = curl_exec($curlHandle);
// echo $data;
$data = utf8tohtml($data , false);
echo $data;
$parser = xml_parser_create("UTF-8");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE, 1);
$ok = xml_parse_into_struct($parser, $data, $values);
if (!$ok) {
$errmsg = sprintf("XML parse error %d '%s' at line %d, column %d (byte index %d)",
xml_get_error_code($parser),
xml_error_string(xml_get_error_code($parser)),
xml_get_current_line_number($parser),
xml_get_current_column_number($parser),
xml_get_current_byte_index($parser));
}
echo "<pre>";
print_r($values);
echo "</pre>";
Hope this will help.
Thanks!
Hussain.
The Content-Type header field in the response specifies the content to be encoded with ISO 8859-1 (see response on Web-Sniffer.net) and not UTF-8. So either specify ISO-8859-1 as encoding or omit that parameter and xml_parser_create tries to identify the encoding.
Again, which php version are you using? xml_parser_create takes encoding as a parameter, but only for output, not input in some versions. http://www.php.net/manual/en/function.xml-parser-create.php
You might want to consider creating an empty utf-8 string and then filling it with the XML retrieved from Google, or explicitly converting the string to UTF-8.
string utf8_encode ( string $data )
Google is correctly informing us the data is UTF-8, but only in the header, not in the actual XML.

Problem with sprintf error when using S3->copyObject() and filenames with % in them

I am using PHP S3.PHP class to manage files on Amazon S3. I use the copyObject() function to copy files in my S3 bucket. All works great until I meet filenames that need to be urlencoded (I urlencode everything anyway). When a filename ends up with % characters in it the copyObject() function spits the dummy.
for example - the filename 63037_Copy%287%29ofDSC_1337.JPG throws the following error when passed to copyObject() -
Warning: sprintf() [<a href='function.sprintf'>function.sprintf</a>]: Too few arguments in ..... S3.php on line 477
Here's the copyObject function line 477
public static function copyObject($srcBucket, $srcUri, $bucket, $uri, $acl = self::ACL_PRIVATE, $metaHeaders = array(), $requestHeaders = array()) {
$rest = new S3Request('PUT', $bucket, $uri);
$rest->setHeader('Content-Length', 0);
foreach ($requestHeaders as $h => $v) $rest->setHeader($h, $v);
foreach ($metaHeaders as $h => $v) $rest->setAmzHeader('x-amz-meta-'.$h, $v);
$rest->setAmzHeader('x-amz-acl', $acl);
$rest->setAmzHeader('x-amz-copy-source', sprintf('/%s/%s', $srcBucket, $srcUri));
if (sizeof($requestHeaders) > 0 || sizeof($metaHeaders) > 0)
$rest->setAmzHeader('x-amz-metadata-directive', 'REPLACE');
$rest = $rest->getResponse();
if ($rest->error === false && $rest->code !== 200)
$rest->error = array('code' => $rest->code, 'message' => 'Unexpected HTTP status');
if ($rest->error !== false) {
-------------------------------------------- LINE 477 ----------------------------
**trigger_error(sprintf("S3::copyObject({$srcBucket}, {$srcUri}, {$bucket}, {$uri}): [%s] %s",
$rest->error['code'], $rest->error['message']), E_USER_WARNING);**
-------------------------------------------- LINE 477 ----------------------------
return false;
}
return isset($rest->body->LastModified, $rest->body->ETag) ? array(
'time' => strtotime((string)$rest->body->LastModified),
'hash' => substr((string)$rest->body->ETag, 1, -1)
) : false;
}
Has anyone come across this before? There is absolutely no problem when using filenames which don't change when urlencoded, and Ive already tried removing all whitespace from filenames but wont be able to catch all characters, like brackets which is the problem in the example here. And I dont want to go down that road anyway as I want to keep the filenames as close to the original as possible.
thanks guys
Redo the line this way:
trigger_error("S3::copyObject({$srcBucket}, {$srcUri}, {$bucket}, {$uri}): ". sprintf("[%s] %s",
$rest->error['code'], $rest->error['message']), E_USER_WARNING);
%'s in the first parameter to sprintf are identified as the placeholders for values. Because your filenames are first inserted in the string and that string is then passed to sprintf(), sprintf() mistakenly interprets the %'s in the file names as placeholders.

Categories