unicode is wrong when get file from server

unicode is wrong when get file from server - php

I wanna download this link from google which mage txt file by php.
when I do it by browser,the unicode is correct and all things are right,but when I do it by curl or file_get_content it contain bad alphabets.
what is difference and how should I solve it?
downloaded by brower
[[["سلام","hello","",""]],[["interjection",["سلام","هالو","الو"],[["سلام",["hello","hi","aloha","all hail"]],["هالو",["hallo","hello","halloo"]],["الو",["hello"]]]]],"en",,[["سلام",[5],0,0,1000,0,1,0]],[["hello",4,,,""],["hello",5,[["سلام",1000,0,0],["خوش",0,0,0],["میهمان گرامی",0,0,0],["خوش آمدید",0,0,0],["درود کاربر",0,0,0]],[[0,5]],"hello"]],,,[["en"]],65]
download by following php script:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<?php
$t = file_get_contents("http://translate.google.com/translate_a/t?client=t&hl=en&sl=auto&tl=fa&multires=1&prev=btn&ssel=0&tsel=3&uptl=fa&alttl=en&sc=1&text=hello");
$f = fopen("t.txt", "w+");
fwrite($f, $t);
fclose($f);
?>
</body></html>
[[["ÓáÇã","hello","",""]],[["interjection",["ÓáÇã","åÇáæ","Çáæ"],[["ÓáÇã",["hello","hi","aloha","all hail"]],["åÇáæ",["hallo","hello","halloo"]],["Çáæ",["hello"]]]]],"en",,[["ÓáÇã",[5],0,0,1000,0,1,0]],[["hello",4,,,""],["hello",5,[["ÓáÇã",1000,0,0],["ÎæÔ",0,0,0],["ã\u06CCåãÇä ÑÇã\u06CC",0,0,0],["ÎæÔ ÂãÏ\u06CCÏ",0,0,0],["ÏÑæÏ ÇÑÈÑ",0,0,0]],[[0,5]],"hello"]],,,[["en"]],4]
Header:
Header are:
HTTP/1.1 200 OK
Pragma: no-cache
Date: Fri, 25 May 2012 22:29:12 GMT
Expires: Fri, 25 May 2012 22:29:12 GMT
Cache-Control: private, max-age=600
Content-Type: text/javascript; charset=UTF-8
Content-Language: fa
Set-Cookie: PREF=ID=b6c08a0545f50594:TM=1337984952:LM=1337984952:S=Sf1xcow2qPZrFeu0; expires=Sun, 25-May-2014 22:29:12 GMT; path=/; domain=.google.com
X-Content-Type-Options: nosniff
Content-Disposition: attachment
Server: HTTP server (unknown)
X-XSS-Protection: 1; mode=block
Transfer-Encoding: chunked

Add parameters ie=UTF-8 and oe=UTF-8 to query string of the url:
$t = file_get_contents("http://translate.google.com/translate_a/t?ie=UTF-8&oe=UTF-8&client=t&hl=en&sl=auto&tl=fa&multires=1&prev=btn&ssel=0&tsel=3&uptl=fa&alttl=en&sc=1&text=hello");

This worked for me once, as I was about to throw lots of code to the garbage! Maybe it will help you too
iconv( 'CP1252', 'UTF-8', $string);

echoing what you get from file_get_contents into the PHP output should work fine, as you are going from a UTF-8 JSON response to a UTF-8 HTML response. Works for me off the given URL.
When you store to a file, you then have to worry about what encoding the tools you are using to read the file are working in. Just fwriteing is fine as long as the text editor you view it in knows the output is UTF-8. On Windows, Notepad may instead try to read it in the locale-dependent default ('ANSI') codepage, which won't be UTF-8. On a Western European install it'd be code page 1252 and you'd get output like Ø³Ù„Ø§Ù… for سلام.
(One way around that is to put a UTF-8 fake-BOM at the front of the file with fwrite($f, "\xef\xbb\xbf");. This is a bit dodgy because UTF-8 doesn't need a Byte Order Mark (its byte order is fixed) and it breaks UTF-8's ASCII-compatibility, but Windows tools like fake-BOMs. The other way around it is to get a better text editor that allows you to default to handling files as UTF-8.)
You've got something slightly different here, as ÓáÇã is what you get when you save سلام in the Windows default Arabic encoding (code page 1256) and then read it in the Windows default Western encoding (code page 1252). This implies there's some kind of extra store-and-load step involved in your testing, that's messing up the encoding.
If it's anything to do with Windows command line tools you might as well give up, because the Command Prompt and MSVCRT apps don't really play well with Unicode at all.

Related

GuzzleHttp request sends garbled characters

I use GuzzleHTTP 6.0 to get the data from the API server. For some reason the request which the API server receives are not UTF-8 endoded the characters ü,ö,ä,ß are garbled characters.
My default System and Database is UTF-8 encoded.
I set debug to true in the RequestOptions this is the output:
User-Agent: GuzzleHttp/6.2.1 curl/7.47.0 PHP/7.0.22-0ubunut0.16.04.1
Content-type: text/xml;charset="UTF-8"
Accept: text/xml" Cache-Control: no-cache
Content-Length: 2175 * upload completely sent off: 2175 out of 2175 bytes
<HTTP/1.1 200 OK <Server:Apache:Coyote/1.1 <Content-Type: text/xml; charset=utf-8 <Transfer-Encoding: chunked <Date: Thu, 23 Nov 2017 9:34:12 GMT <* Connection #5 to host www.abcdef.com left intact
I have set explicitily the headers contents to UTF-8;
$headers = array(
'Content-type' => 'text/xml;charset="utf-8"',
'Accept' => 'text/xml',
'Content-length' => strlen($requestBody),
);
I also tried to test using mb_detect_encoding() method
mb_detect_encoding($requestBody,'UTF-8',true); // returns UTF-8
Any further ideas how do i debug this issue..??

Content-Length must contain number of bytes, not number of characters. That could the reason if you use mbstring.func_overload. Try to omit manual set of this header, Guzzle will set it automatically in the correct way for you then.

Prevent modifying http response headers in Apache

I have ran a security scan in my website and scan report showing security thread in below URL, saying "HTTP header injection vulnerability in REST-style parameter to /catalog/product/view/id"
The following URL adding the custom header XSaint:test/planting-a-stake/category/99 in HTTP Response header.(See the last line in Response Header)
I tried different solutions but no luck! Can any one suggest me to prevent the modifying HTTP Response header.
URL: /catalog/product/view/id/1256/x%0D%0AXSaint:%20test/planting-a-stake/category/99
Response Header:
Cache-Control:max-age=2592000
Content-Encoding:gzip
Content-Length:253
Content-Type:text/html; charset=iso-8859-1
Date:Fri, 26 May 2017 11:27:12 GMT
Expires:Sun, 25 Jun 2017 11:27:12 GMT
Location:https://www.xxxxxx.com/catalog/product/view/id/1256/x
Server:Apache
Vary:Accept-Encoding
XSaint:test/planting-a-stake/category/99

HTTP header injection vulnerability is related to someone injecting data in your application that can be used to insert arbitrary headers (see https://www.owasp.org/index.php/HTTP_Response_Splitting).
In this specific case, the scanner assume the vulnerability might come from the URI put in the Location header:
Location:https://www.xxxxxx.com/catalog/product/view/id/1256/x
The need here is to ensure that the data put into this URI cannot embed the line return characters, to quote the OWASP HTTP Response Splitting page:
CR (carriage return, also given by %0d or \r)
LF (line feed, also given by %0a or \n)

Wordpress strange text over the content of json-api plugin

I use WordPress 4.1.1.
I tried to install the JSON API plugin.
Strange letters are displayed above the JSON content. And they update after refresh of the page.
I tried to bring another letter under the code of plugin. These letters appeared under these figures, so is the problem in the WordPress system?
Please help me to understand and to remove them, because I can't parse my JSON.
On localhost it works fine with the same properties and data...
The letter are: 7b00c, 78709, 6eb3d... and they change with updates..

The strange characters is probably a chunk-size.
Content-Length
When a server-side process sends a response through an HTTP server, the data will typically be stored in a buffer before it is transmitted to the client (browser). If the entire response fits in the buffer in a timely manner, the server will declare the size in a Content-Length: header, and send the response as-is to the client.
Chunked Transfer Coding
If the response does not fit in the buffer, or the server decides to vacate the buffer for other reasons before the full size is known, it will instead send the response in chunks. This is indicated by the Transfer-Encoding: chunked header. Each chunk is preceeded by its length in hexadecimal (followed by a CRLF-pair). The end of the response is indicated by a 0 chunk-size. The exact syntax is detailed below.
Solution
If you are parsing the HTTP response yourself, there are all sorts of intricacies that you need to consider. Chunked encoding is one of them. You need to check for the Transfer-Encoding: chunked header and assemble the response by parsing and stripping out the chunk-size parts.
It's much easier to use a library such as cURL which will handle all the details for you.
One hack to avoid chunks is to send the response using HTTP/1.0 rather than HTTP/1.1. In HTTP/1.0, the length is indicated either by the Content-Length: header, or by closing the connection.
Syntax
This is the syntax for chunked bodies specified in RFC 7230 - "Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing" (ABNF notation):
4.1. Chunked Transfer Coding
chunked-body = *chunk
last-chunk
trailer-part
CRLF
chunk = chunk-size [ chunk-ext ] CRLF
chunk-data CRLF
chunk-size = 1*HEXDIG
last-chunk = 1*("0") [ chunk-ext ] CRLF
chunk-data = 1*OCTET ; a sequence of chunk-size octets
trailer-part = *( header-field CRLF )

Special characters (ë) in JSON-response

My database stores some texts which I have to get with AJAX. This is going well but only when it not contains special characters such as ë or ä. I found some articles about this topic which told me to change the charset of the AJAX-request, but none of these worked for me.
When I start firebug it said this about the headers:
Antwoordheaders (dutch for responseheaders)
Cache-Control no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Connection close
Content-Length 94
Content-Type text/html; charset=ISO-8859-15
Date Wed, 26 Sep 2012 09:52:56 GMT
Expires Thu, 19 Nov 1981 08:52:00 GMT
Pragma no-cache
Server Apache
X-Powered-By PleskLin
Verzoekheaders (dutch for requestheaders)
Accept text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
Accept-Encoding gzip, deflate
Accept-Language nl,en-us;q=0.7,en;q=0.3
Authorization Basic c3BvdGlkczp6SkBVajRrcw==
Connection keep-alive
Content-Type text/html; charset=ISO-8859-15
Cookie __utma=196329838.697518114.1346065716.1346065716.1346065716.1; __utmz=196329838.1346065716.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=2h4vu8gu9v8fe5l1t3ad5agp86
DNT 1
Host www.spotids.com
Referer http://www.spotids.com/private/?p=16
User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1
Both of the headers are talking about charset=ISO-8859-15 which should include characters like ë, but it doesn't work for me.
The code I used for this (PHP):
`$newresult = mysql_query($query2);
$result = array();
while( $row = mysql_fetch_array($newresult))
{
array_push($result, $row);
}
$jsonText = json_encode($result);
echo $jsonText;`

Make sure you set the headers to UTF-8:
header('Content-Type: application/json; charset=utf-8');
Make sure your connection to database is made with UTF-8 encoding before any queries:
$query = mysql_query("SET NAMES 'UTF8'");
As far as I know, JSON encodes any characters that cannot be represented in pure ASCII. And you should decode that JSON on response.
Try to move to PDO as mysql_* functions are deprecated. Use this nice tutorial

From JSON RFC-4627 : JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.
Use mb_convert_encoding or iconv to change string encoding.
And send correct header:
header('Content-Type: application/json;charset=utf-8');
echo json_encode($data);

verify the Content-Type meat
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

base64 encoded string gets truncated through fgets call while parsing IMAP

I am parsing emails with Zend_Mail, and strangely some content gets truncated without an obvious reason and malforms the email parts.
For example
Content-Disposition: attachment; filename="file.sdv"
DQogICAgICBTT05FO0xBTkRJTkdTREE7U0FMR1NEQVRPIDtOQVNKIDtSRURTS0FQICAgICAgICAg
ICAgIDsgRklTS0VTTEFHO1BSRVNFUlYgICA7ICBUSUxTVEFORDsgU1TYUlJFTFNFOyAgS1ZBTElU
RVQ7T01TVFlQRSAgO01JTlNURVBSSVM7ICAgICBWRVJESTsgICBLVkFOVFVNOyAgUlVORFZFS1Qg
IA0KLS0tLS0tLS0tLTstLS0tLS0tLS0tOy0tLS0tLS0tLS07LS0tLS07LS0tLS0tLS0tLS0tLS0t
LS0tLS07LS0tLS0tLS0tLTstLS0tLS0tLS0tOy0tLS0tLS0tLS07LS0tLS0tLS0tLTstLS0tLS0t
LS0tOy0tLS0tLS0tLTstLS0tLS0tLS0tOy0tLS0tLS0tLS07LS0tLS0tLS0tLTstLS0tLS0tLS0t
ICANCiAgICAgICAgIDA7MjAxMC4wOS4wODsyMDEwLjA5LjA4O05vcnNrO0dhcm4gICAgICAgICAg
ICAgICAgOyAgICAgIDEwMjE7RkVSU0sgICAgIDsgICAgICAgMjEwOyAgIDQwMjA5OTk7ICAgICAg
ICAyMDtFZ2Vub3ZlcnQ7ICAgICAgICAgIDsgICAzMDcyLDE2OyAgICAgICAyMTE7ICAgICAyNTMs
MiAgDQogICAgICAgICAwOzIwMTAuMDkuMDg7MjAxMC4wOS4wODtOb3JzaztHYXJuICAgICAgICAg
Gets truncated to
Content-Disposition: attachment; filename="file.sdv"
DQogICAgICBTT05FO0xBTkRJTkdTREE7U0FMR1NEQVRPIDtOQVNKIDtSRURTS0FQICAgICAgICAg
ICAgIDsgRklTS0VTTEFHO1BSRVNFUlYgICA7ICBUSUxTVEFORDsgU1TYUlJFTFNFOyAgS1ZBTElU
RVQ7T01TVFlQRSAgO01JTlNURVBSSVM7ICAgICBWRVJESTsgICBLVkFOVFVNOyAgUlVORFZFS1Qg
IA0KLS0tLS0tLS0tLTstLS0tLS0tLS0tOy0tLS0tLS0tLS07LS0tLS07LS0tLS0tLS0tLS0tLS0t
LS0tLS07LS0tLS0tLS0tLTstLS0tLS0tLS0tOy0tLS0tLS0tLS07LS0tLS0tLS0tLTstLS0tLS0t
LS
a var_dump on each line shows this.
string(78) "DQogICAgICBTT05FO0xBTkRJTkdTREE7U0FMR1NEQVRPIDtOQVNKIDtSRURTS0FQICAgICAgICAg
"
string(78) "ICAgIDsgRklTS0VTTEFHO1BSRVNFUlYgICA7ICBUSUxTVEFORDsgU1TYUlJFTFNFOyAgS1ZBTElU
"
string(78) "RVQ7T01TVFlQRSAgO01JTlNURVBSSVM7ICAgICBWRVJESTsgICBLVkFOVFVNOyAgUlVORFZFS1Qg
"
string(78) "IA0KLS0tLS0tLS0tLTstLS0tLS0tLS0tOy0tLS0tLS0tLS07LS0tLS07LS0tLS0tLS0tLS0tLS0t
"
string(78) "LS0tLS07LS0tLS0tLS0tLTstLS0tLS0tLS0tOy0tLS0tLS0tLS07LS0tLS0tLS0tLTstLS0tLS0t
"
string(5) "LS)
"
string(17) "TAG5 OK Success
"
or in other email at
DQogICAgICBTT05FO0xBTkRJTkdTREE7U0FMR1NEQVRPIDtOQVNKIDtSRURTS0FQICAgICAgICAg
ICAgIDsgRklTS0VTTEFHO1BSRVNFUlYgICA7ICBUSUxTVEFORDsgU1TYUlJFTFNFOyAgS1ZBTElU
RVQ7T01TVFlQRSAgO01JTlNURVBSSVM7ICAgICBWRVJESTsgICBLVkFOVFVNOyAgUlVORFZFS1Qg
IA0KLS0tLS0tLS0tLTstLS0tLS0tLS0tOy0tLS0tLS0tLS07LS0tLS07LS0tLS0tLS0tLS0tLS0t
LS0tLS07LS0tLS0tLS0tLTstLS0tLS0tLS0tOy0tLS0tLS0tLS07LS0tLS0tLS0tLTstLS0tLS0t
LS0tOy0tLS0tLS0tLTstLS0tLS0tLS0tO
I cannot figure out why is stopping there. The transmitions should have stoped at the end of the line only. This is the line that gets the string from the IMAP Server.
$line = #fgets($this->_socket);
The encoded text contains a string like, but again this is truncated in various parts in different emails.
----------;----------;----------;-----;--------------------;----------;----------;--
I've tried to add a size to fgets() but to no results.
I also enabled/disabled "auto_detect_line_endings" php_ini setting, again to no result.
I've also opened a bug report with ZF although the error does not seem to be in the library.
Do you see anything strange with this encoded string?
UPDATE
New research shows that the emails get truncated after 584 chars. Still don't know why.
Sent a question to google as well. See here.
A Bad email headers :
Delivered-To: email#removed.com
Received: by 10.216.3.208 with SMTP id 58cs248812weh;
Fri, 20 Nov 2009 05:14:14 -0800 (PST)
Received: by 10.204.153.217 with SMTP id l25mr1285471bkw.108.1258722853863;
Fri, 20 Nov 2009 05:14:13 -0800 (PST)
Return-Path: <>
Received: from MTX4.mbn1.net (mtx4.mbn1.net [213.188.129.252])
by mx.google.com with SMTP id 2si1800716bwz.60.2009.11.20.05.14.12;
Fri, 20 Nov 2009 05:14:13 -0800 (PST)
Received-SPF: pass (google.com: best guess record for domain of MTX4.mbn1.net designates 213.188.129.252 as permitted sender) client-ip=213.188.129.252;
Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of MTX4.mbn1.net designates 213.188.129.252 as permitted sender) smtp.mail=
Resent-From: <email#removed.com>
Content-Type: multipart/mixed; boundary="===============1703099044=="
MIME-Version: 1.0
From: <email#removed.com>
To: <email#removed.com>
CC:
Subject: some subject
Message-ID: <FLYNDRElQ080Gxw8Zw500000f46email#removed.com>
X-OriginalArrivalTime: 20 Nov 2009 13:14:08.0121 (UTC) FILETIME=[5792C690:01CA69E3]
Date: Fri, 20 Nov 2009 14:14:08 +0100
X-STA-Metric: 0 (engine=030)
X-STA-NotSpam: tlf: vedlagt skip:__ 40 fil cc:2**0
X-STA-Spam: header:MIME-Version: charset:us-ascii header:Subject:1 to:2**0 header:From:1
X-BTI-AntiSpam: score:0,sta:0/030,dnsbl:passed,sw:off,bsn:38/passed,spf:off,bsctr:passed/1,dk:off,pbmf:none,ipr:0/3,trusted:no,ts:no,bs:no,ubl:passed
X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply
Resent-Message-Id: <19740416124736.CF5804B33EF632B0email#removed.com>
Resent-Date: Fri, 20 Nov 2009 14:14:11 +0100 (CET)
--===============1703099044==
Content-Type: application/octet-stream
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="file.sdv"
DQpHUlVQUEVOQVZOICAgICAgICAgIDtLSthQRTtQUk9EQU5MO1BBS0tFTlI7TU9UVEFLTkFWTiAg
ICAgICAgICAgICAgICAgICAgO1NPTjtMQU5ESU5HU0RBO1NBTEdTREFUTyA7TkFTSiA7UkVEU0tB
UCAgIDtGSVNLRVNMQUcgO1BSRVNFUlYgICA7VElMU1RBTkQ7U1TYUlJFTFM7S1ZBTElURVQ7TUlO
U1RFUFJJUzsgICAgICAgIFZFUkRJOyAgICAgS1ZBTlRVTTsgICAgUlVORFZFS1QgICAgDQotLS0t
LS0tLS0tLS0tLS0tLS0tLTstLS0tLTstLS0tLS0tOy0tLS0tLS07LS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tOy0tLTstLS0tLS0tLS0tOy0tLS0tLS0tLS07LS0tLS07LS0tLS0tLS0tLTst
LS0tLS0tLS0tOy0tLS0tLS0tLS07LS0tLS0tLS07LS0tLS0tLS07LS0tLS0tLS07LS0tLS0tLS0t
LTstLS0tLS0tLS0tLS0tOy0tLS0tLS0tLS0tLTstLS0tLS0tLS0tLS0gICAgDQpMb3JlbnR6ZW4g
....
For those interested in an answer and not in the (ex) bounty, more clues.
Gmail is returning a short value in response to RFC822.SIZE, which can lead to truncated messages. (They are off by one byte for each header line, apparently not counting two characters for CR/LF.)

I think you're looking in the wrong place.
The imap server gives you the mail message truncated, and then returns its status line TAG5 OK Success.
I don't see how your (/php's) handling of the socket would make a few kb worth of stream disappear, to magically fix the stream right before this status line.
So either the message is truncated by itself (have you verified the message contents through some other way?) or the imap server is just broken.
The first things I would do, are:
find a sufficiently silent environment to put your project, where you can strace -f -s 10240 -p <pid> apache's process to verify the socket interaction (assuming a linux/apache environment)
and/or: use tcpdump, ethereal or equivalent to check what's coming in on the line
My guess is that you will see the exact same truncated strings coming in on the wire. Meaning you can shift your focus to the imap server.
Reassuring yourself that you're looking in the right place can save a lot of time.

1: try removing the # for more verbosity
2: try using http://www.php.net/manual/en/function.fread.php instead of fgets
This might have something to do with the IMAP server, because i see TAG5 OK Success as a response, even if its not supposed to be there.

Have you tried issuing another fgets and see if you get the rest of the data? You may be retrieving a multi-part email which would require multiple requests.
But regardless, you are using functions designed for file access on a network. Usually this works fine, but depending on the network, issues can arise. For example, you can use file_get_contents to retrieve a web page. But if the issue issues a redirect, then it fails. But using curl will be much more successfully.
If you truly want to read the network socket, you should try socket_read. That is designed with the network in mind, like curl.

Do not know Zend and forgot all about PHP but played with MIME and HTTP before (C++).
I suggest you start looking at finding way to add a Content-Length header entry. It gives a hint to the "message decoder/loader" to expect a certain size in the content (message payload). (Not sure if IMAP does that)
In the code above I would try to convince fgets to read a specific amount of expected data from the network. It could be that the data is buffered or not yet sent over the network (async communication) and fgets only reads an internal buffer thus stopping before the whole message was read.
To see if this is the case, send a small message that falls under your "584 breaking point".
Do some network tracing the see if all the data actually flows. (You would probably need to do some local setup)
The code you are referring to is here?

Most likely one of your server hardware is compromised and thus you want to change it completely or just change the RAM modules or Disk-Drives. I've some experience with Web-and-Mail based encoding and I can confirm you that base64 encoded string is very secure. At least it uses a texture mapping algorithm.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.