decoding $str= imap_fetchbody: $str ==="" but print_r can print it - php

I have forwarded a html message with pdf attachment from Thunderbird.
I receive multipart/mixed with multipart/alternative containing html and txt-plain, and the pdf base 64 encoded. The multipart/alternative is 8 bit, charset= UTF8.
I have tried nearly all proposals from comments on the imap:fetchstructure/fetchbody manual page on php.net. They include decoding (at least for encoding = 1, 3, 4), applying imap_8bit, imap_qprint and imap_base64. Looking manually at the txt/plain shows encoding = 1, so the imap_8bit is applied.
The example functions can't even decide whether the returned text is plain or html because in all cases, because the returned $str always is === "" (empty string).
next, I accidentally tried a print_r($str) (if imap_8bit is not done), and that has the required email text.
I thought this might be multibyte without imap_8bit and mb_detect_encoding returns UTF8 (just as I can see in the raw email text).
Trying mb_convert_encoding($str, "ASCII") again returns an empty string.
quoted_printable_decode doesn't help either neither before nor after imap_8bit.
netbeans PHP debugger (xdebug) declares all these strings to be empty but announces the variables are 'string'.
Does anybody have an idea how to get to the email text? print_r shows that it is there, but I am banging my head against a wall for days now without any result.
I could manually search and decode the boundaries etc., which wouldn't be toooo difficult, but ... why reinvent the wheel?
Code: primarily, I used two versions from the php.net fetch_structure page and othe r web ressources. I can add them to this post but don't want to blow it up too much at this moment.
*getTxtBody which calls get_part
*getmesg which calls getpart
If I look at the plain text, I clearly see the (nested) boundaries for plain, html and pdf.
any help is very much appreciated., Klaus

You can try using fetch library.
To decode headers you can use iconv_mime_decode

Related

php filter a string so that json_encode does not error out

I'm grabbing a bunch of data from a database and putting it into a PHP array. I'm then looking to json_encode that array using $output = json_encode($out).
My issue is that from time to time, something in the array is not able to be read by json_encode and the whole thing fails. If I use print_r($out) to have a look, I can clearly see where it's failing, because the character that is screwing things up always appears as a question mark inside of a black diamond �.
First - what are these characters?
Second - Is there a function I can pass the elements through prior to adding them to the array that would strip these out, or replace 'them' with blanks?
I found the answer to this. Since the data coming FROM the database was stored with the "black diamond" character, I needed to get this out POST grabbing it from the database.
$x[4] = utf8_encode(odbc_result($query, 'B'));
By passing the result through utf8_encode, the string is encoded into UTF-8 and the illegal character is removed.
Say echo json_encode($out);
This will solve your issue
Black diamonds are browser issue. Database uses plain question marks.
It seems you are getting already wrong data from databalse. But that's quite tricky to have incorrect utf with your settings. You need to check everything
if your table marked with utf8 charset
if your data indeed encoded in utf (not marked but indeed encoded)
if your server sending correct charset in Content-type header.
it is also useful to see the page choosing different charsets from your browser menu.
But first of all you have to wipe any trace of all random actions you tried, all these various encode, decode and stuff. Just plain and direct output from database. Otherwise you will never get to the problem

Extract link from IMAP message

i need to extract an URL from an IMAP message, so far i have been able to extract the message in plain text but not the link, i could really use some help here. Here's what i got so far
$section = empty( $attachments ) ? 1 : 1.2;
$text = imap_fetchbody($connection, $msgno, $section );
echo $text."<hr/>";
I tried changing the section number from 1 : 1.1 to 1: 1.2 but it didn't help.
I need to extract the mail as html so it contains the link, what do i need to change to get the link?
Some bugs in the code above:
Using 1.2 as a floating point number instead of a string. It's a string and it's only sheer luck that your 1.2 is not getting converted as 1.2000000001.
Having a hardcoded part number in case the mail comes with multiple body parts -- read the description in RFC 3501, p. 56 for details on what these body parts are, how they work and how to find out which ones are relevant for you
Not performing any decoding of the Content-Transfer-Encoding or dealing with character set conversions. You are apparently interested in text parts of a message, these can arrive in multiple encodings like quoted-printable or base64 which you will have to decode. If you'd like to play it safe (you should, it's 2013 already and there are funny characters in URLs, not speaking about the IDNs), also converted from their charset into unicode and only then matched for contents.
You should probably check the MIME types of at least all top-level body parts so that you do not try to detect "links" within the attached images or undecoded, binary representaiton of zip files, etc.

PHP chinese character IMAP

I retrieve data from an email through IMAP and i want to
detect (via PHP) whether the body have characters in Chinese, Japanese, or Korean programmatically. I know to encoding but no to detect
$mbox = imap_open ("{localhost:995/pop3/ssl/novalidate-cert}", "info#***.com", "********");
$email=$_REQUEST['email'];
$num_mensaje = imap_search($mbox,"FROM $email");
// grab the body for the same message
$body = imap_fetchbody($mbox,$num_mensaje[0],"1");
//chinese for example
$str = mb_convert_encoding($body,"UTF-8","EUC-CN");
imap_close($mbox);
Any idea
Do you mean that you don't know which CJK encoding the incoming message is in?
The canonical place to find that information is the charset= parameter in the Content-Type: header.
Unfortunately extracting that is not as straightforward as you would hope. Really you'd think that the object returned by imap_header would contain the type information, but it doesn't. Instead, you have to use imap_fetchheader to grab the raw headers from the message, and parse them yourself.
Parsing RFC822 headers isn't completely straightforward. For simple cases you might be able to get away with matching each line against ^content-type:.*; *charset=([^;]+) (case-insensitively). But to do it really properly though you'd have to run the whole message headers and body through a proper RFC822-family parser like MailParse.
And then you've still got the problem of messages that neglect to include charset information. For that case you would need to use mb_detect_encoding.
Or are you just worried about which language the correctly-decoded characters represent?
In this case the header you want to read, using the same method as above, is Content-Language. However it is very often not present in which case you have to fall back to guessing again. CJK Unification means that all languages may use many of the same characters, but there are a few heuristics you can use to guess:
The encoding that the message was in, from the above. eg if it was EUC-CN, chances are your languages is going to be simplified Chinese.
The presence of any kana (U+3040–U+30FF -> Japanese) or Hangul (U+AC00–U+D7FF -> Korean) in the text.
The presence of simplified vs traditional Chinese characters. Although some characters can represent either, others (where there is a significant change to the strokes between the two variants) only fit one. The simple way to detect their presence is to attempt to encode the string to GBK and Big5 encodings and see if it fails. ie if you can't encode to GBK but you can to Big5, it'll be traditional Chinese.

Invalid PHP JSON encoding

I'm working on a project in PHP (5.3.1) where I need to send a JSON string to a webservice (in python), but the result I get from json_encode does not pass as a valid JSON (i'm using JSLint to check validity).
I should add that the structure I'm trying to encode is fairly big (13K encoded), and consists partially of UTF8 data, and while json_encode does handle it, i get spaces in weird places in the result. For example, I could get {"hello":tru e} or {"hell o":true} which results in an error from the webservice since the JSON is invalid (or data, like in the second example).
I've also tried to use Zend framework for JSON encoding, but that didn't make much different.
Is there a known issue with JSON in PHP? Did anyone encounter that behavior and found a solution?
You state that "the structure I'm trying to encode ... consists partially of UTF8 data." This implies that it is also partially of non-UTF8 data. The json_encode doc has a comment at the bottom, that
json_encode() expects strings to be encoded to be in UTF8 format, while by default PHP strings are ISO-8859-1 encoded.
This means that
json_encode(array('àü'));
will produce a json representation of an empty string, while
json_encode(array(utf8_encode('àü')));
will work.
Are the failing segments of the JSON due to non-UTF8 input?
For sure object keys cannot contain spaces or any non unicode characters, unquoted variables can be only boolean, integer ,float, object and array value, strings should always be quoted.
Also, I would recommend you to add correct header before your json output.
if(!headers_sent())
header('Content-Type: application/json; charset=utf-8', true,200);
Can you also post your array or object that you passing to json_encode?
I was handling some automatically generated emails the other day and noticed the same weird behavior (spaces were inserted to the email body), so I started to check the email post and found the culprit:
From the SMTP RFC2821:
The maximum total length of a text
line including the is 1000
characters (not counting the leading
dot duplicated for transparency).
My email body was indeed in one line, so breaking it with \n's fixed the spaces issue.
After scratching my head for nearly a day, I've come to the conclusion that the problem was not in the json_encode function. It was with my post function.
Basically, the json_encode was preparing the data to be sent to another service. Before today, I've used stream_context_create and fopen to post data to the external service, but now I use fsockopen and fputs and it seems to be working.
Although I'm unsure as to the nature of the problem, I'm happy it works now :)
BTW: After this process, I mail myself the input and output (both in JSON) and this is how I saw there was a problem in the first place. This problem still persists but I guess that's related to the encoding of the mail or something of that sort.

Get non-UTF-8-form fields as UTF-8 in PHP?

I have a form served in non-UTF-8 (it’s actually in Windows-1251). People, of course, post there any characters they like to. The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities so I can still recognise them. For example, if user types an →, I receive an →. That’s partially great, like, if I just echo it back, the browser will correctly display the → no matter what.
The problem is, I actually do a htmlspecialchars () on the text before displaying it (it’s a PHP function to convert special characters to HTML entities, e.g. & becomes &). My users sometimes type things like — or ©, and I want to display them as actual — or ©, not — and ©.
There’s no way for me to distinguish an → from →, because I get them both as →. And, since I htmlspecialchars () the text, and I also get a → for a → from browser, I echo back an &#8594; which gets displayed as → in a browser. So the user’s input gets corrupted.
Is there a way to say: “Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself”?
Oh, I know that the good idea is to switch the whole software to UTF-8, but that is just too much work, and I would be happy to get a quick fix for this. If this matters, the form’s enctype is "multipart/form-data" (includes file uploader, so cannot use any other enctype). I use Apache and PHP.
Thanks!
The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities
Well, nearly, except that it's not at all helpful. Now you can't tell the difference between a real “ƛ” that someone typed expecting it to come out as a string of text with a ‘&’ in it, and a ‘Б’ character.
I actually do a htmlspecialchars () on the text before displaying it
Yes. You must do that, or else you've got a security problem.
Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself
Yeah, supposedly you send “accept-charset="UTF-8"” in the form tag. But the reality is that doesn't work in IE. To get a form in UTF-8, you must send a form (page) in UTF-8.
I know that the good idea is to switch the whole software to UTF-8,
Yup. Well, at least the encoding of the page containing the form should be UTF-8.
<form action="action.php" method="get" accept-charset="UTF-8">
<!-- some elements -->
</form>
All browsers should return the values in the encoding specified in accept-charset.
You check to see if the characters are within a certain range. If they fall outside the range of standard UTF-8 characters, do whatever you want to with it. I would do this by looking at each character &, #, 8, 5, 9, 4, and parsing it into something you can apply something to.
Short of finding somewhere where someone has created a Windows-1251 to UTF-8 conversion script, you are probably going to have to roll your own. You are probably going to have to look at each specific character and see what needs to be done with it. If it's something like © you will want to handle it differently than → because the second one has the # in it.
I think this answers your question.
The html_entity_decode function is probably what you want.
You could set the fourth parameter of the htmlspecialchars function (double_encode, since PHP 5.2.3) to false do avoid the character references being encoded again.
Or you first decode those existing character references.
You can convert the strings to UTF-8 using the PHP multi-byte functions. From there you can do as you wish. Especially the mb_convert_encoding() to move it from windows-1251 to UTF-8, or where ever.
I don't quite understand your question though, because if someone enters & as a text string, when you do the htmlspecialchars() that should convert it to &amp; ... which when ran back through a html_entity_decode() would come out as the text string the user entered.
This is of course if you haven't used the double_encode option when running your string through the htmlspecialchars()
mbstring supports the "charset" HTML-Entities
for($i=0; $i<strlen($out); $i++) {
printf('%02X ', ord($out[$i]));
}61 20 E2 86 92 20 62 20 26 20 63 E2 86 92 is the byte-sequence for → (RIGHTWARDS ARROW) in utf8.
You won't be able to distinguish between the browser converting a codepoint to an entity and your users typing in an entity because they look identical. The real solution is to give up on Windows 1251. Instead, serve the webpage and form in UTF-8, ask for UTF-8 encoding and all these problems should just go away.

Categories