Proper way to decode incoming email subject (utf 8)

Proper way to decode incoming email subject (utf 8) - php

I'm trying to pipe my incoming mails to a PHP script so I can store them in a database and other things. I'm using the class MIME E-mail message parser (registration required) although I don't think that's important.
I have a problem with email subjects. It works fine when the title is in English but if the subject uses non-latin Characters I get something like
=?UTF-8?B?2KLYstmF2KfbjNi0?=
for a title like
یک دو سه
I decode the subject like this:
$subject = str_replace('=?UTF-8?B?' , '' , $subject);
$subject = str_replace('?=' , '' , $subject);
$subject = base64_decode($subject);
It works fine with short subjects with like 10-15 characters but with a longer title I get half of the original title with something like ��� at the end.
If the title is even longer, like 30 characters, I get nothing. Am I doing this right?

You can use the mb_decode_mimeheader() function to decode your string.

Despite the fact that this is almost a year old - I found this and am facing a similar problem.
I'm unsure why you're getting odd characters, but perhaps you are trying to display them somewhere your charset is unsupported.
Here's some code I wrote which should handle everything except the charset conversion, which is a large problem that many libraries handle much better. (PHP's MB library, for instance)
class mail {
/**
* If you change one of these, please check the other for fixes as well
*
* #const Pattern to match RFC 2047 charset encodings in mail headers
*/
const rfc2047header = '/=\?([^ ?]+)\?([BQbq])\?([^ ?]+)\?=/';
const rfc2047header_spaces = '/(=\?[^ ?]+\?[BQbq]\?[^ ?]+\?=)\s+(=\?[^ ?]+\?[BQbq]\?[^ ?]+\?=)/';
/**
* http://www.rfc-archive.org/getrfc.php?rfc=2047
*
* =?<charset>?<encoding>?<data>?=
*
* #param string $header
*/
public static function is_encoded_header($header) {
// e.g. =?utf-8?q?Re=3a=20Support=3a=204D09EE9A=20=2d=20Re=3a=20Support=3a=204D078032=20=2d=20Wordpress=20Plugin?=
// e.g. =?utf-8?q?Wordpress=20Plugin?=
return preg_match(self::rfc2047header, $header) !== 0;
}
public static function header_charsets($header) {
$matches = null;
if (!preg_match_all(self::rfc2047header, $header, $matches, PREG_PATTERN_ORDER)) {
return array();
}
return array_map('strtoupper', $matches[1]);
}
public static function decode_header($header) {
$matches = null;
/* Repair instances where two encodings are together and separated by a space (strip the spaces) */
$header = preg_replace(self::rfc2047header_spaces, "$1$2", $header);
/* Now see if any encodings exist and match them */
if (!preg_match_all(self::rfc2047header, $header, $matches, PREG_SET_ORDER)) {
return $header;
}
foreach ($matches as $header_match) {
list($match, $charset, $encoding, $data) = $header_match;
$encoding = strtoupper($encoding);
switch ($encoding) {
case 'B':
$data = base64_decode($data);
break;
case 'Q':
$data = quoted_printable_decode(str_replace("_", " ", $data));
break;
default:
throw new Exception("preg_match_all is busted: didn't find B or Q in encoding $header");
}
// This part needs to handle every charset
switch (strtoupper($charset)) {
case "UTF-8":
break;
default:
/* Here's where you should handle other character sets! */
throw new Exception("Unknown charset in header - time to write some code.");
}
$header = str_replace($match, $data, $header);
}
return $header;
}
}
When run through a script and displayed in a browser using UTF-8, the result is:
آزمایش
You would run it like so:
$decoded = mail::decode_header("=?UTF-8?B?2KLYstmF2KfbjNi0?=");

Use php native function
<?php
mb_decode_mimeheader($text);
?>
This function can handle utf8 as well as iso-8859-1 string.
I have tested it.

Use php function:
<?php
imap_utf8($text);
?>

Just to add yet one more way to do this (or if you don't have the mbstring extension installed but do have iconv):
iconv_mime_decode($str, ICONV_MIME_DECODE_CONTINUE_ON_ERROR, 'UTF-8')

Would the imap-mime-header-decode function help here?
Found myself in a similar situation today.
http://www.php.net/manual/en/function.imap-mime-header-decode.php

Related

How to fix saving to database character è becomes `e

I am trying to save the subject of incoming emails to a database. The subject is not always encoded using the same encoding, so I made this code to convert it back to utf-8 when it's not.
private function convertSubjectEncoding($subject)
{
$encoding = mb_detect_encoding($subject);
if($encoding != 'UTF-8') {
return iconv_mime_decode($subject, 0, "UTF-8");
}
return $subject;
}
For the first message, Encoding is UTF-8 and the subject is Accès SQL. When it is saved to database, it becomes "Acce`s SQL" which is wrong and should be "Accès SQL".
For the second message, the subject is Ascii and the original subject is "=?utf-8?Q?Acc=C3=A8s_?=SQL". When converting, and also when saving it is 'Accès SQL' which is good.
Why is that a string that was originally formatted as ut8 and did not get any encoding change suddenly becomes a different string when saved?
I am using Laravel 6.
Here is the full relevant code:
const SUBJECT_REPLY_FORWARD_REGEX = "/([\[\(] *)?\b(RE|FWD?) *([-:;)\]][ :;\])-]*|$)|\]+ *$/im";
private function createFetchedMail($message)
{
$toList = $message->getTo();
$fetchedMail = FetchedMail::create([
'OriginalSubject' => $this->convertSubjectEncoding($message->getSubject()),
'Subject' => $this->cropSubject($this->convertSubjectEncoding($message->getSubject())),
]);
/**
* Removes subject reply and forwarding indacator (Re:, FWD:, etc.) and trims the result
*/
private function cropSubject($subject)
{
return trim(preg_replace(static::SUBJECT_REPLY_FORWARD_REGEX, '', $subject));
}
private function convertSubjectEncoding($subject)
{
$encoding = mb_detect_encoding($subject);
if($encoding != 'UTF-8') {
return iconv_mime_decode($subject, 0, "UTF-8");
}
return $subject;
}
I have tried to save directly without calling convertSubjectEncoding() and cropSubject(), I get the same erroneous string saved in database.

codeigniter url encrypt not working

<a href="<?php echo base_url().'daily_report/index/'.$this->encrypt->encode($this->session->userdata('employee_id')) ?>">
i have encrypted the above url using the codeigniter encrypt
i set the encryption key in codeigniter config file
$config['encryption_key'] = 'gIoueTFDwGzbL2Bje9Bx5B0rlsD0gKDV';
and i called in the autoload
$autoload['libraries'] = array('session','form_validation','encrypt','encryption','database');
when the ulr(href) load into the url it look like this
http://localhost/hrms/daily_report/index/FVjGcz4qQztqAk0jaomJiAFBZ/vKVSBug1iGPQeKQCZ/K7+WUE4E/M9u1EjWh3uKTKeIhExjGKK1dJ2awL0+zQ==
but the url is not decoded, and i;m not getting the employee_id it shows empty.
public function index($employee_id) {
$save_employee_id = $employee_id;
// decoding the encrypted employee id
$get_employee_id = $this->encrypt->decode($save_employee_id);
echo $employee_id; // answer: FVjGcz4qQztqAk0jaomJiAFBZ
echo "<br>";
echo $get_employee_id; // is display the null
echo "<br>";
exit();
// get the employee daily report
$data['get_ind_report'] = $this->daily_report_model->get_ind_report($get_employee_id);
// daily report page
$data['header'] = "Daily Report";
$data['sub_header'] = "All";
$data['main_content'] = "daily_report/list";
$this->load->view('employeelayout/main',$data);
}
complete url(3) is
FVjGcz4qQztqAk0jaomJiAFBZ/vKVSBug1iGPQeKQCZ/K7+WUE4E/M9u1EjWh3uKTKeIhExjGKK1dJ2awL0+zQ==
it shows only
FVjGcz4qQztqAk0jaomJiAFBZ
i tried to change in the
$config['permitted_uri_chars'] = 'a-zA-Z 0-9~%.:_\-#=+';
by / in the permitted uri chars
but it throwing error
So, i need to encryption the $id in the url using the codeigniter encrypt class and decrypt in the server side to get the actual $id, So that i fetch data from the DB. any help would be appreciated

You have to extend encryption class and avoid the / to get it working. Place this class in your application/libraries folder. and name it as MY_Encrypt.php.
class MY_Encrypt extends CI_Encrypt
{
/**
* Encodes a string.
*
* #param string $string The string to encrypt.
* #param string $key[optional] The key to encrypt with.
* #param bool $url_safe[optional] Specifies whether or not the
* returned string should be url-safe.
* #return string
*/
function encode($string, $key="", $url_safe=TRUE)
{
$ret = parent::encode($string, $key);
if ($url_safe)
{
$ret = strtr(
$ret,
array(
'+' => '.',
'=' => '-',
'/' => '~'
)
);
}
return $ret;
}
/**
* Decodes the given string.
*
* #access public
* #param string $string The encrypted string to decrypt.
* #param string $key[optional] The key to use for decryption.
* #return string
*/
function decode($string, $key="")
{
$string = strtr(
$string,
array(
'.' => '+',
'-' => '=',
'~' => '/'
)
);
return parent::decode($string, $key);
}
}

FVjGcz4qQztqAk0jaomJiAFBZ/vKVSBug1iGPQeKQCZ/K7+WUE4E/M9u1EjWh3uKTKeIhExjGKK1dJ2awL0+zQ==
Shows
FVjGcz4qQztqAk0jaomJiAFBZ
If you look at your url closely, you could see that after the result which has been shown there is a '/' . Now any string after that will be treated as another segment. Hence it could not decode.
The encrypt library in this case would not work.
Either you stop passing that through the URL or use another different technique base_encode().
Hope that helps

This is happening as the character "/" is part of html uri delimiter. Instead you can work around it by avoiding that character in html url by rawurlencoding your encrytion output string before attaching it to url.
\edit:
I tried rawurlencode, but wasn't able to get the proper output.
Finally succeeded by using this code.
Define two functions:
function hex2str( $hex ) {
return pack('H*', $hex);
}
function str2hex( $str ) {
return array_shift( unpack('H*', $str) );
}
Then use call str2hex and pass it the encrypted user id to convert encrypted string into hexcode.
Reverse the process to get the correct string so that you can decrypt it.
I was able to properly encode and decode:
"FVjGcz4qQztqAk0jaomJiAFBZ/vKVSBug1iGPQeKQCZ/K7+WUE4E/M9u1EjWh3uKTKeIhExjGKK1dJ2awL0+zQ=="
to:
"46566a47637a3471517a7471416b306a616f6d4a694146425a2f764b56534275673169475051654b51435a2f4b372b57554534452f4d397531456a576833754b544b65496845786a474b4b31644a3261774c302b7a513d3d"
The url would become rather long though.

PHP Simple Template Engine / Function

I'm required to create a simple template engine; I can't use Twig or Smarty, etc. because the designer on the project needs to be able to just copy/paste her HTML into the template with no configuration, muss/fuss, whatever. It's gotta be really easy.
So I created something that will allow her to do just that, by placing her content between {{ CONTENT }} {{ !CONTENT }} tags.
My only problem is that I want to make sure that if she uses multiple spaces in the tags - or NO spaces - it won't break; i.e. {{ CONTENT }} or {{CONTENT}}
What I have below accomplishes this, but I'm afraid it may be overkill. Anybody know a way to simplify this function?
function defineContent($tag, $string) {
$offset = strlen($tag) + 6;
// add a space to our tags if none exist
$string = str_replace('{{'.$tag, '{{ '.$tag, $string);
$string = str_replace($tag.'}}', $tag.' }}', $string);
// strip consecutive spaces
$string = preg_replace('/\s+/', ' ', $string);
// now that extra spaces have been stripped, we're left with this
// {{ CONTENT }} My content goes here {{ !CONTENT }}
// remove the template tags
$return = substr($string, strpos($string, '{{ '.$tag.' }}') + $offset);
$return = substr($return, 0, strpos($return, '{{ !'.$tag.' }}'));
return $return;
}
// here's the string
$string = '{{ CONTENT }} My content goes here {{ !CONTENT }}';
// run it through the function
$content = defineContent('CONTENT', $string);
echo $content;
// gives us this...
My content goes here
EDIT
Ended up creating a repo, for anyone interested.
https://github.com/timgavin/tinyTemplate

I would suggest to take a look at variable extraction into the template scope.
It's a bit easier to maintain and less overhead, than the replace approach and its often easier to use for the designer. In its basic form, its just PHP variables and short tags.
It depends on which side you generate, e.g. a table and its rows (or complete content blocks) - it could be just <?=$table?> ;) Less work for the designer, more work for you. Or just provide a few rendering examples and helpers, because copy/pasting examples should always work, even with an untrained designer.
Template
The template is just HTML mixed with <?=$variable?> - uncluttered.
src/Templates/Article.php
<html>
<body>
<h1><?=$title?></h1>
<div><?=$content?></div>
</body>
</html>
Usage
src/Controller/Article.php
...
// initalize
$view = new View;
// assign
$view->data['title'] = 'The title';
$view->data['content'] = 'The body';
// render
$view->render(dirname(__DIR__) . '/Templates/Article.php');
View / TemplateRenderer
The core function here is render(). The template file is included and the variable extraction happens in a closure to avoid any variable clashes/scope problems.
src/View.php
class View
{
/**
* Set data from controller: $view->data['variable'] = 'value';
* #var array
*/
public $data = [];
/**
* #var sting Path to template file.
*/
function render($template)
{
if (!is_file($template)) {
throw new \RuntimeException('Template not found: ' . $template);
}
// define a closure with a scope for the variable extraction
$result = function($file, array $data = array()) {
ob_start();
extract($data, EXTR_SKIP);
try {
include $file;
} catch (\Exception $e) {
ob_end_clean();
throw $e;
}
return ob_get_clean();
};
// call the closure
echo $result($template, $this->data);
}
}

Answering specifically what you asked:
My only problem is that I want to make sure that if she uses multiple spaces in the tags - or NO spaces - it won't break
What I have below accomplishes this, but I'm afraid it may be overkill. Anybody know a way to simplify this function?
... the only "slow" part of your function is the preg_replace. Use trim instead, for a very slight increase in speed. Otherwise, don't worry about it. There's no magic PHP command to do what you're looking to do.

BBcode parser, how to attach to main page

I am trying to use this simple BBcode parser shown below, but I am not sure how to actually make it work on my webpage. I have used previously some lines which have used some functions that are not recognised. Such as:
require_once('parser.php'); // path to Recruiting Parsers' file
$parser = new parser; // start up Recruiting Parsers
$parsed = $parser-> p($mytext); // p() is function which parses
Where the p() function is not recognised and hence, nothing is parsed. I am using a text editor but it outputs bbcode, which I am trying to convert back into html. Do you know what code I should use so that it would parse? I am not a developer so this is all very strange.
Here is the perser.php:
<?php
function bbcodeParser($bbcode){
/* bbCode Parser
*Syntax: bbcodeParser(bbcode)
*/
/* Matching codes */
$urlmatch = "([a-zA-Z]+[:\/\/]+[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+)";
/* Basically remove HTML tag's functionality */
$bbcode = htmlspecialchars($bbcode);
/* Replace "special character" with it's unicode equivilant */
$match["special"] = "/\�/s";
$replace["special"] = '�';
/* Bold text */
$match["b"] = "/\[b\](.*?)\[\/b\]/is";
$replace["b"] = "<b>$1</b>";
/*many other properties as before: italics, colours, fonts etc.*/
/* Parse */
$bbcode = preg_replace($match, $replace, $bbcode);
/* New line to <br> tag */
$bbcode=nl2br($bbcode);
/* Code blocks - Need to specially remove breaks */
function pre_special($matches)
{
$prep = preg_replace("/\<br \/\>/","",$matches[1]);
return "�<pre>$prep</pre>�";
}
$bbcode = preg_replace_callback("/\[code\](.*?)\[\/code\]/ism","pre_special",$bbcode);
/* Remove <br> tags before quotes and code blocks */
$bbcode=str_replace("�<br />","",$bbcode);
$bbcode=str_replace("�","",$bbcode); //Clean up any special characters that got misplaced...
/* Return parsed contents */
return $bbcode;
}
?>

Have you tried replacing your p() function with bbcodeParser()? Looks like if you do this, it should work as expected:
require_once('parser.php'); // path to Recruiting Parsers' file
$parsed = bbcodeParser($mytext); // bbcodeParser() is function which parses

Advice for implementing simple regex (for bbcode/geshi parsing)

I had made a personal note software in PHP so I can store and organize my notes and wished for a nice simple format to write them in.
I had done it in Markdown but found it was a little confusing and there was no simple syntax highlighting, so I did bbcode before and wished to implement that.
Now for GeSHi which I really wish to implement (the syntax highlighter), it requires the most simple code like this:
$geshi = new GeSHi($sourcecode, $language);
$geshi->parse_code();
Now this is the easy part , but what I wish to do is allow my bbcode to call it.
My current regular expression to match a made up [syntax=cpp][/syntax] bbcode is the following:
preg_replace('#\[syntax=(.*?)\](.*?)\[/syntax\]#si' , 'geshi(\\2,\\1)????', text);
You will notice I capture the language and the content, how on earth would I connect it to the GeSHi code?
preg_replace seems to just be able to replace it with a string not an 'expression', I am not sure how to use those two lines of code for GeSHi up there with the captured data..
I really am excited about this project and wish to overcome this.

I wrote this class a while back, the reason for the class was to allow easy customization / parsing. Maybe a little overkill, but works well and I needed it overkill for my application. The usage is pretty simple:
$geshiH = new Geshi_Helper();
$text = $geshiH->geshi($text); // this assumes that the text should be parsed (ie inline syntaxes)
---- OR ----
$geshiH = new Geshi_Helper();
$text = $geshiH->geshi($text, $lang); // assumes that you have the language, good for a snippets deal
I had to do some chopping from other custom items I had, but pending no syntax errors from the chopping it should work. Feel free to use it.
<?php
require_once 'Geshi/geshi.php';
class Geshi_Helper
{
/**
* #var array Array of matches from the code block.
*/
private $_codeMatches = array();
private $_token = "";
private $_count = 1;
public function __construct()
{
/* Generate a unique hash token for replacement) */
$this->_token = md5(time() . rand(9999,9999999));
}
/**
* Performs syntax highlights using geshi library to the content.
*
* #param string $content - The context to parse
* #return string Syntax Highlighted content
*/
public function geshi($content, $lang=null)
{
if (!is_null($lang)) {
/* Given the returned results 0 is not set, adding the "" should make this compatible */
$content = $this->_highlightSyntax(array("", strtolower($lang), $content));
}else {
/* Need to replace this prior to the code replace for nobbc */
$content = preg_replace('~\[nobbc\](.+?)\[/nobbc\]~ie', '\'[nobbc]\' . strtr(\'$1\', array(\'[\' => \'[\', \']\' => \']\', \':\' => \':\', \'#\' => \'#\')) . \'[/nobbc]\'', $content);
/* For multiple content we have to handle the br's, hence the replacement filters */
$content = $this->_preFilter($content);
/* Reverse the nobbc markup */
$content = preg_replace('~\[nobbc\](.+?)\[/nobbc\]~ie', 'strtr(\'$1\', array(\'&#91;\' => \'[\', \'&#93;\' => \']\', \'&#58;\' => \':\', \'&#64;\' => \'#\'))', $content);
$content = $this->_postFilter($content);
}
return $content;
}
/**
* Performs syntax highlights using geshi library to the content.
* If it is unknown the number of blocks, use highlightContent
* instead.
*
* #param string $content - The code block to parse
* #param string $language - The language to highlight with
* #return string Syntax Highlighted content
* #todo Add any extra / customization styling here.
*/
private function _highlightSyntax($contentArray)
{
$codeCount = $contentArray[1];
/* If the count is 2 we are working with the filter */
if (count($contentArray) == 2) {
$contentArray = $this->_codeMatches[$contentArray[1]];
}
/* for default [syntax] */
if ($contentArray[1] == "")
$contentArray[1] = "php";
/* Grab the language */
$language = (isset($contentArray[1]))?$contentArray[1]:'text';
/* Remove leading spaces to avoid problems */
$content = ltrim($contentArray[2]);
/* Parse the code to be highlighted */
$geshi = new GeSHi($content, strtolower($language));
return $geshi->parse_code();
}
/**
* Substitute the code blocks for formatting to be done without
* messing up the code.
*
* #param array $match - Referenced array of items to substitute
* #return string Substituted content
*/
private function _substitute(&$match)
{
$index = sprintf("%02d", $this->_count++);
$this->_codeMatches[$index] = $match;
return "----" . $this->_token . $index . "----";
}
/**
* Removes the code from the rest of the content to apply other filters.
*
* #param string $content - The content to filter out the code lines
* #return string Content with code removed.
*/
private function _preFilter($content)
{
return preg_replace_callback("#\s*\[syntax=(.*?)\](.*?)\[/syntax\]\s*#siU", array($this, "_substitute"), $content);
}
/**
* Replaces the code after the filters have been ran.
*
* #param string $content - The content to replace the code lines
* #return string Content with code re-applied.
*/
private function _postFilter($content)
{
/* using dashes to prevent the old filtered tag being escaped */
return preg_replace_callback("/----\s*" . $this->_token . "(\d{2})\s*----/si", array($this, "_highlightSyntax"), $content);
}
}
?>

It looks to me like you already got the regex right. Your problem lies in the invocation, so I suggest making a wrapper function:
function geshi($src, $l) {
$geshi = new GeSHi($sourcecode, $language);
$geshi->parse_code();
return $geshi->how_do_I_get_the_results();
}
Now this would normally suffice, but the source code is likely to contain single or dobule quotes itself. Therefore you cannot write preg_replace(".../e", "geshi('$2','$1')", ...) as you would need. (Note that '$1' and '$2' need quotes because preg_replace just substitutes the $1,$2 placeholders, but this needs to be valid php inline code).
That's why you need to use preg_replace_callback to avoid escaping issues in the /e exec replacement code.
So for example:
preg_replace_callback('#\[syntax=(.*?)\](.*?)\[/syntax\]#si' , 'geshi_replace', $text);
And I'd make a second wrapper, but you can combine it with the original code:
function geshi_replace($uu) {
return geshi($uu[2], $uu[1]);
}

Use preg_match:
$match = preg_match('#\[syntax=(.*?)\](.*?)\[/syntax\]#si', $text);
$geshi = new GeSHi($match[2], $match[1]);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Proper way to decode incoming email subject (utf 8) - php

You can use the mb_decode_mimeheader() function to decode your string.

Use php native function <?php mb_decode_mimeheader($text); ?> This function can handle utf8 as well as iso-8859-1 string. I have tested it.

Use php function: <?php imap_utf8($text); ?>

Just to add yet one more way to do this (or if you don't have the mbstring extension installed but do have iconv): iconv_mime_decode($str, ICONV_MIME_DECODE_CONTINUE_ON_ERROR, 'UTF-8')

Would the imap-mime-header-decode function help here? Found myself in a similar situation today. http://www.php.net/manual/en/function.imap-mime-header-decode.php

Related

How to fix saving to database character è becomes `e

codeigniter url encrypt not working

PHP Simple Template Engine / Function

BBcode parser, how to attach to main page

Advice for implementing simple regex (for bbcode/geshi parsing)

Categories

Resources