php utf-8 does not work - php

I'm trying, put in header("Content-Type: text/html; charset=utf-8"); but not work
<?PHP
header("Content-Type: text/html; charset=utf-8");
function randStr($rts=20){
$act_chars = "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ";
$act_val = "";
for($act=0; $act <$rts ; $act++)
{
mt_srand((double)microtime()*1000000);
$act_val .= $act_chars[mt_rand(0,strlen($act_chars)-1)];
}
return $act_val;
}
$dene = randStr(16);
print "$dene";
?>
output
K��A�CÞZU����EJ

There are several mistakes. Don't use [] or strlen() on a multibyte string. Use mb_substr() and mb_strlen().
Also set the internal encoding of the multibyte extension to UTF-8:
mb_internal_encoding("UTF-8");
You could also set the encoding on a per-function basis. See the specific function signatures for further details.
Here is an improved version of your code:
header('Content-Type: text/html; charset=utf-8');
mb_internal_encoding("UTF-8");
function randStr($rts = 20) {
$act_chars = 'ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ';
$act_val = '';
$act_chars_last = mb_strlen($act_chars);
for($act = 0; $act < $rts; $act++) {
$act_val .= mb_substr($act_chars, mt_rand(0, $act_chars_last), 1);
}
return $act_val;
}
$dene = randStr(16);
print $dene;
Replaced double quotes by single quotes (saves a tiny amount of time)
Removed mt_srand() calls because they are automatically done as of PHP 4.2.0 (thanks to feeela for mentioning that in a comment)
Saved the string length - 1 in a variable, so PHP doesn't need to recomputed it in every loop step.
If you want more insights on that topic, checkout the link from deceze in the answer below: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

The problem is that you're assembling a random gobbledygook of bytes and are telling the browser to interpret it as UTF-8. The standard PHP str functions assume one byte = one character. By randomly picking a multi-byte string apart with those you're not going to get whole characters, only bytes.
If you need encoding aware functions operating on a character level rather than a byte level, use the mb_* functions.
And read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Instead of using the mb_* functions, you could alternatively set the following variables in php.ini if you have access to it:
# following mbstring-variables should be set via php.ini or vhost-configuration (httpd.conf);
# does not work per directory/via .htaccess
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.func_overload = 7
That way you could use your current function as it is and use strlen or similar functions as if they were a mb_strlen function.
See php.net – Function Overloading Feature for more details.
See also:
http://www.php.net/manual/en/mbstring.configuration.php#ini.mbstring.language
http://www.php.net/manual/en/mbstring.configuration.php#ini.mbstring.func-overload
http://www.php.net/manual/en/mbstring.configuration.php#ini.mbstring.internal-encoding

Related

PHP read a line from a csv file return wrong in charset

I got a csv file, if I set the charset to ISO-8859-2(eastern europe) in Libre Calc, than it renders the characters correctly, but since the server's locale set to EN-UK.
I can not read the characters correctly, for example:
it returns : T�t insted of Tót.
I tried many things like:
echo (mb_detect_encoding("T�t","ISO-8859-2","UTF-8"));
I know probably the char does not exist in UTF-8 but I tried.
Also tried to setup the correct charset in the header:
header('Content-Type: text/html; charset=iso-8859-2');
echo "T�th";
but its returns : TÄĹźËth insted of Tóth.
Please help me solve this, thanks in advance
I advise against setting the header to charset=iso-8859-2'. It is usual to work with UTF-8. If the data is available with a different encoding, it should be converted to UTF-8 and then processed as CSV. The following example code could be kept as simple as the newline characters in UTF-8 and iso-8859-2 are the same.
$fileName = "yourpath/Iso8859_2.csv";
$fp = fopen($fileName,"r");
while($row = fgets($fp)){
$strUtf8 = mb_convert_encoding($row,'UTF-8','ISO-8859-2');
$arr = str_getcsv($strUtf8);
var_dump($arr);
}
fclose($fp);
The exact encoding of the CSV file must be known. mb_detect_encoding is not suitable for determining the encoding of a file.

Send SMS containing 'æøå' with AT commands

I'm making a SMS sending function for at project i'm working on. The code works just fine, but
when i send the letters 'æ-ø-å-Æ-Ø-Å' it turns to 'f-x-e-F-X-E'.
How do i change the encoding so that I can send these letters?
This is my code:
<?php
include "php_serial.class.php";
$html = $_POST['msg'];
$serial = new phpSerial;
$serial->deviceSet("/dev/cu.HUAWEIMobile-Modem");
$serial->deviceOpen();
$serial->sendMessage("ATZ\n\r");
// Wait and read from the port
var_dump($serial -> readPort());
$serial->sendMessage("ATE0\n\r");
// Wait and read from the port
var_dump($serial -> readPort());
// To write into
$serial->sendMessage("AT+cmgf=1;+cnmi=2,1,0,1,0\n\r");//
$serial->sendMessage("AT+cmgs=\"+45{$_POST['number']}\"\n\r");
$serial->sendMessage("{$html}\n\r");
$serial->sendMessage(chr(26));
//wait for modem to send message
sleep(3);
$read=$serial->readPort();
$serial->deviceClose();
$read = preg_replace('/\s+/', '', $read);
$read = substr($read, -2);
if($read == "OK") {
header("location: index.php?send=1");
} else {
header("location: index.php?send=2");
}
?>
First of all, you must seriously redo your AT command handling to
Read and parse the every single response line given back from the modem until you get a final result code for every single command line invocation, no exceptions whatsoever. See this answer for more details.
For AT+CMGS specifically you also MUST wait for the "\n\r> " response before sending data, see this answer for more details.
Now to answer your question about æøå turning into fxe, this is a classical stripping of the most significant bit of ISO 8859-1 encoding (which I had almost forgotten about). This is probably caused the default character encoding, but since you always should be explicit and set the character encoding you want to use in any case, investigating that further is of no value. The character encoding used for strings to AT commands is controlled by the AT+CSCS command (see this answer for more details), run AT+CSCS=? to get a list of options.
Based on your information you seem to using ISO 8859-1, so running AT+CSCS="8859-1" will stop zeroing the MSB. You might be satisfied with just that, but I strongly recommend using character encoding UTF-8 instead, it is just so vastly superior to 8859-1.
All of that failing I am quite sure that at least one of the GMS or IRA encodings should supports the æøå characters, but then you have to do some very custom translation, those characters will have binary values very different from what is common in text elsewhere.

How to format incoming email text for HTML display

I've set up a script that processes incoming emails and creates blog entries on Blogger. I'm using PEAR's Mail_Mime libs (for now) to read the incoming message. The messages often have characters in them that cannot be read by browsers--this happens most often when people use Outlook or cut/paste from MS Word.
So the output at the other end is something like this:
Here is a test post with “quotes” and ‘apostrophes�for what it�s worth, it also has dashes�and other strange formatting cut and paste from MS Word.
You can also see the output in the wild.
It's not hard to fix any specific instance, but each client (hotmail, gmail, outlook, etc) seems to handle things just a bit differently. Mail_Mime only seems to munge the output and, if I turn off Mail_Mime's parsing and try to translate the encoded characters myself using mb_convert_encoding or some manual simulation of this, it's even worse.
Please not that this is not going to be solved by selecting the right encoding type and using decode/encode/convert functions. The incoming formats vary from Windows-1252 to UTF8 to just about anything else mail clients can think of.
Has anyone scripted this before that could save me some time by offering up a sample or advice on the best approach? I've tried all the simple answers and done plenty of experimenting, so please don't bother responding unless you've dealt with a similar issue successfully or have a deep understanding of encoding issues.
The only way to do this is to do it by the spec's which is I'm afraid to pull in the 'Content-Type' mime header, pick up the charset (it'll look like Content-Type: text/plain; charset="us-ascii") then convert to UTF-8, and of course ensure your output on the web is sent as UTF-8 with the right headers.
To solve this problem, and get my message into valid UTF-8 that is readable from a browser, I found this PHP lib, ConvertCharset by Mikolaj Jedrzejak, which worked on almost everything. It still had issues with a specific symbol (=A0) when converting from Windows-1252 or iso-8859-1. So I converted this character manually before setting the code loose.
Here's what it looks like overall:
// decode using Mail_Mime
require 'Mail.php';
require 'Mail/mime.php';
require 'Mail/mimeDecode.php';
$params['include_bodies'] = true;
$params['decode_bodies'] = true; // this decodes it!
$params['decode_headers'] = true;
$decoder = new Mail_mimeDecode($input);
$mime = $decoder->decode($params);
// too much work to put in this example
$charset = ...; //do some magic with $mime->parts to get the character set
$text = ...; //do some magic with $mime->parts to get the text
// fix the =A0 control character; it's already been decoded
// by Mail_Mime, so we need the actual byte code now
// this has to be done before trying to convert to UTF-8
$char = chr(hexdec(substr('A0',1)));
$text = str_replace($char, '', $text);
// convert to UTF-8 using ConvertCharset
require 'ConvertCharset.class.php';
if( strtolower($charset) != 'utf-8' ) {
$converter = new ConvertCharset($charset, 'utf-8', false);
}
$text = $converter->Convert($text);
Then everything is spiffy. It even does the infamous Iñtërnâtiônàlizætiøn conversion, as well as accepting french, spanish, and pastes directly from MS Word :)

Truncate a UTF-8 string to fit a given byte count in PHP

Say we have a UTF-8 string $s and we need to shorten it so it can be stored in N bytes. Blindly truncating it to N bytes could mess it up. But decoding it to find the character boundaries is a drag. Is there a tidy way?
[Edit 20100414] In addition to S.Mark’s answer: mb_strcut(), I recently found another function to do the job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension. Since intl is an ICU wrapper, I have a lot of confidence in it.
Edit: S.Mark's answer is actually better than mine - PHP has a (badly documented) builtin function that solves exactly this problem.
Original "back to the bits" answer follows:
Truncate to the desired byte count
If the last byte starts with 110 (binary), drop it as well
If the second-to-last byte starts with 1110 (binary), drop the last 2 bytes
If the third-to-last byte starts with 11110 (binary), drop the last 3 bytes
This ensures that you don't have an incomplete character dangling at the end, which is the main thing that can go wrong when truncating UTF-8.
Unfortunately (as Andrew reminds me in the comments) there are also cases where two separately encoded Unicode code points form a single character (basically, diacritics such as accents can be represented as separate code point modifying the preceding letter).
Handling this kind of thing requires advanced Unicode-Fu which is not available in PHP and may not even be possible for all cases (there are somne weird scripts out there!), but fortunately it's relatively rare, at least for Latin-based languages.
I think you don't need to reinvent the wheel, you could just use mb_strcut and make sure you set encoding to UTF-8 first.
mb_internal_encoding('UTF-8');
echo mb_strcut("\xc2\x80\xc2\x80", 0, 3); //from index 0, cut 3 characters.
its return
\xc2\x80
because in \xc2\x80\xc2, last one is invalid
I coded up this simple function for this purpose, you need mb_string though.
function str_truncate($string, $bytes = null)
{
if (isset($bytes) === true)
{
// to speed things up
$string = mb_substr($string, 0, $bytes, 'UTF-8');
while (strlen($string) > $bytes)
{
$string = mb_substr($string, 0, -1, 'UTF-8');
}
}
return $string;
}
While this code also works, S.Mark answer is obviously the way to go.
Here's a test for mb_strcut(). It doesn't prove that it does just what we're looking for but I find it pretty convincing.
<?php
ini_set('default_charset', 'UTF-8' );
$strs = array(
'Iñtërnâtiônàlizætiøn',
'החמאס: רוצים להשלים את עסקת שליט במהירות האפשרית',
'ايران لا ترى تغييرا في الموقف الأمريكي',
'独・米で死傷者を出した銃の乱射事件',
'國會預算處公布驚人的赤字數據後',
'이며 세계 경제 회복에 걸림돌이 되고 있다',
'В дагестанском лесном массиве южнее села Какашура',
'นายประสิทธิ์ รุ่งสะอาด ปลัดเทศบาล รักษาการแทนนายกเทศมนตรี ต.ท่าทองใหม่',
'ભારતીય ટીમનો સુવર્ણ યુગ : કિવીઝમાં પણ કમાલ',
'ཁམས་དཀར་མཛེས་ས་ཁུལ་དུ་རྒྱ་གཞུང་ལ་ཞི་བའི་ངོ་རྒོལ་',
'Χιόνια, βροχές και θυελλώδεις άνεμοι συνθέτουν το',
'Հայաստանում սկսվել է դատական համակարգի ձեւավորումը',
'რუსეთი ასევე გეგმავს სამხედრო');
for ( $i = 10; $i <= 30; $i += 5 ) {
foreach ($strs as $s) {
$t = mb_strcut($s, 0, $i, 'UTF-8');
print(
sprintf('%3s%3s ', mb_strlen($t, 'UTF-8'), mb_strlen($t, 'latin1'))
. ( mb_check_encoding($t, 'UTF-8') ? ' OK ' : ' Bad ' )
. $t . "\n");
}
}
?>
In addition to S.Mark’s answer which was mb_strcut(), I recently found another function to do a similar job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension.
The functionality is a bit different: mb_strcut() documentation claims it cuts at the nearest UTF-8 character boundary, so it doesn't respect multi-character graphemes while grapheme_extract(), otoh, does. So depending what you need, grapheme_extract() might be better (e.g. to display a string) or mb_strcut() might be better (e.g. for indexing). Anyway, just though I'd mention it.
(And since intl is an ICU wrapper, I have a lot of confidence in it.)
No. There is no way to do this other than decoding. The coding is pretty mechanical however. See the pretty table in the wikipedia article
Edit: Michael Borgwardt shows us how to do it without decoding the whole string. Clever.

Dealing with eacute and other special characters using Oracle, PHP and Oci8

Hi I am trying to store names into an Oracle database and fetch them back using PHP and oci8.
However, if I insert the é directly into the Oracle database and use oci8 to fetch it back I just receive an e
Do I have to encode all special characters (including é) into html entities (ie: é) before inserting into database ... or am I missing something ?
Thx
UPDATE: Mar 1 at 18:40
found this function:
http://www.php.net/manual/en/function.utf8-decode.php#85034
function charset_decode_utf_8($string) {
if(#!ereg("[\200-\237]",$string) && #!ereg("[\241-\377]",$string)) {
return $string;
}
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string);
$string = preg_replace("/([\300-\337])([\200-\277])/e","'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",$string);
return $string;
}
seems to work, although not sure if its the optimal solution
UPDATE: Mar 8 at 15:45
Oracle's character set is ISO-8859-1.
in PHP I added:
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1");
to force the oci8 connection to use that character set.
Retrieving the é using oci8 from PHP now worked ! (for varchars, but not CLOBs had to do utf8_encode to extract it )
So then I tried saving the data from PHP to Oracle ... and it doesnt work..somewhere along the way from PHP to Oracle the é becomes a ?
UPDATE: Mar 9 at 14:47
So getting closer.
After adding the NLS_LANG variable, doing direct oci8 inserts with é works.
The problem is actually on the PHP side.
By using ExtJs framework, when submitting a form it encodes it using encodeURIComponent.
So é is sent as %C3%A9 and then re-encoded into é.
However it's length is now 2 (strlen($my_sent_value) = 2) and not 1.
And if in PHP I try: $my_sent_value == é = FALSE
I think if I am able to re-encode all these characters in PHP back into lengths of byte size 1 and then inserting them into Oracle, it should work.
Still no luck though
UPDATE: Mar 10 at 11:05
I keep thinking I am so close (yet so far away).
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9"); works very sporadicly.
I created a small php script to test:
header('Content-Type: text/plain; charset=ISO-8859-1');
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9");
$conn= oci_connect("user", "pass", "DB");
$stmt = oci_parse($conn, "UPDATE temp_tb SET string_field = '|é|'");
oci_execute($stmt, OCI_COMMIT_ON_SUCCESS);
After running this once and loggin into the Oracle Database directly I see that STRING_FIELD is set to |¿|. Obviously not what I had come to expect from my previous experience.
However, if I refresh that PHP page twice quickly.... it worked !!!
In Oracle I correctly saw |é|.
It seems like maybe the environment variable is not being correctly set or sent in time for the first execution of the script, but is available for the second execution.
My next experiment is to export the variable into PHP's environment, however, I need to reset Apache for that...so we'll see what happens, hopefully it works.
I presume you are aware of these facts:
There are many different character sets: you have to pick one and, of course, know which one you are using.
Oracle is perfectly capable of storing text without HTML entities (é). HTML entities are used in, well, HTML. Oracle is not a web browser ;-)
You must also know that HTML entities are not bind to a specific charset; on the contrary, they're used to represent characters in a charset-independent context.
You indistinctly talk about ISO-8859-1 and UTF-8. What charset do you want to use? ISO-8859-1 is easy to use but it can only store text in some latin languages (such as Spanish) and it lacks some common chars like the € symbol. UTF-8 is trickier to use but it can store all characters defined by the Unicode consortium (which include everything you'll ever need).
Once you've taken the decision, you must configure Oracle to hold data in such charset and choose an appropriate column type. E.g., VARCHAR2 is fine for plain ASCII, NVARCHAR2 is good for UTF-8.
This is what I finally ended up doing to solve this problem:
Modified the profile of the daemon running PHP to have:
NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
So that the oci8 connection uses ISO-8859-1.
Then in my PHP configuration set the default content-type to ISO-8859-1:
default_charset = "iso-8859-1"
When I am inserting into an Oracle Table via oci8 from PHP, I do:
utf8_decode($my_sent_value)
And when receiving data from Oracle, printing the variable should just work as so:
echo $my_received_value
However when sending that data over ajax I have had to use:
utf8_encode($my_received_value)
If you really cannot change the character set that oracle will use then how about Base64 encoding your data before storing it in the database. That way, you can accept characters from any character set and store them as ISO-8859-1 (because Base64 will output a subset of the ASCII character set which maps exactly to ISO-8859-1). Base64 encoding will increase the length of the string by, on average, 37%
If your data is only ever going to be displayed as HTML then you might as well store HTML entities as you suggested, but be aware that a single entity can be up to 10 characters per unencoded character e.g. ϑ is ϑ
I had to face this problem : the LatinAmerican special characters are stored as "?" or "¿" in my Oracle database ... I can't change the NLS_CHARACTER_SET because we're not the database owners.
So, I found a workaround :
1) ASP.NET code
Create a function that converts string to hexadecimal characters:
public string ConvertirStringAHex(String input)
{
Encoding encoding = System.Text.Encoding.GetEncoding("ISO-8859-1");
Byte[] stringBytes = encoding.GetBytes(input);
StringBuilder sbBytes = new StringBuilder(stringBytes.Length);
foreach (byte b in stringBytes)
{
sbBytes.AppendFormat("{0:X2}", b);
}
return sbBytes.ToString();
}
2) Apply the function above to the variable you want to encode, like this
myVariableHex = ConvertirStringZHex( myVariable );
In ORACLE, use the following:
PROCEDURE STORE_IN_TABLE( iTEXTO IN VARCHAR2 )
IS
BEGIN
INSERT INTO myTable( SPECIAL_TEXT )
VALUES ( UTL_RAW.CAST_TO_VARCHAR2(HEXTORAW( iTEXTO ));
COMMIT;
END;
Of course, iTEXTO is the Oracle parameter which receives the value of "myVariableHex" from ASP.NET code.
Hope it helps ... if there's something to improve pls don't hesitate to post your comments.
Sources:
http://www.nullskull.com/faq/834/convert-string-to-hex-and-hex-to-string-in-net.aspx
https://forums.oracle.com/thread/44799
If you have different charsets between the server side code (php in this case) and the Oracle database, you should set server side code charset in the Oracle connection, then Oracle made the conversion.
Example: Let's assume:
php charset utf-8 (default).
Oracle charset AMERICAN_AMERICA.WE8ISO8859P1
In the connection to Oracle made by php you should set UTF8 (third parameter).
oci_pconnect("USER", "PASS", "URL"),"UTF8");
Doing this, you write code in utf-8 (not doing any conversion at all) and get utf-8 from the database through this connection.
So you could write something like SELECT * FROM SOME_TABLE WHERE TEXT = 'SOME TEXT LIKE áéíóú Ñ' and also get utf-8 text as a result.
According to the php documentation, by default, Oracle client (oci_pconnect) takes the NLS_LANG environment variable from the Operating system. Some debian based systems has no NLS_LANG enviromental variable, so I think Oracle client use it's own charset (AMERICAN_AMERICA.WE8ISO8859P1) if we don't specify the third parameter.

Categories