How can I remove special characters in a PHP string? - php

I am getting output as
FBI believed he had a ‘doomsday device’
instead of
FBI believed he had a ‘doomsday device’
when i am using
iconv("UTF-8", "ISO-8859-1//IGNORE", $topic);
output is
FBI believed he had a âdoomsday deviceâ
I am not using any header or charset in my file.
Update
Got why is this happening
when the UTF-8 series of numbers is interpreted as if it were ISO-8859-1 the output is
’
Explaination
0xE28099 breaks down as 0xE2 (â), 0x80 (€) and 0x99 (™). What was one character in UTF-8 (’) gets mistakenly displayed as three (’) when misinterpreted as ISO-8859-1.
Still no solution to convert it

Well the output page is being interpreted in Windows-1252, not ISO-8859-1..
I recommend setting your header charset to utf-8:
In apache config:
AddDefaultCharset utf-8
Php.ini:
default_charset utf-8
Manually in php:
header("Content-Type: text/html; charset=utf-8");
If you cannot do anything of the above because of some weird reasons, you should then convert into Windows-1252 instead:
iconv("UTF-8", "Windows-1252//IGNORE", $topic);

Related

Error with chinese encoding with php

I have a file that contain Chinese character Like this :
合作伙伴
problem and result looks like this :
ºÏ×÷»ï°é£º
Even if I try to print the content in the browser , I get the same encoding problem.
I m sure it's an encoding problem but I can't fix it.
Chinese character encoding is usually gb2312.
try to gb2312 convert to utf-8
$str = iconv('gb2312', 'utf-8', $str);
make sure your file is utf-8 encoding.
Content-type: text/html; charset=utf-8
Convert character encoding to utf-8 and use that only:
$string = iconv('gb2312', 'utf-8', $string);

Convert iso-8859-1 hex escape sequence to lowercase

I have a string like this:
$str = "\xC4";
According to wikipedia the C4 is ISO-8859-1 Hexcode for Ä. Now i want to lowercase this string to get ä (also in ISO-8859-1).
I tried various solutions using strtolower and mb_strtolower. None of them worked. The output was garbled every time.
You can specify the encoding in mb_strtolower(), so just specify it and it all works fine:
echo mb_strtolower($str, "ISO-8859-1");
//^^^^^^^^^^
output:
ä
strtolower("\xC4") works just fine. The thing is that you need to interpret the resulting byte (xE4) using the ISO-8859-1 encoding, otherwise you'll obviously see garbage. If you're doing this in a browser, set the appropriate header to clue the browser in to the expected encoding:
header('Content-Type: text/html; charset=iso-8859-1');
echo strtolower("\xC4");

convert UTF-8 to ANSI (windows-1252)

I'm trying to save a string in hebrew to file, while having the file ANSI encoded.
All attemps failed I'm afraid.
The PHP file itself is UTF-8.
So here's the code I'm trying :
$to_file = "בדיקה אם נרשם";
$to_file = mb_convert_encoding($to_file, "WINDOWS-1255", "UTF-8");
file_put_contents(dirname(__FILE__) ."/txt/TESTING.txt",$to_file);
This returns false for some reason.
Another attempt was :
$to_file = iconv("UTF-8", "windows-1252", $to_file);
This returns an empty string. while this did not work, Changing the outpout charset to windows-1255 DID work. so the function itself works, But for some reason it does not convert to 1252.
I ran this function before and after the iconv and printed the results
mb_detect_encoding ($to_file);
before the iconv the encoding is UTF-8.
after the iconv the encoding is ASCII(??)
I'd really appreciate any help you can give
Windows-1252 is a Latin encoding; you cannot encode Hebrew characters in Windows-1252. That's why it doesn't work.
Windows-1255 is an encoding for Hebrew, that's why it works.
The reason it doesn't work with mb_convert_encoding is that mb_ doesn't support Windows-1255.
Detecting encodings is by definition impossible. Windows-1255 is a single-byte encoding; it's virtually impossible to distinguish any one single byte encoding from another. The result is just as valid in ASCII as it is in Windows-1255 or Windows-1252 or ISO-8859 or any other single byte encoding.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for more information.
You can use this:
<?php
$heb = 'טקסט בעברית .. # ';
$utf = preg_replace("/([\xE0-\xFA])/e","chr(215).chr(ord(\${1})-80)",$heb);
echo '<pre>';
print_r($heb);
echo '<pre>';
echo '------';
echo '<pre>';
print_r($utf);
echo '<pre>';
?>
Output will be like this:
���� ������ .. # <-- $heb - what we get when we print hebrew ANSI Windows 1255
טקסט בעברית .. # <- $utf - The Converted ANSI Windows 1255 to now UTF ...:)

PHP, HTML and character encodings

I actually have a fairly simple question but I'm unable to find an answer anywhere. The PHP function html_entity_decode is supposed to "converts all HTML entities to their applicable characters from string."
So, since Ω is the HTML encoding for the Greek captical letter Omega, I'd expect that echo html_entity_decode('Ω', ENT_COMPAT, 'UTF-8'); would output Ω. But instaid, it outputs some strange characters which my browser can't recongize. Why is this?
Thanks,
Martijn
When you convert entities into UTF-8 characters like your last parameter specifies, your output encoding must be UTF-8 as well. Otherwise, in a single-byte encoding like ISO-8859-1, you will see double-byte characters as two broken single ones.
It's works fine:
http://codepad.viper-7.com/tb2LaW
Make sure your webpage encoding is UTF-8
If you have different encoding on webpage change this:
html_entity_decode('Ω', ENT_COMPAT, 'UTF-8');
^^^^^
header('Content-type: text/html;charset=utf-8');
mysql_set_charset("utf8", $conn);
Refer this URL:-
http://www.phpwact.org/php/i18n/charsets
php mysql character set: storing html of international content

what encoding this is?

can anyone tell me what encoding is applied on the chinese character, so that chinese characters are converted into this code or text and stored in mysql database :
中`国液化天然æ°â€Ã¨Â¿Â输(控股)有é™Âå…¬å¸控股`
original chinese characters which are displayed in web page :
中国液化天然气运输(控股)有限公司控股
on the web page there is a header function is used to make standard chinese chars as follow:
header('Content-type: text/html; charset=utf-8');
Thanks...
When you decode
中国液化天然气运输(控股)有限公司控股
as UTF-8, and encode as CP-1252, then you get
中国液化天然气è¿è¾“(控股)有é™å…¬å¸æŽ§è‚¡
When you decode the above as UTF-8 and encode as CP-1252 once again, then you get
中国液化天然æ°â€Ã¨Â¿ï¿½Ã¨Â¾â€œÃ¯Â¼Ë†Ã¦Å½Â§Ã¨â€šÂ¡Ã¯Â¼â€°Ã¦Å“‰é™�å…¬å�¸æŽ§è‚¡
That's what here is happening.
It is Unicode character set (code points) encoded as UTF-8.

Categories