I need help with decoding url encoded scandinavian ASCII values with PHP.
I have tried decode å character like this:
$string = "%e5";
echo rawurldecode($string);
But this gives black diamond �. Same result with urldecode() function.
I am using <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta charset="utf-8"> in head.
When using rawurldecode() function on English letters like %61 it works great.
See http://www.backbone.se/urlencodingUTF8.htm for all url encoded ASCII codes.
E5 is the ISO-8859-1 encoded representation of the character å.
Your problem is that you're outputting an ISO-8859-1 encoded string, yet are telling the browser to interpret it as UTF-8. Either change the encoding in your HTTP headers/meta tag, or convert the string from 8859 to UTF-8:
echo utf8_encode(rawurldecode('%e5'));
(There's almost never a good time for utf8_encode, but in this case it actually succinctly performs the needed charset conversion. Usually you should prefer explicit charset conversions using iconv or mb_convert_encoding.)
utf-8 character set does not contain e5 code.
Please check a table with utf-8 charset .
Try with a valid utf-8 string.
"scandinavian ascii" character set is not supported by rawurldecode .
Try one of the functions iconv,that support CP865 (I guess this is the character set for which you want support):
http://php.net/manual/ro/function.iconv-mime-decode.php
http://php.net/manual/ro/function.iconv-mime-decode-headers.php
Related
I'm trying to convert some encoded text to display on a website; the specific example is converting the string "d83edd2a" to the 🤪 emoji.
Apparently the encoding is UTF-16 but php detects it as ASCII.
I've tried using hex2bin but this returns "Ø>Ý*" and php detects this as UTF-8, which makes sense to me.
I've tried playing around with a couple of different attempts
$newCode = mb_convert_encoding($code, "ASCII", "UTF-16");
But this returns "????"
$newCode = iconv(mb_detect_encoding($code), 'ASCII', $hex);
But this also returns "????"
I'm sure I'm missing something simple but I've ended up tying myself up in knots!
If I understand correctly you want to convert the string d83edd2a to the corresponding emoji.
The most straightforward way is to simply:
echo hex2bin('d83edd2a');
However this assumes the client uses UTF-16 charset.
If the client uses a different charset you need to convert it first, otherwise you will just see garbage.
But you cannot just use any encoding (like ASCII) because emojis are specific to unicode.
(ASCII simply doesn't "know" the concept of emojis.)
You need to use UTF-8, UTF-16 or UTF-32.
Since you mentioned website you want "UTF-8", it is the de facto standard charset for modern websites.
You can convert from UTF-16 to UTF-8 like this:
// First convert the string to binary data
// We know this is encoded in UTF-16
$UTF16Str = hex2bin('d83edd2a');
// Then we convert from UTF-16 to something more common like UTF-8
$UTF8Str = mb_convert_encoding($UTF16Str, 'UTF-8', 'UTF-16');
echo $UTF8Str;
As a last step, make sure you communicate the charset to the client (you can do this in HTML or PHP):
<meta charset="UTF-8"> <!-- inside <head> -->
Or in PHP:
header('Content-Type: text/html; charset=UTF-8');
I am trying to convert Unicode character such as يونيکد (I don't know what type of text this is, so i have used Unicode character in title) which is an Arabic text but when i use utf8_decode() then i am receiving �?�?�?�?کد , But same string can be converted using online tools such as http://www.forgani.com/top/service/ perfectly.
I have tried many things like :
converted string to hex and back to string
used mb_convert_encoding
used htmlentities
used forceutf8 from https://github.com/neitanod/forceutf8
used setting header, like header('Content-type: text/plain; charset=utf-8');
already tried setting PDO charset to utf8mb4 and utf8
But i didn't get the desired result which is يونيکد, so i want to know how can i decode the given string to UTF-8 Or whatever which can be readable by users in PHP.
set your document charset to utf-8 as follows:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
or within your php script as follows:
header('Content-type: text/plain; charset=utf-8');
You can map the Utf8 character to Arabic or Persian character.
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=1664&names=-&utf8=char
This code in Javascript decode the Utf8 encoded text:
function decode(content)
{
var decoded = content.replaceAll("ت", "ت").replaceAll("Ù‡", "ه").replaceAll("Ù‡", "ه").replaceAll("ا", "ا")
.replaceAll("س", "س").replaceAll("Ùˆ", "و").replaceAll("ÛŒ", "ی").replaceAll("Ù†", "ن")
.replaceAll("د", "د").replaceAll("ز", "ز").replaceAll("ب", "ب").replaceAll("Ù‚", "ق")
.replaceAll("ص", "ص").replaceAll("Ù¾", "پ").replaceAll("Ù¾", "پ").replaceAll("Ø´", "ش")
.replaceAll("ر", "ر").replaceAll("Ú˜", "ژ").replaceAll("Ù", "ف").replaceAll("Ø", "ح")
.replaceAll("Ø·", "ط").replaceAll("Ø«", "ث").replaceAll("ج", "ج").replaceAll("Ú†", "چ")
.replaceAll("Ø®", "خ").replaceAll("Ø°", "ذ").replaceAll("ض", "ض").replaceAll("ظ", "ظ")
.replaceAll("ع", "ع").replaceAll("غ", "غ").replaceAll("Ú©", "ک").replaceAll("Ú¯", "گ")
.replaceAll("Ù„", "ل").replaceAll("Ù…","م").replaceAll("Û°", "۰").replaceAll("Û±", "۱")
.replaceAll("Û²", "۲").replaceAll("Û³", "۳").replaceAll("Û´", "۴").replaceAll("Ûµ", "۵")
.replaceAll("Û¶", "۶").replaceAll("Û·", "۷").replaceAll("Û¸", "۸").replaceAll("Û¹", "۹")
.replaceAll("‌", "").replaceAll("Ø¡", "ء").replaceAll("Ø¢","آ").replaceAll("‎","")
.replaceAll("–", "–").replaceAll("ØŒ", "،").replaceAll("ØŒ", "،").replaceAll("Ëš", "˚");
console.log(decoded);
return decoded;
}
Hi have a look at this picture: http://ctrlv.in/175196 and as you can see the ÅÄÖ are replaced with �.
I have this at the very top of my php: <meta http-equiv="content-type" content="text/html; charset=utf-8"></meta>
and when I look at source it is indeed utf-8 - so why dont they display properly?
When you see the UNICODE REPLACEMENT CHARACTER �, it means your text is being interpreted as UTF-8 (or another Unicode encoding), but one of the byte sequences in the file was not valid in this encoding.
In other words, the file is not UTF-8 encoded.
try this.
iconv('windows-1250', 'utf-8', $your_variable);
if that is coming from an sql query set_charset('utf8') first before the query.
using notepad++ try encoding utf-8 without bom
I have looked around and can't seem to find a solution so here it is.
I have the following code:
$file = "adhddrugs.xml";
$xmlstr = simplexml_load_file($file);
echo $xmlstr->report_description;
This is the simple version, but even trying this any hyphens r apostrophes are turned into: ^a (euro sign) trademark sign.
Things I have tried are:
echo = (string)$xmlstr->report_description; /* did not work */
echo = addslashes($xmlstr->report_description); /* yes I know this doesnt work with hyphens, was mainly trying to see if I could escape the apostrophes */
echo = addslashes((string)$xmlstr->report_description); /* did not work */
also htmlspecial(again i know does not work with hyphens), htmlentities, and a few other tricks.
Now the situation is I am getting the XML files from a feed so I cannot change them, but they are pretty standard. The text with the hyphens etc are encapsulated in a cdata tag and encoding is UTF-8. If I check the source I am shown the hyphens and apostrophes in the source.
Now just to see if the encoding was off or mislabeled or something else weird, I tried to view the raw XML file and sure enough it is displayed correctly.
I am sure that in my rush to find the answer I have overlooked something simple and the fact that this is really the first time I have ever used SimpleXML I am missing a very simple solution. Just don't dock me for it I really did try and find the answer on my own.
Thanks again.
This is the simple version, but even
trying this any hyphens apostrophes
are turned into: ^a (euro sign)
trademark sign.
This is caused by incorrect charset guessing (and possibly recoding).
If a text contains a "curly apostrophe" = "Right single quotation mark" = U+2019 character, saving it in UTF-8 encoding results in bytes 0xE2 0x80 0x99. If the same file is then read again assuming its charset is windows-1252, the byte stream of the apostrophe character (0xE2 0x80 0x99) is interpreted as characters ’ (=small "a" with circumflex, euro sign, trademark sign). Again if this incorrectly interpreted text is saved as UTF-8 the original character results in byte stream 0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2
Summary: Your original data is UTF-8 and some part of your code that reads the data assumes it is windows-1252 (or ISO-8859-1, which is usually actually treated as windows-1252). A probable reason for this charset assumption is that default charset for HTTP is ISO-8859-1. 'When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.' Source: RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1
PS. this is a very common problem. Just do a Google or Bing search with query doesn’t -doesn't and you'll see many pages with this same encoding error.
Do you know the document's character set?
You could do header('Content-Type: text/html; charset=utf-8'); before any content is printed, if you havent already.
Make sure you have set up SimpleXML to use UTF-8 too.
Be sure that all the entities are encoded using hex notation, not HTML entities.
Also maybe:
$string = html_entity_decode($string, ENT_QUOTES, "utf-8");
will help.
This is a symptom of declaring an incorrect character set in the <head> section of your page (or not declaring and using default character set without accents and special characters).
This does the trick for latin languages.
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
For TOTAL NEWBIES, html pages for browsers have a basic layout, with a HEAD or HEADER which serves to tell the browser some basic stuff about the page, as well as preload some scripts that the page will use to achieve its functionality(ies).
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Hello world
</body>
</html>
if the <head> section is omitted, html will use defaults (take some things for granted - like using the northamerican character set, which does NOT include many accented letters, whch show up as "weird characters".
I have a text in utf-8 and I want to decode it, using utf8_decode()
But when I do that I lose a part of the text, utf8_decode() decodes the string until it finds a character –
Any idea to solve this problem ?
Maybe iconv can help you
†= E2 80 = 1110 0010 1000 0000
If that's literally what was in your UTF-8 text, then it might not be UTF-8. It would need to be followed by one more octet starting 10 to be valid.
That's because an octet starting 1110 introduces a three octet sequence, with the following octets starting 10, to deliver a total of 16 bytes of 'payload' to give the Unicode code point.
EDIT: You've provided the next char as 0x93 = 1001 0011 which would be valid. The UTF-8 sequence 0xE28093 = 0010 00 0000 01 0011 = 0x2013 which is an EN DASH. So, it looks like plausible UTF-8 after all!
Perhaps – are not in ISO-8859-1? utf8_decode eats only utf8-characters which also exist in ISO-8859-1.
You'll probably want something similar to this:
$string = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
You can read more about iconv in the documentation. Depending on your use, IGNORE might be more useful than TRANSLIT.
Are you sure that EdoDodo's code is not working?
Try to force the browser to handle the output as iso-8859-1. To do this, you need an utf8 encoded file with the string in it (you need this, because text editors may use an invisible UTF-8 BOM, and the browser may switch to UTF-8 against the defined ISO-8859-2), and an other one with the php code in ansi encoding (I am using Notepad++ just to be sure that the encoding is proper - it detects the file's encoding and shows it in the lower right corner, and you can convert between the encodings too).
So create a file in utf-8 encoding called utf8.txt with just the string:
–
And create an ANSI encoded index.php file with this content:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
<?php
$str = file_get_contents('utf8.txt');
echo "iconv(//IGNORE//TRANSLIT): " . iconv("UTF-8", "ISO-8859-1//IGNORE//TRANSLIT", $str) . "<br>\n";
For webpages, I strongly recommend to always use UTF-8 encoding, even if it is in English.