I use CURL to get content from another site, but i don't know why it's auto convert from UTF-8 to ISO 8859-1, like follow:
site: abc.com:
Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP
But when i use CURL get content from that site, i got follow:
Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP
So how to convert it's become to UTF-8 ?
I'd recommend using iconv.
iconv --list gives you a list of all known encodings, and you can then use iconv -f FROM_ENCODING -t TO_ENCODING do do your conversion. It can also read from stdin and therefore be plugged to curl.
But regarding the comment you got for your question: It seems like the file author didn't care about using the correct encoding and decided to stick with (old-style?) ä and stuff.
Take your string in variable and use following function.
$var = "";
echo utf8_encode($var);
Judging from the line you pasted, the problem appears to be with HTML entities, not with character enconding. The encoded chars look fine to me.
You need to translate those HTML entities to encoded chars. Which tool to use will depend of your enviroment or programming language. I don't think it can be done with CURL alone.
PHP has htmlspecialchars_decode(). Python unescape() from the HTMLParser module.
curl does not convert anything, downloads things "as is"
What you see are character entities, valid html, and the browser that the conversion to a readable form.
You can check this by opening the file saved by curl in a browser. It will look like the live page.
You can try this:
html_entity_decode($string)
See more here: html_entity_decode
Your files aren’t being converted to another encoding. They’re using HTML character entities. You need to convert those entities, such as é to UTF-8, such as é. This takes one extra line of code after you convert to UTF-8, if you even need to do that.
Related
I'm using PHP for this web development project. Right now, I'm working on a user page, where the user can add words that he knows. Off course, I'm starting out crude, without adding any special features yet like Do you know this Character suggestion, etc.
I have tackled the challenges of adding UTF-16 collation and charset set to UTF-16 in my MySQL Database, in fact online at http://freemysqlhosting.net to support Chinese characters in my website. Now what I'm struggling with is to support automatic PinYin generation for my Chinese characters.
I have found this after searching all over SO: https://github.com/reorx/pinyindep/blob/master/Uni2Pinyin. Each line begins with a Chinese character, in UTF-16 Code Units.
Take for example, 爱. In UTF-16, it is 7231. I convert this at https://r12a.github.io/apps/conversion/. When I do a lookup in the file, I get the pinyin associated. :D This is the functionality I need, though looking it up in GitHub is in JS, rather than PHP.
In the manual lookup, ai4 is returned, which is the correct intonation. Now, what I'm looking for is either a PHP Built-in Library, or a code snippet to convert this string input, let's say “爱” into a UTF-16 Four Character Code Unit, such as here 7321.
So what's the question:
How should I convert a Chinese character, in form of a string, to UTF-16 code units? (Either through built-in library, or through a suggested PHP Code Snippet)
P.S. I don't really like third-party tools unless they are really popular worldwide, or there's no other option.
You need to use PHP's multibyte string module:
$c = "爱";
list(, $d) = unpack('N', mb_convert_encoding($c, 'UCS-4BE', 'UTF-8'));
echo dechex($d);
// => 7231
Change UTF-8 to UTF-16 if your string is coming from the database in that encoding.
mb_convert_encoding will change the string into four-byte-per-character encoding; then unpack converts the four bytes into an unsigned long; finally, converting to hexadecimal string using dechex.
If you are using PHP 7.2+ you can use mb_ord to simplify the conversion.
echo dechex(mb_ord("爱"));
I know a number of post is there for utf-8 encoding issue. but i'm getting fail to convert string into utf-8.
I have a string "beløp" in php.
When i print this screen in i frame it printed "bel�p".
After that i tried - utf8_encode("beløp"); - now i got output - "bel�p".
Again i tried iconv("UTF-8", "ISO-8859-1", "beløp"); now i got output - "bel ".
And finally i tried - utf8_encode(utf8_decode("beløp")); now i got output - "bel?p".
Please let me know where i'm wrong and how i can fix it.?
This
bel�p
is an indication that you are outputting a non-UTF-8 character in a UTF-8 context.
Make sure your file is encoded in UTF-8 ( Don't know what editor you're using, but Notepad++/Sublime Text got a "Save with encoding.." option ) and if at the top of your HTML page there's
<meta charset="utf-8">
Hi it's fixed there was problem in my file it was not encoded in "UTF-8".
I fixed by replacing "bel�p" to "beløp".
The reason your conversion does not work is because the original format of your "beløp" text was not in iso-8859-1. The utf8_encode will only work for conversions is from this format. What could work for this type of issues is to use mb_detect_encoding function (http://php.net/manual/en/function.mb-detect-encoding.php) to find out which format the text is originally from, then use the iconv convert from the detected encoding to utf-8. When this is done you have to make sure as mentioned on earlier comments that utf-8 is as encoding in the header.
Note that the php mb detect enconding is not very reliable and can make mistakes on detecting correct encoding. Especially if you do not have a large amount of text. To ensure to display all text correct at all times you need to make sure that all processing at all times is in the same encoding. If you get the text from external sources or web services you should always check the headers for correct encoding before the text is processed.
I have a Zope/Plone WS that calls some functions written in Python.
That WS are called by PHP pages (utf-8 into header) but characters aren't visible.
I've tried to decode (where possible) special chars into entities (into Python) and that works, but not all chars have corresponding HTML entities.
I've tried to save the original Python file in UTF-8 format, but I thought that wasn't the right way.
Can someone help?
note : I pass through some php include, if this could be an hint...
Edit it's weird, because if I log all the "pieces" singly, then I have the right chars encoded. If I go up to the "main php page" (where I include all pieces), that messes up everything.
Obviously, the "main php page" has that:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
496e73e972657220646174652064926172726976e9652065742064652064e970617274
That string is encoded in ISO-8859-1, not UTF-8.
Somewhere you're converting your strings to ISO-8859-1, which means they're not interpreted correctly when trying to interpret them as UTF-8, and all non-European characters will be discarded since ISO-8859-1 can't encode anything but a handful of European characters.
I just edited the file site.py of python.
I follow that guide: click here and everything is ok now.
Thank you all for help.
If I have
<p id='test'>TEST™</p>
and I use
document.getElementById('test').innerHTML;
to pass the HTML to a php function where it extract all of the text nodes using DOMDocument and XPath.
When the PHP gets the content the ™ gets converted to ™. I run it through XPath and the text node comes back as:
TESTâ„ ¢
I am not sure what is going wrong, or if there is a way fix it, either on the javascript side so it passes the ™ rather then ™.
Any help is appreciated.
Your value that your variable is being passed with the TM character, not with ™, running through htmlentities() in PHP should take care of it.
You could try and use the HTML Unicode form
EX
<p id='test'>™</p>
Read this page for more example on Unicode TM
http://www.fileformat.info/info/unicode/char/2122/index.htm
Hope this helps.
You need to be more precise than saying it "comes back as". The ™ appears to have been written somewhere in UTF-8 encoding, and the same bytes have then been read by something that doesn't realise they are in UTF-8 encoding, and is assuming they are Latin-1 or similar. To solve the problem you will need to look very carefully at the configuration of the software that wrote the character and the software that read it.
What Michael said is true; in addition you should be aware that XML processors are basically required to convert character entities (like &tm;) to their actual character values, and will (almost) always produce output with those characters encoded in some prevailing character set. It takes heroic measures to prevent this, and is usually not a "good idea". So you should abandon attempts to do that, and my guess is that you would be better served by making sure that the function you are passing the HTML to is told to interpret it as utf-8 not some other charset (which may just be the system default).
I have a form served in non-UTF-8 (it’s actually in Windows-1251). People, of course, post there any characters they like to. The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities so I can still recognise them. For example, if user types an →, I receive an →. That’s partially great, like, if I just echo it back, the browser will correctly display the → no matter what.
The problem is, I actually do a htmlspecialchars () on the text before displaying it (it’s a PHP function to convert special characters to HTML entities, e.g. & becomes &). My users sometimes type things like — or ©, and I want to display them as actual — or ©, not — and ©.
There’s no way for me to distinguish an → from →, because I get them both as →. And, since I htmlspecialchars () the text, and I also get a → for a → from browser, I echo back an → which gets displayed as → in a browser. So the user’s input gets corrupted.
Is there a way to say: “Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself”?
Oh, I know that the good idea is to switch the whole software to UTF-8, but that is just too much work, and I would be happy to get a quick fix for this. If this matters, the form’s enctype is "multipart/form-data" (includes file uploader, so cannot use any other enctype). I use Apache and PHP.
Thanks!
The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities
Well, nearly, except that it's not at all helpful. Now you can't tell the difference between a real “ƛ” that someone typed expecting it to come out as a string of text with a ‘&’ in it, and a ‘Б’ character.
I actually do a htmlspecialchars () on the text before displaying it
Yes. You must do that, or else you've got a security problem.
Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself
Yeah, supposedly you send “accept-charset="UTF-8"” in the form tag. But the reality is that doesn't work in IE. To get a form in UTF-8, you must send a form (page) in UTF-8.
I know that the good idea is to switch the whole software to UTF-8,
Yup. Well, at least the encoding of the page containing the form should be UTF-8.
<form action="action.php" method="get" accept-charset="UTF-8">
<!-- some elements -->
</form>
All browsers should return the values in the encoding specified in accept-charset.
You check to see if the characters are within a certain range. If they fall outside the range of standard UTF-8 characters, do whatever you want to with it. I would do this by looking at each character &, #, 8, 5, 9, 4, and parsing it into something you can apply something to.
Short of finding somewhere where someone has created a Windows-1251 to UTF-8 conversion script, you are probably going to have to roll your own. You are probably going to have to look at each specific character and see what needs to be done with it. If it's something like © you will want to handle it differently than → because the second one has the # in it.
I think this answers your question.
The html_entity_decode function is probably what you want.
You could set the fourth parameter of the htmlspecialchars function (double_encode, since PHP 5.2.3) to false do avoid the character references being encoded again.
Or you first decode those existing character references.
You can convert the strings to UTF-8 using the PHP multi-byte functions. From there you can do as you wish. Especially the mb_convert_encoding() to move it from windows-1251 to UTF-8, or where ever.
I don't quite understand your question though, because if someone enters & as a text string, when you do the htmlspecialchars() that should convert it to & ... which when ran back through a html_entity_decode() would come out as the text string the user entered.
This is of course if you haven't used the double_encode option when running your string through the htmlspecialchars()
mbstring supports the "charset" HTML-Entities
for($i=0; $i<strlen($out); $i++) {
printf('%02X ', ord($out[$i]));
}61 20 E2 86 92 20 62 20 26 20 63 E2 86 92 is the byte-sequence for → (RIGHTWARDS ARROW) in utf8.
You won't be able to distinguish between the browser converting a codepoint to an entity and your users typing in an entity because they look identical. The real solution is to give up on Windows 1251. Instead, serve the webpage and form in UTF-8, ask for UTF-8 encoding and all these problems should just go away.