UTF-8 and HTML entities

UTF-8 and HTML entities - php

I try to eject text from Word .DOC file with PHP. All seems ok, but the only trouble is something like
СУДОВА БУХГАЛТЕРІЯ
instead of russian text. I've tried to use html_entity_decode and utf8_encode, but they didn't help. Is there any simple solution?

html_entity_decode should work with the proper parameters (unless you’re using PHP 5.3.3 or later):
html_entity_decode($str, ENT_QUOTES, 'UTF-8')
This will convert the character references into UTF-8. Before PHP 5.3.3, the charset parameter’s default value was ISO-8859-1. In that case the cyrillic characters can’t be converted as the ISO 8859-1 character set doesn’t contain them.

Related

Convert the Chinese Characters From ISO-8859-1 To UTF-8

I got a system which previously the html encoding type was set as ISO-8859-1 and it caused all the Chinese characters store in the format of "&\#36830;&\#34915;&\#35033;".
So my question is, how can I convert the format above into Chinese word back in UTF-8?
For your information, I had tried with utf8_decode, iconv, but none of them work. :(
Thank you very much.

The current text encoding of that string is rather insubstantial. What you have there are HTML entities; they have little to do with the underlying "physical" encoding like ISO-8859 or UTF-8. What you want is to decode those HTML entities into a byte representation of the characters in a specific encoding, in this case to UTF-8. Therefore:
echo html_entity_decode('连衣裙', ENT_COMPAT, 'UTF-8');
// 连衣裙

You need to use:
utf8_encode($data);
and not decode,to convert your current ISO-8859-1 to UTF-8.
Some native PHP functions such as strtolower(), strtoupper() and ucfirst() do not always function correctly with UTF-8 strings. Possible solutions: convert to latin first or add the following line to your code:
setlocale(LC_CTYPE, 'C');
Make sure not to save your PHP files using a BOM (Byte-Order Marker) UTF-8 file marker (your browser might show these BOM characters between PHP pages on your site).
Just for your reference:
ISO-8859-1 => Albanian, Brazilian, Catalan, Danish, Dutch, English, Finnish, French, German, Portuguese, Norwegian, Spanish, Swedish
UTF-8 => Chinese (simplified), Chinese (traditional), Japanese, Persian

There are many tools that can convert character references to characters, and writing such a tool is rather straightforward, especially if you know the references are all decimal. So the answer really depends on the software environment.
For example, to do such a conversion for an individual HTML document, you could use the BabelPad editor: command Convert → Numeric Character References (NCR) → NCR to Unicode, and save the result as UTF-8.

PHP, HTML and character encodings

I actually have a fairly simple question but I'm unable to find an answer anywhere. The PHP function html_entity_decode is supposed to "converts all HTML entities to their applicable characters from string."
So, since Ω is the HTML encoding for the Greek captical letter Omega, I'd expect that echo html_entity_decode('Ω', ENT_COMPAT, 'UTF-8'); would output Ω. But instaid, it outputs some strange characters which my browser can't recongize. Why is this?
Thanks,
Martijn

When you convert entities into UTF-8 characters like your last parameter specifies, your output encoding must be UTF-8 as well. Otherwise, in a single-byte encoding like ISO-8859-1, you will see double-byte characters as two broken single ones.

It's works fine:
http://codepad.viper-7.com/tb2LaW
Make sure your webpage encoding is UTF-8
If you have different encoding on webpage change this:
html_entity_decode('Ω', ENT_COMPAT, 'UTF-8');
^^^^^

header('Content-type: text/html;charset=utf-8');
mysql_set_charset("utf8", $conn);
Refer this URL:-
http://www.phpwact.org/php/i18n/charsets
php mysql character set: storing html of international content

PHP: how do I convert foreign characters from simple_html_dom to UTF8?

I'm having some trouble with a string that comes from a webpage having foreign characters in it.
The string is generated by parsing the webpage using str_get_html(), followed by $htmldom->innertext; (simple_html_dom class library).
When I output the string using htmlentities() it is displayed fine; but using explode() on the string and printing the parts, I get a tilted block with a question mark in it for each foreign character.
I need to store the string in a utf8 MySQL database, so I need the right foreign characters.
My page has a header with utf8 character set.
I have already tried mb_split() and preg_split(), but those have the same problem.

I solved the issue with :
https://github.com/neitanod/forceutf8
It has a great function that just converts anything to utf-8, no matter what source it's from (as long as it comes in Latin1 (iso 8859-1), Windows-1252 or UTF8 already, or a mix of them).
Many thanks go to Sebastian Grignoli.

PHP and UTF-8 isn't a very good combination. Some functions work fine with UTF-8, others don't, and the worst are those that are documented to work, but in fact do not (such as DOMDocument ).
You can use mb_convert_encoding() to convert multibyte characters to HTML entities, which usually provides an acceptable workaround:
$string = mb_convert_encoding($string, 'HTML-ENTITIES', 'UTF-8');

How to use PHP htmlentities()?

In my project I currently use htmlentities() to filter data coming from the database:
echo htmlentities($variable_name);
I am in the USA and this works fine for me. My friend is in Brazil and for him some text characters don't show up correctly.
How can I use htmlentities() so it internationalizes properly?

The problem could be that the output is not encoded in UTF-8. According to the php docs for htmlentities, the function
takes an optional third argument
charset which defines character set
used in conversion. Presently, the
ISO-8859-1 character set is used as
the default.
So you can try calling
htmlentities($string, ENT_COMPAT, 'UTF-8');
instead, and that might fix the problem, since it's not the default character encoding.

While I suspect Keoki has it correct, another possible problem could be the font. If using a special character where your friend's font doesn't contain that character, they'll just see the missing character sign. In the webpage or whatever medium you are using to post the character, be sure that a font is set, as there's no guarentees on the default font working.
If neither of these be the case though, what is an example character that isn't showing up? Can you post the full code you are using?

You can also try iconv to Convert string to requested character encoding
http://www.php.net/manual/en/function.iconv.php

Convert foreign characters with accents

I'm trying to compare some text to the text in a database. In the database any text with an accent is encoded like in HTML (i.e. é) when I compare the database text to my string it doesn't match because my string just shows é. When I use the PHP function htmlentities to encode the string first the é turns into Ã© weird? Using htmlspecialchars doesn't encode the é at all.
How would you suggest I compare é to é as well as all the other accented characters?

You need to send in the correct charset to htmlentities. It looks like you're using UTF-8, but the default is ISO-8859-1. Change it like this:
$encoded = htmlentities($text, ENT_COMPAT, 'UTF-8');
Another solution is to convert the text to ISO-8859-1 before encoding, but that may destroy information (ISO-8859-1 does not contain nearly as many characters as UTF-8). If you want to try that instead, do like this:
$encoded = htmlentities(utf8_decode($text));

I'm working on french site, and I also had same problem. This is the function that I use.
function convert_accent($string)
{
return htmlspecialchars_decode(htmlentities(utf8_decode($string)));
}
What it does it decodes your string to utf8, than converts everything HTML entities. even tags. But we want to convert tags back to normal, than htmlspecialchars_decode will convert them back. So in the end you will get a string with converted accents without touching tags.
You can use pass through this function your email content before sending it to recipent.
Another issue you might face is that, sometimes with this function the content from database converts to ? . In this case you should do this before running your query:
mysql_query("SET NAMES `utf8`");
But you might need to do it, it depends on encoding in your table. I hope it helps.

The comparing task is related to the charset and the collation you selected when you create the database or the tables. If you are saving strings with a lot of accents like spanish I sugget you to use charset uft8 and the collation could be the more accurate to the language(english, french or whatever) you're using.
The best thing of using the correct charset in the database is that you can save the string in natural way e.g: my name I can store it as is "Mario Juárez" and I have no need of doing some weird conversions.

Ran into similar issues recently. Followed Emil's answer and it worked fine locally but not on our dev/stage environments. I ended up using this and it worked all around:
$title = html_entity_decode(utf8_decode($item));
Thanks for leading me in the right direction!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

UTF-8 and HTML entities - php

I try to eject text from Word .DOC file with PHP. All seems ok, but the only trouble is something like СУДОВА БУХГАЛТЕРІЯ instead of russian text. I've tried to use html_entity_decode and utf8_encode, but they didn't help. Is there any simple solution?

Related

Convert the Chinese Characters From ISO-8859-1 To UTF-8

PHP, HTML and character encodings

PHP: how do I convert foreign characters from simple_html_dom to UTF8?

How to use PHP htmlentities()?

Convert foreign characters with accents

Categories

Resources