PHP - Convert unicode string to HTML entities - php

I need to take a unicode string to HTML Entities.
Let's say the string is: ιи¢яє∂ιвℓє ѕкιℓℓѕ
I want it to convert it to ιи¢яє∂ιвℓє ѕкιℓℓѕ
In order to be able to store it in my SQL db and then output the desired text later.
This basically has the functionality I want, where you type in a string and itn converts it: http://www.online-toolz.com/tools/unicode-html-entities-convertor.php
I'm not very familiar with UTF-8/ASNII/different types of special characters in general, But I'd like it to be able to handle all the Unicode characters here: https://unicode-table.com/en/
Sorry if this is a dumb question but I tried all combinations of uft8_decode, htmlentities, htmlspecialchars and mb_convert_encoding and I can't get the desired result.
Thank you for looking!

Related

PHP Converting UTF-8 to TeX/LaTeX Formatting?

I have some strings that include encoding that would work with TeX, (For example they look like "Pi/~na Colada" instead of Piña Colada). Is there a simple way to convert this to show properly without creating my own function to convert the characters?
No.
But:
The tex.stackexchange.com wiki has a decent list of TeX accents.
Then you just need to correlate them with their UTF-8 combining mark.
Make sure you move the combining mark to after the character you want it combined with.
eg: "/~n" to "n" . $combining_mark
You might then want to run it through intl's Normalizer in NFC form to compose the character into a single codepoint, if it exists.

Converting Symbols (Copyright, Reg etc)

I am trying to sanitise database input and found a problem with the Ⓡ character.
Ⓡ converts to
Ⓡ
Even with html_entity_decode around the variable.
This is a problem because the field is only meant to allow 4 characters in the database.
® Actually works though and is treated as a single character.
I have the same problem with Ⓒ vs ©.
As far as I know they are just html entities so should be decoded. However they aren't even encoded with htmlspecialchars(). It just echoes out the code
Ⓡ
Does PHP have any built-in functions to solve this? Thanks
Edit just to say what I am trying to do:
I have text fields to input and add to a database which displays in a table below.
When I enter any other character like < > &, it enters straight into the database as one character.
I am trying to make Ⓡ and Ⓒ always go in as one character as well (instead of 6).
I am only encoding on output in the table so certain characters don't break the website.
The problem that the entity doesn't decode when using html_entity_decode is likely that the target character set given to html_entity_decode is still the default ISO-8859-1. ISO-8859-1 cannot encode "Ⓡ" (the CIRCLED LETTER R), but it can encode "®" (the REGISTERED MARK).
So, first, to decode it correctly:
html_entity_decode('Ⓡ', ENT_COMPAT, 'UTF-8')
But secondly, "Ⓡ" and "®" are not the same character, and you probably don't want "Ⓡ".

Cannot convert special (ASCII/UTF-8) characters

I am trying to take a group of characters which are ASCII and it contains things like ☺ ☻ ♥.
When I try to echo them out with PHP I get ? for each one. And when I try to use htmlentities() it works for the heart but returns some jambled stuff like : ☺ for ☺.
Entirely what I am trying to do is convert a string of text that I have no control over, but is in ASCII and display it with HTML and store it with SQL.
I am sorry if this is a poorly formed question, but I am not sure how this whole section of conversion works.
Thanks.

decoding ISO characters

I got Chinese characters encoded in ISO-8859-1, for example 兼 = 兼
Those characters are taken form the database using AJAX and sent by Json using json_encode.
I then use the template Handlebars to set the data on the page.
When I look at the ajax page the characters are displayed correctly, the source is still encoded.
But the final result displays the encrypted characters.
I tried to decode on the javascript part with unescape but there is no foreach with the template that gives me the possibility to decode the specific variable, so it crashes.
I tried to decode on the PHP side with htmlspecialchars_decode but without success.
Both pages are encoded in ISO-8859-1, but I can change them in UTF8 if necessary, but the data in the database remains encoded in ISO-8859-1.
Thank you for your help.
You're simply representing your characters in HTML entities. If you want them as "actual characters", you'll need to use an encoding that can represent those characters, ISO-8859 won't do. htmlspecialchars_decode doesn't work because it only decodes a handful of characters that are special in HTML and leaves other characters alone. You'll need html_entity_decode to decode all entities, and you'll need to provide it with a character set to decode to which can handle Chinese characters, UTF-8 being the obvious best choice:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
You'll then need to make sure the browser knows that you're sending it UTF-8. If you want to store the text in the database in UTF-8 as well (which you really should), best follow the guide How to handle UTF-8 in a web app which explains all the pitfalls.
Are you including your text with the "double-stache" Handlebars syntax?
{{your expression}}
As the Handlebars documentation mentions, that syntax HTML-escapes its output, which would cause the results you're mentioning, where you're seeing the entity 兼 instead of 兼.
Using three braces instead ("triple-stache") won't escape the output and will let the browser correctly interpet those numeric entities:
{{{your expression}}}

How to parse unicode format (e.g. \u201c, \u2014) using PHP

I am pulling data from the Facebook graph which has characters encoded like so: \u2014 and \u2014
Is there a function to convert those characters into HTML? i.e \u2014 -> —
If you have some further reading on these character codes), or suggested reading about unicode in general I would appreciate it. This is so confusing to me. I don't know what to call these codes... I guess unicode, but unicode seems to mean a whole lot of things.
that's not entirely true bobince.
How do you handle json containing spanish accents?
there are 2 problems.
I make FB.api(url, function(response)
... var s=JSON.stringify(response);
and pass it to a php script via $.post
First I get a truncated string. I need escape(JSON.stringify(response))
Then I get a full json encoded string with spanish accents.
As a test, I place it in a text file I load with file_get_contents and apply php json_decode and get nothing.
You first need utf8_encode.
And then you get awaiting object of your desire.
After a full day of test and google without any result when decoding unicode properly, I found your post.
So many thanks to you.
Someone asked me to solve the problem of Arabic texts from the Facebook JSON archive, maybe this code helps someone who searches for reading Arabic texts from Facebook (or instagram) JSON:
$str = '\u00d8\u00ae\u00d9\u0084\u00d8\u00b5';
function decode_encoded_utf8($string){
return preg_replace_callback('#\\\\u([0-9a-f]{4})#ism', function($matches) { return mb_convert_encoding(pack("H*", $matches[1]), "UTF-8", "UCS-2BE"); }, $string);
}
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", decode_encoded_utf8($str));
Facebook Graph API returns JSON objects. Use json_decode() to read them into PHP and you do not have to worry about handling string literal escapes like \uNNNN. Don't try to decode JSON/JavaScript string literals by yourself, or extract chosen properties using regex.
Having read the string value, you'll have a UTF-8-encoded string. If your target HTML is also UTF-8-encoded, you don't need to replace — (U+2014) with any entity reference. Just use htmlspecialchars() on the string when outputting it, so that any < or & characters in the string are properly encoded.
If you do for some reason need to produce ASCII-safe HTML, use htmlentities() with the charset arg set to 'utf-8'.

Categories