Converting HTML Entities in UTF-8 to SHIFT_JIS

Converting HTML Entities in UTF-8 to SHIFT_JIS - php

I am working with a website that needs to target old, Japanese mobile phones, that are not Unicode enabled. The problem is, the text for the site is saved in the database as HTML entities (ie, Ӓ). This database absolutely cannot be changed, as it is used for several hundred websites.
What I need to do is convert these entities to actual characters, and then convert the string encoding before sending it out, as the phones render the entities without converting them first.
I've tried both mb_convert_encoding and iconv, but all they are doing is converting the encoding of the entities, but not creating the text.
Thanks in advance
EDIT:
I have also tried html_entity_decode. It is producing the same results - an unconverted string.
Here is the sample data I am working with.
The desired result: シェラトン・ヌーサリゾート＆スパ
The HTML Codes: シェラトン・ヌーサリゾート＆スパ
The output of html_entity_decode([the string above],ENT_COMPAT,'SHIFT_JIS'); is identical to the input string.

Just take care you're creating the right codepoints out of the entities. If the original encoding is UTF-8 for example:
$originalEncoding = 'UTF-8'; // that's only assumed, you have not shared the info so far
$targetEncoding = 'SHIFT_JIS';
$string = '... whatever you have ... ';
// superfluous, but to get the picture:
$string = mb_convert_encoding($string, 'UTF-8', $originalEncoding);
$string = html_entity_decode($string, ENT_COMPAT, 'UTF-8');
$stringTarget = mb_convert_encoding($string, $targetEncoding, 'UTF-8');

I found this function on php.net, it works for me with your example:
function unhtmlentities($string) {
// replace numeric entities
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
// replace literal entities
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl = array_flip($trans_tbl);
return strtr($string, $trans_tbl);
}

I think you just need html_entity_decode.
Edit: Based on your edit:
$output = preg_replace_callback("/(&#[0-9]+;)/", create_function('$m', 'return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); '), $original_string);
Note that this is just your first step, to convert your entities to the actual characters.

just to participate as I encountered some kind of encoding bug while coding, I would suggest this snippet :
$string_to_encode=" your string ";
if(mb_detect_encoding($string_to_encode)!==FALSE){
$converted_string=mb_convert_encoding($string_to_encode,'UTF-8');
}
Maybe not the best for a large amount of data, but still works.

Related

Trouble decoding some special characters

I'm trying to decode some special characters in php and can't seem to find a way to do it.
$str = 'This i"s an example';
This just returns some dots.
$str = preg_replace_callback("/(&#[0-9]+;)/", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, $str);
Some other tests just return the same string.
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = htmlspecialchars_decode($str, ENT_QUOTES);
Anyway, I've been trying all sorts of combinations but really no idea how to convert this to UTF-8 characters.
What I'm expecting to see is this:
Thi’s i"s a’n e”xa“mple
And actually if I take this directly and use htmlentities to encode it I see different characters to begin with.
Thi’s i"s a’n e”xa“mple
Unfortunately I don't have control of the source and I'm stuck dealing with those characters.
Are they non standard, do I need to replace them manually with my own lookup table?
EDIT
Looking at this table here: https://brajeshwar.github.io/entities/
I see the characters I'm looking after are not listed. When I test a few characters from this table they decode just fine. I guess the list in php is incomplete by default?

If you check the unicode standard for the characters you're referring to: http://www.unicode.org/charts/PDF/U0080.pdf
You would see that all the codepoints you have in your string do not have representable glyphs and are control characters.
Which means that it is expected that they are rendered as empty squares (or dots, depending on how your renderer treats those).
If it works for someone somewhere - it's a non-standard behaviour, which one must not rely on, since it is, well, non-standard.
Apparently the text you have has the initial encoding of cp1250, so you either should treat it accordingly, or re-encode entities manually:
$str = 'This i"s an example';
$str = preg_replace_callback("/&#([0-9]+);/u", function($m) {
return iconv('cp1250', 'utf-8', chr($m[1]));
}, $str);
echo $str;

mb_strtoupper displaying question mark

Hi I'm having a problem converting special characters to upper case.
With regular strtoupper I get something like DANIëL and when applying mb_strtoupper I get DANI?L.
Here's the code:
mb_strtoupper(rtrim($pieces[1], ","), 'UTF-8')
Mind you, I already have this running on the input:
iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $tr->TD[0])
Could this be the reason? Or is there something else?

Typical issue of trying to uppercasing a Latin1 when the converter expect UTF-8
Be sure to check your string source. This sample will works if your text editor works in Latin1 pagecode, and not in UTF-8
$str = "daniël"; //or your rtrim($pieces[1],",")
$str = mb_convert_encoding($str,'UTF-8','Latin1');
echo mb_strtoupper($str, 'UTF-8');
//will echo DANIËL

How to convert accented chars to html in php?

I have foreign language strings stored in utf8. When displaying them, I want them to be transform in such a way that accented chars become their html counterparts, e.g. é becomes é
However I phrase my search, all I can find is htmlentities() which does not do it.

Make sure you're properly specifying the encoding when you call htmlentities().
The documentation shows that it defaults to 'UTF-8', but if you read down a bit further, you'll see that that's new as of PHP 5.4. If you're using an older version, the default is actually 'ISO-8859-1', and you'll need to make sure you explicitly specify it as 'UTF-8' instead of relying on the default behaviour.
htmlentities($string, 0, 'UTF-8');

I found this is a very simple solution to this problem.
The first part tries to encode the text. If the $output returns a blank string, then the value is utf8 encoded and then the html entities are created.
$output = htmlentities($value, 0, "UTF-8");
if ($output == "") {
$output = htmlentities(utf8_encode($value), 0, "UTF-8");
}
Example:
Montréal
Output:
Montréal

This was posted as a comment on another answer, but was exactly what I needed so I'm posting it as an answer on its own:
There are cases when you want to generate entities for accented
characters (à ,è, ì, ò, ù, ...) but want to preserve HTML code (so do
not excape "<" and ">" and avoid escaping already escaped entities. In
those cases, you can use this code:
$string = str_replace(array("<", ">"), array("<", ">"), htmlentities($string, ENT_NOQUOTES, 'UTF-8', FALSE));
This code is compatible with PHP >= 5.2.3
(source)

How to remove all ASCII codes from a string

My sentence include ASCII character codes like
"#$%
How can I remove all ASCII codes?
I tried strip_tags(), html_entity_decode(), and htmlspecialchars(), and they did not work.

You could run this if you don't want the returning values:
preg_replace('/(&#x[0-9]{4};)/', '', $text);
But be warned. This is basically a nuker and with the way HTML entities work I am sure this will interfer with other parts of your string. I would recommend leaving them in personally and encoding them as #hakra shows.

Are you trying to remove entities that resolve to non-ascii characters? If that is what you want you can use this code:
$str = '" # $ % 琔'; // " # $ % 琔
// decode entities
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
// remove non-ascii characters
$str = preg_replace('/[^\x{0000}-\x{007F}]/u', '', $str);
Or
// decode only iso-8859-1 entities
$str = html_entity_decode($str, ENT_QUOTES, 'iso-8859-1');
// remove any entities that remain
$str = preg_replace('/&#(x[0-9]{4}|\d+);/', '', $str);
If that's not what you want you need to clarify the question.

If you have the multibyte string extension at hand, this works:
$string = '"#$%';
mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');
Which does give:
"#$%
Loosely related is:
PHP DomDocument failing to handle utf-8 characters (☆)
With the DOM extension you could load it and convert it to a string which probably has the benefit to better deal with HTML elements and such:
echo simplexml_import_dom(#DomDocument::loadHTML('"#$%'))->xpath('//body/p')[0];
Which does output:
"#$%
If it contains HTML, you might need to export the inner html of that element which is explained in some other answer:
DOMDocument : how to get inner HTML as Strings separated by line-breaks?

To remove Japanese characters from a string, you may use the following code:
// Decode the text to get correct UTF-8 text:
$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
// Use the UTF-8 properties with `preg_replace` to remove all Japanese characters
$text = preg_replace('/\p{Katakana}|\p{Hiragana}|\p{Han}/u', '', $text);
Documentation:
Unicode character properties
Unicode scripts
Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of.
Try the code here

What encoding is this... and how do you escape it in php?

Im working on an imdb data scraper for a site, and I they seem to encode everything in a weird encoding I never saw before.
Exploding Ship
A Bug's Life
Is there a php function that will convert these to regular characters?

This is not encoding, it's html entities hexadecimal codes.
try
$converted = html_entity_decode($string, ENT_QUOTES, 'UTF-8');

Those are SGML character escapes. They can be either decimal (') or hexadecimal (&#xA0) and refer directly to a Unicode code point.
html_entity_decode() should work in PHP 5. Though I can't test at the moment.
In the first comment on that reference page, the following code is given for older PHP versions:
// For users prior to PHP 4.3.0 you may do this:
function unhtmlentities($string)
{
// replace numeric entities
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
// replace literal entities
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl = array_flip($trans_tbl);
return strtr($string, $trans_tbl);
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Converting HTML Entities in UTF-8 to SHIFT_JIS - php

Related

Trouble decoding some special characters

mb_strtoupper displaying question mark

How to convert accented chars to html in php?

How to remove all ASCII codes from a string

What encoding is this... and how do you escape it in php?

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Converting HTML Entities in UTF-8 to SHIFT_JIS - php

Related

Trouble decoding some special characters   

mb_strtoupper displaying question mark

How to convert accented chars to html in php?

How to remove all ASCII codes from a string

What encoding is this... and how do you escape it in php?

Categories

Resources

Trouble decoding some special characters