What encoding is this... and how do you escape it in php? - php

Im working on an imdb data scraper for a site, and I they seem to encode everything in a weird encoding I never saw before.
Exploding Ship
A Bug's Life
Is there a php function that will convert these to regular characters?

This is not encoding, it's html entities hexadecimal codes.
try
$converted = html_entity_decode($string, ENT_QUOTES, 'UTF-8');

Those are SGML character escapes. They can be either decimal (') or hexadecimal (&#xA0) and refer directly to a Unicode code point.
html_entity_decode() should work in PHP 5. Though I can't test at the moment.
In the first comment on that reference page, the following code is given for older PHP versions:
// For users prior to PHP 4.3.0 you may do this:
function unhtmlentities($string)
{
// replace numeric entities
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
// replace literal entities
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl = array_flip($trans_tbl);
return strtr($string, $trans_tbl);
}

Related

Trouble decoding some special characters ’ “ ”

I'm trying to decode some special characters in php and can't seem to find a way to do it.
$str = 'Thi’s i"s a’n e”xa“mple';
This just returns some dots.
$str = preg_replace_callback("/(&#[0-9]+;)/", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, $str);
Some other tests just return the same string.
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = htmlspecialchars_decode($str, ENT_QUOTES);
Anyway, I've been trying all sorts of combinations but really no idea how to convert this to UTF-8 characters.
What I'm expecting to see is this:
Thi’s i"s a’n e”xa“mple
And actually if I take this directly and use htmlentities to encode it I see different characters to begin with.
Thi’s i"s a’n e”xa“mple
Unfortunately I don't have control of the source and I'm stuck dealing with those characters.
Are they non standard, do I need to replace them manually with my own lookup table?
EDIT
Looking at this table here: https://brajeshwar.github.io/entities/
I see the characters I'm looking after are not listed. When I test a few characters from this table they decode just fine. I guess the list in php is incomplete by default?
If you check the unicode standard for the characters you're referring to: http://www.unicode.org/charts/PDF/U0080.pdf
You would see that all the codepoints you have in your string do not have representable glyphs and are control characters.
Which means that it is expected that they are rendered as empty squares (or dots, depending on how your renderer treats those).
If it works for someone somewhere - it's a non-standard behaviour, which one must not rely on, since it is, well, non-standard.
Apparently the text you have has the initial encoding of cp1250, so you either should treat it accordingly, or re-encode entities manually:
$str = 'Thi’s i"s a’n e”xa“mple';
$str = preg_replace_callback("/&#([0-9]+);/u", function($m) {
return iconv('cp1250', 'utf-8', chr($m[1]));
}, $str);
echo $str;

How to remove all ASCII codes from a string

My sentence include ASCII character codes like
"#$%
How can I remove all ASCII codes?
I tried strip_tags(), html_entity_decode(), and htmlspecialchars(), and they did not work.
You could run this if you don't want the returning values:
preg_replace('/(&#x[0-9]{4};)/', '', $text);
But be warned. This is basically a nuker and with the way HTML entities work I am sure this will interfer with other parts of your string. I would recommend leaving them in personally and encoding them as #hakra shows.
Are you trying to remove entities that resolve to non-ascii characters? If that is what you want you can use this code:
$str = '" # $ % 琔'; // " # $ % 琔
// decode entities
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
// remove non-ascii characters
$str = preg_replace('/[^\x{0000}-\x{007F}]/u', '', $str);
Or
// decode only iso-8859-1 entities
$str = html_entity_decode($str, ENT_QUOTES, 'iso-8859-1');
// remove any entities that remain
$str = preg_replace('/&#(x[0-9]{4}|\d+);/', '', $str);
If that's not what you want you need to clarify the question.
If you have the multibyte string extension at hand, this works:
$string = '"#$%';
mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');
Which does give:
"#$%
Loosely related is:
PHP DomDocument failing to handle utf-8 characters (☆)
With the DOM extension you could load it and convert it to a string which probably has the benefit to better deal with HTML elements and such:
echo simplexml_import_dom(#DomDocument::loadHTML('"#$%'))->xpath('//body/p')[0];
Which does output:
"#$%
If it contains HTML, you might need to export the inner html of that element which is explained in some other answer:
DOMDocument : how to get inner HTML as Strings separated by line-breaks?
To remove Japanese characters from a string, you may use the following code:
// Decode the text to get correct UTF-8 text:
$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
// Use the UTF-8 properties with `preg_replace` to remove all Japanese characters
$text = preg_replace('/\p{Katakana}|\p{Hiragana}|\p{Han}/u', '', $text);
Documentation:
Unicode character properties
Unicode scripts
Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of.
Try the code here

Converting HTML Entities in UTF-8 to SHIFT_JIS

I am working with a website that needs to target old, Japanese mobile phones, that are not Unicode enabled. The problem is, the text for the site is saved in the database as HTML entities (ie, Ӓ). This database absolutely cannot be changed, as it is used for several hundred websites.
What I need to do is convert these entities to actual characters, and then convert the string encoding before sending it out, as the phones render the entities without converting them first.
I've tried both mb_convert_encoding and iconv, but all they are doing is converting the encoding of the entities, but not creating the text.
Thanks in advance
EDIT:
I have also tried html_entity_decode. It is producing the same results - an unconverted string.
Here is the sample data I am working with.
The desired result: シェラトン・ヌーサリゾート&スパ
The HTML Codes: シェラトン・ヌーサリゾート&スパ
The output of html_entity_decode([the string above],ENT_COMPAT,'SHIFT_JIS'); is identical to the input string.
Just take care you're creating the right codepoints out of the entities. If the original encoding is UTF-8 for example:
$originalEncoding = 'UTF-8'; // that's only assumed, you have not shared the info so far
$targetEncoding = 'SHIFT_JIS';
$string = '... whatever you have ... ';
// superfluous, but to get the picture:
$string = mb_convert_encoding($string, 'UTF-8', $originalEncoding);
$string = html_entity_decode($string, ENT_COMPAT, 'UTF-8');
$stringTarget = mb_convert_encoding($string, $targetEncoding, 'UTF-8');
I found this function on php.net, it works for me with your example:
function unhtmlentities($string) {
// replace numeric entities
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
// replace literal entities
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl = array_flip($trans_tbl);
return strtr($string, $trans_tbl);
}
I think you just need html_entity_decode.
Edit: Based on your edit:
$output = preg_replace_callback("/(&#[0-9]+;)/", create_function('$m', 'return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); '), $original_string);
Note that this is just your first step, to convert your entities to the actual characters.
just to participate as I encountered some kind of encoding bug while coding, I would suggest this snippet :
$string_to_encode=" your string ";
if(mb_detect_encoding($string_to_encode)!==FALSE){
$converted_string=mb_convert_encoding($string_to_encode,'UTF-8');
}
Maybe not the best for a large amount of data, but still works.

How to convert HTML character NUMBERS to plain characters in PHP?

I have some HTML data (over which I have no control, can only read it) that contains a lot of Scandinavian characters (å, ä, ö, æ, ø, etc.). These "special" chars are stored as HTML character numbers (æ = æ). I need to convert these to the corresponding actual character in PHP (or JavaScript but I guess PHP is better here...). Seems like html_entity_decode() only handles the "other" kind of entities, where æ = &#aelig;. The only solution I've come up with so far is to make a conversion table and map each character number to a real character, but that's not really super smart...
So, any ideas? ;)
Cheers,
Christofer
&#NUMBER;
refers to the unicode value of that char.
so you could use some regex like:
/&#(\d+);/g
to grab the numbers, I don't know PHP but im sure you can google how to turn a number into its unicode equivalent char.
Then simply replace your regex match with the char.
Edit: Actually it looks like you can use this:
mb_convert_encoding('æ', 'UTF-8', 'HTML-ENTITIES');
I think html_entity_decode() should work just fine. What happens when you try:
echo html_entity_decode('æ', ENT_COMPAT, 'UTF-8');
On the PHP manual page on html_entity_decode(), it gives the following code for decoding numeric entities in versions of PHP prior to 4.3.0:
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
As someone noted in the comments, you should probably replace chr() with unichr() to deal with non-ASCII characters.
However, it looks like html_entity_decode() really should deal with numeric as well as literal entities. Are specifying an appropriate charset (e.g.,UTF-8)?
If you haven't got the luxury of having multibyte string functions installed, you can use something like this:
<?php
$string = 'Here is a special char æ';
$list = preg_replace_callback('/(&#([0-9]+);)/', create_function(
'$matches', 'return decode(array($matches[2]));'
), $string);
echo '<p>', $string, '</p>';
echo '<p>', $list, '</p>';
function decode(array $list)
{
foreach ($list as $key=>$value) {
return utf8_encode(chr($value));
}
}
?>

Replace diacritic characters with "equivalent" ASCII in PHP?

Related questions:
How to replace characters in a java String?
How to replace special characters with their equivalent (such as " á " for " a") in C#?
As in the questions above, I'm looking for a reliable, robust way to reduce any unicode character to near-equivalent ASCII using PHP. I really want to avoid rolling my own look up table.
For example (stolen from 1st referenced question): Gračišće becomes Gracisce
The iconv module can do this, more specifically, the iconv() function:
$str = iconv('Windows-1252', 'ASCII//TRANSLIT//IGNORE', "Gracišce");
echo $str;
//outputs "Gracisce"
The main hassle with iconv is that you just have to watch your encodings, but it's definitely the right tool for the job (I used 'Windows-1252' for the example due to limitations of the text editor I was working with ;) The feature of iconv that you definitely want to use is the //TRANSLIT flag, which tells iconv to transliterate any characters that don't have an ASCII match into the closest approximation.
I found another solution, based on #zombat's answer.
The issue with his answer was that I was getting:
Notice: iconv() [function.iconv]: Wrong charset, conversion from `UTF-8' to `ASCII//TRANSLIT//IGNORE' is not allowed in D:\www\phpcommand.php(11) : eval()'d code on line 3
And after removing //IGNORE from the function, I got:
Gr'a'e~a~o^O"ucisce
So, the š character was translated correctly, but the other characters weren't.
The solution that worked for me is a mix between preg_replace (to remove everything but [a-zA-Z0-9] - including spaces) and #zombat's solution:
preg_replace('/[^a-zA-Z0-9.]/','',iconv('UTF-8', 'ASCII//TRANSLIT', "GráéãõÔücišce"));
Output:
GraeaoOucisce
My solution is to create two strings - first with not wanted letters and second with letters that will replace firsts.
$from = 'čšć';
$to = 'csc';
$text = 'Gračišće';
$result = str_replace(str_split($from), str_split($to), $text);
Try this:
function normal_chars($string)
{
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', $string);
$string = preg_replace(array('~[^0-9a-z]~i', '~-+~'), ' ', $string);
return trim($string);
}
Examples:
echo normal_chars('Álix----_Ãxel!?!?'); // Alix Axel
echo normal_chars('áéíóúÁÉÍÓÚ'); // aeiouAEIOU
echo normal_chars('üÿÄËÏÖÜŸåÅ'); // uyAEIOUYaA
Based on the selected answer in this thread: URL Friendly Username in PHP?
You should also try:
transliterator_transliterate('Any-Latin; Latin-ASCII; Lower()', "ÀÖØöøįĴőŔžǍǰǴǵǸțȞȟȤȳɃɆɏ");
//Will output
aooooijorzajggnthhzybey
I found this from here:
https://www.php.net/manual/en/transliterator.transliterate.php#111939

Categories