My sentence include ASCII character codes like
"#$%
How can I remove all ASCII codes?
I tried strip_tags(), html_entity_decode(), and htmlspecialchars(), and they did not work.
You could run this if you don't want the returning values:
preg_replace('/(&#x[0-9]{4};)/', '', $text);
But be warned. This is basically a nuker and with the way HTML entities work I am sure this will interfer with other parts of your string. I would recommend leaving them in personally and encoding them as #hakra shows.
Are you trying to remove entities that resolve to non-ascii characters? If that is what you want you can use this code:
$str = '" # $ % 琔'; // " # $ % 琔
// decode entities
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
// remove non-ascii characters
$str = preg_replace('/[^\x{0000}-\x{007F}]/u', '', $str);
Or
// decode only iso-8859-1 entities
$str = html_entity_decode($str, ENT_QUOTES, 'iso-8859-1');
// remove any entities that remain
$str = preg_replace('/&#(x[0-9]{4}|\d+);/', '', $str);
If that's not what you want you need to clarify the question.
If you have the multibyte string extension at hand, this works:
$string = '"#$%';
mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');
Which does give:
"#$%
Loosely related is:
PHP DomDocument failing to handle utf-8 characters (☆)
With the DOM extension you could load it and convert it to a string which probably has the benefit to better deal with HTML elements and such:
echo simplexml_import_dom(#DomDocument::loadHTML('"#$%'))->xpath('//body/p')[0];
Which does output:
"#$%
If it contains HTML, you might need to export the inner html of that element which is explained in some other answer:
DOMDocument : how to get inner HTML as Strings separated by line-breaks?
To remove Japanese characters from a string, you may use the following code:
// Decode the text to get correct UTF-8 text:
$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
// Use the UTF-8 properties with `preg_replace` to remove all Japanese characters
$text = preg_replace('/\p{Katakana}|\p{Hiragana}|\p{Han}/u', '', $text);
Documentation:
Unicode character properties
Unicode scripts
Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of.
Try the code here
Related
I'm trying to decode some special characters in php and can't seem to find a way to do it.
$str = 'This i"s an example';
This just returns some dots.
$str = preg_replace_callback("/(&#[0-9]+;)/", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, $str);
Some other tests just return the same string.
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = htmlspecialchars_decode($str, ENT_QUOTES);
Anyway, I've been trying all sorts of combinations but really no idea how to convert this to UTF-8 characters.
What I'm expecting to see is this:
Thi’s i"s a’n e”xa“mple
And actually if I take this directly and use htmlentities to encode it I see different characters to begin with.
Thi’s i"s a’n e”xa“mple
Unfortunately I don't have control of the source and I'm stuck dealing with those characters.
Are they non standard, do I need to replace them manually with my own lookup table?
EDIT
Looking at this table here: https://brajeshwar.github.io/entities/
I see the characters I'm looking after are not listed. When I test a few characters from this table they decode just fine. I guess the list in php is incomplete by default?
If you check the unicode standard for the characters you're referring to: http://www.unicode.org/charts/PDF/U0080.pdf
You would see that all the codepoints you have in your string do not have representable glyphs and are control characters.
Which means that it is expected that they are rendered as empty squares (or dots, depending on how your renderer treats those).
If it works for someone somewhere - it's a non-standard behaviour, which one must not rely on, since it is, well, non-standard.
Apparently the text you have has the initial encoding of cp1250, so you either should treat it accordingly, or re-encode entities manually:
$str = 'This i"s an example';
$str = preg_replace_callback("/&#([0-9]+);/u", function($m) {
return iconv('cp1250', 'utf-8', chr($m[1]));
}, $str);
echo $str;
I have foreign language strings stored in utf8. When displaying them, I want them to be transform in such a way that accented chars become their html counterparts, e.g. é becomes é
However I phrase my search, all I can find is htmlentities() which does not do it.
Make sure you're properly specifying the encoding when you call htmlentities().
The documentation shows that it defaults to 'UTF-8', but if you read down a bit further, you'll see that that's new as of PHP 5.4. If you're using an older version, the default is actually 'ISO-8859-1', and you'll need to make sure you explicitly specify it as 'UTF-8' instead of relying on the default behaviour.
htmlentities($string, 0, 'UTF-8');
I found this is a very simple solution to this problem.
The first part tries to encode the text. If the $output returns a blank string, then the value is utf8 encoded and then the html entities are created.
$output = htmlentities($value, 0, "UTF-8");
if ($output == "") {
$output = htmlentities(utf8_encode($value), 0, "UTF-8");
}
Example:
Montréal
Output:
Montréal
This was posted as a comment on another answer, but was exactly what I needed so I'm posting it as an answer on its own:
There are cases when you want to generate entities for accented
characters (à ,è, ì, ò, ù, ...) but want to preserve HTML code (so do
not excape "<" and ">" and avoid escaping already escaped entities. In
those cases, you can use this code:
$string = str_replace(array("<", ">"), array("<", ">"), htmlentities($string, ENT_NOQUOTES, 'UTF-8', FALSE));
This code is compatible with PHP >= 5.2.3
(source)
Trying to figure out this decoding. I want to wind up with the most generic text possible. Elipsis to '...' Fancy quotes to single or double quotes, regular old '-' not the emdash. Is there another way other than str_replace with a table of fancy vs. regular strings?
$str = 'Hey,…I came back….ummm,…OK,…cool';
echo htmlspecialchars_decode($str, ENT_QUOTES) ;
// Hey,…I came back….ummm,…OK,…cool
echo html_entity_decode($str, ENT_QUOTES, 'ISO-8859-15') ;
// Hey,…I came back….ummm,…OK,…cool
echo html_entity_decode($str, ENT_QUOTES, 'UTF-8') ;
//this works, but changes to the elipse character
// Hey,…I came back….ummm,…OK,…cool
echo str_replace("…", "...", $str) ;
//Hey,...I came back....ummm,...OK,...cool
//desired result
I'm not sure of your specs but I have the impression you want something like this:
$str = 'Hey,…I came back….ummm,…OK,…cool';
echo iconv('UTF-8', 'ASCII//TRANSLIT', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));
This basically makes any Unicode character fit into 7-bit ASCII. Unexpected results may arise.
Update: Examples of unexpected results:
$str = 'Álvaro España €£¥¢©®';
echo iconv('UTF-8', 'ASCII//TRANSLIT', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));
# 'Alvaro Espa~na EURlbyenc(c)(R)
$str = 'Test: உதாரண';
echo iconv('UTF-8', 'ASCII//TRANSLIT', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));
# Notice: iconv(): Detected an illegal character in input string
$str = 'Test: உதாரண End Test';
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));
# Test: End Test
You should note that HTML entities like … are just a trick to allow browsers to display characters that do not belong to the document encoding. They have nothing to do with databases! If you're getting them into your DB it's probably because your app is not using UTF-8 (UTF-8 allows to represent any character) but users are typing those characters anyway and the browser makes its best to fit them into the document. The easiest fix it to just switch to UTF-8, as explained in UTF-8 all the way through.
Fb doesnt like these &# characters and I would assume doesnt like the elipsis characters either
HTML entities are, well, HTML, not plain text. If Facebook expects plain text, HTML entities will displayed as-is rather than being decoded. As about «…», I really doubt that Facebook (which is using UTF-8) treats them exceptionally. You're probably sending them in the wrong encoding.
I am working with a website that needs to target old, Japanese mobile phones, that are not Unicode enabled. The problem is, the text for the site is saved in the database as HTML entities (ie, Ӓ). This database absolutely cannot be changed, as it is used for several hundred websites.
What I need to do is convert these entities to actual characters, and then convert the string encoding before sending it out, as the phones render the entities without converting them first.
I've tried both mb_convert_encoding and iconv, but all they are doing is converting the encoding of the entities, but not creating the text.
Thanks in advance
EDIT:
I have also tried html_entity_decode. It is producing the same results - an unconverted string.
Here is the sample data I am working with.
The desired result: シェラトン・ヌーサリゾート&スパ
The HTML Codes: シェラトン・ヌーサリゾート&スパ
The output of html_entity_decode([the string above],ENT_COMPAT,'SHIFT_JIS'); is identical to the input string.
Just take care you're creating the right codepoints out of the entities. If the original encoding is UTF-8 for example:
$originalEncoding = 'UTF-8'; // that's only assumed, you have not shared the info so far
$targetEncoding = 'SHIFT_JIS';
$string = '... whatever you have ... ';
// superfluous, but to get the picture:
$string = mb_convert_encoding($string, 'UTF-8', $originalEncoding);
$string = html_entity_decode($string, ENT_COMPAT, 'UTF-8');
$stringTarget = mb_convert_encoding($string, $targetEncoding, 'UTF-8');
I found this function on php.net, it works for me with your example:
function unhtmlentities($string) {
// replace numeric entities
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
// replace literal entities
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl = array_flip($trans_tbl);
return strtr($string, $trans_tbl);
}
I think you just need html_entity_decode.
Edit: Based on your edit:
$output = preg_replace_callback("/(&#[0-9]+;)/", create_function('$m', 'return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); '), $original_string);
Note that this is just your first step, to convert your entities to the actual characters.
just to participate as I encountered some kind of encoding bug while coding, I would suggest this snippet :
$string_to_encode=" your string ";
if(mb_detect_encoding($string_to_encode)!==FALSE){
$converted_string=mb_convert_encoding($string_to_encode,'UTF-8');
}
Maybe not the best for a large amount of data, but still works.
I have an arial character giving me a headache. U+02DD turns into a question mark after I turn its document into a phpquery object. What is an efficient method for removing the character in php by referring to it as 'U+02DD'?
You can use iconv() to convert character sets and strip invalid characters.
<?PHP
/* This will convert ISO-8859-1 input to UTF-8 output and
* strip invalid characters
*/
$output = iconv("ISO-8859-1", "UTF-8//IGNORE", $input);
/* This will attempt to convert invalid characters to something
* that looks approximately correct.
*/
$output = iconv("ISO-8859-1", "UTF-8//TRANSLIT", $input);
?>
See the iconv() documentation at http://php.net/manual/en/function.iconv.php
Use preg_replace and do it like this:
$str = "your text with that character";
echo preg_replace("#\x{02DD}#u", "", $str); //EDIT: inserted the u tag for unicode
To refer to large unicode ranges, you can use preg_replace and specify the unicode character with \x{abcd} pattern. The second parameter is an empty string that. This will make preg_replace to replace your character with nothing, effectively removing it.
[EDIT] Another way:
Did you try doing htmlentities on it. As it's html-entity is ˝, doing that OR replacing the character by ˝ may solve your issue too. Like this:
echo preg_replace("#\x{02DD}#u", "˝", $str);