PHP - Convert Non-ASCII Characters to hex Entities Without mbstring - php

I'd like to convert any Unicode string to hex HTML entities, except for ASCII Characters. So a string like:
Text goes here. Here's だ and here's ã.
gets converted to
Text goes here. Here's &#12384 and here's &#227.
For reference, this question has a function that converts all characters to numerical entities, but it requires mbstring which I cannot use (I also can't use any features past PHP 5.3.10). How to convert all characters to their html entity equivalent using PHP

THIS IS NOT MY CODE.
I did a simple Google check using "php convert unicode to html" and found this:
https://af-design.com/2010/08/17/escaping-unicode-characters-to-html-entities-in-php/
Which had this:
function unicode_escape_sequences($str)
{
$working = json_encode($str);
$working = preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $working);
return json_decode($working);
}
That web page also had a lot of other examples on it but this one looked like what you were looking for.

Related

PHP string encoding is not recognized by strpos()?

I have a binary Word .doc that looks something like this in string format:
þÿÿÿÿÿÿÿppp„±¶g œÙ Text in word doc here I'm interested in [|`ñÿ|Standard1$S_HmHnHsHtHOJPJQJCJEH567>
When I echo that string, I can see all the text I'm interested in finding in between unrecognized characters (but those I'm not worried about them since I only want the text). The issue is that PHP does not seem to recognize it as a string and so I cannot search it with strpos(), strpos(), strchr(), mb_strpos() all return nothing. No -1, no error in the PHP error log, just nothing.
However, when I call gettype() I get string. I suspect this is an encoding issue, but mb_detect_encoding returns UTF-8. I have tried converting it to multiple different encoding types, without avail.
How can I get PHP to search this string? I understand that parsing a Word .doc is more complex of an issue, but for my purposes the plaintext I'm interested in are in the binary data. Does anyone have any experience with this?
Thank you :)
Since you string seems binary encoded and you are only interested in text a quick solution would be to use filter_var to clean the string from non ascii-printable characters.Try using this before searching:
$clean_string = filter_var($str,FILTER_FLAG_STRIP_LOW, FILTER_FLAG_STRIP_HIGH);
Notice the part "Standard1$". php is taking $ as the operator instead of a character.
check here.
<?php
$s = "þÿÿÿÿÿÿÿppp„±¶g œÙ Text in word doc here I'm interested in [|`ñÿ|Standard1$S_HmHnHsHtHOJPJQJCJEH567>";
$s2 = strpos($s, "interested");
echo $s2;
?>
you might want to put a backslash before that $ sign.

How to convert a Chinese character to UTF-16 code units?

I'm using PHP for this web development project. Right now, I'm working on a user page, where the user can add words that he knows. Off course, I'm starting out crude, without adding any special features yet like Do you know this Character suggestion, etc.
I have tackled the challenges of adding UTF-16 collation and charset set to UTF-16 in my MySQL Database, in fact online at http://freemysqlhosting.net to support Chinese characters in my website. Now what I'm struggling with is to support automatic PinYin generation for my Chinese characters.
I have found this after searching all over SO: https://github.com/reorx/pinyindep/blob/master/Uni2Pinyin. Each line begins with a Chinese character, in UTF-16 Code Units.
Take for example, 爱. In UTF-16, it is 7231. I convert this at https://r12a.github.io/apps/conversion/. When I do a lookup in the file, I get the pinyin associated. :D This is the functionality I need, though looking it up in GitHub is in JS, rather than PHP.
In the manual lookup, ai4 is returned, which is the correct intonation. Now, what I'm looking for is either a PHP Built-in Library, or a code snippet to convert this string input, let's say “爱” into a UTF-16 Four Character Code Unit, such as here 7321.
So what's the question:
How should I convert a Chinese character, in form of a string, to UTF-16 code units? (Either through built-in library, or through a suggested PHP Code Snippet)
P.S. I don't really like third-party tools unless they are really popular worldwide, or there's no other option.
You need to use PHP's multibyte string module:
$c = "爱";
list(, $d) = unpack('N', mb_convert_encoding($c, 'UCS-4BE', 'UTF-8'));
echo dechex($d);
// => 7231
Change UTF-8 to UTF-16 if your string is coming from the database in that encoding.
mb_convert_encoding will change the string into four-byte-per-character encoding; then unpack converts the four bytes into an unsigned long; finally, converting to hexadecimal string using dechex.
If you are using PHP 7.2+ you can use mb_ord to simplify the conversion.
echo dechex(mb_ord("爱"));

PHP function to convert HEX char codes to display equivalent

I am working on some PHP code to identify HEX character codes in a string and convert them to their "as seen on screen" equivalent. Mainly, there HEX codes are for accented characters like é, ç and so on.
For example, I am receiving a string like this:
$str = "caf&#xe9s"; - NOTE there is a semicolon after the 9 (i had to remove it to stop this text editor converting it!
The HEX part of the string is &#xe9 (again with semicolon at end) - and I am needing to convert that to its "as seen on screen" equivalent, in this case "é". So the converted string would be "cafés".
The following PHP code works, but I have to write one for each HEX code, and there are scores of them.
$keywords = str_replace("&#xe9","é",$keywords); [again the needle part has a semicolon]
Can anyone suggest an existing PHP function that can scan any string for known HEX codes and convert it to the display equivalent?
I am working in UTF8 otherwise.
Thanks for your consideration, sorry if my terminology sounds amateur.
James
http://www.php.net/manual/en/function.html-entity-decode.php
This will convert HTML entities into their associated char
$keywords = html_entity_decode($keywords);

Does json support arabic characters?

i want to ask quick question, is json support arabic characters i mean when i search for something like following
$values = $database->get_by_name('معاً');
echo json_encode(array('returnedFromValue' => $value."<br/>"));
also I'm looking for arabic result from the database, the returned values will be like this
{"returnedFromValue":"\u0627\u0644\u0645\u0639\u0627\u062f\u0649<br\/>"}{"returnedFromValue":"\u0627\u0644\u0645\u0639\u0627\u062f\u0649<br\/>"}
what I'm missing here ? is it better to use XML in term of supporting the arabic characters
JSON is, just like XML, some kind of data-interchange-format. it's not addicted to a special charset, so arabic characters should be fine if u use a charset that supports these characters (UFT-8 for example).
PHP 5.4.0 will support a special option for json_encode() called JSON_UNESCAPED_UNICODE. This stops the default behaviour of converting characters to their \uXXXX form.
$value = 'معاً';
echo json_encode($value, JSON_UNESCAPED_UNICODE);
// Outputs: "معاً"
These \u0627-numbers are the Unicode-codepoints for your arabic letters. PHP uses them rather than the raw UTF-8 serialization, but they are there. So yes, JSON does support it. If the result string was printed out client-side (using Javascript) you would see the letters again.

How to parse unicode format (e.g. \u201c, \u2014) using PHP

I am pulling data from the Facebook graph which has characters encoded like so: \u2014 and \u2014
Is there a function to convert those characters into HTML? i.e \u2014 -> —
If you have some further reading on these character codes), or suggested reading about unicode in general I would appreciate it. This is so confusing to me. I don't know what to call these codes... I guess unicode, but unicode seems to mean a whole lot of things.
that's not entirely true bobince.
How do you handle json containing spanish accents?
there are 2 problems.
I make FB.api(url, function(response)
... var s=JSON.stringify(response);
and pass it to a php script via $.post
First I get a truncated string. I need escape(JSON.stringify(response))
Then I get a full json encoded string with spanish accents.
As a test, I place it in a text file I load with file_get_contents and apply php json_decode and get nothing.
You first need utf8_encode.
And then you get awaiting object of your desire.
After a full day of test and google without any result when decoding unicode properly, I found your post.
So many thanks to you.
Someone asked me to solve the problem of Arabic texts from the Facebook JSON archive, maybe this code helps someone who searches for reading Arabic texts from Facebook (or instagram) JSON:
$str = '\u00d8\u00ae\u00d9\u0084\u00d8\u00b5';
function decode_encoded_utf8($string){
return preg_replace_callback('#\\\\u([0-9a-f]{4})#ism', function($matches) { return mb_convert_encoding(pack("H*", $matches[1]), "UTF-8", "UCS-2BE"); }, $string);
}
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", decode_encoded_utf8($str));
Facebook Graph API returns JSON objects. Use json_decode() to read them into PHP and you do not have to worry about handling string literal escapes like \uNNNN. Don't try to decode JSON/JavaScript string literals by yourself, or extract chosen properties using regex.
Having read the string value, you'll have a UTF-8-encoded string. If your target HTML is also UTF-8-encoded, you don't need to replace — (U+2014) with any entity reference. Just use htmlspecialchars() on the string when outputting it, so that any < or & characters in the string are properly encoded.
If you do for some reason need to produce ASCII-safe HTML, use htmlentities() with the charset arg set to 'utf-8'.

Categories