how to transform japanese english character to normal english character? - php

I have an japanese english character.
This character is not normal english string.
Characters: Game
How to transform this character to normal english character in php?

Subtract 65248 from the ordinal value of each character. In other words:
$str = "Game some other text by ヴィックサ";
$str = preg_replace_callback(
"/[\x{ff01}-\x{ff5e}]/u",
function($c) {
// convert UTF-8 sequence to ordinal value
$code = ((ord($c[0][0])&0xf)<<12)|((ord($c[0][1])&0x3f)<<6)|(ord($c[0][2])&0x3f);
return chr($code-0xffe0);
},
$str);
This will replace all of the "Fullwidth" characters with their normal width equivalents.

It would be easier to use mb_convert_kana:
$string = 'Characters: Game';
$newString = mb_convert_kana($string,'a');

I'm sure there is a much easier answer but couldnt you make a dictonary object with the special charter as the key and the char you want as the value
then just do a simple find and replace?

Related

Detecting non english characters in a string?

I'm trying to remove cached profiles which have non English letters in their description. I'm fine with dashes, symbols, special characters, underscores all that I just don't want foreign characters in my string.
The issue is, my code below detects strings with á as ASCII even though it isn't an English character, is matching against ASCII the right way?
if (!mb_detect_encoding($this->removeEmojis(str_replace(" ", "", $cacheItem->description), 'ASCII', true)))
{
$cacheItem->delete(); // laravel
}
Value of $cacheItem->description
Welcome to my profile<br> Londrina-Paraná
The letter á is a non English character.
The description can also contain dots, symbols, special characters, but I want to detect foreign characters like Latin.
Descriptions can also contain emojis so I try to remove them with this function
private function removeEmojis($text){
// theres lots more inside the preg_replace I truncated it for readability
return preg_replace('/[\x{1F3F4}](?:\x{E0067}\x{E0062}\x{E0077}\x{E006C}\x{E0073}\x{E007F})|[\x{1F3F4}]/u', ' ', $text);
}
You can detect any character that is not printable ASCII , by using this regexp
[^\x20-\x7E]]*
See ASCII table
Replace the matches with empty string then you get a purified one and then you can apply your emoji replacement.
You can use preg_match to check if all the characters in the string are in the range <space> to ~ which is the ASCII character range:
$description = 'Welcome to my profile<br> Londrina-Paraná';
var_dump(preg_match('/^[ -~]*$/', $description));
$description = 'Welcome to my profile<br> Londriná-Parana';
var_dump(preg_match('/^[ -~]*$/', $description));
$description = 'Welcome to my profile<br> Londrina-Parana';
var_dump(preg_match('/^[ -~]*$/', $description));
Output:
int(0)
int(0)
int(1)
Demo on 3v4l.org

preg_match replace ALL condition with ANY

Seems like
preg_match('/^[\p{Cyrillic}]+$/', $str)
returns 0 or 1 based on if $str contains ALL Cyrillic letters.
I need 0 or 1 based on if $str contains ANY Cyrillic letters.
Thank you.
You can use:
$ret = preg_match('/\p{Cyrillic}/u', $str);
to figure if input string contains any Cyrillic character or not. /u flag is required to handle unicode string inputs.
Alternatively use mb_ereg function for multibyte regex match like this:
$str = 'БДКЯ'; // string with Cyrillic characters only
// check with Cyrillic string only
var_dump( mb_ereg('\p{Cyrillic}', $str) ); // int(1)
// check with mix of Cyrillic and ASCII characters
var_dump( mb_ereg('\p{Cyrillic}', $str . 'abc') ); // int(1)
// check with ASCII characters only
var_dump( mb_ereg('\p{Cyrillic}', 'abc') ); // bool(false)
The anchors ^ and $ force the {Cyrillic} character match from the beginning to the end of the string, so remove them. Also, the character class [] and + are not needed because you are looking for any match:
/\p{Cyrillic}/

unexpected output of ltrim in php

Can anybody explain this unusual output of ltrim
var_dump(ltrim('/btcapi/participation/set-user-event-participation','/btcapi'));
rticipation/set-user-event-participation //output
While expected output has
/participation/set-user-event-participation
Use str_replace if you are sure this is the only one occurence in your string.
$str = '/btcapi/participation/set-user-event-participation';
echo str_replace('/btcapi', $str); // returns: '/participation/set-user-event-participation'
Or regex if you need replace/remove just the first at the beginning of string.
$str = '/btcapi/participation/set-user-event-participation';
preg_replace ('~^/btcapi~', '', $str);
The trim characters are read as individuals, not as a String.
It just replaces the second / for example because it is a part of the characters.
Just use str_replace or a custom loop.
RTM: http://php.net/ltrim
the second argument is a character MASK, e.g. characters you want to strip. CHARACTERS, not STRING.
php > $foo = 'abc123';
php > echo ltrim($foo, 'abpq');
c123
php > echo ltrim($foo, 'a1');
bc123
^---not stripped, because 'bc' are not in the mask.
php >
PHP will search strip all characters from the left of the string, based on the characters in the mask, until it encounters a character NOT in the mask.

Convert Unicode from JSON string with PHP

I've been reading up on a few solutions but have not managed to get anything to work as yet.
I have a JSON string that I read in from an API call and it contains Unicode characters - \u00c2\u00a3 for example is the £ symbol.
I'd like to use PHP to convert these into either £ or £.
I'm looking into the problem and found the following code (using my pound symbol to test) but it didn't seem to work:
$title = preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", '\u00c2\u00a3');
The output is £.
Am I correct in thinking that this is UTF-16 encoded? How would I convert these to output as HTML?
UPDATE
It seems that the JSON string from the API has 2 or 3 unescaped Unicode strings, e.g.:
That\u00e2\u0080\u0099s (right single quotation)
\u00c2\u00a (pound symbol)
It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3 really maps to the £ string.
What you should have is \u00a3 which is the unicode code point for £.
{0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.
If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.
function fixBadUnicode($str) {
return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
}
Example here: http://phpfiddle.org/main/code/6sq-rkn
Edit:
If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:
function fixBadUnicodeForJson($str) {
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
return $str;
}
Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.
Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.
The output is correct.
\u00c2 == Â
\u00a3 == £
So nothing is wrong here. And converting to HTML entities is easy:
htmlentities($title);
Here is an updated version of the function using preg_replace_callback instead of preg_replace.
function fixBadUnicodeForJson($str) {
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")); },
$str
);
return $str;
}

Decoding javascript escape sequences in PHP (\x27, \x22, etc...)

If I have this PHP string:
$string = '\\x27\\x22';
How would I decode it to '"?
A regex could help you here:
$out = preg_replace_callback(
"(\\\\x([0-9a-f]{2}))i",
function($a) {return chr(hexdec($a[1]));},
$string
);
You do not need to decode it. Just do str_replace('\\x27', "'", $str);. In case your '" was just and example, please note you got repeatable pattern \\xAA, where x indicates hexadecimal notation and AA is hex value itself, so each \\xAA represents single byte and AA is from 0 to 0xFF. So you can use regexp or just walk any other way over your string, extract these AA values and convert it with chr(hexdec($AA)) to coresponding characted and glue with result string.
$out = preg_replace_callback(
"(\\\\x([0-9a-f]{2}))i",
function($a) {return '\u00'.bin2hex(hex2bin($a[1]));},
$string
);
That's ok after I converted the value from ascii to unicode.

Categories