I want to replace unicode \u00a0 with actual data in php.
Example : "<b>Sort By\u00a0-\u00a0</b>A to Z\u00a0|\u00a0Recently Added\u00a0|\u00a0Most Downloaded"
\u00a0 is the escape sequence of a NO-BREAK SPACE
To decode any escape sequence in PHP you can use this function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
echo unicodeString($str);
//<b>Sort By - </b>A to Z | Recently Added | Most Downloaded
DEMO:
https://ideone.com/9QtzMO
If you just need to replace a single escape sequence, use:
$str = str_replace("\u00a0", " ", $str);
echo $str;
<b>Sort By - </b>A to Z | Recently Added | Most Downloaded
Related
I'm using this function to clean strings for elastic search:
function cleanString($string){
$string = mb_convert_encoding($string, "UTF-8");
$string = str_ireplace(array('<', '>'), array(' <', '> '), $string);
$string = strip_tags($string);
$string = filter_var($string, FILTER_SANITIZE_STRING);
$string = str_ireplace(array("\t", "\n", "\r", " "," ",":"), ' ', $string);
$string = str_ireplace(array("","«","»","£"), '', $string);
return trim($string, ",;.:-_*+~#'\"´`!§$%&/()=?«»")
}
It does all sorts of stuff, but the problem I am facing has to do with the trim function at the very end. It is supposed to trim away whitespaces and special characters, and worked fine until recently, when I added two more special character to trim away from string: « and ». This caused problems with another special character:
When I pass the word België into the function, the ë gets corrupted and elastic throws an error.
Why does trim corrupt a completely different character?
How can I fix
that, so that I parse out « and » and preserve ë?
trim is not encoding aware and just looks at individual bytes. If you tell it to trim '«»', and that's encoded in UTF-8, it will look for the bytes C2 AB C2 BB (where C2 is redundant, so AB BB C2 are the actual search terms). "ë" in UTF-8 is C3 AB, so half of it gets removed and the character is thereby broken.
You'll need to use an encoding aware functions to safely remove multibyte characters, e.g.:
preg_replace('/^[«»]+|[«»]+$/u', '', $str)
I have been trying to remove junk character from a stream of html strings using PHP but haven't been successfull yet. Is there any special syntax or logics to remove special character from the string?
I had tried this so far, but ain't working
$new_string = preg_replace("�", "", $HtmlText);
echo '<pre>'.$new_string.'</pre>';
\p{S}
You can use this.\p{S} matches math symbols, currency signs, dingbats, box-drawing characters, etc
See demo.
https://www.regex101.com/r/rK5lU1/30
$re = "/\\p{S}/i";
$str = "asdas�sadsad";
$subst = "";
$result = preg_replace($re, $subst, $str);
This is due to mismatch in Charset between database and front-end. Correcting this will fix the issue.
function clean($string) {
return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.
}
suppose I have this function:
function f($string){
$string = preg_replace("`\[.*\]`U","",$string);
$string = preg_replace('`&(amp;)?#?[a-z0-9]+;`i','-',$string);
$string = htmlentities($string, ENT_COMPAT, 'utf-8');
$string = preg_replace( "`&([a-z])(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig|quot|rsquo);`i","\\1", $string );
$string = preg_replace( array("`[^a-z0-9]`i","`[-]+`") , "-", $string);
return $string;
}
how can I reverse this function...ie. how should I write the function fReverse() such that we have the following:
$s = f("some string223---");
$reversed = fReverse($s);
echo $s;
and output: some string223---
f is lossy. It is impossible to find an exact reverse. For example, both "some string223---" and "some string223--------" gives the same output (see http://ideone.com/DtGQZ).
Nevertheless, we could find a pre-image of f. The 5 replacements of f are:
Strip everything between [ and ].
Replace entities like <, { and encoded entities like < to a hyphen -.
Escape special HTML characters (< → <, & → & etc.)
Remove accents of accented characters (é (=é) → e, etc.)
Turn non-alphanumerics and consecutive hyphens into a single hyphen -.
Out of these, it is possible that 1, 2, 4 and 5 be identity transforms. Therefore, one possible preimage is just reverse step 3:
function fReverse($string) {
return html_entity_decode($string, ENT_COMPAT, 'utf-8');
}
Does PHP have a function that searches for hex codes in a string and converts them into their char equivalents?
For example - I have a string that contains the following
"Hello World\x19s"
And I want to convert it to
"Hello World's"
Thanks in advance.
This code will convert "Hello World\x27s" into "Hello World's". It will convert "\x19" into the "end of medium" character, since that's what 0x19 represents in ASCII.
$str = preg_replace('/\\\\x([0-9a-f]{2})/e', 'chr(hexdec($1))', $str);
Correct me if i'm wrong but i think you should change the callback like so:
$str = preg_replace('/\\\\x([0-9a-f]{2})/e', 'chr(hexdec(\'$1\'))', $str);
By adding the single quotes characters like '=' (\x3d) will be converted fine too.
The /e will generate an error in current php advising to use preg_replace_callback. Try this:
preg_replace_callback('/\\\\x([0-9a-f]{2})/', function ($m) { return chr(hexdec($m[1])); }, $str );
/e Modifier causes PHP errors. It has been deprecated under new PHP updates. The correct way to convert hexcodes into characters is:
$str = html_entity_decode($str, ENT_QUOTES | ENT_XML1, 'UTF-8');
This will turn ' into ' and & into & etc
I have a data file (an Apple plist, to be exact), that has Unicode codepoints like \U00e8 and \U2019. I need to turn these into valid hexadecimal HTML entities using PHP.
What I'm doing right now is a long string of:
$fileContents = str_replace("\U00e8", "è", $fileContents);
$fileContents = str_replace("\U2019", "’", $fileContents);
Which is clearly dreadful. I could use a regular expression to convert the \U and all trailing 0s to &#x, then stick on the trailing ;, but that also seems heavy-handed.
Is there a clean, simple way to take a string, and replace all the unicode codepoints to HTML entities?
Here's a correct answer, that deals with the fact that those are code units, not code points, and allows unencoding supplementary characters.
function unenc_utf16_code_units($string) {
/* go for possible surrogate pairs first */
$string = preg_replace_callback(
'/\\\\U(D[89ab][0-9a-f]{2})\\\\U(D[c-f][0-9a-f]{2})/i',
function ($matches) {
$hi_surr = hexdec($matches[1]);
$lo_surr = hexdec($matches[2]);
$scalar = (0x10000 + (($hi_surr & 0x3FF) << 10) |
($lo_surr & 0x3FF));
return "&#x" . dechex($scalar) . ";";
}, $string);
/* now the rest */
$string = preg_replace_callback('/\\\\U([0-9a-f]{4})/i',
function ($matches) {
//just to remove leading zeros
return "&#x" . dechex(hexdec($matches[1])) . ";";
}, $string);
return $string;
}
You can use preg_replace:
preg_replace('/\\\\U0*([0-9a-fA-F]{1,5})/', '&#x\1;', $fileContents);
Testing the RE:
PS> 'some \U00e8 string with \U2019 embedded Unicode' -replace '\\U0*([0-9a-f]{1,5})','&#x$1;'
some è string with ’ embedded Unicode