I have a variable which contains a string, I don't know what the string might be but it could contain special characters.
I'd like to output that into a text file "as is".
So if there is for example a string "my string\n" I want the text file to show exactly that and not interpret the \n as a line feed / new line.
Then make sure it's "as is" in the string, e.g. "my string \\n" or 'my string \n'. PHP is not doing any transformation on the actual data - the transformation of "\n" to a newline happens when PHP parses the string literal in code.
Now, assuming that you want an actual newline character ("\n") in the data/string to be written as a sequence of two characters ('\n'), then it must be converted back, e.g.:
# \n is converted to a NL due to double-quoted literal ..
$strWithNl = "hello\n world";
# but given arbitrary data, we change it back ..
$strWithSlashN = str_replace("\n", '\n', $strWithNl);
There are likely better (read: existing) functions to "de-escape" a string per a given set of rules, but the above should hopefully show the concepts.
While everything above is true/valid (or should be corrected if not), I had a little extra time on my hands to create an escape_as_double_quoted_literal function.
Given an "ASCII encoded" string $str and $escaped = escape_as_double_quoted_literal($str), it should be the case that eval("\"$escaped\"") == $str. I'm not exactly sure when this particular function will be useful (and please don't say for eval!), but since I did not find such a function after some immediate searches, this is my quick implementation of such. YMMV.
function escape_as_double_quoted_literal_matcher ($m) {
$ch = $m[0];
switch ($ch) {
case "\n": return '\n';
case "\r": return '\r';
case "\t": return '\t';
case "\v": return '\v';
case "\e": return '\e';
case "\f": return '\f';
case "\\": return '\\\\';
case "\$": return '\$';
case "\"": return '\\"';
case "\0": return '\0';
default:
$h = dechex(ord($ch));
return '\x' . (strlen($h) > 1 ? $h : '0' . $h);
}
}
function escape_as_double_quoted_literal ($val) {
return preg_replace_callback(
"|[^\x20\x21\x23\x25-\x5b\x5e-\x7e]|",
"escape_as_double_quoted_literal_matcher",
$val);
}
And the usage of such:
$text = "\0\1\xff\"hello\\world\"\n\$";
echo escape_as_double_quoted_literal($text);
(Note that '\1' is encoded as \x01; both are equivalent in a PHP double-quoted string literal.)
The answer for "\n" is to replace any potential new line character with the literal characters.
str_replace("\n", '\n', $myString)
Not sure what the general case may be though for other potential special characters.
Related
There is a PHP function that can highlight a word regardless of case or accents, but the string returned will be the original string with only the highlighting?
For example:
Function highlight($string, $term_to_search){
// ...
}
echo highlight("my Striñg", "string")
// Result: "my <b>Striñg</b>"
Thanks in advance!
What I tried:
I tried to do a function that removed all accents & caps, then did a "str_replace" with the search term but found that the end result logically had no caps or special characters when I expected it to be just normal text but highlighted.
You can use ICU library to normalize the strings. Then, look for term position inside handle string, to add HTML tags at the right place inside original string.
function highlight($string, $term_to_search, Transliterator $tlr) {
$normalizedStr = $tlr->transliterate($string);
$normalizedTerm = $tlr->transliterate($term_to_search);
$termPos = mb_strpos($normalizedStr, $normalizedTerm);
// Actually, `mb_` prefix is useless since strings are normalized
if ($termPos === false) { //term not found
return $string;
}
$termLength = mb_strlen($term_to_search);
$termEndPos = $termPos + $termLength;
return
mb_substr($string, 0, $termPos)
. '<b>'
. mb_substr($string, $termPos, $termLength)
. '</b>'
. mb_substr($string, $termEndPos);
}
$tlr = Transliterator::create('Any-Latin; Latin-ASCII; Lower();');
echo highlight('Would you like a café, Mister Kàpêk?', 'kaPÉ', $tlr);
you can try str_ireplace
echo str_ireplace($term_to_search, '<b>'.$term_to_search.'</b>', $string);
There are a number of questions on SO about removing whitespace, usually answered with a preg_replace('/[\s]{2,}/, '', $string) or similar answer that takes more than one whitespace character and removes them or replaces with one of the characters.
This gets more complicated when certain whitespace duplication may be allowed (e.g. text blocks with two line breaks and one line break both allowed and relevant), moreso combining whitespace characters (\n, \r).
Here is some example text that, whilst messy, covers what I think you could end up with trying to present in a reasonable manner (e.g. user input that's previously been formatted with HTML and now stripped away)
$text = "\nDear Miss Test McTestFace,\r\n \n We have received your customer support request about:\n \tA bug on our website\n \t \n \n \n We will be in touch by : \n\r\tNext Wednesday. \n \r\n \n Thank you for your custom; \n \r \t \n If you have further questions please feel free to email us. \n \n\r\n \n Sincerely \n \n Customer service team \n \n";
If our target was to have it in the format:
Dear Miss Test McTestFace,
We have received your customer support request about: A bug on our
website
We will be in touch by : Next Wednesday.
Thank you for your custom;
If you have further questions please feel free to email us.
Sincerely
Customer service team
How would we achieve this - simple regex, more complex iteration or are there already libraries that can do this?
Also are there ways we could make the test case more complex and thus giving a more robust overall algorithm?
For my own part I chose to attempt an iterative algorithm based on the idea that if we know the current context (are we in a paragraph, or in a series of line breaks/spaces?) we can make better decisions.
I chose to ignore the problem of tabs in this case and would be interested to see how they'd fit into the assumptions - in this case I simply stripped them out.
function strip_whitespace($string){
$string = trim($string);
$string = str_replace(["\r\n", "\n\r"], "\n", $string);
// These three could be done as one, but splitting out
// is easier to read and modify/play with
$string = str_replace("\r", "\n", $string);
$string = str_replace(" \n", "\n", $string);
$string = str_replace("\t", '', $string);
$string_arr = str_split($string);
$new_chars = [];
$prev_char_return = 0;
$prev_char_space = $had_space_recently = false;
foreach ($string_arr as $char){
switch ($char){
case ' ':
if ($prev_char_return || $prev_char_space){
continue 2;
}
$prev_char_space = true;
$prev_char_return = 0;
break;
case "\n":
case "\r":
if ($prev_char_return>1 || $had_space_recently){
continue 2;
}
if ($prev_char_space){
$had_space_recently = true;
}
$prev_char_return += 1;
$prev_char_space = false;
break;
default:
$prev_char_space = $had_space_recently = false;
$prev_char_return = 0;
}
$new_chars[] = $char;
}
$return = implode('', $new_chars);
// Shouldn't be necessary as we trimmed to start, but may as well
$return = trim($return);
return $return;
}
I'm still interested to see other ideas, and especially to any text whose obvious interpretation for a function of this type would be different to what this function produces.
Based on the example (and not looking at your code), it looks like the rule is:
a span of whitespace containing at least 2 LF characters
is a paragraph-separator (so convert it to a blank line);
any other span of whitespace is a word-separator
(so convert it to a single space).
If so, then one approach would be to:
Find the paragraph-separators and convert them to some string (not involving whitespace) that doesn't otherwise occur in the text.
Convert remaining whitespace to single-space.
Convert the paragraph-separator-indicators to \n\n.
E.g.:
$text = preg_replace(
array('/\s*\n\s*\n\s*/', '/\s+/', '/<PARAGRAPH-SEP>/'),
array('<PARAGRAPH-SEP>', ' ', "\n\n"),
trim($text)
);
If the rule is more complicated, then it might be better to use preg_replace_callback, e.g.:
$text = preg_replace_callback('/\s+/', 'handle_whitespace', trim($text));
function handle_whitespace($matches)
{
$whitespace = $matches[0];
if (substr_count($whitespace, "\n") >= 2)
{
// paragraph-separator: replace with blank line
return "\n\n";
}
else
{
// everything else: replace with single space character
return " ";
}
}
I know this question asked here many times.But That solutions are not useful for me. I am facing this problem very badly today.
// Case 1
$str = 'Test \300'; // Single Quoted String
echo json_encode(utf8_encode($str)) // output: Test \\300
// Case 2
$str = "Test \300"; // Double Quoted String
echo json_encode(utf8_encode($str)) // output: Test \u00c0
I want case 2's output and I have single quoted $str variable. This variable is filled from XML string parsing . And that XML string is saved in txt file.
(Here \300 is encoding of À (latin Charactor) character and I can't control it.)
Please Don't give me solution for above static string
Thanks in advance
This'll do:
$string = '\300';
$string = preg_replace_callback('/\\\\\d{1,3}/', function (array $match) {
return pack('C', octdec($match[0]));
}, $string);
It matches any sequence of a backslash followed by up to three numbers and converts that number from an octal number to a binary string. Which has the same result as what "\300" does.
Note that this will not work exactly the same for escaped escapes; i.e. "\\300" will result in a literal \300 while the above code will convert it.
If you want all the possible rules of double quoted strings followed without reimplementing them by hand, your best bet is to simply eval("return \"$string\""), but that has a number of caveats too.
May You are looking for this
$str = 'Test \300'; // Single Quoted String
echo json_encode(stripslashes($str)); // output: Test \\300
Sorry for the title, I really didn't know how to say this...
I often have a string that needs to be cut after X characters, my problem is that this string often contains special characters like : & egrave ;
So, I'm wondering, is their a way to know in php, without transforming my string, if when I am cutting my string, I am in the middle of a special char.
Example
This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact
so right now my result with a sub string would be :
This is my string with a special char : &egra
but I want to have something like this :
This is my string with a special char : è
The best thing to do here is store your string as UTF-8 without any html entities, and use the mb_* family of functions with utf8 as the encoding.
But, if your string is ASCII or iso-8859-1/win1252, you can use the special HTML-ENTITIES encoding of the mb_string library:
$s = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
echo mb_substr($s, 0, 40, 'HTML-ENTITIES');
echo mb_substr($s, 0, 41, 'HTML-ENTITIES');
However, if your underlying string is UTF-8 or some other multibyte encoding, using HTML-ENTITIES is not safe! This is because HTML-ENTITIES really means "win1252 with high-bit characters as html entities". This is an example of where this can go wrong:
// Assuming that é is in utf8:
mb_substr('é ', 0, 2, 'HTML-ENTITIES') === 'é'
// should be 'é '
When your string is in a multibyte encoding, you must instead convert all html entities to a common encoding before you split. E.g.:
$strings_actual_encoding = 'utf8';
$s_noentities = html_entity_decode($s, ENT_QUOTES, $strings_actual_encoding);
$s_trunc_noentities = mb_substr($s_noentities, 0, 41, $strings_actual_encoding);
The best solution would be to store your text as UTF-8, instead of storing them as HTML entities. Other than that, if you don't mind the count being off (` equals one character, instead of 7), then the following snippet should work:
<?php
$string = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
$cut_string = htmlentities(mb_substr(html_entity_decode($string, NULL, 'UTF-8'), 0, 45), NULL, 'UTF-8')."<br><br>";
Note: If you use a different function to encode the text (e.g. htmlspecialchars()), then use that function instead of htmlentities(). If you use a custom function, then use another custom function that does the opposite of your new custom function instead of html_entity_decode() (and custom function instead of htmlentities()).
The longest HTML entity is 10 characters long, including the ampersand and semicolon. If you intend to cut the string at X bytes, check bytes X-9 through X-1 for an ampersand. If the corresponding semicolon appears at byte X or later, cut the string after the semicolon instead of after byte X.
However, if you're willing to preprocess the string, Mike's solution will be more accurate because his cuts the string at X characters, not bytes.
You can use html_entity_decode() first to decode all the HTML entities. Then split your string. Then htmlentities() to re-encode the entities.
$decoded_string = html_entity_decode($original_string);
// implement logic to split string here
// then for each string part do the following:
$encoded_string_part = htmlentities($split_string_part);
A little bruteforce solution, that I'm not really happy with would a PCRE expression, let's say that you want to pass 80 characters and the longest possible HTML expression is 7 chars long:
$regex = '~^(.{73}([^&]{7}|.{0,7}$|[^&]{0,6}&[^;]+;))(.*)~mx'
// Note, this could return a bit of shorter text
return preg_replace( $regexp, '$1', $text);
Just so you know:
.{73} - 73 characters
[^&]{7} - okay, we may fill it with anything that doesn't contain &
.{0,7}$ - keep in mind the possible end (this shouldn't be necessary because shorter text wouldn't match at all)
[^&]{0,6}&[^;]+; - up to 6 characters (you'd be at 79th), then & and let it finish
Something that seems much better but requires bit of play with numbers is to:
// check whether $text is at least $N chars long :)
if( strlen( $text) < $N){
return;
}
// Get last &
$pos = strrpos( $text, '&', $N);
// We're not young anymore, we have to check this too (not entries at all) :)
if( $pos === false){
return substr( $text, 0, $N);
}
// Get Last
$end = strpos( $text, ';', $N);
// false wouldn't be smaller then 0 (entry open at the beginning
if( $end === false){
$end = -1;
}
// Okay, entry closed (; is after &)(
if( $end > $pos){
return substr($text, 0, $N);
}
// Now we need to find first ;
$end = strpos( $text, ';', $N)
if( $end === false){
// Not valid HTML, not closed entry, do whatever you want
}
return substr($text, 0, $end);
Check numbers, there may be +/-1 somewhere in indexes...
I think you would have to use a combination of strpos and strrpos to find the next and previous spaces, parse the text between the spaces, check that against a known list of special characters, and if it matches, extend your "cut" to the position of the next space. If you had a code sample of what you have now, we could give you a better answer.
This question looks embarrassingly simple, but I haven't been able to find an answer.
What is the PHP equivalent to the following C# line of code?
string str = "\u1000";
This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal (4096 in decimal).
That is, in PHP, how can I create a string with a single Unicode character whose "Unicode numeric value" is known?
PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.
It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.
$unicodeChar = "\u{1000}";
Because JSON directly supports the \uxxxx syntax the first thing that comes into my mind is:
$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');
Another option would be to use mb_convert_encoding()
echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:
echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');
I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:
\x[0-9A-Fa-f]{1,2}
The sequence of characters matching the regular expression is a
character in hexadecimal notation.
ASCII example:
<?php
echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>
Hello World!
So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:
<?php
header('content-type:text/html;charset=utf-16be');
echo("\x30\xA2");
?>
ア
If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).
UTF-16 little endian example:
<?php
header('content-type:text/html;charset=utf-16le');
echo("\xA2\x30");
?>
ア
UTF-8 example:
<?php
header('content-type:text/html;charset=utf-8');
echo("\xE3\x82\xA2");
?>
ア
There is also the pack function, but you can expect it to be slow.
PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
Or with an anonymous function expression instead of create_function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
}, $str);
}
Its usage:
$str = unicodeString("\u1000");
html_entity_decode('エ', 0, 'UTF-8');
This works too. However the json_decode() solution is a lot faster (around 50 times).
Try Portable UTF-8:
$str = utf8_chr( 0x1000 );
$str = utf8_chr( '\u1000' );
$str = utf8_chr( 4096 );
All work exactly the same way. You can get the codepoint of a character with utf8_ord(). Read more about Portable UTF-8.
As mentioned by others, PHP 7 introduces support for the \u Unicode syntax directly.
As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.
However, there is one other option. You can encode the character directly in PHP with \x binary escaping. The \x escape syntax is also supported in PHP 5.
This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.
First, a proof example:
// Unicode Character 'HAIR SPACE' (U+200A)
$htmlEntityChar = " ";
$realChar = html_entity_decode($htmlEntityChar);
$phpChar = "\xE2\x80\x8A";
echo 'Proof: ';
var_dump($realChar === $phpChar); // bool(true)
Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8A is the binary coding for U+200A in UTF-8.
The next question is, how do you get from U+200A to \xE2\x80\x8A?
Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.
function str_encode_utf8binary($str) {
/** #author Krinkle 2018 */
$output = '';
foreach (str_split($str) as $octet) {
$ordInt = ord($octet);
// Convert from int (base 10) to hex (base 16), for PHP \x syntax
$ordHex = base_convert($ordInt, 10, 16);
$output .= '\x' . $ordHex;
}
return $output;
}
function str_convert_html_to_utf8binary($str) {
return str_encode_utf8binary(html_entity_decode($str));
}
function str_convert_json_to_utf8binary($str) {
return str_encode_utf8binary(json_decode($str));
}
// Example for raw string: Unicode Character 'INFINITY' (U+221E)
echo str_encode_utf8binary('∞') . "\n";
// \xe2\x88\x9e
// Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_html_to_utf8binary(' ') . "\n";
// \xe2\x80\x8a
// Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
// \xe2\x80\x8a
function unicode_to_textstring($str){
$rawstr = pack('H*', $str);
$newstr = iconv('UTF-16BE', 'UTF-8', $rawstr);
return $newstr;
}
$msg = '67714eac99c500200054006f006b0079006f002000530074006100740069006f006e003a0020';
echo unicode_to_textstring($str);