Consider the following code:
$str = '';
for ($i=0x0; $i<=0x7f; $i++) {
$str .= chr($i);
}
echo json_encode($str);
The result is:
"\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\b\t\n\u000b\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-.\/0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
There are all but one ASCII characters (the last one): 127 (0x7f).
Is there a way to show that character? (for instance: "\u007f")
Delete (DEL) is a control character in the ASCII character set with the coding 0x7f or 127 decimal. This character is saved as ASCII in the JSON string. This character can be made visible by outputting the JSON string in hexadecimal format.
$jsonStr = json_encode(chr(0x7f));
echo bin2hex($jsonStr); //227f22
22 is the encoding for a double quotation mark ("). echo is not suitable for checking what is in a string. There are always misunderstandings. Control characters (including DEL) are only displayed as spaces in the browser. If you look closely at the result of your code example, you will see the space at the end.
I am not sure, but I think this is what you are searching for:
U+007F
Also see this answer: Why no symbols defined for ascii values from 127 to 159
Related
I'm writing a code that accepts the middle name of a person, convert it to upper then get its first character.
Example the user input the name "winston", then I'll get a capitalise "W".
I can use either of the 2 codes to get the first character of string and it works fine.
mb_strtoupper(substr($name,0,1));
or
mb_strtoupper($name[0]);
I'm using mb_strtoupper() so that it can convert a character with diacritics like ñ.
My problem is when the names first character have a diacritical mark.
Ñana
I'm testing a code,
$name_1 = 'Ñana';
echo strtoupper(substr($name_1,0,1));
echo '<br>';
echo strtoupper($name_1[0]);
The result
I tried to increase index of $name_1 and the parameter of substr().
$name_1 = 'Ñana';
echo strtoupper(substr($name_1,0,2));
echo '<br>';
echo strtoupper($name_1[1]);
The result
My code works if the the first character of the string don't have diacritics.
How should I handle it? Or How can I handle this kind of situation?
You cannot use substr for UTF-8 encoded string, because you cut 1 byte out of a single multibyte character. Use mb_substr instead from the Multibyte String Functions:
echo mb_strtoupper(mb_substr($name_1, 0, 1));
I am encoding the URL suffix of my application:
$url = 'subjects?_d=1';
echo base64_encode($url);
// Outputs
c3ViamVjdHM/X2Q9MQ==
Notice the slash before 'X2'.
Why is this happening? I thought base64 only outputted A-Z, 0-9 and '=' as padding?
No. The Base64 alphabet includes A-Z, a-z, 0-9 and + and /.
You can replace them if you don't care about portability towards other applications.
See: http://en.wikipedia.org/wiki/Base64#Variants_summary_table
You can use something like these to use your own symbols instead (replace - and _ by anything you want, as long as it is not in the base64 base alphabet, of course!).
The following example converts the normal base64 to base64url as specified in RFC 4648:
function base64url_encode($s) {
return str_replace(array('+', '/'), array('-', '_'), base64_encode($s));
}
function base64url_decode($s) {
return base64_decode(str_replace(array('-', '_'), array('+', '/'), $s));
}
In addition to all of the answers above, pointing out that / is part of the expected base64 alphabet, it should be noted that the particular reason you saw a / in your encoded string, is because when base64 encoding ASCII text, the only way to generate a / is to have a question mark in a position divisible by three.
Sorry, you thought wrong. A-Za-z0-9 only gets you 62 characters. Base64 uses two additional characters, in PHP's case / and +.
There is nothing special in that.
The base 64 "alphabet" or "digits" are A-Z,a-z,0-9 plus two extra characters + (plus) and / (slash).
You can later encode / with %2f if you want.
Not directly related, and enough people above have answered and explained solutions quite well.
However, going a bit outside of the scope of things. If you want readable base text, try looking into Base58. It's worth considering if you want only alphanumeric characters.
For base64 the valid charset is:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
the = is used as filler for the last bytes
M.
A-Z is 26 characters.
0-9 is 10 characters.
= is one character. That gives a total of 37 characters, which is some way short of 64.
/ is one of the 64 characters. You can see a complete list on the wikipedia page.
I have two strings with seemingly the same values. One is stored as a key in an array, the other a value in another different array. I compare the two using ==, ===, and strcmp. All treat them as different strings. I do a var_dump and this is what I get.
string(17) "Valentine’s Day"
string(15) "Valentine's Day"
Does anyone have any idea why the first string would be 17 characters and the second 15?
Update: This is slightly more obvious when I pasted this out of my editor whose font made the two different apostrophe's almost indistinguishable.
The first string contains a Unicode character for the apostrophe while the second string just has a regular ASCII ' character.
The Unicode character takes up more space.
If you run the PHP ord() function on each of those characters you'll see that you get different values for each:
echo ord("’"); //226 This is just the first 2 bytes (see comments below for details from ircmaxell)
echo ord("'"); //27
As a complement to #Mark answer above which is right (the ’ is a multi-byte character, most probably UTF-8, while ' is not). You can easily convert it to ASCII (or ISO-8859-1) using iconv, per example:
echo iconv('utf-8', 'ascii//TRANSLIT', $str);
Note: Not all characters can be transformed from multi-byte to ASCII or latin1. You can use //IGNORE to have them removed from the resulting string.
’ != '
mainly. if you want this not to be an issue, you could do something like this.
if (str_replace('’', '\'', "Valentine’s Day") == "Valentine's Day") {
I would like to write a (HTML) parser based on state machine but I have doubts how to acctually read/use an input. I decided to load the whole input into one string and then work with it as with an array and hold its index as current parsing position.
There would be no problems with single-byte encoding, but in multi-byte encoding each value does not represent a character, but a byte of a character.
Example:
$mb_string = 'žščř'; //4 multi-byte characters in UTF-8
for($i=0; $i < 4; $i++)
{
echo $mb_string[$i], PHP_EOL;
}
Outputs:
Ĺ
ž
Ĺ
Ą
This means I cannot iterate through the string in a loop to check single characters, because I never know if I am in the middle of an character or not.
So the questions are:
How do I multi-byte safe read a
single character from a string in a
performance friendly way?
Is it good idea to work with the
string as it was an array in this
case?
How would you read the input?
http://php.net/mb_string is the thing you're looking for
just mb_substr characters one by one
not until PHP6
what input exactly? The usual way in general
mb_internal_encoding("UTF-8");
$mb_string = 'žščř';
$l=mb_strlen($mb_string);
for($i=0;$i<$l;$i++){
print(mb_substr($mb_string,$i,1)."<br/>");
}
Without using the mdb_relatedFunctions and with multi-byte encoded strings you can use standard sub string functions that read in multiples of the bytes used for encoding.
For example for a UTF-8 encoded (2 bytes) string if you need the first character from the string
$string = 'žščř'; //4 multi-byte characters in UTF-8
You have to get the $string[0] AND $string[1] values, so you are actually looking for the substring between indexes 0 and 1 (for the first character).
Note that $string[0] or $string[N] will reference the first (or Nth byte of the multi-byte string)
regards,
I want to be able to detect (using regular expressions) if a string contains hebrew characters both utf8 and iso8859-8 in the php programming language. thanks!
Here's map of the iso8859-8 character set. The range E0 - FA appears to be reserved for Hebrew. You could check for those characters in a character class:
[\xE0-\xFA]
For UTF-8, the range reserved for Hebrew appears to be 0591 to 05F4. So you could detect that with:
[\u0591-\u05F4]
Here's an example of a regex match in PHP:
echo preg_match("/[\u0591-\u05F4]/", $string);
well if your PHP file is encoded with UTF-8 as should be in cases that you have hebrew in it, you should use the following RegX:
$string="אבהג";
echo preg_match("/\p{Hebrew}/u", $string);
// output: 1
Here's a small function to check whether the first character in a string is in hebrew:
function IsStringStartsWithHebrew($string)
{
return (strlen($string) > 1 && //minimum of chars for hebrew encoding
ord($string[0]) == 215 && //first byte is 110-10111
ord($string[1]) >= 144 && ord($string[1]) <= 170 //hebrew range in the second byte.
);
}
good luck :)
First, such a string would be completely useless - a mix of two different character sets?
Both the hebrew characters in iso8859-8, and each byte of multibyte sequences in UTF-8, have a value ord($char) > 127. So what I would do is find all bytes with a value greater than 127, and then check if they make sense as is8859-8, or if you think they would make more sense as an UTF8-sequence...
function is_hebrew($string)
{
return preg_match("/\p{Hebrew}/u", $string);
}