PHP string encoding is not recognized by strpos()?

PHP string encoding is not recognized by strpos()? - php

I have a binary Word .doc that looks something like this in string format:
þÿÿÿÿÿÿÿppp„±¶g œÙ Text in word doc here I'm interested in [|`ñÿ|Standard1$S_HmHnHsHtHOJPJQJCJEH567>
When I echo that string, I can see all the text I'm interested in finding in between unrecognized characters (but those I'm not worried about them since I only want the text). The issue is that PHP does not seem to recognize it as a string and so I cannot search it with strpos(), strpos(), strchr(), mb_strpos() all return nothing. No -1, no error in the PHP error log, just nothing.
However, when I call gettype() I get string. I suspect this is an encoding issue, but mb_detect_encoding returns UTF-8. I have tried converting it to multiple different encoding types, without avail.
How can I get PHP to search this string? I understand that parsing a Word .doc is more complex of an issue, but for my purposes the plaintext I'm interested in are in the binary data. Does anyone have any experience with this?
Thank you :)

Since you string seems binary encoded and you are only interested in text a quick solution would be to use filter_var to clean the string from non ascii-printable characters.Try using this before searching:
$clean_string = filter_var($str,FILTER_FLAG_STRIP_LOW, FILTER_FLAG_STRIP_HIGH);

Notice the part "Standard1$". php is taking $ as the operator instead of a character.
check here.
<?php
$s = "þÿÿÿÿÿÿÿppp„±¶g œÙ Text in word doc here I'm interested in [|`ñÿ|Standard1$S_HmHnHsHtHOJPJQJCJEH567>";
$s2 = strpos($s, "interested");
echo $s2;
?>
you might want to put a backslash before that $ sign.

Related

How to convert binary string to normal string in php

Description of the problem
I am trying to import email content into a database table. Sometimes, I get an SQL error while inserting a message. I found that it fails when the message content is a binary string instead of a string.
For exemple, I get this in the console if I print a message that is imported successfully (Truncated)
However, I get this with problematic import:
I found out that if I use the function utf8_encode, I am successfully able to import it into SQL. The problem is that it "breaks" previously successfull imports accented characters:
What I have tried
Detect if the string was a binary string with ctype_print, returned false for both non binary and binary string. I would have then be able to call utf8_encode only if it was binary
Use of unpack, did not work
Detect string encoding with mb_detect_encoding, return UTF-8 for both
use iconv , failed with iconv(): Detected an illegal character in input string
Cast the content as string using (string) / settype($html, 'string')
Question
How can I transform the binary string in a normal string so I can then import it in my database without breaking accented characters in other imports?

This is pretty late, but for anyone else reading... Apparently the b prefix is meaningless in PHP, it's a bit of a red herring. See: https://stackoverflow.com/a/51537602/6111743
What encodings did you pass to iconv()? This is the correct solution but you have to give it the correct first argument, which depends on your input. In my example I use "LATIN1" because that turned out to be the correct way to interpret my input but your use case may vary.
You can use mb_check_encoding() to check if it is valid UTF-8 or not. This returns a boolean.
Assuming the question is really something like "how to convert extended ascii string to valid utf-8 string in PHP" - Here is how I did it in my application:
if(!mb_check_encoding($string)) {
$string = iconv("LATIN1", "UTF-8//TRANSLIT//IGNORE", $string);
}
The "TRANSLIT" part tells it to attempt transliteration, that's optional for you. The "IGNORE" will prevent it from throwing Detected an illegal character in input string if it does detect one; instead the character will just get ignored, meaning, removed. Your use case may not need either of these.
When you're debugging, I recommend just using "UTF-8" as the second argument so you can see what it's doing. It's useful to see if it throws an error. For me, I had given it the wrong first argument at first (I wrote "ASCII" instead of "LATIN-1") and it threw the illegal character error on an accented character. That error went away once I passed it the correct encoding.
By the way, mb_detect_encoding() was no help to me in figuring out that Latin-1 was what I needed. What helped was dumping the contents of unpack("C*", $string) to see what exact bytes were in there. That's more debugging advice than solution but worth mentioning in case it helps.

PHP strpos says different croatian chars are the same: š č

I have the following code:
$text = 'Tomáš'
echo strpos($text, "č");
# result if 4
I believe they are different chars so why is PHP telling me they are the same?
What is going on and how can I correct this?

The encoding you chose to save your source code file in cannot encode the characters you're trying to save. Whatever characters PHP is seeing, it's not comparing the strings you think it is. Save your source code in an encoding that can encode all characters, preferably UTF-8.

You should try with mb_strpos function.
Performs a multi-byte safe strpos() operation based on number of characters. The first character's position is 0, the second character position is 1, and so on.

With a regular setup, it returns false to me.
However if you've troubles with such special characters, using mb_strpos instead of strpos should help.
http://php.net/manual/en/function.mb-strpos.php

Understanding what \u0000 is in PHP / JSON and getting rid of it

I haven't a clue what is going on but I have a string inside an array. It must be a string as I have ran this on it first:
$array[0] = (string)$array[0];
If I output $array[0] to the browser in plain text it shows this:
hellothere
But if I JSON encode $array I get this:
hello\u0000there
Also, I need to separate the 'there' part (the bit after the \u0000), but this doesn't work:
explode('\u0000', $array[0]);
I don't even know what \u0000 is or how to control it in PHP.
I did see this link: Trying to find and get rid of this \u0000 from my json
...which suggests str_replacing the JSON that is generated. I can't do that (and need to separate it as mentioned above first) so I then checked Google for 'php check for backslash \0 byte' but I still can't work out what to do.

\uXXXX is the JSON Unicode escape notation (X is hexadecimal).
In this case, it means the 0 ASCII char, aka the NUL byte, to split it you can either do:
explode('\u0000', json_encode($array[0]));
Or better yet:
explode("\0", $array[0]); // PHP doesn't use the same notation as JSON

The string you have is "hello\0world", or "hello\x00world" whatever you prefer. If you echo it, the null symbol \0 won't be displayed, thats why you see helloworld instead, but json_encode will detect it and escape it as it does to any other special character, thats why its replaced by a visible \u0000 string.
In my way of seeing it, json is encoding the string perfectly, the \u0000 is there to do its job of reproducing the inputted string in a json encoded way. You don't have to touch its output. If you don't want that \u0000 there you should fix its input instead.

you can simply do trim($str) without giving it a charlist

\uXXXX is the unicode symbol with code XXXX (hexadecimal).
For example: http://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx
If you really get 0000 - then it's just the char with code 0

I came across this issue today and I sorted it out by replacing \u0000 in my array with "" before sending it back to the client.
echo str_replace('\\u0000', "", json_encode($send));

In my case I've found the symbol inside serialized Laravel job's payload json, something like s:8:"\0*\0order"; (or s:8:"\u0000*\u0000order";) which meant that serialized object's property order has visibility protected on a moment of serialization

Just in case anyone need it to apply to the whole array
$data = (array)json_decode(str_replace('\u0000*\u0000', '', json_encode($data)));

Try explode("\u0000", $array[0]);, making sure you use double quotes. With single quotes it's going to parse the literal 6 character value.
As others have mentioned, \u0000 is the Unicode NUL character.

Handling Multibyte characters in php

Am working on php based mime parser. If the body contains string like Iñtërnâtiônàlizætiøn we see that It is getting converted into IÃ±tÃ«rnÃ¢tiÃ´nÃ lizÃ¦tiÃ¸n. Can somebody suggest how to handle (what functions) for such string ?
So we are doing the following
Using Zend Library connecting to the IMAP server
mail = new Zend_Mail_Storage_Imap($params);
Read the message using
$message = $mail->getMessage($i);
in the loop.
When we print the $message we see the string e.g. Iñtërnâtiônàlizætiøn printed as IÃ±tÃ«rnÃ¢tiÃ´nÃ lizÃ¦tiÃ¸n.
What I need is if there is someway by which we can retain the original string? And this is just one example we may run into other multi-byte characters, so what to know how we handle this generically?

There's no specific function for that, you simply need to treat the string in the encoding it's in. A string is just a blob of bytes, it gets turned into characters by whatever is interpreting those bytes as text. And that something needs to use the correct encoding for that, otherwise those bytes are not interpreted as the characters they were supposed to be. See Handling Unicode Front To Back In A Web App for a rundown of the common pitfalls.

as mentioned in the comment, you can use php mb_* functions to work with multibyte characters. Here is just an example to detect the encoding of a string:
$s="Iñtërnâtiônàlizætiøn";
echo mb_detect_encoding($s); //UTF-8
then you can work with this, use utf8_decode($s) or any mb_ functions to convert the string to your wished encoding.

How to parse unicode format (e.g. \u201c, \u2014) using PHP

I am pulling data from the Facebook graph which has characters encoded like so: \u2014 and \u2014
Is there a function to convert those characters into HTML? i.e \u2014 -> —
If you have some further reading on these character codes), or suggested reading about unicode in general I would appreciate it. This is so confusing to me. I don't know what to call these codes... I guess unicode, but unicode seems to mean a whole lot of things.

that's not entirely true bobince.
How do you handle json containing spanish accents?
there are 2 problems.
I make FB.api(url, function(response)
... var s=JSON.stringify(response);
and pass it to a php script via $.post
First I get a truncated string. I need escape(JSON.stringify(response))
Then I get a full json encoded string with spanish accents.
As a test, I place it in a text file I load with file_get_contents and apply php json_decode and get nothing.
You first need utf8_encode.
And then you get awaiting object of your desire.
After a full day of test and google without any result when decoding unicode properly, I found your post.
So many thanks to you.

Someone asked me to solve the problem of Arabic texts from the Facebook JSON archive, maybe this code helps someone who searches for reading Arabic texts from Facebook (or instagram) JSON:
$str = '\u00d8\u00ae\u00d9\u0084\u00d8\u00b5';
function decode_encoded_utf8($string){
return preg_replace_callback('#\\\\u([0-9a-f]{4})#ism', function($matches) { return mb_convert_encoding(pack("H*", $matches[1]), "UTF-8", "UCS-2BE"); }, $string);
}
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", decode_encoded_utf8($str));

Facebook Graph API returns JSON objects. Use json_decode() to read them into PHP and you do not have to worry about handling string literal escapes like \uNNNN. Don't try to decode JSON/JavaScript string literals by yourself, or extract chosen properties using regex.
Having read the string value, you'll have a UTF-8-encoded string. If your target HTML is also UTF-8-encoded, you don't need to replace — (U+2014) with any entity reference. Just use htmlspecialchars() on the string when outputting it, so that any < or & characters in the string are properly encoded.
If you do for some reason need to produce ASCII-safe HTML, use htmlentities() with the charset arg set to 'utf-8'.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.