PHP MySQL search character coding issues - php

I'm using PDO to connect to a MySQL database. In my connection string I have already added charset=utf8mb4 and all of my databases and tables are utf8mb4_unicode_ci, But I'm facing a problem.
In order to search for entries based on their title on content table I'm using the code below:
SELECT * FROM content WHERE title LIKE '%سيگنالها%'
the keyword is a Persian word. Now the above code returns 1 result which is correct and as expected.
But If I make a form in my PHP app and enter the SAME word either by using a macOS/Windows PC or by using an Android phone I get 0 results.
I tracked this issue down and it seems like even though the words entered by user look exactly the same as the one already in the database, they are in fact NOT the same.
According to this online tool, the decimal character code
for سيگنالها it's: 1587, 1610, 1711, 1606, 1575, 1604, 1607, 1575
While
for سیگنالها it's: 1587, 1740, 1711, 1606, 1575, 1604, 1607, 1575
Did you spot the difference? It's in bold. In fact if you copy both values and past them in here you will see the difference for yourself.
What can I do to solve this annoying problem? I'm using PHP 7 and MariaDB 10.1.

Your first "ي" in the word "سيگنالها" is different character from second word "سیگنالها" which is "ی"
First ي: is ARABIC LETTER YEH (U+064A)
Second ی: is ARABIC LETTER FARSI YEH (U+06CC)
They are different in their Unicode entities, so that they are not match.
Please see https://www.key-shortcut.com/en/writing-systems/%EF%BA%95%EF%BA%8F%D8%A2-arabic-alphabet/ for more information.

They are not the same character, even though they look the same when stringed together and might even have the same meaning.
The first string (1610) is ARABIC LETTER FARSI YEH[1] while the other (1740) is ARABIC LETTER YEH[2].
[1] https://en.wiktionary.org/wiki/%DB%8C
[2] https://en.wiktionary.org/wiki/%D9%8A
I also created a simple form for PHP and tested both strings to see if the value sent through $_POST is kept. Result: the value isn't converted.
So what's probably going on is that you're using an Arabic keyboard to produce Farsi text. The recommended solution is some kind of normalization of the input.
See these discussions:
1) https://groups.google.com/forum/embed/?place=forum/persian-computing#!topic/persian-computing/xS-G0qIGS8A
2) https://github.com/Samsung/KnowledgeSharingPlatform/blob/master/sameas/lib/lucene-analyzers-common-5.0.0/org/apache/lucene/analysis/fa/PersianNormalizer.java
3) can't search in farsi text with arabic keyboard on iphone

Related

Compare and trim binary/unicoded string to normal string?

I am using below mysql query to check which records vary from the trimmed value
SELECT id, BINARY(username) as binary_username, TRIM(username) as trim_username FROM table.
Above query returns binary value and trimmed value as shown below.
Result of mysql query:
Highlighted values in above image show that binary value vary from trimmed value.
I tried below 2 things:
calculating length of both binary and trimmed column but it is same LENGTH(binary_username) != LENGTH(trim_username).
equating them directly binary_username != trim_username.
but both of them are returning empty records.
How can I fetch these highlighted entries using mysql?
Edit 1: I have added HEX value in the query result
SELECT id, BINARY(username) as binary_username, TRIM(username) as trim_username, HEX(username) as hex_username FROM table
Thanks in advance...
To avoid storing, trimming, etc, the trailing zeros, use VARBINARY instead of BINARY. Why, pray tell, are you using BINARY for text strings??
Please do SELECT HEX(username) FROM ... so we can further diagnose the problem. That screenshot is suspect -- we don't know what the client did to "fix" the output.
Well, none of those are encoded in UTF-8, nor anything else that I recognize. The 'bad' characters (02, 04, 0c 17), are all "control codes" in virtually all encodings. ("Unicode" is not an encoding method, so it is not relevant.)
Would you like a REGEXP that tests for control codes?
In PHP, json_encode has an option for JSON_UNESCAPED_UNICODE. See https://www.php.net/manual/en/function.json-encode.php
But that generates \u1234 type text.
When storing binary data into MySQL, use the binding or escaping mechanism in PDO or mysqli.

How to compare telegram/mysql?

I send russian alphabet with inline-keyboard, in callback_data I pass the letter that user selected. It looks like this:
But telegram returns me this letter is this way \xd0\xb3.
I also save word for compare in mysql db. It returns in this way \u0438\\u043c\\u043f\\u0435\\u0440\\u0430\\u0442\\u0438\\u0432. The encoding in the database is utf8_general_ci.
And as a result, I need to check if the selected letter is in the word from the database. How can I do that?
MySQL never generates \u0438, a Unicode representation. It will generate the 2-byte character whose hex is D0B3 (which might show as \xd0\xb3), specifically a Cyrillic character. And you should provide that format when INSERTing into a MySQL table.
PHP's json_encode will generate the Unicode form instead of the other, depending on the absence or presence of JSON_UNESCAPED_UNICODE in the second argument.
To check the database, do something like:
SELECT col, HEX(col) ...
If "correct" you should get something like
г D0B3
(That's a Cyrillic GHE, not a latin r.)
Who knows what telegram is doing to the data. There are over a hundred packages that use MySQL under the covers; I don't know anything about this one.
Terminology: The encoding is utf8 (or could be utf8mb4). The collation, according to what you say, is utf8_general_ci. Encoding is relevant to the querstion; collation has to do with the ordering of strings in comparisons and sorting.
Anoter example: Cyrillic small letter I и = utf8 hex D0B8 = Unicode codepoint \U0438
HTML is quite happy with Unicode codepoints; it will show и when given \U0438. Perhaps Telegram is converting to codepoints as it builds the web page?

php array phone number fields stored as int

I had a MySQL table with some user data, which I needed to correct and migrate to a new MySQL table. I exported the table using "Export to PHP Array plugin for PHPMyAdmin" from "Geoffray Warnants" and it returned a (PHP) array.
One of the fields contains a telephone number. Now some of the entries have been exported as string. However, some of the entries have the telephone number represented as an integer. When I try to read the number, it returns something like:
4.36991052022E+12
when it should be:
4369910520219
I suppose the integer value is too big, so that must be the problem. (that's the reason for the E+12)
I have close to 300 entries and there is no way I can start writing quotes in front and end of the number manually, since I also have a fax field.
Most recently, I tried (with help of demo sublime text 2) to cast the number by writing (string) in front of it - it doesn't work.
I'm kind of helpless now and ask for your help. What can I do?
Please take a look at this question, which should answer yours:
Convert a big integer to a full string in PHP
Since I didn't have the time to get trough the "complicated" process of installing the GMP library, I decided to make it old-skool and just put double quotes ("") in front of every phone number value no matter if is was a string or a "big integer" and remove (single) quotes from the final string.
Thanks to Sublime Text 2!
So i had:
array(..., 'phone'=>' 43 664 1000383', ...);
and
array(..., 'phone'=>4369910520219, ...);
Search for and Find All 'phone'=> and add afterwards "
Then search for (in my case) ,'fax'=> and add beforehand "
The for every string
preg_replace("/\'/i", "", $user["phone"]);
Thanks though for the library. I might actually use it someday. ;)
Greetings,
Joseph

Sphinx field-start and field-end extended2 search not working

I know 'not working' is never a good start when asking for help but I have been at this on and off for months and I've got virtually nowhere.
So far I have at least determined I CAN get the field-start/end operaters working but ONLY when I stick in a space character like:
#gametitle "^diablo$ "
Strangely that returns JUST the game Diablo, however:
#gametitle "^diablo$"
Returns ALL games with Diablo in the name. Now that's great, I apparently can rely on the fact this extra space character will apply proper matching of the game titles (it seems to work with "^age of empires$ " too).
However when it comes to my OTHER field, the one I actually want to do this full field matching on (#console), I get no such luck. I simply get NO results (if I try and do "^PlayStation$ "), or else I get all the results with playstation in the console field (i.e. the PS1/2/3 and portable) when I do "^PlayStation$".
Now the only difference between the #gametitle and #console fields is that the console field contains some NULL entries. I tried to get around this by selecing the string 'NULL' with an IF statement in MySQL (that's my source) but no joy. In addition, both the console and game title fields are VARCHAR(255) in MySQL.
I'm hoping someone will have some a-ha moment with what I've mentioned with regards to the extra space making this thing work, but I'm not holding my breath! Anyway enough of my pessimism, looking forward to your thoughts.
I am using the PHP API provided by sphinx which I'm extending to make minor changes. I am querying a searchd instance, which is Sphinx v1.10-beta. Here are the query logs:
[...] 0.024 sec [ext2/1/attr- 7 (0,50)] [application] #gametitle "^age of empires$"
[...] 0.024 sec [ext2/1/attr- 1 (0,50)] [application] #gametitle "^age of empires$ "
There you can really see how the addition of the space knocks the record count down from 7 to 1, when really you should expect them both to return 1...
I'm almost certain this is a bug in Sphinx.
I've added it to the Issue Tracker
http://sphinxsearch.com/bugs/view.php?id=909
but so far it hasn't been acknowledged

PHP / MySQL varchar field entered as a number

I have within a form a textbox named PO_Number. The form submit by post to another page the textbox value.
In the second page I get $_POST['PO_Number'] and enter in MySQL.
MySQL field is varchar(15). As soon as the string of PO_Number starts with a letter or a number everything is OK.
The problem: sometimes the PO (Purchase Order) number start with 00 or 000 and it is stored with a comma before the 00
For example:
GH93737 - works
9087893 - works
0011132 - entered in database as ,0011132 (see the comma?)
The insert looks normal:
mysql_query("INSERT INTO table_name (PO_Number, ....) VALUES ('".$_POST['PO_Number']."',......)");
Many thanks for your suggestions and your help.
I'm wondering if this has something to do with your browser/server character encoding and how it's interpreting those specific numbers because all of those leading zeros and ones might be getting interpreted as a binary number?
Here's some brief info on that point:
A character encoding tells the computer how to interpret raw zeroes and ones into real characters. It usually does this by pairing numbers with characters.
http://htmlpurifier.org/docs/enduser-utf8.html

Categories