Sphinx search doesn't understand special characters (accents)

Sphinx search doesn't understand special characters (accents) - php

I have a MySQL db in utf8_general_ci.
And my sphinx.conf is like this:
source jobs
{
type = mysql
sql_sock = /var/run/mysqld/mysqld.sock
sql_query_pre = SET NAMES utf8
...
}
When I query "système" I would like sphinx to search for "système" & "systeme" in the DB.
AND when I query "systeme" I would like sphinx to search for "système" & "systeme" too.
What it does now is removing all the characters before the accents (including the accents themselves). So "système" becomes "me" and "dév" becomes "v"...
PS : I'm using the sphinxapi.php - which shouldn't be preferred over SphinxQL, I know, but it should still work with the api. And I use EXTENDED match mode.

You need to setup your charset_table to be able do this
http://sphinxsearch.com/docs/current.html#charsets
Alas there is no 'magic' config option to just magically work with all languages text, need to setup charset_table to deal with the langauge(s) you deal with.
Although this is pretty close:
http://sphinxsearch.com/forum/view.html?id=9312
(ie steals the hard work MySQL had done with collations and mimics it in charset_table)

Related

PHP Propel 1.6 and MySQL - save() using utf8 not working

I have installed Propel 1.6. I can create tables in MySQL with propel commands.
Below is my propel settings in file: runtime-config.xml
<propel>
<datasources default="myProject">
<datasource id="myProject">
<adapter>mysql</adapter>
<connection>
<dsn>mysql:host=localhost;dbname=myDBname</dsn>
<user>myUser</user>
<password>mypass</password>
**<charset>utf8</charset>
<collate>utf8_unicode_ci</collate>**
</connection>
</datasource>
</datasources>
</propel>
MySQL database and table User has collation utf8_unicode_ci (see photo below):
mySql collation screenshot
Ι create a new Patient object to test everything is ok, through the following code:
$pat = new Patient();
$pat->setEmail("tg#gmail.com");
$pat->setAddress("Η διεύθυνσή μου");
$pat->setAmka("555555555");
$pat->setBirthdate("1966-01-01");
$pat->setFirstname("Τοόνομάμου");
$pat->setLastname("τοεπώνυμόμου");
$pat->setPhone("2109999999");
$pat->setSex(1);
$pat->save();
I checked through debug mode in Netbeans and the object $pat contains the values in the correct format so i can read them.
After save(), in mysql the greek values are showing like this:
mySql values saved screenshot
I would like your help to solve this issue.
Thank you in advance.

Τοόνομάμου, when "Mojibaked", becomes Î¤Î¿ÏŒÎ½Î¿Î¼Î¬Î¼Î¿Ï…. Notice the pattern often has Î and a second character, like your screenshot. Apparently, latin1 was involved at some point.
Trouble with UTF-8 characters; what I see is not what I stored discusses Mojibake and its causes.
It may be that you have "double encoding", which that link also discusses.
If you choose to fix the data rather than start over, see http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

Finally, i found a solution.
In MySQL, i checked my settings using the following command:
show variables like 'char%';
I had to replace character_set utf8 with utf8mb4.
Everything works perfect now!!
For more info, https://mathiasbynens.be/notes/mysql-utf8mb4

You must specify the charset in the propel connection DSN - in your runtime-config.xml file like this:
<dsn>mysql:host=localhost;dbname=myDBname;charset=UTF8</dsn>
https://github.com/propelorm/sfPropelORMPlugin/issues/74#issuecomment-2011350

utf8 encoding breaks when upgrading from php5.6 to php7.0

I have a simple (custom) CMS accepting markdown and displaying it in in a web page. Works fine in php5.6 (using the ondrej/php5 ppa on ubuntu 15.10). Mysql collation set to utf8 everywhere.
Upgrade the server to php7.0 (ondrej/php) and it displays garbage characters. I tried migrating the relevant mysql tables and fields to utf8mb4 / utf8mb4_unicode_ci with no luck.
Downgrade to php5.6 and it all works fine.
I have a hunch it is some strange php setting I don't know about? php.ini default_collation=UTF-8. Couldn't find anything else that worked. phpMyAdmin shows garbage no matter what version of php or server settings, so it is not much help.
What could i try next?
Source text (copied from php5.6 rendered page)
아동 보호 정책에 대한 규정
This Code is part of the
Rendered output (from php7 and phpMyAdmin)
ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì •
This Code is part of the

Use this to change a table to utf8mb4:
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4 COLLATION utf8mb4_unicode_520_ci;
However, if the table was already messed up, then this won't fix it. Do the following to verify:
SELECT col, HEX(col) FROM tbl WHERE ...
For example, 아동 보호 정책에 대한 규정 will show a hex of EC9584 EB8F99 EBB3B4 ED98B8 ECA095 ECB185 EC9790 EB8C80 ED959C EAB79C ECA095. (Please ignore the spaces.)
For Korean text, you should see (mostly) groups of 3 hex bytes of the form Ewxxyy, where w is A or B or C or D, as shown in the example above. Hex 20 (only 1 byte) represents a space.
ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì • is the Mojibake for it. This implies that somewhere latin1 was erroneously involved, probably when you INSERTed the text. In that case, you will see something like C3AC E280A2 E2809E C3AB C28F E284A2 C3AB C2B3 C2B4 C3AD CB9C C2B8 ... -- mostly 2-byte Cwxx hex.
If you see that, an UPDATE of something like this will repair the data:
CONVERT(BINARY(CONVERT(CONVERT(col USING utf8mb4) USING latin1)) USING utf8mb4) (Edit: removed call to UNHEX.)

Cannot Return Chinese Characters From MySql Using Phalcon

I've been trying to get started with Phalcon, but I've been stuck for a few days trying to get query against my database. If I can't find a resolution to this issue I'm going to have to move on.
The target table uses inno-db and is utf-8 encoded. Table has two columns: one is index values, the other is individual (unique) east-Asian characters. When attempting to retrieve a record using a utf-8 encoded Chinese character Phalcon returns 0 records. In addition, when retrieving records using the index value, the corresponding character value is returned as a question mark (a regular question mark, not the question mark browsers use as a placeholder for unrecognized characters).
$characters = Characters::find("indCharacter = 乱");
Returns this error:
"PhalconException: Scanning error before '��' when parsing: SELECT [Characters].* FROM [Characters] WHERE indCharacter = 乱 (64)"
Using single quotes around the actual character 0 results are returned.
I've run the exact same queries using the command line, phpmyadmin, and workbench, all of which were in the same environment. All properly returned records.
I've also double checked that the original query value is utf-8 encoded, and that all data in the table is utf-8 encoded.
Phalcon: 1.2.6
Php: 5.4.11

When using Phalcon's Pdo Mysql adapter to create a db connection you have to explicitly set the character encoding. Here is an example similar to one found in Phalcon's tutorial, except that I've added "charset".
$di->set('db', function() use ($config) {
return new \Phalcon\Db\Adapter\Pdo\Mysql(array(
"host" => "myhost",
"username" => "myusername",
"password" => "mypassword",
"dbname" => "mydbname",
"charset" => "utf8"
));
});
I had previously looked in the documentation for the various adapter classes, but couldn't find anything relating to setting the encoding for the database connection. I (incorrectly) assumed that encoding detection was handled internally.

EDIT: Just saw your note about single quotes. You could try a parameterised query as below.
You need to put quotes around the string you're matching. The syntax accepted by Model::find() with one parameter is similar to what would normally go in a MySQL WHERE clause.
$characters = Characters::find("indCharacter = '乱'");
The docs for finding records of a model show this.
Parameterised query
Note that you can skip adding quotes and escaping if you use a parameterised query:
$characters = Characters::find([
"indCharacter = :char:",
'bind' => [
'char' => '乱'
]
]);
Plain MySQL query
Alternatively, you can run plain MySQL queries using a database adapter. This isn't really suitable if you want to pull model objects from the database, but it does allow you to work around issues with PHQL or skip the ORM when you want to. For example REGEXP causes the PHQL parser to error, but it's perfectly valid in MySQL.
/** #var Phalcon\Db\Adapter\Pdo\Mysql $connection */
$connection = $this->di->get('db');
$result = $connection->query("SELECT * FROM characters WHERE indCharacter = '乱'");

Small notice... I haven't done anything with Falcon, but I did have a similar error when using MySQL/PHP which I did solve. Characters like ã and í didn't work.
What you can try it to use Unicode NCRs. For 乱, it's 乱. I found this website that converts Chinese characters to NCRs.
Again, this might work. My problem was that things were typed into a form and added to a database, which I got around by replacing the instances of characters that didn't work with their NCRs. It might be a hassle to convert everything but I think it's worth a try.

mysql_query returns nothing when called with special characters (The danish characters: æøå)

I generate a mysql query from a form with a free text search field.
Something like:
SELECT ... FROM ... WHERE 'something' LIKE '%SEARCH%'
All this works fine and returns the valid rows when the search does not contain any special characters, like the danish characters ÆØÅ.
When these letters ARE used, the query returns no results, all though when i take the generated query string and plug it into phpMyAdmin i get exacly the result i want.
Thanks

add this line of code in your connection file...
mysql_set_charset("utf8", $db);
it is better for you to encode your data to UTF-8 before you pass it into query...

I'm not using Danish but Czech but I think there are the same (at least by UTF8 implementation) - you must keep in mind used encoding (original server script, data tables self and also your database connection handler).

I think you have an encoding problem, maybe phpMyAdmin is using a different client encoding than your other client. SET NAMES 'encoding' should just do what you need, I think.

Also we can use PHP variable and convert it before the select operation (supposing data base is ISOO-8859-2) .
Example:
// word with special characters
$search='kötészeti';
// conversion to ISO
$search=iconv("UTF-8","ISO-8859-2", $search);
// create search condition
$condition="SELECT ... FROM ... WHERE 'something' LIKE '%$search%'";
// apply query
mysql_query($condition);

Cakephp Finnish language not coming out properly

In Cake, I have this issue with Finnish language not displaying properly. I have set utf encoding in config.php, charset output in default.ctp and also config in core.php
Is there a reason why it's not coming out properly?
To give you an idea the link is below:
http://www.likeslomakkeet.net/petitions/add

What if you re-import your data to database after changed your database.php and database collations? Try re-adding any commune with special characters like "Hämeenkyrö" and see how it looks like in database.
edit: You could also filter out all communes with "(lakkautettu)" because they no longer exists.

Did you also set the database connection to UTF-8 in database.php?
For MySQL, that would be:
'encoding' => 'utf8' // no hyphen

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Sphinx search doesn't understand special characters (accents) - php

Related

PHP Propel 1.6 and MySQL - save() using utf8 not working

utf8 encoding breaks when upgrading from php5.6 to php7.0

Cannot Return Chinese Characters From MySql Using Phalcon

mysql_query returns nothing when called with special characters (The danish characters: æøå)

Cakephp Finnish language not coming out properly

Categories

Resources