Best practices about parsing multi language feed - php

I'm having a problem parsing data from different feeds, some of them in English, others in Italian and others in Spanish. I'm parsing using a PHP script and saving the parsed data into my MySQL database.
The problem is that when I parse items that contains "non common" characters like: "Strage di Viareggio Più" when I look into my database the phrase is stored in this way: "Strage di Viareggio Più".
My database can use that kind character because when I input that manualy it works fine, in the original feed (rss file) the phrase is also fine, I think is my PHP server who is changing the letter. How can I solve this? Thanks!

Make sure that the database uses UTF-8 (as you say it does) and that the PHP script has its internal encoding set to UTF-8, which you can achieve with iconv_set_encoding. If you're reading data from an HTTP request that should be all you need, as long as the request tags its own encoding correctly.

Looks like input data is in UTF-8, but charset/collation of DB table - ASCII. I would suggest to have UTF-8 everywhere.

What you need to implement, before saving to MySQL is:
http://php.net/manual/en/function.htmlentities.php
Check these different threads for more information
Best practices in PHP and MySQL with international strings
htmlentities() makes Chinese characters unusable
What I find incredible is that this question has received -2 in the past 24 hours without any comments.
From the question posted:
I'm parsing using a PHP script and saving the parsed data into my MySQL database.
and
I think is my PHP server who is changing the letter. How can I solve this? Thanks!
The answers posted so far are related to the encoding and settings of MySQL. The person asking the question has clearly stated that he can insert special characters manually and is having no problems:
My database can use that kind character because when I input that manualy it works fine
My answer was to help him convert the characters into an html entity which will circumvent the problem he is having with the RSS feed and answering the question posted.

Related

How to enter accented letters into mysql database with php? [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 9 years ago.
Tried searching for this question but I think I don't know the jargon. I am entering my site content into a mysql database using php, but all of my accented letters or apostrophes (spanish) get transformed into some crazy encoding. Ex:
' becomes â€
á becomes á
and etc. First off I don't know what this means or is, but when displaying them on my site, they definitely do not revert back, nor would I expect them too because if I manually enter in UTF-8 letters they totally work on my site.
Is there a way to fix this without re-entering all of my text? I have a feeling I can extract them using php, decode them and then insert them back in but I do not know the functions that do this. The best solution, and if anyone knows how that would be amazing, would be to just do it within sql. By the way the columns say collation utf8_general_ci.
For some further information, I am not doing anything to the text that gets entered into the database (I know that is bad but I suck at this stuff!) Also I am not doing anything when it is being queried. My functions insert pure text and extract pure text to each page. In this way I can write html into my forms and it appears as html on the page and therefore the browser interprets it correctly. I also have a feeling this is really sloppy but like I said... Thanks for all the help!
-- Edit --
So thanks to people who pointed me to the other questions. However, the way I fixed it was not in that answer, it just gave me the write keywords to start a new search. For anyone who has this problem, the way I fixed it was using the function utf8_decode(). I'm not sure this is a great fix, but at least it is working for now and speed was my biggest priority. I am certain the core problem is in how I am entering the data into the database.
You can make sure the character encoding for INSERT queries are correct at runtime using $mysqli->set_charset("utf8")
It's also prudent to make sure that PHP is sending the correct HTTP response headers, and the Browser is doing the right thing:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
ini_set('default_charset', 'utf-8')
To alter your tables, do ALTER TABLE myTable CHARACTER SET utf8 COLLATE utf8_general_ci;

(?) marks in HTML. Encoding issue from content from the database?

Any idea why this is happening?
It looks to be happening mainly with apostrophes and hyphens. Any ideas if I can fix this? I pull the data from my database and print it to the page like:
<div class="block">
<?=$details['agenda'] ?>
</div>
As other commenters may have mentioned, this is a character encoding problem. If you're lucky, you can force your HTML page to render in UTF-8 and that will resolve it.
Unfortunately, if you're not lucky, you'll discover that the characters are stored in the database in the wrong encoding. Or maybe the database converts them. Or maybe the character encoding data has been destroyed along the path! There's no way of knowing in advance where those characters have been damaged.
The best way I know to fix problems like this is to force every step along your path to follow UTF-8 content encoding. For example, you probably go through steps like this:
Content author writes a document in Microsoft Word containing "SmartQuotes"
Content author copies-and-pastes into the edit box of a content management system.
Content management system saves to the database.
Database may or may not store data in Unicode internally - make sure you use nvarchar (or whatever unicode type your database supports).
Reading from the database may need to scan for characters.
However, it's very tricky to fix this! A long time ago, I used to have a habit of writing "detect-and-fix" routines like this:
$smartquotes = array("”", "“");
str_replace($smartquotes, '"', $mytext);
Of course you know what the problem is - I'd keep discovering new characters I had to fix. Microsoft Word likes to do tons of unusual characters - copyright, registration marks, apostrophes, hyphens, and so on. I'd keep adding to this function, over and over, until I went crazy. So nowadays I just go through my entire content delivery path and force everything to obey UTF-8 rules; that seems to resolve it in most cases.
Good luck!

How to get correct character-encoding between mysql and filemaker

I'm unsure if this is a php-, filemaker-, mysql- or an odbc driver issue.
For security reasons the input fields of my current php webform convert special characters into hex codes, (for example: # becomes ' ) This hex code is saved in the database and will also be shown in Filemaker11 as the hex code. This is not what i want.
How can I make sure the special character will be displayed as it should be?
The other way round (from filemaker to db), no conversion will be done on inserting the special characters.
How can I make sure everything will be consistent?
Kind regards,
Jeroen
FileMaker is just showing the data stored in MySQL. If you pull up the DB in a tool like PhpMyAdmin you should see that the varchar contains the encoding as well. Since FMP is looking at it simply as a text field, it shows the encoding that was stored. If you wanted to decode in FMP you could show a calc field of the varchar that has a custom function to decode the text. (but that won't allow for updating the data..) You could also try a trigger on record load to decode the data in the fields so that you can properly view/edit.
Solved it! It appeared that I had to add an extra line to my PHP script.
after setting up the connection, php needs to tell mysql what the encoding needs to be. This can be done with the following line:
$dbh->query("SET NAMES 'utf8'");
Thanks for the effort guys!
This: ' type of encoding is not done automatically by the browser. Something is doing it. Normally you do it only on output not on input.
You can use html_entity_decode() to undo it. But I strongly suggest you figure out why it's happening in the first place.

Help with multi-lingual text, php, and mysql

I have had no end of problems trying to do what I thought would be relatively simple:
I need to have a form which can accept user input text in a mix of English an other languages, some multi-byte (ie Japanese, Korean, etc), and this gets processed by php and is stored (safely, avoiding SQL injection) in a mysql database. It also needs to be accessed from the database, processed, and used on-screen.
I have it set up fine for Latin chars but when I add a mix of Latin andmulti-byte chars it turns garbled.
I have tried to do my homework but just am banging my head against a wall now.
Magic quotes is off, I have tried using utf8_encode/decode, htmlentities, addslashes/stripslashes, and (in mysql) both "utf8_general_ci" and "utf8_unicode_ci" for the field in the table.
Part of the problem is that there are so many places where I could be messing it up that I'm not sure where to begin solving the problem.
Thanks very much for any and all help with this. Ideally, if someone has working php code examples and/or knows the right mysql table format, that would be fantastic.
Here is a laundry list of things to check are in UTF8 mode:
MySQL table encoding. You seem to have already done this.
MySQL connection encoding. Do SHOW STATUS LIKE 'char%' and you will see what MySQL is using. You need character_set_client, character_set_connection and character_set_results set to utf8 which can easily set in your application by doing SET NAMES 'utf8' at the start of all connections. This is the one most people forget to check, IME.
If you use them, your CLI and terminal settings. In bash, this means LANG=(something).UTF-8.
Your source code (this is not usually a problem unless you have UTF8 constant text).
The page encoding. You seem to have this one right, too, but your browsers debug tools can help a lot.
Once you get all this right, all you will need in your app is mysql_real_escape_string().
Oh and it is (sadly) possible to successfully store correctly encoded UTf8 text in a column with the wrong encoding type or from a connection with the wrong encoding type. And it can come back "correctly", too. Until you fix all the bits that aren't UTF8, at which point it breaks.
I don't think you have any practical alternatives to UTF-8. You're going to have to track down where the encoding and/or decoding breaks. Start by checking whether you can round-trip multi-language text to the data base from the mysql command line, or perhaps through phpmyadmin. Track down and eliminate problems at that level. Then move out one more level by simulating input to your php and examining the output, again dealing with any problems. Finally add browsers into the mix.
First you need to check if you can add multi-language text to your database directly. If its possible you can do it in your application
Are you serializing any data by chance? PHPs serialize function has some issue when serializing non-english characters.
Everything you do should be utf-8 encoded.
One thing you could try is to json_encode() the data when putting it into the database and json_decoding() it when it's retrieved.
The problem was caused by my not having the default char set in the php.ini file, and (possibly) not having set the char set in the mysql table (in PhpMyAdmin, via the Operations tab).
Setting the default char set to "utf-8" fixed it. Thanks for the help!!
Check your database connection settings. It also needs to support UTF-8.

Why is PHP's utf8_encode breaking my utf-8 string?

I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.
I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."
Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.
utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.
To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?

Categories