csv "UTF8" character encoding with pgsql

csv "UTF8" character encoding with pgsql - php

Our client has sent us a CSV file of data that I need to import into a specific table in our Postgresql 8.3.9 database. The database uses UTF-8 character encoding, i.e. our CMS allows multiple languages such as French which are inputted into the database via the CMS in French. One particular facility is for the client to upload images to the server and then enter "alt" tags for them in French. However, due to a bulk update required, we have been sent a CSV to feed into a particular table - for the image alt tags, in French.
The CSV has some special characters such as "é" - e.g.
"Bottes Adaptées Amora Cuir Faux-Croco Fauve Photo d'Ensemble"
The images themselves are hosted on two places - one is a CDN, and one is a local database backup and local server (web server) file backup. I am using a PHP script to read the CSV file and do the needful so that the "alt" tags are updated on two places - our web database, and the CDN.
However, when I read the CSV (using PHP), the character does not "come out" as expected.
The data is comming as "Bottes Adapt�es Amora Cuir Faux-Croco Fauve Photo d'Ensemble".
I don't think this has anything to do with the database, but it has something to do with my PHP file reading the CSV data. Even if I print the data that it is reading, the special character above does not print as above, it' prints as if the special character is not recognised. Other characters print fine.
Here is the code I'm using (not some special custom functions are used here to interact with the database but they can be ignored). The CSV file is made up of {column 1} for image name, and {column 2} for the ALT tag.
$handle = fopen($conn->getIncludePath() . "cronjobs/GIB_img_alt_tags_fr.csv", "r");
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
//normally I run a query here to check if the data exists - "SELECT imageid, image_fileref FROM table1 WHERE image_fileref = '". $data[0]. "'");
if ($conn->Numrows($result)) { //if rows were found -
$row=$conn->fetchArray($result);
//printing the data from $row here
}
}
fclose($handle);

You've still omitted key information - when asking for help with an UPDATE don't delete the UPDATE statement from the code - and your description of the problem is very confused, but there's some hint of what's going on.
Mismatched encodings
It's highly likely that your PHP connection has a client_encoding set to something other than UTF-8. If you're sending UTF-8 data down the connection without conversion, the connection's client_encoding must be UTF-8.
To confirm, run SHOW client_encoding as a SQL statement from PHP and print the result. Add SET client_encoding = 'UTF-8' to your code before importing the CSV and see if that helps. Assuming, of course, that the CSV file is really UTF-8 encoded. If it isn't, you need to either transcode it to UTF-8 or find out what encoding it is in and SET client_encoding to that.
Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and the PostgreSQL manual on character set support.
Better approach
The approach you're taking is unnecessarily slow and inefficient, anyway. You should be:
Opening a transaction
Creating a temporary table in the database with the same structure as the CSV file.
Use pg_copy_from to load the CSV into the temp table, with appropriate options to specify the CSV format.
Merge the contents of the temporary table into the destination table with an INSERT then an UPDATE, eg:
INSERT INTO table1 (image_fileref, ... other fields ...)
SELECT n.image_fileref, ... other fields ...
FROM the_temp_table n
WHERE NOT EXISTS (SELECT 1 from table1 o WHERE o.image_fileref = n.image_fileref);
UPDATE table1 o
SET .... data to update ....
FROM the_temp_table n
WHERE o.image_fileref = n.image_fileref;
Commit the transaction
The INSERT may be more efficiently written as a left outer join with an IS NULL filter to exclude matching rows. It depends on the data. Try it.
I probably could've written a faster CTE-based version, but you didn't say what version of Pg you were using, so I didn't know if your server supported CTEs.
Since you left out the UPDATE I can't be more specific about the UPDATE or INSERT statements. If you'd provided the schema for table1 or even just your INSERT or UPDATE I could've said more. Without sample data I haven't been able to run the statements to check them, and I didn't feel like making up some dummy data, so the above is untested. As it is, completing the code is left as a learning exercise. I will not be updating this answer with fully-written-out statements, you get to work that out.

Related

Encoding Problem char looks right but is not

First it is not the common utf8 Problem. All parts of my application is set to utf8 and works fine.
I get a mail over IMAP by PHP and fetch the title. In this title I have s special char. A ö from the German language. Now I search in my DB if there is an entry with this title. I know there is one. Database use utf8mb4_general_ci for encoding to be able to store 4 bit utf8 encoded special chars.
Title from Mail:
Fw: Auflösungsvertrag
Entry in Database:
Fw: Auflösungsvertrag
I put the cursor behind the ö and tried to delete it. First the ö switched to an o and after the second press of the delete key it was fully gone. If I type now and ö over my keyboard, MySQL finds the entry.
If I put both inside notepad ++ You see
Fw: Auflösungsvertrag
FW: Auflösungsvertrag
If you turn the encoding to ASCII you get
Fw: AufloÌˆsungsvertrag
Fw: AuflÃ¶sungsvertrag
So you can see now that the two ö are different encoded, but they get displayed right. So my mySQL select don't find the DB entry.
Can someone explain this to me and give me a hint for a php command to turn the first encoded string to the second one?
I bit longer description how this problem accrues:
I write a ticketing system. Every mail I send out get added the tickets ID to the subject. If I send out a mail I write it to the db in an outgoing table. Then a cronjob sends this mails out asyncron. I use PHP mailer and send over SMTP.
I fetch Incoming mails by IMAP and the PHP IMAP classes. If a mail comes in with an TID in the subject I merge this mail into the ticket in the database. All ticket entries are grouped by the TID column.
The problem is now, if you send a mail from the system to another mail address inside the same system, you get the mail merged into the existing ticket.
That's why I look in the outgoing table for every incoming mail by search for the from Address, to Address and the title. If I find the mail I know the system has sent it out.
So if I send the mail out I have the first encoding. If I get the same mail back in again it has the other encoding. Both encoding seam to be valid utf8 encoding. Everywhere on the website I get the right character and also in the db I get it displayed right. Only if I make an SQL query over PDO, MySQL treat them as two different characters.

Here's how I would solve this and according to me it has to be fixed once for all with a oneshot instruction on the database side, and not with just a trick on the PHP, that you would have to repeat everywhere each time you are facing the issue.
First I copied your 2 strings :
Auflösungsvertrag
Auflösungsvertrag
Into Notepad++, in which I do have the (very handy) HEX pluggin.
When I turn the text into HEX, I have those values
4175666c6fcc8873756e677376657274726167
4175666cc3b673756e677376657274726167
If we split that, we see easily the HEX of the 2 ö that causes issues
4175666c 6fcc88 73756e677376657274726167
4175666c c3b6 73756e677376657274726167
The trick is now to tell MySQL to replace all characters having those HEX values from one to another, ie 6fcc88 to c3b6
And you can do that with this statement that uses UNHEX() function
UPDATE your_table
SET your_column=REPLACE(your_column, UNHEX('6fcc88'), UNHEX('c3b6'))
Example and reproduction below
Schema (MySQL v8.0)
/* Creating test data - Row 1 and 2 are identical */
create table test (id int, txt varchar(50), txthex varchar(100));
INSERT INTO test (id,txt,txthex) VALUES (1, 'Auflösungsvertrag', '4175666c6fcc8873756e677376657274726167');
INSERT INTO test (id,txt,txthex) VALUES (2, 'Auflösungsvertrag', '4175666c6fcc8873756e677376657274726167');
INSERT INTO test (id,txt,txthex) VALUES (3, 'Auflösungsvertrag','4175666cc3b673756e677376657274726167');
Applying fix
/* Running oneshot fix on row 2 only */
UPDATE test
SET txt=REPLACE(txt, UNHEX('6fcc88'), UNHEX('c3b6'))
WHERE id=2
Check Query
SELECT id, txt, txthex hex_original,
CAST(UNHEX(txthex) AS CHAR(30)) unexed_original ,
HEX(txt) hex_replaced
FROM test;
id
txt
hex_original
unexed_original
hex_replaced
1
Auflösungsvertrag
4175666c6fcc8873756e677376657274726167
Auflösungsvertrag
4175666C6FCC8873756E677376657274726167
2
Auflösungsvertrag
4175666c6fcc8873756e677376657274726167
Auflösungsvertrag
4175666CC3B673756E677376657274726167
3
Auflösungsvertrag
4175666cc3b673756e677376657274726167
Auflösungsvertrag
4175666CC3B673756E677376657274726167

I found the solution.
The topic is called Unicode equivalence and there are methods that normalize.
https://en.wikipedia.org/wiki/Unicode_equivalence
PHP also have a class for this.
https://www.php.net/manual/de/normalizer.normalize.php
I had to call
normalizer_normalize( $myString, Normalizer::NFKC );

How to update badly encoded characters in MySql table?

In an existing database, we have discovered some text entries where characters with accents were badly encoded.
The following query:
SELECT
PR.Product_Ref__ AS ProductCode,
CONCAT(P.Name, IF (PR.Misc <> '', CONCAT(' ', PR.Misc), '')) AS Name
FROM
Product AS P
INNER JOIN
Product_Ref AS PR ON P.Product__ = PR.Product__
WHERE
Name like "%é%" AND
PR.Product_Ref__ IN ( 659491, 657274 )
returns two lines describing the issue:
Sometimes, e with accent has been inserted properly, sometimes not.
How can I detect and update such issue with an UPDATE ... WHERE ...? Or should I use a different statement/method?
P.S.: The corresponding application is developed in PHP.
P.S.2: The answers in the suggested possible duplicate question do not address the issue I am raising. It is not related to the collation of the database.

You just can simple detect all the occurrences that you want to correct and the use a simple update clause to make the substitutions.
Example for the case that you describe:
UPDATE Product
SET Name = REPLACE(Name, 'Ã©', 'é')
WHERE Name like '%Ã©%'
You can run this updates directly in mysql databases, using the command line or in a specific mysql user interface client application. Or if you like, using php functions that run sql statements in the database.

TCPDF problems part 2

I am exporting data into a pdf using TCPDF. Everything works fine until I add a certain column (long text format) to the table. Whenever I add it, the table doesn't show up. When I run the sql query all the data shows up fine.
Is it possible there's a character or characters in the data field itself that are causing the table to become corrupted?
Also I can't for the life of me, figure out how I show more than 256 characters in cell.
If anyone can help I'd really appreciate it.

well until I find a better option I am running this to each of my comment fields
UPDATE TABLE_NAME set COLUMN_NAME = replace(COLUMN_NAME, '’', '`');

Creating database table from tab delimited text file with first row labels

I have a tab delimited text file with the first row being label headings that are also tab delimited, for example:
Name ID Money
Tom 239482 $2093984
Barry 293984 $92938
The only problem is that there are 30 some columns instead of 3 so I'd rather not have to type out all the (name VARCHAR(50),...) if it's avoidable.
How would I go about writing a function that creates the table from scratch in php from the text file, and say the function takes in $file_path and $table_name? Do I have to write all the column names again telling mysql what type they are and chop off the top or is there a more elegant solution when the names are already there?

You would somehow need to map the column type to the columns in your file. You could do this by adding that data to your textfile. For instance
Name|varchar(32) ID|int(8) Money|int(10)
Tom 239482 $2093984
Barry 293984 $92938
or something similar. Then write a function thet get's the column name and columntype using the first line and the data to fill the table with using all the other rows. You might also want to add a way to name the given table etc. However, this would probably be as much work (if not more) than creating SQL queries using you text file. Add a create table statement at the top and insert statements for each line. With search and replace this could be done very fast.

Even if you could find a way to do this, how would you determine the column type? I guess there would be some way to determine the type of the columns through checking for certain attributes (int, string, etc). And then you'd need to handle weird columns like Money, which might be seen as a string because of the dollar sign, but should almost certainly be stored as an integer.
Unless you plan on using this function quite a bit, I wouldn't bother spending time cobbling it together. Just fat finger the table creation. (Ctrl-C, Ctrl-V is your friend)

How to read a bulk data feed via php?

I have a large file that I would like to read via php, and then insert various fields into MySQL.
Each file in the feed is in plain text format, separated into columns and rows. Each record has the same set of fields. The following are the delimiters for each field and record:
Field Separator (FS): SOH (ASCII character 1)
Record Separator (RS) : STX (ASCII character 2) + "\n"
If I look at the first few lines of the file they look like this:
#export_dateapplication_idlanguage_codetitledescriptionrelease_notescompany_urlsupport_urlscreenshot_url_1screenshot_url_2screenshot_url_3screenshot_url_4screenshot_width_height_1screenshot_width_height_2screenshot_width_height_3screenshot_width_height_4
#primaryKey:application_idlanguage_code
#dbTypes:BIGINTINTEGERVARCHAR(20)VARCHAR(1000)LONGTEXTLONGTEXTVARCHAR(1000)VARCHAR(1000)VARCHAR(1000)VARCHAR(1000)VARCHAR(1000)VARCHAR(1000)VARCHAR(20)VARCHAR(20)VARCHAR(20)VARCHAR(20)
#exportMode:FULL
I am struggling to no where to start in order to read this file into PHP, can anyone help with the basic PHP to read each record, and assign a variable to each field, which I then will be able to write into MySQL. I can handle the writing into SQL once I have the various fields set up.
Thanks in advance,
Greg

files greator than 2gb cant be read in PHP (32 bit limit).
For lower size use simple fopen function
And inserting mysql is all the work of macthing patterns and inserts.
If structure of table is same every row then better make it manual once and then just execute inserts by extracting values either by regex or other functions like explode and split .

If every line has delimiters between each field, you may look at fgetcsv().
When you use fgetcsv() on a line, it will return an array with the contents from that line. Since you have several lines, put the funciton inside a while()-loop (look at example #1)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.