MySQL import in phpmyadmin (CSV) chokes on quotes - php

I am trying to import a .csv file into a MySQL table via phpMyAdmin.
The .csv file is separated by pipes, formated like this:
data|d'ata|d'a"ta|dat"a|
data|"da"ta|data|da't'a|
dat'a|data|da"ta"|da'ta|
The data contains quotes. I have no control over the format in which I recieve the data -- it is generated by a third party.
The problem comes when there is a | followed by a double quote. I always get an "invalid field count in CSV input on line N" error.
I am uploading the file from the import page, using Latin1, CSV, terminated by |, separated by ".
I would like to just change the "enclosed by" character, but I keep getting "Invalid parameter for CSV import: Fields enclosed by". I have tried various characters with no success.
How can I tell MySQL to accept this format in phpMyAdmin?
Setting up these tables is the first step in writing a program that will use uploaded gzipped .csv files to maintain the catalog of an e-commerce site.

I've been having a similar problem for the last several hours and I've finally gotten an import to work so I'll share my solution, even though it may not help the original poster.
Short version:
1.) if an Excel file, save as ODS (open document spreadsheet) format.
1a.) If the file is some kind of text format with delimiters (like the original poster has), then open Excel, and once inside Excel use File/Open to open the file. There you will be able to select the appropriate delimiter to view the file. Make sure the file looks alright, THEN save as ODS format (and close the file).
2.) Open the file in OpenOffice Calc (free download from Oracle/Sun).
2a.) Press Ctrl-F to open the Find dialog box. Click More Options and make sure "Current Selection Only" is NOT checked.
2b.) Search for double quotes. If there are none in your file, you can skip steps 4 and 5.
3.) Save As -> Text CSV. Select options for UTF-8 format (press "u" 3 times to get there fast), select ";" (semi colon) as separator, and select double quotes for text.
4.) If there were any double quotes found in your file in step 2b, continue, otherwise just import the file as CSV with phpMyAdmin (see step 6). It should work.
5a.) Open in Word or any other text editor where you can do Find -> Replace All.
5b.) Find all instances of three double quotes in a row by searching for """ (if you do find any, you might even want to search for 4, 5, 6 etc. in a row until you come up empty).
5c.) Replace the """ with a placeholder that is not found anywhere else in your csv. I replaced them with 'abcdefg'.
5d.) Find -> Replace all instances of "" (two double quotes in a row) with " (forward slash and double quote).
5e.) Find -> Replace all instances of abcdefg (or your chosen placeholder from step 5c) with "". 5c and this step ensure that any quotes occuring at the end of a field just before the text-delimiting quote are properly 'escaped'.
5f.) Finally, save the file, keeping in UTF-8 (or whatever format you need for import).
6.a) In phpMyAdmin, click the "import" tab, click the "choose file" button, and select the file you just saved.
6b.) under 'Format of imported file' CSV should be selected. If column names are in the first row, make sure that checkbox is checked. Most importantly, 'Fields terminated by' should be set to ; (semi colon), 'Fields enclosed by' should be set to " (double quotes), and 'Fields escaped by' should be set to \ (forward slash). You set that up in your file by following step 3, and if necessary by following steps 5a - 5f.
7.) Click "Go" and pray you didn't just waste another hour.
Now that the short version has turned out this long, I'll skip the long version.
Suffice it to say, there seem to be 2 major problems with importing through phpmyadmin.
1.) There's some kind of memory problem that prevents large Excel and ODS files (how large is large? not sure yet) being imported.
2.) Neither OpenOffice nor Excel seem to save their csv files in a way that's compatible with phpmyadmin. They want to escape double quotes with double quotes. phpMyAdmin wants double quotes escaped with something else, like forward slash.
The first problem will hopefully be fixed in an update of phpmyadmin (and/or the Excel importing add-on 'PHPExcel').
The second one could be fixed if there was an easy way to change the escape character for Excel or ODS files saved as CSV, or if phpMyAdmin could be made compatible with their format (that should actually be pretty easy. Simply have it perform the same find-replace actions we performed manually above to skirt the double quote problem).
I hope this helps somebody, as I spent 3-4 hours discovering this solution and another hour writing it here. I hope it's not too long, but I was hoping to help people at all levels of expertise from zero to wherever I am (probably around 0.1).

I found a hack that works -- I use the $ as the "enclosed by" character and all is well. Since this is for a European site, I know that they'll never use it in the table content.

you could modify the csv files by adding a \ in front of every ' right?

Have you tried blanking the boxes that read "Fields enclosed by" and "Fields escaped by"? I have not used phpMyAdmin, but Google suggests others have had success with this method.

You might consider just writing your own LOAD DATA INFILE query, seems like you'll need one anyway since this process will be part of an application at some point.

Related

Correct way to handle CSV Files on PHP

Hi I have the following brain braking thing going on. The thig is that I'm developing a Laravel Application that imports and exports CSV files. Now, the data that the application Imports/Exports(I/E now on) has fields from various data types, we have text and numbers, now the text can contain commas(,) and using the default CSV separator (,) on php can lead to fields on the import to generate incorrectly. The client suggested that I I/E using ^ as a separator for the export and (,) again for the import of the data. Now, my question is, can I trust when I/E data using the default separator? Can anyone suggest a best way to do the I/E process?
Edit
The client main struggle is because he uses Excel on a Mac to edit the CSV files, now on my Mac, I can easily edit the files without any issues regarding the separator, of course if the separator is a comma (,) but if we use the ^ as a separator then my excel is a mess and he's ommit some fields.
Thanks in advance.
Don't re-invent the wheel. Re-use a well-written well-tested package. On good one is CSV from The PHP League.
(Historical note about delimiters: the most overlooked (for 50+ years) feature in computing is that the ASCII charset (and therefore UTF8 too) assigned specific chars for delimiting fields (or units, as they called them) and records ... and even groups of records and entire files. See https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text. But instead folks didnt RTM and so used commas, etc to separate fields and newlines (\r, \n, \r\n) to separate records. D-oh!!! So, if you are able to select your own delimiters and want to be safe by using a char not used for any other purpose, use the ASCII delimiters.)
There is no such thing as a "CSV standard". Therefore, having a "default" comma is not exactly true. One can basically use whatever one likes, and the column and line seperaters as well as the enclosures for values or complete lines really depend on what you are planning to put in as data.
TL;DR: It is totally up to you and your client, what you are using as those characters.

Moving html data from mysql (wordpress) to sqlserver

I am upgrading a website to asp.net which have at least 100K posts in wordpress. I could not find any related topic for moving so i wanted to share my experience.
Big data is not a problem, however, some of wordpress tables have html data containing quotes (both single and double), &nbsp's, tab characters and so on. I have tried many ways for both exporting however, exporting to SQL file will not work for me (at least, i could not able to work with it, it causes so many troubles).
Best way to move data is exporting to CSV files, seperately for each table. While exporting, you have to:
Custom Export
Different and unique strings for both "Columns separated with" and "Columns enclosed with" (I used ############ and #######, respectively)
Check "Remove carriage return/line feed characters within columns"
Check "Put columns names in the first row"
Download export file
After downloading, replacing will take place:
Tabs with spaces \t -> space
Double quotes to single " -> '
###### -> \t
####### -> "
Finally, encoding conversion is required. myadmin output is ANSI file and it is corrupt. Sql server might not able to handle it. To resolve it, first convert to UTF-8 and convert to ANSI again (In notepad++, it has options in "Encoding" menu).
While importing to SQL server, you must have to select text file as source. Select related csv files and while importing, take care of column lengths. Sql server will make all your columns varchar(50) default. Exported data will have much more larger columns. You have to adjust them in import wizard manually. Use DT_Text (not DT_NTEXT) for string values.
I know this process will result in some data loss (tabs and double quotes) however, it is wordpress' html editor's fault. Html data should be stored as encoded in database for these purposes...
go with a linked server created on the target system; this way the import process can be driven by SQL-Server and will hopefully produce a result ready to use without requiring too many steps & checks.
there are many SO posts about interacting with MySQL from SQL-Server:
Can't create linked server - sql server and mysql
SELECT * FROM Linked MySQL server
Do I have to use OpenQuery to query a MySQL Linked Server from SQL Server?

mysql/php import csv into table with enclosures inside delimiter

I have come across a weird issue when importing a csv into mysql through sql or through php with data minipulation.
I have a csv from a third party (which i have no control of and i am unable to change) that is delimited by commas and has enclosures of double quotes. Simple enough. However in some of the cells there is data such as:
"first" value, secondvalue, thirdvalue, "fourth, value"
now when i import this into SQL the first value is being split due to the enclosure. How can i get it to ignore such cells and just input them as first value but still keep the enclosures so they work on "fourth, value" ?
Is there a regex that i could run on each line as i import it into the table (i dont mind importing lines one by one by reading them through php then using INSERT) or is there functionality in SQL to allow this?
I have tried the following statements but do not work
load data local infile '../htdocs/invoice/upload/importthis.csv'
into table items_raw
fields terminated by ','
enclosed by '"' lines terminated by '\n'
(date, clid_nu, clid, dnid, dcontext_nu, channel_nu,
dstchannel_nu, lastapp_nu, lastdata_nu, duration, billsec_nu, disposition_nu,
amaflags_nu, accountcode_nu, uniqueid_nu, userfield_nu)
and have also tried using OPTIONALLY ENCLOSED BY '"' however this also does not work
I have also tried using fgetcsv however i am getting the same results from it
Any ideas?
EDIT
so the regex "((.*),(.*))" seems to match the fourth value but not the first value. Is this the best way to go or am i over complicating this?
This looks like malformed CSV to me. This line should be:
"""first"" value", secondvalue, thirdvalue, "fourth, value"
where " is commonly used as an escape character.
The problem with using regexps on CSV input is that CSV is not a regular language.
Try using fgetcsv, see if that function has the same behavior as your SQL importer. Count the number of items it finds on each row. You might be able to catch all the anomalies that way.
Is it good enough to detect anomalies or do you also want to fix them automatically? - for instance if the number of anomalies is very high.
Alternatively you could write your own CSV parser that can read this, and convert the file to proper CSV.
Writing a CSV parser is actually not that hard. I can give an outline if you want.

What is an easy way to clean an unparsable csv file

The csv file was created correctly but the name and address fields contain every piece of punctuation there is available. So when you try to import into mysql you get parsing errors. For example the name field could look like this, "john ""," doe". I have no control over the data I receive so I'm unable to stop people from inputting garbage data. From the example above you can see that if you consider the outside quotes to be the enclosing quotes then it is right but of course mysql, excel, libreoffice, and etc see a whole new field. Is there a way to fix this problem? Some fields I found even have a backslash before the last enclosing quote. I'm at a loss as I have 17 million records to import.
I have windows os and linux so whatever solution you can think of please let me know.
This may not be a usable answer but someone needs to say it. You shouldn't have to do this. CSV is a file format with an expected data encoding. If someone is supplying you a CSV file then it should be delimited and escaped properly, otherwise its a corrupted file and you should reject it. Make the supplier re-export the file properly from whatever data store it was exported from.
If you asked someone to send you JPG and they send what was a proper JPG file with every 5th byte omitted or junk bytes inserted you wouldnt accept that and say "oh, ill reconstruct it for you".
You don't say if you have control over the creation of the CSV file. I am assuming you do, as if not, the CVS file is corrupt and cannot be recovered without human intervention, or some very clever algorithms to "guess" the correct delimiters vs the user entered ones.
Convert user entered tabs (assuming there are some) to spaces and then export the data using TABS separator.
If the above is not possible, you need to implement an ESC sequence to ensure that user entered data is not treated as a delimiter.
Your title asks: What is an easy way to clean an unparsable csv file
If it is unparseable, that means that you can't correctly break it up into fields. So you can't clean it.
Your first sentence states: The csv file was created correctly but the name and address fields contain every piece of punctuation there is available.
If the csv file was created correctly, then you can split it into fields correctly. So you can clean it.
Only punctuation? You are lucky. Unvalidated text fields in databases commonly contain nasties like tab, carriage return, line feed, and even Ctrl-Z.
Who says it's "unparsable"? On what grounds? What is their definition of "parsable"?
Who says it was "created correctly"? On what grounds? What is their definition of "correct"?
Could you perhaps show us the relevant parts of say 5 or so lines that are causing you grief? Edit your question and format the examples as code, to make them easier to read. Make it obvious where previous/next fields stop/start e.g.
...,"john ""," doe",...
By the way, the above is NOT "right" under any interpretation; it can't be right, with an ODD number of quote characters none of which is escaped.
My definition of correct: Here is how to emit a CSV field that can be parsed no matter what is in the database [caveat: Python csv module barfs on `\x00']:
if '"' in field:
output = '"' + field.replace('"', '""') + '"'
elif any of comma, line feed, carriage return in field: # pseudocode
output = '"' + field + '"'
else:
output = field
That's a really tough issue. I don't know of any real way to solve it, but maybe you could try splitting on ",", cleaning up the items in the resulting array (unicorns :) ) and then re-joining the row?
MySQL import has many parameters including escape characters. Given the example, I think the quotes are escaped by putting a quote in the front. So an import with esaped by '"' would work.
First of all - find all kinds of mistake. And then just replace them with empty strings. Just do it! If you need this corrupted data - only you can recover it.

Find actual value of PHP variable

I am having a real headache with reading in a tab delimited text file and inserting it into a MySQL Database.
The tab delimited text file was generated (I think) from a MS SQL Database, and I have written a simple script to read in the file and insert it into an existing table in my MySQL database.
However, there seems to be some problem with the data in the txt file. When my PHP script parses the file and I output the INSERT statements, the values in each of the fields are longer than they should be. For example, the first field should be a simple two character alphanumeric value. If I echo out the INSERT statements, using Firebug (in Firefox), between each of the characters is a question mark in a black diamond. If I var_dump the values, I get the following:
string(5) "A1"
Now, this clearly shows a two character string, but var_dump tells me it is five characters long!!
If I trim() the value, all I get is the first character (in this case "A").
How can I get at the other characters, even if it is only to remove them? Additionally, this appears to be forcing MySQL to insert the value as a BLOB, not as a varchar as it should.
Simon
UPDATE
If I do:
echo mb_detect_encoding($arr[0]);
I get a result of 'ASCII'. This isn't multibyte, is it??
Sounds like an encoding issue.
Are you running any strings through PHP functions which are not multi byte safe?
You may need to look at multi byte aware functions in PHP.
OK, solved all these issues by opening the TXT file in notepad and saving it specifically as UTF-8.
I still don't know what encoding was used (maybe UNICODE??) but it's all sorted now

Categories