Hi I have the following brain braking thing going on. The thig is that I'm developing a Laravel Application that imports and exports CSV files. Now, the data that the application Imports/Exports(I/E now on) has fields from various data types, we have text and numbers, now the text can contain commas(,) and using the default CSV separator (,) on php can lead to fields on the import to generate incorrectly. The client suggested that I I/E using ^ as a separator for the export and (,) again for the import of the data. Now, my question is, can I trust when I/E data using the default separator? Can anyone suggest a best way to do the I/E process?
Edit
The client main struggle is because he uses Excel on a Mac to edit the CSV files, now on my Mac, I can easily edit the files without any issues regarding the separator, of course if the separator is a comma (,) but if we use the ^ as a separator then my excel is a mess and he's ommit some fields.
Thanks in advance.
Don't re-invent the wheel. Re-use a well-written well-tested package. On good one is CSV from The PHP League.
(Historical note about delimiters: the most overlooked (for 50+ years) feature in computing is that the ASCII charset (and therefore UTF8 too) assigned specific chars for delimiting fields (or units, as they called them) and records ... and even groups of records and entire files. See https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text. But instead folks didnt RTM and so used commas, etc to separate fields and newlines (\r, \n, \r\n) to separate records. D-oh!!! So, if you are able to select your own delimiters and want to be safe by using a char not used for any other purpose, use the ASCII delimiters.)
There is no such thing as a "CSV standard". Therefore, having a "default" comma is not exactly true. One can basically use whatever one likes, and the column and line seperaters as well as the enclosures for values or complete lines really depend on what you are planning to put in as data.
TL;DR: It is totally up to you and your client, what you are using as those characters.
Related
I'm stuck on a crazy project that has me looking for a strange solution. I've got a XFA PDF document generated by an outside party. There's are several checkmark characters '✓' on the PDF's that I need to simply change to 'X'. The reason for this is beyond my control. I'm just looking for a way to change the ✓'s into X's. Can anyone point me in the right direction? Is it possible?
Currently we use PHP and TCPDF for creating "our" server PDF's, but this particular PDF is generated outside of my control by a third party that doesn't want to alter their way of doing things. To make things worse, I don't know how many or where the checkmarks may exist. It's just one very specific character that is in need of changing. Does any know a way of hacking the document to change the character?
Character 2713
http://www.fileformat.info/info/unicode/char/2713/index.htm
Yes, I think you can. To my (rather limited) knowledge of the PDF format, you can only reliably search and replace strings of one character in length, since they are created by placing strings of variable length at specific co-ordinates, in an arbitrary order. The string 'hello' could therefore be one string of five letters, or five strings of one letter each or some combination thereof, all placed in the correct position (and in whatever order the print driver decided upon).
I'm afraid I don't know of any libraries that will do this, but I'd be surprised if they don't exist. You'll need to read PDF objects in, do the replacement, and write them out to a new file. I'd start off researching around the answers to this question.
Edit: this looks like it might be useful.
The csv file was created correctly but the name and address fields contain every piece of punctuation there is available. So when you try to import into mysql you get parsing errors. For example the name field could look like this, "john ""," doe". I have no control over the data I receive so I'm unable to stop people from inputting garbage data. From the example above you can see that if you consider the outside quotes to be the enclosing quotes then it is right but of course mysql, excel, libreoffice, and etc see a whole new field. Is there a way to fix this problem? Some fields I found even have a backslash before the last enclosing quote. I'm at a loss as I have 17 million records to import.
I have windows os and linux so whatever solution you can think of please let me know.
This may not be a usable answer but someone needs to say it. You shouldn't have to do this. CSV is a file format with an expected data encoding. If someone is supplying you a CSV file then it should be delimited and escaped properly, otherwise its a corrupted file and you should reject it. Make the supplier re-export the file properly from whatever data store it was exported from.
If you asked someone to send you JPG and they send what was a proper JPG file with every 5th byte omitted or junk bytes inserted you wouldnt accept that and say "oh, ill reconstruct it for you".
You don't say if you have control over the creation of the CSV file. I am assuming you do, as if not, the CVS file is corrupt and cannot be recovered without human intervention, or some very clever algorithms to "guess" the correct delimiters vs the user entered ones.
Convert user entered tabs (assuming there are some) to spaces and then export the data using TABS separator.
If the above is not possible, you need to implement an ESC sequence to ensure that user entered data is not treated as a delimiter.
Your title asks: What is an easy way to clean an unparsable csv file
If it is unparseable, that means that you can't correctly break it up into fields. So you can't clean it.
Your first sentence states: The csv file was created correctly but the name and address fields contain every piece of punctuation there is available.
If the csv file was created correctly, then you can split it into fields correctly. So you can clean it.
Only punctuation? You are lucky. Unvalidated text fields in databases commonly contain nasties like tab, carriage return, line feed, and even Ctrl-Z.
Who says it's "unparsable"? On what grounds? What is their definition of "parsable"?
Who says it was "created correctly"? On what grounds? What is their definition of "correct"?
Could you perhaps show us the relevant parts of say 5 or so lines that are causing you grief? Edit your question and format the examples as code, to make them easier to read. Make it obvious where previous/next fields stop/start e.g.
...,"john ""," doe",...
By the way, the above is NOT "right" under any interpretation; it can't be right, with an ODD number of quote characters none of which is escaped.
My definition of correct: Here is how to emit a CSV field that can be parsed no matter what is in the database [caveat: Python csv module barfs on `\x00']:
if '"' in field:
output = '"' + field.replace('"', '""') + '"'
elif any of comma, line feed, carriage return in field: # pseudocode
output = '"' + field + '"'
else:
output = field
That's a really tough issue. I don't know of any real way to solve it, but maybe you could try splitting on ",", cleaning up the items in the resulting array (unicorns :) ) and then re-joining the row?
MySQL import has many parameters including escape characters. Given the example, I think the quotes are escaped by putting a quote in the front. So an import with esaped by '"' would work.
First of all - find all kinds of mistake. And then just replace them with empty strings. Just do it! If you need this corrupted data - only you can recover it.
I want to import really clean .txt files into Mysql with PHP. I've read that this is easy if you know the delimiter. but I don't.
In my case, the .txt files look like tables - ie: they're still structured like tables, not like a standard, jumbled CSV file.
Does this mean I don't have a delimited file? If so, any advice on how I might approach importing?
Sometimes the delimiter is the column number, rather than a character.
I.e. each data column begins in a specified physical character column. Each column is a fixed width, and parsing is as simple as splitting the string on those character width boundaries, and trimming whitespace if needed.
Sorry about that. An example would obviously help.
Here's an idea of what it looks like - ie: it already looks like a table.
https://gist.github.com/9753ad04b0fab256e452
I'm generating a csv file using php, now some columns contain a paragraph with commas, now when I open the file , every comma within the file counts as a new column, is it maybe possible to escape these commas on a way?
Depends, what, your, CSV, reader, is, "but, quoting, should, work"
Many CSV readers will allow commas within a single column by surrounding the column with double quotes. In that case, double quotes can be represented by double double quotes:
column 1,"column 2, with comma","column 3 with ""quote chars"", and comma"
That's the BIG problem with using a , in a CSV file. I would recommend using a different separator like | (it's less likely to appear on a text) or using a different more robust file format like XML for generating your file.
You're using a comma because it's a delimiter. That is, the comma has special meaning no matter when its used. By that very definition, it becomes hard to treat it as context sensitive. It can be done though, considering symbols like '\n.
You can try a new delimiter, such as ,\n, though that might not be an option for you.
Looks to me that the best solution would be to use a different persistence mechanism. Things will get sticky otherwise.
I am trying to import a .csv file into a MySQL table via phpMyAdmin.
The .csv file is separated by pipes, formated like this:
data|d'ata|d'a"ta|dat"a|
data|"da"ta|data|da't'a|
dat'a|data|da"ta"|da'ta|
The data contains quotes. I have no control over the format in which I recieve the data -- it is generated by a third party.
The problem comes when there is a | followed by a double quote. I always get an "invalid field count in CSV input on line N" error.
I am uploading the file from the import page, using Latin1, CSV, terminated by |, separated by ".
I would like to just change the "enclosed by" character, but I keep getting "Invalid parameter for CSV import: Fields enclosed by". I have tried various characters with no success.
How can I tell MySQL to accept this format in phpMyAdmin?
Setting up these tables is the first step in writing a program that will use uploaded gzipped .csv files to maintain the catalog of an e-commerce site.
I've been having a similar problem for the last several hours and I've finally gotten an import to work so I'll share my solution, even though it may not help the original poster.
Short version:
1.) if an Excel file, save as ODS (open document spreadsheet) format.
1a.) If the file is some kind of text format with delimiters (like the original poster has), then open Excel, and once inside Excel use File/Open to open the file. There you will be able to select the appropriate delimiter to view the file. Make sure the file looks alright, THEN save as ODS format (and close the file).
2.) Open the file in OpenOffice Calc (free download from Oracle/Sun).
2a.) Press Ctrl-F to open the Find dialog box. Click More Options and make sure "Current Selection Only" is NOT checked.
2b.) Search for double quotes. If there are none in your file, you can skip steps 4 and 5.
3.) Save As -> Text CSV. Select options for UTF-8 format (press "u" 3 times to get there fast), select ";" (semi colon) as separator, and select double quotes for text.
4.) If there were any double quotes found in your file in step 2b, continue, otherwise just import the file as CSV with phpMyAdmin (see step 6). It should work.
5a.) Open in Word or any other text editor where you can do Find -> Replace All.
5b.) Find all instances of three double quotes in a row by searching for """ (if you do find any, you might even want to search for 4, 5, 6 etc. in a row until you come up empty).
5c.) Replace the """ with a placeholder that is not found anywhere else in your csv. I replaced them with 'abcdefg'.
5d.) Find -> Replace all instances of "" (two double quotes in a row) with " (forward slash and double quote).
5e.) Find -> Replace all instances of abcdefg (or your chosen placeholder from step 5c) with "". 5c and this step ensure that any quotes occuring at the end of a field just before the text-delimiting quote are properly 'escaped'.
5f.) Finally, save the file, keeping in UTF-8 (or whatever format you need for import).
6.a) In phpMyAdmin, click the "import" tab, click the "choose file" button, and select the file you just saved.
6b.) under 'Format of imported file' CSV should be selected. If column names are in the first row, make sure that checkbox is checked. Most importantly, 'Fields terminated by' should be set to ; (semi colon), 'Fields enclosed by' should be set to " (double quotes), and 'Fields escaped by' should be set to \ (forward slash). You set that up in your file by following step 3, and if necessary by following steps 5a - 5f.
7.) Click "Go" and pray you didn't just waste another hour.
Now that the short version has turned out this long, I'll skip the long version.
Suffice it to say, there seem to be 2 major problems with importing through phpmyadmin.
1.) There's some kind of memory problem that prevents large Excel and ODS files (how large is large? not sure yet) being imported.
2.) Neither OpenOffice nor Excel seem to save their csv files in a way that's compatible with phpmyadmin. They want to escape double quotes with double quotes. phpMyAdmin wants double quotes escaped with something else, like forward slash.
The first problem will hopefully be fixed in an update of phpmyadmin (and/or the Excel importing add-on 'PHPExcel').
The second one could be fixed if there was an easy way to change the escape character for Excel or ODS files saved as CSV, or if phpMyAdmin could be made compatible with their format (that should actually be pretty easy. Simply have it perform the same find-replace actions we performed manually above to skirt the double quote problem).
I hope this helps somebody, as I spent 3-4 hours discovering this solution and another hour writing it here. I hope it's not too long, but I was hoping to help people at all levels of expertise from zero to wherever I am (probably around 0.1).
I found a hack that works -- I use the $ as the "enclosed by" character and all is well. Since this is for a European site, I know that they'll never use it in the table content.
you could modify the csv files by adding a \ in front of every ' right?
Have you tried blanking the boxes that read "Fields enclosed by" and "Fields escaped by"? I have not used phpMyAdmin, but Google suggests others have had success with this method.
You might consider just writing your own LOAD DATA INFILE query, seems like you'll need one anyway since this process will be part of an application at some point.