Importing txt files with Unknown Delimiter - php

I want to import really clean .txt files into Mysql with PHP. I've read that this is easy if you know the delimiter. but I don't.
In my case, the .txt files look like tables - ie: they're still structured like tables, not like a standard, jumbled CSV file.
Does this mean I don't have a delimited file? If so, any advice on how I might approach importing?

Sometimes the delimiter is the column number, rather than a character.
I.e. each data column begins in a specified physical character column. Each column is a fixed width, and parsing is as simple as splitting the string on those character width boundaries, and trimming whitespace if needed.

Sorry about that. An example would obviously help.
Here's an idea of what it looks like - ie: it already looks like a table.
https://gist.github.com/9753ad04b0fab256e452

Related

Correct way to handle CSV Files on PHP

Hi I have the following brain braking thing going on. The thig is that I'm developing a Laravel Application that imports and exports CSV files. Now, the data that the application Imports/Exports(I/E now on) has fields from various data types, we have text and numbers, now the text can contain commas(,) and using the default CSV separator (,) on php can lead to fields on the import to generate incorrectly. The client suggested that I I/E using ^ as a separator for the export and (,) again for the import of the data. Now, my question is, can I trust when I/E data using the default separator? Can anyone suggest a best way to do the I/E process?
Edit
The client main struggle is because he uses Excel on a Mac to edit the CSV files, now on my Mac, I can easily edit the files without any issues regarding the separator, of course if the separator is a comma (,) but if we use the ^ as a separator then my excel is a mess and he's ommit some fields.
Thanks in advance.
Don't re-invent the wheel. Re-use a well-written well-tested package. On good one is CSV from The PHP League.
(Historical note about delimiters: the most overlooked (for 50+ years) feature in computing is that the ASCII charset (and therefore UTF8 too) assigned specific chars for delimiting fields (or units, as they called them) and records ... and even groups of records and entire files. See https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text. But instead folks didnt RTM and so used commas, etc to separate fields and newlines (\r, \n, \r\n) to separate records. D-oh!!! So, if you are able to select your own delimiters and want to be safe by using a char not used for any other purpose, use the ASCII delimiters.)
There is no such thing as a "CSV standard". Therefore, having a "default" comma is not exactly true. One can basically use whatever one likes, and the column and line seperaters as well as the enclosures for values or complete lines really depend on what you are planning to put in as data.
TL;DR: It is totally up to you and your client, what you are using as those characters.

Converting mysql chars

I have a database that's seems to be on latin1_swedish. I need to add some more text to it. The new text contains some Brazilian words. Example:
tilápia
Cachaça
...
The old text that is in the db has these words too, but it's like this:
tilápia
The PHP file is converting it to the real word, using the right accent.
How can I add these texts and keep PHP converting files? For example, add tilápia on my table and mysql keeps it as tilápia.
Thanks, hope it's not confusing.
While the collation should definitely be something more generic like utf8_generic, that won't change how things are displayed. MySQL will store whatever you throw at it and will return exactly the same thing when you ask for it. Hence, you just have to make sure to use the same encoding for reading and writing. In general it's a good idea to use utf8 through the whole application (including db). For that you would need to convert the content in your db.

Using Explode or Preg_Split to split filenames in a string

In my PHP script I pull from a database field a list of file names. The names in the field are separated by commas and can be various lengths containing various characters and / or spaces. The string could look something like this:
"fileone.wav, file two with spaces.mp3, another file but this one has commas, which is, of course, the problem.mp3, another_one.mp3"
I am using this to explode them into an array ($attachments contains the string from the db field):
$filenames = explode(", ", $attachments);
My dliemma is that sometimes the file names contain commas, therefore explode fails since it is separating the names at the comma. It of course breaks the filename into separate array elememts.
I'm wondering if maybe preg_split would be a better way to match and split filenames. I'm very inexperienced with regex but conceptually I imagine I'd split the names by matching the ".", the three characters that follow, whatever they are and the comma.
Is this a good way to do this? And how would I write that expression?
If your filenames can have commas in them (and have no escape character) it's impossible to decide how to split the filenames properly.
Maybe you have a file named one.mp3,two.mp3. Whoever decided to store the filenames like this made a terrible mistake. There are so many serializers available there is no excuse not to use any. Even something like (un)serialize($attachments) is sufficient.
You can do simple detection like find an extension (. followed by something) and then split at the first comma. You don't need a regular expression for that, just walk the string.
The data format as you have it is fundamentally flawed, as you've discovered.
Ideally, you need to fix the data. If you want to stick with the basic format you have (ie comma separated), you should make sure that it is saved in a valid CSV format -- ie with quotes around the values that contain commas, so your string would look like this:
fileone.wav, file two with spaces.mp3, "another file but this one has commas, which is, of course, the problem.mp3", another_one.mp3
With the data in this format, you could use PHP's build-in CSV handling function str_getcsv() to read the data instead of explode(). Problem solved.
If you're happy to try other formats, you could also reformat the data into JSON or some other serialised format, which would also make things easier to manage.
The most technically correct answer remains to normalise the database so that the filenames have their own table and each one is in a separate record, but this may be overkill and/or too much upheaval for your purposes.
So yes, ideally you should fix the data, because it is in a very very badly designed format.
However if you really can't fix the data, then you will have to resort to some clever regex trickery to split the files.
Assuming all files end in ".mp3", it's relatively simple; you could do something like this:
preg_split(".mp3(,|$)",$data)
...which will give you the filenames without the .mp3 extension. If they're all mp3, then it's easy enough to add it back on again.
If your file names are mixed file types, then it gets more complex; you'd need to use regex look-aheads to find the extensions but without removing them.
Your problem with all of this, however, is that it would be possible for a filename to contain .mp3, somewhere in the middle of the name. Not likely of course, but possible, especially if you allow your users to upload their own file names.

What is an easy way to clean an unparsable csv file

The csv file was created correctly but the name and address fields contain every piece of punctuation there is available. So when you try to import into mysql you get parsing errors. For example the name field could look like this, "john ""," doe". I have no control over the data I receive so I'm unable to stop people from inputting garbage data. From the example above you can see that if you consider the outside quotes to be the enclosing quotes then it is right but of course mysql, excel, libreoffice, and etc see a whole new field. Is there a way to fix this problem? Some fields I found even have a backslash before the last enclosing quote. I'm at a loss as I have 17 million records to import.
I have windows os and linux so whatever solution you can think of please let me know.
This may not be a usable answer but someone needs to say it. You shouldn't have to do this. CSV is a file format with an expected data encoding. If someone is supplying you a CSV file then it should be delimited and escaped properly, otherwise its a corrupted file and you should reject it. Make the supplier re-export the file properly from whatever data store it was exported from.
If you asked someone to send you JPG and they send what was a proper JPG file with every 5th byte omitted or junk bytes inserted you wouldnt accept that and say "oh, ill reconstruct it for you".
You don't say if you have control over the creation of the CSV file. I am assuming you do, as if not, the CVS file is corrupt and cannot be recovered without human intervention, or some very clever algorithms to "guess" the correct delimiters vs the user entered ones.
Convert user entered tabs (assuming there are some) to spaces and then export the data using TABS separator.
If the above is not possible, you need to implement an ESC sequence to ensure that user entered data is not treated as a delimiter.
Your title asks: What is an easy way to clean an unparsable csv file
If it is unparseable, that means that you can't correctly break it up into fields. So you can't clean it.
Your first sentence states: The csv file was created correctly but the name and address fields contain every piece of punctuation there is available.
If the csv file was created correctly, then you can split it into fields correctly. So you can clean it.
Only punctuation? You are lucky. Unvalidated text fields in databases commonly contain nasties like tab, carriage return, line feed, and even Ctrl-Z.
Who says it's "unparsable"? On what grounds? What is their definition of "parsable"?
Who says it was "created correctly"? On what grounds? What is their definition of "correct"?
Could you perhaps show us the relevant parts of say 5 or so lines that are causing you grief? Edit your question and format the examples as code, to make them easier to read. Make it obvious where previous/next fields stop/start e.g.
...,"john ""," doe",...
By the way, the above is NOT "right" under any interpretation; it can't be right, with an ODD number of quote characters none of which is escaped.
My definition of correct: Here is how to emit a CSV field that can be parsed no matter what is in the database [caveat: Python csv module barfs on `\x00']:
if '"' in field:
output = '"' + field.replace('"', '""') + '"'
elif any of comma, line feed, carriage return in field: # pseudocode
output = '"' + field + '"'
else:
output = field
That's a really tough issue. I don't know of any real way to solve it, but maybe you could try splitting on ",", cleaning up the items in the resulting array (unicorns :) ) and then re-joining the row?
MySQL import has many parameters including escape characters. Given the example, I think the quotes are escaped by putting a quote in the front. So an import with esaped by '"' would work.
First of all - find all kinds of mistake. And then just replace them with empty strings. Just do it! If you need this corrupted data - only you can recover it.

How to show a comma within a comma separated file

I'm generating a csv file using php, now some columns contain a paragraph with commas, now when I open the file , every comma within the file counts as a new column, is it maybe possible to escape these commas on a way?
Depends, what, your, CSV, reader, is, "but, quoting, should, work"
Many CSV readers will allow commas within a single column by surrounding the column with double quotes. In that case, double quotes can be represented by double double quotes:
column 1,"column 2, with comma","column 3 with ""quote chars"", and comma"
That's the BIG problem with using a , in a CSV file. I would recommend using a different separator like | (it's less likely to appear on a text) or using a different more robust file format like XML for generating your file.
You're using a comma because it's a delimiter. That is, the comma has special meaning no matter when its used. By that very definition, it becomes hard to treat it as context sensitive. It can be done though, considering symbols like '\n.
You can try a new delimiter, such as ,\n, though that might not be an option for you.
Looks to me that the best solution would be to use a different persistence mechanism. Things will get sticky otherwise.

Categories