I am having a real headache with reading in a tab delimited text file and inserting it into a MySQL Database.
The tab delimited text file was generated (I think) from a MS SQL Database, and I have written a simple script to read in the file and insert it into an existing table in my MySQL database.
However, there seems to be some problem with the data in the txt file. When my PHP script parses the file and I output the INSERT statements, the values in each of the fields are longer than they should be. For example, the first field should be a simple two character alphanumeric value. If I echo out the INSERT statements, using Firebug (in Firefox), between each of the characters is a question mark in a black diamond. If I var_dump the values, I get the following:
string(5) "A1"
Now, this clearly shows a two character string, but var_dump tells me it is five characters long!!
If I trim() the value, all I get is the first character (in this case "A").
How can I get at the other characters, even if it is only to remove them? Additionally, this appears to be forcing MySQL to insert the value as a BLOB, not as a varchar as it should.
Simon
UPDATE
If I do:
echo mb_detect_encoding($arr[0]);
I get a result of 'ASCII'. This isn't multibyte, is it??
Sounds like an encoding issue.
Are you running any strings through PHP functions which are not multi byte safe?
You may need to look at multi byte aware functions in PHP.
OK, solved all these issues by opening the TXT file in notepad and saving it specifically as UTF-8.
I still don't know what encoding was used (maybe UNICODE??) but it's all sorted now
Related
I'm using TinyMCE to save some HTML into an SQL table in phpMyAdmin. Inserting and retrieving the row from the table works fine.
I'm using a regex to translate some short codes in the retrieved text and this is where the problem arises.
This is my regular expression, which simply gets the text between two short codes with possible html tags and new lines:
/(<.+>)?[[]{$code}[]](<\/.+>)?((?:\n.+\n?)+)(<.+>)?[[]{$code}[]](<\/.+>)?/
When I retrieve the HTML from the DB and run the regex on it, the preg_match_all() fails to match anything, but when I double-click on the row in the database and open the in-line editor, phpMyAdmin does...something and performs an update on the row automatically and sets the text to a new value; Then, when I run the regex on the newly updated value, preg_match_all() matches the correct values.
I was thinking it was some automatic text encoding conversion or something, but running mb_detect_encoding() on the HTML before I insert it indeed confirms that the encoding is UTF-8 same as the table's utf8_unicode_ci.
I then compared the text plus EOL characters before and after the update in Notepad++ and they're exactly the same, yet my regex doesn't work before phpMyAdmin updates it.
What is phpMyAdmin doing to fix the text and how can I do it before it gets inserted in to the database? Why is it automatically updating the row at all?
I added some more code to the regular expression to check for content after the short code on the same line and now preg_match_all() matches correctly every time. I'm still not sure what's going on there as the content before and after the update are identical in every test I've tried (Same text, same amount of spaces and new line characters).
Regardless, I fixed it by adding the below regex after the check for the end HTML tag:
(?:.+)?
So the full expression is:
(<.+>)?[[]{$code}[]](<\/.+>)?(?:.+)?((?:\n.+\n?)+)(<.+>)?[[]{$code}[]](<\/.+>)?
I have been poring over stackoverflow all night looking for a way to solve my issues, but I absolutely cannot get the browser to display my Unicode characters correctly when pulling them from my database. In particular, I am trying to use the "combining macron" character (U+0304), added after a character to put a macron over it. I want the user to have the option to turn them on and off, and having one character to look for and ignore seems easier to accomplish this than instead of making conversions between individual macroned letters and their non-macroned counterpart (e.g. Ā -> A).
It would be trivial to use the HTML entity (& #772;) to accomplish this, but if I were to use the MySQL database for something other than making a webpage I want it to be easily transferable. I have tested with the HTML entity and I can get it to successfully add a macron to the previous character.
However, when using the Unicode character in my MySQL table, I absolutely cannot get it to print anything other than question marks (?) in the browser. In the table itself, the entry is a VARCHAR(64) and looks like 'word¯' with the macron appearing afterwards, but I assume that's just a limitation of the cmd environment that it doesn't put the macron over the d. The column Collation is latin1_swedish_ci, if that makes a difference. Here is what I have tried to get the entry to print correctly:
Changing my php.ini to have a default charset of utf-8
Making the top of my php file read:
<?php
header('Content-Type: text/html; charset=utf-8');
?>
And setting the first parameter of my database PDO as mysql:dbname=NAME;host=localhost;charset=utf8'
When I simply make the php file echo the character I want, it prints to the page correctly. So I'm thinking the problem isn't with the encoding? Or maybe the encoding of the database and the server aren't the same and that is creating the ??
EDIT:
I can get it to correctly display if I insert the value from PhPMyAdmin, but not when I enter it through the cmd. In both cases I am pasting the same word with an ending character of 'U+0304'. Is there a reason that it works with PHPMyAdmin and not through a direct query, and what can I do so it works with both?
I'm unsure if this is a php-, filemaker-, mysql- or an odbc driver issue.
For security reasons the input fields of my current php webform convert special characters into hex codes, (for example: # becomes ' ) This hex code is saved in the database and will also be shown in Filemaker11 as the hex code. This is not what i want.
How can I make sure the special character will be displayed as it should be?
The other way round (from filemaker to db), no conversion will be done on inserting the special characters.
How can I make sure everything will be consistent?
Kind regards,
Jeroen
FileMaker is just showing the data stored in MySQL. If you pull up the DB in a tool like PhpMyAdmin you should see that the varchar contains the encoding as well. Since FMP is looking at it simply as a text field, it shows the encoding that was stored. If you wanted to decode in FMP you could show a calc field of the varchar that has a custom function to decode the text. (but that won't allow for updating the data..) You could also try a trigger on record load to decode the data in the fields so that you can properly view/edit.
Solved it! It appeared that I had to add an extra line to my PHP script.
after setting up the connection, php needs to tell mysql what the encoding needs to be. This can be done with the following line:
$dbh->query("SET NAMES 'utf8'");
Thanks for the effort guys!
This: ' type of encoding is not done automatically by the browser. Something is doing it. Normally you do it only on output not on input.
You can use html_entity_decode() to undo it. But I strongly suggest you figure out why it's happening in the first place.
The csv file was created correctly but the name and address fields contain every piece of punctuation there is available. So when you try to import into mysql you get parsing errors. For example the name field could look like this, "john ""," doe". I have no control over the data I receive so I'm unable to stop people from inputting garbage data. From the example above you can see that if you consider the outside quotes to be the enclosing quotes then it is right but of course mysql, excel, libreoffice, and etc see a whole new field. Is there a way to fix this problem? Some fields I found even have a backslash before the last enclosing quote. I'm at a loss as I have 17 million records to import.
I have windows os and linux so whatever solution you can think of please let me know.
This may not be a usable answer but someone needs to say it. You shouldn't have to do this. CSV is a file format with an expected data encoding. If someone is supplying you a CSV file then it should be delimited and escaped properly, otherwise its a corrupted file and you should reject it. Make the supplier re-export the file properly from whatever data store it was exported from.
If you asked someone to send you JPG and they send what was a proper JPG file with every 5th byte omitted or junk bytes inserted you wouldnt accept that and say "oh, ill reconstruct it for you".
You don't say if you have control over the creation of the CSV file. I am assuming you do, as if not, the CVS file is corrupt and cannot be recovered without human intervention, or some very clever algorithms to "guess" the correct delimiters vs the user entered ones.
Convert user entered tabs (assuming there are some) to spaces and then export the data using TABS separator.
If the above is not possible, you need to implement an ESC sequence to ensure that user entered data is not treated as a delimiter.
Your title asks: What is an easy way to clean an unparsable csv file
If it is unparseable, that means that you can't correctly break it up into fields. So you can't clean it.
Your first sentence states: The csv file was created correctly but the name and address fields contain every piece of punctuation there is available.
If the csv file was created correctly, then you can split it into fields correctly. So you can clean it.
Only punctuation? You are lucky. Unvalidated text fields in databases commonly contain nasties like tab, carriage return, line feed, and even Ctrl-Z.
Who says it's "unparsable"? On what grounds? What is their definition of "parsable"?
Who says it was "created correctly"? On what grounds? What is their definition of "correct"?
Could you perhaps show us the relevant parts of say 5 or so lines that are causing you grief? Edit your question and format the examples as code, to make them easier to read. Make it obvious where previous/next fields stop/start e.g.
...,"john ""," doe",...
By the way, the above is NOT "right" under any interpretation; it can't be right, with an ODD number of quote characters none of which is escaped.
My definition of correct: Here is how to emit a CSV field that can be parsed no matter what is in the database [caveat: Python csv module barfs on `\x00']:
if '"' in field:
output = '"' + field.replace('"', '""') + '"'
elif any of comma, line feed, carriage return in field: # pseudocode
output = '"' + field + '"'
else:
output = field
That's a really tough issue. I don't know of any real way to solve it, but maybe you could try splitting on ",", cleaning up the items in the resulting array (unicorns :) ) and then re-joining the row?
MySQL import has many parameters including escape characters. Given the example, I think the quotes are escaped by putting a quote in the front. So an import with esaped by '"' would work.
First of all - find all kinds of mistake. And then just replace them with empty strings. Just do it! If you need this corrupted data - only you can recover it.
I am using codeigniter in an app. There is a form. In the textarea element, I wrote something including
%Features%
However, when I want to echo this by $this->input->post(key), I get something like
�atures%
The '%Fe' are vanished.
In main index.php file of CI, I tried var_dump($_POST) and I see the above word is fully ok. but when I am fetching it with the input library (xss filtering is on) I get the problem.
When the XSS filtering is off, it appears ok initially. however, if I store it in database and show next time, I see same problem (even the xss filtering is off).
%Fe happens to look like a URL-encoded sequence %FE, representing character 254. It's being munched into the Unicode "I have no idea what that sequence means" glyph, �.
It's clear that the "XSS filter" is being over-zealous when decoding the field on submission.
It's also very likely that a URL-decode is being run again at some point later in the process, when you output the result from the database. Check the database to make sure that the actual string is being represented properly.
First: Escape the variables before storing them into db. % has special meaning in SQL.
Second: % also has special meaning in URLs eg. %20 is %FE will map to some character which will be decoded by input()