I need to convert a fixed length text file into a MySQL Table.
My biggest problem is that multiple cells are contained on each line, and this is how the file is sent to me, and the main reason why I want to convert it.
The cells are all of a specific length; however all are included on the one line.
For example the first 3 positions (1 - 3) of a line are the IRT, the next three positions (4 - 6) are the IFTC the next 5 positions (7 - 11) are the FSC, etc.
As the file can contain up to 300 lines of records, I need an easy way to import it straight into the SQL Tables.
I have been searching the net for hours trying to find a solution, however without comma separation I haven't been able to find a working solution yet.
I would like to code this solution in PHP, if possible as well. And am willing to do the long yards of working out how to use the function required to do this if someone could give me the function name, I don't expect people to write my code out for me.
File:
testfile.txt (4 rows)
AAA11111xx
BBB22222yy
CCC33333zz
DDD 444 aa
Table:
CREATE TABLE TestLoadDataInfile
( a VARCHAR(3)
, b INT(5)
, c CHAR(2)
) CHARSET = latin1;
Code:
LOAD DATA INFILE 'D:\\...\\testfile.txt'
INTO TABLE TestLoadDataInfile
FIELDS TERMINATED BY ''
LINES TERMINATED BY '\r\n' ;
Result:
mysql> SELECT * FROM TestLoadDataInfile ;
+-----+-------+----+
| a | b | c |
+-----+-------+----+
| AAA | 11111 | xx |
| BBB | 22222 | yy |
| CCC | 33333 | zz |
| DDD | 444 | aa |
+-----+-------+----+
The LOAD DATA INFILE documentation is not very good at this point (fixed-size fields). Here's the related parts:
If the FIELDS TERMINATED BY and FIELDS
ENCLOSED BY values are both empty
(''), a fixed-row (nondelimited)
format is used. With fixed-row format,
no delimiters are used between fields
(but you can still have a line
terminator). Instead, column values
are read and written using a field
width wide enough to hold all values
in the field. For TINYINT, SMALLINT,
MEDIUMINT, INT, and BIGINT, the field
widths are 4, 6, 8, 11, and 20,
respectively, no matter what the
declared display width is.
LINES TERMINATED BY is still used to
separate lines. If a line does not
contain all fields, the rest of the
columns are set to their default
values. If you do not have a line
terminator, you should set this to ''.
In this case, the text file must
contain all fields for each row.
Fixed-row format also affects handling
of NULL values, as described later.
Note that fixed-size format does not
work if you are using a multi-byte
character set.
NULL handling
With fixed-row format (which is used
when FIELDS TERMINATED BY and FIELDS
ENCLOSED BY are both empty), NULL is
written as an empty string. Note that
this causes both NULL values and empty
strings in the table to be
indistinguishable when written to the
file because both are written as empty
strings. If you need to be able to
tell the two apart when reading the
file back in, you should not use
fixed-row format.
Some cases are not supported by LOAD
DATA INFILE:
Fixed-size rows (FIELDS TERMINATED BY and FIELDS ENCLOSED BY
both empty) and BLOB or TEXT columns.
User variables cannot be used when
loading data with fixed-row format
because user variables do not have a
display width.
You probably won't like it very much, but there really isn't an easy way to do what you're after. A long time ago (circa 1991), I wrote a tool, DBLDFMT (for 'database load format') to deal with such fixed-length, non-delimited files. It is tuned to generating the load format preferred by Informix databases (so it uses a pipe symbol by default to separate the fields, but of course you can tune that with a command line option or an environment variable). It can, however, create delimited data which you can then process more normally, probably using the LOAD DATA INFILE command.
Contact me by email (see my profile) if you want the source code for DBLDFMT. (The current version, 3.17 from 2008, does not have direct support for CSV output. It would not be hard to add it. You can, more or less, achieve the required effect, but it should be a lot easier than it is.)
Related
I'm building my first app right now, but I'm new to mysql databases.
I want to store users personalized settings in database, and here are two scenarios to make that happen:
First one:
COLUMNS: "uid" | "app_settings"
ROWS: 1 | 0,1,0,1,ffff00,#ff0000
Which is storing them as an array, and breaking them up by PHP explode.
Second one:
COLUMNS: "uid" | "show_menu" | "show_toolbar" | "show_email | "menu_color" | "toolbar_color"
ROWS: 1 | 0 | 1 | 1 | #ffff00 | #ff0000
Which is storing each in a separate column.
Both ways work fine, but I want to know if it's a bad practice to use the first method.
Does the extra processes to break apart each value is overwhelming for the server resources in a large scale? (Using the PHP explode) or selecting multiple columns is somehow just like exploding them by php in terms of processing speed?
It all depends on what for do you intent to use this data.
Main purpose of using separate columns in database is to have ability to index such data.
If it is a matter of storage only you can use your storage format in one field, but it is much better to use well known format as json (json_encoding in PHP before storing in db, and json_decode after reading).
Also if you really want to save up space, then assuming, that things such as "show_menu" / "show_toolbar" are simply boolean flags, you can store them in one number as a bit fields. For example field named show_rights may have value of 6 which translates to binary 110, so [1: show_menu][1: show_toolbar][0: show_email].
I have a table of ban reasons:
id | short_name | description
1 virus Virus detected in file
2 spam Spammy file
3 illegal Illegal content
When I ban a file for being a virus, in my code I do this:
$file -> banVirus();
Which inserts the file id and ban reason into a table:
"INSERT INTO 'banned_files' VALUES (61234, 1)"
My question is; is it a problem that I have hard-coded the value 1?, to indicate a spam file.
Should I use defines in my config like define ('SPAM', 1), so i can replace 1 with a define? Or does it not matter at all?
If the id is an auto incrementing field, then it is a very big problem! Since the ids are automatically generated, it's hard to guarantee their stability; i.e. they may change.
If the id is something you manually assigned, it's not such a big problem, but it's bad practice. Because magic numbers easily lead to confusion and mistakes. Who knows what "1" means when reading your code?
So either way, you'd be better off to assign a stable, readable id to each case.
I agree with #Tenner that it also hardly makes sense to have a table for this static, unchanging data to begin with. Your banned_files table should have a column like this:
reason ENUM('virus', 'spam', 'illegal') NOT NULL
You need nothing more in your database. When outputting this for the user, you can add a readable reason with a simple array through your PHP code.
Since you have a fixed (and small) number of parameters, I'd be tempted to make the IDs an enum in your code and not even include them as a separate database table at all.
Think about something like gender -- which has two (or more) options, both fixed. (We won't be adding multiple new genders anytime soon.) I guarantee most registration systems' don't have a GENDER table with two entries in it.
So, table banned_files would be something like this:
id | reason
--------+------------
12345 | 1
67890 | 2
and your code would contain enums as necessary:
enum BanReason {
Virus = 1,
Spam = 2,
Illegal = 3
}
(please convert to PHP; I'm a C# developer!)
In PHP:
$aBanReason = array(
'Virus' => 1,
'Spam' => 2,
'Illegal' => 3
);
I'm trying to import a large csv file into Mysql. Unfortunately, the data within the file is separated both by spaces and tabs.
As a result, whenever I load the data into my table, I end up with countless empty cells (because Mysql only recognizes one field separator). Modifying the data before importing it is not an option.
Here is an example of the data:
# 1574 1 1 1
$ 1587 6 6 2
$115 1878 8 9 23
(Where the second and third value of every row are separated by a tab)
Any ideas?
If my goal were just to import the file, i'd use sed -i 's/,/ /g' *.txt to create just one delimiter to worry about.
I like CSVs, but perhaps there's a string encased in double quotes that contains a comma or space, in which case this isn't perfect. It'd still import, just would modify those strings.
In that case, another approach I've used in production is Stat/Transfer. There's a syntax language to create a shell script to convert the file and specify multiple delimiters.
MySQL import CSV file using regex delimiter
Assuming you're using LOAD DATA INFILE try this:
load data local infile 'c:/somefile.txt' into table tabspace
columns terminated by ' '
(col1, #col23, col4, col5)
set col2 = left(#col23, instr(#col23,char(9))-1),
col3 = substr(#col23,instr(#col23,char(9))+1);
Note that the separator is a space so the second column contains the col2/col3 data. This is assigned to a variable #col23 which is then split up and the parts assigned to col2 and col3.
If I use INT(12) vs INT(10) or INT(8) what will this actually do in terms of me using in code?
(This is a spin off of a previous question) I read through the manuals and I think I understand what they're saying, but I don't actually know how it would apply to my php/mysql coding.
Can someone provide an example of where this would actually matter?
The argument to integer types in MySQL has no effect on the storage of data or the range of values supported by each data type.
The argument only applies to display width, which may be used by applications as Jonathan Fingland mentions. It also comes up when used in combination with the ZEROFILL option:
CREATE TABLE foo (
i INT(3) ZEROFILL,
j INT(6) ZEROFILL,
k INT(11) ZEROFILL
);
INSERT INTO foo (i, j, k) VALUES (123, 456, 789);
SELECT * FROM foo;
+------+--------+-------------+
| i | j | k |
+------+--------+-------------+
| 123 | 000456 | 00000000789 |
+------+--------+-------------+
See how ZEROFILL makes sure the data is zero-padded to at least the number of digits equal to the integer type argument.
Without ZEROFILL, the data is space-padded, but since spaces are often trimmed anyway, it's harder to see that difference.
What affect does it have on your PHP code? None. If you need to output columnar data, or space-pad or zero-pad values, it's more flexible to use sprintf(),
Short answer: there's no difference.
The display width is passed back in the "meta data". It's up to the application to make use of it. Normally it's just ignored. I don't think you can get it using the mysql functions, but you might be able to with mysqli using mysqli_fetch_field_direct.
You shouldn't have to change anything as it has no impact on the data returned. The information can be used by applications that want to use it.
See http://dev.mysql.com/doc/refman/5.0/en/numeric-types.html
Another extension is supported by MySQL for optionally specifying the display width of integer data types in parentheses following the base keyword for the type (for example, INT(4)). This optional display width may be used by applications to display integer values having a width less than the width specified for the column by left-padding them with spaces. (That is, this width is present in the metadata returned with result sets. Whether it is used or not is up to the application.)
The display width does not constrain the range of values that can be stored in the column, nor the number of digits that are displayed for values having a width exceeding that specified for the column. For example, a column specified as SMALLINT(3) has the usual SMALLINT range of -32768 to 32767, and values outside the range allowed by three characters are displayed using more than three characters.
(Emphasis mine)
I am exporting data from a database using PHP to convert it into a CSV. I figured it'd be useful to provide the first row with a title (similar to the <th> element in HTML) so the end user would understand the column's meanings. Example
=============
| id | name |
=============
| 0 | tim |
| 1 | tom |
=============
Which would look like this as a CSV
id, name
0, tim
1, tom
Is there a way to mark up the first row's columns or do anything differently that programs that often read CSVs (example Microsoft Excel) will mark it up accordingly. I.e. provide a semantic hook to inform the client (possibly Excel but not restricted to) that this is a column header?
Nope. And to make it even more fun, there's nothing that says that the header line has to be present at all. Good times, good times...
One key thing to avoid with CSVs is to avoid using 'ID' as the first characters in the file. The lowercase 'id' or double-quoted '"ID"' is acceptable, but if Excel comes across upper-case 'ID' it tries to open the file as a SYLK file and fails.
(edit: note that single quotes in the above should be ignored)
The best practice I can think of myself is to make the headings the first row only. But this is obviously common sense.