How to parse tab-delimited file into mysql database

How to parse tab-delimited file into mysql database - php

I was tasked with parsing a tab-delimited file and inserting the values into the database. Find a selection of the tab-delimited file below.
"030-36-2" 0 0 14 "P"
"030-38-2" 0 0 14 "S"
"030-40-2" 0 0 14 "S"
"031-2-2" 1 0 "O"
"031-3-2" 4 0 "O"
"032-36-26" 0 0 14 "S"
"032-38-26" 0 0 14 "S"
"032-40-26" 0 0 14 "S"
"070-140-161" 0 0 14 "S"
"070-140-162" 2 0 "D"
"070-83-161" 0 0 14 "S"
I'm using fgetcsv with my delimiter set to a tab (9) but upon executing the code I am only getting a small percentage of total values inserted into the database.
This is my code:
if(($handle = fopen("mytabdelimitedfile.txt","r"))!==FALSE){
fgetcsv($handle, 0,chr(9));
while(($data = fgetcsv($handle,1000,chr(9)))!==FALSE){
print_r($data[0]);
$result = mysql_query("INSERT INTO $table (col1,col2,col3,col4,col5) VALUES('$data[0]','$data[1]','$data[2]','$data[3]','$data[4]')");
}
}
The first 4 records are not inserted but it starts with "031-3-2", then skips down to "070-140-162". I fear the result may have to do with some values missing but I cannot seem to discern a pattern.
Does anyone have any insight regarding this? Does the issue have to do with some values missing? Is there any workaround? (I don't have any control over source data)
Also another note: when I use Excel => import data from text => tab-delimited, the results are perfect. But of course I cannot use Excel as the data is updated on an hourly basis. Please, any point in the right direction would be GREATLY appreciated.

Like VMai saide, use LOAD DATA INFILE
LOAD DATA LOCAL INFILE 'mytabdelimitedfile.txt'
INTO TABLE table_name
FIELDS
TERMINATED BY '\t'
OPTIONALLY ENCLOSED BY '"'
(col1,col2,col3,col4,col5)
Also, I really hope those aren't your actual column names.
And don't rely on Excel as an example for anything. It hasn't handled CSV in a sane manner since at least 2007.

Related

SQLite3::exec(): unrecognized token: "'Date"

I get the error message SQLite3::exec(): unrecognized token: "'Date" when I insert a command into a SQLite database.
If I have the SQL commands echoed and execute them via the console, this also works. The data for this comes from a dbase database.
If I enter a string for fieldname the insert commands will work.
$field=unpack( "a11fieldname/A1fieldtype/Voffset/Cfieldlen/Cfielddec", substr($buf,0,18));
$db = new SQLite3('databases/test.db');
$sqlCode .= "INSERT INTO HEADER (name) VALUES ('".$field['fieldname']."');";
$db-> exec($sqlCode);

Would be that much easy to use prepare satement against sql injections
If I enter a string for fieldname the insert commands will work
make sure if fieldname is a text value
I am using pdo :
$sqlCode = "INSERT INTO header VALUES (:name)";
$query = $db->prepare($sqlCode);
$query->bindValue(':name', $field['fieldname'], SQLITE3_TEXT);
$result = $query->execute();
EDIT : Error says unrecognized token Date, So, if fieldname is a date you might need to change SQLITE3_TEXT to SQLITE3_BLOB OR SQLITE3_INTEGER Usualy not needed.
But you need to insert date into date column not into name column
EDIT 2 :
it's a bit complicated.
See here for a full description of the dbf file format.
So it would be best if you could use a library to read and write the dbf files.
If you really need to do this yourself, here are the most important parts:
Dbf is a binary file format, so you have to read and write it as binary. For example the number of records is stored in a 32 bit integer, which can contain zero bytes.
You can't use string functions on that binary data. For example strlen() will scan the data up to the first null byte, which is present in that 32 bit integer, and will return the wrong value.
If you split the file (the records), you'll have to adjust the record count in the header.
When splitting the records keep in mind that each record is preceded by an extra byte, a space 0x20 if the record is not deleted, an asterisk 0x2A if the record is deleted. (for example, if you have 4 fields of 10 bytes, the length of each record will be 41) - that value is also available in the header: bytes 10-11 - 16-bit number - Number of bytes in the record. (Least significant byte first)
The file could end with the end-of-file marker 0x1A, so you'll have to check for that as well.
See asked : binary safe write on file with php to create a DBF file
Final Word : you need DBF library
Data File Header Structure

PHP - Optimising preg_match of thousands of patterns

So I wrote a script to extract data from raw genome files, heres what the raw genome file looks like:
# rsid chromosome position genotype
rs4477212 1 82154 AA
rs3094315 1 752566 AG
rs3131972 1 752721 AG
rs12124819 1 776546 AA
rs11240777 1 798959 AG
rs6681049 1 800007 CC
rs4970383 1 838555 AC
rs4475691 1 846808 CT
rs7537756 1 854250 AG
rs13302982 1 861808 GG
rs1110052 1 873558 TT
rs2272756 1 882033 GG
rs3748597 1 888659 CT
rs13303106 1 891945 AA
rs28415373 1 893981 CC
rs13303010 1 894573 GG
rs6696281 1 903104 CT
rs28391282 1 904165 GG
rs2340592 1 910935 GG
The raw text file has hundreds of thousands of these rows, but I only need specific ones, I need about 10,000 of them. I have a list of rsids. I just need the genotype from each line. So I loop through the rsid list and use preg_match to find the line I need:
$rawData = file_get_contents('genome_file.txt');
$rsids = $this->get_snps();
while ($row = $rsids->fetch_assoc()) {
$searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)\n~i";
if (preg_match($searchPattern,$rawData,$matchedGene)) {
$genotype = $matchedGene[3]);
// Do something with genotype
}
}
NOTE: I stripped out a lot of code to just show the regexp extraction I'm doing. I'm also inserting each row into a database as I go along. Heres the code with the database work included:
$rawData = file_get_contents('genome_file.txt');
$rsids = $this->get_snps();
$query = "INSERT INTO wp_genomics_results (file_id,snp_id,genotype,reputation,zygosity) VALUES (?,?,?,?,?)";
$stmt = $ngdb->prepare($query);
$stmt->bind_param("iissi", $file_id,$snp_id,$genotype,$reputation,$zygosity);
$ngdb->query("START TRANSACTION");
while ($row = $rsids->fetch_assoc()) {
$searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)\n~i";
if (preg_match($searchPattern,$rawData,$matchedGene)) {
$genotype = $matchedGene[3]);
$stmt->execute();
$insert++;
}
}
$stmt->close();
$ngdb->query("COMMIT");
$snps->free();
$ngdb->close();
}
So unfortunately my script runs very slowly. Running 50 iterations takes 17 seconds. So you can imagine how long running 18,000 iterations is gonna take. I'm looking into ways to optimise this.
Is there a faster way to extract the data I need from this huge text file? What if I explode it into an array of lines, and use preg_grep(), would that be any faster?
Something I tried is combining all 18,000 rsids into a single expression (i.e. (rs123|rs124|rs125) like this:
$rsids = get_rsids();
$rsid_group = implode('|',$rsids);
$pattern = "~({$rsid_group })\t(.*?)\t(.*?)\t(.*?)\n~i";
preg_match($pattern,$rawData,$matches);
But unfortunately it gave me some error message about exceeding the PCRE expression limit. The needle was way too big. Another thing I tried is adding the S modifier to the expression. I read that this analyses the pattern in order to increase performance. It didn't speed things up at all. Maybe maybe pattern isn't compatible with it?
So then the second thing I need to try and optimise is the database inserts. I added a transaction hoping that would speed things up but it didn't speed it up at all. So I'm thinking maybe I should group the inserts together, so that I insert multiple rows at once, rather than inserting them individually.
Then another idea is something I read about, using LOAD DATA INFILE to load rows from a text file. In that case, I just need to generate a text file first. Would it work out faster to generate a text file in this case I wonder.
EDIT: It seems like whats taking up most time is the regular expressions. Running that part of the program by itself, it takes a really long time. 10 rows takes 4 seconds.

This is slow because you're searching a vast array of data over and over again.
It looks like you have a text file, not a dbms table, containing lines like these:
rs4477212 1 82154 AA
rs3094315 1 752566 AG
rs3131972 1 752721 AG
rs12124819 1 776546 AA
It looks like you have some other data structure containing a list of values like rs4477212. I think that's already in a table in the dbms.
I think you want exact matches for the rsxxxx values, not prefix or partial matches.
I think you want to process many different files of raw data, and extract the same batch of rsxxxx values from each of them.
So, here's what you do, in pseudocode. Don't load the whole raw data file into memory, rather process it line by line.
Read your rows of rsid values from the dbms, just once, and store them in an associative array.
for each file of raw data....
for each line of data in the file...
split the line of data to obtain the rsid. In php, $array = explode(" ", $line, 2); will yield your rsid in $array[0], and do it fast.
Look in your array of rsid values for this value. In php, if ( array_key_exists( $array[0], $rsid_array )) { ... will do this.
If the key does exist, you have a match.
extract the last column from the raw text line ('GC or whatever)
write it to your dbms.
Notice how this avoids regular expressions, and how it processes your raw data line by line. You only have to touch each line of raw data once. That's good, because your raw data is also your largest quantity of data. It exploits php's associative array feature to do the matching. All that will be much faster than your method.
To speed the process of inserting tens of thousands of rows into a table, read this. Optimizing InnoDB Insert Queries

+1 to #Ollie Jones' answer. He posted while I was working on my answer. So here's some code to get you started.
$rsids = $this->get_snps();
while ($row = $rsids->fetch_assoc()) {
$key = 'rs' . $row['rsid'];
$rsidHash[$key] = true;
}
$rawDataFd = fopen('genome_file.txt', 'r');
while ($rawData = fgetcsv($rawDataFd, 80, "\t")) {
if (array_key_exists($rawData[0], $rsidHash)) {
$genotype = $rawData[3];
// do something with genotype
}
}

I wanted to give the LOAD DATA INFILE approach to see how well that works, so I came up with what I thought is a nice elegant approach, heres the code:
$file = 'C:/wamp/www/nutri/wp-content/plugins/genomics/genome/test';
$data_query = "
LOAD DATA LOCAL INFILE '$file'
INTO TABLE wp_genomics_results
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
IGNORE 18 ROWS
(#rsid,#chromosome,#locus,#genotype)
SET file_id = '$file_id',
snp_id = (SELECT id FROM wp_pods_snp WHERE rsid = SUBSTR(#rsid,2)),
genotype = #genotype
";
$ngdb->query($data_query);
I put a foreign key restraint on the snp_id (thats the ID for my table of RSIDs) column so that it only enters genotypes for rsids that I need. Unfortunately this foreign key restraint caused some kind of error which locked the tables. Ah well. It might not have been a good approach anyhow since there are on average 200,000 rows in each of these genome files. I'll go with Ollie Jones approach since that seems to be the most effective and viable approach I've come across.

MySQL Query - replace or remove 0 (zero) in a column, but not remove the digit zero in a word or number?

I have a table where one column has 0 for a value. The problem is that my page that fetches this data show's the 0.
I'd like to remove the 0 value, but only if it's a single 0. And not remove the 0 if it's in a word or a numeric number like 10, 1990, 2006, and etc.
I'd like to see if you guys can offer a SQL Query that would do that?
I was thinking of using the following query, but I think it will remove any 0 within a word or numeric data.
update phpbb_tree set member_born = replace(member_born, '0', '')
Hopefully you guys can suggest another method? Thanks in advance...

After discussed at the comments you have said that you want to not show 0 values when you fetching the data. The solution is simple and should be like this.
lets supposed that you have make your query and fetch the data with a $row variable.
if($row['born_year'] == '0'){
$born_year = "";
} else {
$born_year = $row['born_year'];
}
Another solution is by filtering the query from the begging
select * from table where born_year !='0';
update
if you want to remove all the 0 values from your tables you can do it in this way. Consider making a backup before.
update table set column='' where column='0';
if the value is int change column='0' to column=0

Leading zeros in an extracted Excel file from database using PEAR

In the admin section of my website there is a button that extracts an Excel file with data from my database. In reality, it is an Excel file that is created upon clicking the button using PEAR. I use an SQL query to get the information necessary from my SQL Server 2008 database.
One of the columns named 'number1' contains a number ranging from 1-9999. I have been looking for a way to have it put zeros in front of the numbers when it doesnt have 4 digits already, but I've had no luck until now. For example, if the number in the database is 12, I would like it to show as 0012 in my Excel sheet.
currently the code used is the following:
if ($j == 15){
$worksheet->write($variable1, $j , $variable2[$i][$j], $text222);
}
where $variable1 = 0; $variable2 = ("my sql query")
Your help is appreciated.
EDIT: ANSWER(S)
$number = str_pad($value, 4, '0', STR_PAD_LEFT);
OR
JAGAnalyst's answer, the one I actually used in my code.

Alternatively, you can also edit your SQL query to return a four-character text string instead of a number, including the replacement function, i.e.
, CASE
WHEN Len(number1) = 1 THEN '000' + CAST(number1 AS VARCHAR(4))
WHEN Len(number1) = 2 THEN '00' + CAST(number1 AS VARCHAR(4))
WHEN Len(number1) = 3 THEN '0' + CAST(number1 AS VARCHAR(4))
ELSE CAST(number1 AS VARCHAR(4))
END AS NEW
This will actually alter the value that is extracted, rather than simply changing the format.

php odbc double excecutions bug

I am using Windows Server with MSSQL Server 2008, and a ODBC connection to WAMP on a seperate Windows Server.
Using PHP to execute database queries, and when I run INSERT queries I get two identical rows (with unique identifier as the only thing seperating them).
I have checked multiple times that it's not double posting, so it must be executing the code twice, or sending it twice, the problem is not on the PHP side, that's for sure.
For example, I run this code:
$sql = "INSERT INTO mytable(aboutPerson, fromPerson, comment)
VALUES(1,5, 'hello')";
//.. create connection to db
odbc_exec($sql);
//.. close connection to db
Out of that I get two identical rows in the table. This occurs on several tables.
Why does this occur and how can I stop it from happening? Is it a known bug in connecting WAMP to Microsoft SQL Server with ODBC?
BTW, I don't have the option of switching systems.
Any help would be very appreciated, thanks!
Edit-
An actual example by request.
$SQL = "INSERT INTO billboard (toPersonId, toGroupClass, toOfficeId, fromOfficeId,
fromType, fromId, header, msg, aboutPersonId, created, startMessageId,
fromPersonId, status, emailNote, smsNote)
VALUES (1, 0, 1, 1,
2, 1, 'cccccccccccccc', 'cccccccccccccccc', '', GETDATE(), 0,
1, 0, 0, 0)";
$resource = odbc_connect(database server url, username, password);
$statement = odbc_exec($SQL, $resource);
return odbc_free_result($statement);
I execute this code and I should get only 1 single row inserted into the table. Instead I get this, two identical rows (except for the first column, which is the primary key and is identity)
111 1 0 1 1 2 1 cccccccccccccc cccccccccccccccc 0 2012-10-10 12:56:27.773 0 1 0 0 0
110 1 0 1 1 2 1 cccccccccccccc cccccccccccccccc 0 2012-10-10 12:56:27.773 0 1 0 0 0

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.