How To Import Movielens Data To Mysql

How To Import Movielens Data To Mysql - php

How can i import UTF-8 data form Movielens to MySql.
I get the data from http://grouplens.org/datasets/movielens/ and for my recommender system Thesis purpose, i just want the 100K and Tag Gnome data only.
I've been searching on google and in this forum and i don't find anything about importing these files to MySQl. Myself, currently using PhpMyAdmin for managing MySQL, so if anybody know how to easily import those files to MySQL.
I'm fine if you guys recommend me to iterate it one by one using php, but please explain to me the code.

You'll need to write some custom code to import all of their data into MySQL. Dumbest answer on Stack Overflow ever, right?
So they provide a set of flat files, each described in the README.
README
allbut.pl
mku.sh
u.data
u.genre
u.info
u.item
u.occupation
u.user
u1.base
u1.test
u2.base
u2.test
u3.base
u3.test
u4.base
u4.test
u5.base
u5.test
ua.base
ua.test
ub.base
ub.test
In a nutshell:
Make your own database and tables in MySQL.
Programatically open a file and parse each line to SQL.
Import the SQL into MySQL.
???
Profit!
Yeah, I know I still haven't really told you anything, let's do one and you can hopefully do the others.
I'll do u.genre, because I'm lazy and it is easy.
Make a new table, I'll assume you know how to make tables and such.
u.genre has two things: a genre and an id.
unknown|0
Action|1
...etc...
So your table should have two fields.
You'll use two data types: https://dev.mysql.com/doc/refman/5.7/en/data-types.html
id - unsigned TINYINT
TINYINT unsigned is 0 to 255
genre - VARCHAR(20)
VARCHAR 20 is up to 20 characters, their longest is "Documentary" so that'll give you a bit of extra room if they add a new one.
Open the file get the contents: https://secure.php.net/manual/en/function.file-get-contents.php
$filecontents = file_get_contents("u.genre");
Now let's split up the file by line: https://secure.php.net/manual/en/function.explode.php
$genres = explode("\n", $filecontents);
Now we'll loop through the $genres using foreach and explode again: https://secure.php.net/manual/en/control-structures.foreach.php
foreach ($genres as &$row) {
list($genre,$id) = explode("|",$row);
# more here later
}
Now let's just output SQL, skipping if either of the fields are empty.
if ($genre!="" && $id!=="") {
print "INSERT INTO genre (genre,id) VALUES ($genre,$id);\n";
}
Put it all together...
<?php
$filecontents = file_get_contents("u.genre");
$genres = explode("\n", $filecontents);
foreach ($genres as &$row) {
list($genre,$id) = explode("|",$row);
if ($genre!="" && $id!=="") {
$sql = "INSERT INTO genre (genre,id) VALUES ($genre,$id);\n";
print $sql;
# Insert each into your DB here.
}
}
?>
Save it and run it from the commandline or put it in a browser for no good reason.
There are too many resources out there showing how to insert data into MySQL, so I'll leave it at this. Everyone's database setup is a bit different, so writing it up for my particular setup won't help you.

Related

Which method is better to load MySQL database?

So I have a script that reads a text file, organizes it into an array then uses this code to loop through the data to input into the proper columns/rows inside MySQL server:
$size = sizeof($str)/14;
$x=0;
$a=0; $b=1; $c=2; $d=3; $e=4; $f=5; $g=6; $h=7; $i=8; $j=9; $k=10; $l=11; $m=12; $n=13;
mysql_query('TRUNCATE TABLE scores');
do {
$query = "INSERT INTO scores (serverid,resetid, rank,number,countryname,land,networth,tag,gov,gdi,protection,vacation,alive,deleted)
VALUES ('$str[$a]','$str[$b]','$str[$c]','$str[$d]','$str[$e]','$str[$f]','$str[$g]','$str[$h]',
'$str[$i]','$str[$j]','$str[$k]','$str[$l]','$str[$m]','$str[$n]')";
mysql_query($query,$conn);
$a=$a+14; $b=$b+14; $c=$c+14; $d=$d+14; $e=$e+14; $f=$f+14; $g=$g+14; $h=$h+14; $i=$i+14; $j=$j+14; $k=$k+14; $l=$l+14; $m=$m+14; $n=$n+14;
$x++;
} while ($x != $size);
mysql_close($conn);
This code figures out how large the file is loops through all 13 columns until it reaches the last row in the text file. Each time it is ran it clears the DB and loads the new data (as intended).
My question is: is this a good way of doing it? Or is there a faster more clean way to do the same thing as my code above?
Could I use the LOAD DATA LOCAL INFILE '$myFile'" . " INTO TABLE ranksfeed_temp FIELDS TERMINATED BY ',' to do the same job in a more efficient manner? What are your thoughts? I'm trying to make my code more efficient and fast.

LOAD DATA would be faster and more efficient to import a character separated file like csv. LOAD DATA is optimized for importing large files into you MySQL table, whereas you are running one query per row from your textfile, which ist incredibly slow in execution.
Please pay attention to the fact that the LOCAL option is only for files which are placed on the client side of your MySQL-Server-Client Connection. Try to load the file form the machine which acts as the MySQL directly.
Disabling possible keys on your table before inserting can give you extra speed while importing. Try it with disabled keys and without to benchmark the results.

Newbie trying to decipher merge command associated code

Someone retired in our group and I'm trying to figure out what his merge statement (and associated code) does so I can determine how to convert some (not all) values to integer before sending up. See comments below for questions. I am an absolute newbie with Microsoft SQL and took a class in php a few years ago, but don't have much experience. I've tried googling the merge command but I'm having trouble with a couple parts in it. See my questions below. (// ?)
I've looked at:
http://php.net/manual/en/pdo.query.php
http://stackoverflow.com/questions/4336573/merge-to-target-columns-using-source-rows
http://pic.dhe.ibm.com/infocenter/iseries/v7r1m0/index.jsp?topic=%2Fsqlp%2Frbafymerge.htm
I realize these are basic questions but I'm trying to figure it out and nobody around here knows.
function storeData ($form)
{
global $ms_conn, $QEDnamespace;
//I'm not sure what this is doing?? I thought this was where it was sending data up??
$qry = "MERGE INTO visEData AS Target
USING (VALUES (?,?,?,?,?,?,?,?,?,?))
AS Source (TestGUID,pqID, TestUnitID, TestUnitCountID,
ColorID, MeasurementID, ParameterValue,
Comments, EvaluatorID, EvaluationDate)
ON Target.pqID = Source.pqID
AND Target.MeasurementID=Source.MeasurementID //what is this doing?
AND Target.ColorID=Source.ColorID //what is target and source?
WHEN MATCHED THEN
UPDATE SET ParameterValue = Source.ParameterValue,
EvaluatorID = Source.EvaluatorID, //where is evaluatorID and source? My table or table we're send it to?
EvaluationDate = Source.EvaluationDate,
Comments = Source.Comments
WHEN NOT MATCHED BY TARGET THEN
INSERT (TestGUID,
pqID, TestUnitID, TestUnitCountID,
ColorID, MeasurementID,
ParameterValue, Comments,
EvaluatorID, EvaluationDate, TestIndex, TestNumber)
VALUES (Source.TestGUID, Source.pqID,
Source.TestUnitID,
Source.TestUnitCountID,
Source.ColorID, Source.MeasurementID, Source.ParameterValue,
Source.Comments, Source.EvaluatorID, Source.EvaluationDate,?,?);";
$pqID = coverSheetData($form);
$tid = getBaseTest($form['TextField6']);
$testGUID = getTestGUID($tid);
$testIndex = getTestIndex ($testGUID);
foreach ($form['visE']['parameters'] as $parameter=>$element)
{
foreach ($element as $key=>$data)
{
if ( mb_ereg_match('.+evaluation', $key) === true )
{
$testUnitData = getTestUnitData ($form, $key, $tid, $testGUID);
try
{
//I'm not sure if this is where it's sent up??
//Maybe I could add the integer conversion here??
$ms_conn->query ($qry, array(
$testGUID, $pqID,
$testUnitData[0], $testUnitData[1], $testUnitData[2],$element['parameterID'], $data, $element['comments'] $QEDnamespace->userid, date ('Y-m-d'), $testIndex, $tid));
}
catch (Zend_Db_Statement_Sqlsrv_Exception $e)
{
dataLog($e->getMessage());
returnStatus ("Failed at: " . $key);
}
}
}
}
}

This is a bit long for a comment. If you are using SQL Server, then look at the SQL Server documentation on merge. All the SQL Server documentation is on line, and it is very easy to find via Google (and perhaps even easier using Bing).
The purpose of the MERGE command is to do both inserts and updates in one step. Basically, you have a table that has new data ("source") and a table to be updated ("target"). When a record matches, then update the existing record in the target with matching record in source. When a record doesn't match, then insert it into target.
The main advantage of MERGE over two statements is not necessarily the elegant and intuitively obvious syntax. The main advantage is that all the operations occur in a single transaction, so either they all succeed or all fail as one.
The syntax actually isn't that bad. I would recommend that you set up a test database and try a few examples on your own, so you at least understand the syntax. Then, return to this code. When doing so, print out the resulting merge statement and put it in SQL Server Management Studio, where you will have nice color coded key words for the statement. Then go through it step by step, and you'll probably find that it makes lots of sense.

Editing Data in an XLS with PHP then importing into mySQL

I am trying to import an XLS file into PHP, where I can then edit the information and import it into mySQL. I have never done anything related to this, so I am having a hard time grasping how to approach it.
I have looked at a few open source projects:
PHP Excel Reader
ExcelRead
PHPExcel
None of these options perfectly fit what I want to do or maybe I just haven't gone deep enough into the documentation.
There are some things that needed to be taken into consideration. The XLS file cannot be converted into any other file format. This is being made for ease-of-access for nontechnical users. The XLS file is a report generated on another website that will have the same format (columns) every time.
For example, every XLS file with have the same amount of columns (this would be A1):
*ID |Email |First Name |Last Name |Paid |Active |State |Country|*
But, there are more columns in the XLS file than what is going to be imported into the DB.
For example, the rows that are being imported (this would be A1):
*ID |Email |First Name |Last Name |Country*
I know one of two ways to do edit the data would be A. Use something like PHPExcel to read in the data, edit it, then send it to the DB or B. Use something like PHPExcel to convert the XLS to CSV, do a raw import into a temp table, edit the data, and insert it into the old table.
I have read a lot of the PHPExcel documentation but, it doesn't have anything on importing into a database and I don't really even know where to start with editing the XLS before or after importing.
I have googled a lot of keywords and mostly found results on how to read/write/preview XLS. I am looking for advice on the best way of doing all of these things in the least and simplest steps.

See this article on using PHP-ExcelReader, in particular the short section titled "Turning the Tables".
Any solution you have will end up looking like this:
Read a row from the XLS (requires an XLS reader)
Modify the data from the row as needed for your database.
Insert modified data into the database.
You seem to have this fixation on "Editing the data". This is just PHP--you get a value from the XLS reader, modify it with PHP code, then insert into the database. There's no intermediate file, you don't modify the XLS--it's just PHP.
This is a super-simple, untested example of the inner loop of the program you need to write. This is just to illustrate the general pattern.
$colsYouWant = array(1,2,3,4,8);
$sql = 'INSERT INTO data (id, email, fname, lname, country) VALUES (?,?,?,?,?)';
$stmt = $pdo->prepare($sql);
$sheet = $excel->sheets[0];
// the excel reader seems to index by 1 instead of 0: be careful!
for ($rowindex=2; $rowindex <= $sheet['numRows']; $rowindex++) {
$xlsRow = $sheet['cells'][$rowindex];
$row = array();
foreach ($colsYouWant as $colindex) {
$row[] = $xlsRow[$colindex];
}
// now let's "edit the row"
// trim all strings
$row = array_map('trim', $row);
// convert id to an integer
$row[0] = (int) $row[0];
// capitalize first and last name
// (use mb_* functions if non-ascii--I don't know spreadsheet's charset)
$row[2] = ucfirst(strtolower($row[2]));
$row[3] = ucfirst(strtolower($row[3]));
// do whatever other normalization you want to $row
// Insert into db:
$stmt->execute($row);
}

Transfer data into optimized MySQL tables from "raw" tables in cron with php

I'm working with an MLS real estate listing provider (RETS). Every 48 hours we will be pulling data from their server in a cron job to an SQL database. I'm charged with the task of writing a php script that will be run after the data from the remote server is dumped into our "raw" tables. In these raw tables, all columns are VARCHAR(255), and we want to move the data into optimized tables. Before I send my script to the guy in charge of setting up the cron job, I wondered if there is a more efficient way to do it so I don't look foolish.
Here's what I'm doing:
There are 8 total tables, 4 raw and 4 optimized - all in the same database. The raw table column names are non descriptive, like c1,c2,c2,c4 etc. This is intentional because the data that goes in each column may change. The raw table column names are mapped to the correct optimized table columns with php, something like this:
$tables['optimized_table_name1']['raw_table'] = 'raw_table_name1';
$tables['optimized_table_name1']['data_map'] = array(
'c1' => array( // <--- "c1" is the raw table column name
'column_name' => 'id',
// I use other values for table creation,
// but they don't matter to the question.
// Just explaining why the array looks like this
//'type' => 'VARCHAR',
//'max_length' => 45,
//'primary_key' => FALSE,
// etc.
),
'c9' => array('column_name' => 'address'),
'c25' => array('column_name' => 'baths'),
'c2' => array('column_name' => 'bedrooms') //etc.
);
I'm doing the same thing for each of the 4 tables: SELECT * FROM the raw table, read the config array and create a huge SQL insert statement, TRUNCATE the optimized table, then run the INSERT query.
foreach ($tables as $table_name => $config):
$raw_table = $config['raw_table'];
$data_map = $config['data_map'];
$fields = array();
$values = array();
$count = 0;
// Get the raw data and create an array mapped to the optimized table columns.
$query = mysql_query("SELECT * FROM dbname.{$raw_table}");
while ($row = mysql_fetch_assoc($query))
{
// Reading column names from my config file on first pass
// Setting up the array, will only run once per table
if (empty($fields))
{
foreach ($row as $key => $val)
{// Produces an array with the column names
$fields[] = $data_map[$key]['column_name'];
}
}
foreach ($row as $key => $val)
{// Assigns data to an array to be imploded later
$values[$count][] = $val;
}
$count++;
}
// Create the INSERT statement string
$insert = array();
$sql = "\nINSERT INTO `{$table_name}` (`".implode('`,`', $fields)."`) VALUES\n";
foreach ($values as $key => $vals)
{
foreach ($vals as &$val)
{
// Escape the data
$val = mysql_real_escape_string($val);
}
// Using implode for simplicity, could avoid the nested foreach if I wanted to
$insert[] = "('".implode("','", $vals)."')";
}
$sql .= implode(",\n", $insert).";\n";
// TRUNCATE optimized table and run INSERT query here
endforeach;
Which produces something like this (only larger - about 15,000 records max per table, and one insert statement per table):
INSERT INTO `optimized_table_name1` (`id`,`beds`,`baths`,`town`) VALUES
('50300584','2','1','Fairfield'),
('87560584','3','2','New Haven'),
('76545584','2','1','Bristol');
Now I'll admit, I have been under the wing of an ORM for a long time and am not up on my vanilla mysql/php. This is a pretty simple task and I want to keep the code simple.
My questions:
Is the TRUNCATE/INSERT method a good way to do this?
Is there anything about my code that you can see being a problem? I know you see nested foreach loops and just shudder, but I want to keep the code as small clean as possible and avoid lots of messy string concatenation (to produce the insert query). Like I said, I also haven't used native php functions for SQL in a long time.
I feel like it really doesn't matter if the code is not optimized if it is run at 3AM every 2 days. Does it matter? Is this code going to preform OK?
Is there a better overall strategy to accomplish this task?
Do I need to be using transactions?
How can I be aware of errors that may occur in cron scripts?
Apologize if I don't use correct cron jargon, it's new to me.

Keep it simple. ORM would be swell for this task.
Answers:
Yes.
Your code is readable. At least I did not have any problems to read it.
We had a script that ran early in the morning. It was not optimized and consumed a lot of memory. After FOUR years it started to consume over 512 Mb. I've spent 2 hours to optimize it to, so now it consumes 7 Mb (pretty good optimization, huh? :) ). I personally think it is "ok" that your script is not optimized now. If this script will start failing, you'll figure what the problem is. Maybe it will exhaust memory, maybe your SQL queries will cause deadlocks... maybe you will later optimize it to READ from slave servers... I don't know, but it works fine now, that's okay.
I'd do something similar to your code. But I'd probably generate file first and load data into the server by running shell command mysql -u username --password=password < import_file.sql. So I'd have my file stored somewhere on a disk so I cal always take a look at it. And maybe even edit for one-time correction load. But you still can do it by writing your sql statement into file.
No. It is just one query. If you use InnoDB engine it is already a transaction.
First, use error_reporting(E_ALL & ~E_NOTICE). Second, use mysql_error PHP function to ensure your query performed correctly. Third, in your cronjob output errors stream into some file like so: 0 7 * * 0 /path/to/php -c /path/to/php.ini /path/to/script.php 2> /tmp/errors_file And thus you can create SECOND script runnin after first one to notify about errors in script.php by email or.... whatever way of notifying you prefer. I'd prefer to register_shutdown_functions that would check for error_file and if it is not empty, notify you and delete it afterwards.
Just my opinion, but I hope my answer helps though.

How to speed up processing a huge text file?

I have an 800mb text file with 18,990,870 lines in it (each line is a record) that I need to pick out certain records, and if there is a match write them into a database.
It is taking an age to work through them, so I wondered if there was a way to do it any quicker?
My PHP is reading a line at a time as follows:
$fp2 = fopen('download/pricing20100714/application_price','r');
if (!$fp2) {echo 'ERROR: Unable to open file.'; exit;}
while (!feof($fp2)) {
$line = stream_get_line($fp2,128,$eoldelimiter); //use 2048 if very long lines
if ($line[0] === '#') continue; //Skip lines that start with #
$field = explode ($delimiter, $line);
list($export_date, $application_id, $retail_price, $currency_code, $storefront_id ) = explode($delimiter, $line);
if ($currency_code == 'USD' and $storefront_id == '143441'){
// does application_id exist?
$application_id = mysql_real_escape_string($application_id);
$query = "SELECT * FROM jos_mt_links WHERE link_id='$application_id';";
$res = mysql_query($query);
if (mysql_num_rows($res) > 0 ) {
echo $application_id . "application id has price of " . $retail_price . "with currency of " . $currency_code. "\n";
} // end if exists in SQL
} else
{
// no, application_id doesn't exist
} // end check for currency and storefront
} // end while statement
fclose($fp2);

At a guess, the performance issue is because it issues a query for each application_id with USD and your storefront.
If space and IO aren't an issue, you might just blindly write all 19M records into a new staging DB table, add indices and then do the matching with a filter?

Don't try to invent the wheel, it's been done. Use a database to search through the file's content. You can looad that file into a staging table in your database and query your data using indexes for fast access if they add value. Most if not all databases have import/loading tools to get a file into the database relatively fast.

19M rows on DB will slow it down if DB was not designed properly. You can still use text files, if it is partitioned properly. Recreating multiple smaller files, based on certain parameters, storing in proper sorted way might work.
Anyway PHP is not the best language for file IO and processing, it is much slower than Java for this task, while plain old C would be one of the fastest for the job. PHP should be restricted to generated dynamic Web output, while core processing should be in Java/C. Ideally it should be Java/C service which generates output, and PHP using that feed to generate HTML output.

You are parsing the input line twice by doing two explodes in a row. I would start by removing the first line:
$field = explode ($delimiter, $line);
list($export_date, ...., $storefront_id ) = explode($delimiter, $line);
Also, if you are only using the query to test for a match based on your condition, don't use SELECT * use something like this:
"SELECT 1 FROM jos_mt_links WHERE link_id='$application_id';"
You could also, as Brandon Horsley suggested, buffer a set of application_id values in an array and modify your select statement to use the IN clause thereby reducing the number of queries you are performing.

Have you tried profiling the code to see where it's spending most of its time? That should always be your first step when trying to diagnose performance problems.

Preprocess with sed and/or awk ?

Databases are built and designed to cope with large amounts of data, PHP isn't. You need to re-evaluate how you are storing the data.
I would dump all the records into a database, then delete the records you don't need. Once you have done that, you can copy those records wherever you want.

As others have mentioned, the expense is likely in your database query. It might be faster to load a batch of records from the file (instead of one at a time) and perform one query to check multiple records.
For example, load 1000 records that match the USD currency and storefront at a time into an array and execute a query like:
'select link_id from jos_mt_links where link_id in (' . implode(',', application_id_array) . ')'
This will return a list of those records that are in the database. Alternatively, you could change the sql to be not in to get a list of those records that are not in the database.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.