parsing a very big file into mysql - php

i have a task where i need to parse an extremely big file and write the results into a mysql database. "extremely big" means we are talking about 1.4GB of sort-of-CSV data, totalling in approx 10 million lines of text.
Thing is not "HOW" to do it, but how to do it FAST. my first approach was to just do it in php without any speed optimization and then let it run for a few days until it's done. unfortunately, it's been running for 48 hours straight right now and has processed only 2% of the total file. therefore, that's not an option.
the file format is as follows:
A:1,2
where the amount of comma separated numbers following the ":" can be 0-1000. the example dataset has to go into a table as follows:
| A | 1 |
| A | 2 |
so right now, i did it like this:
$fh = fopen("file.txt", "r");
$line = ""; // buffer for the data
$i = 0; // line counter
$start = time(); // benchmark
while($line = fgets($fh))
{
$i++;
echo "line " . $i . ": ";
//echo $i . ": " . $line . "<br>\n";
$line = explode(":", $line);
if(count($line) != 2 || !is_numeric(trim($line[0])))
{
echo "error: source id [" . trim($line[0]) . "]<br>\n";
continue;
}
$targets = explode(",", $line[1]);
echo "node " . $line[0] . " has " . count($targets) . " links<br>\n";
// insert links in link table
foreach($targets as $target)
{
if(!is_numeric(trim($target)))
{
echo "line " . $i . " has malformed target [" . trim($target) . "]<br>\n";
continue;
}
$sql = "INSERT INTO link (source_id, target_id) VALUES ('" . trim($line[0]) . "', '" . trim($target) . "')";
mysql_query($sql) or die("insert failed for SQL: ". mysql_error());
}
}
echo "<br>\n--<br>\n<br>\nseconds wasted: " . (time() - $start);
this is obviously not optimized for speed in ANY way. any hints for a fresh start? should i switch to another language?

The first optimization would be to insert with a transaction - each 100 or 1000 lines commit and begin a new transaction. Obviously you'd have to use a storage engine that supports transactions.
Then observe the CPU usage with the top command - if you have multiple cores, the mysql process does not do much and the PHP process does much of the work, rewrite the script to accept a parameter that skips n lines from the beginning and only import 10000 lines or so. Then start multiple instances of the script, each with a different starting point.
Third solution would be to convert the file into a CSV with PHP (no INSERT at all, just writing to a file) and the using LOAD DATA INFILE as m4t1t0 suggested.

as promised, attached you'll find the solution i went for in this post. i benchmarked it and it turned out, that it is 40 times (!) faster than the old one :)
sure - there's still much room for optimization, but it's fast enough for me right now :)
$db = mysqli_connect(/*...*/) or die("could not connect to database");
$fh = fopen("data", "r");
$line = ""; // buffer for the data
$i = 0; // line counter
$start = time(); // benchmark timer
$node_ids = array(); // all (source) node ids
mysqli_autocommit($db, false);
while($line = fgets($fh))
{
$i++;
echo "line " . $i . ": ";
$line = explode(":", $line);
$line[0] = trim($line[0]);
if(count($line) != 2 || !is_numeric($line[0]))
{
echo "error: source node id [" . $line[0] . "] - skipping...\n";
continue;
}
else
{
$node_ids[] = $line[0];
}
$targets = explode(",", $line[1]);
echo "node " . $line[0] . " has " . count($targets) . " links\n";
// insert links in link table
foreach($targets as $target)
{
if(!is_numeric($target))
{
echo "line " . $i . " has malformed target [" . trim($target) . "]\n";
continue;
}
$sql = "INSERT INTO link (source_id, target_id) VALUES ('" . $line[0] . "', '" . trim($target) . "')";
mysqli_query($db, $sql) or die("insert failed for SQL: ". $db::error);
}
if($i%1000 == 0)
{
$node_ids = array_unique($node_ids);
foreach($node_ids as $node)
{
$sql = "INSERT INTO node (node_id) VALUES ('" . $node . "')";
mysqli_query($db, $sql);
}
$node_ids = array();
mysqli_commit($db);
mysqli_autocommit($db, false);
echo "committed to database\n\n";
}
}
echo "<br>\n--<br>\n<br>\nseconds wasted: " . (time() - $start);

I find your description rather confusing - and it doesn't match up with the code you've provided.
if(count($line) != 2 || !is_numeric(trim($line[0])))
the trim here is redundant - whitespace doesn't change the behaviour of is_numberic. But you've said aleswhere that the start of the line is a letter - therefore this will always fail.
If you want to speed it up then switch to using stream processing of the input rather than message processing (PHP arrays can be very slow) or use a different language and aggregate the insert statements into multi-line inserts.

I would first just use the script to create a SQL file. Then lock the table using this http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html by placing the appropriate commands at the start/end of the SQL file (could get you script to do this).
Then just use the command tool to inject the SQL into the database (preferably on the machine where the database resides).

Related

PHP MySQL UPDATE query with array and IN statement

I have this query which works successfully in one of my PHP scripts:
$sql = "UPDATE Units SET move_end = $currentTime, map_ID = $mapID, attacking = $attackStartTime, unit_ID_affected = $enemy, updated = now() WHERE unit_ID IN ($attackingUnits);";
$attackingUnits is an imploded array of anywhere between 1 - 100 integers.
What I'd like to do is also add arrays with different values for $currentTime and $mapID which correspond with the values for $attackingUnits. Something like this:
$sql = "UPDATE Units SET move_end = " . $attackingUnits['move_end'] . ", map_ID = " . $attackingUnits['map_ID'] . ", attacking = $attackStartTime, unit_ID_affected = $enemy, updated = now() WHERE unit_ID IN ($attackingUnits);";
Obviously that won't work the way I want it to because $attackingUnits['move_end'] and $attackingUnits['map_ID'] are just single values, not an array, but I'm stumped as to how I can write this query. I know I can one query for each element of $attackingUnits, but this is precisely what I'm trying to avoid as I'd like to be able to use one UPDATE for as many elements as required.
How would I write this query?
The key parts of the PHP script are:
$attackStartTime = time(); // the time the units started attacking the enemy (i.e. the current time)
// create a proper multi-dimensional array as the client only sends a string of comma-delimited unitID values
$data = array();
// add the enemy unit ID to the start of the selectedUnits CSV and call it allUnits. we then run the same query for all units in the selectedUnits array. this avoids two separate queries for pretty much the same columns
$allUnits = $enemy . "," . $selectedUnits;
// get the current enemy and unit data from the database
$sql = "SELECT user_ID, unit_ID, type, map_ID, moving, move_end, destination, attacking, unit_ID_affected, current_health FROM Units WHERE unit_ID IN ($allUnits);";
$result = mysqli_query($conn, $sql);
// convert the CSV strings to arrays for processing in this script
$selectedUnits = explode(',', $selectedUnits);
$allUnits = explode(',', $allUnits);
while ($row = mysqli_fetch_assoc($result)) {
$data[] = $row;
}
$result -> close();
$increment = 0; // set an increment value outside of the foreach loop so that we can use the pointer value at each loop
// check each selected unit to see if it can validly attack the enemy unit, otherwise remove them from selected units and send an error back for that specific unit
foreach ($data as &$unit) {
// do a whole bunch of checking stuff here
}
// convert the attacking units (i.e. the unit ids from selected units which have passed the attacking tests) to a CSV string for processing on the database
$attackingUnits = implode(',', $selectedUnits);
// update each attacking unit with the start time of the attack and the unit id we are attacking, as well as any change in movement data
// HERE IS MY PROBLEMATIC QUERY
$sql = "UPDATE Units SET moving = " . $unit['moving'] . ", move_end = " . $unit['move_end'] . ", map_ID = " . $unit['map_ID'] . ", attacking = $attackStartTime, unit_ID_affected = $enemy, updated = now() WHERE unit_ID IN ($attackingUnits);";
$result = mysqli_query($conn, $sql);
// send back the full data array - should only be used for testing and not in production!
echo json_encode($data);
mysqli_close($conn);
OK, after some more web research I found a link that helped me out:
https://stuporglue.org/update-multiple-rows-at-once-with-different-values-in-mysql/
I updated his code to mysqli and after a lot of testing, it works! I can now successfully UPDATE hundreds of rows with one query, rather than sending hundreds of small updates via PHP. Here are the relevant parts of my code for anyone who's interested:
$updateValues = array(); // the array we are going to build out of all the unit values that need to be updated
// build up the query string
$updateValues[$unit["unit_ID"]] = array(
"moving" => $unit["new_start_time"],
"move_end" => $unit["new_end_time"],
"map_ID" => "`destination`",
"destination" => $unit["new_destination"],
"attacking" => 0,
"unit_ID_affected" => 0,
"updated" => "now()"
);
// start of the query
$updateQuery = "UPDATE Units SET ";
// columns we will be updating
$columns = array("moving" => "`moving` = CASE ",
"move_end" => "`move_end` = CASE ",
"map_ID" => "`map_ID` = CASE ",
"destination" => "`destination` = CASE ",
"attacking" => "`attacking` = CASE ",
"unit_ID_affected" => "`unit_ID_affected` = CASE ",
"updated" => "`updated` = CASE ");
// build up each column's CASE statement
foreach ($updateValues as $id => $values) {
$columns['moving'] .= "WHEN `unit_ID` = " . mysqli_real_escape_string($conn, $id) . " THEN " . mysqli_real_escape_string($conn, $values['moving']) . " ";
$columns['move_end'] .= "WHEN `unit_ID` = " . mysqli_real_escape_string($conn, $id) . " THEN " . mysqli_real_escape_string($conn, $values['move_end']) . " ";
$columns['map_ID'] .= "WHEN `unit_ID` = " . mysqli_real_escape_string($conn, $id) . " THEN " . mysqli_real_escape_string($conn, $values['map_ID']) . " ";
$columns['destination'] .= "WHEN `unit_ID` = " . mysqli_real_escape_string($conn, $id) . " THEN " . mysqli_real_escape_string($conn, $values['destination']) . " ";
$columns['attacking'] .= "WHEN `unit_ID` = " . mysqli_real_escape_string($conn, $id) . " THEN " . mysqli_real_escape_string($conn, $values['attacking']) . " ";
$columns['unit_ID_affected'] .= "WHEN `unit_ID` = " . mysqli_real_escape_string($conn, $id) . " THEN " . mysqli_real_escape_string($conn, $values['unit_ID_affected']) . " ";
$columns['updated'] .= "WHEN `unit_ID` = " . mysqli_real_escape_string($conn, $id) . " THEN " . mysqli_real_escape_string($conn, $values['updated']) . " ";
}
// add a default case, here we are going to use whatever value was already in the field
foreach ($columns as $columnName => $queryPart) {
$columns[$columnName] .= " ELSE `$columnName` END ";
}
// build the WHERE part. since we keyed our updateValues off the database keys, this is pretty easy
// $where = " WHERE `unit_ID` = '" . implode("' OR `unit_ID` = '", array_keys($updateValues)) . "'";
$where = " WHERE unit_ID IN ($unitIDs);";
// join the statements with commas, then run the query
$updateQuery .= implode(', ', $columns) . $where;
$result = mysqli_query($conn, $updateQuery);
This will significantly reduce the load on my database as these events can happen every second (think of hundreds of players at once, attacking hundreds of enemy units with hundreds of their own units). I hope this helps someone out.

MySQL won't INSERT last item from for loop into database [duplicate]

This question already has answers here:
How can I prevent SQL injection in PHP?
(27 answers)
Closed 5 years ago.
I'm looping through some data to display on my page, and then insert each loop to a row in a database (personal learning exercise for PHP and MySQL).
The for loop runs 5 times (for example, sometimes it may loop more/less), and I am able to successfully insert the data for the first 4 loops, but am having difficulty figuring out why the last loop won't insert into the database.
All 5 loop iterations display on my page, I'm not quite sure why the last loop won't insert into the database.
Here is my for loop that includes the MySQL code:
$artworksIterations = 1
count($artworksTitle[$x]) = 5
for ($x = 0; $x < $artworksIterations; $x++) {
for ($y = 0; $y < count($artworksTitle[$x]); $y++) {
$savedartworksTitle = $artworksTitle[$x][$y];
echo "TITLE: " . $savedartworksTitle . "<br>";
$savedartworksArtist = $artworksArtist[$x][$y];
echo "ARTIST: " . $savedartworksArtist . "<br>";
$savedartworksYear = $artworksYear[$x][$y];
echo "YEAR: " . $savedartworksYear . "<br>";
$savedartworksMedium = $artworksMedium[$x][$y];
echo "MEDIUM: " . $savedartworksMedium . "<br>";
$implodeGene = implode(", ", $artworksGene[$x][$y]);
echo "GENRES: " . $implodeGene;
$savedartworksDisplay = $artworksDisplay[$x][$y];
echo "<br><img src='" . $savedartworksDisplay . "'><br>";
echo "<br>----<br>";
$sql = "INSERT INTO Artworks (title, artist, year, medium, display, genres) VALUES ('$savedartworksTitle', '$savedartworksArtist', '$savedartworksYear', '$savedartworksMedium', '$savedartworksDisplay', '$implodeGene');";
mysqli_query($conn, $sql);
} // end of y
} // end of x
Any help would be deeply appreciated. Thank you :)
Thank you to everyone who helped dissect my issue!
It turns out, by mere coincidence, the data from the last loop of each iteration contained an apostrophe ('), which was why they weren't inserted into the database (syntax error).
Thank you to Jacques for pointing out that I should be checking for errors, that is where it informed me of a syntax error (also, thank you for jeff who clued in on my apostrophe issue).
To fix this, I used the mysqli_real_escape_string() function save the data as a safe format for the database (probably something I should have done in the first place, learning experience!).
Updated working code:
for ($x = 0; $x < $artworksIterations; $x++) {
for ($y = 0; $y < count($artworksTitle[$x]); $y++) {
// display the information on the web page
echo "TITLE: " . $artworksTitle[$x][$y] . "<br>";
echo "ARTIST: " . $artworksArtist[$x][$y] . "<br>";
echo "YEAR: " . $artworksYear[$x][$y] . "<br>";
echo "MEDIUM: " . $artworksMedium[$x][$y] . "<br>";
echo "GENRES: " . implode(", ", $artworksGene[$x][$y]);
echo "<br><img src='" . $artworksDisplay[$x][$y] . "'><br>";
// save data to MySQL safe format
$savedartworksTitle = mysqli_real_escape_string($conn, $artworksTitle[$x][$y]);
$savedartworksArtist = mysqli_real_escape_string($conn, $artworksArtist[$x][$y]);
$savedartworksYear = mysqli_real_escape_string($conn, $artworksYear[$x][$y]);
$savedartworksMedium = mysqli_real_escape_string($conn, $artworksMedium[$x][$y]);
$savedartworksDisplay = mysqli_real_escape_string($conn, $artworksDisplay[$x][$y]);
$implodeGene = mysqli_real_escape_string($conn, implode(", ", $artworksGene[$x][$y]));
// insert data into database
$sql = "INSERT INTO Artworks (title, artist, year, medium, display, genres) VALUES ('$savedartworksTitle', '$savedartworksArtist', '$savedartworksYear', '$savedartworksMedium', '$savedartworksDisplay', '$implodeGene');";
if (mysqli_query($conn, $sql) === FALSE) {
printf("ERROR: %s\n", mysqli_error($conn));
}
echo "<br>----<br>";
} // end of y
} // end of x

How a string of text every other two lines in a foreach loop in PHP? (retrieving records from a database)

I am working on a "boxing records" database for a school project. The loop retrieves records from a SQL statement. I want to add a "VS" string of text between every other two lines in order to show records outputted somewhat like this.
Upcoming Fights
Sergey Kovalev (28-0-25)
VS
Jean Pascal (30-3-17)
Another Boxer (123-0-5)
VS
Some Boxer (123-3-1)
However, my current loop outputs like this
Sergey Kovalev (28-0-25)
VS
Jean Pascal (30-3-17)
VS
Another Boxer (123-3-1)
VS
Some Other Boxer (123-3-1)
VS
The loop I currently have is the following
foreach($records as $record) {
$i = 0;
echo $record['name'] . " (" . $record['wins'] . "-" . $record['losses'] . "-" . $record['kos'] . ")" . "<br>";
$i=$i*2;
if($i%2 == 0)
{
echo "VS <br/>";
}
else{
echo "<br />";
}
I know I could probably change the SQL in order to display two fighters in the same row, and then append "vs" on the echo, but I thought that just modifying the for loop would work by using a variable counter $i. I thought it would be pretty easy to make the "VS" appear between every two rows but im missing something in my logic.
You need to increment value of $i by one rather than multiplying it by 2 and initialize $i outside foreach loop.
$i = 0; // Initialize counter here
foreach($records as $record) {
echo $record['name'] . " (" . $record['wins'] . "-" . $record['losses'] . "-" . $record['kos'] . ")" . "<br>";
if($i%2 == 0)
{
echo "VS <br/>";
}
else
{
echo "<br />";
}
$i++; // Increment counter here
}

MySQL/PHP mysql_fetch_array() keeps missing first row

Good eve everyone!
For some reason Database::fetchArray() is skipping the first $row of the query result set.
It prints all rows properly, only keeps missing out the first one for some reason, I assume there's something wrong with my fetchArray() function?
I ran the query in phpMyAdmin and it returned 4 rows, when I tried it on my localhost with the php file (code below) it only printed 3 rows, using the same 'WHERE tunes.riddim'-value ofcourse. Most similiar topics on google show that a common mistake is to use mysql_fetch_array() before the while(), which sets the pointer ahead and causes the missing of the first row, unfortunately I only have one mysql_fetch_array() call (the one within the while()-head).
<?php
$db->query("SELECT " .
"riddims.riddim AS riddim, " .
"riddims.image AS image, " .
"riddims.genre AS genre, " .
"tunes.label AS label, " .
"tunes.artist AS artist, " .
"tunes.tune AS tune, " .
"tunes.year AS year," .
"tunes.producer AS producer " .
"FROM tunes " .
"INNER JOIN riddims ON tunes.riddim = riddims.riddim " .
"WHERE tunes.riddim = '" . mysql_real_escape_string(String::plus2ws($_GET['riddim'])) . "'" .
"ORDER BY tunes.year ASC");
$ar = $db->fetchArray();
for($i = 0; $i < count($ar) - 1; $i++)
{
echo $ar[$i]['riddim'] . " - " . $ar[$i]['artist'] . " - " . $ar[$i]['tune'] . " - " . $ar[$i]['label'] . " - " . $ar[$i]['year'] . "<br>";
}
?>
Database::fetchArray() looks like:
public function fetchArray()
{
$ar = array();
while(($row = mysql_fetch_array($this->result)) != NULL)
$ar[] = $row;
return $ar;
}
Any suggestions appreciated!
You should remove -1 from the for loop
The problem's in your while loop:
for($i = 0; $i < count($ar) - 1; $i++)
if count ($ar) is 1, because there's one entry, your loop will never be called; try tweaking the check part:
for($i = 0; $i < count($ar) ; $i++)
You can also use a simple foreach:
foreach($db->fetchArray() as $row)
{
echo $row['riddim'] # ...
}
It'll make your code more readable too.

Finding matches between two tables, each with 160k+ rows

I have two tables each with 160k+ rows each. between the two some UUID are shared. I'm and using a foreach loop over the "new" table with an embedded foreach searching the "old" table. When a UUID match is out the "old" table is updated with data from the "new" table.
Both tables have an index on the ID.
My problem is this operation is extreme time intensive; does anyone know a more efficient way to do said search for matching UUIDs? Sidenote: we are using the MySQLi extension for PHP 5.3
Exp code:
$oldCounter = 0;
$newCounter = 1;
//loop
foreach( $accounts as $accKey=>$accValue )
{
echo( "New ID - " . $oldCounter++ . ": " . $accValue['id'] . "\n" );
foreach( $accountContactData as $acdKey=>$acdValue )
{
echo( "Old ID - " $newCounter++ . ": " . $acdValue['id'] . " \n" );
if( $accValue['id'] == $acdValue['id'] && (
$accValue['phone_office'] == "" || $accValue['phone_office'] == NULL || $accValue['phone_office'] == 0 )
){
echo("ID match found\n");
//when match found update accounts with accountsContact info
$query = '
UPDATE `accounts`
SET
`phone_fax` = "' . $acdValue['fax'] . '",
`phone_office` = "' . $acdValue['telephone1'] . '",
`phone_alternate` = "' . $acdValue['telephone2'] . '"
WHERE
`id` = "' . $acdValue['id'] . '"
';
echo "" . $query . "\n\n";
$DB->query($query);
break 1;
}
}
}
unset($oldCounter);
unset($newCounter);
Thank you in advance.
Do this all in SQL.
There is nothing that I see in your code that requires PHP.
UPDATE allows JOIN. JOIN the new and old tables and have your WHERE conditions match that of your description. Should be pretty straightforward and significantly faster.
I wrote a function some months ago
You can modify it as you want .
This can improve the speed of your search
public static function search($query)
{
$result = array();
$all = custom_query::getNumRows("bar");
$quarter = floor(0.25 * $all) + 1;
$all = 0;
for($i = 0;$i<4;$i++)
{
custom_query::condition("", "limit $all, $quarter");
$data = custom_query::FetchAll("bar");
foreach ($data as $v)
foreach($v as $_v)
if (count(explode($query, $_v)) > 1)
$result[] = $v["bar_id"];
$all += $quarter;
}
return $result;
}
It returns you the ID of the record that search matched on it.
This method divides the table to 4 parts and each iteration gets only a quarter of it...
You can change this number to for example 10 or 20 for speed of...
some methods are in the class and you can easily write theme...

Categories