How can I do this in a faster way? - php

I have a script that imports CSV files. What ends up in my database is, among other things, a list of customers and a list of addresses. I have a table called customer and another called address, where address has a customer_id.
One thing that's important to me is not to have any duplicate rows. Therefore, each time I import an address, I do something like this:
$address = new Address();
$address->setLine_1($line_1);
$address->setZip($zip);
$address->setCountry($usa);
$address->setCity($city);
$address->setState($state);
$address = Doctrine::getTable('Address')->findOrCreate($address);
$address->save();
What findOrCreate() does, as you can probably guess, is find a matching address record if it exists, otherwise just return a new Address object. Here is the code:
public function findOrCreate($address)
{
$q = Doctrine_Query::create()
->select('a.*')
->from('Address a')
->where('a.line_1 = ?', $address->getLine_1())
->andWhere('a.line_2 = ?', $address->getLine_2())
->andWhere('a.country_id = ?', $address->getCountryId())
->andWhere('a.city = ?', $address->getCity())
->andWhere('a.state_id = ?', $address->getStateId())
->andWhere('a.zip = ?', $address->getZip());
$existing_address = $q->fetchOne();
if ($existing_address)
{
return $existing_address;
}
else
{
return $address;
}
}
The problem with doing this is that it's slow. To save each row in the CSV file (which translates into several INSERT statements on different tables), it takes about a quarter second. I'd like to get it as close to "instantaneous" as possible because I sometimes have over 50,000 rows in my CSV file. I've found that if I comment out the part of my import that saves addresses, it's much faster. Is there some faster way I could do this? I briefly considered putting an index on it but it seems like, since all the fields need to match, an index wouldn't help.

This certainly won't alleviate all of the time spent on tens of thousands of iterations, but why don't you manage your addresses outside of per-iteration DB queries? The general idea:
Get a list of all current addresses (store it in an array)
As you iterate, check array membership (checksums [sic]); if it doesn't exist, store the new address in the array and save the address to the database.
Unless I'm misunderstanding the scenario, this way you're only making INSERT queries if you have to, and you don't need to perform any SELECT queries aside from the first one.

I recommend that you investigate loading the CSV files into MySQL using LOAD DATA INFILE:
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
In order to update existing rows, you have a couple of options. LOAD DATA INFILE does not have upsert functionality (insert...on duplicate key update), but it does have a REPLACE option, which you could use to update existing rows, but you need to make sure you have an appropriate unique index, and the REPLACE is really just a DELETE and INSERT, which is slower than an UPDATE.
Another option is to load the data from the CSV into a temporary table, then merge that table with the live table using INSERT...ON DUPLICATE KEY UPDATE. Again, make sure you have an appropriate unique index, but in this case you're doing an update instead of a delete so it should be faster.

It looks like your duplicate checking is what is slowing you down. To find out why, figure out what query Doctrine is creating and run EXPLAIN on it.
My guess would be that you will need to create some indexes. Searching through the entire table can be very slow, but adding an index to zip would allow the query to only do a full search through addresses with that zip code. The EXPLAIN will be able to guide you to other optimizations.

What I ended up doing, that improved performance greatly, was to use ON DUPLICATE KEY UPDATE instead of using findOrCreate().

Related

Optimal way to detect fields to delete in database comparing to an array of IDs

I am trying to do the following.
I am consulting an external database using a web service. What the web service does is bring me all the products from an ERP system my client uses. As the server and the connection are not really fast, what I decided to do is basically synchronize the database on my web server and handle most operations there, so that the website can run smoothly.
Everything works fine I just need one last step to guarantee that the inventory on the website matches the one available on the ERP. The only issue comes when they (the client) deletes something on the ERP system.
At the moment I am thinking what would be the ideal strategy (least resource and time consuming) to remove products from my Products table if I don't receive them in the web service result.
So I basically have the following process:
I query the web service for all the products, give them a little format and store them in an array. The final size is about 600 indexes.
Then what I do is I do a foreach cycle and have the following subprocess.
I query my database to check if product_id is present.
If the product is present, I just update it with the latest info, stock data.
If the product is not present, I just insert it.
So, I was thinking of doing the following, but I do not think it's the ideal way:
Do a SELECT * FROM Products and generate an array that has all the products.
Do a foreach cycle in the resulting array and in each cycle scan the ERP array to check if the specific product exists. If not I delete it, if yes, I continue with the next product.
Now considering that after all the previous steps this would involve a couple of nested foreach I am a little worried that it might consume too much memory and also take longer to process.
I was thinking that maybe something like array_diff or array map could solve the issue, but I am not really experienced with these functions, and the structure of the two arrays differs a lot, so I am not sure if it would work that easily.
What would you guys recommend?
It's actually quite simple:
SELECT id FROM Products
Then you have an array of your product Ids, for example:
[123,5679,345]
Then as you go and do your updates or inserts, remove the id from the array.
[for updates]I query my database to check if product_id is present.
This is redundant now.
There are a few ways to remove the value from the array (when you do an update), this is the way I would probably do it.
if(false !== ($index = array_search($data['product_id'],$myids))){
//note the !== type comparison because array_search can return 0 for the first index, we must check for boolean false.
//find the index of the product id in our list of id's from local DB
unset($myids[$index]);
//If our incoming product_id is in the local list we Do Update
}else{
//Otherwise we Do Insert
}
As I mentioned above when doing your updates/inserts, You no longer have to check if the ID exists, because you already know this by having an array of IDs from the database. This alone saves you (n) queries (apx 600).
Then its very simple if you have ids left over.
//I wouldn't normally concatenate variables into SQL, in this case it's a list of int IDs from the database.
//you can of course come up with a loop to make it a prepared statement if you wish, but for the sake of simplistically, I'll leave that as an exercise for another day..
'DELETE FROM Products WHERE id IN('.implode(',', $myids).')'
And because you unset these when Updating, then the only thing left is Products that no longer exist.
Conclusion:
You have no choice (other then doing on duplicate key query, or ignoring exceptions) then to pull out the product Ids. You're already doing this on a row by row basis. So we can effectively kill 2 birds with one stone.
If you need more data then just the ID, for example you check that the product was changed before doing an update. Then pull that data out, but I would recommend using PDO and the FETCH_GROUP option. I wont go into the specifics of that but to say it lets you easily build your array this way:
[{product_id} => [ {product_name}, {product_price} etc..]];
Basically the product_id, is the key with a nested array of the row data, this will make lookup easier.
This way you can look it up like this.
//then instead of array_search
//if(false !== ($index = array_search($data['product_id'],$myids))){
if(isset($myids[$data['product_id']])){
unset($myids[$data['product_id']]);
//do your checks, then your update
}else{
//do inserts
}
References:
http://php.net/manual/en/function.array-search.php
array_search — Searches the array for a given value and returns the first corresponding key if successful
WARNING This function may return Boolean FALSE, but may also return a non-Boolean value which evaluates to FALSE. Please read the section on Booleans for more information. Use the === operator for testing the return value of this function.
UPDATE
There is one other really good way to do this, and that is to add a field called sync_date, now when you do your insert or update then set the sync_date to the current data.
This way when you are done, those products with an older sync date then today can be deleted. In this case it's best to cache the time when doing it so you know the exact time.
$time = data('Y-m-d H:i:s'); //or time() if you prefer timestamp
//use this same variable for the whole coarse of the script.
Then you can do
'DELETE from products WHERE sync_time != $time'
This may actually be a bit better because it has more utility. When was the last time it was ran, Now you know.

Making my query more efficient

im currently doing a query that pulls a string from my db. but it has to check for every new row i import. and a file of 100k takes almost 4 hours to import. thats way too long. im assuming that my sql code to check if he exist is the thing slowing it down.
ive heard about indexing but i have no clue what it is or how to use it.
this is the current code im using:
$sql2 = $pdo->prepare('SELECT * FROM prospects WHERE :ssn = ssn');
$sql2->execute(array(':ssn' => $ssn));
if($sql2->fetch(PDO::FETCH_NUM) > 0){
so everytime the phpscript reads a new row, it does this check. problem is, that i cant put it in "on duplicate key" in the sql code. it has to check before going to any sql, because if this is empty, then it should continue doing its thing.
what could i do to make this more efficient regarding time? and also, if index is the way to go, could someone enlighten me how this would be done by either posting examples, linking a guide or php.net page. and how i could read from that index to do what i am in my code
So you have 100k records and you don't have any index? Start then with creating one
CREATE INDEX ssn_index ON prospects (ssn)
Now, each time when you try to select something from prospects table with where condition on ssn column MySQL is going to check where it should look for the records by the index. If this column is strongly selective (there are many different values) the query is going to be performed fast.
You can check your execution plan by querying
EXPLAIN SELECT * FROM prospects WHERE :ssn = ssn

Doctrine: Fill column with random values

I want to update one column for all rows in a table with a random number. As far as my research goes, there is no rand() in Doctrine by default. The options I see are 1. Add a custom DQL-Function, this would be MySQL specific, 2. Update every row with a PHP generated value.
Both options seem like bad practice to me. Is there something I'm missing?
I would go with native query. It is much simpler than creating custom DQL function.
$em = getEntityManager();
$tableName = $em->getClassMetadata('Your:Entity')->getTableName();
$em->getConnection()->exec('UPDATE '.$tableName.' SET column=RAND()');
But if You prefer DQL go with it.
But doing it in PHP will be the worst.
You will have to fetch all records first
You will have to update each row one by one
Database is not something You change every week so don't be afraid of using vendor specific functions.

How Can I Update an SQL database with a foreach loop?

I have a photo gallery and want to update multiple captions at once via form input. I've tried to research this but I think I'm in way over my head. This is what I have so far but it's not working..
The data is saved in an SQL table called "gallery". An example row might look like:
gallery_id(key) = some number
product_id = 500
photo = photo.jpg
caption = 'look at this picture'
My form inputs are generated like this:
$sql = mysql_query("SELECT * FROM gallery WHERE product_id = 500");
while($row = mysql_fetch_array($sql)) {
$photo=$row['photo'];
$caption=$row['caption'];
echo '<img src="$photo"/>';
echo '<input name="cap['.$caption.']" id="cap['.$caption.']" value="'.$caption.'" />';
}
So once I submit the form I start to access my inputs like this but I hit a wall..
if( isset($_POST['cap']) && is_array($_POST['cap']) ) {
foreach($_POST['cap'] as $cap) {
mysql_query("UPDATE gallery
SET caption=$caption
WHERE ???????");
}
}
I don't know how to tell the database where to put these inputs and as far as I can tell you can't pass more than one variable in a foreach loop.
$_POST is an array, if im not wrong (I can't test it now and im new in PHP) you can do
foreach ($_POST as $p) {
$id=$p['id'];'
$cap=$p['caption'];
mysql_query("UPDATE gallery
SET caption=$cap
WHERE photoid=$id");
}
A few things to consider
Updating many rows is tricky. I see it's already answered so I'm not trying to give a better answer, but some non-trivials for those who find this thread by searching for update sql many rows or similar. It's a common scenario to update several rows, and I've seen many codes with the "one update a time" approach (described above) which can be quite slow. In many cases there are better ways. No one-fits-all solution here, so I tried to split up the possible techniques by what you have and where you're going:
Multiple rows, same dataIf you want to set the same caption for many rows, UPDATE ... WHERE ID IN (...) - it's an edge case but it gives you a noticeable performance boost
Multiple rows, different data, at onceWhen updating several rows, consider using a CASE structure, as explained in this article:
sql update multiple rows with case in php
TransactionsIf you're using InnoDB tables (which is quite likely), you may want to do the whole thing in one transaction. It's a lot faster.
Index updatesIf you're updating many-many rows (like thousands), it can make sense to disable keys before and enable them afterwards. This way SQL can save a lot of index updates while it happens, and only do it for the final result.
Delete + reinsertIf you're updating many fields per record, and there are no triggers and other magic consequences of deleting your rows, it can be faster to delete them WHERE ID IN(...) and then do a multi-insert. Don't do this if you're only updating fixed-size integers like counters.
Then again:
If your table has many concurrent reads & writes, and you're updating only a few (up to a hundred) rows, stick with the one-by-one approach. Especially if it's MyISAM.
Try before you buy; these techniques all depend on data itself too. But it's worth to go some extra miles to find the best.

MySQL Remove/Combine Similar Rows

I've got a problem that I just can't seem to find the answer to. I've developed a very small CRM-like application in PHP that's driven by MySQL. Users of this application can import new data to the database via an uploaded CSV file. One of the issues we're working to solve right now is duplicate, or more importantly, near duplicate records. For example, if I have the following:
Record A: [1, Bob, Jones, Atlanta, GA, 30327, (404) 555-1234]
and
Record B: [2, Bobby, Jones, Atlanta, GA, 30327, Bob's Shoe Store, (404) 555-1234]
I need a way to see that these are both similar, take the record with more information (in this case record B) and remove record A.
But here's where it gets even more complicated. This must be done upon importing new data, and a function I can execute to remove duplicates from the database at any time. I have been able to put something together in PHP that gets all duplicate rows from the MySQL table and matches them up by phone number, or by using implode() on all columns in the row and then using strlen() to decide the longest record.
There has got to be a better way of doing this, and one that is more accurate.
Do any of you have any brilliant suggestions that I may be able to implement or build on? It's obvious that when importing new data I'll need to open their CSV file into an array or temporary MySQL table, do the duplicate/similar search, then recompile the CSV file or add everything from the temporary table to the main table. I think. :)
I'm hoping that some of you can point out something that I may be missing that can scale somewhat decently and that's somewhat accurate. I'd rather present a list of duplicates we're 'unsure' about to a user that's 5 records long, not 5,000.
Thanks in advance!
Alex
If I were you I'd give a UNIQUE key to name, surname and phone number since in theory if all these three are equal then it means that it is a duplicate. I am thinking so because a phone number can have only one owner. Anyways, you should find a combination of 2-3 or maybe 4 columns and assign them a unique key. Once you have such a structure, run something like this:
// assuming that you have defined something like the following in your CREATE TABLE:
UNIQUE(phone, name, surname)
// then you should perform something like:
INSERT INTO your_table (phone, name, surname) VALUES ($val1, $val2, $val3)
ON DUPLICATE KEY UPDATE phone = IFNULL($val1, phone),
name = IFNULL($val2, name),
surname = IFNULL($val3, surname);
So basically, if the inserted value is a duplicate, this code will update the row, rather than inserting a new one. The IFNULL function performs a check to see whether the first expression is null or not. If it is null, then it picks the second expression, which in this case is the column value that already exists in your table. Hence, it will update your row with as much as information possible.
I don't think there're brilliant solutions. You need to determine priority of your data fields you can rely on for detecting similarity, for example phone, some kind of IDs, of some uniform address or official name.
You can save some cleaned up values (reduced to the same format like only digits in phones, concatenated full address) along with row which you would be able to use for similarity search when adding records.
Then you need to decide on data completeness in any case to update existing rows with more complete fields, or delete old and add new row.
Don't know any ready solutions for such a variable task and doubt they exist.

Categories