Efficient way to find which values in CSV are NOT in DB?

Efficient way to find which values in CSV are NOT in DB? - php

A vendor is feeding us a CSV file of their products. A particular column on the file (eg column 3) is the style number. This file has thousands on entries.
We have a data-base table of products with a column called manufacturer_num which is the vendors style number.
I need to find which of the vendor's products we do not currently have.
I know I can loop throw each line in the CSV file and extract the style_number and check to see if it is in our data-base. But then I am making a call to the data-base for each line. This would be thousands of calls to the data-base. I think this is inefficient.
I could also build a list of the style numbers (either as a string or array) to make one DB call.
Something like: WHERE manufactuer_num IN(...) But won't PHP run out of memory if the list is too big? And actually this would give me the ones we do have, not the ones we don't have.
Whats an efficient way to do this?

Bulk load the CSV into a temporary table, do a LEFT JOIN, then get the records where the RHS of the join is NULL.

Related

Optimizing EXCEL + MySQL Processing

I have a module in my application whereby a user will upload an excel sheet with around 1000-2000 rows. I am using excel-reader to read the excel file.
In the excel there are following columns:
1) SKU_CODE
2)PRODUCT_NAME
3)OLD_INVENTORY
4)NEW_INVENTORY
5)STATUS
I have a mysql table inventory which contains the data regarding the sku codes:
1) SKU_CODE : VARCHAR(100) Primary key
2) NEW_INVENTORY INT
3) STATUS : 0/1 BOOLEAN
There are two options available with me:
Option 1: To process all the records from php, extract all the sku_codes and do a msql in query:
Select * from inventory where SKU_CODE in ('xxx','www','zzz'.....so on ~ 1000-2000 values);
- Single query
Option 2: is to process each record one by one for the current sku data
Select * from inventory where SKU_CODE = 'xxx';
..
...
around 1000-2000 queries
So can you please help me choose the best way of achieving the above task with proper explanation so that i can be sure of a good product module.

As you've probably realized, both options have their pro's and cons. On a properly indexed table, both should perform fairly well.
Option 1 is most likely faster, and can be better if you're absolutely sure that the number of SKU's will always be fairly limited, and users can only do something with the result after the entire file is processed.
Option 2 has a very important advantage in that you can process each record in your Excel file separately. This offers some interesting options, in that you can begin generating output for each row you read from the Excel instead of having to parse the entire file in one go, and then run the big query.

You shall find a middle way, have a specific optimal BATCH_SIZE , and use that as criteria for querying your database.
An example batch size could be 5000.
So if your excel contains 2000 rows, all the data gets returned in single query.
If the excel contains 19000 rows, you do four queries i.e 0-5000 sku codes, 5001-1000 sku codes....and so on.
Try optimizing on BATCH_SIZE as per your benchmark.
It is always good to save on database queries.

Large mysql query in PHP

I have a large table of about 14 million rows. Each row has contains a block of text. I also have another table with about 6000 rows and each row has a word and six numerical values for each word. I need to take each block of text from the first table and find the amount of times each word in the second table appears then calculate the mean of the six values for each block of text and store it.
I have a debian machine with an i7 and 8gb of memory which should be able to handle it. At the moment I am using the php substr_count() function. However PHP just doesn't feel like its the right solution for this problem. Other than working around time-out and memory limit problems does anyone have a better way of doing this? Is it possible to use just SQL? If not what would be the best way to execute my PHP without overloading the server?

Do each record from the 'big' table one-at-a-time. Load that single 'block' of text into your program (php or what ever), and do the searching and calculation, then save the appropriate values where ever you need them.
Do each record as its own transaction, in isolation from the rest. If you are interrupted, use the saved values to determine where to start again.
Once you are done the existing records, you only need to do this in the future when you enter or update a record, so it's much easier. You just need to take your big bite right now to get the data updated.

What are you trying to do exactly? If you are trying to create something like a search engine with a weighting function, you maybe should drop that and instead use the MySQL fulltext search functions and indices that are there. If you still need to have this specific solution, you can of course do this completely in SQL. You can do this in one query or with a trigger that is run each time after a row is inserted or updated. You wont be able to get this done properly with PHP without jumping through a lot of hoops.
To give you a specific answer, we indeed would need more information about the queries, data structures and what you are trying to do.

Redesign IT()
If for size on disc is not !important just joints table into one
Table with 6000 put into memory [ memory table ] and make backup every one hour
INSERT IGNORE into back.table SELECT * FROM my.table;
Create "own" index in big table eq
Add column "name index" into big table with id of row
--
Need more info about query to find solution

Comparing multiple very large csv files against each other

I have n csv files which I need to compare against each other and modify them afterwards.
The Problem is that each csv file has around 800.000 lines.
To read the csv file I use fgetcsv and it works good. Get some memory pikes but in the end it is fast enough. But if I try to compare the array against each other it takes ages.
One other Problem is that I have to use a foreach to get the csv data with fgetcsv because of the n amount of files. I end up with one ultra big array and can't compare it with array_diff. So i need to compare it with nested foreach loops and that take ages.
a code snippet for better understanding:
foreach( $files as $value ) {
$data[] = $csv->read( $value['path'] );
}
my csv class use fgetcsv to add the output to the array:
fgetcsv( $this->_fh, $this->_lengthToRead, $this->_delimiter, $this->_enclosure )
Every data of all the csv files are stored in the $data array. This is probably the first big mistake to use only one array, but I have no clue how to stay flexible with the files without to use an foreach. I tried to use flexible variable names but I stucked there as well :)
Now I have this big array. Normally if I try to compare the values against each other and to find out if the data from file one exists in file two and so on, I use array_diff or array_intersect. But in this case I have only this one big array. And as I said, to run an foreach over it takes ages.
Also after only 3 files I have an array with 3 * 800.000 entries. I guess latest after 10 files my memory will explode.
So is there any better way to use PHP to compare n amount of very large csv files?

Use SQL
Create a table with the same columns as your CSV files.
Insert the data from the first CSV file.
Add indexes to speed up queries.
Compare with other CSV files by reading a line and issuing a SELECT.
You did not describe how you compare n files, and there are several ways to do so. If you just want to find the line that are in A1 but not in A2,...,An, then you'll just have to add a boolean column diff in your table. If you want to know in which files a line is repeated, you'll need a text column, or a new table if a line can be in several files.
Edit: a few words on performance if you're using MySQL (I do not now much about other RDBMS).
Inserting lines one by one would be too slow. You probably can't use LOAD DATA unless you can put the CSV files directly onto the DB server's filesystem. So I guess the best solution is to read a few hundreds of lines in the CSV then send a multiple insert query INSERT INTO mytable VALUES (..1..), (..2..).
You can't issue a SELECT for each line you read in your other files, so you'd better put them in another table. Then issue a multiple-table update to mark the rows that are identical in the tables t1 and t2: UPDATE t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b) SET t1.diff=1
Maybe you could try using sqlite. No concurrency problems here, and it could be faster than the client/server model of MySQL. And you don't need to setup much to use sqlite.

Creating database table from tab delimited text file with first row labels

I have a tab delimited text file with the first row being label headings that are also tab delimited, for example:
Name ID Money
Tom 239482 $2093984
Barry 293984 $92938
The only problem is that there are 30 some columns instead of 3 so I'd rather not have to type out all the (name VARCHAR(50),...) if it's avoidable.
How would I go about writing a function that creates the table from scratch in php from the text file, and say the function takes in $file_path and $table_name? Do I have to write all the column names again telling mysql what type they are and chop off the top or is there a more elegant solution when the names are already there?

You would somehow need to map the column type to the columns in your file. You could do this by adding that data to your textfile. For instance
Name|varchar(32) ID|int(8) Money|int(10)
Tom 239482 $2093984
Barry 293984 $92938
or something similar. Then write a function thet get's the column name and columntype using the first line and the data to fill the table with using all the other rows. You might also want to add a way to name the given table etc. However, this would probably be as much work (if not more) than creating SQL queries using you text file. Add a create table statement at the top and insert statements for each line. With search and replace this could be done very fast.

Even if you could find a way to do this, how would you determine the column type? I guess there would be some way to determine the type of the columns through checking for certain attributes (int, string, etc). And then you'd need to handle weird columns like Money, which might be seen as a string because of the dollar sign, but should almost certainly be stored as an integer.
Unless you plan on using this function quite a bit, I wouldn't bother spending time cobbling it together. Just fat finger the table creation. (Ctrl-C, Ctrl-V is your friend)

Count line breaks in a field and order by

I have a field in a table recipes that has been inserted using mysql_real_escape_string, I want to count the number of line breaks in that field and order the records using this number.
p.s. the field is called Ingredients.
Thanks everyone

This would do it:
SELECT *, LENGTH(Ingredients) - LENGTH(REPLACE(Ingredients, '\n', '')) as Count
FROM Recipes
ORDER BY Count DESC
The way I am getting the amount of linebreaks is a bit of a hack, however, and I don't think there's a better way. I would recommend keeping a column that has the amount of linebreaks if performance is a huge issue. For medium-sized data sets, though, I think the above should be fine.
If you wanted to have a cache column as described above, you would do:
UPDATE
Recipes
SET
IngredientAmount = LENGTH(Ingredients) - LENGTH(REPLACE(Ingredients, '\n', ''))
After that, whenever you are updating/inserting a new row, you could calculate the amounts (probably with PHP) and fill in this column before-hand. Or, if you're into that sort of thing, try out triggers.

I'm assuming a lot here, but from what I'm reading in your post, you could change your database structure a little bit, and both solve this problem and open your dataset up to more interesting uses.
If you separate ingredients into its own table, and use a linking table to index which ingredients occur in which recipes, it'll be much easier to be creative with data manipulation. It becomes easier to count ingredients per recipe, to find similarities in recipes, to search for recipes containing sets of ingredients, etc. also your data would be more normalized and smaller. (storing one global list of all ingredients vs. storing a set for each recipe)
If you're using a single text entry field to enter ingredients for a recipe now, you could do something like break up that input by lines and use each line as an ingredient when saving to the database. You can use something like PHP's built-in levenshtein() or similar_text() functions to deal with misspelled ingredient names and keep the data as normalized as possbile without having to hand-groom your [users'] data entry too much.
This is just a suggestion, take it as you like.

You're going a bit beyond the capabilities and intent of SQL here. You could write a stored procedure to scan the string and return the number and then use this in your query.
However, I think you should revisit the design of whatever is inserting the Ingredients so that you avoid searching strings in of every row whenever you do this query. Add a 'num_linebreaks' column, calculate the number of line breaks and set this column when you're adding the Indgredients.
If you've no control over the app that's doing the insertion, then you could use a stored procedure to update num_linebreaks based on a trigger.

Got it thanks, the php code looks like:
$check = explode("\r\n", $_POST['ingredients']);
$lines = count($check);
So how could I update all the information in the table so Ingred_count based on field Ingredients in one fellow swoop for previous records?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Efficient way to find which values in CSV are NOT in DB? - php

Bulk load the CSV into a temporary table, do a LEFT JOIN, then get the records where the RHS of the join is NULL.

Related

Optimizing EXCEL + MySQL Processing

Large mysql query in PHP

Comparing multiple very large csv files against each other

Creating database table from tab delimited text file with first row labels

Count line breaks in a field and order by

Categories

Resources