Compare two big data - 20 million products

Compare two big data - 20 million products - php

I want to compare two product database based on title,
I first data is about 3 million from which I want to compare and the second data is 10 million, I am doing this because to remove duplicate products.
I have tried this by using MySQL query writing program in PHP which check title (name = '$name') if the data will return zero so it will be unique but it is quite slow 2 sec per result.
The second method I have used is storing data in the text file and using the regular expression, but it will also slow.
What is the best way to compare large data to find out unique products.?
Table DDL:
CREATE TABLE main ( id int(11) NOT NULL AUTO_INCREMENT,
name text,
image text, price int(11) DEFAULT NULL,
store_link text,
status int(11) NOT NULL,
cat text NOT NULL,
store_single text,
brand text,
imagestatus int(11) DEFAULT NULL,
time text,
PRIMARY KEY (id) )
ENGINE=InnoDB AUTO_INCREMENT=9250887
DEFAULT CHARSET=latin1;

Since you have to go over 10 mio titles 3 mio times its going to take some time. My approach would be to see if you can get all titles from both lists in a php script. Then compare them there in memory. Have the script create delete statements to a text file which you then execute on the db.
Not in your question but probably you next problem: different spellings see
similar_text()
soundex()
levenshtein()
for some help with that.

In my opinion this is what database are made for. I wouldn't reinvent the wheel in your shoes.
Once this is agreed, you should really check database structure and indexing to speed up your operations.

I have been using SQLyog to compare databases of around 1-2 million data. It gives an option for "One-way synchronization","Two-way synchronization" and also "Visually merge data" to sync the databases.
The important part is,it gives an option to compare data on chunks, and this value can be specified by us in writing the chunk limit inorder to avoid connection loss.

If your DB support it, use a left join and filter rows where the right side is not null. But first create indexes with your keys in both tables (column name).
If your computer/server memory support to upload in memory the 3 millions on objects in a HashSet, then create a HashSet using the NAME as the key and then read one by one the other set (10 million objects) and validate if the object exist in the HashSet. If it exist, then it is duplicated. (I want to suggest dump the data into a text files and then read the files to create the structure)
If the previous strategies fail then is time to implement some kind of MapReduce. You can implement it comparing with one of the previous approaches a subset of your data. For example,
comparing all the products that start with some letter.

I have tried a lot using MySQL queries but data was very slow, only find out the solution is using sphinx, Index whole database and searching for every product string on sphinx index and same time removing duplicate products getting ids from sphinx.

Related

Load a full list of IDs from DB or perform one record at a time ? What's best?

I migrate a custom made web site to WordPress and first I have to migrate the data from the previous web site, and then, every day I have to perform some data insertion using an API.
The data I like to insert, comes with a unique ID, representing a single football game.
In order to avoid inserting the same game multiple times, I made a db table with the following structure:
CREATE TABLE `ss_highlight_ids` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`highlight_id` int(10) unsigned zerofill NOT NULL DEFAULT '0000000000',
PRIMARY KEY (`id`),
UNIQUE KEY `highlight_id_UNIQUE` (`highlight_id`),
KEY `highlight_id_INDEX` (`highlight_id`) COMMENT 'Contains a list with all the highlight IDs. This is used as index, and dissalow the creation of double records.'
) ENGINE=InnoDB AUTO_INCREMENT=2967 DEFAULT CHARSET=latin1
and when I try to insert a new record in my WordPress db, I first like to lookup this table, to see if the ID already exists.
The question now :)
What's preferable ? To load all the IDs using a single SQL query, and then use plain PHP to check if the current game ID exists, or is it better to query the DB for any single row I insert ?
I know that MySQL Queries are resource expensive, but from the other side, currently I have about 3k records in this table, and this will move over 30 - 40k in the next few year, so I don't know if it's a good practice to load all of those records in PHP ?
What is your opinion / suggestion ?
UPDATE #1
I just found that my table has 272KiB size with 2966 row. This means that in the near feature it seems that will have a size of about ~8000KiB+ size, and going on.
UPDATE #2
Maybe I have not make it too clear. For first insertion, I have to itterate a CSV file with about 12K records, and after the CSV insertion every day I will insert about 100 - 200 records. All of those records requiring a lookup in the table with the IDs.
So the excact question is, is it better to create a 12K queries in MySQL at CSV insertion and then about 100 - 200 MySQL Queries every day, or just load the IDs in server memory, and use PHP for the lookup ?

Your table has a column id which is auto_increment, what that means is there is no need to insert anything in that column. It will fill it itself.

highlight_id is UNIQUE, so it may as well be the PRIMARY KEY; get rid if id.
A PRIMARY KEY is a UNIQUE key is an INDEX. So this is redundant:
KEY `highlight_id_INDEX` (`highlight_id`)
Back to your question... SQL is designed to do things in batches. Don't defeat that by doing things one row at a time.
How can the table be 272KiB size if it has only two columns and 2966 rows? If there are more columns in the table; show them. There are often good clues of what you are doing, and how to make it more efficient.
2966 rows is 'trivial'; you will have to look closely to see performance differences.
Loading from CSV...
If this is a replacement, use LOAD DATA, building a new table, then RENAME to put it into place. One CREATE, one LOAD, one RENAME, one DROP. Much more efficient than 100 queries of any kind.
If the CSV is updates/inserts, LOAD into a temp table, then do INSERT ... ON DUPLICATE KEY UPDATE ... to perform the updates/inserts into the real table. One CREATE, one LOAD, one IODKU. Much more efficient than 100 queries of any kind.
If the CSV is something else, please elaborate.

How to store 60 Booleans in a MySQL Database?

I'm building a mobile App I use PHP & MySQL to write a backend - REST API.
If I have to store around 50-60 Boolean values in a table called "Reports"(users have to check things in a form) in my mobile app I store the values (0/1) in a simple array. In my MySql Table should I create a different column for each Boolean value or is it enough if I simply use a string or an Int to store it as a "number" like "110101110110111..."?
I get and put the data with JSON.
UPDATE 1: All I have to do is check if everything is 1, if one of them is 0 then that's a "problem". In 2 years this table will have around 15.000-20.000 rows, it has to be very fast and as space-saving as possible.
UPDATE 2: In terms of speed which solution is faster? Making separate columns vs store it in a string/binary type. What if I have to check which ones are the 0s? Is it a great solution if I store it as a "number" in one column and if it's not "111..111" then send it to the mobile app as JSON where I parse the value and analyse it on the user's device? Let's say I have to deal with 50K rows.
Thanks in advance.

A separate column per value is more flexible when it comes to searching.
A separate key/value table is more flexible if different rows have different collections of Boolean values.
And, if
your list of Boolean values is more-or-less static
all your rows have all those Boolean values
your performance-critical search is to find rows in which any of the values are false
then using text strings like '1001010010' etc is a good way to store them. You can search like this
WHERE flags <> '11111111'
to find the rows you need.
You could use a BINARY column with one bit per flag. But your table will be easier to use for casual queries and eyeball inspection if you use text. The space savings from using BINARY instead of CHAR won't be significant until you start storing many millions of rows.
edit It has to be said: every time I've built something like this with arrays of Boolean attributes, I've later been disappointed at how inflexible it turned out to be. For example, suppose it was a catalog of light bulbs. At the turn of the millennium, the Boolean flags might have been stuff like
screw base
halogen
mercury vapor
low voltage
Then, things change and I find myself needing more Boolean flags, like,
LED
CFL
dimmable
Energy Star
etc. All of a sudden my data types aren't big enough to hold what I need them to hold. When I wrote "your list of Boolean values is more-or-less static" I meant that you don't reasonably expect to have something like the light-bulb characteristics change during the lifetime of your application.
So, a separate table of attributes might be a better solution. It would have these columns:
item_id fk to item table -- pk
attribute_id attribute identifier -- pk
attribute_value
This is ultimately flexible. You can just add new flags. You can add them to existing items, or to new items, at any time in the lifetime of your application. And, every item doesn't need the same collection of flags. You can write the "what items have any false attributes?" query like this:
SELECT DISTINCT item_id FROM attribute_table WHERE attribute_value = 0
But, you have to be careful because the query "what items have missing attributes" is a lot harder to write.

For your specific purpose, when any zero-flag is a problen (an exception) and most of entries (like 99%) will be "1111...1111", i dont see any reason to store them all. I would rather create a separate table that only stores unchecked flags. The table could look like: uncheked_flags (user_id, flag_id). In an other table you store your flag definitions: flags (flag_id, flag_name, flag_description).
Then your report is as simple as SELECT * FROM unchecked_flags.
Update - possible table definitions:
CREATE TABLE `flags` (
`flag_id` TINYINT(3) UNSIGNED NOT NULL AUTO_INCREMENT,
`flag_name` VARCHAR(63) NOT NULL,
`flag_description` TEXT NOT NULL,
PRIMARY KEY (`flag_id`),
UNIQUE INDEX `flag_name` (`flag_name`)
) ENGINE=InnoDB;
CREATE TABLE `uncheked_flags` (
`user_id` MEDIUMINT(8) UNSIGNED NOT NULL,
`flag_id` TINYINT(3) UNSIGNED NOT NULL,
PRIMARY KEY (`user_id`, `flag_id`),
INDEX `flag_id` (`flag_id`),
CONSTRAINT `FK_uncheked_flags_flags` FOREIGN KEY (`flag_id`) REFERENCES `flags` (`flag_id`),
CONSTRAINT `FK_uncheked_flags_users` FOREIGN KEY (`user_id`) REFERENCES `users` (`user_id`)
) ENGINE=InnoDB;

You may get a better search out of using dedicated columns, for each boolean, but the cardinality is poor and even if you index each column it will involve a fair bit of traversal or scanning.
If you are just looking for HIGH-VALUES 0xFFF.... then definitely bitmap, this solves your cardinality problem (per OP update). It's not like you are checking parity... The tree will however be heavily skewed to HIGH-VALUES if this is normal and can create a hot spot prone to node splitting upon inserts.
Bit mapping and using bitwise operator masks will save space but will need to be aligned to a byte so there may be an unused "tip" (provisioning for future fields perhaps), so the mask must be of a maintained length or the field padded with 1s.
It will also add complexity to your architecture, that may require bespoke coding, bespoke standards.
You need to perform an analysis on the importance of any searching (you may not ordinarily expect to be searching all. or even any of the discrete fields).
This is a very common strategy for denormalising data and also for tuning service request for specific clients. (Where some reponses are fatter than others for the same transaction).

Case 1: If "problems" are rare.
Have a table Problems with ids, and a TINYINT with the value (50-60) of the problem. With suitable indexes on that table you can lookup whatever you need.
Case 2: Lots of items.
Use a BIGINT UNSIGNED to hold up to 64 0/1 value. Use an expression like 1 << n to build a mask for the nth (counting from 0) bit. If you know, for example, that there exactly 55 bits, then the value of all 1s is (1<<55)-1. Then you can find the items with "problems" via WHERE bits = (1<<55)-1.
Bit Operators and functions
Case 3: You have names for the problems.
SET ('broken', 'stolen', 'out of gas', 'wrong color', ...)
That will build a DATATYPE with (logically) a bit for each problem. See also the function FIND_IN_SET() as a way to check for one problem.
Cases 2 and 3 will take about 8 bytes for the full set of problems -- very compact. Most SELECT that you might perform would scan the entire table, but 20K rows won't take terribly long and will be a lot faster than having 60 columns or a row per problem.

Storing an index list with MYSQL?

I have a MySQL/PHP performance related question.
I need to store an index list associated with each record in a table. Each list contains 1000 indices. I need to be able to quickly access any index value in the list associated to a given record. I am not sure about the best way to go. I've thought of the following ways and would like your input on them:
Store the list in a string as a comma separated value list or using JSON. Probably terrible performance since I need to extract the whole list out of the DB to PHP only to retrieve a single value. Parsing the string won't exactly be fast either... I can store a number of expanded lists in a Least Rencently Used cache on the PHP side to reduce load.
Make a list table with 1001 columns that will store the list and its primary key. I'm not sure how costly this is regarding storage? This also feels like abusing the system. And then, what if I need to store 100000 indices?
Only store with SQL the name of the binary file containing my indices and perform a fopen(); fseek(); fread(); fclose() cycle for each access? Not sure how the system filesystem cache will react to that. If it goes badly then there are many solutions available to adress the issues... but that's sounds a bit overkill no?
What do you think of that?

What about a good old one-to-many relationship?
records
-------
id int
record ...
indices
-------
record_id int
index varchar
Then:
SELECT *
FROM records
LEFT JOIN indices
ON records.id = indices.record_id
WHERE indices.index = 'foo'

The standard solution is to create another table, with one row per (record, index), and add a MySQL Index to allow fast search
CREATE TABLE IF NOT EXISTS `table_list` (
`IDrecord` int(11) NOT NULL,
`item` int(11) NOT NULL,
KEY `IDrecord` (`IDrecord`)
)
Change the item's type according to your needs - I used int in my example.

The most logical solution would be to put each value in it's own tuple. Adding a MYSQL index to each tuple will enable the DBMS to quickly ascertain the value, and should improve performance.
The reasons we're not going with your other answers are as follows:
Option 1
Storing multiple values in one MYSQL cell is a violation of the first stage of database normalisation. You can read up on it here.
Option 3
This has heavy reliance on other files. You want to localize your data storage as much as possible, to make it easier to maintain in the future.

performance of MYSQL and PHP fetching records

Scenario 1
I have one table lets say "member". In that table "member" i have 7 fields ( memid,login_name,password,age,city,phone,country ). In my table i have 10K records.i need to fetch one record . so i'm using the query like this
mysql_query("select * from member where memid=999");
Scenario 2
I have the same table called "member" but i'm splitting the table like this member and member_txt .So in my member_txt table i have memid,age,phone,city,country )and in my member table i have memid,login_name,password .
Which is the best scenario to fetch the data quickly? Either going to single table or split the table into two with reference?
Note: I need to fetch the particular data in PHP and MYSQL. Please let me know which is best method to follow.
we have 10K records

For your own health, use the single table approach.
As long as you are using a primary key for memid, things are going to be lightning fast. This is because PRIMARY KEY automatically assigns an index, which basically tells the exact location for the data and eliminates the need to go through data that it would otherwise do.
From http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
Indexes are used to find rows with specific column values quickly.
Without an index, MySQL must begin with the first row and then read
through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns
in question, MySQL can quickly determine the position to seek to in
the middle of the data file without having to look at all the data. If
a table has 1,000 rows, this is at least 100 times faster than reading
sequentially. If you need to access most of the rows, it is faster to
read sequentially, because this minimizes disk seeks.
Your second approach only makes your system more complex, and provides no benefits.
Use the scenario 1.

please make the memid primary/unique key then having one table is faster than having two tables.

In general you should not see to much impact on performance with 10k rows as long as your accessing it by your primary key.
Also note that fetching data from one table is also faster than fetching data from 2 tables.
If you want to optimize further use the column names in the select statement instead of the * operator.

Website: What is the best way to store a large number of user variables?

I'm designing a website using PHP and MySQL currently and as the site proceeds I find myself adding more and more columns to the users table to store various variables.
Which got me thinking, is there a better way to store this information? Just to clarify, the information is global, can be affected by other users so cookies won't work, also I'd lose the information if they clear their cookies.
The second part of my question is, if it does turn out that storing it in a database is the best way, would it be less expensive to have a large number of columns or rather to combine related columns into delimited varchar columns and then explode them in PHP?
Thanks!

In my experience, I'd rather get the database right than start adding comma separated fields holding multiple items. Having to sift through multiple comma separated fields is only going to hurt your program's efficiency and the readability of your code.
Also, if your table is growing to much, then perhaps you need to look into splitting it into multiple tables joined by foreign dependencies?

I'd create a user_meta table, with three columns: user_id, key, value.

I wouldn't go for the option of grouping columns together and exploding them. It's untidy work and very unmanageable. Instead maybe try spreading those columns over a few tables and using InnoDb's transaction feature.
If you still dislike the idea of frequently updating the database, and if this method complies with what you're trying to achieve, you can use APC's caching function to store (cache) information "globally" on the server.

MongoDB (and its NoSQL cousins) are great for stuff like this.

The database a perfectly fine place to store such data, as long as they're variables and not, say, huge image files. The database has all the optimizations and specifications for storing and retrieving large amounts of data. Anything you set up on file system level will always be beaten by what the database already has in terms of speed and functionality.
would it be less expensive to have a large number of columns or rather to combine related columns into delimited varchar columns and then explode them in PHP?
It's not really that much of a performance than a maintenance question IMO - it's not fun to manage hundreds of columns. Storing such data - perhaps as serialized objects - in a TEXT field is a viable option - as long as it's 100% sure you will never have to make any queries on that data.
But why not use a normalized user_variables table like so:
id | user_id | variable_name | variable_value
?
It is a bit more complex to query, but provides for a very clean table structure all round. You can easily add arbitrary user variables that way.
If you are doing a lot of queries like SELECT FROM USERS WHERE variable257 = 'green' you may have to stick to have specific columns.

The database is definitely the best place to store the data. (I'm assuming you were thinking of storing it in flat files otherwise) You'd definitely get better performance and security from using a DB over storing in files.
With regards to the storing your data in multiple columns or delimiting them... It's a personal choice but you should consider a few things
If you're going to delimit the items, you need to think of what you're going to delimit them with (something that's not likely to crop up within the text your delimiting)
I often find that it helps to try and visualise whether another programmer of your level would be able to understand what you've done with little help.
Yes, as Pekka said, if you want to perform queries on the data stored you should stick with the seperate columns
You may also get a slight performance boost from not retrieving and parsing ALL your data every time if you just want a couple of fields of information
I'd suggest going with the seperate columns as it offers you the option of much greater flexibility in the future. And there's nothing worse than having to drastically change your data structure and migrate information down the track!

I would recommend setting up a memcached server (see http://memcached.org/). It has proven to be viable with lots of the big sites. PHP has two extensions that integrate a client into your runtime (see http://php.net/manual/en/book.memcached.php).
Give it a try, you won't regret it.
EDIT
Sure, this will only be an option for data that's frequently used and would otherwise have to be loaded from your database again and again. Keep in mind though that you will still have to save your data to some kind of persistent storage.

A document-oriented database might be what you need.
If you want to stick to a relational database, don't take the naïve approach of just creating a table with oh so many fields:
CREATE TABLE SomeEntity (
ENTITY_ID CHAR(10) NOT NULL,
PROPERTY_1 VARCHAR(50),
PROPERTY_2 VARCHAR(50),
PROPERTY_3 VARCHAR(50),
...
PROPERTY_915 VARCHAR(50),
PRIMARY KEY (ENTITY_ID)
);
Instead define a Attribute table:
CREATE TABLE Attribute (
ATTRIBUTE_ID CHAR(10) NOT NULL,
DESCRIPTION VARCHAR(30),
/* optionally */
DEFAULT_VALUE /* whatever type you want */,
/* end_optionally */
PRIMARY KEY (ATTRIBUTE_ID)
);
Then define your SomeEntity table, which only includes the essential attributes (for example, required fields in a registration form):
CREATE TABLE SomeEntity (
ENTITY_ID CHAR(10) NOT NULL
ESSENTIAL_1 VARCHAR(30),
ESSENTIAL_2 VARCHAR(30),
ESSENTIAL_3 VARCHAR(30),
PRIMARY KEY (ENTITY_ID)
);
And then define a table for those attributes that you might or might not want to store.
CREATE TABLE EntityAttribute (
ATTRIBUTE_ID CHAR(10) NOT NULL,
ENTITY_ID CHAR(10) NOT NULL,
ATTRIBUTE_VALUE /* the same type as SomeEntity.DEFAULT_VALUE;
if you didn't create that field, then any type */,
PRIMARY KEY (ATTRIBUTE_ID, ENTITY_ID)
);
Evidently, in your case, that SomeEntity is the user.

Instead of MySQL you might consider using a triplestore, or a key-value store
that way you get the benifits of having all the multithreading multiuser, performance and caching voodoo, figured out, without all the trouble of trying to figure out ahead of time what kind of values you really want to store.
Downsides: it's a bit more costly to figure out the average salary of all the people in idaho who also own hats.

depends on what kind of user info you are storing. if its session pertinent data, use php sessions in coordination with session event handlers to store your session data in a single data field in the db.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.