Compare table rows, big data amount

Compare table rows, big data amount - php

I have a quite interesting task. But I don't know how to call it in one word in order to search for related topics. Even this topic title might not reflect what I need. So, if somebody has better title - welcome.
I'll try to explain my problem.
I have about 100,000 rows in MySQL db table. And I need to "compare" entries from the table.
"compare" doesn't mean just equal. There is an algorithm for calculation comparison level. I have weight coefficient for each table column. Means that if entry#1's column1 equals to entry#2's column2 then I give, say, 5 point to this pair. And so on for each column.
The most straight forward way to do this - apply calculation rules for each couple of entries. Why am I afraid of this? 100,000 entries means about 5 billion "compare" operations. For sure, I can calculate this on demand and store the result somewhere in cache. But I believe that the most obvious way is not the most effective.
So, my first question is: Is there any other better way to achive my goal except of brute force?
My second question is related to tool which is better for calculations.
Application language is PHP. Hence, I need to load into memory whole
table and iterate over the data.
Create stored procedure in MySQL.
Using MongoDB's aggregation framework or MapReduce.
The least of all I like the first way. The most of all - the last.
I'm looking for any suggestion or advice from people who have experience in such sort of cases.
Since, I don't know how to ask google for help, any links will be appreciated.
UPDATE:
Calculation rules are a bit more complicated then I described...
Table has a set of related columns which are to be used at once as group(not one by one).
Let's assume:
table has fields, say, tag_1, tag_2, .., tag_n.
row_1 and row_2 - entries in the table.
The rule(pseudo-code):
if(row_1.tag_1==row_2.tag_1)
{
// gives 10 points
}
elseif(row_1.tag_1 is in row_2.tags && row_1.tag_1!=row_2.tag_1)
{
// gives 5 points
}
....
// and so on
Basically, I need to check find intersection of two arrays. If it is not empty - points are given. If indexes of tags in two rows match the additional points are given.
I'm wondering, how this can be accomplished using Stored Procedures Language? Because it can be done pretty easy using any programming language.
If stored procedure can do this then it is my choice.

If you have a static table, then it doesn't make a difference which you choose, so long as you store the results somewhere (presumably back in the database).
If your data is changing, then you need to compare each new row to all rows, which is essentially a full-table scan. This is probably best done in a database.
If the data fits into memory (and 500,000 rows should fit into memory), then (2) will probably be faster than (3) on equivalent hardware. "Equivalent hardware" is a very important consideration.
In most cases, I would opt for (2). It sounds like the query is something like:
select t.id, t2.id,
((case when t1.col1 = t2.col1 then 5 else 0 end) +
(case when t2.col2 = t2.col2 then 7 else 0 end) +
. . .
)
from t cross join t2
If you are much more comfortable with map-reduce, then you might find it easier to code there. I know both languages and prefer SQL for something like this.

Can't you do something like this:
UPDATE table SET points = points+5 WHERE column1 = column2
If you have too check for a specific value, you could try something like this:
UPDATE table SET points = points+5 WHERE column1 = 'somevalue' AND column2 = 'somevalue'

Related

How to index a query the right way

I am trying to make my DB more optimized and are in the beginning of indexing it but not sure how to do it right.
I have this query:
$year = date("Y");
$thisYear = $year;
//$nextYear = $thisYear + 1;
$sql = mysql_query("SELECT SUM(points) as userpoints
FROM ".$prefix."_publicpoints
WHERE date BETWEEN '$thisYear" . "-01-01' AND '$thisYear" . "-12-31' AND fk_player_id = $playerid");
$row = mysql_fetch_assoc($sql);
$userPoints = $row['userpoints'];
$sql = mysql_query("SELECT
fk_player_id
FROM ".$prefix."_publicpoints
WHERE date BETWEEN '$thisYear" . "-01-01' AND '$thisYear" . "-12-31'
GROUP BY fk_player_id
HAVING SUM(points) > $userPoints");
$row = mysql_fetch_assoc($sql);
$userWrank = mysql_num_rows($sql)+1;
I am not sure how to index this? I have tried indexing the fk_player_id but it still looks through all the rows (287937).
I have indexed the date field which gives me this back in EXPLAIN:
1
SIMPLE
nf_publicpoints
range
IDXdate
IDXdate
3
NULL
143969
Using where with pushed condition; Using temporary...
I also have 2 calls to the same table... Could that be done in one?
How do I index this and/or could it be done smarter?

You should definitely spend some time reading up on indexing, there's a lot written about it, and it's important to understand what's going on.
Broadly speaking, and index imposes an ordering on the rows of a table.
For simplicity's sake, imagine a table is just a big CSV file. Whenever a row is inserted, it's inserted at the end. So the "natural" ordering of the table is just the order in which rows were inserted.
Imagine you've got that CSV file loaded up in a very rudimentary spreadsheet application. All this spreadsheet does is display the data, and numbers the rows in sequential order.
Now imagine that you need to find all the rows that has some value "M" in the third column. Given what you have available, you have only one option. You scan the table checking the value of the third column for each row. If you've got a lot of rows, this method (a "table scan") can take a long time!
Now imagine that in addition to this table, you've got an index. This particular index is the index of values in the third column. The index lists all of the values from the third column, in some meaningful order (say, alphabetically) and for each of them, provides a list of row numbers where that value appears.
Now you have a good strategy for finding all the rows where the value of the third column is M! For instance, you can perform a binary search! Whereas the table scan requires you to look N rows (where N is the number of rows), the binary search only requires that you look at log-n index entries, in the very worst case. Wow, that's sure a lot easier!
Of course, if you have this index, and you're adding rows to the table (at the end, since that's how our conceptual table works), you need need to update the index each and every time. So you do a little more work while you're writing new rows, but you save a ton of time when you're searching for something.
So, in general, indexing creates a tradeoff between read efficiency and write efficiency. With no indexes, inserts can be very fast -- the database engine just adds a row to the table. As you add indexes, the engine must update each index while performing the insert.
On the other hand, reads become a lot faster.
Hopefully that covers your first two questions (as others have answered -- you need to find the right balance).
Your third scenario is a little more complicated. If you're using LIKE, indexing engines will typically help with your read speed up to the first "%". In other words, if you're SELECTing WHERE column LIKE 'foo%bar%', the database will use the index to find all the rows where column starts with "foo", and then need to scan that intermediate rowset to find the subset that contains "bar". SELECT ... WHERE column LIKE '%bar%' can't use the index. I hope you can see why.
Finally, you need to start thinking about indexes on more than one column. The concept is the same, and behaves similarly to the LIKE stuff -- essentialy, if you have an index on (a,b,c), the engine will continue using the index from left to right as best it can. So a search on column a might use the (a,b,c) index, as would one on (a,b). However, the engine would need to do a full table scan if you were searching WHERE b=5 AND c=1)
Hopefully this helps shed a little light, but I must reiterate that you're best off spending a few hours digging around for good articles that explain these things in depth. It's also a good idea to read your particular database server's documentation. The way indices are implemented and used by query planners can vary pretty widely.
More information and example visit here : http://blog.sqlauthority.com/category/sql-index/

Try create index on date column, indexing fk_payer_id will not help with this query. If does not work - paste explain...
For more information about indexes in Mysql look here: http://hackmysql.com/case1

Why not index the date column, seeing how that's the main criterion that will be evaluated in the lookup?

Optimizing an MYSQL COUNT ORDER BY query

I have recently written a survey application that has done it's job and all the data is gathered. Now i have to analyze the data and i'm having some time issues.
I have to find out how many people selected what option and display it all.
I'm using this query, which does do it's job:
SELECT COUNT(*)
FROM survey
WHERE users = ? AND table = ? AND col = ? AND row = ? AND selected = ?
GROUP BY users,table,col,row,selected
As evident by the "?" i'm using MySQLi (in php) to fetch the data when needed, but i fear this is causing it to be so slow.
The table consists of all the elements above (+ an unique ID) and all of them are integers.
To explain some of the fields:
Each survey was divided into 3 or 4 tables (sized from 2x3 to 5x5) with a 1 to 10 happiness grade to select form. (questions are on the right and top of the table, then you answer where the questions intersect)
users - age groups
table, row, col - explained above
selected - dooooh explained above
Now with the surveys complete and around 1 million entries in the table the query is getting very slow. Sometimes it takes like 3 minutes, sometimes (i guess) the time limit expires and you get no data at all. I also don't have access to the full database, just my empty "testing" one since the costumer is kinda paranoid :S (and his server seems to be a bit slow)
Now (after the initial essay) my questions are: I left indexing out intentionally because with a lot of data being written during the survey, it would be a bad idea. But since no new data is coming in at this point, would it make sense to index all the fields of a table? How much sense does it make to index integers that never go above 10? (as you can guess i haven't got a clue about indexes). Do i need the primary unique ID in this table? I
I read somewhere that indexing may help groups but only if you group by the first columns in a table (and since my ID is first and from my point of view useless can i remove it and gain anything by it?)
Is there another way to write my query that would basically do the same thing but in a shorter period of time?
Thanks for all your suggestions in advance!

Add an index on entries that you "GROUP BY" or do "WHERE". So that's ONE index incorporating users,table,col,row and selected in your case.
Some quick rules:
combine fields to have the WHERE first, and the GROUP BY elements last.
If you have other queries that only use part of it (e.g. users,table,col and selected) then leave the missing value (row, in this example) last.
Don't use too many indexes/indeces, as each will slow the table to updates marginally - so on really large system you need to balance queries with indexes.
Edit: do you need the GROUP BY user,col,row as these are used in the WHERE. If the WHERE has already filtered them out, you only need group by "selected".

Can I "cache" an embedded MySQL select used several times?

Working in Drupal 6, PHP 5.3, and MySQL, I'm building a query that looks roughly like this:
SELECT val from table [and some other tables joined in below]
where [a bunch of clauses, including getting all the tables joined up]
and ('foo' not in (select ...))
and (('bar' in (select...) and x = y)
or ('baz' in (select ...) and p = q))
That's not a great representation of what I'm trying to do, but hopefully it will be enough. The point is that, in the middle of the query there is an embedded SELECT that is used a number of times. It's always the same. It's not completely self-contained -- it relies on a value pulled from one of the tables at the top level of the query.
I'm feeling a little guilty/unclean for just repeating the query every time it's needed, but I don't see any other way to compute the value once and reuse it as needed. Since it refers to the value from a top level table, I can't compute it once outside the query and just insert the value into the query, either through a MySQL variable or by monkeying around with the query string. Or, so I think, anyway.
Is there anything I can do about this? Or, maybe it's a non-issue from a performance perspective: the code might be nasty, but parhaps MySQL is smart enough to cache the value itself and avoid executing the query over and over again? Any advice? Thanks!

You should be able to alias the result by doing SELECT ... AS alias, and then using in alias in the other queries, since the SELECT is really just a table.

Running queries on tables with more than 1million rows in

I am indexing all the columns that I use in my Where / Order by, is there anything else I can do to speed the queries up?
The queries are very simple, like:
SELECT COUNT(*)
FROM TABLE
WHERE user = id
AND other_column = 'something'`
I am using PHP 5, MySQL client version: 4.1.22 and my tables are MyISAM.

Talk to your DBA. Run your local equivalent of showplan. For a query like your sample, I would suspect that a covering index on the columns id and other_column would greatly speed up performance. (I assume user is a variable or niladic function).
A good general rule is the columns in the index should go from left to right in descending order of variance. That is, that column varying most rapidly in value should be the first column in the index and that column varying least rapidly should be the last column in the index. Seems counter intuitive, but there you go. The query optimizer likes narrowing things down as fast as possible.

If all your queries include a user id then you can start with the assumption that userid should be included in each of your indexes, probably as the first field. (Can we assume that the user id is highly selective? i.e. that any single user doesn't have more than several thousand records?)
So your indexes might be:
user + otherfield1
user + otherfield2
etc.
If your user id is really selective, like several dozen records, then just the index on that field should be pretty effective (sub-second return).
What's nice about a "user + otherfield" index is that mysql doesn't even need to look at the data records. The index has a pointer for each record and it can just count the pointers.

php and MySQL: 2 requests or 1 request?

I'm building a wepage in php using MySQL as my database.
Which way is faster?
2 requests to MySQL with the folling query.
SELECT points FROM data;
SELECT sum(points) FROM data;
1 request to MySQL. Hold the result in a temporary array and calcuale the sum in php.
$data = SELECT points FROM data;
EDIT -- the data is about 200-500 rows

It's really going to depend on a lot of different factors. I would recommend trying both methods and seeing which one is faster.

Since Phill and Kibbee have answered this pretty effectively, I'd like to point out that premature optimization is a Bad Thing (TM). Write what's simplest for you and profile, profile, profile.

How much data are we talking about? I'd say MySQL is probably faster at doing those kind of operations in the majority of cases.
Edit: with the kind of data that you're talking about, it probably won't make masses of difference. But databases tend to be optimised for those kind of queries, whereas PHP isn't. I think the second DB query is probably worth it.

If you want to do it in one line, use a running total like this:
SET #total=0;
SELECT points, #total:=#total+points AS RunningTotal FROM data;

I wouldn't worry about it until I had an issue with performance.

If you go with two separate queries, you need to watch out for the possibility of the data changing between getting the rows & getting their sum. Until there's an observable performance problem, I'd stick to doing my own summation to keep the page consistent.

The general rule of thumb for efficiency with mySQL is to try to minimize the number of SQL requests. Every call to the database adds overhead and is "expensive" in terms of time required.
The optimization done by mySQL is quite good. It can take very complex requests with many joins, nestings and computations, and make it run efficiently.
But it can only optimize individual requests. It cannot check the relationship between two different SQL statements and optimize between them.
In your example 1, the two statements will make two requests to the database and the table will be scanned twice.
Your example 2 where you save the result and compute the sum yourself would be faster than 1. This would only be one database call, and looping through the data in PHP to get the sum is faster than a second call to the database.

Just for the fun of it.
SELECT COUNT(points) FROM `data`
UNION
SELECT points FROM `data`
The first row will be the total, the next rows will be the data.
NOTE: Union can be slow, but its an option.
Could also do more fun and this supports you sorting the rows.
SELECT 'total' AS name, COUNT(points) FROM `data`
UNION
SELECT 'points' AS name, points FROM `data`
Then selecting through PHP
while($row = mysql_fetch_assoc($query))
{
if($row["data"] == "points")
{
echo $row["points"];
}
if($row["data"] == "total")
{
echo "Total is: ".$row["points"];
}
}

You can use union like this:
(select points, null as total from data) union (select null, sum(points) from data group by points);
The result will look something like this:
point total
2 null
5 null
...
null 7
you can figure out how to handle it.

do it the mySQL way. let the database manager do its work.
mySQL is optimized for such tasks

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.