Using PHP to optimize MySQL query

Using PHP to optimize MySQL query - php

Let's assume I have the following query:
SELECT address
FROM addresses a, names n
WHERE a.address_id = n.address_id
GROUP BY n.address_id
HAVING COUNT(*) >= 10
If the two tables were large enough (think if we had the whole US population in these two tables) then running an EXPLAIN on this SELECT would say that Using temporary; Using filesort which is usually not good.
If we have a DB with many concurrent INSERTs and SELECTs (like this) would delegating the GROUP BY a.address_id HAVING COUNT(*) >= 10 part to PHP be a good plan to minimise DB resources? What would the most efficient way (in terms of computing power) to code this?
EDIT: It seems the consensus is that offloading to PHP is the wrong move. How then, could I improve the query (let's assume indexes have been created properly)? More sepcifically how do I avoid the DB from creating a temporary table?

So your plan to minimize resources is by sucking all the data out of the database and having PHP process it, causing extreme memory usage?
Don't do client-side processing if at all possible - databases are DESIGNED for this sort of heavy work.

Offloading this to PHP is probably the opposite direction you want to go. If you must do this on a single machine then the database is likely the most efficient place to do it. If you have a bunch of PHP machines and only a single DB server, then offloading might make sense, but more likely you'll just clobber the IO capability of the DB. You'll probably get a bigger win by setting up a replica and doing your read queries there. Depending on your ratio of SELECT to INSERT queries, you might want to consider keeping a tally table (many more SELECTs than INSERTs). The more latency you can allow for your results, the more options you have. If you can allow 5 minutes latency, then you might start considering a distributed batch processing system like hadoop rather than a database.

Related

How to handle user's data in MySQL/PHP, for large number of users and data entries

Let's pretend with me here:
PHP/MySQL web-application. Assume a single server and a single MySQL DB.
I have 1,000 bosses. Every boss has 10 workers under them. These 10 workers (times 1k, totaling 10,000 workers) each have at least 5 database entries (call them work orders for this purpose) in the WebApplication every work day. That's 50k entries a day in this work orders table.
Server issues aside, I see two main ways to handle the basic logic of the database here:
Each Boss has an ID. There is one table called workorders and it has a column named BossID to associate every work order with a boss. This leaves you with approximately 1 million entries a month in a single table, and to me that seems to add up fast.
Each Boss has it's own table that is created when that Boss signed up, i.e. work_bossID where bossID = the boss' unique ID. This leaves you with 1,000 tables, but these tables are much more manageable.
Is there a third option that I'm overlooking?
Which method would be the better-functioning method?
How big is too big for number of entries in a table (let's assume a small number of columns: less than 10)? (this can include: it's time to get a second server when...)
How big is too big for number of tables in a database? (this can include: it's time to get a second server when...)
I know that at some point we have to bring in talks of multiple servers, and databases linked together... but again, let's focus on a single server here with a singly MySQL DB.

If you use a single server, I don't think there is a problem with how big the table gets. It isn't just the number of records in a table, but how frequently it is accessed.
To manage large datasets, you can use multiple servers. In this case:
You can keep all workorders in a single table, and mirror them across different servers (so that you have slave servers)
You can shard the workorders table by boss (in this case you access the server depending on where the workorder belongs) - search for database sharding for more information
Which option you choose depends on how you will use your database.
Mirrors (master/slave)
Keeping all workorders in a single table is good for querying when you don't know which boss a workorder belongs to, eg. if you are searching by product type, but any boss can have orders in any product type.
However, you have to store a copy of everything on every mirror. In addition only one server (the master) can deal with update (or adding workorder) SQL requests. This is fine if most of your SQL queries are SELECT queries.
Sharding
The advantage of sharding is that you don't have to store a copy of the record on every mirror server.
However, if you are searching workorders by some attribute for any boss, you would have to query every server to check every shard.
How to choose
In summary, use a single table if you can have all sorts of queries, including browsing workorders by an attribute (other than which boss it belongs to), and you are likely to have more SELECT (read) queries than write queries.
Use shards if you can have write queries on the same order of magnitude as read queries, and/or you want to save memory, and queries searching by other attributes (not boss) are rare.
Keeping queries fast
Large databases are not really a big problem, if they are not overwhelmed by queries, because they can keep most of the database on hard disk, and only keep what was accessed recently in cache (on memory).
The other important thing to prevent any single query from running slowly is to make sure you add the right index for each query you might perform to avoid linear searches. This is to allow the database to binary search for the record(s) required.
If you need to maintain a count of records, whether of the whole table, or by attribute (category or boss), then keep counter caches.
When to get a new server
There isn't really a single number you can assign to determine when a new server is needed because there are too many variables. This decision can be made by looking at how fast queries are performing, and the CPU/memory usage of your server.

Scaling is often a case of experimentation as it's not always clear from the outset where the bottlenecks will be. Since you seem to have a pretty good idea of the kind of load the system will be under, one of the first things to do is capture this in a spreadsheet so you can work out some hypotheticals. This allows you do do a lot of quick "what if" scenarios and come up with a reasonable upper end for how far you have to scale with your first build.
For collecting large numbers of records there's some straight-forward rules:
Use the most efficient data type to represent what you're describing. Don't worry about using smaller integer types to shave off a few bytes, or shrinking varchars. What's important here is using integers for numbers, date fields for dates, and so on. Don't use a varchar for data that already has a proper type.
Don't over-index your table, add only what is strictly necessary. The larger the number of indexes you have, the slower your inserts will get as the table grows.
Purge data that's no longer necessary. Where practical delete it. Where it needs to be retained for an extended period of time, make alternate tables you can dump it into. For instance, you may be able to rotate out your main orders table every quarter or fiscal year to keep it running quickly. You can always adjust your queries to run against the other tables if required for reporting. Keep your working data set as small as practical.
Tune your MySQL server by benchmarking, tinkering, researching, and experimenting. There's no magic bullet here. There's many variables that may work for some people but might slow down your application. They're also highly dependent on OS, hardware, and the structure and size of your data. You can easily double or quadruple performance by allocating more memory to your database engine, for instance, either InnoDB or MyISAM.
Try using other MySQL forks if you think they might help significantly. There are a few that offer improved performance over the regular MySQL, Percona in particular.
If you query large tables often and aggressively, it may make sense to de-normalize some of your data to reduce the number of expensive joins that have to be done. For instance, on a message board you might include the user's name in every message even though that seems like a waste of data, but it makes displaying large lists of messages very, very fast.
With all that in mind, the best thing to do is design your schema, build your tables, and then exercise them. Simulate loading in 6-12 months of data and see how well it performs once really loaded down. You'll find all kinds of issues if you use EXPLAIN on your slower queries. It's even better to do this on a development system that's slower than your production database server so you won't have any surprises when you deploy.
The golden rule of scaling is only optimize what's actually a problem and avoid tuning things just because it seems like a good idea. It's very easy to over-engineer a solution that will later do the opposite of what you intend or prove to be extremely difficult to un-do.
MySQL can handle millions if not billions of rows without too much trouble if you're careful to experiment and prove it works in some capacity before rolling it out.

i had database size problem as well in one of my networks so big that it use to slow the server down when i run query on that table..
in my opinion divide your database into dates decide what table size would be too big for you - let say 1 million entries then calculate how long it will take you to get to that amount. and then have a script every that period of time to either create a new table with the date and move all current data over or just back that table up and empty it.
like putting out dated material in archives.
if you chose the first option you'll be able to access that date easily by referring to that table.
Hope that idea helps

Just create a workers table, bosses table, a relationships table for the two, and then all of your other tables. With a relationship structure like this, it's very dynamic. Because, if it ever got large enough you could create another relationship table between the work orders to the bosses or to the workers.
You might want to look into bigints, but I doubt you'll need that. I know it that the relationships table will get massive, but thats good db design.
Of course bigint is for mySQL, which can go up to -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED*

MySQL database design

I'm setting up a MySQL database and I'm not sure of the best method to structure it:
I am setting up a system (PHP/MySQL based) where a few hundred people will be executing SELECT/UPDATE/SET/DELETE queries to a database (probably about 50 simultaneously). I imagine there are going to be a few thousand rows if they're all using the same database and table. I could split the data across a number of tables but then I would have to make sure they're all uniform AND I, as the administrator, will be running some SELECT DISTINCT queries via cron to update an administrative interface.
What's the best way to approach this? Can I have everybody sharing one database? one table? Will there be a problem when there are a few thousand rows? I imagine there is going to be a huge performance issue over time.
Any tips or suggestions are welcome!

MySQL/php can easily handle this as long as your server is powerful enough. MySQL loves RAM and will use as much as it can (within the limits you provide).
If you're going to have a lot of concurrent users then I would suggest looking at using innodb tables instead of MyISAM (the default in MySQL versions <5.5). Innodb locks individual rows when doing INSERT/UPDATE/DELETE etc, rather than locking the whole table like MyISAM does.
We use php/MySQL and would have 1000+ users on our site at the same time (our master db server does about 4k queries per second).

MySQL optimization: Perform Maths operation inside or outside of a query?

I have a strong feeling that all mathematical operations unnecessary to the query itself ought to be preformed outside of the query. For example:
$result = mysql_query(SELECT a, a*b/c as score FROM table)
while ($row = mysql_fetch_assoc($result))
{
echo $row['a'].' score: '.$row['score'].<br>;
}
vs:
$result = mysql_query(SELECT a, b, c FROM table)
while ($row = mysql_fetch_assoc($result))
{
echo $row['a'].' score: '.$row['a']*$row['b']/$row['c'].<br>;
}
the second option would usually be better, especially with complex table joins & such. This is my suspicion, I only lack confirmation . . .

Faster depends on the machines involved, if you're talking about faster for one user. If you're talking about faster for a million users hitting a website, then it's more efficient to do these calculations in PHP.
The load of a webserver running PHP is very easily distributed over a large number of machines. These machines can run in parallel, handling requests from visitors and fetching necessary information from the database. The database, however, is not easy to run in parallel. Issues such as replication or sharding are complex and can require specialty software and properly organized data to function well. These are expensive solutions compared to adding another PHP installation to a server array.
Because of this, the value of a CPU cycle on the database machine is far more valuable than one on the webserver. So you should perform these math functions on the webserver where CPU cycles are cheaper and significantly more easy to parallelize.
This also assumes that the database isn't holding open any sort of data lock while performing the calculation. If so, then you're not just using precious CPU cycles, you're locking data from other users directly.

My feeling would be that doing the maths in the database would be slightly more efficient in the long run, given your query setup. With the select a,b,c version, PHP has to create 3 elements and populate them for each row fetched.
With the in-database version, only 2 elements are created, so you've cut creation time by 33%. Either way, the calculation has to be done, so there's not much in the way of savings there.
Now, if you actually needed the b and c values to be exposed to your code, then there'd be no point in doing the calculation in the database, you'd be adding more fields to the result set with their attendant creation/processing/populating overhead.
Regardless, though, you should benchmark both version. What works in one situation may be worse than useless in another, and only some testing will show which is better.

I'd agree in general. Pull data from source in your query, manipulate data in the calling/scripting environment.
I wouldn't worry too much about efficiency/speed unless your queries get really complex, but it still seems like the right thing to do.

Math in the query is generally not a problem, UNLESS it is in the WHERE clause. Example:
SELECT a, b, c FROM table WHERE a*b=c
This makes it rather impossible to use an index.
SELECT a*b/c FROM table
Is fine.

If there is any performance advantage of one way over the other it is likely going to be very negligible making it more a matter of preference than optimization.
I prefer it in the query, personally because I feel it encapsulates the calculation in the data tier.
Also, although it doesn't apply to your specific example, the more information you give the DB engine about what you are ultimately trying to do, the more information it has to feed the query optimizer. It seems theoretically possible that the query might actually run faster if you put the calculation in the SQL.

Do it in the database is better because you can run the application in one machine and the database in another, that said, I will balance your overall performance. Specially in cheap hosting services, they generally do that, application in one machine database in another.

I doubt it could be a bottleneck.
especially with complex table joins & such, where one filesort will outcome these maths by factor of 1000s
However, you can always perpend your query with BENCHMARK keyword and take some measurements
BENCHMARK 1000 SELECT a, a*b/c as score FROM table

What should I do to make mysql 100% optimal?

Recently I've been doing quite a big project with php + mysql. And now I'm concerned about my mysql. What should I do to make my mysql as optimal as possible? Tell everything you know, I'll be really very grateful.
Second question, I use one mysql query per page load which takes information from mysql. It's quite a big query, because I take information from a few tables with a join. Maybe I should do something else?
Thank you.

Some top tips from MySQL Performance tips forge
Specific Query Performance:
Use EXPLAIN to profile the query
execution plan
Use Slow Query Log (always have it
on!)
Don't use DISTINCT when you have or
could use GROUP BY Insert
performance
Batch INSERT and REPLACE
Use LOAD DATA instead of INSERT
LIMIT m,n may not be as fast as it
sounds
Don't use ORDER BY RAND() if you
have > ~2K records
Use SQL_NO_CACHE when you are
SELECTing frequently updated data or
large sets of data
Avoid wildcards at the start of LIKE
queries
Avoid correlated subqueries and in
select and where clause (try to
avoid in)
Scaling Performance Tips:
Use benchmarking
isolate workloads don't let administrative work interfere with customer performance. (ie backups)
Debugging sucks, testing rocks!
As your data grows, indexing may change (cardinality and selectivity change). Structuring may want to change. Make your schema as modular as your code. Make your code able to scale. Plan and embrace change, and get developers to do the same.
Network Performance Tips:
Minimize traffic by fetching only what you need.
1. Paging/chunked data retrieval to limit
2. Don't use SELECT *
3. Be wary of lots of small quick queries if a longer query can be more efficient
Use multi_query if appropriate to reduce round-trips
Use stored procedures to avoid bandwidth wastage
OS Performance Tips:
Use proper data partitions
1. For Cluster. Start thinking about Cluster before you need them
Keep the database host as clean as possible. Do you really need a windowing system on that server?
Utilize the strengths of the OS
pare down cron scripts
create a test environment

Learn to use the explain tool.

Three things:
Joins are not necessarily suboptimal. Oftentimes schemata that use joins will be faster than those that achieve the same but avoid table joins. The important thing is to know that your joins are optimal. EXPLAIN is very helpful but you also need to know how indexes work.
If you're grabbing data from the DB on every page hit, consider if a cacheing system would work for you. If so, check out PHP memcache and memcached. It's easy to use in PHP and very fast. It's popular for a reason.
Back to mysql: make sure you're key buffer is sized correctly. You can also think about using dedicated key buffers for critical indices that should remain in cache. Read about CACHE INDEX and LOAD INDEX INTO CACHE. See also here.

"...because I take information from a few tables with a join"
Joins, even "big" joins aren't bad. Just be sure that you have good indexes.
Also note that performance with a couple of records is a lot different than performance with hundreds of thousands of records, so test accordingly.

For performance, this book is good: High Perofmanace MYSQL. The associated blog is good too.

my 2cents: set your log_slow_queries to <2sec and use mysqlsla (get it from hackmysql.com) to analyse the 'slow' queries... Thisway you can just drilldown into the slower queries as they come along...
(the mysqlsla can also benefit from the log-queries-not-using-indexes option)
on mysqlhack.com there's a script called 'mysqlreport' that gives estimates on how your installation is runnig... (once it's running a while) and also gives pointers as to where to tune your setup more precisely...

Being perfect is a bit of a challenge and not the first target to set yourself.
Enable mysql logging of all queries, and write some code which parses the log files and removes any literal values from the SQL statements.
e.g. changes
SELECT * FROM atable WHERE something=5 AND other='splodgy';
and
SELECT * FROM atable WHERE something=1 AND other='zippy';
to something like:
SELECT * FROM atable WHERE something=:1 AND other=:2;
(Sorry, I've not got my code which does this to hand - but it's not rocket science)
Then shove the re-written log into a table so you can prioritize your performance fixes based on length and frequency of execution.

MySQL vs Web Server for processing data

I was wondering if it's faster to process data in MySQL or a server language like PHP or Python. I'm sure native functions like ORDER will be faster in MySQL due to indexing, caching, etc, but actually calculating the rank (including ties returning multiple entries as having the same rank):
Sample SQL
SELECT TORCH_ID,
distance AS thisscore,
(SELECT COUNT(distinct(distance))+1 FROM torch_info WHERE distance > thisscore) AS rank
FROM torch_info ORDER BY rank
Server
...as opposed to just doing a SELECT TORCH_ID FROM torch_info ORDER BY score DESC and then figure out rank in PHP on the web server.

Edit: Since posting this, my answer has changed completely, partly due to the experience I've gained since then and partly because relational database systems have gotten significantly better since 2009. Today, 9 times out of 10, I would recommend doing as much of your data crunching in-database as possible. There are three reasons for this:
Databases are highly optimized for crunching data—that's their entire job! With few exceptions, replicating what the database is doing at the application level is going to be slower unless you invest a lot of engineering effort into implementing the same optimizations that the DB provides to you for free—especially with a relatively slow language like PHP, Python, or Ruby.
As the size of your table grows, pulling it into the application layer and operating on it there becomes prohibitively expensive simply due to the sheer amount of data transferred. Many applications will never reach this scale, but if you do, it's best to reduce the transfer overhead and keep the data operations as close to the DB as possible.
In my experience, you're far more likely to introduce consistency bugs in your application than in your RDBMS, since the DB can enforce consistency on your data at a low level but the application cannot. If you don't have that safety net built-in, so you have to be more careful to not make mistakes.
Original answer: MySQL will probably be faster with most non-complex calculations. However, 90% of the time database server is the bottleneck, so do you really want to add to that by bogging down your database with these calculations? I myself would rather put them on the web/application server to even out the load, but that's your decision.

In general, the answer to the "Should I process data in the database, or on the web server question" is, "It depends".
It's easy to add another web server. It's harder to add another database server. If you can take load off the database, that can be good.
If the output of your data processing is much smaller than the required input, you may be able to avoid a lot of data transfer overhead by doing the processing in the database. As a simple example, it'd be foolish to SELECT *, retrieve every row in the table, and iterate through them on the web server to pick the one where x = 3, when you can just SELECT * WHERE x = 3
As you pointed out, the database is optimized for operation on its data, using indexes, etc.

The speed of the count is going to depend on which DB storage engine you are using and the size of the table. Though I suspect that nearly every count and rank done in mySQL would be faster than pulling that same data into PHP memory and doing the same operation.

Ranking is based on count, order. So if you can do those functions faster, then rank will obviously be faster.

A large part of your question is dependent on the primary keys and indexes you have set up.
Assuming that torchID is indexed properly...
You will find that mySQL is faster than server side code.
Another consideration you might want to make is how often this SQL will be called. You may find it easier to create a rank column and update that as each track record comes in. This will result in a lot of minor hits to your database, versus a number of "heavier" hits to your database.
So let's say you have 10,000 records, 1000 users who hit this query once a day, and 100 users who put in a new track record each day. I'd rather have the DB doing 100 updates in which 10% of them hit every record (9,999) then have the ranking query get hit 1,000 times a day.
My two cents.

If your test is running individual queries instead of posting transactions then I would recommend using a JDBC driver over the ODBC dsn because youll get 2-3 times faster performance. (im assuming your using an odbc dsn here in your tests)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.