JS, PHP, and MySQL to get large data - php

I am using Ajax to send query to PHP server, which then run the SQL query to get data. Because the query involves three tables (two large ones), so JOIN the three tables is very slow.
Then I split the SQL query to three queries. It improves the efficiency (for small dataset). But for large dataset, because the PHP program runs the three queries one by one, and processes the result after each, there will be 30 second timeout (by default). I don't want to remove this default setting.
To avoid timeout, I am also considering running the three query and returning the result to JS, and let client side to do processing.
Is there other way to do that?
add
Basically, I want three output, title, extviews, allviews, for each item, WHERE extviews>somevalue. title is from one small table, extviews and allviews are aggregated from two different large tables. I have all the fields indexed, but joining the two big tables still requires a long time.
So I first aggregate one table to get extviews for each item, and also a list of item id. The results are organized as an array for JSON output to JS. Then using the list of id, I get the title for each item, and aggregate the other table to get allviews. Then I update the array with the new results.

Unless your mysql server is really overloaded, it's usually quickier to use joins. I guess you've already defined indexes on your tables? (for fields used in join condition & where clauses)
Doing the processing on the client side might also be a problem, since you'll have to send a lot of data in order to do the join...
Edit:
If all "easy" optimisation is done, then you have 2 choices... The one you just described (doing it on client size, if it's possible - what is the size (in bytes) of the json arrays you send to the client?)
Your other choice is to do the processing in the background (via cron) & cache somehow the results.

As already indicated by other people responding to your post, you should give us an idea of the structure of your three tables and the intent of each. Based upon that information, you may be able to get significant performance improvements by optimizing your database structure. To make it easier to understand, let's assume that someone had a website running off an intelligently designed database. I could easily make that application perform ten times worse solely by modifying the structure of the database.
Now, maybe there's some reason why you need to have three distinct tables, but I can't make that judgment without knowing what the fields in the database are, what you're aggregating, and what your web application is doing in the first place. Is it read heavy or write heavy? The solution may be as simple as denormalizing your database so that you don't need to use any joins.
I can say from a cursory glance at your description of what you're doing, that this application can't possibly scale efficiently and that you really need to reconsider your design. The first warning sign for me is the fact that you stated that one of the joins is just to link the title to two other tables. To me, being forced to do a join just to get a title of an object seems indicative of over-normalization. Some data redundancy is not necessarily a bad thing, and in some situations it's absolutely mandatory. Also, you say that you have two large tables that you use aggregate functions on and then join everything together. I can tell you right now that you're going to run into some serious performance issues if every hit to your application involves using a triple join and two aggregate functions, I'm assuming count.
Ultimately, we'll be able to give you a better response once you provide more information as to what you're trying to accomplish, and the general structure of the database you set up for it.

Related

Suggestions on Structuring a Database with Large Amounts of Data

I'm doing an RIA with JavaScript, MySQL and PHP on a Windows server.
I have 5,000 identically structured data sets I want to put in a database. 5 tables is enough for the data, all of which will be reasonably small except for one table that will have 300,000+ records for a typical data set.
Additionally, 500 users will get read only access to statistics compiled from those data sets. Those statistics are provided by PHP (no direct access is allowed). What's more, their access to data varies. Some users can only use one data set, others some, a few, all.
The results users see are relatively small; most requests return well under 100 rows, and the largest requests will be about 700 rows. All requests are through a JavaScript RIA which uses Ajax to connect to PHP which in turn connects to the data, does its thing and outputs JSON in response, which JavaScript then presents accordingly.
In thinking about how to structure this, three options present themselves:
Put the data sets in the same tables. That could easily give me 1,500,000,000 records in the largest table.
Use separate tables for each data set. That would limit the largest table size, but could mean 25,000 tables.
Forget the database and stick with the proprietary format.
I'm leaning towards #2 for a few reasons.
I'm concerned about issues in using very large tables (eg: query speeds, implementation limits, etc...).
Separate tables seem safer; they limit the impact of errors and structure changes.
Separate tables allow me to use MySQL's table level security rather than implementing my own row level security. This means less work and better protection; for instance, if a query is accidentally sent without row level security, users can get unauthorized data. Not so with table level security, as the database will reject the query out of hand.
Those are my thoughts, but I'd like yours. Do you think this is the right choice? If not, why not? What considerations have I missed? Should I consider other platforms if scale-ability is an issue?
1) I'm concerned about issues in using very large tables (eg: query speeds, implementation limits, etc...).
Whether the DBMS has to...
search through the large index of one table,
or search for the right table and then search through the smaller index of that table
...probably doesn't make much of a difference performance-wise. If anything, the second case has an undocumented component (the performance of locating the right table), so I'd be reluctant to trust it fully.
If you want to physically partition the data, MySQL supports that directly since version 5.1, so you don't have to emulate it via separate tables.
2) Separate tables seem safer; they limit the impact of errors and structure changes.
That's what backups are for.
3) Separate tables allow me to use MySQL's table level security rather than implementing my own row level security.
True enough, however similar effect can be achieved through views or stored procedures.
All in all, my instinct is to go with a single table, unless you know in advance that these data-sets differ enough structurally to warrant separate tables. BTW, I doubt you'd be able to do better with a proprietary format compared to a well-optimized database.

How to handle user's data in MySQL/PHP, for large number of users and data entries

Let's pretend with me here:
PHP/MySQL web-application. Assume a single server and a single MySQL DB.
I have 1,000 bosses. Every boss has 10 workers under them. These 10 workers (times 1k, totaling 10,000 workers) each have at least 5 database entries (call them work orders for this purpose) in the WebApplication every work day. That's 50k entries a day in this work orders table.
Server issues aside, I see two main ways to handle the basic logic of the database here:
Each Boss has an ID. There is one table called workorders and it has a column named BossID to associate every work order with a boss. This leaves you with approximately 1 million entries a month in a single table, and to me that seems to add up fast.
Each Boss has it's own table that is created when that Boss signed up, i.e. work_bossID where bossID = the boss' unique ID. This leaves you with 1,000 tables, but these tables are much more manageable.
Is there a third option that I'm overlooking?
Which method would be the better-functioning method?
How big is too big for number of entries in a table (let's assume a small number of columns: less than 10)? (this can include: it's time to get a second server when...)
How big is too big for number of tables in a database? (this can include: it's time to get a second server when...)
I know that at some point we have to bring in talks of multiple servers, and databases linked together... but again, let's focus on a single server here with a singly MySQL DB.
If you use a single server, I don't think there is a problem with how big the table gets. It isn't just the number of records in a table, but how frequently it is accessed.
To manage large datasets, you can use multiple servers. In this case:
You can keep all workorders in a single table, and mirror them across different servers (so that you have slave servers)
You can shard the workorders table by boss (in this case you access the server depending on where the workorder belongs) - search for database sharding for more information
Which option you choose depends on how you will use your database.
Mirrors (master/slave)
Keeping all workorders in a single table is good for querying when you don't know which boss a workorder belongs to, eg. if you are searching by product type, but any boss can have orders in any product type.
However, you have to store a copy of everything on every mirror. In addition only one server (the master) can deal with update (or adding workorder) SQL requests. This is fine if most of your SQL queries are SELECT queries.
Sharding
The advantage of sharding is that you don't have to store a copy of the record on every mirror server.
However, if you are searching workorders by some attribute for any boss, you would have to query every server to check every shard.
How to choose
In summary, use a single table if you can have all sorts of queries, including browsing workorders by an attribute (other than which boss it belongs to), and you are likely to have more SELECT (read) queries than write queries.
Use shards if you can have write queries on the same order of magnitude as read queries, and/or you want to save memory, and queries searching by other attributes (not boss) are rare.
Keeping queries fast
Large databases are not really a big problem, if they are not overwhelmed by queries, because they can keep most of the database on hard disk, and only keep what was accessed recently in cache (on memory).
The other important thing to prevent any single query from running slowly is to make sure you add the right index for each query you might perform to avoid linear searches. This is to allow the database to binary search for the record(s) required.
If you need to maintain a count of records, whether of the whole table, or by attribute (category or boss), then keep counter caches.
When to get a new server
There isn't really a single number you can assign to determine when a new server is needed because there are too many variables. This decision can be made by looking at how fast queries are performing, and the CPU/memory usage of your server.
Scaling is often a case of experimentation as it's not always clear from the outset where the bottlenecks will be. Since you seem to have a pretty good idea of the kind of load the system will be under, one of the first things to do is capture this in a spreadsheet so you can work out some hypotheticals. This allows you do do a lot of quick "what if" scenarios and come up with a reasonable upper end for how far you have to scale with your first build.
For collecting large numbers of records there's some straight-forward rules:
Use the most efficient data type to represent what you're describing. Don't worry about using smaller integer types to shave off a few bytes, or shrinking varchars. What's important here is using integers for numbers, date fields for dates, and so on. Don't use a varchar for data that already has a proper type.
Don't over-index your table, add only what is strictly necessary. The larger the number of indexes you have, the slower your inserts will get as the table grows.
Purge data that's no longer necessary. Where practical delete it. Where it needs to be retained for an extended period of time, make alternate tables you can dump it into. For instance, you may be able to rotate out your main orders table every quarter or fiscal year to keep it running quickly. You can always adjust your queries to run against the other tables if required for reporting. Keep your working data set as small as practical.
Tune your MySQL server by benchmarking, tinkering, researching, and experimenting. There's no magic bullet here. There's many variables that may work for some people but might slow down your application. They're also highly dependent on OS, hardware, and the structure and size of your data. You can easily double or quadruple performance by allocating more memory to your database engine, for instance, either InnoDB or MyISAM.
Try using other MySQL forks if you think they might help significantly. There are a few that offer improved performance over the regular MySQL, Percona in particular.
If you query large tables often and aggressively, it may make sense to de-normalize some of your data to reduce the number of expensive joins that have to be done. For instance, on a message board you might include the user's name in every message even though that seems like a waste of data, but it makes displaying large lists of messages very, very fast.
With all that in mind, the best thing to do is design your schema, build your tables, and then exercise them. Simulate loading in 6-12 months of data and see how well it performs once really loaded down. You'll find all kinds of issues if you use EXPLAIN on your slower queries. It's even better to do this on a development system that's slower than your production database server so you won't have any surprises when you deploy.
The golden rule of scaling is only optimize what's actually a problem and avoid tuning things just because it seems like a good idea. It's very easy to over-engineer a solution that will later do the opposite of what you intend or prove to be extremely difficult to un-do.
MySQL can handle millions if not billions of rows without too much trouble if you're careful to experiment and prove it works in some capacity before rolling it out.
i had database size problem as well in one of my networks so big that it use to slow the server down when i run query on that table..
in my opinion divide your database into dates decide what table size would be too big for you - let say 1 million entries then calculate how long it will take you to get to that amount. and then have a script every that period of time to either create a new table with the date and move all current data over or just back that table up and empty it.
like putting out dated material in archives.
if you chose the first option you'll be able to access that date easily by referring to that table.
Hope that idea helps
Just create a workers table, bosses table, a relationships table for the two, and then all of your other tables. With a relationship structure like this, it's very dynamic. Because, if it ever got large enough you could create another relationship table between the work orders to the bosses or to the workers.
You might want to look into bigints, but I doubt you'll need that. I know it that the relationships table will get massive, but thats good db design.
Of course bigint is for mySQL, which can go up to -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED*

JOINS vs. while statements

In the company where I came to work, they run a PHP/MySQL relational database. I had always thought that if I needed to pull different info from different tables, that I could just do a simple join to pull in the data such as....
SELECT table_1.id, table_2.id FROM table_1 LEFT JOIN table_2 ON table_1.sub_id = table_2.id
When I got to where I currently work, this is what they do.
<?php $query = mysql_query("SELECT sub_id FROM table_1");
while($rs = mysql_fetch_assoc($query)) {
$query_2 = mysql_fetch_assoc(mysql_query("SELECT * FROM table_2 WHERE id = '{$rs['sub_id']}'"));
//blah blah blah more queries
?>
When I asked why the did it the second way, they said that it actually ran faster than a join. They manage a database that has millions of records on different tables and some of the tables are a little wide (row-wise). They said that they wanted to avoid joins in the case that a poorly executed query could lock up a table (or several of them). One other thing to keep in mind is that there is a massive report builder attached to this database that a client can use to build their own report and if they go crazy and build a big report, it could cause some havoc.
I was confused so I thought I'd throw this out there for the general programming public. This could be a matter of opinion, but is it really faster to do the while statement (one larger query to pull a lot of rows, followed by a lot of small tiny sub-queries if you will) or to do a join (pull a larger query one time to get all the data you need). As long as indexes are done properly, does it matter? One other thing to consider is that the current DB is in InnoDB format.
Thanks!
Update 8/28/14
So I thought I'd throw up an update to this one and what has worked more long term. After this discussion I decided to rebuild the report generator here at work. I don't have definitive result numbers, but I thought I'd share what the result was.
I think went a little overkill because I turned the entire report (it's pretty dynamic as far as the data that's returned) into a massive join fest. Most of the joins, if not all are joining a value to a primary key so they all run really really fast. If the report had lets say 30 columns of data to pull and it pulled 2000 records, every single field was running a query to fetch the data (because that piece of data could be on a different field). 30 x 2000 = 60000 and even under a sweet query time of 0.0003 seconds per query, that was still 18 seconds of just query time (which is pretty much what I remember it being). Now that I rebuilt the query as a massive join on a bunch of primary keys (where possible), that same report loaded in about 2-3 seconds, and most of that time was downloading the html. Each record that returns runs between 0-4 extra queries depending on the data that's needed (may not need any data if it can fetch it in the joins, which happens 75% of the time). So the same 2000 records would return an additional 0-8000 queries, (much better than 60000).
I would say that the while statement is useful in some cases, but as stated below in the comments, benchmarking is what it's all about. In my case, joins were the better option, but in other areas of my site, a while statement is more useful. In one instance I have a report where a client could request several categories to pull by and only return data for those categories. What happened was I had a category_id IN(...,...,..,.., etc etc etc) with 50-500 IDs and the index would choke and die in my arms as I was holding it in it's final moments. So what I did was spread out the ids in groups of 10 and ran the same query x / 10 times and my results were fetch way faster than before because the index likes dealing with 10 IDs, not 500, so I saw a great improvement on my queries then because of doing the while statement.
If the indexes are properly used, then it is almost always more efficient to use a JOIN. The emphasis is added because best efficiency does not always equal best performance.
There isn't really a one-size-fits all answer, though; you should analyze a query using EXPLAIN to ensure that the indexes are indeed being used, that there is no unnecessary temp table use, etc. In some cases, conditions conspire to create a query that just can't use indexes. In those cases, it might be faster to separate the queries into pieces in the fashion you've indicated.
If I encountered such code in an existing project, I would question it: check the query, think of different ways to perform the query, make sure that these things have been considered, build a scientific, fact-supported case for or against the practice. Make sure that the original developers did their due diligence, since not using a JOIN superficially points to poor database or query design. In the end, though, the results speak loudly and if all the optimizations and corrections still result in a slower join than using query fragments provides, then the faster solution prevails. Benchmark and act on the results of the benchmark; there is no case in software design that you should trade poor performance for adhesion to arbitrary rules about what you should or should not do. The best-performing method is the best method.
It should be better to do the big query, if the indexes are well placed.
The logic behind it:
1 query = 1 call to the DB server, wich then processes the query (optimizer and all) and finally returns the result. N queries mean N calls to the database, including N calls to the optimizer and, in a bad case, I/O.
MySQL has optimizations wich work on JOINs. Those optimizations can not work if you do a while.
As stated in previous answers, check with EXPLAIN if there is something wich isn't using an index in case you use the JOIN. Also, you should check the memory wich is given to the InnoDB cache, and the memory given to MySQL to parse a given query. Maybe it's because of those parameters that the database goes slower when doing the JOINs.
I would say the answer is, it depends. Normally, I'd say joins are the answer, and doing multiple queries in a loop is bad practise, however, it depends entirely on what is being done.
Is it the case for you? Without detailed table structures and info on indexes as well as use of foreign keys etc, we can't say for sure. Best idea if you want to check, is try it and see. Get their queries, EXPLAIN them, write your own, and do an EXPLAIN on that, see which is more efficient.
I'm not sure about huge databases, but in my projects I always try to keep the queries to a minimum. Queries use harddrive access and (if not on same host) network access, which are slow. If there are many entries in that first query, you could be running thousands of queries per page which is going to be slow.
Benchmark to find out the actual answer.
With the example you provided, it is highly unlikely that (with equivalent data) a join by the database will use more resources than setting up a new connection and perform the exact same operation (after all: you're still connecting the data in the same way as a join, even if it is externally done): if it was, the engine could simply be rewritten to use that external route to improve performance.
When joins use more resources (apart from indexing problems), it mostly comes from the downsides of retrieving the data per row, which means that information of the parent table will be duplicated in every row, even when this is redundant.
This may cause performance problems that can be helped by splitting queries if:
there are many children to one parent AND
you fetch lots of data from the parent (many columns or large fields)
In my experience, reducing the number of queries almost always benefits performance (I've optimized by combining queries far more than picking them apart).
The correct use of indices is good advice of course, but at first sight I don't think it will account for differences between those two scenarios, as the same indices (or lack of) would apply in both cases.

Big joins or multiple fetches most efficent?

I understand that multiple variables are part of this equation like number of tables, number of columns, number of returned rows, used indexes etc. But if we speak overall
Is more efficient to run a query with multiple (say 5+) joins where most of the tables will contain rows with information corresponding to rows in the main table and the returned result would be in the 20.000 rows range. For the sake of argument let's say the first table would contain users with a creation date and it's on this date we decide the users to pick out. The other tables contain stuff such as session information, user notes etc. All users should be picked out but depending on the values of fields in the secondary tables we might ignore the session data for one user and do some work with the session data on another user when we go through the results. This way we would get all needed data in one query but might get some redundant data for some users at the same time.
Or would it be more efficient to pick the users by date and when iterating the results we fetch data from the other tables per user when it's necessary?
Let's say that the work on the returned rows is done within PHP5+.
I'll say, do a benchmark.
It will depends on the frequency of "when it's necessary". If you need the extra date for 10% of the users, the seconde approach will be better I think. If you need them for 90%, it will be better to retrieve everything in one big query.
Big join.
I can cite absolutely no evidence to back that up. I do speak from some experience, though: in the system i work with, we do millions of little tiny simple queries, rather than a few big ones, and all the data-intensive work takes ages. For example, it takes an hour to load data that a direct SQL load can do in a couple of minutes. The per-query cost completely dominates the equation.
If your tables have the proper indexes (which will help a lot, when it comes to joins), one single SQL query, even a bit complex, will probably be faster than several queries, which will each imply an exchange between PHP and the MySQL server.
(But, of course, the only way to know for sure what applies the best in your specific situation is to test both solutions, benchmarking them !)

How to deal with large data sets for analytics, and varying numbers of columns'?

I'm building an analytics system for a mobile application and have had some difficulty deciding how to store and process large amounts of data.
Each row will represent a 'view' (like a web page) and store some fixed attributes, like user agent and date. Additionally, each view may have a varying number of extra attributes, which relate to actions performed or content identifiers.
I've looked at Amazon SimpleDb which handles the varying number of attributes well, but has no support for GROUP BY and doesn't seem to perform well when COUNTing rows either. Generating a monthly graph with 30 data points would require a query for each day per dataset.
MySQL handles the COUNT and GROUP modifiers much better but additional attributes require storage in a link table and a JOIN to retrieve views where attributes match a given value, which isn't very fast. 5.1's partitioning feature may help speed things up a bit.
What I have gathered from a lot of reading and profiling queries on the aforementioned systems is that ultimately all of the data needs to be aggregated and stored in tables for quick report generation.
Have I missed anything obvious in my research and is there a better way to do this than use MySQL? It doesn't feel like the right task for the job, but I can't find anything capable of both GROUP/COUNT queries and a flexible table structure.
This is a case where you want to store the data once and read it over and over. Further I think that you'd wish the queries to be preprocessed instead of needing to be calculated on every go.
My suggestion for you is to store your data in CouchDB for the following reasons:
Its tables are structureless
Its queries are pre-processed
Its support for map-reduce allows your queries to handle group by
It has a REST service access model which lets you connect from pretty much anything that handle HTTP requests
You may find this suggestion a little out there considering how new CouchDB is. However I'd suggest for you to read about it because personally I think running a CouchDB database is sweet and lightweight. More light weight than MySQL
Keeping it in MySQL: If the amount of writes are limited / reads are more common, and the data is relatively simple (i.e: you can predict possible characters), you could try to use a text/blob column in the main table, which is updated with comma separated values or key/value pairs with an AFTER INSERT / UPDATE trigger on the join table. You keep the actual data in a separate table, so searching for MAX's / specific 'extra' attributes can still be done relatively fast, but retrieving the complete dataset for one of your 'views' would be a single row in the main table, which you can split into the separate values with the script / application you're using, relieving much of the stress on the database itself.
The downside of this is a tremendous increase in cost of updates / inserts in the join table: every alteration of data would require a query on all related data for a record, and a second insert into the 'normal' table, something like
UPDATE join_table
JOIN main_table
ON main_table.id = join_table.main_id
SET main_table.cache = GROUP_CONCAT(CONCAT(join_table.key,'=',join_table.value) SEPARATOR ';')
WHERE join_table.main_id = 'foo' GROUP BY main_table.id`).
However, as analytics data goes it usually trails somewhat, so possibly not every update has to trigger an update in cache, just a daily cronscript filling the cache with yesterdays data could do.

Categories