I am writing a PHP/MySQL application (using CodeIgniter) that uses some jQuery functionality for dragging table rows. I have a table in which the user can drag rows to the desired order (kind of a queue for which I need to preserve the rank of each row). I've been trying to figure out how to (and whether I should) update the database each time the user drops a row, in order to simplify the UI and avoid a "Save" button.
I have the jQuery working and can send a serialized list back to the server onDrop, but is it good design practice to run an update query this often? The table will usually have 30-40 rows max, but if the user drags row 1 far down the list, then potentially all the rows would need to be updated to update the rank field.
I've been wondering whether to send a giant query to the server, to loop through the rows in PHP and update each row with its own Update query, to send a small serialized list to a stored procedure to let the server do all the work, or perhaps a better method I haven't considered. I've read that stored procedures in MySQL are not very efficient and use a separate process for each call. Any advice as to the right solution here? Thanks very much for your help!
Any question that includes "The table will usually have 30-40 rows max" ends with "Do whatever you want to it." I can't imagine an operation, however frequently it's performed, that would have any appreciable performance impact on a table that tiny.
The only real question is what the visitor will be doing while your request is going to and returning from the server. Will they be locked out of making other changes? If not, make sure you have a mechanism to ensure that the most recent change is the one that's really taken effect. (It's possible for requests to reach the server out of order, and you wouldn't want an outdated request to get saved as the final state.)
Related
This question has been asked a THOUSAND times... so it's not unfair if you decide to skip reading/answering it, but I still thought people would like to see and comment on my approach...
I'm building a site which requires an activity feed, like FourSquare.
But my site has this feature for the eye-candy's sake, and doesn't need the stuff to be saved forever.
So, I write the event_type and user_id to a MySQL table. Before writing new events to the table, I delete all the older, unnecessary rows (by counting the total number of rows, getting the event_id lesser than which everything is redundant, and deleting those rows). I prune the table, and write a new row every time an event happens. There's another user_text column which is NULL if there is no user-generated text...
In the front-end, I have jQuery that checks with a PHP file via GET every x seconds the user has the site open. The jQuery sends a request with the last update "id" it received. The <div> tags generated by my backend have the "id" attribute set as the MySQL row id. This way, I don't have to save the last_received_id in memory, though I guess there's absolutely no performance impact from storing one variable with a very small int value in memory...
I have a function that generates an "update text" depending on the event_type and user_id I pass it from the jQuery, and whether the user_text column is empty. The update text is passed back to jQuery, which appends the freshly received event <div> to the feed with some effects, while simultaneously getting rid of the "tail end" event <div> with an effect.
If I (more importantly, the client) want to, I can have an "event archive" table in my database (or a different one) that saves up all those redundant rows before deleting. This way, event information will be saved forever, while not impacting the performance of the live site...
I'm using CodeIgniter, so there's no question of repeated code anywhere. All the pertinent functions go into a LiveUpdates class in the library and model respectively.
I'm rather happy with the way I'm doing it because it solves the problem at hand while sticking to the KISS ideology... but still, can anyone please point me to some resources, that show a better way to do it? A Google search on this subject reveals too many articles/SO questions, and I would like to benefit from the experience any other developer that has already trawled through them and found out the best approach...
If you use proper indexes there's no reason you couldn't keep all the events in one table without affecting performance.
If you craft your polling correctly to return nothing when there is nothing new you can minimize the load each client has on the server. If you also look into push notification (the hybrid delayed-connection-closing method) this will further help you scale big successfully.
Finally, it is completely unnecessary to worry about variable storage in the client. This is premature optimization. The performance issues are going to be in the avalanche of connections to the web server from many users, and in the DB, tables without proper indexes.
About indexes: An index is "proper" when the most common query against a table can be performed with a seek and a minimal number of reads (like 1-5). In your case, this could be an incrementing id or a date (if it has enough precision). If you design it right, the operation to find the most recent update_id should be a single read. Then when your client submits its ajax request to see if there is updated content, first do a query to see if the value submitted (id or time) is less than the current value. If so, respond immediately with the new content via a second query. Keeping the "ping" action as lightweight as possible is your goal, even if this incurs a slightly greater cost for when there is new content.
Using a push would be far better, though, so please explore Comet.
If you don't know how many reads are going on with your queries then I encourage you to explore this aspect of the database so you can find it out and assess it properly.
Update: offering the idea of clients getting a "yes there's new content" answer and then actually requesting the content was perhaps not the best. Please see Why the Fat Pings Win for some very interesting related material.
Let's say you have a search form, with multiple select fields, let's say a user selects from a dropdown an option, but before he submits the data I need to display the count of the rows in the database .
So let's say the site has at least 300k(300.000) visitors a day, and a user selects options from the form at least 40 times a visit, that would mean 12M ajax requests + 12M count queries on the database, which seems a bit too much .
The question is how can one implement a fast count (using php(Zend Framework) and MySQL) so that the additional 12M queries on the database won't affect the load of the site .
One solution would be to have a table that stores all combinations of select fields and their respective counts (when a product is added or deleted from the products table the table storing the count would be updated). Although this is not such a good idea when for 8 filters (select options) out of 43 there would be +8M rows inserted that need to be managed.
Any other thoughts on how to achieve this?
p.s. I don't need code examples but the idea itself that would work in this scenario.
I would probably have an pre-calculated table - as you suggest yourself. Import is that you have an smart mechanism for 2 things:
Easily query which entries are affected by which change.
Have an unique lookup field for an entire form request.
The 8M entries wouldn't be very significant if you have solid keys, as you would only require an direct lookup.
I would go trough the trouble to write specific updates for this table on all places it is necessary. Even with the high amount of changes, this is still efficient. If correctly done you will know which rows you need to update or invalidate when inserting/updating/deleting the product.
Sidenote:
Based on your comment. If you need to add code on eight places to cover all spots can be deleted - it might be a good time to refactor and centralize some code.
there are few scenarios
mysql has the query cache, you dun have to bother the caching IF the update of table is not that frequently
99% user won't bother how many results that matched, he/she just need the top few records
use the explain - if you notice explain will return how many rows going to matched in the query, is not 100% precise, but should be good enough to act as rough row count
Not really what you asked for, but since you have a lot of options and want to count the items available based on the options you should take a look at Lucene and its faceted search. It was made to solve problems like this.
If you do not have the need to have up to date information from the search you can use a queue system to push updates and inserts to Lucene every now and then (so you don't have to bother Lucene with couple of thousand of updates and inserts every day).
You really only have three options, and no amount of searching is likely to reveal a fourth:
Count the results manually. O(n) with the total number of the results at query-time.
Store and maintain counts for every combination of filters. O(1) to retrieve the count, but requires O(2^n) storage and O(2^n) time to update all the counts when records change.
Cache counts, only calculating them (per #1) when they're not found in the cache. O(1) when data is in the cache, O(n) otherwise.
It's for this reason that systems that have to scale beyond the trivial - that is, most of them - either cap the number of results they'll count (eg, items in your GMail inbox or unread in Google Reader), estimate the count based on statistics (eg, Google search result counts), or both.
I suppose it's possible you might actually require an exact count for your users, with no limitation, but it's hard to envisage a scenario where that might actually be necessary.
I would suggest a separate table that caches the counts, combined with triggers.
In order for it to be fast you make it a memory table and you update it using triggers on the inserts, deletes and updates.
pseudo code:
CREATE TABLE counts (
id unsigned integer auto_increment primary key
option integer indexed using hash key
user_id integer indexed using hash key
rowcount unsigned integer
unique key user_option (user, option)
) engine = memory
DELIMITER $$
CREATE TRIGGER ai_tablex_each AFTER UPDATE ON tablex FOR EACH ROW
BEGIN
IF (old.option <> new.option) OR (old.user_id <> new.user_id) THEN BEGIN
UPDATE counts c SET c.rowcount = c.rowcount - 1
WHERE c.user_id = old.user_id and c.option = old.option;
INSERT INTO counts rowcount, user_id, option
VALUES (1, new.user_id, new.option)
ON DUPLICATE KEY SET c.rowcount = c.rowcount + 1;
END; END IF;
END $$
DELIMITER ;
Selection of the counts will be instant, and the updates in the trigger should not take very long either because you're using a memory table with hash indexes which have O(1) lookup time.
Links:
Memory engine: http://dev.mysql.com/doc/refman/5.5/en/memory-storage-engine.html
Triggers: http://dev.mysql.com/doc/refman/5.5/en/triggers.html
A few things you can easily optimise:
Cache all you can allow yourself to cache. The options for your dropdowns, for example, do they need to be fetched by ajax calls? This page answered many of my questions when I implemented memcache, and of course memcached.org has great documentation available too.
Serve anything that can be served statically. Ie, options that don't change frequently could be stored in a flat file as array via cron every hour for example and included with script at runtime.
MySQL with default configuration settings is often sub-optimal for any serious application load and should be tweaked to fit the needs, of the task at hand. Maybe look into memory engine for high performance read-access.
You can have a look at these 3 great-but-very-technical posts on materialized views, as a matter of fact that whole blog is truly a goldmine of performance tips for mysql.
GOod-luck
Presumably you're using ajax to make the call to the back end that you're talking about. Use some kind of a chached flat file as an intermediate for the data. Set an expire time of 5 seconds or whatever is appropriate. Name the data file as the query key=value string. In the ajax request if the data file is older than your cooldown time, then refresh, if not, use the value stored in your data file.
Also, you might be underestimating the strength of the mysql query cache mechanism. If you're using mysql query cache, I doubt there would be any significant performance dip over doing it the way I just described. If the query was being query cached by mysql then virtually the only slowdown effect would be from the network layer between your application and mysql.
Consider what role replication can play in your architecture. If you need to scale out, you might consider replicating your tables from InnoDB to MyISAM. The MyISAM engine automatically maintains a table count if you are doing count(*) queries. If you are doing count(col) where queries, then you need to rely heavily on well designed indicies. In that case you your count queries might take shape like so:
alter table A add index ixA (a, b);
select count(a) using from A use index(ixA) where a=1 and b=2;
I feel crazy for suggesting this as it seems that no-one else has, but have you considered client-side caching? JavaScript isn't terrible at dealing with large lists, especially if they're relatively simple lists.
I know that your ideal is that you have a desire to make the numbers completely accurate, but heuristics are your friend here, especially since synchronization will never be 100% -- a slow connection or high latency due to server-side traffic will make the AJAX request out of date, especially if that data is not a constant. IF THE DATA CAN BE EDITED BY OTHER USERS, SYNCHRONICITY IS IMPOSSIBLE USING AJAX. IF IT CANNOT BE EDITED BY ANYONE ELSE, THEN CLIENT-SIDE CACHING WILL WORK AND IS LIKELY YOUR BEST OPTION. Oh, and if you're using some sort of port connection, then whatever is pushing to the server can simply update all of the other clients until a sync can be accomplished.
If you're willing to do that form of caching, you can also cache the results on the server too and simply refresh the query periodically.
As others have suggested, you really need some sort of caching mechanism on the server side. Whether it's a MySQL table or memcache, either would work. But to reduce the number of calls to the server, retrieve the full list of cached counts in one request and cache that locally in javascript. That's a pretty simple way to eliminate almost 12M server hits.
You could probably even store the count information in a cookie which expires in an hour, so subsequent page loads don't need to query again. That's if you don't need real time numbers.
Many of the latest browser also support local storage, which doesn't get passed to the server with every request like cookies do.
You can fit a lot of data into a 1-2K json data structure. So even if you have thousands of possible count options, that is still smaller than your typical image. Just keep in mind maximum cookie sizes if you use cookie caching.
Fairly simple concept, making an extremely basic message board system and I want users to have a post count. Now I was debating on whether or not to have a tally in their row that is added each time a post by them is created, or subtracted by one each time a post of theirs is deleted. However I'm sure that performing a count query when the post count is requested would be more accurate due to unforseen circumstances (say a thread gets deleted and it doesn't lower their tally properly), however this seems like it would be less efficient to run a query EVERY time their post count is loaded, especially in the case of them having 10 posts on the same page and it lists their post count each post.
Thoughts/Advice?
Thanks
post_count should definitely be a column in the user table. the little extra effort to get this right is minimal compared to the additional database load you produce with running a few count query on every thread view.
if you use some sort of orm or database abstraction, it should be quite simple to add the counting to their create / delete filters.
Just go for count each time. Unless your load is going to be astronomical, COUNT shouldn't be a problem, and reduces the amount of effort involved in saving and updating data.
Just make sure you put an index on your user_id column, so that you can filter the data with a WHERE clause efficiently.
If you get to the point where this doesn't do it for you, you can implement caching strategies, but given that it's a simple message board, you shouldn't encounter that problem for a while.
EDIT:
Just saw your second concern about the same query repeating 10 times on a page. Don't do that :) Just pull the data once and store it in a variable. No need to repeat the same query multiple times.
Just use COUNT. It will be more accurate and will avoid any possible missed cases.
The case you mention of displaying the post count multiple times on a page won't be a problem unless you have an extremely high traffic site.
In any other case, the query cache of your database server will execute the query, then keep a cache of the response until any of the tables that the query relies on change. In the course of a single page load, nothing else should change, so you will only be executing the query once.
If you really need to worry about it, you can just cache it yourself in a variable and just execute the query once.
Generally speaking, your database queries will always be extremely efficient compared to your app logic. As such, the time wasted on maintaining the post_count in the user table will most probably be far far less than is needed to run a query to update the user table whenever a comment is posted.
Also, it is usually considered bad DB structure to have a field such as you are describing.
There are arguments for both, so ultimately it depends on the volume of traffic you expect. If your code is solid and properly layered, you can confidently keep a row count in your users' record without worrying about losing accuracy, and over time, count() will potentially get heavy, but updating a row count also adds overhead.
For a small site, it makes next to no difference, so if (and only if) you're a stickler for efficiency, the only way to get a useful answer is to run some benchmarks and find out for yourself. One way or another, it's going to be 3/10ths of 2/8ths of diddley squat, so do whatever feels right :)
It's totally reasonable to store the post counts in a column in your Users table. Then, to ensure that your post counts don't become increasingly inaccurate over time, run a scheduled task (e.g. nightly) to update them based on your Posts table.
We have this PHP application which selects a row from the database, works on it (calls an external API which uses a webservice), and then inserts a new register based on the work done. There's an AJAX display which informs the user of how many registers have been processed.
The data is mostly text, so it's rather heavy data.
The process is made by thousands of registers a time. The user can choose how many registers to start working on. The data is obtained from one table, where they are marked as "done". No "WHERE" condition, except the optional "WHERE date BETWEEN date1 AND date2".
We had an argument over which approach is better:
Select one register, work on it, and insert the new data
Select all of the registers, work with them in memory and insert them in the database after all the work was done.
Which approach do you consider the most efficient one for a web environment with PHP and PostgreSQL? Why?
It really depends how much you care about your data (seriously):
Does reliability matter in this case? If the process dies, can you just re-process everything? Or can't you?
Typically when calling a remote web service, you don't want to be calling it twice for the same data item. Perhaps there are side effects (like credit card charges), or maybe it is not a free API...
Anyway, if you don't care about potential duplicate processing, then take the batch approach. It's easy, it's simple, and fast.
But if you do care about duplicate processing, then do this:
SELECT 1 record from the table FOR UPDATE (ie. lock it in a transaction)
UPDATE that record with a status of "Processing"
Commit that transaction
And then
Process the record
Update the record contents, AND
SET the status to "Complete", or "Error" in case of errors.
You can run this code concurrently without fear of it running over itself. You will be able to have confidence that the same record will not be processed twice.
You will also be able to see any records that "didn't make it", because their status will be "Processing", and any errors.
If the data is heavy and so is the load, considering the application is not real time dependant the best approach is most definately getting the needed data and working on all of it, then putting it back.
Efficiency speaking, regardless of language is that if you are opening single items, and working on them individually, you are probably closing the database connection. This means that if you have 1000's of items, you will open and close 1000's of connections. The overhead on this far outweighs the overhead of returning all of the items and working on them.
Is it possible to do a simple count(*) query in a PHP script while another PHP script is doing insert...select... query?
The situation is that I need to create a table with ~1M or more rows from another table, and while inserting, I do not want the user feel the page is freezing, so I am trying to keep update the counting, but by using a select count(\*) from table when background in inserting, I got only 0 until the insert is completed.
So is there any way to ask MySQL returns partial result first? Or is there a fast way to do a series of insert with data fetched from a previous select query while having about the same performance as insert...select... query?
The environment is php4.3 and MySQL4.1.
Without reducing performance? Not likely. With a little performance loss, maybe...
But why are you regularily creating tables and inserting millions of row? If you do this only very seldom, can't you just warn the admin (presumably the only one allowed to do such a thing) that this takes a long time. If you're doing this all the time, are you really sure you're not doing it wrong?
I agree with Stein's comment that this is a red flag if you're copying 1 million rows at a time during a PHP request.
I believe that in a majority of cases where people are trying to micro-optimize SQL, they could get much greater performance and throughput by approaching the problem in a different way. SQL shouldn't be your bottleneck.
If you're doing a single INSERT...SELECT, then no, you won't be able to get intermediate results. In fact this would be a Bad Thing, as users should never see a database in an intermediate state showing only a partial result of a statement or transaction. For more information, read up on ACID compliance.
That said, the MyISAM engine may play fast and loose with this. I'm pretty sure I've seen MyISAM commit some but not all of the rows from an INSERT...SELECT when I've aborted it part of the way through. You haven't said which engine your table is using, though.
The other users can't see the insertion until it's committed. That's normally a good thing, since it makes sure they can't see half-done data. However, if you want them to see intermediate data, you could throw in an occassional call to "commit" while you're inserting.
By the way - don't let anybody tell you to turn autocommit on. That a HUGE time waster. I have a "delete and re-insert" job on my database that takes 1/3rd as long when I turn off autocommit.
Just to be clear, MySQL 4 isn't configured by default to use transactions. It uses the MyISAM table type which locks the entire table for each insert, if I remember correctly.
Your best bet would be to use one of the MySQL bulk insertion functions, such as LOAD DATA INFILE, as these are dramatically faster at inserting large amounts of data. As for the counting, well, you could break the inserts into N groups of 1000 (or Y) then divide your progress meter into N sections and just update it on each group's request.
Edit: Another thing to consider is, if this is static data for a template, then you could use a "select into" to create a new table with the same data. Not sure what your application is, or the intended functionality, but that could work as well.
If you can get to the console, you can ask various status questions that will give you the information you are looking for. There's a command that goes something like "SHOW processlist".