I'm using PHP client for working with BigTable instance. Is there a way to fetch total number of rows from the table?
So far the solution I found is to use Filter::pass() and manually loop through all returned results. Wondering if there is an easier way.
BigTable does not maintain count of rows seprately that you can pull directly. Even the cbt tool which supports count reads all the rows to get to a count. Since you are doing count you should use Strip value filter $filter = Filter::value()->strip(). This removes all the cell values making it easy on the memory consumption.
For larger datasets where you want to avoid scan all together you could use increment with your writes to maintain the count yourself.
If you are sticking with scans, for larger tables you may want to use a system like dataflow which can distribute the count process to many workers and get you the results quickly. Here is the SourceRowCount example from Dataflow.
Related
I have a dilemma that I'm trying to solve right now. I have a table called "generic_pricing" that has over a million rows. It looks like this....
I have a list of 25000 parts that I need to get generic_pricing data for. Some parts have a CLEI, some have a partNumber, and some have both. For each of the 25000 parts, I need to search the generic_pricing table to find all rows that match either clei or partNumber.
Making matters more difficult is that I have to do matches based on substring searches. For example, one of my parts may have a CLEI of "IDX100AB01", but I need the results of a query like....
SELECT * FROM generic_pricing WHERE clei LIKE 'IDX100AB%';
Currently, my lengthy PHP code for finding these matches is using the following logic is to loop through the 25000 items. For each item, I use the query above on clei. If found, I use that row for my calculations. If not, I execute a similar query on partNumber to try to find the matches.
As you can imagine, this is very time consuming. And this has to be done for about 10 other tables similar to generic_pricing to run all of the calculations. The system is now bogging down and timing out trying to crunch all of this data. So now I'm trying to find a better way.
One thought I have is to just query the database one time to get all rows, and then use loops to find matches. But for 25000 items each having to compare against over a million rows, that just seems like it would take even longer.
Another thought I have is to get 2 associative arrays of all of the generic_pricing data. i.e. one array of all rows indexed by clei, and another all indexed by partNumber. But since I am looking for substrings, that won't work.
I'm at a loss here for an efficient way to handle this task. Is there anything that I'm overlooking to simplify this?
Do not query the db for all rows and sort them in your app. Will cause a lot more headaches.
Here are a few suggestions:
Use parameterized queries. This allows your db engine to compile the query once and use it multiple times. Otherwise it will have to optimize and compile the query each time.
Figure out a way to make in work. Instead of using like try ... left(clei,8) in ('IDX100AB','IDX100AC','IDX101AB'...)
Do the calculations/math on the db side. Build a stored proc which takes a list of part/clei numbers and outputs the same list with the computed prices. You'll have a lot more control of execution and a lot less network overhead. If not a stored proc, build a view.
Paginate. If this data is being displayed somewhere, switch to processing in batches of 100 or less.
Build a cheat sheet. If speed is an issue try precomputing prices into a separate table nightly, include some partial clei/part numbers if needed. Then use the precomputed lookup table.
The best way to find a match between a few columns in the Data Base
I'd like to do something like this:
If you find a match to $ a, display the ID of the row
I am debating between two ways:
Select the entire table and look for a match and keep them a Data Base and then present them to from the array
Or that each time it search for matching from the table
The problem is that each time I perform a query for all the table (very large table) there is a problem with memory limit
So I'm looking for a way that takes the least memory
If all the data is in a single table, be sure that the data you are querying is indexed. This will ensure an 'optimal' search for your table.
In terms of memory, if you have an extremely large result set and slam the entire dataset into an array, you may run out of memory. To deal with this, you should page the data e.g. load some limited number results into the array for display, then present more data as the user asks for it.
Generally, selecting limited results from the database is faster and less memory intensive than populating large arrays. For a large table, be sure you only select the data you require. You might be looking for something like
SELECT record_id FROM your_table WHERE your_table.your_column = '$a' LIMIT 1;
This will only return one record in your result set.
I have a search engine on a shared host that uses MySQL. This search engine potentially has millions/trillions etc of records.
Each time a search is performed I return a count of the records that can then be used for pagination purposes.
The count tells you how many results there are in regard to the search performed. MySQL count is I believe considered quite slow.
Order of search queries:
Search executed and results returned
Count query executed
I don't perform a PHP count as this will be far slower in larger data sets.
Question is, do I need to worry about MySQL "count" and at what stage should I worry about it. How do the big search engines perform this task?
In almost all cases the answer is indexing. The larger your database gets the more important it is to have a well designed and optimized indexing strategy.
The importance of indexing on a large database can not be overstated.
You are absolutely right about not looping in code to count DB records. Your RDBMS is optimized for operations like that, your programming language is no. Wherever possible you want to do any sorting, grouping, counting, filtering operations within the SQL language provided by your RDBMS.
As for efficiently getting the count on a "paginated" query that uses a LIMIT clause, check out SQL_CALC_FOUND_ROWS.
SQL_CALC_FOUND_ROWS tells MySQL to calculate how many rows there would
be in the result set, disregarding any LIMIT clause. The number of
rows can then be retrieved with SELECT FOUND_ROWS(). See Section
11.13, “Information Functions”.
If MySQL database reaches several millions of records, that's a sign you'll be forced to stop using monolithic data store - meaning you'll have to split reads, writes and most likely use a different storage engine than the default one.
Once that happens, you'll stop using the actual count of the rows and you'll start using the estimate, cache the search results and so on in order to alleviate the work on the database. Even Google uses caching and displays an estimate of number of records.
Anyway, for now, you've got 2 options:
1 - Run 2 queries, one to retrieve the data and the other one where you use COUNT() to get the number of rows.
2 - Use SQL_CALC_FOUND_ROWS like #JohnFX suggested.
Percona has an article about what's faster, tho it might be outdated now.
The biggest problem you're facing is the way MySQL uses LIMIT OFFSET, which means you probably won't like your users using large offset numbers.
In case you indeed get millions of records - I don't forsee a bright future for your MySQL monolithic storage on a shared server. However, good luck to you and your project.
If I understand what you are trying to do properly, you can execute the one query, and perform the mysql_num_rows() function on the result in PHP... that should be pretty zippy.
http://php.net/manual/en/function.mysql-num-rows.php
Since you're using PHP, you could use the mysql_num_rows method to tell you the count after the query is done. See here: http://www.php.net/manual/en/function.mysql-num-rows.php
I have about 1 million rows so its going pretty slow. Here's the query:
$sql = "SELECT `plays`,`year`,`month` FROM `game`";
I've looked up indexes but it only makes sense to me when there's a 'where' clause.
Any ideas?
Indexes can make a difference even without a WHERE clause depending on what other columns you have in your table. If the 3 columns you are selecting only make up a small proportion of the table contents a covering index on them could reduce the amount of pages that need to be scanned.
Not moving as much data around though, either by adding a WHERE clause or doing the processing in the database would be better if possible.
If you don't need all 1 million records, you can pull n records:
$sql = "SELECT `plays`,`year`,`month` FROM `game` LIMIT 0, 1000";
Where the first number is the offset (where to start from) and the second number is the number of rows. You might want to use ORDER BY too, if only pulling a select number of records.
You won't be able to make that query much faster, short of fetching the data from a memory cache instead of the db. Fetching a million rows takes time. If you need more speed, figure out if you can have the DB do some of the work, e.g. sum/group togehter things.
If you're not using all the rows, you should use the LIMIT clause in your SQL to fetch only a certain range of those million rows.
If you really need all the 1 million rows to build your output, there's not much you can do from the database side.
However you may want to cache the result on the application side, so that the next time you'd want to serve the same output, you can return the processed output from your cache.
The realistic answer is no. With no restrictions (ie. a WHERE clause or a LIMIT) on your query, then you're almost guaranteed a full table scan every time.
The only way to decrease the scan time would be to have less data (or perhaps a faster disk). It's possible that you could re-work your data to make your rows more efficient (CHARS instead of VARCHARS in some cases, TINYINTS instead of INTS, etc.), but you're really not going to see much of a speed difference with that kind of micro-optimization. Indexes are where it's at.
Generally if you're stuck with a case like this where you can't use indexes, but you have large tables, then it's the business logic that requires some re-working. Do you always need to select every record? Can you do some application-side caching? Can you fragment the data into smaller sets or tables, perhaps organized by day or month? Etc.
As some of you may know, use of the LIMIT keyword in MySQL does not preclude it from reading the preceding records.
For example:
SELECT * FROM my_table LIMIT 10000, 20;
Means that MySQL will still read the first 10,000 records and throw them away before producing the 20 we are after.
So, when paginating a large dataset, high page numbers mean long load times.
Does anyone know of any existing pagination class/technique/methodology that can paginate large datasets in a more efficient way i.e. that does not rely on the LIMIT MySQL keyword?
In PHP if possible as that is the weapon of choice at my company.
Cheers.
First of all, if you want to paginate, you absolutely have to have an ORDER BY clause. Then you simply have to use that clause to dig deeper in your data set. For example, consider this:
SELECT * FROM my_table ORDER BY id LIMIT 20
You'll have the first 20 records, let's say their id's are: 5,8,9,...,55,64. Your pagination link to page 2 will look like "list.php?page=2&id=64" and your query will be
SELECT * FROM my_table WHERE id > 64 ORDER BY id LIMIT 20
No offset, only 20 records read. It doesn't allow you to jump arbitrarily to any page, but most of the time people just browse the next/prev page. An index on "id" will improve the performance, even with big OFFSET values.
A solution might be to not use the limit clause, and use a join instead -- joining on a table used as some kind of sequence.
For more informations, on SO, I found this question / answer, which gives an example -- that might help you ;-)
There are basically 3 approaches to this, each of which have their own trade-offs:
Send all 10000 records to the client, and handle pagination client-side via Javascript or the like. Obvious benefit is that only a single query is necessary for all of the records; obvious downside is that if the record size is in any way significant, the size of the page sent to the browser will be of proportionate size - and the user might not actually care about the full record set.
Do what you're currently doing, namely SQL LIMIT and grab only the records you need with each request, completely stateless. Benefit in that it only sends the records for the page currently requested, so requests are small, downsides in that a) it requires a server request for each page, and b) it's slower as the number of records/pages increases for later pages in the result, as you mentioned. Using a JOIN or a WHERE clause on a monotonically increasing id field can sometimes help in this regard, specifically if you're requesting results from a static table as opposed to a dynamic query.
Maintain some sort of state object on the server which caches the query results and can be referenced in future requests for a limited period of time. Upside is that it has the best query speed, since the actual query only needs to run once; downside is having to manage/store/cleanup those state objects (especially nasty for high-traffic websites).
SELECT * FROM my_table LIMIT 10000, 20;
means show 20 records starting from record # 10000 in the search , if ur using primary keys in the where clause there will not be a heavy load on my sql
any other methods for pagnation will take real huge load like using a join method
I'm not aware of that performance decrease that you've mentioned, and I don't know of any other solution for pagination however a ORDER BY clause might help you reduce the load time.
Best way is to define index field in my_table and for every new inserted row you need increment this field. And after all you need to use WHERE YOUR_INDEX_FIELD BETWEEN 10000 AND 10020
It will much faster.
some other options,
Partition the tables per each page so ignore the limit
Store the results into a session (a good idea would be to create a hash of that data using md5, then using that cache the session per multiple users)