Best way to find a match from large table

Best way to find a match from large table - php

The best way to find a match between a few columns in the Data Base
I'd like to do something like this:
If you find a match to $ a, display the ID of the row
I am debating between two ways:
Select the entire table and look for a match and keep them a Data Base and then present them to from the array
Or that each time it search for matching from the table
The problem is that each time I perform a query for all the table (very large table) there is a problem with memory limit
So I'm looking for a way that takes the least memory

If all the data is in a single table, be sure that the data you are querying is indexed. This will ensure an 'optimal' search for your table.
In terms of memory, if you have an extremely large result set and slam the entire dataset into an array, you may run out of memory. To deal with this, you should page the data e.g. load some limited number results into the array for display, then present more data as the user asks for it.

Generally, selecting limited results from the database is faster and less memory intensive than populating large arrays. For a large table, be sure you only select the data you require. You might be looking for something like
SELECT record_id FROM your_table WHERE your_table.your_column = '$a' LIMIT 1;
This will only return one record in your result set.

Related

MySQL+PHP: How to paginate data from complex query with ORDER BY on user-selected column

I have a table with currently ~1500 rows which is expected to grow over time (can't say how much, but still), the website is read-only and lets users do complex queries through the use of some forms, then the search query is completely URL-encoded since it's a public database. It's important to know that users can select what column data must be sorted by.
I'm not concerned about putting some indexes and slowing down INSERTs and UPDATEs (just performed occasionally by admins) since it's basically heavy-reading, but I need to paginate results as some popular queries can return 900+ results and that takes up too much space and RAM on client-side (results are further processed to create a quite rich <div> HTML element with an <img> for each result, btw).
I'm aware of the use of OFFSET {$m} LIMIT {$n} but would like to avoid it
I'm aware of the use of this
Query
SELECT *
FROM table
WHERE {$filters} AND id > {$last_id}
ORDER BY id ASC
LIMIT {$results_per_page}
and that's what I'd like to use, but that requires rows to be sorted only by their ID!
I've come up with (what I think is) a very similar query to custom sort results and allow efficient pagination.
Query:
SELECT *
FROM table
WHERE {$filters} AND {$column_id} > {$last_column_id}
ORDER BY {$column} ASC
LIMIT {$results_per_page}
but that unfortunately requires to have a {$last_column_id} value to pass between pages!
I know indexes (especially unique indexes) are basically automatically-updated integer-based columns that "rank" a table by values of a column (be it integer, varchar etc.), but I really don't know how to make MySQL return the needed $last_column_id for that query to work!
The only thing I can come up with is to put an additional "XYZ_id" integer column next to every "XYZ" column users can sort results by, then update values periodically through some scripts, but is it the only way to make it work? Please help.

(Too many comments to fit into a 'comment'.)
Is the query I/O bound? Or CPU bound? It seems like a mere 1500 rows would lead to being CPU-bound and fast enough.
What engine are you using? How much RAM? What are the settings of key_buffer_size and innodb_buffer_pool_size?
Let's see SHOW CREATE TABLE. If the table is full of big BLOBs or TEXT fields, we need to code the query to avoid fetching those bulky fields only to throw them away because of OFFSET. Hint: Fetch the LIMIT IDs, then reach back into the table to get the bulky columns.
The only way for this to be efficient:
SELECT ...
WHERE x = ...
ORDER BY y
LIMIT 100,20
is to have INDEX(x,y). But, even that, will still have to step over 100 cow paddies.
You have implied that there are many possible WHERE and ORDER BY clauses? That would imply that adding enough indexes to cover all cases is probably impractical?
"Remembering where you left off" is much better than using OFFSET, so try to do that. That avoids the already-discussed problem with OFFSET.
Do not use WHERE (a,b) > (x,y); that construct used not to be optimized well. (Perhaps 5.7 has fixed it, but I don't know.)
My blog on OFFSET discusses your problem. (However, it may or may not help your specific case.)

What's faster, db calls or resorting an array?

In a site I maintain I have a need to query the same table (articles) twice, once for each category of article. AFAIT there are basically two ways of doing this (maybe someone can suggest a better, third way?):
Perform the db query twice, meaning the db server has to sort through the entire table twice. After each query, I iterate over the cursor to generate html for a list entry on the page.
Perform the query just once and pull out all the records, then sort them into two separate arrays. After this, I have to iterate over each array separately in order to generate the HTML.
So it's this:
$newsQuery = $mysqli->query("SELECT * FROM articles WHERE type='news' ");
while($newRow = $newsQuery->fetch_assoc()){
// generate article summary in html
}
// repeat for informational articles
vs this:
$query = $mysqli->query("SELECT * FROM articles ");
$news = Array();
$info = Array();
while($row = $query->fetch_assoc()){
if($row['type'] == "news"){
$news[] = $row;
}else{
$info[] = $row;
}
}
// iterate over each array separate to generate article summaries
The recordset is not very large, current <200 and will probably grow to 1000-2000. Is there a significant different in the times between the two approaches, and if so, which one is faster?
(I know this whole thing seems awfully inefficient, but it's a poorly coded site I inherited and have to take care of without a budget for refactoring the whole thing...)
I'm writing in PHP, no framework :( , on a MySql db.
Edit
I just realized I left out one major detail. On a given page in the site, we will display (and thus retrieve from the db) no more than 30 records at once - but here's the catch: 15 info articles, and 15 news articles. On each page we pull the next 15 of each kind.

You know you can sort in the DB right?
SELECT * FROM articles ORDER BY type

EDIT
Due to the change made to the question, I'm updating my answer to address the newly revealed requirement: 15 rows for 'news' and 15 rows for not-'news'.
The gist of the question is the same "which is faster... one query to two separate queries". The gist of the answer remains the same: each database roundtrip incurs overhead (extra time, especially over a network connection to a separate database server), so with all else being equal, reducing the number database roundtrips can improve performance.
The new requirement really doesn't impact that. What the newly revealed requirement really impacts is the actual query to return the specified resultset.
For example:
( SELECT n.*
FROM articles n
WHERE n.type='news'
LIMIT 15
)
UNION ALL
( SELECT o.*
FROM articles o
WHERE NOT (o.type<=>'news')
LIMIT 15
)
Running that statement as a single query is going to require fewer database resources, and be faster than running two separate statements, and retrieving two disparate resultsets.
We weren't provided any indication of what the other values for type can be, so the statement offered here simply addresses two general categories of rows: rows that have type='news', and all other rows that have some other value for type.
That query assumes that type allows for NULL values, and we want to return rows that have a NULL for type. If that's not the case, we can adjust the predicate to be just
WHERE o.type <> 'news'
Or, if there are specific values for type we're interested in, we can specify that in the predicate instead
WHERE o.type IN ('alert','info','weather')
If "paging" is a requirement... "next 15", the typical pattern we see applied, LIMIT 30,15 can be inefficient. But this question isn't asking about improving efficiency of "paging" queries, it's asking whether running a single statement or running two separate statements is faster.
And the answer to that question is still the same.
ORIGINAL ANSWER below
There's overhead for every database roundtrip. In terms of database performance, for small sets (like you describe) you're better off with a single database query.
The downside is that you're fetching all of those rows and materializing an array. (But, that looks like that's the approach you're using in either case.)
Given the choice between the two options you've shown, go with the single query. That's going to be faster.
As far as a different approach, it really depends on what you are doing with those arrays.
You could actually have the database return the rows in a specified sequence, using an ORDER BY clause.
To get all of the 'news' rows first, followed by everything that isn't 'news', you could
ORDER BY type<=>'news' DESC
That's MySQL short hand for the more ANSI standards compliant:
ORDER BY CASE WHEN t.type = 'news' THEN 1 ELSE 0 END DESC
Rather than fetch every single row and store it in an array, you could just fetch from the cursor as you output each row, e.g.
while($row = $query->fetch_assoc()) {
echo "<br>Title: " . htmlspecialchars($row['title']);
echo "<br>byline: " . htmlspecialchars($row['byline']);
echo "<hr>";
}

Best way of dealing with a situation like this is to test this for yourself. Doesn't matter how many records do you have at the moment. You can simulate whatever amount you'd like, that's never a problem. Also, 1000-2000 is really a small set of data.
I somewhat don't understand why you'd have to iterate over all the records twice. You should never retrieve all the records in a query either way, but only a small subset you need to be working with. In a typical site where you manage articles it's usually about 10 records per page MAX. No user will ever go through 2000 articles in a way you'd have to pull all the records at once. Utilize paging and smart querying.
// iterate over each array separate to generate article summaries
Not really what you mean by this, but something tells me this data should be stored in the database as well. I really hope you're not generating article excerpts on the fly for every page hit.
It all sounds to me more like a bad architecture design than anything else...
PS: I believe sorting/ordering/filtering of a database data should be done on the database server, not in the application itself. You may save some traffic by doing a single query, but it won't help much if you transfer too much data at once, that you won't be using anyway.

PHP loading structed files faster

I have multiple dataset as below, and I want to handle them using PHP.
Dataset #1 (75 cols * 27,000 rows)
col #1 col #2 ...
record #1
record #2
...
Dataset #2 (32 cols * 7,500 rows)
....
Dataset #3 (44 cols * 17,500 rows)
....
Here, the number of records and columns are different so it is hard to use database structure.
And note that each 'cell' of dataset is only consists of either real number or N/A... and the dataset is perfectly fixed, i.e., there will be no any change.
So what I've done so far is make them as a file-based table, and write a starting offset of each record in the file.
Using this way, quite nice access speedup was achieved, but not satisfactory so far, because an access to each record requires parsing it as PHP data structure.
What I ultimately want to achieve is eliminating the parsing step. But serialization was not a good choice because it loads entire dataset. Of course it is possible to serialize each record and keep their offset as I've done but without serialization, but it seems me to not so fancy.
So here's the question, is there any method to load a part of dataset without any parsing step, but more better than the partial serialization what I suggested?
Many thanks in advance.
More information
Maybe I made the viewers a little bit confused.
Each dataset is separated and they exist as independent files.
Usual data access pattern is row-wise. Each row have unique ID by string, and an ID in one dataset could be exists in other dataset, but not necessarily. But above of that, what I concern is accelerating an access speed when I have some query to fetch specific row(s) in the dataset. For example, let there is a dataset like below.
Dataset #1 (plain-text file)
obs1 obs2 obs3 ...
my1 3.72 5.28 10.22 ...
xu1 3.44 5.82 15.33 ...
...
qq7 8.24 10.22 47.54 ...
And there is a corresponding index file, serialized using PHP. A key of each item represents unique ID in the dataset, and their value represents their offset in the file
Index #1 (PHP-serialized one, not same as actual serialized one)
Array (
"my1" => 0,
"xu1" => 337,
...
"qq7" => 271104
)
So it is possible to know record "xu1" starts at 337 bytes from the beginning of dataset file.
In order to access and fetch some rows using their unique ID,
1) Load serialized index file
2) Find matching IDs with query
3) Access to those position and fetch rows, and parsing them as an array of PHP.
The problems what I have is
1) Since I using exact matching, it is impossible to fetch multiple rows that partially matching with query (for example, fetch "xu1" row from query "xu")
2) Even though I indexed dataset, fetch speed is not satisfactory (took 0.05 sec. from single query)
3) When I tried to solve above problem by serializing an entire dataset, (maybe of course) the loading speed become substantially slower.
The only easiest way to solve above problems is make them as database I would do so,
but hope to find better way as keep them with plain text or some text-like format (for example, serialized or json-coded).
Many thanks and interests about my problem!

I think I understand your question to some extent. You've got 3 sets of data, that can be or cannot be related, with different number of columns and rows.
This may not the most cleanest looking solution, but I think it could solve the purpose. You can use mysql to store the data to avoid parsing the file every now and again. You can store the data in three tables or put them in one table with all the columns (the rows without need for a set column can have "null" for the field value).
You can also use sql unions, in case you want to run queries on all the three datasets collectively, by using tricks like
select null as "col1", col2, col3 from table1 where col2="something"
union all
select col1,null as "col2", null as "col3" from table2 where co1="something else"
order by col1

Speed of SELECT Distinct vs array unique

I am using WordPress with some custom post types (just to give a description of my DB structure - its WP's).
Each post has custom meta, which is stored in a separate table (postmeta table). In my case, I am storing city and state.
I've added some actions to WP's save_post/trash_post hooks so that the city and state are also stored in a separate table (cities) like so:
ID postID city state
auto int varchar varchar
I did this because I assumed that this table would be faster than querying the rather large postmeta table for a list of available cities and states.
My logic also forced me to add/update cities and states for every post, even though this will cause duplicates (in the city/state fields). This must be so because I must keep track of which states/cities exist (actually have a post associated with them). When a post is added or deleted, it takes its record to or from the cities table with it.
This brings me to my question(s).
Does this logic make sense or do I suck at DB design?
If it does make sense, my real question is this: **would it be faster to use MySQL's "SELECT DISTINCT" or just "SELECT *" and then use PHP's array_unique on the results?**
Edits for comments/answers thus far:
The structure of the table is exactly how I typed it out above. There is an index on ID, but the point of this table isn't to retrieve an indexed list, but to retrieve ALL results (that are unique) for a list of ALL available city/state combos.
I think I may go with (I don't know why I didn't think of this before) just adding a serialized list of city/state combos in ONE record in the wp_options table. Then I can just get that record, and filter out the unique records I need.
Can I get some feedback on this? I would imagine that retrieving and filtering a serialized array would be faster than storing the data in a separate table for retrieval.

To answer your question about using SELECT distinct vs. array_unique, I would say that I would almost always prefer to limit the result set in the database assuming of course that you have an appropriate index on the field for which you are trying to get distinct values. This saves you time in transmitting extra data from DB to application and for the application reading that data into memory where you can work with it.
As far as your separate table design, it is hard to speculate whether this is a good approach or not, this would largely depend on how you are actually preforming your query (i.e. are you doing two separate queries - one for post info and one for city/state info or querying across a join?).
The is really only one definitive way to determine what is fastest approach. That is to test both ways in your environment.

1) Fully normalized table(when it have only integer values and other tables have only one int+varchar) have advantage when you not dooing full table joins often and dooing alot of search on normalized fields. As downside it require large join/sort buffers and result more complex queries=much less chance query will be auto-optimized by mysql. So you have optimize your queries yourself.
2)Select distinct will be faster in almost any cases. Only case when it will be slower - you have low size sort buffer in /etc/my.conf and much more size memory buffer for php.
Distinct select can use indexes, while your code can't.
Also sending large amount of data to your app require alot of mysql cpu time and real time.

Is there a way to speed up this query with no WHERE clause?

I have about 1 million rows so its going pretty slow. Here's the query:
$sql = "SELECT `plays`,`year`,`month` FROM `game`";
I've looked up indexes but it only makes sense to me when there's a 'where' clause.
Any ideas?

Indexes can make a difference even without a WHERE clause depending on what other columns you have in your table. If the 3 columns you are selecting only make up a small proportion of the table contents a covering index on them could reduce the amount of pages that need to be scanned.
Not moving as much data around though, either by adding a WHERE clause or doing the processing in the database would be better if possible.

If you don't need all 1 million records, you can pull n records:
$sql = "SELECT `plays`,`year`,`month` FROM `game` LIMIT 0, 1000";
Where the first number is the offset (where to start from) and the second number is the number of rows. You might want to use ORDER BY too, if only pulling a select number of records.

You won't be able to make that query much faster, short of fetching the data from a memory cache instead of the db. Fetching a million rows takes time. If you need more speed, figure out if you can have the DB do some of the work, e.g. sum/group togehter things.
If you're not using all the rows, you should use the LIMIT clause in your SQL to fetch only a certain range of those million rows.

If you really need all the 1 million rows to build your output, there's not much you can do from the database side.
However you may want to cache the result on the application side, so that the next time you'd want to serve the same output, you can return the processed output from your cache.

The realistic answer is no. With no restrictions (ie. a WHERE clause or a LIMIT) on your query, then you're almost guaranteed a full table scan every time.
The only way to decrease the scan time would be to have less data (or perhaps a faster disk). It's possible that you could re-work your data to make your rows more efficient (CHARS instead of VARCHARS in some cases, TINYINTS instead of INTS, etc.), but you're really not going to see much of a speed difference with that kind of micro-optimization. Indexes are where it's at.
Generally if you're stuck with a case like this where you can't use indexes, but you have large tables, then it's the business logic that requires some re-working. Do you always need to select every record? Can you do some application-side caching? Can you fragment the data into smaller sets or tables, perhaps organized by day or month? Etc.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.