How to Filter Large Data Set

How to Filter Large Data Set - php

I would assume that this should be handled on the data base server instead of using the Apache server in my specific case, but I'm wondering about how I would filter a result set that could be at least one hundred thousand records or greater.
On the interface side, a user would see the first page of results (like 100 results on say, invoices) where they can filter the result set or sort by the columns. The way I've done this system in the past was to create a MySQL command comparing each of the visible columns with the search term surrounded by the percent sign and using like for the comparison. My only problem was that this seemed to be quite slow even when processing a database that was around the 300MB mark.
Since I'm relatively new to database performance and was unable to find any filtering strategies, how should I structure my queries to provide quick, filtered data?

Related

Can I leverage Multi-Threaded PHP for slow sql queries

Ive been tasked with linking the ids of two different API's, the linking will be done based on names, therefore the searches use wildcard are a bit slow.
For example- One api uses the name Lionel Messi, while the other uses Lionel Andrés Messi. To solve this queries are done by doing
select id from players WHERE name LIKE '%Lionel%Messi%'
This proves effective but slow with queries taking an average of .3 seconds, and with 100k searches necessary this will take all day.
Since the slow bit is the query, would it be possible for my php program to be multi threaded so that multiple queries could run at the same time.
Would it be as simple as splitting the list of 100k searches into 4 lists of 25k, and just running the script in 4 different web pages?
EDIT-BTW the column "name" is an index in the table "players"
however that seems to have little to no impact on speed

Yes, it sounds like this can be done multi-threaded, as each operation (linking a single pair of IDs) doesn't depend on the results of previous operations. To get the best performance, you would split the input (the table) into as many lists as you have processor cores. The split could be done multiple ways depending on your requirements, e.g. ID ranges, splitting into several different tables, etc. And yes, running the script in multiple browser windows should create the desired parallelisation, making use of all available CPU cores. It may depend on how your server (Apache, nginx, etc) is configured, but I think most servers in their default configuration will get this right.
To elaborate on why the index doesn't have any effect -- an index is just a data structure that allows you to kind of reverse the way the basic function of selecting a row works, in order to find rows where a column matches a particular value. So instead of the input being a row number (not an id but an actual offset into the data that locates the row in physical storage) and the output being a row, the input is a column value (e.g. a numeric ID or a string) and the output is a list of row numbers that match that value. Various data structures are used, but the mechanism depends on the actual value (e.g. the ID) being stored on disk in a data structure. So the reason that wildcards aren't indexed is that every possible wildcard matching each unique value would have to be stored on disk.
Edit as detailed in the answers linked in a comment (Mysql Improve Search Performance with wildcards (%%)), MySQL can use indexes with wildcards as long as the string doesn't start with a wildcard -- presumably because rows can be eliminated immediately based on the start of the string.

Which database for dealing with very large result-sets?

I am currently working on a PHP application (pre-release).
Background
We have the a table in our MySQL database which is expected to grow extremely large - it would not be unusual for a single user to own 250,000 rows in this table. Each row in the table is given an amount and a date, among other things.
Furthermore, this particular table is read from (and written to) very frequently - on the majority of pages. Given that each row has a date, I'm using GROUP BY date to minimise the size of the result-set given by MySQL - rows contained in the same year can now be seen as just one total.
However, a typical page will still have a result-set between 1000-3000 results. There are also places where many SUM()'s are performed, totalling many tens - if not hundreds - of thousands of rows.
Trying MySQL
On a usual page, MySQL was usually taking around around 600-900ms. Using LIMIT and offsets weren't helping performance and the data has been heavily normalised, and so it doesn't seem like further normalisation would help.
To make matters worse, there are parts of the application which require the retrieval of 10,000-15,000 rows from the database. The results are then used in a calculation by PHP and formatted accordingly. Given this, the performance of MySQL wasn't acceptable.
Trying MongoDB
I have converted the table to MongoDB, and it's speed is faster - it usually takes around 250ms to retrieve 2,000 documents. However, the $group command in the aggregation pipeline - needed to aggregate fields depending on the year they fall in - slows things down. Unfortunately, keeping a total and updating that whenever a document is removed/updated/inserted is also out of the question, because although we can use a yearly total for some parts of the app, in other parts the calculations require that each amount falls on a specific date.
I've also considered Redis, although I think the complexity of the data is beyond what Redis was designed for.
The Final Straw
On top of all of this, speed is important. So performance is up there it terms of priorities.
Questions:
What is the best way to store data which is frequently read/written and rapidly growing, with the knowledge that most queries will retrieve a very large result-set?
Is there another solution to the problem? I'm totally open to suggestions.
I'm a little stuck at the moment, I haven't been able to retrieve such a large result-set in an acceptable amount of time. It seems most datastores are great for small retrieval sizes - even on large amounts of data - but I haven't been able to find anything on retrieving large amounts of data from an even larger table/collection.

I only read the first two lines but you are using aggregation (GROUP BY) and then expecting it to just do realtime?
I will say you are new to the internals of databases not to undermine you but to try and help you.
The group operator in both MySQL and MongoDB is in-memory. In other words it takes whatever data structure you povide, whether it be an index or a document (row) and it will go through each row/document taking the field and grouping it up.
This means that you can speed it up in both MySQL and MongoDB by making sure you are using an index for the grouping, but still this only goes so far, even with housing the index in your direct working set in MongoDB (memory).
In fact using LIMIT with a OFFSET as well is probably just slowing things down even further frankly. Since after writing out the set MySQL then needs to query again to get your answer.
Once done it will write out the result, MySQL will write it out to a result set (memory and IO being used here) and MongoDB will reply inline if you have not set $out, the maximum size of the inline output being 16MB (the maximum size of a document).
The final point to take away here is: aggregation is horrible
There is no silver bullet that will save you here, some databases will attempt to boast about their speed etc etc but fact is most big aggregators use something called "pre-aggregated reports". You can find a quick introduction within the MongoDB documentation: http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/
This means that you put the effort of aggregating and grouping onto some other process which could do it easily enough allowing your reading thread, the one that needs to be realtime to do it's thang in realtime.

PHP array VS MSQL table

I have a program that creates logs and these logs are used to calculate balances, trends, etc for each individual client. Currently, I store everything in separate MYSQL tables. I link all the logs to a specific client by joining the two tables. When I access a client, it pulls all the logs from the log_table and generates a report. The report varies depending on what filters are in place, mostly date and category specific.
My concern is the performance of my program as we accumulate more logs and clients. My intuition tells me to store the log information in the user_table in the form of a serialized array so only one query is used for the entire session. I can then take that log array and filter it using PHP where as before, it was filtered in a MYSQL query (using multiple methods, such as BETWEEN for dates and other comparisons).
My question is, do you think performance would be improved if I used serialized arrays to store the logs as opposed to using a MYSQL table to store each individual log? We are estimating about 500-1000 logs per client, with around 50000 clients (and growing).

It sounds like you don't understand what makes databases powerful. It's not about "storing data", it's about "storing data in a way that can be indexed, optimized, and filtered". You don't store serialized arrays, because the database can't do anything with that. All it sees is a single string without any structure that it can meaningfully work with. Using it that way voids the entire reason to even use a database.
Instead, figure out the schema for your array data, and then insert your data properly, with one field per dedicated table column so that you can actually use the database as a database, allowing it to optimize its storage, retrieval, and database algebra (selecting, joining and filtering).
Is serialized arrays in a db faster than native PHP? No, of course not. You've forced the database to act as a flat file with the extra dbms overhead.
Is using the database properly faster than native PHP? Usually, yes, by a lot.
Plus, and this part is important, it means that your database can live "anywhere", including on a faster machine next to your webserver, so that your database can return results in 0.1s, rather than PHP jacking 100% cpu to filter your data and preventing users of your website from getting page results because you blocked all the threads. In fact, for that very reason it makes absolutely no sense to keep this task in PHP, even if you're bad at implementing your schema and queries, forget to cache results and do subsequent searches inside of those cached results, forget to index the tables on columns for extremely fast retrieval, etc, etc.
PHP is not for doing all the heavy lifting. It should ask other things for the data it needs, and act as the glue between "a request comes in", "response base data is obtained" and "response is sent back to the client". It should start up, make the calls, generate the result, and die as fast as it can again.

It really depends on how you need to use the data. You might want to look into storing with mongo if you don't need to search that data. If you do, leave it in individual rows and create your indexes in a way that makes them look up fast.
If you have 10 billion rows, and need to look up 100 of them to do a calculation, it should still be fast if you have your indexes done right.
Now if you have 10 billion rows and you want to do a sum on 10,000 of them, it would probably be more efficient to save that total somewhere. Whenever a new row is added, removed or updated that would affect that total, you can change that total as well. Consider a bank, where all items in the ledger are stored in a table, but the balance is stored on the user account and is not calculated based on all the transactions every time the user wants to check his balance.

JS, PHP, and MySQL to get large data

I am using Ajax to send query to PHP server, which then run the SQL query to get data. Because the query involves three tables (two large ones), so JOIN the three tables is very slow.
Then I split the SQL query to three queries. It improves the efficiency (for small dataset). But for large dataset, because the PHP program runs the three queries one by one, and processes the result after each, there will be 30 second timeout (by default). I don't want to remove this default setting.
To avoid timeout, I am also considering running the three query and returning the result to JS, and let client side to do processing.
Is there other way to do that?
add
Basically, I want three output, title, extviews, allviews, for each item, WHERE extviews>somevalue. title is from one small table, extviews and allviews are aggregated from two different large tables. I have all the fields indexed, but joining the two big tables still requires a long time.
So I first aggregate one table to get extviews for each item, and also a list of item id. The results are organized as an array for JSON output to JS. Then using the list of id, I get the title for each item, and aggregate the other table to get allviews. Then I update the array with the new results.

Unless your mysql server is really overloaded, it's usually quickier to use joins. I guess you've already defined indexes on your tables? (for fields used in join condition & where clauses)
Doing the processing on the client side might also be a problem, since you'll have to send a lot of data in order to do the join...
Edit:
If all "easy" optimisation is done, then you have 2 choices... The one you just described (doing it on client size, if it's possible - what is the size (in bytes) of the json arrays you send to the client?)
Your other choice is to do the processing in the background (via cron) & cache somehow the results.

As already indicated by other people responding to your post, you should give us an idea of the structure of your three tables and the intent of each. Based upon that information, you may be able to get significant performance improvements by optimizing your database structure. To make it easier to understand, let's assume that someone had a website running off an intelligently designed database. I could easily make that application perform ten times worse solely by modifying the structure of the database.
Now, maybe there's some reason why you need to have three distinct tables, but I can't make that judgment without knowing what the fields in the database are, what you're aggregating, and what your web application is doing in the first place. Is it read heavy or write heavy? The solution may be as simple as denormalizing your database so that you don't need to use any joins.
I can say from a cursory glance at your description of what you're doing, that this application can't possibly scale efficiently and that you really need to reconsider your design. The first warning sign for me is the fact that you stated that one of the joins is just to link the title to two other tables. To me, being forced to do a join just to get a title of an object seems indicative of over-normalization. Some data redundancy is not necessarily a bad thing, and in some situations it's absolutely mandatory. Also, you say that you have two large tables that you use aggregate functions on and then join everything together. I can tell you right now that you're going to run into some serious performance issues if every hit to your application involves using a triple join and two aggregate functions, I'm assuming count.
Ultimately, we'll be able to give you a better response once you provide more information as to what you're trying to accomplish, and the general structure of the database you set up for it.

mysql and php: querying the db vs. reading in the whole thing

I'm struggling with a philosophical question on database programming in PHP. In particular, I'm trying to decide when it's best to read in an entire table into an object, vs. querying MySQL directly whenever I need data.
Is there ever a situation where you'd want to just read in the entire database into an object? Where do you draw the line?
For example, if I had a table full of names and phone numbers, and I need to get the phone number for one individual, that's a simple one-time mysql query. Reading in an entire table into an associative array just to get one phone number sounds ridiculous... But:
(1) what if I need to get the names and phone numbers of 50 individuals? 100? 1000?
(2) When is it more efficient (if ever) to read in the entire table into an object? Is performing 1000 mysql queries on 1000 names always going to be more efficient than reading in the entire table?
(2a) Obviously it would depend on the total number of records in the table. Would it be better to do 1000 queries for 1000 phone numbers, or read in a table of 2000 total records from a MySQL into an associative array? What if it was 5000 total records, and I needed 1000? What if it was 10k? Etc. etc.
(3) What if I need to do something a little more complex, like return all phone numbers in a certain area code? Obviously in that case I could use a regexp SQL query, but I'm sure I could come up with a more complex case where a simple query doesn't give me exactly what I want.
I guess what I'm getting at is, as a developer, you have several knobs you can turn to optimize your application. Obviously you want to think about the data you're using and optimize the database model to match the types of data requests you'll be doing. But sometimes you get into a mutually exclusive case where you're forced to pick optimizing your data model for one scenario, at the expense of another, competing scenario.
Any thoughts?

Databases are designed to be efficient at locating and returning exactly the data that you need to work with for a particular operation.
Transferring data over a network connection is orders of magnitude slower than processing it on the machine where it resides. Use databases for what they're good at... holding lots of information and allowing application code to query and work with exactly the subset of that data it needs to at a given point in time.
If you find that you need to frequently access the same data over and over, caching it at the application layer or in a dedicated caching solution like memcached does make sense, but I cannot imagine a scenario where it makes sense just to read in a whole table because my application logic needs to process a subset of the rows and/or columns in the table.

(3) but I'm sure I could come up with a more complex case where a simple query doesn't give me exactly what I want.
This is usually an indication that your database hasn't been properly normalized and/or has design flaws.
(2) When is it more efficient (if ever) to read in the entire table into an object? Is performing 1000 mysql queries on 1000 names always
Neither is a good choice. SQL is intended for set-based operations. You really need to use the system correctly for it to work well, but to do this you have to have properly designed your database. The best thing would be to write one query that returns exactly the records you want, no more and no less.
what if I need to get the names and phone numbers of 50 individuals
Maybe use something like select * where ID in (1,2,3,...,50), if you have a larger number of users, maybe create a temporary table with the list of users you want, and join on that. With a properly designed database there is usually a good way to retrieve a set of data with a single query.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.