I'm running Ubuntu 13.04 with an nginx webserver installed. I'm writing a mini-social network for the users on my website, but for some reason the scripts I use to load things like profiles and "walls" are sometimes slow. Not all of them are slow, but especially the newsfeed script where it shows recent posts by friends.
I've added a bunch of microtime() checks throughout the script and it seems the query to get the recent posts is taking the most time. I tried to optimize it as much as possible but it still seems to be slow. I'm using MySQLi. Here is my query:
SELECT `id`,`posterName`, `posterUUID`, `message`, `postDate`, `likes`, `whoLiked`
FROM `wallposts`
`wallUUID` IN (' . implode(',', $friendStr) . ')
AND posterUUID = wallUUID
AND isDeleted=0
ORDER BY `postDate` DESC
Would it be faster to just use SELECT * since I'm pretty much selecting most of the columns anyway? I'm not sure what else to try, so that's why I came here.
Any help please as to what I could do/not do to keep it from taking 5+ seconds just for this query?
Several things:
using * instead of a list of columns is usually a bad idea, the risk is to add a column that you do not need and this column could be containing large amounts of binary data, this would make your query slower. So it's certainly not something to care about when you have speed problems.
you may have some priority of logical operators AND/OR problems
Your query is:
And I'm pretty sure you mean:
But AND takes precedence, so what you have is:
OR (
When in doubt use parenthesis (I'm in doubt there, but I would use parenthesis).
Your first WHERE condition is quite strange:
wallUUID IN (42,43,44,45,46)
AND posterUUID = wallUUID
That mean a filter on the friends identifiers for the wall posts, I guess, and then a filter which says for each row we need to have the same id for the poster uid and for the wall id.
I'm pretty that's not what you wanted. Maybe you need a join query here. Or maybe not, without the structure of your tables it's hard to guess
You will need a pretty decent indexation to get an optimized result on friend's posts results, an dindex which starts by the current user id, contain sthe right sort by date, the deletion thing, and certainly the friends identifiers.
user-friends relationships are hard to manage, especially when volumes gets bigger, usually building a social website involves pub/sub systems (publication subscriptions channels systems). You should study some pubsub databases schemas.
I am working on big eCommerce shopping website. I have around 40 databases. i want to create search page which show 18 result after searching by title in all databases.
(SELECT id_no,offers,image,title,mrp,store from db1.table1 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
UNION ALL (SELECT id_no,offers,image,title,mrp,store from db3.table3 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
UNION ALL (SELECT id_no,offers,image,title,mrp,store from db2.table2 WHERE MATCH(title) AGAINST('$searchkey') AND title like '%$searchkey%')
currently i am using the above query its working fine for 4 or more character keyword search like laptop nokia etc but takes 10-15 sec for processes but for query with keyword less than 3 characters it takes 30-40sec or i end up with 500 internal server error. Is there any optimized way for searching in multiple databases. I generated two index primary and full text index with title
Currently my search page is in php i am ready to code in python or any
other language if i gets good speed
You can use the sphixmachine:http://sphinxsearch.com/. This is powerfull search for database. IMHO Sphinx this best decision
for search in your site.
FULLTEXT is not configured (by default) for searching for words less than three characters in length. You can configure that to handle shorter words by setting a ...min_token_size parameter. Read this. https://dev.mysql.com/doc/refman/5.7/en/fulltext-fine-tuning.html You can only do this if you control the MySQL server. It won't be possible on shared hosting. Try this.
FULLTEXT is designed to produce more false-positive matches than false-negative matches. It's generally most useful for populating dropdown picklists like the ones under the location field of a browser. That is, it requires some human interaction to choose the correct record. To expect FULLTEXT to be able to do absolutely correct searches is probably a bad idea.
You simply cannot use AND column LIKE '%whatever%' if you want any reasonable performance at all. You must get rid of that. You might be able to rewrite your python program to do something different when the search term is one or two letters, and thereby avoid many, but not all, LIKE '%a%' and LIKE '%ab%' operations. If you go this route, create ordinary indexes on your title columns. Whatever you do, don't combine the FULLTEXT and LIKE searches in a single query.
If this were my project I'd consider using a special table with columns like this to hold all the short words from the title column in every row of each table.
id_pk INT autoincrement
id_no INT
word VARCHAR(3)
Then you can use a query like this to look up short words
SELECT a.id_no,offers,image,title,mrp,store
FROM db1.table1 a
JOIN db1.table1_shortwords s ON a.id_no = s.id_no
WHERE s.word = '$searchkey'
To do this, you will have to preprocess the title columns of your other tables to populate the shortwords tables, and put an index on the word column. This will be fast, but it will require a special-purpose program to do the preprocessing.
Having to search multiple tables with your UNION ALL operation is a performance problem. You will be able to improve performance dramatically by redesigning your schema so you need search only one table.
Having to search databases on different server machines is a performance problem. You may be able to rig up your python program to search them in parallel: that is, to somehow use separate tasks to search each one, then aggregate the results. Each of those separate search tasks requires its own connection to the data base, so this is not a cheap or simple solution.
If this system faces the public web, you will have to redesign it sooner or later, because it will never perform well enough as it is now. (Sorry to be the bearer of bad news.) Many system designers like to avoid redesigning systems after they become enormous. So, if I were you I would get the redesign done.
If your focus is on searching, then bend the schema to facilitate searching rather than the other way around.
Collect all the strings to search for in a single table. Whereas a UNION of 40 tables does work, it will be ~40 times as slow as having the strings collected together.
Use FULLTEXT when the words are long enough, use some other technique when they are not. (This addresses your 3-char problem; see also the Answer discussing innodb_ft_min_token_size. You are using InnoDB, correct?)
Use + and boolean mode to say that a word is mandatory: MATCH(col) AGAINST("+term" IN BOOLEAN MODE)
Do not add on a LIKE clause unless there is a good reason.
I'm working on a management system for a small library. I proposed them to replace the Excel spreadsheet they are using now with something more robust and professional like PhpMyBibli - https://en.wikipedia.org/wiki/PhpMyBibli - but they are scared by the amount of fields to fill, and also the interfaces are not fully translated in Italian.
So I made a very trivial DB, with basically a table for the authors and a table for the books. The authors table is because I'm tired to have to explain that "Gabriele D'Annunzio" != "Gabriele d'Annunzio" != "Dannunzio G." and so on.
My test tables are now populated with ~ 100k books and ~ 3k authors, both with plausible random text, to check the scripts under pressure.
For the public consultation I want to make an interface like that of Gallica, the website of the Bibliothèque nationale de France, which I find pretty useful. A sample can be seen here: http://gallica.bnf.fr/Search?ArianeWireIndex=index&p=1&lang=EN&f_typedoc=livre&q=Computer&x=0&y=0
The concept is pretty easy: for each menu, e.g. the author one, I generate a fancy <select> field with all the names retrieved from the DB, and this works smoothly.
The issue arises when I try to add beside every author name the number of books, as made by Gallica, in this way (warning - conceptual code, not actual PHP):
SELECT id, surname, name FROM authors
foreach row {
SELECT COUNT(*) as num FROM BOOKS WHERE id_auth=id
echo "<option>$surname, $name ($num)</option>";
With the code above a core of the CPU jumps at 100%, and no results are shown in the browser. Not surprising, since they are 3k queries on a 100k table in a very short time.
Just to try, I added a LIMIT 100 to the first query (on the authors table). The page then required 3 seconds to be generated, and 15 seconds when I raised the LIMIT to 500 (seems a linear increment). But of course I can't show to library users a reduced list of authors.
I don't know which hardware/software is used by Gallica to achieve their results, but I bet their budget is far above that of a small village library using 2nd hand computers.
Do you think that to add a "number_of_books" field in the authors table, which will be updated every time a new book is inserted, could be a practical solution, rather than to browse the whole list at every request?
BTW, a similar procedure must be done for the publication date, the language, the theme, and some other fields, so the query time will be hit again, even if the other tables are a lot smaller than the authors one.
Your query style is very inefficient - try using a join and group structure:
COUNT(books.id) AS numbooks
FROM authors
INNER JOIN books ON books.id_auth=authors.id
GROUP BY authors.id
ORDER BY numbooks DESC
Just to clear up some issues I not explicitely said:
Ofcourse you don't need a query in the PHP loop any longer, just the displaying portion
Indices on books.id_auth and authors.id (the latter primary or unique) are assumed
As #GordonLinoff pointed out, the IFNULL() is redundant in an inner join, so I removed it.
To get all themes, even if there aren't any books in them, just use a left join (this time including the IFNULL(), if your provider's MySQL may be old):
IFNULL(COUNT(books.theme),0) AS num
FROM themes
LEFT JOIN books ON books.theme=theme.id
GROUP BY themes.id
Ofcourse a stored value will give you the best performance - but this denormalization comes at a cost: Your Database now has the potential to become inconsistent in a user-visible way.
If you do go with this method. I strongly recommend you use triggers to auto-fill this field (and ofcourse those triggers must sit on the books table).
Be prepared to see slowed down inserts - this might ofcourse be okay, as I guess you will see a much higher rate of SELECTS than INSERTS
After reading a lot about how the JOIN statement works, with the help of
useful answer 1 and useful answer 2, I discovered I used it some 15 or 20 years ago, then I forgot about this since I never needed it again.
I made a test using the options I had:
reply with the JOIN query with IFNULL(): 0,5 seconds
reply with the JOIN query without IFNULL(): 0,5 seconds
reply using a stored value: 0,4 seconds
That DB will run on some single core old iron, so I think a 20% difference could be significant, and I decide to use stored values, updating the count every time a new book is inserted (i.e. not often).
Anyway thanks a lot for having refreshed my memory: JOIN queries will be useful somewhere else in my DB.
I used the JOIN method above to query the book themes, which are stored into a far smaller table, in this way:
SELECT theme.id, theme.main, theme.sub, COUNT(books.theme) as num FROMthemesJOIN books ON books.theme = theme.id GROUP BY themes.id ORDER by themes.main ASC, themes.sub ASC
It works fine, but for themes which are not in the books table I obviously don't get a 0 response, so I don't have lines like Contemporary Poetry - Etruscan (0) to show as disabled options for the sake of list completeness.
Is there a way to have back my theme.main and theme.sub?
I have this 2 mysql tables: TableA and TableB
* ColumnAId
* ColumnA1
* ColumnA2
* ColumnBId
* ColumnAId
* ColumnB1
* ColumnB2
In PHP, I wanted to have this multidimensional array format
$array = array(
'ColumnAId' => value,
'ColumnA1' => value,
'ColumnA2' => value,
'TableB' => array(
'ColumnBId' => value,
'ColumnAId' => value,
'ColumnB1' => value,
'ColumnB2' => value
so that I can loop it in this way
foreach($array as $i => $TableA) {
echo 'ColumnAId' . $TableA['ColumnAId'];
echo 'ColumnA1' . $TableA['ColumnA1'];
echo 'ColumnA2' . $TableA['ColumnA2'];
echo 'TableB\'s';
foreach($value['TableB'] as $j => $TableB) {
echo $TableB['...']...
echo $TableB['...']...
My problem is that, what is the best way or the proper way of querying MySQL database so that I can achieve this goal?
Solution1 --- The one I'm using
$array = array();
$rs = mysqli_query("SELECT * FROM TableA", $con);
while ($row = mysqli_fetch_assoc($rs)) {
$rs2 = mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
// $array = result in array
$row['TableB'] = $array2;
I'm doubting my code cause its always querying the database.
$rs = mysqli_query("SELECT * FROM TableA JOIN TableB ON TableA.ColumnAId=TableB.ColumnAId");
while ($row = mysqli_fet...) {
// Code
The second solution only query once, but if I have thousand of rows in TableA and thousand of rows in TableB for each TableB.ColumnAId (1 TableA.ColumnAId = 1000 TableB.ColumnAId), thus this solution2 takes much time than the solution1?
Neither of the two solutions proposed are probably optimal, BUT solution 1 is UNPREDICTABLE and thus INHERENTLY FLAWED!
One of the first things you learn when dealing with large databases is that 'the best way' to do a query is often dependent upon factors (referred to as meta-data) within the database:
How many rows there are.
How many tables you are querying.
The size of each row.
Because of this, there's unlikely to be a silver bullet solution for your problem. Your database is not the same as my database, you will need to benchmark different optimizations if you need the best performance available.
You will probably find that applying & building correct indexes (and understanding the native implementation of indexes in MySQL) in your database does a lot more for you.
There are some golden rules with queries which should rarely be broken:
Don't do them in loop structures. As tempting as it often is, the overhead on creating a connection, executing a query and getting a response is high.
Avoid SELECT * unless needed. Selecting more columns will significantly increase overhead of your SQL operations.
Know thy indexes. Use the EXPLAIN feature so that you can see which indexes are being used, optimize your queries to use what's available and create new ones.
Because of this, of the two I'd go for the second query (replacing SELECT * with only the columns you want), but there are probably better ways to structure the query if you have the time to optimize.
However, speed should NOT be your only consideration in this, there is a GREAT reason not to use suggestion one:
PREDICTABILITY: why read-locks are a good thing
One of the other answers suggests that having the table locked for a long period of time is a bad thing, and that therefore the multiple-query solution is good.
I would argue that this couldn't be further from the truth. In fact, I'd argue that in many cases the predictability of running a single locking SELECT query is a greater argument FOR running that query than the optimization & speed benefits.
First of all, when we run a SELECT (read-only) query on a MyISAM or InnoDB database (default systems for MySQL), what happens is that the table is read-locked. This prevents any WRITE operations from happening on the table until the read-lock is surrendered (either our SELECT query completes or fails). Other SELECT queries are not affected, so if you're running a multi-threaded application, they will continue to work.
This delay is a GOOD thing. Why, you may ask? Relational data integrity.
Let's take an example: we're running an operation to get a list of items currently in the inventory of a bunch of users on a game, so we do this join:
SELECT * FROM `users` JOIN `items` ON `users`.`id`=`items`.`inventory_id` WHERE `users`.`logged_in` = 1;
What happens if, during this query operation, a user trades an item to another user? Using this query, we see the game state as it was when we started the query: the item exists once, in the inventory of the user who had it before we ran the query.
But, what happens if we're running it in a loop?
Depending on whether the user traded it before or after we read his details, and in which order we read the inventory of the two players, there are four possibilities:
The item could be shown in the first user's inventory (scan user B -> scan user A -> item traded OR scan user B -> scan user A -> item traded).
The item could be shown in the second user's inventory (item traded -> scan user A -> scan user B OR item traded -> scan user B -> scan user A).
The item could be shown in both inventories (scan user A -> item traded -> scan user B).
The item could be shown in neither of the user's inventories (scan user B -> item traded -> scan user A).
What this means is that we would be unable to predict the results of the query or to ensure relational integrity.
If you're planning to give $5,000 to the guy with item ID 1000000 at midnight on Tuesday, I hope you have $10k on hand. If your program relies on unique items being unique when snapshots are taken, you will possibly raise an exception with this kind of query.
Locking is good because it increases predictability and protects the integrity of results.
Note: You could force a loop to lock with a transaction, but it will still be slower.
You should never have a statement that looks like this:
mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
mysqli has support for prepared statements. Read about them and use them, they will help you to avoid something terrible happening to your database.
Definitely second way. Nested query is an ugly thing since you're getting all query overheads (execution, network e t.c.) every time for every nested query, while single JOIN query will be executed once - i.e. all overheads will be done only once.
Simple rule is not to use queries in cycles - in general. There could be exceptions, if one query will be too complex, so due to performance in should be split, but in a certain case that can be shown only by benchmarks and measures.
If you want to do algorithmic evaluation of your data in your application code (which I think is the right thing to do), you should not use SQL at all. SQL was made to be a human readable way to ask for computational achieved data from a relational database, which means, if you just use it to store data, and do the computations in your code, you're doing it wrong anyway.
In such a case you should prefer using a different storage/retrieving possibility like a key-value store (there are persistent ones out there, and newer versions of MySQL exposes a key-value interface as well for InnoDB, but it's still using a relational database for key-value storage, aka the wrong tool for the job).
If you STILL want to use your solution:
I've often found that issuing multiple queries can be faster than a single query, because MySQL has to parse less query, the optimizer has less work to do, and more often than not the MySQL optimzer just fails (that's the reason things like STRAIGHT JOIN and index hints exist). And even if it does not fail, multiple queries might still be faster depending on the underlying storage engine as well as how many threads try to access the data at once (lock granularity - this only applies with mixing in update queries though - neither MyISAM nor InnoDB lock the whole table for SELECT queries by default). Then again, you might even get different results with the two solutions if you don't use transactions, as data might change between queries if you use multiple queries versus a single one.
In a nutshell: There's way more to your question than what you posted/asked for, and what a generic answer can provide.
Regarding your solutions: I'd prefer the first solution if you have an environment where a) data changes are common and/or b) you have many concurrent running threads (requests) accessing and updating your tables (lock granularity is better with split up queries, as is cacheability of the queries) ; if your database is on a different network, e.g. network latency is an issue, you're probably better of with the first solution (but most people I know have MySQL on the same server, using socket connections, and local socket connections normally don't have much latency).
Situation may also change depending on how often the for loop is actually executed.
Again: Benchmark
Another thing to consider is memory efficiency and algorithmic efficiency. Later one is about O(n) in both cases, but depending on the type of data you use to join, it could be worse in any of the two. E.g. if you use strings to join (you really shouldn't, but you don't say), performance in the more php dependent solution also depends on php hash map algorithm (arrays in php are effectively hash maps) and the likelyhood of a collision, while mysql string indexes are normally fixed length, and thus, depending on your data, might not be applicable.
For memory efficiency, the multi query version is certainly better, as you have the php array anyway (which is very inefficient in terms of memory!) in both solutions, but the join might use a temp table depending on several circumstances (normally it shouldn't, but there ARE cases where it does - you can check using EXPLAIN for your queries)
In some case, you should using limit for best performance
If you wanna show 1000 rows
And some single query( master data)
you should run 1000 with limit between 10-100
Then get your foreign key to master data with single query with using WHERE IN in your query. then count your unique data to limit master data.
Select productID, date from transaction_product limit 100
Get all productID and make it unique
Select price from master_product WHERE IN (1,2 3 4) limit 4(count from total unique)
I'm making a micro-blogging website. The users can follow each other. I've to make stream of posts (activity stream) for the current user ( $userid ) based on the users the current user is following, like in Twitter. I know two ways of implementing this. Which one is better?
Table: posts
Columns: PostID, AuthorID, TimeStamp, Content
Table: follow
Columns: poster, follower
The first way, by joining these two tables:
select `posts`.* from `posts`,`follow` where `follow`.`follower`='$userid' and
`posts`.`AuthorID`=`follow`.`poster` order by `posts`.`postid` desc
The second way is by making an array of users the $userid is following (posters), then doing php implode on this array, and then doing where in:
One thing I'll like to tell here that I'm storing the the number of users a user is following in the `following` record of the `user` table, so here I'll use this number as a limit when extracting the list of posters - the 'followingList':
function followingList($userid){
$limit="select `following` from `users` where `userid`='$userid' limit 1";
$limit= (int) $limit[0];
$sql="select `poster` from `follow` where `follower`='$userid' limit $limit";
while($data = mysql_fetch_row($result)){
$listArray[] = $data[0];
return $posters;
Now I've a comma separated list of user IDs the current $userid is following.And now selecting the posts to make the activity stream:
$sql = "select * from `posts` where (`AuthorID` in ('$posters'))
order by `postid` desc";
Which of the two methods is better?
And can knowing the total number of following (number of users the current user is following), make things faster in the first method as it's doing in the second method?
Any other better method?
You should go all the way with the first option. Always try as much as possible to process the data on the mysql server instead of in your PHP code. PHP will not implicitly cache the results of the operations while MySQL will do it.
The most important thing is to make sure you index your data correctly. Try using "EXPLAIN" statements to make sure you have optimized your database as much as possible and use #1 to link your data together.
This will allow you later to compute statistics also, while the second method requires you to process a part of the statistics.
The first important point is that PHP is good at building pages but very bad are managing data, everything manipulated by PHP will fill the memory and no special behavior can be applied in PHP to prevent using to much memory, except crashing.
On the other side the datatase job is to analyse relation between the tables, real number used by the query (cardinality of indexes and statictics on rows and index usage in fact), and a lot of different mechanism can be choosen by the engine depending on the size of data (merge joins, temporary tables, etc). That means you could have 256.278.242 posts and 145.268 users, with 5.684 average followers the datatabase job would be to find the fastest way to give you an answer. Well, when you hit really big numbers you'll see that all databases are not equal, but that's another problem.
On the PHP side Retrieving the list of users from the fisrt query coudl became very long (with a big number of followed users, let's say 15.000. Simply building the query string with 15 000 identifiers inside would take a quite big amount a memory. Trasnferring this new query to the SQL server would also be slow. It's definitively the wrong way.
Now be careful of the way you build your SQL request. A request is something you should be able to read from the top to the end, explaining what you really want. This will help the SQL (good) engine in choosing the right solution.
select `posts`.*
from `posts`
INNER JOIN `follow` ON posts`.`AuthorID`=`follow`.`poster`
where `follow`.`follower`='#userid'
order by `posts`.`postid` desc
Several remarks:
I have used an INNER JOIN.I want an INNER JOIN, let's write it, it will be easier to read for me later and it should be the same for the query analyser.
if #userid is an int do not use quotes. Please use ints for identifiers (this is really faster than strings). And on the PHP side cast the int "SELECT ..." . (int) $user_id ." ORDER ... or use query with parameters (This is for security).
I have used a LIMIT 15, maybe an offset could be used as well, if you want to show some pagination control around the posts. Let's say this query will retrieve 15.263 documents from my 5.642 folowwed users, you do not want, and the user do not want, to show theses 15.263 documents on a web page. And knowing with $limit that the number is 15.263 is a good thing but certainly not for a request limit. You know this number, but the database may know it as well if it has a good query analyser and some good internal statistics.
The request limit has several goals
1. Limit the size of data transfered from the database to your PHP script
2. Limit the memory usage of your PHP script (an array with 15.263 documents containg some HTMl stuff... ouch)
3. Limit the size of the final user output (and get a faster response)
I am building a fairly large statistics system, which needs to allow users to requests statistics for a given set of filters (e.g. a date range).
e.g. This is a simple query that returns 10 results, including the player_id and amount of kills each player has made:
SELECT player_id, SUM(kills) as kills
FROM `player_cache`
GROUP BY player_id
The above query will offset the results by 30 (i.e. The 3rd 'page' of results). When the user then selects the 'next' page, it will then use OFFSET 40 instead of 30.
My problem is that nothing is cached, even though the LIMIT/OFFSET pair are being used on the same dataset, it is performing the SUM() all over again, just to offset the results by 10 more.
The above example is a simplified version of a much bigger query which just returns more fields, and takes a very long time (20+ seconds, and will only get longer as the system grows).
So I am essentially looking for a solution to speed up the page load, by caching the state before the LIMIT/OFFSET is applied.
You can of course use caching, but i would recommend caching the result, not the query in mysql.
But first things first, make sure that a) you have the proper indexing on your data, b) that it's being used.
If this does not work, as group by tends to be slow with large datasets, you need to put the summary data in a static table/file/database.
There are several techniques/libraries etc that help you perform server side caching of your data. PHP Caching to Speed up Dynamically Generated Sites offers a pretty simple but self explanatory example of this.
Have you considered periodically running your long query and storing all the results in a summary table? The summary table can be quickly queried because there are no JOINs and no GROUPings. The downside is that the summary table is not up-to-the-minute current.
I realize this doesn't address the LIMIT/OFFSET issue, but it does fix the issue of running a difficult query multiple times.
Depending on how often the data is updated, data-warehousing is a straightforward solution to this. Basically you:
Build a second database (the data warehouse) with a similar table structure
Optimise the data warehouse database for getting your data out in the shape you want it
Periodically (e.g. overnight each day) copy the data from your live database to the data warehouse
Make the page get its data from the data warehouse.
There are different optimisation techniques you can use, but it's worth looking into:
Removing fields which you don't need to report on
Adding extra indexes to existing tables
Adding new tables/views which summarise the data in the shape you need it.