Caching big data, alternative query or other indexes?

Caching big data, alternative query or other indexes? - php

I'm with a problem, I am working on highscores, and for those highscores you need to make a ranking based on skill experience and latest update time (to see who got the highest score first incase skill experience is the same).
The problem is that with the query I wrote, it takes 28 (skills) x 0,7 seconds to create a personal highscore page to see what their rank is on the list. Requesting this in the browser is just not doable, it takes way too long for the page to load and I need a solution for my issue.
MySQL version: 5.5.47
The query I wrote:
SELECT rank FROM
(
SELECT hs.playerID, (#rowID := #rowID + 1) AS rank
FROM
(
SELECT hs.playerID
FROM highscores AS hs
INNER JOIN overall AS o ON hs.playerID = o.playerID
WHERE hs.skillID = ?
AND o.game_mode = ?
ORDER BY hs.skillExperience DESC,
hs.updateTime ASC
) highscore,
(SELECT #rowID := 0) r
) data
WHERE data.playerID = ?
As you can see I first have to create a whole resultset that gives me a full ranking for that game mode and skill, and then I have to select the rank based on the playerID after that, the problem is that I cannot let the query run untill it finds the result, because mysql doesn't offer such function, if I'd specifiy where data.playerID = ? in the query above, it would give back 1 result, meaning the ranking will be 1 as well.
The highscores table has 550k rows
What I have tried was storing the resultset for each skillid/gamemode combination in a temp table json_encoded, tried storing on files, but it ended up being quite slow as well, because the files are really huge and it takes time to process.
Highscores table:
CREATE TABLE `highscores` (
`playerID` INT(11) NOT NULL,
`skillID` INT(10) NOT NULL,
`skillLevel` INT(10) NOT NULL,
`skillExperience` INT(10) NOT NULL,
`updateTime` BIGINT(20) NOT NULL,
PRIMARY KEY (`playerID`, `skillID`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
Overall table has got 351k rows
Overall table:
CREATE TABLE `overall` (
`playerID` INT(11) NOT NULL,
`playerName` VARCHAR(50) NOT NULL,
`totalLevel` INT(10) NOT NULL,
`totalExperience` BIGINT(20) NOT NULL,
`updateTime` BIGINT(20) NOT NULL,
`game_mode` ENUM('REGULAR','IRON_MAN','IRON_MAN_HARDCORE') NOT NULL DEFAULT 'REGULAR',
PRIMARY KEY (`playerID`, `playerName`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
Explain Select result from the query:
Does anybody have a solution for me?

No useful index for WHERE
The last 2 lines of the EXPLAIN (#3 DERIVED):
WHERE hs.skillID = ?
AND o.game_mode = ?
Since neither table has a suitable index to use for the WHERE clause, to optimizer decided to do a table scan of one of them (overall), then reach into the other (highscores). Having one of these indexes would help, at least some:
highscores: INDEX(skillID)
overall: INDEX(game_mode, ...) -- note that an index only on a low-cardinality ENUM is rarely useful.
(More in a minute.)
No useful index for ORDER BY
The optimizer sometimes decides to use an index for the ORDER BY instead of for the WHERE. But
ORDER BY hs.skillExperience DESC,
hs.updateTime ASC
cannot use an index, even though both are in the same table. This is because DESC and ASC are different. Changing ASC to DESC would have an impact on the resultset, but would allow
INDEX(skillExperience, updateTime)
to be used. Still, this may not be optimal. (More in a minute.)
Covering index
Another form of optimization is to build a "covering index". That is an index that has all the columns that the SELECT needs. Then the query can be performed entirely in the index, without reaching over to the data. The SELECT in question is the innermost:
( SELECT hs.playerID
FROM highscores AS hs
INNER JOIN overall AS o ON hs.playerID = o.playerID
WHERE hs.skillID = ?
AND o.game_mode = ?
ORDER BY hs.skillExperience DESC, hs.updateTime ASC
) highscore,
For hs: INDEX(skillID, skillExperience, updateTime, playerID) is "covering" and has the most important item (skillID, from the WHERE) first.
For o: INDEX(game_mode, playerID) is "covering". Again, game_mode must be first.
If you change the ORDER BY to be DESC and DESC, then add another index for hs: INDEX(skillExperience, updateTime, skillID, playerID). Now the first 2 columns must be in that order.
Conclusion
It is not obvious which of those indexes the optimizer would prefer. I suggest you add both and let it choose.
I believe that (1) the innermost query is consuming the bulk of time, and (2) there is nothing to optimize in the outer SELECTs. So, I leave that as my recommendation.
Much of this is covered in my Indexing Cookbook.

Important subanswer: How frequently change rank of all players? Hmm.. Need explain.. You want realtime statistics? No, you dont want realtime )) You must select time interval for update statistics, e.g. 10 minutes. For this case you can run cronjob for insert new rank statistics into separated table like this:
/* lock */
TRUNCATE TABLE rank_stat; /* maybe update as unused/old for history) instead truncate */
INSERT INTO rank_stat (a, b, c, d) <your query here>;
/* unlock */
and users (browsers) will select readonly statistics from this table (can be split to pages).
But if rank stat not frequently change, e.g. you can recalculate it for all wanted game events and/or acts/achievs of players.
This is recommedations only. Because you not explain full environment. But I think you can found right solution with this recommendations.

It doesn't look like you really need to rank everyone, you just want to find out how many people are ahead of the current player. You should be able to get a simple count of how many players have better scores & dates than the current player which represents the current player's ranking.
SELECT count(highscores.id) as rank FROM highscores
join highscores playerscore
on playerscore.skillID = highscores.skillID
and playerscore.gamemode = highscores.gamemode
where highscores.skillID = ?
AND highscores.gamemode = ?
and playerscore.playerID = ?
and (highscores.skillExperience > playerscore.skillExperience
or (highscores.skillExperience = playerscore.skillExperience
and highscores.updateTime > playerscore.updateTime));
(I joined the table to itself and aliased the second instance as playerscore so it was slightly less confusing)
You could probably even simplify it to one query by grouping and parsing the results within your language of choice.
SELECT
highscores.gamemode as gamemode,
highscores.skillID as skillID,
count(highscores.id) as rank
FROM highscores
join highscores playerscore
on playerscore.skillID = highscores.skillID
and playerscore.gamemode = highscores.gamemode
where playerscore.playerID = ?
and (highscores.skillExperience > playerscore.skillExperience
or (highscores.skillExperience = playerscore.skillExperience
and highscores.updateTime > playerscore.updateTime));
group by highscores.gamemode, highscores.skillID;
Not quite sure about the grouping bit though.

Related

Pagination Offset Issues - MySQL

I have an orders grid holding 1 million records. The page has pagination, sort and search options. So If the sort order is set by customer name with a search key and the page number is 1, it is working fine.
SELECT * FROM orders WHERE customer_name like '%Henry%' ORDER BY
customer_name desc limit 10 offset 0
It becomes a problem when the User clicks on the last page.
SELECT * FROM orders WHERE customer_name like '%Henry%' ORDER BY
customer_name desc limit 10 offset 100000
The above query takes forever to load. Index is set to the order id, customer name, date of order column.
I can use this solution https://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/ if I don't have a non-primary key sort option, but in my case sorting is user selected. It will change from Order id, customer name, date of order etc.
Any help would be appreciated. Thanks.

Problem 1:
LIKE "%..." -- The leading wildcard requires a full scan of the data, or at least until it finds the 100000+10 rows. Even
... WHERE ... LIKE '%qzx%' ... LIMIT 10
is problematic, since there probably not 10 such names. So, a full scan of your million names.
... WHERE name LIKE 'James%' ...
will at least start in the middle of the table-- if there is an index starting with name. But still, the LIMIT and OFFSET might conspire to require reading the rest of the table.
Problem 2: (before you edited your Question!)
If you leave out the WHERE, do you really expect the user to page through a million names looking for something?
This is a UI problem.
If you have a million rows, and the output is ordered by Customer_name, that makes it easy to see the Aarons and the Zywickis, but not anyone else. How would you get to me (James)? Either you have 100K links and I am somewhere near the middle, or the poor user would have to press [Next] 'forever'.
My point is that the database is not the place to introduce efficiency.
In some other situations, it is meaningful to go to the [Next] (or [Prev]) page. In these situations, "remember where you left off", then use that to efficiently reach into the table. OFFSET is not efficient. More on Pagination

I use a special concept for this. First I have a table called pager. It contains an primary pager_id, and some values to identify a user (user_id,session_id), so that the pager data can't be stolen.
Then I have a second table called pager_filter. I consist of 3 ids:
pager_id int unsigned not NULL # id of table pager
order_id int unsigned not NULL # store the order here
reference_id int unsigned not NULL # reference into the data table
primary key(pager_id,order_id);
As first operation I select all records matching the filter rules from and insert them into pager_filter
DELETE FROM pager_filter WHERE pager_id = $PAGER_ID;
INSERT INTO pager_filter (pager_id,order_id,reference_id)
SELECT $PAGER_ID pager_id, ROW_NUMBER() order_id, data_id reference_id
FROM data_table
WHERE $CONDITIONS
ORDER BY $ORDERING
After filling the filter table you can use an inner join for pagination:
SELECT d.*
FROM pager_filter f
INNER JOIN data_table d ON d.data_id = f.reference id
WHERE f.pager_id = $PAGER_ID && f.order_id between 100000 and 100099
ORDER BY f.order_id
or
SELECT d.*
FROM pager_filter f
INNER JOIN data_table d ON d.data_id = f.reference id
WHERE f.pager_id = $PAGER_ID
ORDER BY f.order_id
LIMIT 100 OFFSET 100000
Hint: All code above is not tested pseudo code

Mysql where between query optimization

Below is the format of the database of Autonomous System Numbers ( download and parsed from this site! ).
range_start range_end number cc provider
----------- --------- ------ -- -------------------------------------
16778240 16778495 56203 AU AS56203 - BIGRED-NET-AU Big Red Group
16793600 16809983 18144 AS18144
745465 total rows
A Normal query looks like this:
select * from table where 3232235520 BETWEEN range_start AND range_end
Works properly but I query a huge number of IPs to check for their AS information which ends up taking too many calls and time.
Profiler Snapshot:
Blackfire profiler snapshot
I've two indexes:
id column
a combine index on the range_start and range_end column as both the make unique row.
Questions:
Is there a way to query a huge number of IPs in a single query?
multiple where (IP between range_start and range_end) OR where (IP between range_start and range_end) OR ... works but I can't get the IP -> row mapping or which rows are retrieved for which IP.
Any suggestions to change the database structure to optimize the query speed and decrease the time?
Any help will be appreciated! Thanks!

It is possible to query more than one IP address. Several approaches we could take. Assuming range_start and range_end are defined as integer types.
For a reasonable number of ip addresses, we could use an inline view:
SELECT i.ip, a.*
FROM ( SELECT 3232235520 AS ip
UNION ALL SELECT 3232235521
UNION ALL SELECT 3232235522
UNION ALL SELECT 3232235523
UNION ALL SELECT 3232235524
UNION ALL SELECT 3232235525
) i
LEFT
JOIN ip_to_asn a
ON a.range_start <= i.ip
AND a.range_end >= i.ip
ORDER BY i.ip
This approach will work for a reasonable number of IP addresses. The inline view could be extended with more UNION ALL SELECT to add additional IP addresses. But that's not necessarily going to work for a "huge" number.
When we get "huge", we're going to run into limitations in MySQL... maximum size of a SQL statement limited by max_allowed_packet, there may be a limit on the number of SELECT that can appear.
The inline view could be replaced with a temporary table, built first.
DROP TEMPORARY TABLE IF EXISTS _ip_list_;
CREATE TEMPORARY TABLE _ip_list_ (ip BIGINT NOT NULL PRIMARY KEY) ENGINE=InnoDB;
INSERT INTO _ip_list_ (ip) VALUES (3232235520),(3232235521),(3232235522),...;
...
INSERT INTO _ip_list_ (ip) VALUES (3232237989),(3232237990);
Then reference the temporary table in place of the inline view:
SELECT i.ip, a.*
FROM _ip_list_ i
LEFT
JOIN ip_to_asn a
ON a.range_start <= i.ip
AND a.range_end >= i.ip
ORDER BY i.ip ;
And then drop the temporary table:
DROP TEMPORARY TABLE IF EXISTS _ip_list_ ;
Some other notes:
Churning database connections is going to degrade performance. There's a significant amount overhead in establishing and tearing down a connection. That overhead get noticeable if the application is repeatedly connecting and disconnecting, if its doing that for every SQL statement being issued.
And running an individual SQL statement also has overhead... the statement has to be sent to the server, the statement parsed for syntax, evaluated from semantics, choose an execution plan, execute the plan, prepare a resultset, return the resultset to the client. And this is why it's more efficient to process set wise rather than row wise. Processing RBAR (row by agonizing row) can be very slow, compared to sending a statement to the database and letting it process a set in one fell swoop.
But there's a tradeoff there. With ginormous sets, things can start to get slow again.
Even if you can process two IP addresses in each statement, that halves the number of statements that need to be executed. If you do 20 IP addresses in each statement, that cuts down the number of statements to 5% of the number that would be required a row at a time.
And the composite index already defined on (range_start,range_end) is appropriate for this query.
FOLLOWUP
As Rick James points out in a comment, the index I earlier said was "appropriate" is less than ideal.
We could write the query a little differently, that might make more effective use of that index.
If (range_start,range_end) is UNIQUE (or PRIMARY) KEY, then this will return one row per IP address, even when there are "overlapping" ranges. (The previous query would return all of the rows that had a range_start and range_end that overlapped with the IP address.)
SELECT t.ip, a.*
FROM ( SELECT s.ip
, s.range_start
, MIN(e.range_end) AS range_end
FROM ( SELECT i.ip
, MAX(r.range_start) AS range_start
FROM _ip_list_ i
LEFT
JOIN ip_to_asn r
ON r.range_start <= i.ip
GROUP BY i.ip
) s
LEFT
JOIN ip_to_asn e
ON e.range_start = s.range_start
AND e.range_end >= s.ip
GROUP BY s.ip, s.range_start
) t
LEFT
JOIN ip_to_asn a
ON a.range_start = t.range_start
AND a.range_end = t.range_end
ORDER BY t.ip ;
With this query, for the innermost inline view query s, the optimizer might be able to make effective use of an index with a leading column of range_start, to quickly identify the "highest" value of range_start (that is less than or equal to the IP address). But with that outer join, and with the GROUP BY on i.ip, I'd really need to look at the EXPLAIN output; it's only conjecture what the optimizer might do; what is important is what the optimizer actually does.)
Then, for inline view query e, MySQL might be able to make more effective use of the composite index on (range_start,range_end), because of the equality predicate on the first column, and the inequality condition on MIN aggregate on the second column.
For the outermost query, MySQL will surely be able to make effective use of the composite index, due to the equality predicates on both columns.
A query of this form might show improved performance, or performance might go to hell in a handbasket. The output of EXPLAIN should give a good indication of what's going on. We'd like to see "Using index for group-by" in the Extra column, and we only want to see a "Using filesort" for the ORDER BY on the outermost query. (If we remove the ORDER BY clause, we want to not see "Using filesort" in the Extra column.)
Another approach is to make use of correlated subqueries in the SELECT list. The execution of correlated subqueries can get expensive when the resultset contains a large number of rows. But this approach can give satisfactory performance for some use cases.
This query depends on no overlapping ranges in the ip_to_asn table, and this query will not produce the expected results when overlapping ranges exist.
SELECT t.ip, a.*
FROM ( SELECT i.ip
, ( SELECT MAX(s.range_start)
FROM ip_to_asn s
WHERE s.range_start <= i.ip
) AS range_start
, ( SELECT MIN(e.range_end)
FROM ip_to_asn e
WHERE e.range_end >= i.ip
) AS range_end
FROM _ip_list_ i
) r
LEFT
JOIN ip_to_asn a
ON a.range_start = r.range_start
AND a.range_end = r.range_end
As a demonstration of why overlapping ranges will be a problem for this query, given a totally goofy, made up example
range_start range_end
----------- ---------
.101 .160
.128 .244
Given an IP address of .140, the MAX(range_start) subquery will find .128, the MIN(range_end) subquery will find .160, and then the outer query will attempt to find a matching row range_start=.128 AND range_end=.160. And that row just doesn't exist.

This is a duplicate of the question here however I'm not voting to close it, as the accepted answer in that question is not very helpful; the answer by Quassnoi is much better (but it only links to the solution).
A linear index is not going to help resolve a database of ranges. The solution is to use geospatial indexing (available in MySQL and other DBMS). An added complication is that MySQL geospatial indexing only works in 2 dimensions (while you have a 1-D dataset) so you need to map this to 2-dimensions.
Hence:
CREATE TABLE IF NOT EXISTS `inetnum` (
`from_ip` int(11) unsigned NOT NULL,
`to_ip` int(11) unsigned NOT NULL,
`netname` varchar(40) default NULL,
`ip_txt` varchar(60) default NULL,
`descr` varchar(60) default NULL,
`country` varchar(2) default NULL,
`rir` enum('APNIC','AFRINIC','ARIN','RIPE','LACNIC') NOT NULL default 'RIPE',
`netrange` linestring NOT NULL,
PRIMARY KEY (`from_ip`,`to_ip`),
SPATIAL KEY `rangelookup` (`netrange`)
) ENGINE=MyISAM DEFAULT CHARSET=ascii;
Which might be populated with....
INSERT INTO inetnum
(from_ip, to_ip
, netname, ip_txt, descr, country
, netrange)
VALUES
(INET_ATON('127.0.0.0'), INET_ATON('127.0.0.2')
, 'localhost','127.0.0.0-127.0.0.2', 'Local Machine', '.',
GEOMFROMWKB(POLYGON(LINESTRING(
POINT(INET_ATON('127.0.0.0'), -1),
POINT(INET_ATON('127.0.0.2'), -1),
POINT(INET_ATON('127.0.0.2'), 1),
POINT(INET_ATON('127.0.0.0'), 1),
POINT(INET_ATON('127.0.0.0'), -1))))
);
Then you might want to create a function to wrap the rather verbose SQL....
DROP FUNCTION `netname2`//
CREATE DEFINER=`root`#`localhost` FUNCTION `netname2`(p_ip VARCHAR(20) CHARACTER SET ascii) RETURNS varchar(80) CHARSET ascii
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE l_netname varchar(80);
SELECT CONCAT(country, '/',netname)
INTO l_netname
FROM inetnum
WHERE MBRCONTAINS(netrange, GEOMFROMTEXT(CONCAT('POINT(', INET_ATON(p_ip), ' 0)')))
ORDER BY (to_ip-from_ip)
LIMIT 0,1;
RETURN l_netname;
END
And therefore:
SELECT netname2('127.0.0.1');
./localhost
Which uses the index:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE inetnum range rangelookup rangelookup 34 NULL 1 Using where; Using filesort
(and takes around 10msec to find a record from the combined APNIC,AFRINIC,ARIN,RIPE and LACNIC datasets on the very low spec VM I'm using here)

You can compare IP ranges using MySQL. This question might contain an answer you're looking for: MySQL check if an IP-address is in range?
SELECT * FROM TABLE_NAME WHERE (INET_ATON("193.235.19.255") BETWEEN INET_ATON(ipStart) AND INET_ATON(ipEnd));
You will likely want to index your database. This optimizes the time it takes to search your database, similar to the index you will find in the back of a textbook, but for databases:
ALTER TABLE `table` ADD INDEX `name` (`column_id`)
EDIT: Apparently INET_ATON cannot be used on indexed databases, so you would have to pick one of these!

query order by rand() too slow

i have a large table in a database called offers(over 300.000 rows).
when i execute the below query it takes over 3 secs.
$sql = "SELECT * FROM `offers` WHERE (`start_price` / `price` >= 2) ORDER BY RAND() LIMIT 1";
Table offers
`id` int(11) NOT NULL,
`title` text NOT NULL,
`description` text NOT NULL,
`image` text NOT NULL,
`price` float NOT NULL,
`start_price` float NOT NULL,
`brand` text NOT NULL
is there any way to make it faster? i want to select one random row (start_price / price >= 2)

I think your problem is that your query requires a full table scan for the WHERE clause. The order by does make things worse -- depending on the volume that pass the filter.
You might consider storing this number in the table and adding an index to it:
alter table offers add column start_to_price float;
update offers
set start_to_price = start_price / price;
create index idx_offers_s2p on offers(start_to_price);
Then, your query might be fast:
SELECT o.*
FROM `offers` o
WHERE start_to_price >= 2
ORDER BY RAND()
LIMIT 1;
If performance is still a problem, then I would be likely to use a where clause first:
SELECT o.*
FROM `offers` o CROSS JOIN
(select COUNT(*) as cnt from offers where start_to_price >= 2) oo
WHERE rand() <= 10 / cnt
ORDER BY RAND()
LIMIT 1;
This pulls about 10 rows at random and then chooses one of them.
If these don't work, then there are other solutions that get progressively more complicated.

One option to make this faster is to ensure that you leverage indexing:
How does database indexing work?
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
So in this case ensure that you have an index for start_price together with price and in that exact order.
Another way is to optimise the coalition that is in use for the database and tables, so choose utf8mb4 over utf8 and if sorting/localisation is not being an issue for you and you want to be completely anal then general_ci over unicode_ci:
What's the difference between utf8_general_ci and utf8_unicode_ci
Despite the MyISAM storage engine delivering faster read speeds (http://www.rackspace.com/knowledge_center/article/mysql-engines-myisam-vs-innodb) I have found that there are various tweaks available to the InnoDB storage engine that can speed things up more so than I was able to achieve using MyISAM:
https://dba.stackexchange.com/questions/5666/possible-to-make-mysql-use-more-than-one-core?lq=1
So something like the following would be another option:
[mysqld] // Don't play here unless you have read and understand what is going on
innodb_read_io_threads=64
innodb_write_io_threads=64
innodb_buffer_pool_size=2G
Yet another option is to take a look at alternate storage engines: https://www.percona.com/software/mysql-database/percona-server/benchmarks
You could also see the other answers for refactoring of your query :)

There are alternatives. The one I have used is described here:-
http://jan.kneschke.de/projects/mysql/order-by-rand/
Essentially you generate a random number that is between your min and max id, and then join that against your result set (using >=), with a limit of 1. So you get a result set starting from a random point in your full results and then just grab the first record.
Down side is that if you id fields are not equally distributed then it isn't truly random
Quick example code, assuming your offers table has a unique key called id:-
SELECT offers.*
FROM offers
INNER JOIN
(
SELECT RAND( ) * ( MAX( Id ) - MIN( Id ) ) + MIN( Id ) AS Id
FROM offers
WHERE (`start_price` / `price` >= 2)
) AS r2
ON offers.Id >= r2.Id
WHERE (`start_price` / `price` >= 2)
ORDER BY offers.Id LIMIT 1

MySQL query taking much time to load ? How to tune the database efficiently

I am having a table containing currently about 5 million rows.This is a live database where data is populated as a result of a scraping script.The script is inserting the data into the table continuously,
For example:
The business listing site is giving me a JSON response on API call,this is parsed and inserted into the database.A duplication check also happens in between.And on a later phase I am taking he data obtained to get reports.
While trying to take reports based on the stored information it's taking too long to complete the script execution.
The scraping script is live and continues to update the table with records in the future.
Every month its expected to get .7 - 1 million new records.
Following is the structure of my table,
CREATE TABLE IF NOT EXISTS `biz_listing` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`lid` smallint(11) NOT NULL,
`name` varchar(300) NOT NULL,
`type` enum('cat1','cat2') NOT NULL,
`location` varchar(300) NOT NULL,
`businessID` varchar(300) NOT NULL,
`reviewcount` int(6) NOT NULL,
`city` varchar(300) NOT NULL,
`categories` varchar(300) NOT NULL,
`result_month` varchar(10) NOT NULL,
`updated_date` date NOT NULL,
PRIMARY KEY (`id`),
KEY `biz_date` (`businessID`,`updated_date`),
KEY `type_date` (`type`,`updated_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The records fall under two categories, 'cat1' and 'cat2' .
(I am plaaning to add a new category ,say cat3)
I need to have a same station aggregate report section,which shows business IDs which fall across every month in a selected range of months.
Here it is chosen as June-July 2014.
Report on aggregate numbers # category
SELECT COUNT(t.`businessID`) AS bizcount, SUM(t.reviewcount) AS reviewcount, t.`type`
FROM `biz_listing` t
INNER JOIN
( SELECT `businessID`,count(*) c FROM `biz_listing` WHERE updated_date BETWEEN '2014/06/01' AND LAST_DAY('2014/07/01') GROUP
BY `businessID`,`type` HAVING c = 2 ) t2
ON t2.`businessID` = t.`businessID`
WHERE updated_date BETWEEN '2014/07/01' AND LAST_DAY('2014/07/01') GROUP BY t.`type`
EXPLAIN (done on a backup table 4 million)
Report on aggregate numbers # based on cities
SELECT COUNT(t.`businessID`) AS bizcount, SUM(t.reviewcount) AS reviewcount, t.`type`, t.`location` as city
FROM `biz_listing` t
INNER JOIN
( SELECT `businessID`,count(*) c FROM `biz_listing` WHERE updated_date BETWEEN '2014/06/01' AND LAST_DAY('2014/07/01') GROUP
BY `businessID`,`type` HAVING c = 2 ) t2
ON t2.`businessID` = t.`businessID`
WHERE updated_date BETWEEN '2014/07/01' AND LAST_DAY('2014/07/01') GROUP BY t.`location`, t.`result_month`
Here we selecting range of months (June-July) , so it will list all the businessID common in both range of months,
1st query will output according to type of Business
2nd query will output according to location
The problem is it considerably taking very long time to execute the query (600 seconds and more) also, some times the query dies before completion.
Please suggest me on optimizations for the query if you find so.
I think indexing is affecting insertion performance of the scraping script.
How can I modify the current script considering insertion and retrieval performance?
Thanx in advance.
EDIT
I tried the suggested covering indexes and its taking much more time than usual :(
EXPLAIN is as follows:

This is a MyISAM table, which offers less contention between inserting queries and reporting queries than InnoDB. Therefore, let's focus first on the reporting queries. It is true that indexes slow down inserts. But queries slow down a LOT because of missing or incorrect indexes.
To troubleshoot this performance problem it's helpful for clarity to consider the various subqueries separately, I believe.
So let's start with one of them.
SELECT `businessID`,
count(*) c
FROM `biz_listing`
WHERE updated_date BETWEEN '2014/06/01' AND LAST_DAY('2014/07/01')
GROUP BY `businessID`,`type`
HAVING c = 2
This subquery is straightforward, and basically well-constructed. It's capable of using an index to jump to the first record meeting the updated_date range criterion, then linearly scan that index looking for the last record. As it scans the index, if it finds the type column in it, it can collect the record counts it needs to satisfy the query as it scans the index. That's fast.
But, you don't have that index! So this subquery is doing a full table scan. As we say in New England, that's wicked slow.
If you took your compound covering index (type,updated_date) index and exchanged the order of the two columns in it to give (updated_date,type), it would serve as a high-performance covering index for this query. The order in which the columns appear in your compound index is incorrect to make the index helpful for this query.
Let's take a look at your first main query in the same light (omitting the subquery).
SELECT COUNT(t.`businessID`) AS bizcount,
SUM(t.reviewcount) AS reviewcount, t.`type`
FROM `biz_listing` t
WHERE updated_date BETWEEN '2014/07/01' AND LAST_DAY('2014/07/01')
GROUP BY t.`type`
(Something's not clear here. You say COUNT(t.businessID) here, but it's possible you want COUNT(DISTINCT t.businesscount). What you have will give the same result as COUNT(*) because there are no NULL values of businessID. If you do this, you can put HAVING SUM(DISTINCT businessID) > 2 in the query and get rid of your need for the subquery.)
This query works similarly to the previous one. It scans an index over the updated_date range, then by type, then picks up values of businessID and reviewcount. So a compound index in this order will allow this query to be satisfied by a pure index scan, which will be fast.
(updated_date, type, businessID,reviewcount)
Notice that any query that can be satisfied from the (updated_date, type) index can also be satisfied from this one, so you don't need them both.
Go read about compound covering indexes, tight range scans, and loose range scans.
Your other query will probably be greatly improved by this same index. Give it a try.
You have a backup table it seems. You can experiment with various compound indexes in that table until you get good results.
I'm reluctant to give this sort of advice:
TL;DR: change your indexes from this to that
because then you may just come back to SO with the next question and be tempted to become a support leech. Can I avoid being a "leech" when I am a beginner in a topic and only ask questions?
You know... teach a person to fish, etc.

Designing "relevance-based" search?

In my application (PHP/MySQL/JS), I have a search functionality built in. One of the search criteria contains checkboxes for various options, and as such, some results would be more relevant than others, should they contain more or less of each option.
i.e. Options are A and B, and if I search for both options A and B, Result 1 containing only option A is 50% relevent, while Result 2 containing both options A and B is 100% relevant.
Prior, I'd just be doing simple SQL queries based on form input, but this one's a little harder, since it's not as simple as data LIKE "%query%", but rather, some results are more valuable to some search queries, and some aren't.
I have absolutely no idea where to begin... does anybody have relevant (ha!) reading material to direct me to?
Edit: After mulling it over, I'm thinking something involving an SQL script to get the raw data, followed by many many rounds of parsing is something I'd have to do...
Nothing cacheable, though? :(

have a look at the lucence project
it is available in many languages
this is the php port
http://framework.zend.com/manual/en/zend.search.lucene.html
it indexes the items to search and returns the relevant weighted search results, eg better then select x from y where name like '%pattern%' style searching

What you need is a powerful search engine, like solr. While you could implement this on top of mysql, it's already provided out of the box with other tools.

Here's an idea: do the comparisons and sum the results. The higher the sum, the more criteria match.
How about a (stupid) table like this:
name
dob_year
dob_month
dob_day
Find the person who shares the most of the three date components with 3/15/1980:
SELECT (dob_year = 1980) + (dob_month = 3) + (dob_day = 15) as strength, name
from user
order by strength desc
limit 1
A good WHERE clause and index would be required to keep you from doing a table scan, but...
You could even add a weight to a column, e.g.
SELECT ((dob_year = 1980)*2)
Good luck.

Given your answer to my comment, here's an example on how you might do it:
First the tables:
CREATE TABLE `items` (
`id` int(11) NOT NULL,
`name` varchar(80) NOT NULL
);
CREATE TABLE `criteria` (
`cid` int(11) NOT NULL,
`option` varchar(80) NOT NULL,
`value` int(1) NOT NULL
);
Then an example of some items and criteria:
INSERT INTO items (id, name) VALUES
(1,'Name1'),
(2,'Name2'),
(3,'Name3');
INSERT INTO criteria VALUES
(1,'option1',1) ,(1,'option2',1) ,(1,'option3',0),
(2,'option1',0) ,(2,'option2',1) ,(2,'option3',1),
(3,'option1',1) ,(3,'option2',0) ,(3,'option3',1);
This would create 3 items and 3 options and assign options to them.
Now there are multiple way you can order by a certain "strength". The simplest of which would be:
SELECT i . * , c1.value + c3.value AS strength
FROM items i
JOIN criteria c1 ON c1.cid = i.id AND c1.option = 'option1'
JOIN criteria c3 ON c3.cid = i.id AND c3.option = 'option3'
ORDER BY strength DESC
This would show you all the items that have option 1 or option 3 but those with both options would appear to be ranked "higher.
This works well if you're doing a search on 2 options. But let's assume you make a search on all 3 options. All the items now share the same strength, this is why it's important to assign "weights" to options.
You could make the value your strength, but that might not help you if your queries don't always assign the same weights to the same options everywhere. This can be easily achieved on a per-query basis with the following query:
SELECT i.* , IF(c1.value, 2, 0) + IF(c3.value, 1, 0) AS strength
FROM items i
JOIN criteria c1 ON c1.cid = i.id AND c1.option = 'option1'
JOIN criteria c3 ON c3.cid = i.id AND c3.option = 'option3'
ORDER BY strength DESC
Try the queries out and see if it's what you need.
I would also like to note that this is not the best solution in terms of processing power. I'd recommend you add indexes, make the option field an integer, cache results wherever possible.
Leave a comment if you have any questions or anything to add.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.