MySQL rollup with Pearson's R - php

I'm using MySQL & PHP to calculate Pearson's R measuring the relationship over a number of years between political donations and tenders received by a particular business.
I've run into trouble with the MySQL query that is feeding values into the algorithm. The algorithm itself is fine and evaluates correctly. The problem is in the query used to get the data.
The formula I'm using for Pearson's R is at http://www.statisticshowto.com/how-to-compute-pearsons-correlation-coefficients/
Here is the basic MySQL query that spits out values for each year:
SELECT count( distinct year) as count,name,sum(donations), sum(tenders), sum(donations * tenders) as xy,(sum(donations)*sum(donations)) as x2, (sum(tenders)*sum(tenders)) as y2 from money_by_year where name='$name' group by name,year
Here is the query WITH ROLLUP to get only the final values:
SELECT count( distinct year) as count,name,sum(donations), sum(tenders), sum(donations * tenders) as xy,(sum(donations)*sum(donations)) as x2, (sum(tenders)*sum(tenders)) as y2 from money_by_year where name='$name' group by name with rollup LIMIT 1
The problem is that the totals from the second query are wrong in the sum xy, x2 & y2. This is being caused by the query itself, probably the ROLLUP and I'd like to know what is going on with it.
You can see working examples of the code with the values resulting from both the above queries and the algorithm at https://openaus.net.au/follow_the_money.php?name=KPMG
I have tried various changes to sum(donations * tenders) as xy for example implementing it as sum(donations) * sum(tenders) as in:
SELECT count( distinct year) as count,name,sum(donations), sum(tenders), sum(donations) * sum(tenders) as xy,(sum(donations)*sum(donations)) as x2, (sum(tenders)*sum(tenders)) as y2 from money_by_year where name='KPMG' group by name with rollup LIMIT 1
however the ROLLUP totals are incorrect, much bigger than they should be. The values I want may not be possible via a single MySQL query however I would appreciate knowing why this is the case, what the ROLLUP is doing to the figures and why.

Related

How to Limit MySQL query based on total records count percentage [duplicate]

Let's say I have a list of values, like this:
id value
----------
A 53
B 23
C 12
D 72
E 21
F 16
..
I need the top 10 percent of this list - I tried:
SELECT id, value
FROM list
ORDER BY value DESC
LIMIT COUNT(*) / 10
But this doesn't work. The problem is that I don't know the amount of records before I do the query. Any idea's?
Best answer I found:
SELECT*
FROM (
SELECT list.*, #counter := #counter +1 AS counter
FROM (select #counter:=0) AS initvar, list
ORDER BY value DESC
) AS X
where counter <= (10/100 * #counter);
ORDER BY value DESC
Change the 10 to get a different percentage.
In case you are doing this for an out of order, or random situation - I've started using the following style:
SELECT id, value FROM list HAVING RAND() > 0.9
If you need it to be random but controllable you can use a seed (example with PHP):
SELECT id, value FROM list HAVING RAND($seed) > 0.9
Lastly - if this is a sort of thing that you need full control over you can actually add a column that holds a random value whenever a row is inserted, and then query using that
SELECT id, value FROM list HAVING `rand_column` BETWEEN 0.8 AND 0.9
Since this does not require sorting, or ORDER BY - it is O(n) rather than O(n lg n)
You can also try with that:
SET #amount =(SELECT COUNT(*) FROM page) /10;
PREPARE STMT FROM 'SELECT * FROM page LIMIT ?';
EXECUTE STMT USING #amount;
This is MySQL bug described in here: http://bugs.mysql.com/bug.php?id=19795
Hope it'll help.
I realize this is VERY old, but it still pops up as the top result when you google SQL limit by percent so I'll try to save you some time. This is pretty simple to do these days. The following would give the OP the results they need:
SELECT TOP 10 PERCENT
id,
value
FROM list
ORDER BY value DESC
To get a quick and dirty random 10 percent of your table, the following would suffice:
SELECT TOP 10 PERCENT
id,
value
FROM list
ORDER BY NEWID()
I have an alternative which hasn't been mentionned in the other answers: if you access from any language where you have full access to the MySQL API (i.e. not the MySQL CLI), you can launch the query, ask how many rows there will be and then break the loop if it is time.
E.g. in Python:
...
maxnum = cursor.execute(query)
for num, row in enumerate(query)
if num > .1 * maxnum: # Here I break the loop if I got 10% of the rows.
break
do_stuff...
This works only with mysql_store_result(), not with mysql_use_result(), as the latter requires that you always accept all needed rows.
OTOH, the traffic for my solution might be too high - all rows have to be transferred.

Get previous 10 row from specific WHERE condition

Im currently working on a project that requires MySql database and im having a hard time constructing the query that i want get.
i want to get the previous 10 rows from the specific WHERE condition on my mysql query.
for example
My where is date='December';
i want the last 10 months to as a result.
Feb,march,april,may,june,july,aug,sept,oct,nov like that.
Another example is.
if i have a 17 strings stored in my database. and in my where clause i specify that WHERE strings='eyt' limit 3
Test
one
twi
thre
for
payb
six
seven
eyt
nayn
ten
eleven
twelve
tertin
fortin
fiftin
sixtin
the result must be
payb
six
seven
Thanks in advance for your suggestions or answers
If you are using PDO this is the right syntax:
$objStmt = $objDatabase->prepare('SELECT * FROM calendar ORDER BY id DESC LIMIT 10');
You can change ASC to DESC in order to get either the first or the last 10.
Here's a solution:
select t.*
from mytable t
inner join (select id from mytable where strings = 'eyt' order by id limit 1) x
on t.id < x.id
order by t.id desc
limit 3
Demo: http://sqlfiddle.com/#!9/7ffc4/2
It outputs the rows in descending order, but you can either live with that, or else put that query in a subquery and reverse the order.
Re your comment:
x in the above query is called a "correlation name" so we can refer to columns of the subquery as if they were columns of a table. It's required when you use a subquery as a table.
I chose the letter x arbitrarily. You can use anything you like as a correlation name, following the same rules you would use for any identifier.
You can also optionally define a correlation name for any simple table in the query (like mytable t above), so you can refer to columns of that table using a convenient abbreviated name. For example in t.id < x.id
Some people use the term "table alias" but the technical term is "correlation name".

SQL join or PHP loop? or what?

So i have two queries to give me the information i need and im trying to figure out the best way to get the result from them. i have a table of Holdings. using:
SELECT symbol, sum(shares) AS shares, sum(shares * price) AS cost
FROM Transactions
WHERE (action <>5)
AND date <= '2010-10-30'
GROUP BY symbol HAVING sum(shares) > 0
which results in
AGNC 50.00 1390.0000
RSO 1517.00 9981.8600
T 265.00 6668.7500
I then have another query
SELECT close FROM History WHERE symbol = $symbol AND date = $date
which will return the closing price of that day.
T 24.40
i want to basically for a given date calculate value so sum(shares * close) for each symbol. but i dont know how to do that with out looping through each row in php. i was hoping there was a join or something i could do in sql to make the data return the way i want it
I think you could do something similar to this: A select query selecting a select statement Put your second query literally in your first query. Not sure about exact syntax, but like:
SELECT
symbol,
sum(shares) AS shares,
sum(shares * price) AS cost,
sum(shares * close) as value
FROM Transactions
INNER JOIN History ON (symbol = $symbol AND date = $date)
WHERE (action <>5)
AND date <= '2010-10-30'
GROUP BY symbol HAVING sum(shares) > 0
#popnoodles is right about your naming though. If you use date I'd think you'd need [date].

PHP/MySQL Debug + Optimization

I have about 100,000 merchants stored in my database, the table contains their Name, Lat, Lon
Right now my search function takes a lat/lon input and finds the stores within 10 miles radius. However, a lot of the 100,000 merchants are franchises (ie. groups of associated stores). For the franchises, I'd like to show only the Closest store, with the rest stores of the same franchise hidden (if there are multiple locations within 10 miles radius).
I've tried doing:
SELECT SQL_CALC_FOUND_ROWS *, xxxxx as Distance
FROM table
WHERE isActive = 1
GROUP BY Name
HAVING Distance < 10
ORDER BY Distance ASC
xxxxx is the function that calculates distance based on input lat/lon:
((ACOS(SIN('$lat'*PI()/180)*SIN(`Latitude`*PI()/180) + COS('$lat'*PI()/180)*COS(`Latitude`*PI()/180)*COS(('$long'-`Longitude`)*PI()/180))*180/PI())*60*1.1515)
However it's not returning the correct results. I'm getting significantly less results regardless of franchised or unfranchised stores when comparing with the same query without the "GROUP BY" clause. I wonder what's the problem?
Also, the speed is really slow. I have Name column indexed. But I think the "GROUP BY Name" is the bottleneck since MySQL must be doing a lot of string comparison? Assuming GROUP BY bug can be fixed, I'm wondering what are my options to make this faster. Is it worth while to setup a "Business_Group" column and pre-process the stores so the franchised stores would be assigned a Business_Group ID, this way GROUP BY would be faster since it's comparing int?
Make a virtual table using views of calculated column xxxxx
and use join of table and view.
That will be faster and optimized.
Try to group derived table
SELECT * FROM (
SELECT *, xxxxx as Distance FROM table WHERE isActive = 1
HAVING Distance < 10 ORDER BY Distance ASC
) AS all
GROUP BY Name
In not grouped query HAVING applies to every record as instant filter, and with GROUP BY it applies to whole groups, so groups are first filled with all data and then filtered with HAVING.

Returning random rows from mysql database without using rand()

I would like to be able to pull back 15 or so records from a database. I've seen that using WHERE id = rand() can cause performance issues as my database gets larger. All solutions I've seen are geared towards selecting a single random record. I would like to get multiples.
Does anyone know of an efficient way to do this for large databases?
edit:
Further Edit and Testing:
I made a fairly simple table, on a new database using MyISAM. I gave this 3 fields: autokey (unsigned auto number key) bigdata (a large blob) and somemore (a medium int). I then applied random data to the table and ran a series of queries using Navicat. Here are the results:
Query 1: select * from test order by rand() limit 15
Query 2: select *
from
test
join
(select round(rand()*(select max(autokey) from test)) as val from test limit 15) as rnd
on
rnd.val=test.autokey;`
(I tried both select and select distinct and it made no discernible difference)
and:
Query 3 (I only ran this on the second test):
SELECT *
FROM (
SELECT #cnt := COUNT(*) + 1,
#lim := 10
FROM test
) vars
STRAIGHT_JOIN
(
SELECT r.*,
#lim := #lim - 1
FROM test r
WHERE (#cnt := #cnt - 1)
AND RAND(20090301) < #lim / #cnt
) i
ROWS: QUERY 1: QUERY 2: QUERY 3:
2,060,922 2.977s 0.002s N/A
3,043,406 5.334s 0.001s 1.260
I would like to do more rows so I can see how query 3 scales, but at the moment, it seems as though the clear winner is query 2.
Before I wrap up this testing and declare an answer, and while I have all this data and the test environment set up, can anyone recommend any further testing?
Try:
select * from table order by rand() limit 15
Another (and possibly more efficient way) would be to join against a set of random values. This should work, if there's some contiguous integer key in the table. Here is how I would do it in postgres (My MySQL is a bit rusty)
select * from table join
(select (random()*maxid)::integer as val from generate_series(1,15)) as rnd
on rand.val=table.id;
where maxid is the highest id in table. If id has an index, then this would mean only 15 index lookup, so its very fast.
UPDATE:
Looks like there no such thing as generate_series in MySQL. My fault. We don't need it actually:
select *
from
table
join
-- this just returns 15 random numbers.
-- I need `table` here only to produce rows for rand()
(select round(rand()*(select max(id) from table)) as val from table limit 15) as rnd
on
rnd.val=table.id;
P.S. If I don't want duplicates returned, I can use (select distinct [...]) in the random generator expression.
Update: Check out the accepted answer in this question. It's pure mySQL and even deals with even distribution.
The problem with id = rand() or anything comparable in PHP is that you can't be sure whether that particular ID still exists. Therefore, you need to work with LIMIT, and that can become slow for large amounts of data.
As an alternative to that, you could try using a loop in PHP.
What the loop does is
Create a random integer number using rand(), with a scope between 0 and the number of records in the database
Query the database whether a record with that ID exists
If it exists, add the number to an array
If it doesn't, go back to step 1
End the loop when the array of random numbers contains the desired number of elements
this method could cause a lot of queries in a fragmented table, but they should be pretty fast to execute. It may be faster than LIMIT rand() in certain situations.
The LIMIT method, as outlined by #Luther, is certainly the simplest code-wise.
You could do a query with all the results or however many limited, then use mysqli_fetch_all followed by:
shuffle($a);
$a = array_slice($a, 0, 15);
For a large dataset doing
select * from table order by rand() limit 15
can be quite time and memory consuming.
If your data records happen to be numbered you can put and index on the numbering colum and do a
select * from table where no >= rand() limit 15
Or even better do the random number generation in your application and do
select * from table where no >= $rand and no <= $rand+15
If your data doesn't change too often, it might be worth to add such a numbering a column to make the selection efficient.
Assuming MySQL supports nested queries and that operations on the primary key are fast, I'd try something like
select * from table where id in (select id from table order by rand() limit 15)

Categories