PHP/SQL: ORDER BY or sort($array)? - php

Which do you think is faster in a PHP script:
$query = "SELECT... FROM ... ORDER BY first_val";
or
while($row = odbc_fetch_array($result))
$arrayname[] = array(
"first_key" => $row['first_val'],
"second_key" => $row['second_val'],
etc...
);
sort($arrayname);

It depends on so many factors that I don't even know what to begin with.
But as a rule, you perform sorting on database side.
Indexes, collations and all this, they help.

Which do you think is faster in a php script:
The ORDER BY doesn't execute in the PHP script -- it executes in the database, before data is retrieved by the PHP script. Apologies if this seems pedantic, I just want to make sure you understand this.
Anyway, the reason I would use ORDER BY is that the database has access to indexes and cached pages from the database. Sorting in PHP sorts the data set in memory of course, but has no access to any index.

If the ordered field is indexed, I'd say probably the SQL query. If not, I'm not sure, but I can't imagine it will be overly noticeable either way unless you're dealing with an absurdly large number of rows.

ORDER BY will almost always be faster.

In my opinion, nothing beats actually timing the thing so you really, really know for sure:
$time_start = microtime(true);
// Try the ORDER BY and sort($array) variants here
$time_end = microtime(true);
$time = $time_end - $time_start;
echo "It took $time seconds";

If there's a LIMIT on the first query, and the set of rows the query would match without the LIMIT is much larger than the LIMIT, then ORDER BY on the query is DEFINITELY faster.
That is to say, if you need the top 50 rows from a 10,000 row table, it's much faster to have the database sort for you, and return only those top 50 rows, than it is to retrieve all 10,000 rows and sort them yourself in PHP. This is probably representative of the vast majority of what will happen in real-world applications
If there are any cases at all in which sorting in PHP is even comparable, they're few and far between.
Additionally, SQL sorting is much more powerful -- it's trivial to sort on multiple columns, subqueries, the return values of aggregate functions etc.

Related

Efficient way of emulating LIMIT (FETCH), OFFSET in Progress OpenEdge 10.1B SQL using PHP

I want to be able to use the equivalent of MySQL's LIMIT, OFFSET in Progress OpenEdge 10.1b.
Whilst the FETCH/OFFSET commands are available as of Progress OpenEdge 11, unfortunately version 10.1B does not have them, therefore it is difficult to produce paged recordsets (e.g. Records 1-10, 11-20, 21-30 etc.).
ROW_NUMBER is also not supported by 10.1b. Seems that it is pretty much the same functionality as was found in SQL Server 2000.
If searching always in the order of the primary key id (pkid), this could be achieved by using "SELECT TOP 10 * FROM table ORDER BY pkid ASC", then identifying the last pkid and finding the next set with "SELECT TOP 10 * FROM table WHERE pkid>last_pkid ORDER BY pkid ASC"; this, however only works when sorting by the pkid.
My solution to this was to write a PHP function where I could pass the limit and offset and then return only the results where the row number was between my those defined values. I use TOP to return no more than the sum of the limit and offset.
function limit_query($sql, $limit=NULL, $offset=0)
{
$out = array();
if ($limit!=NULL) {
$sql=str_replace_first("SELECT", "SELECT TOP ".($limit+$offset), $sql);
}
$query = $db->query($sql); //$db is my DB wrapper class
$i=0;
while ($row = $this->fetch($query)) {
if ($i>=$offset) { //only add to return array if greater than offset
$out[] = $row;
}
$i++;
}
$db->free_result($query);
return $out;
}
This works well on small recordsets or on the first few pages of results, but if the total results are in the thousands, if you want to see results on page 20, 100 or 300, it is very slow and inefficient (Page one is querying only the first 10 results, page 2 the first 20 but page 100 will query the first 1000).
Whilst in most cases, the user will probably not venture past page 2 or 3, so the lack of efficiency isn't perhaps a major issue, I do wonder if there is a more efficient way of emulating this functionality.
Sadly, upgrading to a newer version of Progress, or a superior database such as MySQL is not an option, as the db is provided by third-party software.
Can anyone suggest alternative, more efficient methods?
I am not sure I fully understand the question, so here's an attempt to give you an answer:
You probably won't be able to do what you want with a single hit to the db. Just by sorting records / adding functions you probably won't achieve the paging functionality you are trying to get. As far as I know, Progress won't number the rows, unless, as you said, you're sorting by some crescent pkid.
My suggestion to you would be a procedure to run in the back end to create the query with a batch size same as the page (in your case 10), and use a loop to get the next batch until you get the ones you need. Look into batching datasets or use an open query using MAX-ROWS.
Hope it helps, or at least gives you an idea to get this. I actually like your PHP implementation, it seems like a good workaround, not ugly to keep.
You should be able to install an upgraded version of Progress, convert your database(s) and recompile the code against the new version. Normally your support through your vendor would provide you with the latest version of Progress (Openedge) and wouldn't be a huge issue. Going from version 10 to 11 shouldn't cause any compile issues and give you all of the SQL benefits of the newer version.
Honestly your comment about MySql being superior is a little confusing, but that's a discussion for another day. ;D
Best regards!

mysql_num_rows() php - is it efficient?

I have several SELECT statements on a PHP page, and I used Dreamweaver to generate those.
After going through the code it generated, there seemed to be alot of fluff which I could cut out under most circumstances, a mysql_num_rows() line for each statement being an example.
So I'm wondering if anyone can tell me whether or not this actually saves resources - considering the query is being run regardless, is there any actual overhead for this?
UPDATE:
After following Chriszuma's suggestion about microtime, here are my results:
//time before running the query
1: 0.46837500 1316102620
//time after the query ran
2: 0.53913800 1316102620
//time before calling mysql_num_rows()
3: 0.53914200 1316102620
//time after mysql_num_rows()
4: 0.53914500 1316102620
So not much overhead at all, it seems
mysql_num_rows() counts rows after they have been fetched. It's like you fetched all rows and stored them in a PHP array, and then ran count($array). But mysql_num_rows() is implemented in C within the MySQL client library, so it should be a bit more efficient than the equivalent PHP code.
Note that in order for mysql_num_rows() to work, you do have to have the complete result of your query in PHP's memory space. So there is overhead in the sense that a query result set could be large, and take up a lot of memory.
I would expect that such a call would have an extremely minimal impact on performance. It is just counting the rows of its internally-stored query result. The SQL query itself is going to take the vast majority of processing time.
If you want to know for sure, you can execute microtime() before and after the call to see exactly how long it is taking.
$startTime = microtime(true);
mysql_num_rows();
$time = microtime(true) - $startTime;
echo("mysql_num_rows() execution: $time seconds\n");
My suspicion is that you will see something in the microseconds range.

Multiple Queries to a Large MySQL Table

I have a table with columns ID(int), Number(decimal), and Date(int only timestamp). There are millions of rows. There are indexes on ID and Date.
On many of my pages I am querying this four or five times for a list of Numbers in a specified date range (the range being different each query).
Like:
select number,date where date < 111111111 and date >111111100000
I'm querying these sets of data to be placed on several different charts. "Today vs Yesterday", "This Month vs Last Month", "This Year vs Last Year".
Would querying the largest possible result set with the sql statement and then using my programming language to filter down the query via a sorted and spliced array be better than waiting for each of these 0.3 second queries to finish?
Is there something else that can be done to speed this up?
It depends on the result set and the executing speed of your queries. There is no ultimate answer to this question.
You should benchmark and calculate the results if you really need to speed up things.
But keep in mind that premature optimization should be avoided besides that you'll implement an already implemented logic in your code which can contain bugs, etc. etc.
While it may cause the query to perform quicker you have to ask yourself about the potential impacts to memory if you were to attempt to load in the entire range of records and then aggregating it programatically.
Chances are that the MySQL optimatizations based on index will perform better than anything you could come up with anyway so it sounds like a bad idea.

High performance PHP simaliraty checking on large database

I have 30,000 rows in a database that need to be similarity checked (using similar_text or another such function).
In order to do this it will require doing 30,000^2 checks for each columns.
I estimate I will be checking on average 4 columns.
This means I will have to do 3,600,000,000 checks.
What is the best (fastest, and most reliable) way to do this with PHP, bearing in mind request memory limits and time limits etc?
The server need to still actively serve webpages at the same time as doing this.
PS. The server we are using is an 8 core Xeon 32 GB ram.
Edit:
The size of each column is normally less that 50 characters.
I guess you just need FULL TEXT search.
If that not fits you, you have only one chance to solve this: cache the results.
So you will not have to parse 3bil of records for each requests
Anyway here how you can do it:
$result = array();
$sql = "SELECT * FROM TABLE";
while( $row = ... ) {
$result[] = $row; //> Append the current record
}
Now results contains all the rows from your table.
At this point you said you want to similar_text() all columns with each other.
To do that and cache the results you need at least a table (as I said in the comment).
//> Starting calculating the similarity
foreach($result as $k=>$v) {
foreach($result as $k2=>$v2) {
//> At this point you have 2 rows, $v and $v2 containing your column
$similarity = 0;
$similartiy += levensthein($v['column1'],$v2['column1']);
$similartiy += levensthein($v['column2'],$v2['column2']);
//> What ever comparison you need here between columns
//> Now you can finally store the result by inserting in a table the $similarity
"INSERT DELAYED INTO similarity (value) VALUES ('$similarity')";
}
}
2 Things you have to notice:
I used levensthein because it's much faster than similar_text (notice it's value it's the contrary of similar_text, because the greater the value levensthein returns the less the affinity between string)
I Used INSERT DELAYED to greatly lower the database cost
oy... similar_text() is O(n^3) !
do you really need a percentage similarity for each comparison or can you just do a quick compare of the first/middle/last X bytes of the strings to narrow the field?
if you're just looking for dups say... you can probably narrow down the number of comparisons you need to do, and that will be the most effective tack imho.

Difference in efficiency of retrieving all rows in one query, or each row individually?

I have a table in my database that has about 200 rows of data that I need to retrieve. How significant, if at all, is the difference in efficiency when retrieving all of them at once in one query, versus each row individually in separate queries?
The queries are usually made via a socket, so executing 200 queries instead of 1 represents a lot of overhead, plus the RDBMS is optimized to fetch a lot of rows for one query.
200 queries instead of 1 will make the RDBMS initialize datasets, parse the query, fetch one row, populate the datasets, and send the results 200 times instead of 1 time.
It's a lot better to execute only one query.
I think the difference will be significant, because there will (I guess) be a lot of overhead in parsing and executing the query, packaging the data up to send back etc., which you are then doing for every row rather than once.
It is often useful to write a quick test which times various approaches, then you have meaningful statistics you can compare.
If you were talking about some constant number of queries k versus a greater number of constant queries k+k1 you may find that more queries is better. I don't know for sure but SQL has all sorts of unusual quirks so it wouldn't surprise me if someone could come up with a scenario like this.
However if you're talking about some constant number of queries k versus some non-constant number of queries n you should always pick the constant number of queries option.
In general, you want to minimize the number of calls to the database. You can already assume that MySQL is optimized to retrieve rows, however you cannot be certain that your calls are optimized, if at all.
Extremely significant, Usually getting all the rows at once will take as much time as getting one row. So let's say that time is 1 second (very high but good for illustration) then getting all the rows will take 1 second, getting each row individually will take 200 seconds (1 second for each row) A very dramatic difference. And this isn't counting where are you getting the list of 200 to begin with.
All that said, you've only got 200 rows, so in practice it won't matter much.
But still, get them all at once.
Exactly as the others have said. Your RDBMS will not break a sweat throwing 200+++++ rows at you all at once. Getting all the rows in one associative array will also not make much difference to your script, since you no doubt already have a loop for grabbing each individual row.
All you need do is modify this loop to iterate through the array you are given [very minor tweak!]
The only time I have found it better to get fewer results from multiple queries instead of one big set is if there is lots of processing to be done on the results. I was able to cut out about 40,000 records from the result set (plus associated processing) by breaking the result set up. Anything you can build into the query that will allow the DB to do the processing and reduce result set size is a benefit, but if you truly need all the rows, just go get them.

Categories