Sorting with PHP vs MongoDB

Sorting with PHP vs MongoDB - php

Consider I have newly calculated one million (1,000,000) values.
I want to highest 10 values of those one million values.
I'm hesitating to choose whether to sort in PHP or using MongoDB (Indexed) to sort those.
I know that less using DB might increase overall performance.
But I do not know which one would be faster in this case, what if MongoDB is incredibly fast so that even using MongoDB just for sorting is faster than using PHP to sort.
If php is faster and better way to do, which sorting algorithm should be chosen?
Give me some suggestions.

MongoDB has a pretty nice set on indexes features in the other hand in PHP you can use different functions such as sort, (which uses an implementation of quicksort, btw) etc.
I wouldn't only focus on speed unless your concurrency is minimal, consider if you are sorting the result set int PHP each time you want to display it and you are listening X number of requests then the memory footprint will be about X * array size + extra overhead until the request/run finishes.
MongoDB has the ability of allowing you to choose the index sorting when you are creating them so this can be a good idea, since the data is going to be added to a B-tree for indexing in the right order (while in the other hand it will slow down the inserts for the same reason)
So, bottom line, maybe if the set were smaller I would opt for PHP sorting, but in this case (and as usual this kind of questions ends) I would recommend you to benchmark and decide with real data.

Related

Looping through a large array

I'm creating application that will create a very very large array, and will search them.
I just want to know if there is a good PHP array search algorithm to do that task?
Example: I have an array that contains over 2M keys and values, what is the best way to search?
EDIT
I've Created a flatfile dbms that based on arrays so i want to find the best way to search it

A couple of things:
Try it, benchmark several approaches, and see which one is the faster
Consider using objects
Do think about DB's at least... it could be a NoSQL key->value storage thing like Redis.io (which is dead-fast)
Search algorithms, sure there are plenty of them around
But storing an assoc array of 2M keys in memory will mean you'll have tons of hash collisions, which will slow you down anyway. Sort the array, chunk it, and apply a decent search algorithm and you might get it to work reasonably fast, but to be brutally honest, I would say you're about to make a bad decision.
Also consider this: PHP is stateless by design, each time your script runs, the data has to be loaded into memory again (for each request if it's a web application you're writing). It's not unlikely that that will be a bigger bottleneck than a brute-force search on a HashTable will ever be.
The quickest way to find this out is to run a test, once with APC (or alternatives) turned off, and then again, but cache the array you want to search first. Measure the difference between the two, and you'll get an idea of how much the actual construction of the array is costing you

The best way to go would be to use array_search(). PHP has heavily optimized their in C written functions.
If this is still too slow, you should switch to an other 'programming' language (PHP isn't popular for its speed).
There are algorithms available that use your graphics card to search specific values in parallel.

fast large scale key-value store for a php program

I'm working on a full text index system for a project of mine. As one part of the process of indexing pages it splits the data into a very, very large number of very small pieces.
I have gotten the size of the pieces to be as low as a constant 20-30 bytes, and it could be less, it is basically 2 8 byte integers and a float that make up the actual data.
Because of the scale I'm looking for and the number of pieces this creates I'm looking for an alternative to mysql which has shown significant issues at value sets well below my goal.
My current thinking is that a key-value store would be the best option for this and I have adjusted my code accordingly.
I have tried a number but for some reason they all seem to scale even less than mysql.
I'm looking to store on the order of hundreds of millions or billions or more key-value pairs so I need something that won't have a large performance degradation with size.
I have tried memcachedb, membase, and mongo and while they were all easy enough to set up, none of them scaled that well for me.
membase had the most issues due to the number of keys required and the limited memory available. Write speed is very important here as this is a very close to even workload, I write a thing once, then read it back a few times and store it for eventual update.
I don't need much performance on deletes and I would prefer something that can cluster well as I'm hoping to eventually have this able to scale across machines but it needs to work on a single machine for now.
I'm also hoping to make this project easy to deploy so an easy setup would be much better. The project is written in php so it needs to be easy accessed from php.
I don't need to have rows or other higher level abstractions, they are mostly useless in this case and I have already made the code from some of my other tests to get down to a key-value store and that seems to likely be the fastest as I only have 2 things that would be retrieved from a row keyed off a third so there is little additional work done to use a key-value store. Does anyone know any easy to use projects that can scale like this?
I am using this store to store individual sets of three numbers, (the sizes are based on how they were stored in mysql, that may not be true in other storage locations) 2 eight byte integers, one for the ID of the document and one for the ID of the word and a float representation of the proportion of the document that that word was (number of times the work appeared divided by the number of words in the document). The index for this data is the word id and the range the document id falls into, every time I need to retrieve this data it will be all of the results for a given word id. I currently turn the word id, the range, and a counter for that word/range combo each into binary representations of the numbers and concatenate them to form the key along with a 2 digit number to say what value for that key I am storing, the document id or the float value.
Performance measurement was somewhat subjective looking at the output from the processes putting data into or pulling data out of the storage and seeing how fast it was processing documents as well as rapidly refreshing my statistics counters that track more accurate statistics of how fast the system is working and looking at the differences when I was using each storage method.

You would need to provide some more data about what you really want to do...
depending on how you define fast large scale you have several options:
memcache
redis
voldemort
riak
and sooo on.. the list gets pretty big..
Edit 1:
Per this post comments I would say that you take a look to cassandra or voldemort. Cassandra isn't a simple KV storage per se since you can storage much more complex objects than just K -> V
if you care to check cassandra with PHP, take a look to phpcassa. but redis is also a good option if you set a replica.

Here's add a few products and ideas that weren't mentioned above:
OrientDB - this is a graph/document database, but you can use it to store very small "documents" - it is extremely fast, highly scalable, and optimized to handle vast amounts of records.
Berkeley DB - Berkeley DB is a key-value store used at the heart of a number of graph and document databases - supposedly has a SQLite-compatible API that works with PHP.
shmop - Shared memory operations might be one possible approach, if you're willing to do some dirty-work. If you records are small and have a fixed size, this might work for you - using a fixed record-size and padding with zeroes.
handlersocket - this has been in development for a long time, and I don't know how reliable it is. It basically lets you use MySQL at a "lower level", almost like a key/value-store. Because you're bypassing the query parser etc. it's much faster than MySQL in general.
If you have a fixed record-size, few writes and lots of reads, you may even consider reading/writing to/from a flat file. Likely nowhere near as fast as reading/writing to shared memory, but it may be worth considering. I suggest you weigh all the pros/cons specifically for your project's requirements, not only for products, but for any approach you can think of. Your requirements aren't exactly "mainstream", and the solution may not be as obvious as picking the right product.

Sorting small list (10 items) in PHP vs. MySQL and then returning?

I'm curious what the runtime cost of sorting a small set of elements returned by a MySQL query is in PHP in contrast to sorting them in MySQL.
Reason I can't sort them in MySQL is because of the groups of data being returned, (actually 2 queries and then bringing data together PHP side).
Thanks!

There will be no difference for ten elements. But generally you should always sort in MySQL, not PHP. MySQL is optimized to make sorting fast and has the appropriate data structures and information.
PS: Depending on the nature of the data you want to combine: Look at JOINs, UNIONs and subqueries. They'll probably do it.

My opinion:
The overhead is negligible in either case. Optimize your productivity and do whichever is easier for you. The computer is there to help you, not vice versa.
If you really want to optimize the host, have you considered letting the client do it in javascript?
I'd probably do it in SQL too, but mainly to keep the logic in one place (increase cohesion, reduce coupling).

you could combine the queries with a union and then sort it with mysql
Q1
UNION
Q2
ORDER BY X

Speed of calculations in SQL statement

I've got a database (MySQL) table with three fields : id, score, and percent.
Long story short, I need to do a calculation on each record that looks like this:
(Score * 10) / (1 - percent) = Value
And then I need to use that value both in my code and as the ORDER BY field. Writing the SQL isn't my issue - I'm just worried about the efficiency of this statement. Is doing that calculation in my SQL statement the most efficient use of resources, or would I be better off grabbing the data and then doing math via PHP?
If SQL is the best way to do it, are there any tips I can keep in mind for keeping my SQL pulls as speedy as possible?
Update 1: Just to clear some things up, because it seems like many of the answers are assuming differently : Both the Score and the Percent will be changing constantly. Actually, just about every time a user interacts with the app, those fields will change (those fields are actually linked to a user, btw).
As far as # of records, right now it's very small, but I would like to be scaling for a target set of about 2 million records (users). At any given time I will only need 20ish records, but I need them to be the top 20 records sorted by this calculated value.

It sounds like this calculated value is of inherent meaning in your business domain; if this is the case, I would calculate it once (e.g. at the time the record is created), and use it just like any normal field. This is by far the most efficient way to achieve what you want - the extra calculation on insert or update has minimal performance impact, and from then on you don't have to worry about who does the calculation where.
Drawback is that you do have to update your "insert" and "update" logic to perform this calculation. I don't usually like triggers - they can be the source of impenetrable bugs - but this is a case where I'd consider them (http://dev.mysql.com/doc/refman/5.0/en/triggers.html).
If for some reason you can't do that, I'd suggest doing it on the database server. This should be pretty snappy, unless you are dealing with very large numbers of records; in that case the "order by" will be a real performance problem. It will be a far bigger performance problem if you execute the same logic on the PHP side, of course - but your database tends to be the bottleneck from a performance point of view, so the impact is larger.
If you're dealing with large numbers of records, you may just have to bite the bullet and go with my first suggestion.
If it weren't for the need to sort by the calculation, you could also do this on the PHP side; however, sorting an array in PHP is not something I'd want to do for large result sets, and it seems wasteful not to do sorting in the database (which is good at that kinda thing).
So, after all that, my actual advice boils down to:
do the simplest thing that could work
test whether it's fast enough within the constraints of your
project
if not, iteratively refactor to a faster solution, re-test
once you reach "good enough", move on.
Based on edit 1:
You've answered your own question, I think - returning (eventually) 2 million rows to PHP, only to find the top 20 records (after calculating their "value" one by one) will be incredibly slow. So calculating in PHP is really not an option.
So, you're going to be calculating it on the server. My recommendation would be to create a view (http://dev.mysql.com/doc/refman/5.0/en/create-view.html) which has the SQL to perform the calculation; benchmark the performance of the view with 200, 200K and 2M records, and see if it's quick enough.
If it isn't quick enough at 2M users/records, you can always create a regular table, with an index on your "value" column, and relatively little needs to change in your client code; you could populate the new table through triggers, and the client code might never know what happened.

doing the math in the database will be more efficient because sending the data back and forth from the database to the client will be slower than that simple expression no matter how fast the client is and how slow the database is.

Test it out and let us know the performance results. I think it is going to depend on the volume of data in your result set. For the SQL bit, just make sure your where clause has a covered index.

Where you do the math shouldn't be too important. It's the same fundamental operation either way. Now, if MySQL is running on a different server than your PHP code, then you may care which CPU does the calculation. You may wish that the SQL server does more of the "hard work", or you may wish to leave the SQL server doing "only SQL", and move the math logic to PHP.
Another consideration might be bandwidth usage (if MySQL isn't running on the same machine as PHP)--you may wish to have MySQL return whichever form is shorter, to use less network bandwidth.
If they're both on the same physical hardware, though, it probably makes no noticeable difference, from a sheer CPU usage standpoint.
One tip I would offer is to do the ORDER BY on the raw value (percent) rather than on the calculated value--this way MySQL can use an index on the percent column--it can't use indexes on calculated values.

If you have a growing number of records, your script (and its memory) will reach its limits faster than mysql would. Are you planning to fetch all records anyway?
Mysql would be quicker in general.
I don't get how you would use the value calculated in php in an ORDER BY afterwards. If you are planning to sort in php, it would become even slower but it all depends on the number of records you're dealing with.

Is searching PHP array faster than search/retrieve from MySQL

Was curious to know which is faster - If i have an array of 25000 key-value pairs and a MySQL database of identical information, which would be faster to search through?
thanks a lot everyone!

The best way to answer this question is to perform a benchmark.

Although you should just try it out yourself, I'm going to assume that there's a proper index and conclude that the DB can do it faster than PHP due to being built to be all about that.
However, it might come down to network latencies, speed of parsing SQL vs PHP, or DB load and percentage of memory use.

My first thought would be it is faster searching with arrays. But, on the other side it really depends on several factors:
How is your databasa table designed (does it use indexes properly, etc)
How is your query built
Databases are generally pretty optimized for such searches.
What type of search you are doing on the array? There are several type of searches you could do. The slowest is a straight search where you go through each row an check for a value, a faster approach is a binary search.
I presumed that you are comparing a select statement executed directly on a database, and an array search in php. Not

One thing to keep in mind: If your search is CPU intensive on the database it might be worth doing it in PHP even if it's not as fast. It's usually easier to add web servers than database servers when scaling.

Test it - profiling something as simple as this should be trivial.
Also, remember that databases are designed to handle exactly this sort of task, so they'll naturally be good at it. Even a naive binary search would only have 17 compares for this, so 25k elements isn't a lot. The real problem is sorting, but that has been conquered to death over the past 60+ years.

It depends. In mysql you can use indexing, which will increase speed, but with php you don't need to send information through net(if mysql database on another server).

MySQL is built to efficiently sort and search through large amounts of data, such as this. This is especially true when you are searching for a key, since MySQL indexes the primary keys in a table.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.