I'm creating application that will create a very very large array, and will search them.
I just want to know if there is a good PHP array search algorithm to do that task?
Example: I have an array that contains over 2M keys and values, what is the best way to search?
EDIT
I've Created a flatfile dbms that based on arrays so i want to find the best way to search it
A couple of things:
Try it, benchmark several approaches, and see which one is the faster
Consider using objects
Do think about DB's at least... it could be a NoSQL key->value storage thing like Redis.io (which is dead-fast)
Search algorithms, sure there are plenty of them around
But storing an assoc array of 2M keys in memory will mean you'll have tons of hash collisions, which will slow you down anyway. Sort the array, chunk it, and apply a decent search algorithm and you might get it to work reasonably fast, but to be brutally honest, I would say you're about to make a bad decision.
Also consider this: PHP is stateless by design, each time your script runs, the data has to be loaded into memory again (for each request if it's a web application you're writing). It's not unlikely that that will be a bigger bottleneck than a brute-force search on a HashTable will ever be.
The quickest way to find this out is to run a test, once with APC (or alternatives) turned off, and then again, but cache the array you want to search first. Measure the difference between the two, and you'll get an idea of how much the actual construction of the array is costing you
The best way to go would be to use array_search(). PHP has heavily optimized their in C written functions.
If this is still too slow, you should switch to an other 'programming' language (PHP isn't popular for its speed).
There are algorithms available that use your graphics card to search specific values in parallel.
Related
Consider I have newly calculated one million (1,000,000) values.
I want to highest 10 values of those one million values.
I'm hesitating to choose whether to sort in PHP or using MongoDB (Indexed) to sort those.
I know that less using DB might increase overall performance.
But I do not know which one would be faster in this case, what if MongoDB is incredibly fast so that even using MongoDB just for sorting is faster than using PHP to sort.
If php is faster and better way to do, which sorting algorithm should be chosen?
Give me some suggestions.
MongoDB has a pretty nice set on indexes features in the other hand in PHP you can use different functions such as sort, (which uses an implementation of quicksort, btw) etc.
I wouldn't only focus on speed unless your concurrency is minimal, consider if you are sorting the result set int PHP each time you want to display it and you are listening X number of requests then the memory footprint will be about X * array size + extra overhead until the request/run finishes.
MongoDB has the ability of allowing you to choose the index sorting when you are creating them so this can be a good idea, since the data is going to be added to a B-tree for indexing in the right order (while in the other hand it will slow down the inserts for the same reason)
So, bottom line, maybe if the set were smaller I would opt for PHP sorting, but in this case (and as usual this kind of questions ends) I would recommend you to benchmark and decide with real data.
Basically for a plugin for a dynamic site (site can be fairly large) I am caching results of some sort of search (because results are from an external search), the results can be 400-1500 characters in length.
As the results come in an array, I use json_encode (faster than serialize) to store in the database, but ~1.5KB per entry (since there may be 10,000)=15MB seems a little large for me.
My questions are:
* Is this an acceptable (your opinion) size per entry?
* Will running GZip or similar and storing in a binary field in MySQL be more efficient or take too much CPU time in the end? Anything normally used similar?
I prefer to not use memcached or alike as it's needed to be portable (but would that be better as well?) this is mostly a theory question for me, I just require input before I implement anything solid.
There will always be a CPU Cost for any kind of compression, it depends if you have the resources to handle it without any noticeable slowdown.
Space is cheap and abundant, so 15megs is ok.
But if you really want to compress your field, then check out Mysql's COMPRESS() and UNCOMPRESS() functions.
This could be dropped into your code and it would work without changing any PHP/Logic.
After getting some help on how to measure fysical/actual size of memcached objects to prevent them from growing too large, I have thought about the next step - implementing a sharding/splitting function that transparently splits large objects into smaller pieces upon storage and glues them together as one big object when requesting them. Basically it should do everything behind the scenes automatically that needs to be done to keep memcached happy.
What's an appropriate way to handle splitting of array, objects or whatever kind of objects?
I am using PHP in my webapp, but for this case, I would be quite happy with a general approach with some psuedo-code to point me in the right direction.
Thanks a lot!
In the other question, serialize is used to measure the stored length of the object. If you're hitting the default one meg limit on object size, and you need to split things up, you can simply use serialize, split the resulting string into appropriate chunks, and then store the chunks. Later you can join them back together again and unserialize the result.
That being said... seriously, if your object, serialized, is a meg in size, you might want to reconsider how you're storing things. PHP's serialize can be a bit slow (compared to, say, json_encode), and throwing a meg or more of data at it is not likely to be the best or fastest way to do whatever it is you're doing.
If you're implementing memcached and sharding as a performance mechanism, I urge you to stop right now unless you've already used a tool like Xdebug to profile your code and have eliminated all other bottlenecks.
A few minutes ago, I asked whether it was better to perform many queries at once at log in and save the data in sessions, or to query as needed. I was surprised by the answer, (to query as needed). Are there other good rules of thumb to follow when building PHP/MySQL multi-user apps that speed up performance?
I'm looking for specific ways to create the most efficient application possible.
hashing
know your hashes (arrays/tables/ordered maps/whatever you call them). a hash lookup is very fast, and sometimes, if you have O(n^2) loops, you may reduce them to O(n) by organizing them into an array (keyed by primary key) first and then processing them.
an example:
foreach ($results as $result)
if (in_array($result->id, $other_results)
$found++;
is slow - in_array loops through the whole $other_result, resulting in O(n^2).
foreach ($other_results as $other_result)
$hash[$other_result->id] = true;
foreach ($results as $result)
if (isset($hash[$result->id]))
$found++;
the second one is a lot faster (depending on the result sets - the bigger, the faster), because isset() is (almost) constant time. actually, this is not a very good example - you could do this even faster using built in php functions, but you get the idea.
optimizing (My)SQL
mysql.conf: i don't have any idea how much performance you can gain by optimizing your mysql configuration instead of leaving the default. but i've read you can ignore every postgresql benchmark that used the default configuration. afaik with configuration matters less with mysql, but why ignore it? rule of thumb: try to fit the whole database into memory :)
explain [query]: an obvious one, a lot of people get wrong. learn about indices. there are rules you can follow, you can benchmark it and you can make a huge difference. if you really want it all, learn about the different types of indices (btrees, hashes, ...) and when to use them.
caching
caching is hard, but if done right it makes the difference (not a difference). in my opinion: if you can live without caching, don't do it. it often adds a lot of complexity and points of failures. google did a bit of proxy caching once (to make the intertubes faster), and some people saw private information of others.
in php, there are 4 different kinds of caching people regulary use:
query caching: almost always translates to memcached (sometimes to APC shared memory). store the result set of a certain query to a fast key/value (=hashing) storage engine. queries (now lookups) become very cheap.
output caching: store your generated html for later use (instead of regenerating it every time). this can result in the biggest speed-ups, but somewhat works against PHPs dynamic nature.
browser caching: what about etags and http responses? if done right you may avoid most of the work right at the beginning! most php programmers ignore this option because they have no idea what HTTP is.
opcode caching: APC, zend optimizer and so on. makes php code load faster. can help with big applications. got nothing to do with (slow) external datasources though, and the potential is somewhat limited.
sometimes it's not possible to live without caches, e.g. if it comes to thumbnails. image resizing is very expensive, but fortunatley easy to control (most of the time).
profiler
xdebug shows you the bottlenecks of your application. if your app is too slow, it's helpful to know why.
queries in loops
there are (php-)experts who do not know what a join is (and for every one you educate, two new ones without that knowledge will surface - and they will write frameworks, see schnalles law). sometimes, those queries-in-loops are not that obvious, e.g. if they come with libraries. count the queries - if they grow with the results shown, there is something wrong.
inexperienced developers do have a primal, insatiable urge to write frameworks and content management systems
schnalle's law
Optimize your MySQL queries first, then the PHP that handles it, and then lastly cache the results of large queries and searches. MySQL is, by far, the most frequent bottleneck in an application. A poorly designed query can take two to three times longer than a well designed query that only selects needed information.
Therefore, if your queries are optimized before you cache them, you have saved a good deal of processing time.
However, on some shared hosts caching is file-system only thanks to a lack of Memcached. In this instance it may be better to run smaller queries than it is to cache them, as the seek time of the hard drive (and waiting for access due to other sites) can easily take longer than the query when your site is under load.
Cache.
Cache.
Speedy indexed queries.
Was curious to know which is faster - If i have an array of 25000 key-value pairs and a MySQL database of identical information, which would be faster to search through?
thanks a lot everyone!
The best way to answer this question is to perform a benchmark.
Although you should just try it out yourself, I'm going to assume that there's a proper index and conclude that the DB can do it faster than PHP due to being built to be all about that.
However, it might come down to network latencies, speed of parsing SQL vs PHP, or DB load and percentage of memory use.
My first thought would be it is faster searching with arrays. But, on the other side it really depends on several factors:
How is your databasa table designed (does it use indexes properly, etc)
How is your query built
Databases are generally pretty optimized for such searches.
What type of search you are doing on the array? There are several type of searches you could do. The slowest is a straight search where you go through each row an check for a value, a faster approach is a binary search.
I presumed that you are comparing a select statement executed directly on a database, and an array search in php. Not
One thing to keep in mind: If your search is CPU intensive on the database it might be worth doing it in PHP even if it's not as fast. It's usually easier to add web servers than database servers when scaling.
Test it - profiling something as simple as this should be trivial.
Also, remember that databases are designed to handle exactly this sort of task, so they'll naturally be good at it. Even a naive binary search would only have 17 compares for this, so 25k elements isn't a lot. The real problem is sorting, but that has been conquered to death over the past 60+ years.
It depends. In mysql you can use indexing, which will increase speed, but with php you don't need to send information through net(if mysql database on another server).
MySQL is built to efficiently sort and search through large amounts of data, such as this. This is especially true when you are searching for a key, since MySQL indexes the primary keys in a table.