processing in php vs c++ - php

I need to design a function which uses hashtable. It basically inserts data into hashtable and search for items. Typically the function will take 15sec to 10min for execution. Should I implement this function in c++ and use a system call in PHP or should I implement it in php using associative arrays. Which would be more efficient. What are the advantage and disadvantage of each.
The key will be a string. The value will be one structure which contains two other structures.
The first structure basically contains an array of integers and the second will contain an array of integer pair values

Apparently, PHP arrays are implemented as a linked hash table. See How is the PHP array implemented on the C level?.
In any case, for 300 items there would probably be little speed difference in the type of container you used. I would stay in PHP if possible for simplicity.

PHP is well known for its fast associative array implementation, but according to my experiences, C++ is still faster. A few months ago I needed to implement fast prefix matching, there were thousands of prefixes in hash table and millions of strings to be matched. I made both, PHP and C++ implementations, and as I remember C++ was more than 10 times faster and consumed much less memory. But of course, it heavily depends also on your algorithm, not only on hash table implementation.

Related

Looping through a large array

I'm creating application that will create a very very large array, and will search them.
I just want to know if there is a good PHP array search algorithm to do that task?
Example: I have an array that contains over 2M keys and values, what is the best way to search?
EDIT
I've Created a flatfile dbms that based on arrays so i want to find the best way to search it
A couple of things:
Try it, benchmark several approaches, and see which one is the faster
Consider using objects
Do think about DB's at least... it could be a NoSQL key->value storage thing like Redis.io (which is dead-fast)
Search algorithms, sure there are plenty of them around
But storing an assoc array of 2M keys in memory will mean you'll have tons of hash collisions, which will slow you down anyway. Sort the array, chunk it, and apply a decent search algorithm and you might get it to work reasonably fast, but to be brutally honest, I would say you're about to make a bad decision.
Also consider this: PHP is stateless by design, each time your script runs, the data has to be loaded into memory again (for each request if it's a web application you're writing). It's not unlikely that that will be a bigger bottleneck than a brute-force search on a HashTable will ever be.
The quickest way to find this out is to run a test, once with APC (or alternatives) turned off, and then again, but cache the array you want to search first. Measure the difference between the two, and you'll get an idea of how much the actual construction of the array is costing you
The best way to go would be to use array_search(). PHP has heavily optimized their in C written functions.
If this is still too slow, you should switch to an other 'programming' language (PHP isn't popular for its speed).
There are algorithms available that use your graphics card to search specific values in parallel.

Are there alternative data structures than array in PHP, where I can benefit from different index techniques?

Lately I had an issue with an array that contained some hundred thousands of values and the only thing I wanted to do was to check whether a value was already present.
In my case this were IPs from a webserver log.
So basically something like:
in_array(ip2long(ip),$myarray) did the job
However the lookup time increased dramatically and 10k of lookups took around 17 seconds or so.
So in this case I didn't care whether I had duplicates or not, I just needed to check for existence. So I could store the IPs in the index like this:
isset($myarray[ip2long($ip)])
And boom, lookup times went down from 17 seconds (and more) to a static time of 0.8 seconds for 10k lookups. As a value for the array entry I just used int 1.
I think the array index is probably based on some b-tree which should have log(n) lookup time and the index on a hashmap.
In my case using the index worked fine, but are there any data structures where I can use hashmaps as a value index, where multiple values may also occour (i realize that this makes only sense if do not have too many duplicates and I cannot use range/search requests efficiently, which is the primary benefit of tree structures)?
There are a whole range of alternatives datastructures beyond simple arrays in the SPL library bundled with PHP, including linked lists, stacks, heaps, queues, etc.
However, I suspect you could make your logic a whole lot more efficient if you flipped your array, allowing you to do a lookup on the key (using the array_key_exists() function) rather than search for the value. The array index is a hash, rather than a btree, making for very fast direct access via the key.
However, if you're working with 10k entries in an array, you'd probably be better taking advantage of a database, where you can define your own indexes.
You also have the chdb (constant hash database) extension - which is perfect for this.
Arrays have an sequential order and it's quick to access certain elements, because you don't need to traverse a tree or work through a sequential list structure.
A set is of course faster here, because you only check unique elements and not all elements (in the array).
Tree's are fine for in example sorted structures. You could implement a tree with IPs sorted by their ranges, then you could decide faster if this IP exist or not.
I'm not sure if PHP provides such customised tree structures. I guess you'll need to implement this yourself, but this will take about half an hour.
You'll find sample codes on the web for such tree structures.
as already answered, you can use brand new classes provided by spl http://www.php.net/spl
BUT apparently they are not as fast as people think. probably they are not implemented as we expect. it is my opinion that splfixedarray, for example, is not a real array, but a hashtable as classic php's arrays
BUT also, you have some alternative solutions
first you can store your result in a database. queries are fast because db indexes may be better optimized than a php datastructure
you can use http://www.php.net/sqlite3 and store results in a temporary database (a file or in memory)
I suggest a temporary file, because you don't have to load all in memory, and in plus you can add each row individually (using http://www.php.net/fgets for example)
HTH!
feel free to correct my English

string vs array processing speed in javascript and php, and can an array be passed to php without manipulation?

a few questions on strings vs arrays for text processing.
Is it possible to pass a js array[] to php as an array without converting it to a string or using JSON? Just pass it as an ordinary array without manipulation.
In php and JS (or any other language), in general, which format is faster for processing or searching text (for very large arrays or text strings). Example:
string = abc,def,dfa,afds,xyz,afds,xxx
array = {"abc","xyz","xxx"}
Which is faster to use to search to see if xyz is present/match?
Which is faster to use to determine the index position of xyz?
Which has a smaller memory size/usage?
TIA.
Edit: the answer to point 1 is no I guess, but for point 2, please understand that I am asking because I am dealing with a program that makes concurrent ajax calls that requires the processing of very large arrays or text strings. The usability of the interface depends on the speed of the returned ajax calls. I had to scrap the original code because of this problem.
As for question 1, can you pass a native Javascript array to PHP, the answer is no.
Not only are Javascript arrays and PHP arrays incompatible (they're data structures of two different languages), the only communication between (client-side) Javascript and PHP is through HTTP, which only knows strings. Not numbers, not booleans, not objects, not arrays, only strings.
As for question 2, speed, it depends on many things, including your search algorithm and the string/array length. If your data structure is an array, use it as an array, not as a string. Readability and maintainability first, speed optimizations only when necessary. And you have to push the limits quite a bit more before you'll get into performance problems, your short examples are plenty fast enough either way.
Here's a test case I created that may answer your question: http://jsperf.com/string-search-speed
It really depends on your goal though. Searching within a string means you need to rule out the possibility of just matching a substring, for which you pretty much need a RegEx. Unless that's of no concern. On the other hand, stuffing everything into an object is orders of magnitude faster, but won't allow you to store the same string twice. Whether that's of concern or not I don't know. Be sure to run these tests on a wide variety of browsers, as the speed of individual tests varies greatly among Javascript engines.
And the more I play around with this the clearer it is that there's no answer. All tests score almost equally well in Chrome (safe for object lookup, which plays in a different league). Opera seems to have an enormously optimized str.search implementation which is on par with object lookups. In Safari all Regex tests are terribly slow but object lookups are the fastest of any browser. Firefox's str.indexOf is awesome, manual array looping not so much.
So again, there is no absolute answer (unless you use objects, which are always faster). Do what makes the most sense!
why do you say "without using JSON"? JSON is exactly what you are looking for. you turn the array into a JSON string, pass it to PHP, then have PHP parse the JSON back into an array object.
also, it sounds like you are doing premature optimization. most of the time, making your code easy to use is more important than shaving miliseconds off of your execution time. if you really want to know the speed of things, run some benchmarking tests.

Appropriate data structure for faster retrieval process (data size: around 200,000 values all string)

I have a large data set of around 200, 000 values, all of them are strings. Which data structure should i use so that the searching and retrieval process is fast. Insertion is one time, so even if the insertion is slow it wouldn't matter much.
Hash Map could be one solution, but what are the other choices??
Thanks
Edit:
some pointers
1. I am looking for exact matches and not the partial ones.
2. I have to accomplish this in PHP.
3. Is there any way i can keep such amount of data in cache in form of tree or in some other format?
You really should consider not using maps or hash dictionaries if all you need is a string lookup. When using those, your complexity guaranties for N items in a lookup of string size M are O(M x log(N)) or, best amortised for the hash, O(M) with a large constant multiplier. It is much more efficient to use an acyclic deterministic finite automaton (ADFA) for basic lookups, or a Trie if there is a need to associate data. These will walk the data structure one character at a time, giving O(M) with very small multiplier complexity.
Basically, you want a data structure that parses your string as it is consumed by the data structure, not one that must do full string compares at each node of the lookup. The common orders of complexity you see thrown around around for red-black trees and such assume O(1) compare, which is not true for strings. Strings are O(M), and that propagates to all compares used.
Maybe a trie data structure.
A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings
Use a TreeMap in that case. Search and Retrieval will be O(log n). In case of HashMap search can be O(n) worst case, but retrieval is O(1).
For 200000 values, it probably won't matter much though unless you are working with hardware constraints. I have used HashMaps with 2 million Strings and they were still fast enough. YMMV.
You can B+ trees if you want to ensure your search is minimal at the cost of insertion time.
You can also try bucket push and search.
Use a hashmap. Assuming implementation similar to Java's, and a normal collision rate, retrieval is O(m) - the main cost is computing the hashcode and then one string-compare. That's hard to beat.
For any tree/trie implementation, factor in the hard-to-quantify costs of the additional pipeline stalls caused by additional non-localized data fetches. The only reason to use one (a trie, in particular) would be to possibly save memory. Memory will be saved only with long strings. With short strings, the memory savings from reduced character storage are more than offset by all the additional pointers/indices.
Fine print: worse behavior can occur when there are lots of hashcode collisions due to an ill-chosen hashing function. Your mileage may vary. But it probably won't.
I don't do PHP - there may be language characteristics that skew the answer here.

Associative Array : PHP/C vs Flex/Flash

In PHP an Associative Array keeps its order.
// this will keep its order in PHP
a['kiwis']
a['bananas']
a['potatoes']
a['peaches']
However in Flex it doesn't with a perfectly valid explanation. I really can't remember how C treats this problem but I am more leaned to believe it works like php as the Array has it's space pre-reserved in memory and we can just walk the memory. Am I right?
The real question here is why. Why does C/PHP interpretation of this varies from Flash/Flex and what is the main reason Adobe has made Flash work this way.
Thank you.
There isn't a C implementation, you roll your own as needed, or choose from a pre-existing one. As such, a given C implementation may be ordered or unordered.
As to why, the reason is that the advantages are different. Ordered allows you (obviously enough) to depend on that ordering. However, it's wasteful when you don't need that ordering.
Different people will consider the advantage of ordering more or less important than the advantage of not ordering.
The greatest flexibility comes from not ordering, as if you also have some sort of ordered structure (list, linked list, vector would all do) then you can easily create an ordered hashmap out of that (not the optimal solution, but it is easy, so you can't complain you didn't have one given to you). This makes it the obvious choice in something intended, from the early in its design, to be general purpose.
On the other hand, the disadvantage of ordering is generally only in terms of performance, so it's the obvious choice for something intended to give relatively wide-ranging support with a small number of types for a new developer to learn.
The march of history sometimes makes these decisions optimal and sometimes sub-optimal, in ways that no developer can really plan for.
For PHP arrays: These beasts are unique constructs and are somehow complicated, an overview is given in a slashdot response from Kendall Hopkins (scroll down to his answer):
Ken: The PHP array is a chained hash table (lookup of O(c) and O(n) on key collisions)
that allows for int and string keys. It uses 2 different hashing algorithms
to fit the two types into the same hash key space. Also each value stored in
the hash is linked to the value stored before it and the value stored after
(linked list). It also has a temporary pointer which is used to hold the
current item so the hash can be iterated.
In C/C++, there is, as has been said, no "associative array" in the core lanuage. It has a map (ordered) in the STL, as will be in the new standard library (hash_map, unordered_map) and there was a gnu_hash_map (unordered) on some implementations (which was very good imho).
Furthermore, the "order" of elements in an "ordered" C/C++ map is usually not the "insertion order" (as in PHP), it's the "key sort order" or "string hash value sort order".
To answer your question: your view of equivalence of PHP and C/C++ associative arrays does not hold, in PHP, they made a design decision in order to provide maximum comfort under a single interface (and failed or succeeded, whatever). In C/C++, there are many different implementations (with advantages and tradeoffs) available.
Regards
rbo

Categories