alternative to in_array for unicode arrays - php

The in_array function is very slow for large arrays due to doing a linear search.
A faster alternative is to search the key of the array.
Thus
if (isset($array[$val]))
is much faster than
if (in_array($val,$array))
for large arrays. However using unicode as array keys will not work.
Is there an alternative way to do this for unicode without resorting to linear searches such as in_array or array_search or generating hashes like md5?

You can use anything as key which can be converted into a string.
Compare: Characters allowed in php array keys?
But nevertheless apparently some poeple have problems with special characters in their array-keys. I bet this may be the case if you use different encodings at the time you store the key and when you search for it. For example your keys come from a database using UTF-8, but when you search you have the key you search for hardcoded in a Iso-encoded PHP-script. This is just an example, there are dozens of scenarios like this.
To ensure you always use the same encoding I would use rawurlencode.

Related

create random short string [a-zA-Z0-9]+

What is the best way to create a short (6 chars), random, and with low collison probability? I need to create short links like bit.ly.
The problem of md5, sha1, uniqid etc. is that they don't generate uppercase characters, so I'm looking for a case-sensitive output to have a wider range of possible values...
I like to use Hashids for this kind of thing:
Hashids is a small open-source library that generates short, unique, non-sequential ids from numbers.
It converts numbers like 347 into strings like “yr8”, or array of numbers like [27, 986] into “3kTMd”.
You can also decode those ids back. This is useful in bundling several parameters into one or simply using them as short UIDs.
Hashids has been ported to many languages, including PHP.
(Note that, despite the name, Hashids is not a true hashing system since it is designed to be reversible.)

Quick way to find common words between two strings

I have a string which has the length of an average sentence, it can be made up of any random words. I also have a file (around 600kb) which contains some more random words.
I want to find out the common words between these two as efficiently as possible. Right now, I am going over two loops to match each word from the string against each word in the file but that seems a bit inefficient. Is there a better and more efficient way to get the common words?
Load one set into an array keys (values can be anything). Then loop the other set and test whether the array has those keys. This way you don't have two nested loops, but two independent ones (load loop and test loop), and key lookup is easy and fast when compared to the value lookup.
If you are testing multiple sentences against one file, loading the file into the array is clearly better. If your file is larger than your memory (shouldn't happen really, not with 600kb), then do it the other way around.
Alternately you can just make two arrays, then use array_intersect or array_intersect_key. If PHP is smart, array_intersect_keys will use the above procedure; in any case it should be good because it is implemented in C. The downside is you must load everything into memory (again, probably not an issue).
Your current algorithm complexity is O(N*M). To improve it, you can use hashtable to store the words from the file. In PHP, associative arrays are implemented as hashtables. So your array will look like this
$array = ['abc' => true, 'dfg' => true, ]// and so on
and use array_key_exists to check if word is in array. This gives you O(1) on validation. And finally, you have to iterate the words in your sentences. It will be O(N), where N is a number of words. Final complexity is O(N)

Determine data type from file_get_contents()

I'm writing a command line application in PHP that accepts a path to a local input file as an argument. The input file will contain one of the following things:
JSON encoded associative array
A serialized() version of the associative array
A base 64 encoded version of the serialized() associative array
Base 64 encoded JSON encoded associative array
A plain old PHP associative array
Rubbish
In short, there are several dissimilar programs that I have no control over that will be writing to this file, in a uniform way that I can understand, once I actually figure out the format. Once I figure out how to ingest the data, I can just run with it.
What I'm considering is:
If the first byte of the file is { , try json_decode(), see if it fails.
If the first byte of the file is < or $, try include(), see if it fails.
if the first three bytes of the file match a:[0-9], try unserialize().
If not the first three, try base64_decode(), see if it fails. If not:
Check the first bytes of the decoded data, again.
If all of that fails, it's rubbish.
That just seems quite expensive for quite a simple task. Could I be doing it in a better way? If so, how?
There isn't much to optimize here. The magic bytes approach is already the way to go. But of course the actual deserialization functions can be avoided. It's feasible to use a verification regex for each instead (which despite the meme are often faster than having PHP actually unpack a nested array).
base64 is easy enough to probe for.
json can be checked with a regex. Fastest way to check if a string is JSON in PHP? is the RFC version for securing it in JS. But it would be feasible to write a complete json (?R) match rule.
serialize is a bit more difficult without a proper unpack function. But with some heuristics you can already assert that it's a serialize blob.
php array scripts can be probed a bit faster with token_get_all. Or if the format and data is constrained enough, again with a regex.
The more important question here is, do you need reliability - or simplicity and speed?
For speed, you could use the file(1) utility and add "magic numbers" in /usr/share/file/magic. It should be faster than a pure PHP alternative.
You can try json_decode() and unserialize() which will return NULL if they fail, then base64_decode() and run that again. It's not fast, but it's infinitely less error prone than hand parsing them...
The issue here is that if you have no idea which it can be, you will need to develop a detection algorithm. Conventions should be set with an extension (check the extension, if it fails, tell whoever put the file there to place the correct extension on), otherwise you will need to check yourself. Most algorithms that detect what type a file actually is do use hereustics to determine it's contents (exe, jpg etc) because generally they have some sort of signature that identifies them. So if you have no idea what the content will be for definate, it's best to look for features that are specific to those contents. This does sometimes mean reading more than a couple of bytes.

Large regex patterns: PCRC won't do it

I have a long list of words that I want to search for in a large string. There are about 500 words and the string is usually around 500K in size.
PCRE throws an error saying preg_match_all: Compilation failed: regular expression is too large at offset 704416
Is there an alternative to this? I know I can recompile PCRE with a higher internal linkage size, but I want to avoid messing around with server packages.
Perhaps you might consider tokenizing your input string instead, and then simply iterating through each token and seeing if it's one of the words you're looking for?
Could you approach the problem from the other direction?
Use regex to clean up your 500K of HTML and pull out all the words into a big-ass array. Something like \b(\w+)\b.. (sorry haven't tested that).
Build a hash table of the 500 words you want to check. Assuming case doesn't matter, you would lowercase (or uppercase) all the words. The hash table could store integers (or some more complex object) to keep track of matches.
Loop through each word from (1), lowercase it, and then match it against your hashtable.
Increment the item in your hash table when it matches.
You can try re2.
One of it's strengths is that uses automata theory to guarantee that the regex runs in linear time in comparison to it's input.
You can use str_word_count or explode the string on whitespace (or whatever dilimeter makes sense for the context of your document) then filter the results against you keywords.
$allWordsArray = str_word_count($content, 1);
$matchedWords = array_filter($allWordsArray, function($word) use ($keywordsArray) {
return in_array($word, $keywordsArray);
});
This assume php5+ to use the closure, but this can be substituted for create_function in earlier versions of php.

Two RC4 implementations generated different encryption results

Why is encryption algorithm may give different results in AS3 and PHP?
In AS3 I use library from http://labs.boulevart.be/index.php/2007/05/23/encryption-in-as2-and-as3/.
And in PHP I use RC4 Cipher.
Could some tell me what is the problem? Thanks.
How are you comparing the two results? You could be looking at one result display as a hex string, and another in ASCII for example. Have you also tried comparing the result to online (such as from Wikipedia) test vector for some simple strings to see if you are getting expected result?
Assuming the obvious like having the same key and initialisation values, you may want to look at the endianness assumptions of the two implementations.
If initial vector (iv) of the encryption libraries in not same (and it is unlikely to be the same as it should be random) the encryption will not give you same result.
If you want to check - check encryption with one and decryption with other and vise-verse

Categories