Matching a longest word using Regex only

Matching a longest word using Regex only - php

I need to match longest word of given string using regex:
for example given string
S = "hello night axe axbxbxx prom etc..."
character set 1 = [abcdexy]
character set 2 = [mnrpo]
I need to get only one word that match 2 constriants, all the word should contain characters from one set only and the chosen word should be the longest, I tried to solve this using php regex such as:
preg_match("/\b[abcdexy]+/",$s, $match1);
preg_match("/\b[mnrpo]+/",$s, $match2);
if(strlen($match1[0]) > strlen($match2[0]))
{
//output match1[0];
}
else
{
//output match2[0]
}
The expected output should be axbxbxx since it contain only characters from set 1 and it is the longest between words that belong to one of the two sets.
My question is, can I make this work using only regex without need for strlen() testing?

You can write a single regex expression that uses a pipe to match both character ranges, then sort the matched values by descending length and access the first element's value.
Code: (Demo)
$string='hello proxy night pom-pom-mop axe prom etc decayed';
if (preg_match_all('~\b(?:[a-exy]+|[m-pr]+)\b~', $string, $out)) {
usort($out[0], function($a, $b) {return strlen($b) - strlen($a);}); // or spaceship operator if you like
echo $out[0][0];
} else {
echo "no matches";
}
Output:
decayed
The above method is not "tie-aware" so if you have two values or more values that share the highest length, you will only get one value in the output. I think you need to build in some additional logic to handle these fringe cases like:
Output all highest length values or
Set a secondary criteria to break ties on length
I'll not bother coding up these solution extensions since I prefer not to go down rabbit holes.

Related

Regular expression to match an exact number of occurrence for a certain character

I'm trying to check if a string has a certain number of occurrence of a character.
Example:
$string = '123~456~789~000';
I want to verify if this string has exactly 3 instances of the character ~.
Is that possible using regular expressions?

Yes
/^[^~]*~[^~]*~[^~]*~[^~]*$/
Explanation:
^ ... $ means the whole string in many regex dialects
[^~]* a string of zero or more non-tilde characters
~ a tilde character
The string can have as many non-tilde characters as necessary, appearing anywhere in the string, but must have exactly three tildes, no more and no less.

As single character is technically a substring, and the task is to count the number of its occurences, I suppose the most efficient approach lies in using a special PHP function - substr_count:
$string = '123~456~789~000';
if (substr_count($string, '~') === 3) {
// string is valid
}
Obviously, this approach won't work if you need to count the number of pattern matches (for example, while you can count the number of '0' in your string with substr_count, you better use preg_match_all to count digits).
Yet for this specific question it should be faster overall, as substr_count is optimized for one specific goal - count substrings - when preg_match_all is more on the universal side. )

I believe this should work for a variable number of characters:
^(?:[^~]*~[^~]*){3}$
The advantage here is that you just replace 3 with however many you want to check.
To make it more efficient, it can be written as
^[^~]*(?:~[^~]*){3}$

This is what you are looking for:
EDIT based on comment below:
<?php
$string = '123~456~789~000';
$total = preg_match_all('/~/', $string);
echo $total; // Shows 3

Search a String for Alpha Numeric Characters in a Pattern

I have a string that contains 5 words. In the string one of the words is a Ham Radio Call Sign and can be anyone of the thousands of call signs in the US. In order to extract the Call Sign from the string I need to utilize the below pattern. The Call Sign I need to extract can be in any of the 5 positions in the string. The number is never the first character and the number is never the last character. The string is actually put together from an Array since it is originally read from a text file.
$string = $word[1] $word[2] $word[3] etc....
So the search can be either done on the whole string or each piece of the array.
Patterns:
1 Number and 3 Letters Example: AB4C A4BC
1 Number and 4 Letters Example: A4BCD
1 Number and 5 Letters Example: AB4CDE
I have tried everything I can think of and search till I cant search no more. I am sure I am over thinking this.

A two-step regular expression like this would do it:
$str = "hello A4AB there BC5AD";
$signs = array();
preg_match_all('/[A-Z][A-Z\d]{1,3}[A-Z]/', $str, $possible_signs);
foreach($possible_signs[0] as $possible_sign)
if (preg_match('/^\D+\d\D+$/', $possible_sign))
array_push($signs, $possible_sign);
print_r($signs); //Array ([0] => A4AB [1] => BC5AD)
Explanation
This is a regular expression approach, using two patterns. I don't think it could be done with one and still satisfy the exact requirements of the matching rules.
The first pattern enforces the following requirements:
substring starts and ends with a capital letter
substring contains only other capital letters or numbers between the first and last letter
substring is, overall, not more than 6 characters long
What I can't do in that same pattern, for complex REGEX reasons I won't go into (unless someone knows a way and can correct me), is enforce that only one number is contained.
#jeroen's answer does enforce this in a single pattern, but in turn does not enforce the correct length of the substring. Either way, we need a second pattern.
So after grabbing the initial matches, we loop over the results. We then apply each to a second pattern that enforces simply that there is only one number in the substring.
If so, we green-light the substring and it's added to the $signs array.
Hope this helps.

It depends on what the other words can contain, but you could use a regular expression like:
#\b[a-z]+\d[a-z]+\b#i
^ case insensitive
^^ a word boundary
^^^^^^ One or more letters
^^ One number
You can make it more restrictive by using {1,3} instead of + for the letters so that you have a sequence of 1 to 3 letters.
The complete expression would be something like:
$success = preg_match('#\b[a-z]+\d[a-z]+\b#i', $input_string, $matches);
where $matches[0] will contain the matched value, see the manual.

Read corresponding value in PHP and add to running sum

I would like to have each word in a string cross-referenced in a file.
So, if I was given the string: Jumping jacks wake me up in the morning.
I use some regex to strip out the period. Also, the entire string is made lowercase.
I then go on to have the words separated into an array by using PHP's nifty explode() function.
Now, what I'm left with, is an array with the words used in the string.
From there I need to look up each value in the array and get a value for it and add it to a running sum. for() loop it is. Okay, this is where I get stuck...
The list ($wordlist) is structured like so:
wake#4 waking#3 0.125
morning#2 -0.125
There are \ts in between the word and the number. There can be more than one word per value.
What I need the PHP to do now is look up the number to each word in the array then pull that corresponding number back to add it to a running sum. What's the best way for me to go about this?
The answer should be easy enough, just finding the location of the string in the wordlist and then finding the tab and from there reading the int... I just need some guidance.
Thanks in advance.
EDIT: to clarify -- I don't want the sum of the values of the wordlist, rather, I'd like to look up my individual values as they correspond to the words in the sentence and THEN look them up in the list and add just those values; not all of them.

Edited answer based on your comment and question edit. The running sum is stored in an array called $sum where the key value of the "word" will store the value of its running sum. e.g $sum['wake'] will store the running sum for the word wake and so on.
$sum = array();
foreach($wordlist as $word) //Loop through each word in wordlist
{
// Getting the value for the word by matching pattern.
//The number value for each word is stored in an array $word_values, where the key is the word and value is the value for that word.
// The word is got by matching upto '#'. The first parenthesis matches the word - (\w+)
//The word is followed by #, single digit(\d), multiple spaces(\s+), then the number value(\S+ matches the rest of the non-space characters)
//The second parenthesis matches the number value for the word
preg_match('/(\w+)#\d\s+(\S+)/', $word, $match);
$word_ref = $match[1];
$word_ref_number = $match[2];
$word_values["$word_ref"] = $word_ref_number;
}
//Assuming $sentence_array to store the array of words used in your string example {"Jumping", "jacks", "wake", "me", "up", "in", "the", "morning"}
foreach ($sentence_array as $word)
{
if (!array_key_exists("$word", $sum)) $sum["$word"] = 0;
$sum["$word"] += $word_values["$word"];
}
Am assuming you would take care of case sensitivities, since you mentioned that you make the entire string lowercase, so am not including that here.

$sentence = 'Jumping jacks wake me up in the morning';
$words=array();
foreach( explode(' ',$sentence) as $w ){
if( !array_key_exists($w,$words) ){
$words[$w]++;
} else {
$words[$w]=1;
}
}
explodeby space, check if that word is in the words array as key; if so increment it's count(val); if not, set it's val as 1. Loop this for each of your sentences without redeclaring the $words=array()

What is the best way to unscramble words with PHP?

I have a word list and I want to unscramble words using this word list, in PHP.
It seems to me that PHP doesn't have a built-in function that does this. So could someone please suggest a good algorithm to do this, or at least point me in the right direction?
EDIT: edited to add example
So basically, what I'm talking about is I have a list of words:
apple
banana
orange
Then, I'm given a bunch of jumbled letters.
pplea
nanaba
eroang

Given a dictionary of known words:
foreach ($list as $word)
{
if (count_chars($scrambled_word,1) == count_chars($word,1))
echo "$word\n";
}
Edit: A simple optimization would be to move the count_chars($scrambled_word,1)) outside the loop since it never changes:
$letters = count_chars($scrambled_word,1)
foreach ($list as $word)
{
if ($letters == count_chars($word,1))
echo "$word\n";
}

Warning: I rarely use PHP, so this is dealing only with a general algorithm that should work in almost any language, not anything specific to PHP.
Presumably you have a word in which the letters have been rearranged, and you want to find what word(s) could be made from those letters.
If that's correct, the general idea is fairly simple: take a copy of your word list, and sort the letters in each word into alphabetical order. Put the sorted and unsorted versions of each word side by side, and sort the whole thing by the sorted words (but keeping each unsorted word along with its sorted version). You may want to collapse duplicates together, so that (for example) instead of {abt : bat} and {abt : tab}, you have: {abt: bat, tab}
Then, to match up a scrambled word, sort its letters in alphabetical order. Look for matches in your dictionary (since it's sorted, you can use a binary search). When you find a match, the result is the word (or words) associated with that sorted letter group. Using the example above, if the scrambled word was "tba", you'd sort it to get "abt", then look up "abt" to get "bat" and "tab".
Edit: As #Moron pointed out in the comments, sorting and binary search aren't really crucial points in themselves. The basic points are to turn all equivalent inputs into identical keys, then use some sort of fast lookup by key to find the word(s) for that key.
Sorting the letters in each word is one easy way to turn equivalent inputs into identical keys. Sorting the list and doing a binary search is one easy way to do fast lookups by key.
In both cases, there are quite a few alternatives. I'm not at all sure the alternatives are likely to improve performance a lot, but they certainly could.
Just for example, instead of a pure binary search you could have a second level of index that told you where the keys starting with 'a' were, the keys starting with 'b', and so on. Given that a couple of extremely frequently-used letters are near the beginning of the alphabet ('e' and 'a', for example) you might be better off sorting the words so that relatively uncommon letters ('q', 'z', etc.) are toward the front of the key, and the most commonly used letters are at the end. This would give that first lookup based on the initial character the greatest discrimination.
On the sort/binary search side, there are probably more alternatives, and probably better arguments to be made in favor of using something else. Hash tables typically allow lookups in (nearly) constant time. Tries can reduce storage substantially, especially when many words share a common prefix. The only obvious disadvantage is that the code for either one is probably more work (though PHP's array type is hash-based, so you could probably use it quite nicely).

It is possible to unscramble in O(log p + n) where
p = size of dictionary
n = length of word to be unscrambled
Assume a constant, c, the most occurrences of some letter within any word plus 1.
Assume a constant, k, the number of letters in the alphabet.
Assume a constant, j, the most number of words that can share the same hash or letter-sorted version.
Initialization of O(p) space:
1. Using the dictionary, D, create an associated list of letter sorted words, L, which will be size at most p since each word has a one sorted version.
2. Associate another column to L with a numerical hash of integers which can range [0, c^k-1].
3. For each word in L, generate its hash with the following function:
hash(word) = 0 if word is empty or (c^i + hash(remaining substring of the word))
where i is the zero-based alphabet index of the first letter.
Algorithm:
1. In O(n), determine hash, h, of the letter sorted version of the word in question.
2. In O(log p), search for the hash in L.
3. In O(n), list j associated words of length n.

Try these
http://www.php.net/manual/en/function.similar-text.php
http://www.php.net/manual/en/function.soundex.php
http://www.php.net/manual/en/function.levenshtein.php

The slow option would be to generate all permutations of the letters in a scrambled word, then probe them via pspell_check().
If you however can use a raw dictionary text file, then the best option is to just use a regular expression to scan it:
$dict = file_get_contents("words.txt"); // one word per line
$n = strlen($word);
if (preg_match('/^[$word]{$n}$/im', $dict, $match)) {
print $match[0];
}
I'm quite certain PCRE is significantly faster at searching for permutations than PHP and the guessing approach.

Make use of PHP's array functions since they can solve this for you.
$words = array('hello', 'food', 'stuff', 'happy', 'fast');
$scrambled_word = 'oehll';
foreach ($words as $word)
{
// Same length?
if (strlen($scrambled_word) === strlen($word))
{
// Convert to an array and match
if( ! array_diff(str_split($word), str_split($scrambled_word)))
{
print "Your word is: $word";
}
}
}
Basically, you look for something the same length - then you ask PHP to see if all the letters are the same.

If you have a really large list of words and want this unscramble operation to be fast, I'd try putting the word list into a database. Next add a field to your word list table that is the sum of the ascii values of the word, and then add an index on this ascii sum.
Whenever you want to retrieve a list of possible matches just search the word table for ascii sums that match the sum of the scrambled letters. Keep in mind that you may have a few false matches so you'll have to compare all of the matched words to ensure they contain only the letters of your scrambled word (but the result set should be pretty small).
If you don't want to use a database you could implement the same basic idea using a file, just sort the list by the sum value for faster retrieval of all matches.
Example Data assumes all lowercase (a=97, b=98, c=99, ...)
bat => 311,
cat => 312, ...
Example php function to figure out the sum for a word
function asciiSum($word) {
$characters = str_split(strtolower($word));
$sum = 0;
foreach($characters as $character) {
$sum += ord($character);
}
return $sum;
}
Even faster: add another field to the database that represents the string length, then you can search for words based on an ascii sum and a string length which would further reduce the number of false matches you would need to check for.

Filter array of numeric PIN code strings which may be in the format "######" or "### ###"

I have a PHP array of strings. The strings are supposed to represent PIN codes which are of 6 digits like:
560095
Having a space after the first 3 digits is also considered valid e.g. 560 095.
Not all array elements are valid. I want to filter out all invalid PIN codes.

Yes you can make use of regex for this.
PHP has a function called preg_grep to which you pass your regular expression and it returns a new array with entries from the input array that match the pattern.
$new_array = preg_grep('/^\d{3} ?\d{3}$/',$array);
Explanation of the regex:
^ - Start anchor
\d{3} - 3 digits. Same as [0-9][0-9][0-9]
? - optional space (there is a space before ?)
If you want to allow any number of any whitespace between the groups
you can use \s* instead
\d{3} - 3 digits
$ - End anchor

Yes, you can use a regular expression to make sure there are 6 digits with or without a space.
A neat tool for playing with regular expressions is RegExr... here's what RegEx I came up with:
^[0-9]{3}\s?[0-9]{3}$
It matches the beginning of the string ^, then any three numbers [0-9]{3} followed by an optional space \s? followed by another three numbers [0-9]{3}, followed by the end of the string $.
Passing the array into the PHP function preg_grep along with the Regex will return a new array with only matching indeces.

If you just want to iterate over the valid responses (loop over them), you could always use a RegexIterator:
$regex = '/^\d{3}\s?\d{3}$/';
$it = new RegexIterator(new ArrayIterator($array), $regex);
foreach ($it as $valid) {
//Only matching items will be looped over, non-matching will be skipped
}
It has the benefit of not copying the entire array (it computes the next one when you want it). So it's much more memory efficient than doing something with preg_grep for large arrays. But it also will be slower if you iterate multiple times (but for a single iteration it should be faster due to the memory usage).

If you want to get an array of the valid PIN codes, use codaddict's answer.
You could also, at the same time as filtering only valid PINs, remove the optional space character so that all PINs become 6 digits by using preg_filter:
$new_array = preg_filter('/^(\d{3}) ?(\d{3})$/D', '$1$2', $array);

The best answer might depend on your situation, but if you wanted to do a simple and low cost check first...
$item = str_replace( " ", "", $var );
if ( strlen( $item ) !== 6 ){
echo 'fail early';
}
Following that, you could equally go on and do some type checking - as long as valid numbers did not start with a 0 in which case is might be more difficult.
If you don't fail early, then go on with the regex solutions already posted.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.