php regex find substring in substring - php

I am still playing around for one project with matching words.
Let assume that I have a given string, say maxmuster . Then I want to mark this part of my random word maxs which are in maxmuster in the proper order, like the letters are.
I wil give some examples and then I tell what I already did. Lets keep the string maxmuster. The bold part is the matched one by regex (best would be in php, however could be python, bash, javascript,...)
maxs
Mymaxmuis
Lemu
muster
Of course also m, u, ... will be matched then. I know that, I am going to fix that later. However, the solution, I though, should not so difficult, so I try to divide the word in groups like this:
/(maxmuster)?|(maxmuste)?|(maxmust)?|(maxmus)?|(maxmu)?|(maxm)?|(max)?|(ma)?|(m)?/gui
But then I forgot of course the other combinations, like:
(axmuster)(xmus) and so on. Did I really have to do that, or exist there a simple regex trick, to solve this question, like I explained above?
Thank you very much

Sounds like you need string intersection. If you don't mind non regex idea, have a look in Wikibooks Algorithm Implementation/Strings/Longest common substring PHP section.
foreach(["maxs", "Mymaxmuis", "Lemu", "muster"] AS $str)
echo get_longest_common_subsequence($str, "maxmuster") . "\n";
max
maxmu
mu
muster
See this PHP demo at tio.run (caseless comparison).
If you need a regex idea, I would join both strings with space and use a pattern like this demo.
(?=(\w+)(?=\w* \w*?\1))\w
It will capture inside a lookahead at each position before a word character in the first string the longest substring that also matches the second string. Then by PHP matches of the first group need to be sorted by length and the longest match will be returned. See the PHP demo at tio.run.
function get_longest_common_subsequence($w1="", $w2="")
{
$test_str = preg_quote($w1,'/')." ".preg_quote($w2,'/');
if(preg_match_all('/(?=(\w+)(?=\w* \w*?\1))\w/i', $test_str, $out) > 0)
{
usort($out[1], function($a, $b) { return strlen($b) - strlen($a); });
return $out[1][0];
}
}

TL;DR
Using Regular Expressions:
longestSubstring(['Mymaxmuis', 'axmuis', 'muster'], buildRegexFrom('maxmuster'));
Full snippet
Using below regex you are able to match all true sub-strings of string maxmuster:
(?|((?:
m(?=a)
|(?<=m)a
|a(?=x)
|(?<=a)x
|x(?=m)
|(?<=x)m
|m(?=u)
|(?<=m)u
|u(?=s)
|(?<=u)s
|s(?=t)
|(?<=s)t
|t(?=e)
|(?<=t)e
|e(?=r)
|(?<=e)r
)+)|([maxmuster]))
Live demo
You have to cook such a regex from a word like maxmuster so you need a function to call it:
function buildRegexFrom(string $word): string {
// Split word to letters
$letters = str_split($word);
// Creating all side of alternations in our regex
foreach ($letters as $key => $letter)
if (end($letters) != $letter)
$regex[] = "$letter(?={$letters[$key + 1]})|(?<=$letter){$letters[$key + 1]}";
// Return whole cooked pattern
return "~(?|((?>".implode('|', $regex).")+)|([$word]))~i";
}
To return longest match you need to sort results according to matches length from longest to shortest. It means writing another piece of code for it:
function longestSubstring(array $array, string $regex): array {
foreach ($array as $value) {
preg_match_all($regex, $value, $matches);
usort($matches[1], function($a, $b) {
return strlen($b) <=> strlen($a);
});
// Store longest match being sorted
$substrings[] = $matches[1][0];
}
return $substrings;
}
Putting all things together:
print_r(longestSubstring(['Mymaxmuis', 'axmuis', 'muster'], buildRegexFrom('maxmuster')));
Outputs:
Array
(
[0] => maxmu
[1] => axmu
[2] => muster
)
PHP live demo

Here is my take on this problem using regex.
<?php
$subject="maxmuster";
$str="Lemu";
$comb=str_split($subject); // Split into single characters.
$len=strlen($subject);
for ($i=2; $i<=$len; $i++){
for($start=0; $start<$len; $start++){
$temp="";
$inc=$start;
for($j=0; $j<$i; $j++){
$temp=$temp.$subject[$inc];
$inc++;
}
array_push($comb,$temp);
}
}
echo "Matches are:\n";
for($i=0; $i<sizeof($comb); $i++){
$pattern = "/".$comb[$i]."/";
if(preg_match($pattern,$str, $matches)){
print_r($matches);
};
}
?>
And here is an Ideone Demo.

Related

Array not being updated outside of foreach loop

I have a piece of code that I'm struggling with. I'm still on my first steps so it's entirely possible that some silly mistake is causing this.
I want to turn each first character of each word into uppercase, but for some reason it is not working and I cannot get it figured out.
$split = explode(" ",$string);
foreach ($split as $word) {
if (ord($word[0]) >= 97 & ord($word[0]) <= 122){
$word[0] = chr(ord($word[0]) - 32);
}}
return $string;
}
You should handle this a little differently.
Let's create our split first:
$words = explode(' ', $words_string);
Now let's loop through these words and remember their index by using the $key param.
foreach($words as $index => $word) { //So we remember the key in the array using $k => $v
$words[$index] = ucfirst($word); //This will uppercase the first letter.
}
The reason why it's not working is explained in the question i've linked.
However in your case the solution is much more simple. You can just use ucwords() function or mb_convert_case() with MB_CASE_TITLE if you work with multibyte strings.
PHP provides in-built function that help you convert every word's first character to uppercase of the string without exploding and iterating through.
ucwords( $string );
EDIT: Let us include a sample to help you out what would be the output:
echo ucwords("Hi this is just a simple test of converting each word's first charater to uppercase!");
will return
Hi This Is Just A Simple Test Of Converting Each Word's First Charater To Uppercase!

How to check characters alternatively and replace it with Y if it is X?

I have a string, something like this:
$str ="it is a test string.";
// for more clarification
i t i s a t e s t s t r i n g .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Now I need to check all characters that are multiples of 4 (plus first character). like these:
1 => i
4 => i
8 => [space]
12 => t
16 => r
20 => .
Now, I need to compare them with Y (Y is a variable (symbol), for example Y = 'r' in here). So I want to replace Y with X (X is a variable (symbol) too, for example X = 'm' in here).
So, I want this output:
it is a test stming.
Here is my solution: I can do that using some PHP function:
strlen($str): to count the number of characters (named $sum)
$sum / 4: To find characters that are multiples of 4
substr($str, 4,1): to select specific character (named $char) {the problem is here}
if ($char == 'r') {}: to compare
str_replace('r','m',$char): to replace
And then combining all $char to each other.
But my solution has two problem:
substr() does not count [space] character (As I mentioned above)
combining characters is complicated a bit. (It needs to some waste processing)
Well, is there any solution? I like to do that using REGEX, Is it possible?
Could just use a simple regex with callback (add u flag if utf-8, s for . to match newline).
$str = preg_replace_callback(['/^./', '/.{3}\K./'], function ($m) {
return $m[0] == "r" ? "m" : $m[0];
}, $str); echo $str;
See this demo at tio.run > it is a test stming.
1st pattern: ^. any first character
2nd pattern: \K resets after .{3} any three characters, only want to check the fourth .
For use with anonymous function PHP >= 5.3 is required. Here is the workaround (demo).
Update: #Mariano demonstrated in his very nice answer that it is even with a single regex replacement possible. Thank you for the benchmark that reveals a rather bad performance for the preg_replace_callback solution. A more efficient variant without callback (but still two patterns).
$str = preg_replace(['/^r/', '/(?:...[^r])*...\Kr/'], 'm', $str);
I also included #revo's answer from 2017 in Mariano's benchmark and ran it on tio.run (100k loops). With newer PHP and PCRE2 the numbers seem to have changed slightly, "no regex" leads at tio.run.
In .NET or modern browser JS regex it also could be done like this by a variable length lookbehind.
If all characters in your string are in single byte, you can use something from PHP's official language reference:
$str ="it is a test string.";
$y="r";
$x="m";
$len=strlen($str);
if($str[0]==$y)
{
$str=substr_replace($str,$x,0,1);
}
if($len>=3)
{
for($i=3;$i<$len;$i+=4)
{
if($str[$i]==$y)
{
$str=substr_replace($str,$x,$i,1);
}
}
}
var_dump($str);
3v4l demo
Outputs it is a test stming.
Edit:
As #Don'tPanic points out, String is mutable using [] operator, so instead of using
$str=substr_replace($str,$x,$i,1);
you can just use
$str[$i]=$x;
This is an alternative using preg_replace()
$y = 'r';
$y = preg_quote($y, '/');
$x = 'M';
$x = preg_quote($x, '/');
$subject = 'rrrrrr rrrrr rrrrrr rrrr rrrr.';
$regex = "/\\G(?:^|(?(?<!^.).)..(?:.{4})*?)\\K$y/s";
$result = preg_replace($regex, $x, $subject);
echo $result;
// => MrrMrr MrrrM rrMrrr rrrM rrMr.
ideone demo
Regex:
\G(?:^|(?(?<!^.).)..(?:.{4})*?)\Km
\G is an assertion to the end of last match (or start of string)
(?:^|(?(?<!^.).)..(?:.{4})*?) matches:
^ start of string, to check at position 1
(?(?<!^.).) is an if clause that yields:
..(?:.{4})*?) 2 chars + a multiple of 4 if it has just replaced at position 1
...(?:.{4})*?) 3 chars + a multiple of 4 for successive matches
\K resets the text matched to avoid using backreferences
I must say though, regex is an overkill for this task. This code is counterintuitive and a typical regex that proves difficult to understand/debug/maintain.
EDIT. There was a later discussion about performance vs. code readability, so I did a benchmark to compare:
RegEx with a callback (#bobblebubble's answer).
RegEx with 2 replacements in an array (#bobblebubble's suggestion in comment).
No RegEx with substr_replace (#Passerby's answer).
Pure RegEx (this answer).
Result:
Code #1(with_callback): 0.548 secs/50k loops
Code #2(regex_array): 0.158 secs/50k loops
Code #3(no_regex): 0.120 secs/50k loops
Code #4(pure_regex): 0.118 secs/50k loops
Benchmark in ideone.com
Try this
$str ="it is a test string.";
$y="r";
$x="m";
$splite_array = str_split($str);
foreach ($splite_array as $key => $val)
{
if($key % 4 == 0 && $val == $y)
{
$splite_array[$key] = $x;
}
}
$yout_new_string = implode($splite_array);
This piece of code could help you on your way:
// Define variables
$string = "it is a test string.";
$y = 'r';
$x = 'm';
// Convert string to array
$chars = explode('', $string);
// Loop through all characters
foreach ($chars as $key => $char) {
// Array keys start at 0, so we add 1
$keyCount = $key+1;
// Check if deviding the key by 4 doesn't have rest value
// This means it is devisable by 4
if ($keyCount % 4 == 0 && $value == $y) {
$chars[$key] = $x;
}
}
// Convert back to string
$string = implode($chars);
Here is one other way to do this using string access and modification by character. (Consequently, it is only useful for single-byte encoded strings.)
// First character handled outside the loop because its index doesn't match the pattern
if ($str[0] == $y) $str[0] = $x;
// access every fourth character
for ($i=3; isset($str[$i]) ; $i+=4) {
// change it if it needs to be changed
if ($str[$i] == $y) $str[$i] = $x;
}
This modifies the original string rather than creating a new string, so if that shouldn't happen, it should be used on a copy.
Late to the party, puting aside \G anchor, I'd go with (*SKIP)(*F) method:
$str = "it is a test string.";
echo preg_replace(['~\Ar~', '~.{3}\K(?>r|.(*SKIP)(?!))~'], 'm', $str);
Short and clean.
PHP live demo

Most efficient way to find matching words from paragraph

I have a Paragraph that I have to parse for different keywords. For example, Paragraph:
"I want to make a change in the world. Want to make it a better place to live. Peace, Love and Harmony. It is all life is all about. We can make our world a good place to live"
And my keywords are
"world", "earth", "place"
I should report whenever I have a match and how many times.
Output should be:
"world" 2 times and "place" 1 time
Currently, I am just converting Paragraph strings to array of characters and then matching each keyword with all of the array contents.
Which is wasting my resources.
Please guide me for an efficient way.( I am using PHP)
As #CasimiretHippolyte commented, regex is the better means as word boundaries can be used. Further caseless matching is possible using the i flag. Use with preg_match_all return value:
Returns the number of full pattern matches (which might be zero), or FALSE if an error occurred.
The pattern for matching one word is: /\bword\b/i. Generate an array where the keys are the word values from search $words and values are the mapped word-count, that preg_match_all returns:
$words = array("earth", "world", "place", "foo");
$str = "at Earth Hour the world-lights go out and make every place on the world dark";
$res = array_combine($words, array_map( function($w) USE (&$str) { return
preg_match_all('/\b'.preg_quote($w,'/').'\b/i', $str); }, $words));
print_r($res); test at eval.in outputs to:
Array
(
[earth] => 1
[world] => 2
[place] => 1
[foo] => 0
)
Used preg_quote for escaping the words which is not necessary, if you know, they don't contain any specials. For the use of inline anonymous functions with array_combine PHP 5.3 is required.
<?php
Function woohoo($terms, $para) {
$result ="";
foreach ($terms as $keyword) {
$cnt = substr_count($para, $keyword);
if ($cnt) {
$result .= $keyword. " found ".$cnt." times<br>";
}
}
return $result;
}
$terms = array('world', 'earth', 'place');
$para = "I want to make a change in the world. Want to make it a better place to live.";
$r = woohoo($terms, $para);
echo($r);
?>
I will use preg_match_all(). Here is how it would look in your code. The actual function returns the count of items found, but the $matches array will hold the results:
<?php
$string = "world";
$paragraph = "I want to make a change in the world. Want to make it a better place to live. Peace, Love and Harmony. It is all life is all about. We can make our world a good place to live";
if (preg_match_all($string, $paragraph, &$matches)) {
echo 'world'.count($matches[0]) . "times";
}else {
echo "match NOT found";
}
?>

Is right using preg_match to search word that contain specific letters?

I have a group of letters, for example :
$word='estroaroint';
that can be arranged to be words like :
- store
- train
- restoration
- ...etc
They can be found in my file list 'dictionary.txt'.
A letter only can only be used once.
How to write a php script able to perform that?
I would try to manage it with this function: strpbrk() http://php.net/manual/en/function.strpbrk.php
It isn't really possible to do that in one step with a regex. However, it is possible to do it in two steps:
the first step find all the words in the dictionary that only contains the letters.
the second step filter words where letters are repeated.
Example (only for ascii range):
$pattern = '~\b[' . $word . ']{1,' . strlen($word) . '}+\b~';
if (preg_match_all($pattern, $dictionary, $m)) {
$chars = count_chars ($word, 1);
$result = array_filter($m[0], function ($i) use ($chars) {
foreach (count_chars($i, 1) as $k=>$v) {
if ($v > $chars[$k]) return false;
}
return true;
});
print_r($result);
}
PHP links: array_filter - count_chars
Note: to extend this script to multibyte characters, you need to write your own function mb_count_chars (since this function doesn't exist) that splits a multibyte string (you can use for example mb_substr, mb_strlen and a loop, or preg_split with ~(?=.)~u and the PREG_SPLIT_NO_EMPTY option). You need to add the u modifier to the regex pattern too and to change strlen to its multybyte equivalent.

Get more backreferences from regexp than parenthesis

Ok this is really difficult to explain in English, so I'll just give an example.
I am going to have strings in the following format:
key-value;key1-value;key2-...
and I need to extract the data to be an array
array('key'=>'value','key1'=>'value1', ... )
I was planning to use regexp to achieve (most of) this functionality, and wrote this regular expression:
/^(\w+)-([^-;]+)(?:;(\w+)-([^-;]+))*;?$/
to work with preg_match and this code:
for ($l = count($matches),$i = 1;$i<$l;$i+=2) {
$parameters[$matches[$i]] = $matches[$i+1];
}
However the regexp obviously returns only 4 backreferences - first and last key-value pairs of the input string. Is there a way around this? I know I can use regex just to test the correctness of the string and use PHP's explode in loops with perfect results, but I'm really curious whether it's possible with regular expressions.
In short, I need to capture an arbitrary number of these key-value; pairs in a string by means of regular expressions.
You can use a lookahead to validate the input while you extract the matches:
/\G(?=(?:\w++-[^;-]++;?)++$)(\w++)-([^;-]++);?/
(?=(?:\w++-[^;-]++;?)++$) is the validation part. If the input is invalid, matching will fail immediately, but the lookahead still gets evaluated every time the regex is applied. In order to keep it (along with the rest of the regex) in sync with the key-value pairs, I used \G to anchor each match to the spot where the previous match ended.
This way, if the lookahead succeeds the first time, it's guaranteed to succeed every subsequent time. Obviously it's not as efficient as it could be, but that probably won't be a problem--only your testing can tell for sure.
If the lookahead fails, preg_match_all() will return zero (false). If it succeeds, the matches will be returned in an array of arrays: one for the full key-value pairs, one for the keys, one for the values.
regex is powerful tool, but sometimes, its not the best approach.
$string = "key-value;key1-value";
$s = explode(";",$string);
foreach($s as $k){
$e = explode("-",$k);
$array[$e[0]]=$e[1];
}
print_r($array);
Use preg_match_all() instead. Maybe something like:
$matches = $parameters = array();
$input = 'key-value;key1-value1;key2-value2;key123-value123;';
preg_match_all("/(\w+)-([^-;]+)/", $input, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$parameters[$match[1]] = $match[2];
}
print_r($parameters);
EDIT:
to first validate if the input string conforms to the pattern, then just use:
if (preg_match("/^((\w+)-([^-;]+);)+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
EDIT2: the final semicolon is optional
if (preg_match("/^(\w+-[^-;]+;)*\w+-[^-;]+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
No. Newer matches overwrite older matches. Perhaps the limit argument of explode() would be helpful when exploding.
what about this solution:
$samples = array(
"good" => "key-value;key1-value;key2-value;key5-value;key-value;",
"bad1" => "key-value-value;key1-value;key2-value;key5-value;key-value;",
"bad2" => "key;key1-value;key2-value;key5-value;key-value;",
"bad3" => "k%ey;key1-value;key2-value;key5-value;key-value;"
);
foreach($samples as $name => $value) {
if (preg_match("/^(\w+-\w+;)+$/", $value)) {
printf("'%s' matches\n", $name);
} else {
printf("'%s' not matches\n", $name);
}
}
I don't think you can do both validation and extraction of data with one single regexp, as you need anchors (^ and $) for validation and preg_match_all() for the data, but if you use anchors with preg_match_all() it will only return the last set matched.

Categories