Identifying a random repeating pattern in a structured text string

Identifying a random repeating pattern in a structured text string - php

I have a string that has the following structure:
ABC_ABC_PQR_XYZ
Where PQR has the structure:
ABC+JKL
and
ABC itself is a string that can contain alphanumeric characters and a few other characters like "_", "-", "+", "." and follows no set structure:
eg.qWe_rtY-asdf or pkl123
so, in effect, the string can look like this:
qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ
My goal is to find out what string constitutes ABC.
I was initially just using
$arrString = explode("_",$string);
to return $arrString[0] before I was made aware that ABC ($arrString[0]) itself can contain underscores, thus rendering it incorrect.
My next attempt was exlpoding it on "_" anyway and then comparing each of the exploded string parts with the first string part until I get a semblance of a pattern:
function getPatternABC($string)
{
$count = 0;
$pattern ="";
$arrString = explode("_", $string);
foreach($arrString as $expString)
{
if(strcmp($expString,$arrString[0])!==0 || $count==0)
{
$pattern = $pattern ."_". $arrString[$count];
$count++;
}
else break;
}
return substr($pattern,1);
}
This works great - but I wanted to know if there was a more elegant way of doing this using regular expressions?

Here is the regex solution:
'^([a-zA-Z0-9_+-]+)_\1_\1\+'
What this does is match (starting from the beginning of the string) the longest possible sequence consisting of the characters inside the square brackets (edit that per your spec). The sequence must appear exactly twice, each time followed by an underscore, and then must appear once more followed by a plus sign (this is actually the first half of PQR with the delimiter before JKL). The rest of the input is ignored.
You will find ABC captured as capture group 1.
So:
$input = 'qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ';
$result = preg_match('/^([a-zA-Z0-9_+-]+)_\1_\1\+/', $input, $matches);
if ($result) {
echo $matches[2];
}
See it in action.

Sure, just make a regular expression that matches your pattern. In this case, something like this:
preg_match('/^([a-zA-Z0-9_+.-]+)_\1_\1\+JKL_XYZ$/', $string, $match);
Your ABC is in $match[1].

If the presence of underscores in these strings has a low frequency, it may be worth checking to see if a simple explode() will do it before bothering with regex.
<?php
$str = 'ABC_ABC_PQR_XYZ';
if(substr_count($str, '_') == 3)
$abc = reset(explode('_', $str));
else
$abc = regexy_function($str);
?>

Related

How to get substrings on both sides of hyphen and trailing substring?

I am currently working on a web app which is using a specific string to call a function. Here is a sample string:
$string = "translate from-to word for translate"
First I need to validate the string, and it should be like the above $string. How should I validate the string?
Then I need to extract 3 substrings from $string.
The word that precedes the hyphen. (To be named: $target)
The word that follows the hyphen. (To be named: $source)
The text (not including the first space) that follows $source to the end of the string. (To be named: $translate)
This is my coding attempt to get the from and to:
$found = false;
$source ="";
$target = "";
$next = 3;
$prev = 1;
for($i=0;$i<strlen($string);$i++){
if($found== false){
if($string[$i] == "-"){
$found = true;
while($string[$i+$prev] != " "){
$target .= $string[$i+$prev];
$prev +=1;
}
/*$next -=1;
while($string[$i-$next] != " " && $next > 0){
$source .= $string[$i-$next];
$next -=1;
}*/
}
}
}
From that code, I only can return the $target which contains to after -.I don't know how to get $source.
Please show me the fastest way to get the from as $source and to as $target.
Then I need to get word for translate (all of the string after from-to).
So the result should be
$target = "to";
$source = "from";
$translate = "word for translate";
Finally, if the $string has two hyphens, like translate from-to from-to test-test word for translate, it should be return false;
note to and from are random strings.

Consider the following possible input strings:
translate from-to word for translate (1 hyphen, no accents or non-English characters)
translate dari-ke dari-ke word for translate (2 hyphens)
translate clé-solution word for translate (1 hyphen, accented character used)
translate goodbye-さようなら word for translate (1 hyphen , Japanese characters used)
A case-insensitive pattern like: /^[a-z]+? ([a-z]+)-([a-z]+?) ([a-z ]+)$/i will perform as requested on the first two sample strings with high efficiency, but not the last two.
Using the "word character" (\w) to match the substrings (instead of case-insensitive [a-z]) will perform as intended with the first two samples with, but also allows 0-9 and _ as valid characters. This means a slight drop in pattern accuracy (this may be of no noticeable consequence to your project).
If you are translating strings that may go beyond English characters, it can be simpler / more forgiving to use a "negated character class" for matching. If you want to allow letters beyond a-z, like accented and other multibyte characters, then [^-] will offer a broad allowance of characters (at the expense of allowing many unwanted letters too). Here is a demo of this kind of pattern.
It is important to only write "capture groups" for substrings that you want to subsequently use. For this reason, I do not capture the leading substring translate.
list() is a handy "language construct" to assign variable names to array values. Notice that the first element (the fullstring match) is not assigned to a variable. This is why list()'s parameters starts with ,. If you don't wish to leverage the convenience of list(), then you can manually assign the three variable names over three lines like this:
$source=$out[1];
$target=$out[2];
$translate=$out[3];
Code: (Demo)
$strings=[
"translate from-to word for translate",
"translate dari-ke dari-ke word for translate",
"translate clé-solution word for translate",
"translate goodbye-さようなら word for translate"
];
foreach($strings as $string){
if(preg_match('/^[a-z]+? ([^-]+)-([^-]+?) ([a-z ]+)$/i',$string,$out)){
list(,$source,$target,$translate)=$out;
echo "source=$source; target=$target; translate=$translate";
}else{
var_export(false); // $found=false;
}
echo "<br>";
}
Output:
source=from; target=to; translate=word for translate
false
source=clé; target=solution; translate=word for translate
source=goodbye; target=さようなら; translate=word for translate
While regex provides a much more concise method with fewer function calls, this is a non-regex method:
if(substr_count($string,'-')!=1){
var_export(false); // $found=false;
}else{
$trimmed=ltrim($string,'translate ');
$array=explode(' ',$trimmed,2);
list($source,$target)=explode('-',$array[0]);
$translate=$array[1];
echo "source=$source; target=$target; translate=$translate";
}

If I understand your question correctly, this can be done with a regular expression:
<?php
$string = "translate from-to word for translate";
$result = preg_match("/^([\w ]+?) (\w+)-(\w+) ([\w ]+)$/", $string, $matches);
if ($result) {
print_r($matches);
$source = $matches[2];
$target = $matches[3];
$translate = $matches[4];
} else {
echo "No match";
}
Output:
Array
(
[0] => translate from-to word for translate
[1] => translate
[2] => from
[3] => to
[4] => word for translate
)
Here is an explanation of the regular expression.

preg replace would ignore non-letter characters when detecting words

I have an array of words and a string and want to add a hashtag to the words in the string that they have a match inside the array. I use this loop to find and replace the words:
foreach($testArray as $tag){
$str = preg_replace("~\b".$tag."~i","#\$0",$str);
}
Problem: lets say I have the word "is" and "isolate" in my array. I will get ##isolate at the output. this means that the word "isolate" is found once for "is" and once for "isolate". And the pattern ignores the fact that "#isoldated" is not starting with "is" anymore and it starts with "#".
I bring an example BUT this is only an example and I don't want to just solve this one but every other possiblity:
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
Output will be:
this #is ##isolated #is an example of this and that

You may build a regex with an alternation group enclosed with word boundaries on both ends and replace all the matches in one pass:
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
echo preg_replace('~\b(?:' . implode('|', $testArray) . ')\b~i', '#$0', $str);
// => this #is #isolated #is an example of this and that
See the PHP demo.
The regex will look like
~\b(?:is|isolated|somethingElse)\b~
See its online demo.
If you want to make your approach work, you might add a negative lookbehind after \b: "~\b(?<!#)".$tag."~i","#\$0". The lookbehind will fail all matches that are preceded with #. See this PHP demo.

A way to do that is to split your string by words and to build a associative array with your original array of words (to avoid the use of in_array):
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
$hash = array_flip(array_map('strtolower', $testArray));
$parts = preg_split('~\b~', $str);
for ($i=1; $i<count($parts); $i+=2) {
$low = strtolower($parts[$i]);
if (isset($hash[$low])) $parts[$i-1] .= '#';
}
$result = implode('', $parts);
echo $result;
This way, your string is processed only once, whatever the number of words in your array.

Check string for defined format and get part of it

How can I check if a string has the format [group|any_title] and give me the title back?
[group|This is] -> This is
[group|just an] -> just an
[group|example] -> example
I would do that with explode and [group| as the delimiter and remove the last ]. If length (of explode) is > 0, then the string has the correct format.
But I think that is not quite a good way, isn't it?

So you want to check if a string matches a regex?
if(preg_match('/^\[group\|(.+)\]$/', $string, $m)) {
$title = $m[1];
}
If the group part is supposed to be dynamic as well:
if(preg_match('/^\[(.+)\|(.+)\]$/', $string, $m)) {
$group = $m[1];
$title = $m[2];
}

Use regular expression matching using PHP function preg_match.
You can use for example regexr.com to create and test a regular expression and when you're done, then implement it in your PHP script (replace the first parameter of preg_match with your regular expression):
$text = '[group|This is]';
// replace "pattern" with regular expression pattern
if (preg_match('/pattern/', $text, $matches)) {
// OK, you have parts of $text in $matches array
}
else {
// $text doesn't contain text in expected format
}
Specific regular expression pattern depends on how strictly you want to check your input string. It can be for example something like /^\[.+\|(.+)\]$/ or /\|([A-Za-z ]+)\]$/. First checks if string starts with [, ends with ] and contains any characters delimited by | in between. Second one just checks if string ends with | followed by upper and lower case alphabetic characters and spaces and finally ].

Make two simple regex's into one

I am trying to make a regex that will look behind .txt and then behind the "-" and get the first digit .... in the example, it would be a 1.
$record_pattern = '/.txt.+/';
preg_match($record_pattern, $decklist, $record);
print_r($record);
.txt?n=chihoi%20%283-1%29
I want to write this as one expression but can only seem to do it as two. This is the first time working with regex's.

You can use this:
$record_pattern = '/\.txt.+-(\d)/';
Now, the first group contains what you want.

Your regex would be,
\.txt[^-]*-\K\d
You don't need for any groups. It just matches from the .txt and upto the literal -. Because of \K in our regex, it discards the previously matched characters. In our case it discards .txt?n=chihoi%20%283- string. Then it starts matching again the first digit which was just after to -
DEMO
Your PHP code would be,
<?php
$mystring = ".txt?n=chihoi%20%283-1%29";
$regex = '~\.txt[^-]*-\K\d~';
if (preg_match($regex, $mystring, $m)) {
$yourmatch = $m[0];
echo $yourmatch;
}
?> //=> 1

Php replace exact word

Here is my problem:
Using preg_replace('#\b(word)\b#','****',$text);
Where in text I have word\word and word, the preg_replace above replaces both word\word and word so my resulting string is ***\word and ***.
I want my string to look like : word\word and ***.
Is this possible? What am I doing wrong???
LATER EDIT
I have an array with urls, I foreach that array and preg_replace the text where url is found, but it's not working.
For instance, I have http://www.link.com and http://www.link.com/something
If I have http://www.link.com it also replaces http://www.link.com/something.

You are effectively specifying that you don't want certain characters to count as word boundary. Therefore you need to specify the "boundaries" yourself, something like this:
preg_replace('#(^|[^\w\\])(word)([^\w\\]|$)#','**',$text);
What this does is searches for the word surrounded by line boundaries or non-word characters except the back slash \. Therefore it will match .word, but not .word\ and not `\word. If you need to exclude other characters from matching, just add them inside the brackets.

You could just use str_replace("word\word", "word\word and"), I dont really see why you would need to use a preg_replace in your case given above.

Here is a simple solution that doesn't use a regex. It will ONLY replace single occurances of 'word' where it is a lone word.
<?php
$text = "word\word word cat dog";
$new_text = "";
$words = explode(" ",$text); // split the string into seperate 'words'
$inc = 0; // loop counter
foreach($words as $word){
if($word == "word"){ // if the current word in the array of words matches the criteria, replace it
$words[$inc] = "***";
}
$new_text.= $words[$inc]." ";
$inc ++;
}
echo $new_text; // gives 'word\word *** cat dog'
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Identifying a random repeating pattern in a structured text string - php

Sure, just make a regular expression that matches your pattern. In this case, something like this: preg_match('/^([a-zA-Z0-9_+.-]+)_\1_\1\+JKL_XYZ$/', $string, $match); Your ABC is in $match[1].

If the presence of underscores in these strings has a low frequency, it may be worth checking to see if a simple explode() will do it before bothering with regex. <?php $str = 'ABC_ABC_PQR_XYZ'; if(substr_count($str, '_') == 3) $abc = reset(explode('_', $str)); else $abc = regexy_function($str); ?>

Related

How to get substrings on both sides of hyphen and trailing substring?

preg replace would ignore non-letter characters when detecting words

Check string for defined format and get part of it

Make two simple regex's into one

Php replace exact word

Categories

Resources