preg_match to capture string part after a special character

preg_match to capture string part after a special character - php

I have a text files with strings and for each string I need to divide and capture each part of it.
The string is like:
Joao.Martins.G2R71.Pedro.Feliz.sno
Being: NAME 1st player (only first or first+surname) G = game (can be 2 or 02 or other number less than 99) ; R = result (in this example home team wis 7x1) and NAME 2nd player ... last 3 chars are the game type (this example snooker)
But the string can also be:
Joao Martins |2x71| Pedro Feliz.poo
I'm no Regex expert (sadly) and already searched lots of questions here without finding a solution or for that matter even getting help just by reading the answers to other questions (mainly because I never seem to understand this)
I already have this:
preg_match("/\[(|^|]+)\]/",$string,$result);
echo $result[1] . "<br />";
But this only gives me the all thingy between the | | part without even separating them and ignores everything else
Can you guys help me with a solution for both cases? I'm as usual completely lost here!
Thanks in advance!

explode way:
You don't have to use complex regexp, you may use simple explode.
$parts = explode( '.', $string);
Parts now how either 2 parts or 6, so you can do:
if( count( $parts) == 6)){
list( $fistName1, $surName1, $string, $fistName2, $surName2, $gameType) = $parts;
} elseif( count( $parts) == 2) {
$gameType = $parts[1];
list( $fistName1, $surName1, $string, $fistName2, $surName2) = explode( $parts[0]);
} else {
echo "Cannot parse";
}
And now parsing $gameType :)
if( preg_match( '~^\|(\d+)x(\d+)\|$~', $gameType, $parts)){
$first = $parts[1];
$second = $parts[2];
} elseif( preg_match( '~^G(\d+)R(\d+)$~', $gameType, $parts)){
$first = $parts[1];
$second = $parts[2];
} else {
echo "Cannot parse!";
}
preg_match way:
The second regexp is intentionally different, so you can see how to write regexp that will "eat" whole name doesn't matter whether it has 2,3 or 5 parts and you will get used to *? (greedy killer).
$match = array();
if( preg_match( '~^(\w+)\.(\w+)\.G(\d+)R(\d+)\.(\w+)\.(\w+)\.(\w+)$~', $text, $match)){
// First way
} elseif (preg_match( '~^([^\|]+)\|(\d+)x(\d+)\|(.*?)\.(\w+)$~', $text, $match)){
// Second way
} else {
// Failed to parse
}
Edit (more than 2 names)
And if player may have more than 2 names (like Armin Van Buuren) you should go with regexp like this:
~^([\w.]+)\.G(\d+)R(\d+)\.([\w.]+)\.(\w+)$~
This will match names in Albert.Einstein, Armin.Van.Buuren (regexp relies on that name won't contain \d (decimal number) so names like Gerold The 3rd won't match).
You should be fine with using just: ~^([\w\d.]+)\.G(\d+)R(\d+)\.([\w\d.]+)\.(\w+)$~ which would also match Gerold The 3rd and any other name (\.G(\d+)R(\d+)\. is quite strict and you would have to make up really crazy name like G3R01 (like "3l1t33 kid Gerold") to parse it wrong.
Oh and one more thing, don't forget to $name = strtr( $name, '.', ' ') :)
RegExp explained
~~ - regexp delimiter; starts end finishes regexp; ~regexp~, it can be practically anything /regexp/, (regexp)
^ and $ - meta characters;^ start of string/line, $ end of string/line
\w is escape sequence for any word character, the same as [a-zA-Z]
([\w.]+) - captures subpatern/match group what contains [a-zA-Z.] at least once. + is called quantifier
+? - ? (after other quantifier) is called greedy killer and it means take as little as possible, normally would (\w+)a would match (on string ababa) abab, (\w+?)a would match ab and (\w*?)a would match empty string :)

I think this will do it for you.
/^(\w+)(?:\.| )(\w+)(?:\.| \|)G?(\d+)[x|R](\d+)(?:\.|\| )(\w+)(?:\.| )(\w+)(?:\.| )(\w+)$/
$1 will be p1 first name
$2 will be p1 last name
$3 will be game number
$4 will be results
$5 will be p2 first name
$6 will be p2 last name
$7 will be game type
If the $n things don't make sense then just think of them as the elements of the $results array. The pattern might be simplified some but I don't have enough time to figure that out.

You can do this:
//to get the string without the game type
$yourstring = substr($yourstring ,0 ,strlen($yourstring)-4);
//separating strings with "." as delimiter
$results = explode(".",$yourstring);
//checking whether "." was the delimiter
if(!strcmp($results[0],$yourstring)) {
//if "." was not the delimiter, then split the string with " "
//as the delimiter.
$results = explode(" ",$yourstring);
}
//storing them in separate variables. and removing "|" if exists.
if( count( $results) == 5){
$results[2] = trim($results[2],"|");
list( $var1, $var2, $var3, $var4, $var5) = $results;
}
elseif( count( $results) == 4){
$results[1] = trim($results[1],"|");
$results[2] = trim($results[2],"|");
list( $var1, $var2, $var3, $var4) = $results;
}
else {
$results[1] = trim($results[1],"|");
list( $var1, $var2, $var3) = $results;
}
All your string parts will be separated and stored in $results.
To get them to separate variable, you can use list function.

Related

PHP - Check for leading 0's in 2 comma-delimited integers

I have a user-input string with 2 comma-delimited integers.
Example (OK):
3,5
I want to reject any user input that contains leading 0's for either number.
Examples (Bad):
03,5
00005,3
05,003
Now what I could do is separate the two numbers into 2 separate string's and use ltrim on each one, then see if they have changed from before ltrim was executed:
$string = "03,5";
$string_arr = explode(",",$string);
$string_orig1 = $string_arr[0];
$string_orig2 = $string_arr[1];
$string_mod1 = ltrim($string_orig1, '0');
$string_mod2 = ltrim($string_orig2, '0');
if (($string_mod1 !== $string_orig1) || ($string_mod2 !== $string_orig2)){
// One of them had leading zeros!
}
..but this seems unnecessarily verbose. Is there a cleaner way to do this? Perhaps with preg_match?

You could shorten the code and check if the first character of each part is a zero:
$string = "03,5";
$string_arr = explode(",",$string);
if ($string_arr[0][0] === "0" || $string_arr[1][0] === "0") {
echo "not valid";
} else {
echo "valid";
}

Here is one approach using preg_match. We can try matching for the pattern:
\b0\d+
The \b would match either the start of the string, or a preceding comma separator.
If we find such a match, it means that we found one or more numbers in the CSV list (or a single number, if only one number present) which had a leading zero.
$input = "00005,3";
if (preg_match("/\b0\d+/", $input)) {
echo "no match";
}

You can do a simple check that if the first character is 0 (using [0]) or that ,0 exists in the string
if ( $string[0] == "0" || strpos($string, ",0") !== false ) {
// One of them had leading zeros!
}

All the current answers fail if any of the values are simply 0.
You can just convert to integer and back and compare the result.
$arr = explode(',', $input);
foreach($arr as $item) {
if( (str)intval($item) !== $item ) {
oh_noes();
}
}
However I am more curious as to why this check matters at all.

One way would be with /^([1-9]+),(\d+)/; a regex that checks the string starts with one or more non-zero digits, followed by a comma, then one or more digits.
preg_match('/^([1-9]+),(\d+)/', $input_line, $output_array);
This separates the digits into two groups and explicitly avoids leading zeros.
This can be seen on Regex101 here and PHPLiveRegex here.

Get text contained within the parentheses at the end of string

Say I have this user list:
Michael (43)
Peter (1) (143)
Peter (2) (144)
Daniel (12)
The number in the furthest right set of parentheses is the user number.
I want to loop each user and get the highest user number in the list, which in this case would be 144.
How do I do this? I'm sure it can be done with some kind of regexp, but I have no idea how. My loop is simple:
$currentUserNO = 0;
foreach ($users as $user) {
$userNO = $user->NameUserNo; // NameUserNo is the string to be stripped! ex: "Peter (2) (144)" => 144
if ($userNO > $currentUserNO) {
$currentUserNO = $userNO;
}
}
echo "The next user will be added with the user number: " . $currentUserNO + 1;

You could use a regex like:
/\((\d+)\)$/
^ glued to the end of the string
^^ closing parentheses
^^^ the number you want to capture
^^ opening parentheses
to capture the number in the last set of parentheses / at the end of the string.
But you could also use some basic array and string functions:
$parts = explode('(', trim($user->NameUserNo, ' )'));
$number = end($parts);
which breaks down to:
trim the closing parentheses and spaces from the end (strictly speaking from the beginning and end, you could also use rtrim());
explode on the opening parentheses;
get the last element of the resulting array.

If you are not confortable with regular expression you should not use them (and start to seriously learn them* as they are very powerful but cryptic).
In the mean time you don't have to use regex to solve your problem, just use (assuming that the NameUserNo contains just a line of the list) :
$userNO = substr(end(explode('(',$user->NameUserNo;)),0,-1);
It should be easier to understand.
* Is there a good, online, interactive regex tutorial?

I think the regular expression you are looking for is:
.+\((\d+)\)$
Which should select all characters until it reaches the last number wrapped in parenthesis.
The PHP code you can use to extract just the number is then:
$userNO = preg_replace('/.+\((\d+)\)$/', '$1', $user);
I haven't tested this, but it should set $userNO to 43 for the user Michael and 143 for the user Peter and so on.

I guess this is basically what you are looking for:
<?php
$list = array();
foreach $users as $user) {
preg_match('/$([a-zA-Z]+).*\([1-9]+\)$/', , $tokens);
$list[$tokens[2]] = $tokens[1];
}
ksort($list);
$highest = last(array_keys($list));
echo "The next user will be added with the user number: " . $highest++;

This is pretty easy to do with a regex.
foreach ($users as $user) {
# search for any characters then a number in brackets
# the match will be in $matches[1]
preg_match("/.+\((\d+)\)/", $user->NameUserNo, $matches);
$userNO = $matches[1];
if ($userNO > $currentUserNO) {
$currentUserNO = $userNO;
}
}
Because regexs use greedy matching, the .+, which means search for one or more characters, will grab up as much of the input string as it can before it reaches the number in brackets.

I'm fairly new to PHP, but couldn't you do it with:
$exploded = explode(" ", $user->NameUserNumber);
$userNo = substr(end($exploded), 1,-1);

regex to trim down subdomain in the url

I have a regexp that match to something like : wiseman.google.com.jp, me.co.uk, paradise.museum, abcd-abc.net, www.google.jp, 12345-daswe-23dswe-dswedsswe-54eddss.info, del.icio.us, jo.ggi.ng, all of this is from a textarea value.
used regexp (in preg_match_all($regex1, $str, $match)) to get the above values: /(?:[a-zA-Z0-9]{2,}\.)?[-a-zA-Z0-9]{2,}\.[a-zA-Z0-9]{2,7}(?:\.[-a-zA-Z0-9]{2,3})?/
Now, my question is : how can I make the regexp to trim down the "wiseman.google.com.jp" into "google.com.jp" and "www.google.jp" into "google.jp"?
I am thingking to make a second preg_match($regex2, $str, $match) function with each value coming from the preg_match_all function.
I have tried this regexp in $regex2 : ([-a-zA-Z0-9\x{0080}-\x{00FF}]{2,}+)\.[a-zA-Z0-9\x{0080}-\x{00FF}]{2,7}(?:\.[-a-zA-Z0-9\x{0080}-\x{00FF}]{2,3})? but it doesn't work.
Any inputs? TIA
here is my little solution :
preg_match_all($regex, $str, $matches, PREG_PATTERN_ORDER);
$arrlength=count($matches[0]);
for($x=0;$x<$arrlength;$x++){
$dom = $matches[0][$x];
$newstringcount = substr_count($dom, '.'); // this line is to count how many "." present in the string.
if($newstringcount == 3){ // if there are 3 '.' present in the string = true
$pos = strpos($dom, '.', 0); // this line is to find the first occurence of the '.' in the string
$find = substr($dom, $pos+1); //this line is to get the value after the first occurence of the '.' in the string
echo $find;
}else if($newstringcount == 2){
if ($pos = strpos($dom,'www.') !== false) {
$find = substr($dom, $pos+3);
echo $find;
}else{
echo $dom;
}
}else if($newstringcount == 1){
echo $dom;
}
echo "<br>";
}

(Caution: this answer will only fit your needs if you HAVE to use regex or you're somewhat... desperate...)
What you want to achieve isn't possible with general rules due to domains like .com.jp or .co.uk.
The only general rule one can find is:
When read from right to left there are one or two TLDs followed by one second level domain
Thus, we have to whitelist all available TLDs. I think i'll call the following the "domain-kraken".
Release the kraken!
([a-z0-9\-]{2,63}(?:\.(?:a(?:cademy|ero|rpa|sia|[cdefgilmnoqrstuwxz])|b(?:ike
|iz|uilders|uzz|[abdefghijlmnoqrstvwyz])|c(?:ab|amera|amp|areers|at|enter|eo
|lothing|odes|offee|om(?:pany|puter)?|onstruction|ontractors|oop|
[acdfghiklmnoruvwxyz])|d(?:iamonds|irectory|omains|[ejkmoz])|e(?:du(?:cation)?
|mail|nterprises|quipment|state|[ceghrstu])|f(?:arm|lorist|[ijkmor])|g(?:allery|
lass|raphics|uru|[abdefghlmnpqrstuwy])|h(?:ol(?:dings|iday)|ouse|[kmnrtu])|
i(?:mmobilien|n(?:fo|stitute|ternational)|[delmnoqrst])|j(?:obs|[emop])|
k(?:aufen|i(?:tchen|wi)|[eghimnprwxyz])|l(?:and|i(?:ghting|mo)|[abcikrstuvy])|
m(?:anagement|enu|il|obi|useum|[acdefghklmnopqrstuvwxyz])|n(?:ame|et|inja|
[acefgilopruz])|o(?:m|nl|rg)|p(?:hoto(?:graphy|s)|lumbing|ost|ro|[aefghklmnrstwy])|
r(?:e(?:cipes|pair)|uhr|[eosuw])|s(?:exy|hoes|ingles|ol(?:ar|utions)|upport|
ystems|[abcdeghijklmnorstuvxyz])|t(?:attoo|echnology|el|ips|oday|
[cdfghjklmnoprtvwz])|u(?:no|[agkmsyz])|v(?:entures|iajes|oyage|[aceginu])|
w(?:ang|ien|[fs])|xxx|y(?:[et])|z(?:[amw]))){1,2})$
Use it together with the i and m flags.
This supposes your data is on mutiple lines.
In case your data is seperated by a ,, change the last character in the regex ($) to ,? and use the g and i flags.
Demos are available on regex101 and debuggex.
(Both of the demos have an explanation: regex101 describes it with text while debuggex visualizes the beast)
A list of available TLDs can be found at iana.org, the used TLDs in the regex are as of January 2014.

How to use str_replace() to remove text a certain number of times only in PHP?

I am trying to remove the word "John" a certain number of times from a string. I read on the php manual that str_replace excepts a 4th parameter called "count". So I figured that can be used to specify how many instances of the search should be removed. But that doesn't seem to be the case since the following:
$string = 'Hello John, how are you John. John are you happy with your life John?';
$numberOfInstances = 2;
echo str_replace('John', 'dude', $string, $numberOfInstances);
replaces all instances of the word "John" with "dude" instead of doing it just twice and leaving the other two Johns alone.
For my purposes it doesn't matter which order the replacement happens in, for example the first 2 instances can be replaced, or the last two or a combination, the order of the replacement doesn't matter.
So is there a way to use str_replace() in this way or is there another built in (non-regex) function that can achieve what I'm looking for?

As Artelius explains, the last parameter to str_replace() is set by the function. There's no parameter that allows you to limit the number of replacements.
Only preg_replace() features such a parameter:
echo preg_replace('/John/', 'dude', $string, $numberOfInstances);
That is as simple as it gets, and I suggest using it because its performance hit is way too tiny compared to the tedium of the following non-regex solution:
$len = strlen('John');
while ($numberOfInstances-- > 0 && ($pos = strpos($string, 'John')) !== false)
$string = substr_replace($string, 'dude', $pos, $len);
echo $string;
You can choose either solution though, both work as you intend.

You've misunderstood the wording of the manual.
If passed, this will be set to the number of replacements performed.
The parameter is passed by reference and its value is changed by the function to indicate how many times the string was found and replaced. Its initial value is discarded.

There are a few things you could do to achieve this, but I can't think of one specific php function that will easily let you do this.
One option is to create your own replace function and utilize strripos and substr to do the replaces.
Another thing you can do is use preg_replace_callback and count the number of replacements you have done in the callback.
There's probably more ways but that's all I can think of on the fly. If performance is an issue I suggest you give both a try and do some simple benchmarks.

The cleanest, most-direct, single function call is to use preg_replace(). Its replacement limiting parameter makes the task intuitive and readable.
$string = preg_replace('/John/', 'dude', $string, $numberOfInstances);
The function is also attractive because making the search case-insensitive is as simple as adding the i pattern modifier to the end of the pattern. I won't delve into the usefulness of word boundaries (\b).
If a search string might contain characters with special meaning to the regex engine, then preg_quote() will be necessary -- this diminishes the beauty of the technique but not prohibitively so.
$search = '$5.99';
$pattern = '/' . preg_quote($search, '/') . '/';
$string = preg_replace($pattern, 'free', $string, $numberOfInstances);
For anyone who has an unnatural bias against regex functions, this can be done without regex and without looping -- it will be case-sensitive though.
Limited Explode & Implode: (Demo)
$numberOfInstances = 2;
$string = 'Hello John, how are you John. John are you happy with your life John?';
// explode here -^^^^ and ---------^^^^ only to create the following array:
// 0 => 'Hello ',
// 1 => ', how are you ',
// 2 => '. John are you happy with your life John?'
echo implode('dude', explode('John', $string, $numberOfInstances + 1));
Output:
Hello dude, how are you dude. John are you happy with your life John?
Notice the explode's limiting parameter dictates how many elements are generated, not how many explosions are executed on the string.

function str_replace_occurrences($find, $replace, $string, $count = -1) {
// current occrurence
$current = 0;
// while any occurrence
while (($pos = strpos($string, $find)) != false) {
// update length of str (size of string is changing)
$len = strlen($find);
// found next one
$current++;
// check if we've reached our target
// -1 is used to replace all occurrence
if($current <= $count || $count == -1) {
// do replacement
$string = substr_replace($string, $replace, $pos, $len);
} else {
// we've reached our
break;
}
}
return $string;
}

Artelius has already described how the function works, ill just show you how to do this via the manual methods:
function str_replace_occurrences($find,$replace,$string,$count = 0)
{
if($count == 0)
{
return str_replace($find,$replace,$string);
}
$pos = 0;
$len = strlen($find);
while($pos < $count && false !== ($pos = strpos($string,$find,$pos)))
{
$string = substr_replace($string,$replace,$pos,$len);
}
return $string;
}
This is untested but should work.

Regular Expressions: how to do "option split" replaces

those reqular expressions drive me crazy. I'm stuck with this one:
test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not
Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:
test1:link test2:silver test3:out1insideout2 test4:this|not
I came up with (PHP)
$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]
this works for part1 of the task. but before that I think I should do the option split, my best solution:
$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);
Result:
test1:silver test3:[[out1[[inside]]out2]] this|not
I'm stuck. may someone with some free minutes help me? Thanks!

I think the easiest way to do this would be multiple passes. Use a regular expression like:
\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]
This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.
Edit 1: By way of explanation,
\[\[ # Opening [[
(?: # A non-matching group (we don't want this bit)
[^\[\]] # Non-bracket characters
* # Zero or more of anything but [
\| # A literal '|' character representing the end of the discarded options
)? # This group is optional: if there is only one option, it won't be present
( # The group we're actually interested in ($1)
[^\[\]] # All the non-bracket characters
+ # Must be at least one
) # End of $1
\]\] # End of the grouping.
Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).
Edit 3: There is no need to know the number of nested brackets as you can do something like:
$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
$oldtext = $newtext;
$newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;
Basically, this keeps running the regular expression replace until the output is the same as the input.
Note that I don't know PHP, so there are probably syntax errors in the above.

This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.
Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.
You will need to escape all backslashes when putting it into a string (\ becomes \\.)
\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]
As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)
Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:
test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not
becomes
test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not

Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.
When trying to get something going favour clarity and simplicity.
Seems like you have all the pieces.

Why not just simply remove any brackets that are left?
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);

Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/';
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
$opt = explode('|', $match); $match = $opt[count($opt)-1];
}
$m[2][$pos] = str_replace(array('[', ']'),'', $match );
}
foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];

This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.
String input = "test1:[[link]] " +
"test2:[[gold|silver]] " +
"test3:[[out1[[inside]]out2]] " +
"test4:this|not";
String step1 = Regex.Replace(input, #"\[\[([^|]+)\|([^\]]+)\]\]", #"[[$2]]");
String step2 = Regex.Replace(step1, #"\[\[|\]\]", String.Empty);
// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);

$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
$v = preg_replace("/\[\[|\]\]/","",$v);
$j = explode(":",$v);
$j[1]=preg_replace("/.*\|/","",$j[1]);
print implode(":",$j)."\n";
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_match to capture string part after a special character - php

Related

PHP - Check for leading 0's in 2 comma-delimited integers

Get text contained within the parentheses at the end of string

regex to trim down subdomain in the url

How to use str_replace() to remove text a certain number of times only in PHP?

Regular Expressions: how to do "option split" replaces

Categories

Resources