PHP Preg_split split for character ' except preceded by? - php

I have a problem to split a text by ' character except when ' is preceded by ?.
I used this expression to split my text:
preg_split("/([^?]')/",$this->msg)
This expression works fine, but removes the last character from the splitted strings.
For example for this text:
ONEDAY'TWODAY?'AA'THREEDAY'
returns:
ONEDA
TWODA?0A
THREEDA

It works this way because preg_split() uses the expression it matches as a delimiter.
Your expression matches an apostrophe (') preceded by any character but ? (two characters in total.)
What you need is a lookbehind assertion.
A regex that does what you need is:
preg_split("/(?<!\?)'/", $this->msg);
Explanation
The part enclosed in (?<! and ) is a negative lookbehind assertion. It contains the question mark character (?) escaped because it is has a special meaning in regex and we need it here to be interpreted as a literal question mark. A negative assertion matches anything but the expression it encloses.
An assertion is compared against the subject string as usual but it is not included in the match; it is just context.
Alternative
Another regex that does the same thing is:
preg_split("/(?<=[^?])'/", $this->msg);
It uses a positive lookbehind assertion (enclosed in (?<= and )) that matches any character but the question mark ([^?]).

$string = "ONEDAY'TWODAY?'AA'THREEDAY'";
$parts = preg_split('/\'/sim', $string , -1, PREG_SPLIT_NO_EMPTY);
print_r($parts);
Output:
Array
(
[0] => ONEDAY
[1] => TWODAY?
[2] => AA
[3] => THREEDAY
)
Demo:
http://ideone.com/5WMZZQ

Related

preg_match_all for backslash [\] & [u002F]

I have a URL:
https:\u002F\u002Fsite.vid.com\u002F93836af7-f465-4d2c-9feb-9d8128827d85\u002F6njx6dp3gi.m3u8?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjb3VudHJ5IjoiSU4iLCJkZXZpY2VfaWQiOiI1NjYxZTY3Zi0yYWE3LTQ1MjUtOGYwYy01ODkwNGQyMjc3ZmYiLCJleHAiOjE2MTA3MjgzNjEsInBsYXRmb3JtIjoiV0VCIiwidXNlcl9pZCI6MH0.c3Xhi58DnxBhy-_I5yC2XMGSWU3UUkz5YgeVL1buHYc","
And I want to match it using preg_match_all. My regex expression is:
preg_match_all('/(https:\/\/site\.vid\.com\/.*\",")/', $input_lines, $output_array);
But I am not able to match special character \ & u002F in above code. I tried using (escaping fuction). But it is not matching. I know it maybe a lame question, but if anyone could help me in matching \ and u002F or in escaping \ and u002F in preg_match_all, that would be helpfull.
Question Edit:
I want to use only preg_match_all because I am trying to extract above URL from a html page.
You may use
preg_match_all('~https:(?://|(?:\\\\u002F){2})site\.vid\.com(?:/|\\\\u002F)[^"]*~', $string)
See the regex demo. Details:
https: - a literal string (if s is optional, use https?:)
(?://|(?:\\u002F){2}) - a non-capturing group matching either // or (|) two occurrences of \u002F
site\.vid\.com - a literal site.vid.com string (the dot is a metacharacter that matches any char but line break chars, so it must be escaped)
(?:/|\\u002F) - a non-capturing group matching / or \u002F text
[^"]* - a negated character class matching zero or more chars other than ".
See the PHP demo:
$re = '~https:(?://|(?:\\\\u002F){2})site\.vid\.com(?:/|\\\\u002F)[^"]*~';
$str = 'https:\\u002F\\u002Fsite.vid.com\\u002F93836af7-f465-4d2c-9feb-9d8128827d85\\u002F6njx6dp3gi.m3u8?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjb3VudHJ5IjoiSU4iLCJkZXZpY2VfaWQiOiI1NjYxZTY3Zi0yYWE3LTQ1MjUtOGYwYy01ODkwNGQyMjc3ZmYiLCJleHAiOjE2MTA3MjgzNjEsInBsYXRmb3JtIjoiV0VCIiwidXNlcl9pZCI6MH0.c3Xhi58DnxBhy-_I5yC2XMGSWU3UUkz5YgeVL1buHYc","';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches[0]);
// => Array( [0] => https:\u002F\u002Fsite.vid.com\u002F93836af7-f465-4d2c-9feb-9d8128827d85\u002F6njx6dp3gi.m3u8?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjb3VudHJ5IjoiSU4iLCJkZXZpY2VfaWQiOiI1NjYxZTY3Zi0yYWE3LTQ1MjUtOGYwYy01ODkwNGQyMjc3ZmYiLCJleHAiOjE2MTA3MjgzNjEsInBsYXRmb3JtIjoiV0VCIiwidXNlcl9pZCI6MH0.c3Xhi58DnxBhy-_I5yC2XMGSWU3UUkz5YgeVL1buHYc )

regex capture certain characters only

currently dealing with a bit of a problem. this is my string "all-days"
im in need of some assistance to creating a regex to capture the first character, the dash and also the first character after the dash. Im a bit of a newbie to Regex so forgive me.
Here is what ive got so far. (^.)
capture the first character, the dash and also the first
character after the dash
With preg_match function:
$s = "all-days";
preg_match('/^(.)[^-]*(-)(.)/', $s, $m);
unset($m[0]);
print_r($m);
The output:
Array
(
[1] => a
[2] => -
[3] => d
)
Its not regex but If you want just a solution as you want by other way it can be achieve by explode, array_walk and implode
$string = 'all-days-with-my-style';
$arr = explode("-",$string);
$new = array_walk($arr,function(&$a){
$a = $a[0];
});
echo implode("-",$arr);
Live demo : https://eval.in/882846
Output is : a-d-w-m-s
I assume your string only contains word characters and hyphens, and doesn't have consecutive hyphens:
To remove all that isn't the first character the hyphens and the first character after them, remove all that isn't after a word boundary:
$result = preg_replace('~\B\w+~', '', 'all-days');
If you only want to match these characters, just catch each character after a word boundary:
if ( preg_match_all('~\b.~', 'all-days', $matches) )
print_r($matches[0]);
Code
See code in use here
\b(\w|-\b)
For more precision, the following can be used (note that it uses Unicode groups, so it doesn't work in every language, but it does in PHP). This will only match letters, not numbers and underscores. It uses a negative lookbehind and positive lookahead, but you can understand it if you keep reading this article and break it apart one piece at a time.
(\b\p{L}|(?<=\p{L})-(?=\p{L}))
Explanation
\b Assert position at a word boundary
(\w|-\b) Capture the following into capture group 1
\w Match any word character
| Or
- Match the - character literally
\b Assert position at a word boundary
\b:
Asserts the position in the string matches 1 of the following:
^\w Assert position at the start of the string and match a word character
\w$ Match a word character and assert its position as the last position in the string
\W\w Match any non-word character, followed by a word character
\w\W Match any word character, followed by a non-word character
\w:
Means a word character (usually defined by any character in the set a-zA-Z0-9_, however, some languages also accept Unicode characters that represent any letter, number, or underscore \p{L}\p{N}_).
For more precision (depending on the use-case), you can specify [a-zA-Z] (for ASCII letters), \p{L} for Unicode letters, or [a-z] with the i flag for ASCII characters with the case-insensitive flag enabled in regex.

Regular expressions, allow specific format only. "John-doe"

I've researched a little, but I found nothing that relates exactly to what I need and whenever tried to create the expression it is always a little off from what I require.
I attempted something along the lines of [AZaz09]{3,8}\-[AZaz09]{3,8}.
I want the valid result to only allow text-text, where either or the text can be alphabetical or numeric however the only symbol allowed is - and that is in between the two texts.
Each text must be at least three characters long ({3,8}?), then separated by the -.
Therefore for it to be valid some examples could be:
Text-Text
Abc-123
123-Abc
A2C-def4gk
Invalid tests could be:
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%
You need to use anchors and use the - so the characters in the character class are read as a range, not the individual characters.
Try:
^[A-Za-z0-9]{3,8}-[A-Za-z0-9]{3,8}$
Demo: https://regex101.com/r/xH3oM8/1
You also could simplify it a but with the i modifier and the \d meta character.
(?i)^[a-z\d]{3,8}-[a-z\d]{3,8}$
If accented letters should be allowed, or any other letter that exists in the Unicode range (like Greek or Cyrillic letters), then use the u modifier (for UTF-8 support) and \pL to match Unicode letters (and \d for digits):
$string ="
Mañana-déjà
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='/^[\pL\d]{3,}-[\pL\d]{3,}$/mu';
preg_match_all($regex, $string, $matches);
var_export($matches);
Output:
array (
0 =>
array (
0 => 'Mañana-déjà',
1 => 'Text-Text',
2 => 'Abc-123',
3 => '123-Abc',
4 => 'A2C-def4gk',
),
)
NB: the difference with \w is that [\pL\d] will not match an underscore.
You could come up with the following:
<?php
$string ="
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='~
^\w{3,} # at last three word characters at the beginning of the line
- # a dash
\w{3,}$ # three word characters at the end of the line
~xm'; # multiline and freespacing mode (for this explanation)
# ~xmu for accented characters
preg_match_all($regex, $string, $matches);
print_r($matches);
?>
As #chris85 pointed out, \w will match an underscore as well. Trincot had a good comment (matching accented characters, that is). To achieve this, simply use the u modifier.
See a demo on regex101.com and a complete code on ideone.com.
You can use this regex
^\w{3,}-\w{3,}$
^ // start of the string
\w{3,} // match "a" to "z", "A" to "Z" and 0 to 9 and requires at least 3 characters
- // requires "-"
\w{3,} // same as above
$ // end of the string
Regex Demo
And a short one.
^([^\W_]{3,8})-(?1)$
[^\W_] can be used as short for alnum. It subtracts the underscore from \w
(?1) is a subroutine call to the pattern in first group
Demo at regex101
My vote for #chris85 which is most obvious and performant.
This one
^([\w]{3,8}-[\w]{3,8})$
https://regex101.com/r/uS8nB5/1

how to use preg_split() in php?

Can anybody explain to me how to use preg_split() function?
I didn't understand the pattern parameter like this "/[\s,]+/".
for example:
I have this subject: is is. and I want the results to be:
array (
0 => 'is',
1 => 'is',
)
so it will ignore the space and the full-stop, how I can do that?
preg means Pcre REGexp", which is kind of redundant, since the "PCRE" means "Perl Compatible Regexp".
Regexps are a nightmare to the beginner. I still don’t fully understand them and I’ve been working with them for years.
Basically the example you have there, broken down is:
"/[\s,]+/"
/ = start or end of pattern string
[ ... ] = grouping of characters
+ = one or more of the preceeding character or group
\s = Any whitespace character (space, tab).
, = the literal comma character
So you have a search pattern that is "split on any part of the string that is at least one whitespace character and/or one or more commas".
Other common characters are:
. = any single character
* = any number of the preceeding character or group
^ (at start of pattern) = The start of the string
$ (at end of pattern) = The end of the string
^ (inside [...]) = "NOT" the following character
For PHP there is good information in the official documentation.
This should work:
$words = preg_split("/(?<=\w)\b\s*[!?.]*/", 'is is.', -1, PREG_SPLIT_NO_EMPTY);
echo '<pre>';
print_r($words);
echo '</pre>';
The output would be:
Array
(
[0] => is
[1] => is
)
Before I explain the regex, just an explanation on PREG_SPLIT_NO_EMPTY. That basically means only return the results of preg_split if the results are not empty. This assures you the data returned in the array $words truly has data in it and not just empty values which can happen when dealing with regex patterns and mixed data sources.
And the explanation of that regex can be broken down like this using this tool:
NODE EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
[!?.]* any character of: '!', '?', '.' (0 or more
times (matching the most amount possible))
An nicer explanation can be found by entering the full regex pattern of /(?<=\w)\b\s*[!?.]*/ in this other other tool:
(?<=\w) Positive Lookbehind - Assert that the regex below can be matched
\w match any word character [a-zA-Z0-9_]
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
\s* match any white space character [\r\n\t\f ]
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
!?. a single character in the list !?. literally
That last regex explanation can be boiled down by a human—also known as me—as the following:
Match—and split—any word character that comes before a word boundary that can have multiple spaces and the punctuation marks of !?..
PHP's str_word_count may be a better choice here.
str_word_count($string, 2) will output an array of all words in the string, including duplicates.
Documentation says:
The preg_split() function operates exactly like split(), except that
regular expressions are accepted as input parameters for pattern.
So, the following code...
<?php
$ip = "123 ,456 ,789 ,000";
$iparr = preg_split ("/[\s,]+/", $ip);
print "$iparr[0] <br />";
print "$iparr[1] <br />" ;
print "$iparr[2] <br />" ;
print "$iparr[3] <br />" ;
?>
This will produce following result.
123
456
789
000
So, if have this subject: is is and you want:
array (
0 => 'is',
1 => 'is',
)
you need to modify your regex to "/[\s]+/"
Unless you have is ,is you need the regex you already have "/[\s,]+/"

Can you explain/simplify this regular expression (PCRE) in PHP?

preg_match('/.*MyString[ (\/]*([a-z0-9\.\-]*)/i', $contents, $matches);
I need to debug this one. I have a good idea of what it's doing but since I was never an expert at regular expressions I need your help.
Can you tell me what it does block by block (so I can learn)?
Does the syntax can be simplified (I think there is no need to escape the dot with a slash)?
The regexp...
'/.*MyString[ (\/]*([a-z0-9\.\-]*)/i'
.* matches any character zero or more times
MyString matches that string. But you are using case insensitive matching so the matched string will spell "mystring" by but with any capitalization
EDIT: (Thanks to Alan Moore) [ (\/]*. This matches any of the chars space ( or / repeated zero of more times. As Alan points out the final escape of / is to stop the / being treated as a regexp delimeter.
EDIT: The ( does not need escaping and neither does the . (thanks AlexV) because:
All non-alphanumeric characters other than \, -, ^ (at the start) and
the terminating ] are non-special in character classes, but it does no
harm if they are escaped.
-- http://www.php.net/manual/en/regexp.reference.character-classes.php
The hyphen, generally does need to be escaped, otherwise it will try to define a range. For example:
[A-Z] // matches all upper case letters of the aphabet
[A\-Z] // matches 'A', '-', and 'Z'
However, where the hyphen is at the end of the list you can get away with not escaping it (but always best to be in the habit of escaping it... I got caught out by this].
([a-z0-9\.\-]*) matches any string containing the characters a through z (note again this is effected by the case insensitive match), 0 through 9, a dot, a hyphen, repeated zero of more times. The surrounding () capture this string. This means that $matches[1] will contain the string matches by [a-z0-9\.\-]*. The brackets () tell preg_match to "capture" this string.
e.g.
<?php
$input = "aslghklfjMyString(james321-james.org)blahblahblah";
preg_match('/.*MyString[ (\/]*([a-z0-9.\-]*)/i', $input, $matches);
print_r($matches);
?>
outputs
Array
(
[0] => aslghklfjMyString(james321-james.org
[1] => james321-james.org
)
Note that because you use a case insensitive match...
$input = "aslghklfjmYsTrInG(james321898-james.org)blahblahblah";
Will also match and give the same answer in $matches[1]
Hope this helps....
Let's break this down step-by step, removing the explained parts from the expression.
"/.*MyString[ (\/]*([a-z0-9\.\-]*)/i"
Let's first strip the regex delimiters (/i at the end means it's case-insensitive):
".*MyString[ (\/]*([a-z0-9\.\-]*)"
Then we've got a wildcard lookahead (search for any symbol any number of times until we match the next statement.
"MyString[ (\/]*([a-z0-9\.\-]*)"
Then match 'MyString' literally, followed by any number (note the '*') of any of the following: ' ', '(', '/'. This is probably the error zone, you need to escape that '('. Try [ (/].
"([a-z0-9\.\-]*)"
Then we get a capture group for any number of any of the following: a-z literals, 0-9 digits, '.', or '-'.
That's pretty much all of it.

Categories