Splitting a string using regex - php

I would like to split a string where any character is a space or punctuation (excluding apostrophes). The following regex works as intended.
/[^a-z']/i
Words like I'll and Didn't are accepted, which is great.
The problem is with words like 'ere and 'im. I would like to remove the beginning apostrophe and have the words im and ere.
I would ideally like to stop/remove this within the regex pattern if possible.
Thanks in advance.

Try this
$str = "Words like I'll and Didn't are accepted, which is great.
The problem is with words like 'ere and 'im";
print_r(preg_split("/'?[^a-z']+'?/i", $str));
//Array ( [0] => Words [1] => like [2] => I'll [3] => and [4] => Didn't ...
// [16] => ere [17] => and [18] => im )

Related

PHP preg_replace and explode function

I have some raw data like this
\u002522\u00253A\u002522https\u00253A\u00255C\u00252F\u00255C\
My intention is to remove the backslash "\" and first 7 digit of every string between \u002522https\ this. For this the output will be only https.
If there is only 7 digit like this \u002522\ the output will be empty.
My final intention is to put every result in a array which is formatted for the above raw data like this
Array
(
[0] =>
[1] =>
[2] => https
[3] =>
[4] =>
[5] =>
[6] =>
)
I want this result for constructing a URL. I have tried with preg_replace and explode function to get my expected result but I am failed.
$text = '\u002522\u00253A\u002522https\u00253A\u00255C\u00252F\u00255C\\';
$text = preg_replace("#(\\\\[a-z0-9]{7})#is",",",$text);
$text_array = explode(",",trim($text,'\\'));
print_r($text_array);

php regexp problems

I am doing a simple tutorial which able to catch the keywords automatically, the code is as below:-
$content = "#abc i love you #def #you , and you?";
preg_match_all("/[\n\r\t]*\#(.+?)\s/s",$content, $tag_matches);
print_r($tag_matches);
output:-
Array ( [0] => Array ( [0] => #abc [1] => #def [2] => #you ) [1] => Array ( [0] => abc [1] => def [2] => you ) )
'#' symbol with words are the keywords
the output is correct, but if I insert any punctuation symbols beside the keyword, e.g: #you, , the output will become you, , may I know how do I filter punctuation symbols after keywords?
besides this, if I insert any keywords together just like #def#you, , the output is def#you, is anyone can help me to separate it/
Thanks All.
Try using a word boundary \b instead of whitespace \s. That will stop the match when it reaches anything other than a word character (i.e., [a-zA-Z0-9_]).
/[\n\r\t]*\#(.+?)\b/s
Conceptually, that's what you were trying to do anyway by putting whitespace there (i.e., denote end of word).
You could try:
/[\n\r\t]*\#([\w]*)\s/s
The * actually has the same behavior as +?. By matching the . you are every character. If you have tags which are hyphenated you may want to add - inside of the brackets.

preg_match() behaving strange?

I want to compare two strings against url:
$reg1 = "/(^(((www\.))|(?!(www\.)))domain\.com\/paramsindex\/([a-z]+)\/([a-z]+)\/((([a-z0-9]+)(\-[a-z0-9]+){0,})(\/([a-z0-9]+)(\-[a-z0-9]+){0,}){0,})|()\/?$)/";
$reg2 = "/(^(((www\.))|(?!(www\.)))domain\.com\/paramsassoc\/([a-z]+)\/([a-z]+)\/((([a-z0-9]+)(\-[a-z0-9]+){0,})(\/([a-z0-9]+)(\-[a-z0-9]+){0,}){0,})|()\/?$)/";
$uri = "www.domain.com/paramsindex/cont/meth/par1/par2/par3/";
$r1 = preg_match($reg1, $uri);
echo "<p>First regex returned: {$r1}</p>";
$r2 = preg_match($reg2, $uri);
echo "<p>Second regex returned: {$r2}</p>";
Now these strings are not the same, difference is this:
www.domain.com/paramsindex/cont/meth/par1/par2/par3/
vs.
www.domain.com/paramsassoc/cont/meth/par1/par2/par3/
And yet PHP preg_match returns 1 for both of them.
Now you will say this is a long regex and why use that. And the thing is I could built shorter regex but it is built on the fly and... it youst needs to be like that.
And what bothers me is that in Rubular regexs works as it should.
When testing them I was using Rubular, and now i PHP it wont work. I know Rubular is Ruby regex editor but I tought it should be the same :(
Rubular testing:here
What is problem here? How should I write that regex in PHP so preg_match can see the difference? And regex should be as close to the one I already wrote, is there some simple fix to my problem? Something im overlooking?
That behavior is by design, preg_match returns 1 when a match is found. If you want to capture matches, see the matches parameter at: http://php.net/manual/en/function.preg-match.php
Edit: For example
$matches = array();
$r2 = preg_match($reg2, $uri, $matches);
echo "<p>Second regex returned: ";
print_r($matches);
echo "</p>";
I'll leave the above to document my own stupidity for not answering the right question.
At the end of your regex you have |()\/?$)/ which will make the regex match URL that ends with a slash. Take it out and it looks like you're golden from my tests.
Always remember to group your operands!
I can assume that this one is can be quite hard to spot, but it's all because of your use of the or-operator |. You are not grouping the operands correctly and therefore the result described in your post is being yield.
Your use of |() in the provided case will match either nothing or the full regular expression to the left of your operator |.
To solve this issue you will need to put parentheses around the operands that should be ORed.
An easy method of seeing where everything goes wrong is to run this below snippet:
$reg1 = "/(^(((www\.))|(?!(www\.)))domain\.com\/paramsindex\/([a-z]+)\/([a-z]+)\/((([a-z0-9]+)(\-[a-z0-9]+){0,})(\/([a-z0-9]+)(\-[a-z0-9]+){0,}){0,})|()\/?$
$reg2 = "/(^(((www\.))|(?!(www\.)))domain\.com\/paramsassoc\/([a-z]+)\/([a-z]+)\/((([a-z0-9]+)(\-[a-z0-9]+){0,})(\/([a-z0-9]+)(\-[a-z0-9]+){0,}){0,})|()\/?$
$uri = "www.domain.com/paramsindex/cont/meth/par1/par2/par3/";
var_dump (preg_match($reg1, $uri, $match1));
var_dump (preg_match($reg2, $uri, $match2));
print_r ($match1);
print_r ($match2);
output
int(1)
int(1)
Array
(
[0] => www.domain.com/paramsindex/cont/meth/par1/par2/par3
[1] => www.domain.com/paramsindex/cont/meth/par1/par2/par3
[2] => www.
[3] => www.
[4] => www.
[5] =>
[6] => cont
[7] => meth
[8] => par1/par2/par3
[9] => par1
[10] => par1
[11] =>
[12] => /par3
[13] => par3
)
Array
(
[0] => /
[1] => /
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] =>
[14] =>
[15] =>
)
As you see $reg2 matches a bunch of empty strings in $uri, which is an indication of what I described earlier.
If you come up with a short description of what you are trying to do I can provide you with a fully functional (and probably a bit neater than you current) regular expression.
Your RegEx is a mess and you will have to change it if you want it to work.
Check out the Rubular for your paramsindex: http://www.rubular.com/r/3ptjQ5aIrD
Now, for paramsassoc: http://www.rubular.com/r/o7GCbCsHyX
They both return a result. Sure it's an array full of empty strings, but it is a result nontheless.
That is why both are TRUE.

I need help with this regex pattern

Hi I have a problem with my regex pattern:
preg_match_all('/!!\d{3}/', '!!333!!333 !!333 test', $result);
I want this to match !!333 but not !!333!333. How can I modify this regex to match only a max length of 5 characters - two ! and three numbers.
/^!!\d{3}$/
You need the anchors ^, that match the beginning of a string and $ for the end. Its like saying: "It must begin at the start of the string and it must end at the end of it." If you omit one (or both) the pattern allows arbitrary symbols at the beginning and/or the end.
Update
As I found out in the comments the question was very misleading. Now I suggest to split the string before applying the pattern
$string = '!!333!!333 !!333 test';
$result = array();
foreach (explode(' ', $string) as $index => $item) {
if (preg_match('/^!!\d{3}$/', $item)) {
$result[$index] = $item;
}
}
This also respects the index of the item. If you dont need it, remove the $index stuff or just ignore it ;)
Its much easier then trying to find a pattern, that fulfill your request all at once.
^!!\d{3}$
You need to anchor your pattern.
If you want to match a string with !!333 in it, you may want something like:
(^|\s)!!\d{3}($|\s)
With further explanation we can have a further refinement:
(^|\s)!!\d{3}(?=$|\s)
Which will not capture the trailing space allowing multiple matches in the same line to match one after another.
I find the easiest and most descriptive way to do this is with negative lookaheads and lookbehinds.
See:
preg_match_all('/(?<![^\s])!!\d{3}(?![^\s])/', '!!333 !!333!!333 !!333 test !!333', result);
This says: match anything of the form !![0-9][0-9][0-9] which doesn't have anything other than a space in front or behind it. Note that these lookaheads/lookbehinds aren't matched themselves, they are "zero-width assertions", they are thrown away and so you only get "!!333" etc in your match, not " !!333" etc.
It returns
[0] => Array
(
[0] => !!333
[1] => !!333
[2] => !!333
)
)
Also
preg_match_all(
'/(?<![^\s])!!\d{3}(?![^\s])/',
'!!333 !!555 !!333 !!123 !!555 !!456 !!333 !!333 !!444 !!444 !!123 !!123 !!123!!123',
$result));
returns
[0] => Array
(
[0] => !!333
[1] => !!555
[2] => !!333
[3] => !!123
[4] => !!555
[5] => !!456
[6] => !!333
[7] => !!333
[8] => !!444
[9] => !!444
[10] => !!123
[11] => !!123
)
That is, all but the last two which are too long.
See Lookahead tutorial.

String has been split using punctuation as delimiters; how to reassemble and put the punctuation back in?

Im implementing a profanity filter by using a Trie data structure. Every swear word is added to the Trie. When I have a string to remove profanities from, I explode the string by using punctuations and check every word with the Trie. If found I replace by asterisks.Then I implode the string The issue is, how do I keep track of punctuations? In other words how do I make sure the resultant string has punctuations?
If you are using preg_split() to split up your string, consider using the PREG_SPLIT_DELIM_CAPTURE flag to capture the punctuation with the matches.
Consider:
$str = "This. string/ has? punctuation!";
print_r(preg_split('/(\W+)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE));
/*
Array
(
[0] => This
[1] => .
[2] => string
[3] => /
[4] => has
[5] => ?
[6] => punctuation
[7] => !
[8] =>
)
*/
See http://php.net/preg_split for more information.

Categories