I need get all matches in string, when word begins with # and then contains only alnym 0-9a-z characters. for example from this string #ww#ee x##vx #ss #aa assadd #sfsd I need get these pieces:
#ss
#aa
#sfsd
I am trying:
$str = "#ww#ee x##vx #ss #aa assadd #sfsd";
preg_match_all("#(^|\s)\#([0-9a-z]+)(\s+|$)#ui", $str, $matches);
var_dump( $matches );
But this gives only #ss
#sfsd and skips #aa.
What would be right pattern for this?
You can use the following regex
'~\B(?<!#)#([0-9a-z]+)(?:\s|$)~iu'
See the regex demo and here is an IDEONE demo:
$re = '~\B(?<!#)#([0-9a-z]+)(?:\s|$)~ui';
$str = "#ww#ee x##vx #ss #aa assadd #sfsd";
preg_match_all($re, $str, $matches);
print_r($matches);
The regex explanation:
\B - match the non-word boundary location (that is, everywhere but between ^ and \w, \w and $, \W and \w, \w and \W))
(?<!#) - fail the match if there is a # before the current location
# - a # symbol (does not have to be escaped)
([0-9a-z]+) - Group 1 (since the (...) are not escaped, they capture a subpattern and store it in a special memory slot)
(?:\s|$) - a non-capturing group (only meant to group alternatives) matching a whitespace (\s) or $.
The ~ui modifiers allow proper handling of Unicode strings (u) and make the pattern case insensitive (i).
Note that \B is forcing a non-word character to appear before #. But you do not want to match if another # precedes the #wwww-like string. Thus, we have to use the negative lookbehind (?<!#) that restricts the matches even further.
Related
I have a string that I want to match with php regex.
$string = "~word1 ~word2 ~word3 word4";
I want to match all words that are not start with ~ sign. In php I have tried this but not works
preg_match("/(?!~)(?<words>[a-zA-Z0-9\.\_])/i", $string, $matches);
var_dump($matches);
But It is not working.
For this purpose you may use this regex with a negative lookbehind:
~(?<!~)\b\w[\w.-]*~
RegEx Demo
RegEx Details:
(?<!~): Negative lookbehind to fail the match if ~ is at previous position
\b: Word boundary
\w: Match a word character
[\w.-]*: Match 0 or more of word character or . or -
You can set a whitespace boundary to the left, and only match the allowed characters in the character class.
Note that you have to repeat character class or else you will match a single character.
(?<!\S)(?<words>[a-zA-Z0-9._]+)
Regex demo
how to use preg_match_all() to get 1a1a-1a1a and 2B2B2-B2in the following string :
$string = 'Hello #1a1a-1a1a and #2B2B2-B2 too';
my aim is to capture every # followed by a uuid.
i tried :
preg_match_all("/#(.*)/", $string, $matches);
preg_match_all("/#.*?/U", $string, $matches);
preg_match_all("/#([^\"]+)/si", $a, $matches);
but can't make it
Use /(?<=#)[\w-]+/ pattern that match any string after #
preg_match_all("/(?<=#)[\w-]+/", $string, $matches);
print_r($matches[0]);
Output
Array
(
[0] => 1a1a-1a1a
[1] => 2B2B2-B2
)
Check result in demo
The #(.*) regex matches a # and the greedily any 0 or more chars other than line break chars (i.e. the rest of the line). /#.*?/U is a synonymous pattern, it is equal to /#.*/, the text after # just is not captured into a group. #([^\"]+) matches # and captures into Group 1 any one or more chars other than " and that will either match up to the first " or end of string if there is no ".
I suggest using
preg_match_all('~#\K[\w-]+~', $s, $matches)
See the regex demo. #\K[\w-]+ will match # and \K will remove it from the match, and [\w-]+ will match 1 or more word or - chars that will be returned.
To make the pattern a bit more restrictive, say, to only match letters or digits after # that can be hyphen separated, you may use
'~#\K[A-Z0-9]+(?:-[A-Z0-9]+)*~i'
See this regex demo. Here, [A-Z0-9]+ matches 1 or more alphanumeric chars and (?:-[A-Z0-9]+)* will match 0 or more repetitions of a - followed with 1+ alphanumeric chars. i modifier will make the pattern case insensitive.
Your regexes ar matching:
#(.*) Matches # and captures in a group any character 0+ times greedy including the space which will match all in your example
#.*? Matches # followed by any character 0+ times non greedy which will only match the #
#([^\"]+) Matches # and captures in a group matching not a " which will match all in your example
To capture every # followed by a uuid, you could use a character class to list what you would allow to match and repeat that pattern preceded by a dash in a non capturing group 1+ times.
If you want to match the uuid only, you could capture the values in a capturing group.
#([a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)+)
Regex demo
$string = 'Hello #1a1a-1a1a and #2B2B2-B2 too';
preg_match_all("/#([a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)+)/", $string, $matches);
print_r($matches[1]);
Result
Array
(
[0] => 1a1a-1a1a
[1] => 2B2B2-B2
)
Demo php
Try this, it will catch everything after '#', no matter how many characters
preg_match_all("#(\w)*/", $string, $matches)
We are extracting text from PDF files, and there is a high frequency of results that contain malformed text. Specifically adding spaces between the characters of a word. e.g. SEATTLE is being returned as S E A T T L E.
Is there a RegEx expression for preg_replace that can remove any spaces in the case of n number of single character "words"? Specifically, remove spaces from any occurrence of a string that is more than 3 single alpha characters and is separated by spaces?
If googled this for awhile, but can't even imagine how to construct the expression. As expressed in a comment, I don't want ALL spaces removed, but only when there is an occurrence of >3 single alpha characters, e.g. Welcome to the Greater S E A T T L E area should become Welcome to the Greater SEATTLE area. The result is to be used in full text searching, so case sensitivity is not a concern.
You may use a simple approach with a preg_replace_callback. Match '~\b[A-Za-z](?: [A-Za-z]){2,}\b~' and str_replace spaces in the anonymous function:
$regex = '~\b[A-Za-z](?: [A-Za-z]){2,}\b~';
$result = preg_replace_callback($regex, function($m) {
return str_replace(" ", "", $m[0]);
}, $s);
See the regex demo.
To only match sequences of uppercase letters, remove a-z from the pattern:
$regex = '~\b[A-Z](?: [A-Z]){2,}\b~';
And another thing: there may be soft/hard spaces, tabs, other kind of whitespace. Then, use
$regex = '~\b[A-Za-z](?:\h[A-Za-z]){2,}\b~u';
^^ ^
Finally, to match any Unicode letter, use \p{L} (to only match uppercase ones, \p{Lu}) instead of [a-zA-Z]:
$regex = '~\b\p{L}(?:\h\p{L}){2,}\b~u';
NOTE: It will most probably fail to work in some cases, e.g. when there are one-letter words. You will have to handle those cases separately/manually. Anyway, there is no safe regex-only way to fix OCR issues.
Pattern details
\b - a word boundary
[A-Za-z] - a single letter
(?: [A-Za-z]){2,} - 2 or more occurrences of
- a space (\h matches any kind of horizontal whitespace)
[A-Za-z] - a single letter
\b - a word boundary
When usign u modifier, \h becomes Unicode-aware.
You could do this in one go:
(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)
See live demo here
Explanation:
(?i: # Start of non-capturing group with case-insensitive modifier on
(?<!\S) # Negative lookbehind to ensure there is no leading non-whitespace character
([a-z]) + # Capture one letter and at least one space
((?1)) # Capture one letter in 2nd capturing group
| # Or
\G(?!\A) + # Start match from where previous match ends
# with matching spaces
((?1))\b # Match a letter at word boundary
) # End of non-capturing group
PHP code:
$str = preg_replace('~(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)~', '$1$2$3', $str);
You may use this pure regex approach with lookarounds and \G:
$re = '~\b(?:(?=(?:\pL\h+){3}\pL\b)|(?<!^)\G)(\pL)\h+(?=\pL\b)~';
$repl = preg_replace($re, '$1', $str);
RegEx Demo
RegEx Details:
\b: Match word boundary
(?:: Start non-capture group
(?=(?:\pL\h+){3}\pL\b): Lookahead to assert we have 3+ single letters separated by 1+ spaces
|: OR
(?<!^)\G: \G asserts position at the end of the previous match. (?<!^) ensures we don't match start of the string for the first match
): End non-capture group
(\pL): Match a single letter and capture it
\h+: Followed by 1+ horizontal whitespace
(?=\pL\b): Assert that we only have a single letter ahead
In the replacement we use $1 which is the letter left of whitespace we capture
So basically I got links like these
https://dog.example.com/randomgenerated45443444444444
https://turtle.example.com/randomgenerated45443
https://mice.example.com/randomgenerated452
https://monkey.example.com/randomgenerated43232323
https://leopard.example.com/randomgenerated22222222222222222
I was wondering if it was possible to detect the words between https:// and .example.com/ which would be the random animal name. And replace it with "thumbnail". The amount of letters in the animal names and randomgenerated ones always vary in amount of letters in them
You can use a positive lookahead to get to the data you want:
$string = 'https://leopard.example.com/randomgenerated22222222222222222';
$pattern = '/(?=.*\/\/)(.*?)(?=\.)/';
$replacement = 'thumbnail';
$foo = preg_replace($pattern, $replacement, $string);
$protocol = 'https://';
echo $protocol . $foo;
returns
https://thumbnail.example.com/randomgenerated22222222222222222
Explanation of the regex:
Positive Lookahead (?=.*\/\/)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\/ matches the character / literally (case sensitive)
\/ matches the character / literally (case sensitive)
1st Capturing Group (.*?)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=\.)
Assert that the Regex below matches
\. matches the character . literally (case sensitive)
Assuming that https:// and example.com never change, then this is the simplest regex you can use for the purpose:
https://(.+)\.example\.com
Anything in the (.+) will be the words you are attempting to extract.
Edit on 2016.10.27:
While the / character has no special meaning in Regular Expressions, it will likely need to be escaped (\/) if you are also using it as your expression delimiter. So the above will look like:
https:\/\/(.+)\.example\.com
I have a string like this
05/15/2015 09:19 PM pt_Product2017.9.abc.swl.px64_kor_7700 I need to select the pt_Product2017.9.abc.swl.px64_kor from that. (start with pt_ and end with _kor)
$str = "05/15/2015 09:19 PM pt_Product2017.9.abc.swl.px64_kor_7700";
preg_match('/^pt_*_kor$/',$str, $matches);
But it doesn't work.
You need to remove the anchors, adda \b at the beginning to match pt_ preceded with a non-word character, and use a \S with * (\S shorthand character class that matches any character but whitespace):
preg_match('/\bpt_\S*_kor/',$str, $matches);
See regex demo
In your regex,^ and $ force the regex engine to search for the ptat the beginning and _kor at the end of the string, and _* matches 0 or more underscores. Note that regex patterns are not the same as wildcards.
In case there can be whitespace between pt_ and _kor, use .*:
preg_match('/\bpt_.*_kor/',$str, $matches);
I should also mention greediness: if you have pt_something_kor_more_kor, the .*/\S* will match the whole string, but .*?/\S*? will match just pt_something_kor. Please adjust according to your requirements.
^ and $ are the start and end of the complete string, not only the matched one. So use simply (pt_.+_kor) to match everything between pt_ and _kor: preg_match('/(pt_+_kor)/',$str, $matches);
Here's a demo: https://regex101.com/r/qL4fW9/1
The ^ and $ that you have used in the regular expression means that the string should start with pt AND end with kor. But it's neither starting as such, nor ending with kor (in fact, ending with kor_7700).
Try removing the ^ and $, and you'll get the match:
preg_match('/pt_.*_kor/',$str, $matches);