PHP regular expression start and end with given strings - php

I have a string like this
05/15/2015 09:19 PM pt_Product2017.9.abc.swl.px64_kor_7700 I need to select the pt_Product2017.9.abc.swl.px64_kor from that. (start with pt_ and end with _kor)
$str = "05/15/2015 09:19 PM pt_Product2017.9.abc.swl.px64_kor_7700";
preg_match('/^pt_*_kor$/',$str, $matches);
But it doesn't work.

You need to remove the anchors, adda \b at the beginning to match pt_ preceded with a non-word character, and use a \S with * (\S shorthand character class that matches any character but whitespace):
preg_match('/\bpt_\S*_kor/',$str, $matches);
See regex demo
In your regex,^ and $ force the regex engine to search for the ptat the beginning and _kor at the end of the string, and _* matches 0 or more underscores. Note that regex patterns are not the same as wildcards.
In case there can be whitespace between pt_ and _kor, use .*:
preg_match('/\bpt_.*_kor/',$str, $matches);
I should also mention greediness: if you have pt_something_kor_more_kor, the .*/\S* will match the whole string, but .*?/\S*? will match just pt_something_kor. Please adjust according to your requirements.

^ and $ are the start and end of the complete string, not only the matched one. So use simply (pt_.+_kor) to match everything between pt_ and _kor: preg_match('/(pt_+_kor)/',$str, $matches);
Here's a demo: https://regex101.com/r/qL4fW9/1

The ^ and $ that you have used in the regular expression means that the string should start with pt AND end with kor. But it's neither starting as such, nor ending with kor (in fact, ending with kor_7700).
Try removing the ^ and $, and you'll get the match:
preg_match('/pt_.*_kor/',$str, $matches);

Related

php preg_replace and newline characters [duplicate]

I use a regex pattern i preg_match php function. The pattern is let's say '/abc$/'. It matches both strings:
'abc'
and
'abc
'
The second one has the line break at its end. What would be the pattern that matches only this first string?
'abc'
The reason why /abc$/ matches both "abc\n" and "abc" is that $ matches the location at the end of the string, or (even without /m modifier) the position before the newline that is at the end of the string.
You need the following regex:
/abc\z/
where \z is the unambiguous very end of the string, or
/abc$/D
where the /D modifier will make $ behave the same way as \z. See PHP.NET:
The meaning of dollar can be changed so that it matches only at the very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at compile or matching time.
See the regex demo

How to use preg_replace to remove excessive single spaces

We are extracting text from PDF files, and there is a high frequency of results that contain malformed text. Specifically adding spaces between the characters of a word. e.g. SEATTLE is being returned as S E A T T L E.
Is there a RegEx expression for preg_replace that can remove any spaces in the case of n number of single character "words"? Specifically, remove spaces from any occurrence of a string that is more than 3 single alpha characters and is separated by spaces?
If googled this for awhile, but can't even imagine how to construct the expression. As expressed in a comment, I don't want ALL spaces removed, but only when there is an occurrence of >3 single alpha characters, e.g. Welcome to the Greater S E A T T L E area should become Welcome to the Greater SEATTLE area. The result is to be used in full text searching, so case sensitivity is not a concern.
You may use a simple approach with a preg_replace_callback. Match '~\b[A-Za-z](?: [A-Za-z]){2,}\b~' and str_replace spaces in the anonymous function:
$regex = '~\b[A-Za-z](?: [A-Za-z]){2,}\b~';
$result = preg_replace_callback($regex, function($m) {
return str_replace(" ", "", $m[0]);
}, $s);
See the regex demo.
To only match sequences of uppercase letters, remove a-z from the pattern:
$regex = '~\b[A-Z](?: [A-Z]){2,}\b~';
And another thing: there may be soft/hard spaces, tabs, other kind of whitespace. Then, use
$regex = '~\b[A-Za-z](?:\h[A-Za-z]){2,}\b~u';
^^ ^
Finally, to match any Unicode letter, use \p{L} (to only match uppercase ones, \p{Lu}) instead of [a-zA-Z]:
$regex = '~\b\p{L}(?:\h\p{L}){2,}\b~u';
NOTE: It will most probably fail to work in some cases, e.g. when there are one-letter words. You will have to handle those cases separately/manually. Anyway, there is no safe regex-only way to fix OCR issues.
Pattern details
\b - a word boundary
[A-Za-z] - a single letter
(?: [A-Za-z]){2,} - 2 or more occurrences of
- a space (\h matches any kind of horizontal whitespace)
[A-Za-z] - a single letter
\b - a word boundary
When usign u modifier, \h becomes Unicode-aware.
You could do this in one go:
(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)
See live demo here
Explanation:
(?i: # Start of non-capturing group with case-insensitive modifier on
(?<!\S) # Negative lookbehind to ensure there is no leading non-whitespace character
([a-z]) + # Capture one letter and at least one space
((?1)) # Capture one letter in 2nd capturing group
| # Or
\G(?!\A) + # Start match from where previous match ends
# with matching spaces
((?1))\b # Match a letter at word boundary
) # End of non-capturing group
PHP code:
$str = preg_replace('~(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)~', '$1$2$3', $str);
You may use this pure regex approach with lookarounds and \G:
$re = '~\b(?:(?=(?:\pL\h+){3}\pL\b)|(?<!^)\G)(\pL)\h+(?=\pL\b)~';
$repl = preg_replace($re, '$1', $str);
RegEx Demo
RegEx Details:
\b: Match word boundary
(?:: Start non-capture group
(?=(?:\pL\h+){3}\pL\b): Lookahead to assert we have 3+ single letters separated by 1+ spaces
|: OR
(?<!^)\G: \G asserts position at the end of the previous match. (?<!^) ensures we don't match start of the string for the first match
): End non-capture group
(\pL): Match a single letter and capture it
\h+: Followed by 1+ horizontal whitespace
(?=\pL\b): Assert that we only have a single letter ahead
In the replacement we use $1 which is the letter left of whitespace we capture

Get all pieces from string, when it begins with #

I need get all matches in string, when word begins with # and then contains only alnym 0-9a-z characters. for example from this string #ww#ee x##vx #ss #aa assadd #sfsd I need get these pieces:
#ss
#aa
#sfsd
I am trying:
$str = "#ww#ee x##vx #ss #aa assadd #sfsd";
preg_match_all("#(^|\s)\#([0-9a-z]+)(\s+|$)#ui", $str, $matches);
var_dump( $matches );
But this gives only #ss
#sfsd and skips #aa.
What would be right pattern for this?
You can use the following regex
'~\B(?<!#)#([0-9a-z]+)(?:\s|$)~iu'
See the regex demo and here is an IDEONE demo:
$re = '~\B(?<!#)#([0-9a-z]+)(?:\s|$)~ui';
$str = "#ww#ee x##vx #ss #aa assadd #sfsd";
preg_match_all($re, $str, $matches);
print_r($matches);
The regex explanation:
\B - match the non-word boundary location (that is, everywhere but between ^ and \w, \w and $, \W and \w, \w and \W))
(?<!#) - fail the match if there is a # before the current location
# - a # symbol (does not have to be escaped)
([0-9a-z]+) - Group 1 (since the (...) are not escaped, they capture a subpattern and store it in a special memory slot)
(?:\s|$) - a non-capturing group (only meant to group alternatives) matching a whitespace (\s) or $.
The ~ui modifiers allow proper handling of Unicode strings (u) and make the pattern case insensitive (i).
Note that \B is forcing a non-word character to appear before #. But you do not want to match if another # precedes the #wwww-like string. Thus, we have to use the negative lookbehind (?<!#) that restricts the matches even further.

preg_match start and end of string and replace

Could someone help with a preg_match expression I need it to match the - dash character at the start and end of a string. This is for tags e.g. match -my-tag- should then be my-tag so It only matches the start and end of a string and replace it the characters with and empty string
You can do that with this easy expression:
$string = "-my-tag-";
$tag = preg_replace("/^-(.*)-$/", "$1", $string);
^ and $ are used to match the start and the end of the string, while (.*) captures every other symbols.
You can read more about regular expressions in the official PHP Documentation.

preg_match_all doesn't match when using a carat (^)

I'm using preg_match_all to find a URL in a HTML file. The URL always appears at the start of the line, with no leading space, like this:
<strong>Next</strong>
I used this to match it:
preg_match_all('|^<A HREF="(?<url>.*?)"><strong>Next</strong>|', $html, $url_matches);
It didn't work until I removed the carat (^) character. I thought that the carat matched the start of a line. Why is it causing my match to fail?
You have to add the m modifier:
preg_match_all('|^<A HREF="(?<url>.*?)"><strong>Next</strong>|m', $html, $url_matches);
then ^ matches at start of a line, else it would only match at the start of the entire string.
More Info: http://php.net/manual/en/reference.pcre.pattern.modifiers.php
^ matches start-of-string not start-of-line. Use the m ("multi-line") modifier: //m

Categories