preg match only one line in each match - php

I have this pattern.. The matches must not cross multiple lines (there must not be any newline char in the matches) so I added the m modifier..
But sometimes there is a \n in the matches.. How to prevent this?
preg_match_all('/(?<!\d|\d\D)(?:dk)?([\d\PL]{8,})/m', $input, $matches, PREG_PATTERN_ORDER);

The \PL pattern matches any char but a Unicode letter and also matches digits and whitespace chars. So, [\d\PL] can be shortened to \PL and since you need to subtract line breaks from it, replace it with the reverse shorthand character class (\pL) and use it inside a negated bracket expression, [^\pL], and add \r and \n there:
'/(?<!\d|\d\D)(?:dk)?([^\pL\r\n]{8,})/u'
The m modifier is redundant since it only redefines the behavior of ^ and $ anchors. You might need the u modifier though, for the Unicode property class to work safely with Unicode strings in PHP/PCRE. Change \d to [0-9] and \D to [^0-9] if you only want to match ASCII digits.

Related

PHP regex match Latin words may contains symbols, digits and spaces

There are names of records in which are mixed Cyrillic and Latin words, symbols, spaces, digits, etc.
I need to preg_match (PHP) only Latin part with any symbols in any combinations.
Test set:
БлаблаБла Uty-223
Блабла (бла.)Бла CAROP-C
Бла бла ST.MORITZ
Бла бла RAMIRO2-TED
LA PLYSGNE 1 H - 001
(Блабла) - doesn't matter Cyrillic words.
So i tried pattern:
/[-0-9a-zA-Z.]+/
But [Блабла (бла.)Бла CAROP-C] and [LA PLYSGNE 1 H - 001] not found as string.
Next i tried to write more flexible pattern:
/[-0-9a-zA-Z]+(?:.)?+(?:\s+)?+[-0-9a-zA-Z]+/
But there is still problem with matching [LA PLYSGNE 1 H - 001].
Is there any idea how can this be solved?
Thanks.
If the . and - can not occur at the beginning or end, you can start the match with [0-9a-zA-Z] and optionally repeat one of the chars listed in the character class followed by again [0-9a-zA-Z]
\b[0-9a-zA-Z]+(?:[.\h-]+[0-9a-zA-Z]+)*\b
The \b is a word boundary preventing a partial word match
\h matches a horizontal whitespace character
See a regex101 demo.
Matching at least a single char [0-9a-zA-Z] with allowed chars . and - in the whole string, and asserting whitespace boundaries to the left and right
(?<!\S)[.-]*\b[0-9a-zA-Z](?:[0-9a-zA-Z.\h-]*[0-9a-zA-Z.-])?(?!\S)
Using (?<!\S) and (?!\S) are lookaround assertions that are whitespace boundaries, asserting not a non whitespace char to the left and the right.
See a regex101 demo.
You can also use a script run starting with a latin letter:
~(*sr:\p{Latin}.*\S)~u
demo

Unexpected behavior of preg_replace() with regular expression containing \h on à [duplicate]

I want to replace all empty spaces on the beginning of all new lines. I have two regex replacements:
$txt = preg_replace("/^ +/m", '', $txt);
$txt = preg_replace("/^[^\S\r\n]+/m", '', $txt);
Each of them matches different kinds of empty spaces. However, there may be chances that both of the empty spaces exist and in different orders, so I want to match occurences of all of them at the beginning of new lines. How can I do that?
NOTE: The first regex matches an ideographic space, \u3000 char, which is only possible to check in the question raw body (SO rendering is not doing the right job here). The second regex matches only ASCII whitespace chars other than LF and CR. Here is a demo proving the second regex does not match what the first regex matches.
Since you want to remove any horizontal whitespace from a Unicode string you need to use
\h regex escape ("any horizontal whitespace character (since PHP 5.2.4)")
u modifier (see Pattern Modifiers)
Use
$txt = preg_replace("/^\h+/mu", '', $txt);
Details
^ - start of a line (m modifier makes ^ match all line start positions, not just string start position)
\h+ - one or more horizontal whitespaces
u modifier will make sure the Unicode text is treated as a sequence of Unicode code points, not just code units, and will make all regex escapes in the pattern Unicode aware.

Regex Matches white space but not tab (php)

How to write a regex with matches whitespace but no tabs and new line?
thanks everything
[[:blank:]]{2,} <-- Even though this isn't good for me because its whitespace or tab but not newlines.
As per my original comment, you can use this.
Code
See regex in use here
Note: The link contains whitespace characters: tab, newline, and space. Only space is matched.
[^\S\t\n\r]
So your regex would be [^\S\t\n\r]{2,}
Explanation
[^\S\t\n\r] Match any character not present in the set.
\S Matches any non-whitespace character. Since it's a double negative it will actually match any whitespace character. Adding \t, \n, and \r to the negated set ensures we exclude those specific characters as well. Basically, this regex is saying:
Match any whitespace character except \t\n\r
This principle in regex is often used with word characters \w to negate the underscore _ character: [^\W_]
[ ]{2,} works normally (not sure about php)
or even / {2,}/

Regex match character and non-ascii characters

I am writing a script to clean up a file line-by-line with non-ascii characters, but I am having trouble with a regex pattern. I need a regex pattern that matches any line that starts with an asterisk, may have an equals, and will contain non-ascii characters and spaces. I know how to match a non-ascii character, but not in the same set as other positively defined characters.
Here is a sample line that I need to match:
* = Ìÿð ÿð
Here is the pattern I have so far:
/\*[^[:ascii:]]+[\r\n]/
This will match lines that start with asterisk and containing non-ascii characters, but not if the line has spaces or equals in it.
Try the following expression:
^\*\s*=?\s*[[:^ascii:]\s]+[\r\n]*$
This matches the start-of-line ^, then it matches zero or more spaces \s* followed by an optional equal sign =? then zero or more white spaces \s*.
Now a nice piece of expression matches one or more characters which are a combination of non-ascii and white spaces [[:^ascii:]\s]+, check docs to see the syntax for character classes.
Finally the expression matches a combination of carriage returns and newlines which may end the line.
Regex101 Demo
Maybe this - (edit: changed after reread )
# ^\*(?=.*[^\0-\177])
^
\*
(?= .* [^\0-\177] )

Regex to remove single characters from string

Consider the following strings
breaking out a of a simple prison
this is b moving up
following me is x times better
All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:
breaking out of simple prison
this is moving up
following me is times better
Is this possible with a single regex in php?
$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("#\\b[a-z]\\b ?#i", "", $str);
echo $res;
How about:
preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);
Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).
If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:
preg_replace('/(^|\s)[a-z]\b/', '$1', $string);
As a one-liner:
$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);
This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.
The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.
So
This a is my test sentence a. o How funny (what a coincidence a) this is!
is changed to
This is my test sentence. How funny (what coincidence) this is!
You could try something like this:
preg_replace('/\b\S\s\b/', "", $subject);
This is what it means:
\b # Assert position at a word boundary
\S # Match a single character that is a “non-whitespace character”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b # Assert position at a word boundary
Update
As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.
As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this
test a hello
but you could fix that by changing the expression to \b\S\s*\b
Try this one:
$sString = preg_replace("#\b[a-z]{1}\b#m", ' ', $sString);

Categories