I'm trying to validate a string in PHP using regex; it can only contain letters (including latin letters such as 'á', 'õ', etc) and spaces.
Using preg_replace('/\P{L}/u', '', $ str); I get rid of everything (including the spaces) but the latin letters. What do I need to change on the regex to include the spaces as well?
You may use
preg_replace('/[^\p{L}\s]+/u', '', $str);
The [^\p{L}\s]+ pattern will match 1 or more occurrences of any char but a Unicode letter or whitespace. Note that due to u modifier, \s will recognize any Unicode whitespace chars.
See the regex demo.
Details
[^ - start of a negated character class that matches any char but
\p{L} - any Unicode letter
\s - whitespace
]+ - 1 or more times.
If you have diacritics and want to keep them, you will have to add \p{M} to the negated character class, /[^\p{L}\p{M}\s]+/u.
Related
There are names of records in which are mixed Cyrillic and Latin words, symbols, spaces, digits, etc.
I need to preg_match (PHP) only Latin part with any symbols in any combinations.
Test set:
БлаблаБла Uty-223
Блабла (бла.)Бла CAROP-C
Бла бла ST.MORITZ
Бла бла RAMIRO2-TED
LA PLYSGNE 1 H - 001
(Блабла) - doesn't matter Cyrillic words.
So i tried pattern:
/[-0-9a-zA-Z.]+/
But [Блабла (бла.)Бла CAROP-C] and [LA PLYSGNE 1 H - 001] not found as string.
Next i tried to write more flexible pattern:
/[-0-9a-zA-Z]+(?:.)?+(?:\s+)?+[-0-9a-zA-Z]+/
But there is still problem with matching [LA PLYSGNE 1 H - 001].
Is there any idea how can this be solved?
Thanks.
If the . and - can not occur at the beginning or end, you can start the match with [0-9a-zA-Z] and optionally repeat one of the chars listed in the character class followed by again [0-9a-zA-Z]
\b[0-9a-zA-Z]+(?:[.\h-]+[0-9a-zA-Z]+)*\b
The \b is a word boundary preventing a partial word match
\h matches a horizontal whitespace character
See a regex101 demo.
Matching at least a single char [0-9a-zA-Z] with allowed chars . and - in the whole string, and asserting whitespace boundaries to the left and right
(?<!\S)[.-]*\b[0-9a-zA-Z](?:[0-9a-zA-Z.\h-]*[0-9a-zA-Z.-])?(?!\S)
Using (?<!\S) and (?!\S) are lookaround assertions that are whitespace boundaries, asserting not a non whitespace char to the left and the right.
See a regex101 demo.
You can also use a script run starting with a latin letter:
~(*sr:\p{Latin}.*\S)~u
demo
I want to replace all empty spaces on the beginning of all new lines. I have two regex replacements:
$txt = preg_replace("/^ +/m", '', $txt);
$txt = preg_replace("/^[^\S\r\n]+/m", '', $txt);
Each of them matches different kinds of empty spaces. However, there may be chances that both of the empty spaces exist and in different orders, so I want to match occurences of all of them at the beginning of new lines. How can I do that?
NOTE: The first regex matches an ideographic space, \u3000 char, which is only possible to check in the question raw body (SO rendering is not doing the right job here). The second regex matches only ASCII whitespace chars other than LF and CR. Here is a demo proving the second regex does not match what the first regex matches.
Since you want to remove any horizontal whitespace from a Unicode string you need to use
\h regex escape ("any horizontal whitespace character (since PHP 5.2.4)")
u modifier (see Pattern Modifiers)
Use
$txt = preg_replace("/^\h+/mu", '', $txt);
Details
^ - start of a line (m modifier makes ^ match all line start positions, not just string start position)
\h+ - one or more horizontal whitespaces
u modifier will make sure the Unicode text is treated as a sequence of Unicode code points, not just code units, and will make all regex escapes in the pattern Unicode aware.
How can I remove non-alphanumeric characters from a string in PHP while keeping Russian characters like ч and г?
I tried to translate the string and then clean it with preg_replace, but this would remove the Russian characters.
You can do it with preg_replace. You just have to build a regular expression that matches what you desire.
If I understood your question correctly, this should work:
preg_replace('/[^\p{L}\p{N}\s]/u', '', $string);
Brief explanation:
^ matches any character that is not in this set.
\p{L} matches any letter (including the Cyrillic alphabet).
\p{N} matches any number.
\s matches any whitespaces.
/u adds Unicode support.
If you only want to match letters from the Cyrillic alphabet., you may want to use \p{Cyrillic} instead of \p{L}.
I have this pattern.. The matches must not cross multiple lines (there must not be any newline char in the matches) so I added the m modifier..
But sometimes there is a \n in the matches.. How to prevent this?
preg_match_all('/(?<!\d|\d\D)(?:dk)?([\d\PL]{8,})/m', $input, $matches, PREG_PATTERN_ORDER);
The \PL pattern matches any char but a Unicode letter and also matches digits and whitespace chars. So, [\d\PL] can be shortened to \PL and since you need to subtract line breaks from it, replace it with the reverse shorthand character class (\pL) and use it inside a negated bracket expression, [^\pL], and add \r and \n there:
'/(?<!\d|\d\D)(?:dk)?([^\pL\r\n]{8,})/u'
The m modifier is redundant since it only redefines the behavior of ^ and $ anchors. You might need the u modifier though, for the Unicode property class to work safely with Unicode strings in PHP/PCRE. Change \d to [0-9] and \D to [^0-9] if you only want to match ASCII digits.
How to write a regex with matches whitespace but no tabs and new line?
thanks everything
[[:blank:]]{2,} <-- Even though this isn't good for me because its whitespace or tab but not newlines.
As per my original comment, you can use this.
Code
See regex in use here
Note: The link contains whitespace characters: tab, newline, and space. Only space is matched.
[^\S\t\n\r]
So your regex would be [^\S\t\n\r]{2,}
Explanation
[^\S\t\n\r] Match any character not present in the set.
\S Matches any non-whitespace character. Since it's a double negative it will actually match any whitespace character. Adding \t, \n, and \r to the negated set ensures we exclude those specific characters as well. Basically, this regex is saying:
Match any whitespace character except \t\n\r
This principle in regex is often used with word characters \w to negate the underscore _ character: [^\W_]
[ ]{2,} works normally (not sure about php)
or even / {2,}/