i need to create a regular expresion that match word whitespace word, it can't start with whitespace neither has more than 1 whitespaces between word and word i have to allow on each word letters and accents, i'm using this pattern:
^([^\+\*\.\|\(\)\[\]\{\}\?\/\^\s\d\t\n\r<>ºª!#"·#~½%¬&=\'¿¡~´,;:_®¥§¹×£µ€¶«²¢³\$\-\\]+\s{0,1}?)*$/
Examples:
-Graça+whitespace+anotherWord -> match
-whitespace+Graça+whitespace+anotherWord -> don't match
-Graça+whitespace+whitespace+anotherword -> don't match
In general, it is a validation to allow firstname+whitespace+lastname with accents chars and a-z chars
and i have to exclude all specials chars like /*-+)(!/($=
You can try this pattern: ^[\x{0041}-\x{02B3}]+\s[\x{0041}-\x{02B3}]+.
Explanation: since you are using characters not matched by \w, you have to define your own range of word characters. \x{0041} is just a character with unicode index equal to 0041.
Demo
For just spaces, use str_replace:
$string = str_replace(' ', '', $string);
For all whitespace, use preg_replace:
$string = preg_replace('/\s+/', '', $string);
Related
We are extracting text from PDF files, and there is a high frequency of results that contain malformed text. Specifically adding spaces between the characters of a word. e.g. SEATTLE is being returned as S E A T T L E.
Is there a RegEx expression for preg_replace that can remove any spaces in the case of n number of single character "words"? Specifically, remove spaces from any occurrence of a string that is more than 3 single alpha characters and is separated by spaces?
If googled this for awhile, but can't even imagine how to construct the expression. As expressed in a comment, I don't want ALL spaces removed, but only when there is an occurrence of >3 single alpha characters, e.g. Welcome to the Greater S E A T T L E area should become Welcome to the Greater SEATTLE area. The result is to be used in full text searching, so case sensitivity is not a concern.
You may use a simple approach with a preg_replace_callback. Match '~\b[A-Za-z](?: [A-Za-z]){2,}\b~' and str_replace spaces in the anonymous function:
$regex = '~\b[A-Za-z](?: [A-Za-z]){2,}\b~';
$result = preg_replace_callback($regex, function($m) {
return str_replace(" ", "", $m[0]);
}, $s);
See the regex demo.
To only match sequences of uppercase letters, remove a-z from the pattern:
$regex = '~\b[A-Z](?: [A-Z]){2,}\b~';
And another thing: there may be soft/hard spaces, tabs, other kind of whitespace. Then, use
$regex = '~\b[A-Za-z](?:\h[A-Za-z]){2,}\b~u';
^^ ^
Finally, to match any Unicode letter, use \p{L} (to only match uppercase ones, \p{Lu}) instead of [a-zA-Z]:
$regex = '~\b\p{L}(?:\h\p{L}){2,}\b~u';
NOTE: It will most probably fail to work in some cases, e.g. when there are one-letter words. You will have to handle those cases separately/manually. Anyway, there is no safe regex-only way to fix OCR issues.
Pattern details
\b - a word boundary
[A-Za-z] - a single letter
(?: [A-Za-z]){2,} - 2 or more occurrences of
- a space (\h matches any kind of horizontal whitespace)
[A-Za-z] - a single letter
\b - a word boundary
When usign u modifier, \h becomes Unicode-aware.
You could do this in one go:
(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)
See live demo here
Explanation:
(?i: # Start of non-capturing group with case-insensitive modifier on
(?<!\S) # Negative lookbehind to ensure there is no leading non-whitespace character
([a-z]) + # Capture one letter and at least one space
((?1)) # Capture one letter in 2nd capturing group
| # Or
\G(?!\A) + # Start match from where previous match ends
# with matching spaces
((?1))\b # Match a letter at word boundary
) # End of non-capturing group
PHP code:
$str = preg_replace('~(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)~', '$1$2$3', $str);
You may use this pure regex approach with lookarounds and \G:
$re = '~\b(?:(?=(?:\pL\h+){3}\pL\b)|(?<!^)\G)(\pL)\h+(?=\pL\b)~';
$repl = preg_replace($re, '$1', $str);
RegEx Demo
RegEx Details:
\b: Match word boundary
(?:: Start non-capture group
(?=(?:\pL\h+){3}\pL\b): Lookahead to assert we have 3+ single letters separated by 1+ spaces
|: OR
(?<!^)\G: \G asserts position at the end of the previous match. (?<!^) ensures we don't match start of the string for the first match
): End non-capture group
(\pL): Match a single letter and capture it
\h+: Followed by 1+ horizontal whitespace
(?=\pL\b): Assert that we only have a single letter ahead
In the replacement we use $1 which is the letter left of whitespace we capture
I'm using this PHP function for SEO urls. It's working fine with Latin words, but my urls are on Cyrillic. This regex - /[^a-z0-9_\s-]/ is not working with Cyrillic chars, please help me to make it works with non-Latin chars.
function seoUrl($string) {
// Lower case everything
$string = strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^a-z0-9_\s-]/', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
You need to use a Unicode script for Cyrillic alphabet that fortunately PHP PCRE supports it using \p{Cyrillic}. Besides you have to set u (unicode) flag to predict engine behavior. You may also need i flag for enabling case-insensitivity like A-Z:
~[^\p{Cyrillic}a-z0-9_\s-]~ui
You don't need to double escape \s.
PHP code:
preg_replace('~[^\p{Cyrillic}a-z0-9_\s-]+~ui', '', $string);
To learn more about Unicode Regular Expressions see this article.
\p{L} or \p{Letter} matches any kind of letter from any language.
To match only Cyrillic characters, use \p{Cyrillic}
Since Cyrillic characters are not standard ASCII characters, you have to use u flag/modifier, so regex will recognize Unicode characters as needed.
Be sure to use mb_strtolower instead of strtolower, as you work with unicode characters.
Because you convert all characters to lowercase, you don't have to use i regex flag/modifier.
The following PHP code should work for you:
function seoUrl($string) {
// Lower case everything
$string = mb_strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^\p{Cyrillic}a-z0-9\s_-]+/u', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
Furthermore, please note that \p{InCyrillic_Supplementary} matches all Cyrillic Supplementary characters and \p{InCyrillic} matches all non-Supplementary Cyrillic characters.
I'm looking for a regular expression in Php that could transform incoming strings like this:
abaisser_negation_pronominal_question => abaisser_n_p_q
abaisser_pronominal_question => abaisser_p_q
abaisser_negation_question => abaisser_n_q
abaisser_negation_pronominal => abaisser_n_p
abaisser_negation_voix_passive_pronominal => abaisser_n_v_p_p
abaisser => abaisser
With the Php code close to something like:
$line=preg_replace("/<h3>/im", "", $line);
How would you do?
You can use:
$input = preg_replace('/(_[A-Za-z])[^_\n]*/', '$1', $input);
RegEx Demo
Explanation:
This regex searches for (_[A-Za-z])[^_\n]* which means underscore followed by single letter and then match before a newline or underscore
It capture first part (_[A-Za-z]) in a backreference $1
Replacement is $1 leaving underscore and first letter in the replacement string
You could use \K or positive lookbehind.
$input = preg_replace('~_.\K[^_\n]*~', '', $input);
Pattern _. in the above regex would match an _ and also the character following the underscore. \K discards the previously matched characters that is, _ plus the following character. It won't take these two characters into consideration. Now [^_\n]* matches any character but not of an _ or a \n newline character zero or more times. So the characters after the character which was preceded by an underscore would be matched upto the next _ or \n character. Removing those characters will give you the desired output.
DEMO
$input = preg_replace('~(?<=_.)[^_\n]*~', '', $input);
It just looks after to the _ and the character following the _ and matches all the characters upto the next underscore or newline character.
DEMO
You can use regex
$input = preg_replace('/_(.)[^\n_]+/', '_$1', $input);
DEMO
What it does is capture the character after _ and match till \n or _ is encountered and replaced with the _$1 which means _ plus the character captured.
$line = preg_replace("/_([a-z])([a-z]*)/i", "_$1", $line);
i need to format uppercase words to bold but it doesn't work if the word contains two spaces
is there any way to make regex match only with words which end with colon?
$str = "BAKA NO TEST: hey";
$str = preg_replace('~[A-Z]{4,}\s[A-Z]\s{2,}(?:\s[A-Z]{4,})?:?~', '<b>$0</b>', $str);
output: <b>BAKA NO TEST:</b> hey
but it returns <b>BAKA</b> NO TEST: hey
the original $str is a multiline text so there are many lowercase and uppercase words but i need to change only some
You can do it like this:
$txt = preg_replace('~[A-Z]+(?:\s[A-Z]+)*:~', '<b>$0</b>', $txt);
Explanations:
[A-Z]+ # uppercase letter one or more times
(?: # open a non capturing group
\s # a white character (space, tab, newline,...)
[A-Z]+ #
)* # close the group and repeat it zero or more times
If you want a more tolerant pattern you can replace \s by \s+ to allow more than one space between each words.
Unless you have some good reason to use that regexp, try something simpler, like:
/([A-Z\s]+):/
Also, just so you know, you can use asterisk to specify none or more space characters: \s*
Say,should keep the space between letters(a-z,case insensitive) and remove the space between non-letters?
This should work:
$trimmed = preg_replace('~([a-z0-9])\s+([a-z0-9])~i', '\1\2', $your_text);
This will strip any whitespace that is between two non-alpha characters:
preg_replace('/(?<![a-z])\s+(?![a-z])/i', '', $text);
This will strip any whitespace that has a non-alpha character on either side (big difference):
preg_replace('/(?<![a-z])\s+|\s+(?![a-z])/i', '', $text);
By using negative look-ahead and negative look-behind assertions, the beginning and end of the string are treated as non-alpha as well.