Match whole words in utf - php

I want to replace all occurrences of a with 5. Here is the code that works well:
$content=preg_replace("/\ba\b/","5", $content);
unless I have words like zapłać where a is between non standard characters, or zmarła where there is a Unicode (or non-ASCII) letter followed by a at the end of word. Is there any easy way to fix it?

the problem is that the predefined character class \w is ASCII based and that does not change, when the u modifier is used. (See regular-expressions.info, preg is PCRE in the columns)
You can use lookbehind and lookahead to do it:
$content=preg_replace("/(?<!\p{L})a(?!\p{L})/","5",$content);
This will replace "a" if there is not a letter before and not a letter ahead.
\p{L}: any kind of letter from any language.

$content=preg_replace("/\ba\b/u","5",$content);

Related

PHP REGEX to find uppercase sentence in html tag

I am trying to create regex to find uppercase sentence in html tag. Here is an example:
<span style="font-family:Arial; font-size:11pt; font-weight:bold">RESSONÂNCIA MAGNÉTICA</span></p>
I got this regex: ^<span style="font-family:Arial; font-size:11pt; font-weight:bold">+[A-Z]+<\/span><\/p>
However it is not working properly. It is missing spaces and letters with accentuation.
You seem to have a very specific case in mind. #Mariano pointed out a sweet way to grab uppercases characters that is unicode safe (nice work!) but maybe coming at this a little differently will help.
You mentioned wanting uppercase sentences... I assume that's more than uppercase letters, that includes punctuation, and all matter of other characters being okay. Maybe think about what isn't okay? If all that is not allowed to be inside that tag is lowercase letters, maybe your match (inside the tag) is [^a-z]+ which will match anything that isn't a lowercase letter from a to z.
preg_replace("/^<span style=\"font-family:Arial; font-size:11pt; font-weight:bold\">([^a-z]+)<\/span><\/p>/u", "\1", $input_lines);
And if you want to grab the contents of any span, you could use something like this:
preg_replace("/^<span[^>]+>([^a-z]+)<\/span>/u", "\1", $input_lines);
Or to handle lowercase letters with accents:
preg_replace("/^<span[^>]+>([^\{Ll}]+)<\/span>/u", "\1", $input_lines);
You're using [A-Z] that only matches A to Z. This can be solved using Unicode categories
Use \p{Lu} to match characters with the Uppercase_Letter Unicode property.
In order to use the above, set the /u (Unicode modifier) in your pattern.
Don't forget to include spaces (your example has 1).
This will match what you want: [\p{Lu} ]+
Code:
preg_replace("/^<span style=\"font-family:Arial; font-size:11pt; font-weight:bold\">([\p{Lu} ]+)<\/span><\/p>/u", "\1", $input_lines);
Demo online
I suggested using \p{Lu} in a previous answer, but you're probably not interested in matching Arabic, German special chars or whatever Uppercase_Letter category matches.
Keep it simple:
Just add the special chars you want inside the character class. For example, and I'm guessing it's Portuguese you're matching:
[A-ZÁÂÃÀÇÉÊÍÓÔÕÚ ]+

php: strip everything except alphanumeric unicode and two characters

I am trying to get a strip a text from all punctuation but since the text is in Spanish I can't use [A-Za-z0-9].
I have found this regex:
trim(preg_replace('#[^\p{L}\p{N}]+#u', ' ', $str)
which seems to do the job, but I would like to keep two special characters # and #, how can I achieve that?
Extra question: How can I delete all strings that are just numbers? e.g. 123 would be deleted but not as5623.
Thanks in advance!
You can simply add those characters to your negated class to retain them. And be sure to change your pattern delimiters to something other than # as well.
~[^\p{L}\p{N}##]+~u
To remove all strings that are numbers, you can place word boundaries \b around your pattern.
\b\d+\b
Note: A word boundary does not consume any characters. It asserts that on one side there is a word character, and on the other side there is not.
You can use posix character classes too.
/[^[:alnum:]##]+/
But for the two special character, you just have to add it inside character class.
To delete all the only number containing words following regex would work.
/\b[[:digit:]]+\b/

Regular Expressions: How to Express \w Without Underscore

Is there a concise way to express:
\w but without _
That is, "all characters included in \w, except _"
I'm asking this because I'm looking for the most concise way to express domain name validation. A domain name may include lowercase and uppercase letters, numbers, period signs and dashes, but no underscores. \w includes all of the above, plus an underscore. So, is there any way to "remove" an underscore from \w via regex syntax?
Edited: I'm asking about regex as used in PHP.
Thanks in advance!
the following character class (in Perl)
[^\W_]
\W is the same as [^\w]
You could use a negative lookahead: (?!_)\w
However, I think writing [a-zA-Z0-9.-] is more readable.
To be on the safe side, usually, we will use character class:
[a-zA-Z0-9.-]
The regex "fragment" above match English alphabet, and digits, plus period . and dash -. It should work even with the most basic regex support.
Shorter may be better, but only if you know exactly what it represents.
I don't know what language you are using. In a lot of engines, \w is equivalent to [a-zA-Z0-9_] (some requires "ASCII mode" for this). However, some engine have Unicode support for regex, and may extend \w to match Unicode characters.
If my understanding is right \w means [A-Za-z0-9_] period signs, dashes are not included.
info:
http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes
so I guess what you want is [a-zA-Z0-9.-]
Some regex flavours have a negative lookbehind syntax you might use:
\w(?<!_)
I would start with [^_], and then think of what else characters I need to deny. If you need to filter a keyboard input, it's quite simple to enumerate all the unwanted characters.
You can write something like this:
\([^\w]|_)\u
If you use preg_filter with this string any character in \w (excluding _ underscore) will be filtered.

preg_replace in PHP - regular expression for NOT condition

I am trying to write a function in PHP using preg_replace where it will replace all those characters which are NOT found in list. Normally we replace where they are found but this one is different.
For example if I have the string:
$mystring = "ab2c4d";
I can write the following function which will replace all numbers with *:
preg_replace("/(\d+)/","*",$mystring);
But I want to replace those characters which are neither number nor alphabets from a to z. They could be anything like #$*();~!{}[]|\/.,<>?' e.t.c.
So anything other than numbers and alphabets should be replaced by something else. How do I do that?
Thanks
You can use a negated character class (using ^ at the beginning of the class):
/[^\da-z]+/i
Update: I mean, you have to use a negated character class and you can use the one I provided but there are others as well ;)
Try
preg_replace("/([^a-zA-Z0-9]+)/","*",$mystring);
You want to use a negated "character class". The syntax for them is [^...]. In your case just [^\w] I think.
\W matches a non-alpha, non-digit character. The underscore _ is included in the list of alphanumerics, so it also won't match here.
preg_replace("/\W/", "something else", $mystring);
should do if you can live with the underscore not being replaced. If you can't, use
preg_replace("/[\W_]/", "something else", $mystring);
The \d, \w and similar in regex all have negative versions, which are simply the upper-case version of the same letter.
So \w matches any word character (ie basically alpha-numerics), and therefore \W matches anything except a word character, so anything other than an alpha-numeric.
This sounds like what you're after.
For more info, I recommend regular-expressions.info.
Since PHP 5.1.0 can use \p{L} (Unicode letters) and \p{N} (Unicode digits) that is unicode equivalent like \d and \w for latin
preg_replace("/[^\p{L}\p{N}]/iu", $replacement_string, $original_string);
/iu modifiers at the end of pattern:
i (PCRE_CASELESS)
u (PCRE_UTF8)
see more at: https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Check a variable using regex

Im about to create a registration form for my website. I need to check the variable, and accept it only if contains letter, number, _ or -.
How can do it with regex? I used to work with them with preg_replace(), but i think this is not the case. Also, i know that the "ereg" function is dead. Any solutions?
this regex is pretty common these days.
if(preg_match('/^[a-z0-9\-\_]+$/i',$username))
{
// Ok
}
Use preg_match:
preg_match('/^[\w-]+$/D', $str)
Here \w describes letters, digits and the _, so [\w-]+ matches one or more letters, digits, _, and -. ^ and $ are so called anchors that denote the begin and end of the string respectively. The D modifier avoids that $ really matches the end of the string and is not followed by a line break.
Note that the letter and digits that are matched by \w depend on the current locale and might match other letter or digits than just [a-zA-Z0-9]. So if you just want these, use them explicitly. And if you want to allow more than these, you could also try character classes that are describes by Unicode character properties like \p{L} for all Unicode letters.
Try preg_match(). http://php.net/manual/en/function.preg-match.php

Categories