php: strip everything except alphanumeric unicode and two characters - php

I am trying to get a strip a text from all punctuation but since the text is in Spanish I can't use [A-Za-z0-9].
I have found this regex:
trim(preg_replace('#[^\p{L}\p{N}]+#u', ' ', $str)
which seems to do the job, but I would like to keep two special characters # and #, how can I achieve that?
Extra question: How can I delete all strings that are just numbers? e.g. 123 would be deleted but not as5623.
Thanks in advance!

You can simply add those characters to your negated class to retain them. And be sure to change your pattern delimiters to something other than # as well.
~[^\p{L}\p{N}##]+~u
To remove all strings that are numbers, you can place word boundaries \b around your pattern.
\b\d+\b
Note: A word boundary does not consume any characters. It asserts that on one side there is a word character, and on the other side there is not.

You can use posix character classes too.
/[^[:alnum:]##]+/
But for the two special character, you just have to add it inside character class.
To delete all the only number containing words following regex would work.
/\b[[:digit:]]+\b/

Related

How to check if string contains specific special characters or starting with a space? [duplicate]

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

Test if a word is composed of alpha characters, white spaces and periods?

I need a regex that would test if a word is composed of letters (alpha characters), white spaces, and periods (.). I need this to use for validating names that is entered in my database.
This is what I currently use:
preg_match('/^[\pL\s]+$/u',$foo)
It works fine for checking alpha characters and whitespaces, but rejects names with periods as well. I hope you guys can help as I have no idea how to use regex.
Add a dot to the character class so that it would match a literal dot also.
preg_match('/^[\p{L}.\s]+$/u',$foo)
OR
preg_match('/^[\pL.\s]+$/u',$foo)
Explanation:
^ Asserts that we are at the start.
[\pL.\s]+ Matches any character in the list one or more times. \pL matches any Kind of letter from any language.
$ Asserts that we are at the end.
The following regex should satisfy your condition:
preg_match('/^[a-zA-Z\s.]+$/',$foo)
In this link, you will find all the information you need to figure regex out with PHP. PHP Regex Cheat Sheet
Basically, if you want to add the period you add . :
preg_match('/^[\pL\s\.]+$/u',$foo)
Enjoy! :)

Match whole words in utf

I want to replace all occurrences of a with 5. Here is the code that works well:
$content=preg_replace("/\ba\b/","5", $content);
unless I have words like zapłać where a is between non standard characters, or zmarła where there is a Unicode (or non-ASCII) letter followed by a at the end of word. Is there any easy way to fix it?
the problem is that the predefined character class \w is ASCII based and that does not change, when the u modifier is used. (See regular-expressions.info, preg is PCRE in the columns)
You can use lookbehind and lookahead to do it:
$content=preg_replace("/(?<!\p{L})a(?!\p{L})/","5",$content);
This will replace "a" if there is not a letter before and not a letter ahead.
\p{L}: any kind of letter from any language.
$content=preg_replace("/\ba\b/u","5",$content);

Strip trailing non-word character(s)

I need to strip any non-alphanumeric characters from the end of strings using PHP's preg_replace:
Word One, Two, -, Word One, Two,[space], Word One, Two,, Word One, Two should all become Word One, Two.
I have tried preg_replace('/(.+)\\W+$/', '$1', 'Word One, Two, -'); but this only strips the last non-word character. I also tried '/(.+)\\W*$/' as I assumed this would make it work if 0 or 1 non-word characters are found (as I need) but it then doesn't match at all. I think I need to make the \W greedy but I'm not sure how. Any ideas? Also, please feel free to explain to me what I am doing wrong so I don't find myself haunting the SO regex tag ;-)
This is because (.+) eats up all other character, including non-word characters. The regex engine starts matching the string and starts out with all characters in the capturing group. Only then it notices that the \W at the end of the string won't fit and backs up, tentatively allowing a single character to be matched by the \W. But a single character is all that's needed to satisfy the \W+, so it just stops and just strips that single character. That's also the reason why (.+)\W*$ doesn't work at all, because \W* is content with matching nothing at all.
Use
preg_replace('/\\W+$/', '', $foo);
instead. This avoids the problem by just replacing trailing non-word characters without even trying to match something else.
Another option would be
preg_replace('/(.+?)\\W+$/', '$1', $foo);
which would use a lazy quantifier (+?) for the capturing group. This quantifier tries satisfying the match while matching as little as possible (as opposed to + which tries to match as much as possible as we saw above). But generally I'd avoid replacing parts of the match by themselves if you can avoid it. To strip things from a string you certainly don't need to match more than you need to strip.
What your regex is doing is looking for the maximum possible amount of any character, while still keeping at least one non-word at the end.
What you need to do is just drop the (.+), and use:
preg_replace("/\W+$/","",$input);

RegEx - extract words with prefix # or #

If I have a string
This is a #really nice#day.
On first pass I should get as an output/result words really and day (results should not contain dots or any other punctuation signs, also you should not just match A-Z,a-z and everything else ignore because string could contain international characters so keep that in mind).
On second pass I should get out everything except those two words and punctuation for ex.
This is a nice
RegEx is done via PHP.
EDIT: #hochl
The problem with ([##]\w+) is that it doesn't catch international characters like šđžćč so #dayš is recognized only as #day.
To catch the international characters, you could use the following:
[##]\p{L}+
You would need to use the unicode modifier /u for this to work in php.
Note:
The \p{L} is telling it to match unicode "letters"
You don't need to wrap the whole thing in parentheses () as the whole match is always the first group

Categories