pcre regex to match first two words, numbers - php

I need a regular expression to match only the first two words (they may contain letters , numbers, commas and other punctuation but not white spaces, tabs or new lines) in a string.
My solution is ([^\s]+\s+){2} but if it matches something like :'123 word' *in '123 word, hello'*, it doesnt work on a string with just two words and no spaces after.
What is the right regex for this task?

You have it almost right:
(\S+\s+\S+)
Assuming you don't need stronger control on what characters to use.
If you need to match both two words or only one word only, you may use one of those:
(\S+\s+\S|\S+)
(\S+(?:\s+\S+)?)

Instead of trying to match the words, you could split the string on whitespace with preg_split().

If you really only want to allow numbers and letters [^\s] is not restrictive enough. Use this:
/[a-z0-9]+(\s+[a-z0-9]+)?/i

Related

How to check if string contains specific special characters or starting with a space? [duplicate]

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

Regexp for name string

I'm trying to find a regexp that covers a lot of outcomes, the one I'm using now would be enough if it weren't for a lot of international names having special letters in them as well as hyphens.
The one I'm using now looks like this:
/^[A-Za-zåäöÅÄÖ\s\-\ ]*$/
It allows for hyphens and whitespace but it also allows them at the start or end of the string which I don't want to allow.
I need to modify this to allow:
Special letters such as éýÿüåäö etc. (preferrably by not having to write them all manually)
Capital letter at the start of each new word
Whitespace between words
- hyphens between words, but not before or after the full string
It should not allow numbers, which it doesn't already. Since I haven't worked a whole lot with regex construction I'm in the dark on how to achieve this, I've found a lot of solutions that covers one or the other scenario, but not all of the ones I need. I would appreciate the assistance. The regex should work for PHP validation.
EDIT:
$fname = 'Scrooge Mc-Duck'; //Only example string
$fname = trim($fname);
if (!preg_match('/^\p{Lu}\p{Ll}+([ -]+\p{Lu}\p{Ll}+)*$/', $fname)) {
$fnameErr = 'Invalid first name';
}
This outputs the error when using #npinti's solution.
Assuming that your regular expression engine can expose character classes. You can use \p{L} to match any letter. So, to match a name, you could use ^\p{Lu}\p{Ll}+([ -]+\p{Lu}\p{Ll}+)*$.
This would allow you to match an upper case letter followed by one or more lower case letters. In turn, this can be followed by a combination of 0 or more white spaces and dashes and is then followed by an upper case letter and one ore more lower case letters. The ^ and $ at the beginning and end make sure that the regular expression matches the entire string.
A demo of the regex can be viewed here.

php: strip everything except alphanumeric unicode and two characters

I am trying to get a strip a text from all punctuation but since the text is in Spanish I can't use [A-Za-z0-9].
I have found this regex:
trim(preg_replace('#[^\p{L}\p{N}]+#u', ' ', $str)
which seems to do the job, but I would like to keep two special characters # and #, how can I achieve that?
Extra question: How can I delete all strings that are just numbers? e.g. 123 would be deleted but not as5623.
Thanks in advance!
You can simply add those characters to your negated class to retain them. And be sure to change your pattern delimiters to something other than # as well.
~[^\p{L}\p{N}##]+~u
To remove all strings that are numbers, you can place word boundaries \b around your pattern.
\b\d+\b
Note: A word boundary does not consume any characters. It asserts that on one side there is a word character, and on the other side there is not.
You can use posix character classes too.
/[^[:alnum:]##]+/
But for the two special character, you just have to add it inside character class.
To delete all the only number containing words following regex would work.
/\b[[:digit:]]+\b/

Splitting large strings into words in php

I have a long string in php consisting of different paragraphs each of which with different sentences (it is pretty much a small document). I want to split the whole thing into words by removing any symbols/characters that are not relevant. For example remove commas, spaces, new lines, full stops, exclamation marks and anything that might be irrelevant so as to end up with only words.
Is there an easy way of doing this in one go, for example by using a regular expression and the preg_split function or do I have to use the explode function a number of times: eg first get all the sentences (by removing '.', '!' etc). Then get words by removing ',' and spaces etc etc.
I would not like to use the explode function on all the possible characters that are irrelevant since it is time consuming and I may accidentally omit some of all those possible characters.
I would like to find a more automatic way. I think a well define regular expression might do the work but again I will need to specify all the possible characters and also I have no idea of how to write regular expressions in php.
So what can you suggest to me ?
Do you want to remove punctuation characters, etc and then split the words into an array? Or just strip it so there are only letters and spaces? Not exactly sure what you're trying to achieve, but the following might help:
<?php
$string = "This is a sentence! It has *lots* of #$#king random non-word characters. Wouldn't you like to strip them?";
$words = preg_replace("/[^\w\ _]+/", '', $string); // strip all punctuation characters, news lines, etc.
$words = preg_split("/\s+/", $words); // split by left over spaces
var_dump($words);
Either way, it gives you the general idea of using regular expressions to manipulate text as needed. My example has two parts, this way words like "wouldn't" aren't split into two words like other answers have suggested.
To be unicode compatible, you should use this one:
preg_split('/\PL+/u', $string, -1, PREG_SPLIT_NO_EMPTY);
wich splits on characters that are not letter.
Have a look at here to see the unicode character properties.
Just use preg_replace() and define a regular expression to match on the different characters you wish to replace and provide a replacement character to replace them with.
http://php.net/manual/en/function.preg-replace.php
For the characters you wish to search on you can define those in a PHP array as seen in the PHP manual.
Your answer is in the domain of regular expressions and would probably be very difficult to get right. You could get something that works well in almost all cases but there would be exceptions.
This might help:
http://www.regular-expressions.info/wordboundaries.html

PHP Simple PCRE regex to only allow 1 dot or none?

I'm trying to create a regex for alias validation:
And I'm allowing letters, numbers and 1 dot.
I have done the following:
/^[a-z0-9\\.]+$/i
However it allows more then 1 dot?
This should do it:
/^(?:\.[a-z0-9]+|[a-z0-9]+(?:\.[a-z0-9]*)?)$/i
This allows the string to either:
start with one dot that is followed by at least one alphanumeric character, or
start with one or more alphanumeric character that may be followed by one dot and zero or more alphanumeric characters.
I think it is not a good idea to allow a dot as a first or last character, in that case:
/^[a-z0-9]+\.?[a-z0-9]+$/i
try this:
^(?:[a-z0-9]+\.?[a-z0-9]*|[a-z0-9]*\.?[a-z0-9]+)$
places the dot in the center, then allows it to be surrounded on either side.

Categories