preg_match_all alpha+accented, but not numeric - php

I would like to use PHP's preg_match_all to capture substrings which comprise:
A-Z, a-z, all accented chars;
space;
hyphen.
It must not capture strings with anything else in them, including numeric chars.
This example is close but also catches strings containing numeric chars:
preg_match_all("/([\w -]+)/u", $abigstring, $matches);

That's a job for Unicode properties:
preg_match_all("/([\p{L} -]+)/u", $abigstring, $matches);
\p{L} matches any character with the Unicode property "Letter".

This is also an option :
preg_replace("/[^A-Za-zÀ-ÿ -]+/u", "", "juana 123456 sfdf 423 999 _ -a- dsa & ç%& à à$¨à+", -1);
Here is a working example -> https://xrg.es/#17hyhfm

For those want to, here is a correction of the code above which doesn't work !
preg_match("/^([\p{L} -]+)$/u", $string)
Anchors (^ and $) were missing
EDIT : Much better. If hyphens/spaces are only allowed in the middle:
/^([\p{L}](?:[\p{L} -]+[\p{L}])?)$/u

Related

Regular expressions, allow specific format only. "John-doe"

I've researched a little, but I found nothing that relates exactly to what I need and whenever tried to create the expression it is always a little off from what I require.
I attempted something along the lines of [AZaz09]{3,8}\-[AZaz09]{3,8}.
I want the valid result to only allow text-text, where either or the text can be alphabetical or numeric however the only symbol allowed is - and that is in between the two texts.
Each text must be at least three characters long ({3,8}?), then separated by the -.
Therefore for it to be valid some examples could be:
Text-Text
Abc-123
123-Abc
A2C-def4gk
Invalid tests could be:
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%
You need to use anchors and use the - so the characters in the character class are read as a range, not the individual characters.
Try:
^[A-Za-z0-9]{3,8}-[A-Za-z0-9]{3,8}$
Demo: https://regex101.com/r/xH3oM8/1
You also could simplify it a but with the i modifier and the \d meta character.
(?i)^[a-z\d]{3,8}-[a-z\d]{3,8}$
If accented letters should be allowed, or any other letter that exists in the Unicode range (like Greek or Cyrillic letters), then use the u modifier (for UTF-8 support) and \pL to match Unicode letters (and \d for digits):
$string ="
Mañana-déjà
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='/^[\pL\d]{3,}-[\pL\d]{3,}$/mu';
preg_match_all($regex, $string, $matches);
var_export($matches);
Output:
array (
0 =>
array (
0 => 'Mañana-déjà',
1 => 'Text-Text',
2 => 'Abc-123',
3 => '123-Abc',
4 => 'A2C-def4gk',
),
)
NB: the difference with \w is that [\pL\d] will not match an underscore.
You could come up with the following:
<?php
$string ="
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='~
^\w{3,} # at last three word characters at the beginning of the line
- # a dash
\w{3,}$ # three word characters at the end of the line
~xm'; # multiline and freespacing mode (for this explanation)
# ~xmu for accented characters
preg_match_all($regex, $string, $matches);
print_r($matches);
?>
As #chris85 pointed out, \w will match an underscore as well. Trincot had a good comment (matching accented characters, that is). To achieve this, simply use the u modifier.
See a demo on regex101.com and a complete code on ideone.com.
You can use this regex
^\w{3,}-\w{3,}$
^ // start of the string
\w{3,} // match "a" to "z", "A" to "Z" and 0 to 9 and requires at least 3 characters
- // requires "-"
\w{3,} // same as above
$ // end of the string
Regex Demo
And a short one.
^([^\W_]{3,8})-(?1)$
[^\W_] can be used as short for alnum. It subtracts the underscore from \w
(?1) is a subroutine call to the pattern in first group
Demo at regex101
My vote for #chris85 which is most obvious and performant.
This one
^([\w]{3,8}-[\w]{3,8})$
https://regex101.com/r/uS8nB5/1

Extract hashtags from non ASCII strings using regex

How do i can extract hashtags from a non-ASCII string , using regex ?
For example :
$str = #Hello #سلام #hello-again #سلام_دوباره #hello_again
I wouldn't like bad characters like ! # $ % ^ ♫ ► that included in hashtag , be accepted.
I tried this but it accepts bad characters :
preg_match_all('/#([^\s]+)/', $str, $matches);
it accepts #►☻
You may use the following regex:
'/#([\w-]+)/u'
See regex demo. The /u modifier will allow handling of Unicode symbols, and \w will match Unicode letters.
The regex breakdown:
# - a # symbol
([\w-]+) - 1 or more characters that are either letters, numbers, underscores or hyphens.
See the IDEONE demo

Search for repeated arabic (hindi) numerals in a string

I am trying to determine whether a given strings contains more than 4 consecutive arabic (hindi) numerals. to be specific, arabic (hindi) numerals are :
١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
which are unicode 661 to 669
I tried :
if (preg_match("/\b(?:(?:١|٢|٣|٤|٥|٦|٧|٨|٩)\b\s*?){4}/", $str, $matches) > 0)
return true;
But it doesn't work at all (always returns false).
You can try the following regular expression. \p{N} matches any kind of numeric character in any script.
preg_match('~(?:\p{N}\s?){4,}~u', $str, $matches)
If you just want to match those specific characters, you could use the following instead.
preg_match('~(?:[\x{0660}-\x{0669}]\s?){4,}~u, $str, $matches)
Use a character class and quantify it. See this regex:
/[١٢٣٤٥٦٧٨٩]{4,}/
Your characters are not word characters, so \b would assert a word character in front of / behind your match, remove it.
Here is a regex demo.
As a note, if you are matching more than 4 characters, use {5,} instead.

preg_match_all alpha+accented character, but not numeric

I would like to use php's preg_match to capture substrings which comprise:
A-Z, a-z, all accented chars
space
hyphen
It must not capture strings with anything else in them, including numeric chars.
This example is close but also catches strings containing numeric chars:
preg_match("/([\p{L} -]+)/u", $string)
A similar question already had an answer (the one above) but it doesn't work...
If I understand your problem correctly (which I might not have), then you simply want to use the ^ and $ characters to specify that "the match HAS to start here and the match HAS to end here":
/^([\p{L} -]+)$/u
^ ^
Then preg_match would only return true if the string had nothing else in it.
DEMO
Edit:
If hyphens/spaces are only allowed in the middle:
/^([\p{L}](?:[\p{L} -]+[\p{L}])?)$/u
DEMO

PHP Regex Not Matching Desired Substrings

I've written the next regular expression
$pattern = "~\d+[.][\s]*[A-Z]{1}[A-Za-z0-9\s-']+~";
in order to match substrings as 2.bon jovi - it's my life
the problem is the only part that is recognized is - bon jovi
none " - " or " ' " are recognized by this regular expression.
I'd prefer to know what is wrong with the regular expression that I've wrote rather than getting a new one.
Your regular expressions states that after the period character (can be changed to \.), you will have zero or more white space characters which should then be followed by 1 upper case letter. In your string, you do not have any upper case letters.
Secondly, the - should be placed last when you want to match it. So, changing your regex to this: ~\d+[.][\s]*[A-Z]{1}[A-Za-z0-9\s'-]+~ will match something like so: 2.Bon jovi - it's my life.
On the other hand, you can change it to this: ~\d+[.][\s]*[A-Za-z0-9\s'-]+~ to match something like so: 2.bon jovi - it's my life.
EDIT: Ammended as per the comments of Marko D and aleation.
A better regular expression to handle that would be...
$pattern = "~\d+\.\s*[\pL\pP\s]+~";
CodePad.
This will match a number, followed by a ., followed by optional whitespace, followed by one or more Unicode letters, whitespace or punctuation marks.
$pattern = "~\d+\..*~";
$string = "2.bon jovi - it's my life";
preg_match($pattern, $string, $match);
print_r($match);
output: Array ( [0] => 2.bon jovi - it's my life )
So the way I understand this regular expression is:
\d+ // Match any digit, 1 or more times
[.] // Match a dot
[\s]* // Match 0 or more whitespace characters
[A-Z]{1} // Match characters between an UPPERCASE A-Z Range 1 time
[A-Za-z0-9\s-']+ // Match characters between A-Z, a-z, 0-9, whitespace, dashe and apostrophe
So straight away, your 'bon jovi' might not get matched as it's lower case and you're only looking for uppercase characters. 'bon jovi' also contains a space so perhaps changing that part of the regular expression to allow for lowercase characters and whitespace might help so you'd end up with:
$pattern = "~\d+[.][\s]*[A-Za-z\s]{1}[A-Za-z0-9\s-']+~";
Note: I quickly tested this on RegExr ( http://gskinner.com/RegExr/ ) and it appeared to match the string fine.
Your regrex is as follows.
~ // delimiter
\d+ // 1 or more numbers
[.] // a period
[\s]* // 0 or more whitespace characters
[A-Z]{1} // 1 upper case letter
[A-Za-z0-9\s-\']+ // 1 or more characters, from the character class
~ //delimiter
Comparing that to the string "2.bon jovi" You have:
~ //
\d+ // "2"
[.] // "."
[\s]* // ""
[A-Z]{1} // <- NO MATCH
[A-Za-z0-9\s-\']+ //
~ //
"bon" does not start with a captial letter, it therefore does not match [A-Z]{1}
Cleaner regex
There are a few simple things you can do to clean up your regex
don't use character-classes for one character
don't specify {1} it's the same as not being present
Applying the above to your existing regex you get:
$pattern = "~\d+\.\s*[A-Z][A-Za-z0-9\s-']+~";
Which is slightly easier to read.
Your [A-Z]{1} sub-pattern requires one capital letter, so "2.bon jovi - it's my life" will not match.
And you need to escape the - in the [A-Za-z0-9\s-'] character class, or put it at the start or end, otherwise it is specifying a range.
"~\d+\.[A-Za-z0-9\s'-]+~"
As pointed out in the comments, it is actually not necessary to escape the - in the character class in your regex. That is only because you happened to precede it with a metacharacter \s that cannot be part of a range. Normally, if you want to match a literal - and you have it in a character class, you must escape it or position it as described above.

Categories