Search for repeated arabic (hindi) numerals in a string

Search for repeated arabic (hindi) numerals in a string - php

I am trying to determine whether a given strings contains more than 4 consecutive arabic (hindi) numerals. to be specific, arabic (hindi) numerals are :
١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
which are unicode 661 to 669
I tried :
if (preg_match("/\b(?:(?:١|٢|٣|٤|٥|٦|٧|٨|٩)\b\s*?){4}/", $str, $matches) > 0)
return true;
But it doesn't work at all (always returns false).

You can try the following regular expression. \p{N} matches any kind of numeric character in any script.
preg_match('~(?:\p{N}\s?){4,}~u', $str, $matches)
If you just want to match those specific characters, you could use the following instead.
preg_match('~(?:[\x{0660}-\x{0669}]\s?){4,}~u, $str, $matches)

Use a character class and quantify it. See this regex:
/[١٢٣٤٥٦٧٨٩]{4,}/
Your characters are not word characters, so \b would assert a word character in front of / behind your match, remove it.
Here is a regex demo.
As a note, if you are matching more than 4 characters, use {5,} instead.

Related

regex capture certain characters only

currently dealing with a bit of a problem. this is my string "all-days"
im in need of some assistance to creating a regex to capture the first character, the dash and also the first character after the dash. Im a bit of a newbie to Regex so forgive me.
Here is what ive got so far. (^.)

capture the first character, the dash and also the first
character after the dash
With preg_match function:
$s = "all-days";
preg_match('/^(.)[^-]*(-)(.)/', $s, $m);
unset($m[0]);
print_r($m);
The output:
Array
(
[1] => a
[2] => -
[3] => d
)

Its not regex but If you want just a solution as you want by other way it can be achieve by explode, array_walk and implode
$string = 'all-days-with-my-style';
$arr = explode("-",$string);
$new = array_walk($arr,function(&$a){
$a = $a[0];
});
echo implode("-",$arr);
Live demo : https://eval.in/882846
Output is : a-d-w-m-s

I assume your string only contains word characters and hyphens, and doesn't have consecutive hyphens:
To remove all that isn't the first character the hyphens and the first character after them, remove all that isn't after a word boundary:
$result = preg_replace('~\B\w+~', '', 'all-days');
If you only want to match these characters, just catch each character after a word boundary:
if ( preg_match_all('~\b.~', 'all-days', $matches) )
print_r($matches[0]);

Code
See code in use here
\b(\w|-\b)
For more precision, the following can be used (note that it uses Unicode groups, so it doesn't work in every language, but it does in PHP). This will only match letters, not numbers and underscores. It uses a negative lookbehind and positive lookahead, but you can understand it if you keep reading this article and break it apart one piece at a time.
(\b\p{L}|(?<=\p{L})-(?=\p{L}))
Explanation
\b Assert position at a word boundary
(\w|-\b) Capture the following into capture group 1
\w Match any word character
| Or
- Match the - character literally
\b Assert position at a word boundary
\b:
Asserts the position in the string matches 1 of the following:
^\w Assert position at the start of the string and match a word character
\w$ Match a word character and assert its position as the last position in the string
\W\w Match any non-word character, followed by a word character
\w\W Match any word character, followed by a non-word character
\w:
Means a word character (usually defined by any character in the set a-zA-Z0-9_, however, some languages also accept Unicode characters that represent any letter, number, or underscore \p{L}\p{N}_).
For more precision (depending on the use-case), you can specify [a-zA-Z] (for ASCII letters), \p{L} for Unicode letters, or [a-z] with the i flag for ASCII characters with the case-insensitive flag enabled in regex.

How can I detect letters in other languages (not English) in the string?

Here is my code:
function isValid($string) {
return strlen($string) >= 6 &&
strlen($string) <= 40 &&
preg_match("/\d/", $string) &&
preg_match("/[a-zA-Z]/", $string);
}
// Negative test cases
assert(!isValid("hello"));
// Positive test cases
assert(isValid("abcde2"));
As you see, my script validates a string based on 4 conditions. Now I'm trying to develop this one:
preg_match("/[a-zA-Z]/", $string)
This condition returns true just for English letters. How can I also add other letters like ا ب ث چ. Well how can I do that?
Note: Those characters aren't Arabic, they are Persian.

To match either an English or Persian letter, you may use
preg_match('/[\x{0600}-\x{06FF}A-Z]/iu', $string)
The \x{0600}-\x{06FF} range is supposed to match all Persian letters. The A-Z range will match all ASCII letters (both upper- ans lowercase since the /i case insensitive modifier is used). The /u modifier is necessary since you are working with Unicode characters.
Also, use mb_strlen rather than strlen when checking a Unicode string length, it will count the Unicode code points correctly.
As for
Your password should be containing at least a letter (that letter can be in any language
You need to use
preg_match('/\p{L}/u', $string)
or
preg_match('/\p{L}\p{M}*+/u', $string)
^^^^^^^^^^^^
that will match any letter (even the one with a diacritic after it). \p{L} matches any base Unicode letter, and \p{M}*+ will possessively match 0+ diacritics after it. If the match value is not used, /\p{L}/u will suffice for the check.

Php mb_ereg_match faulty match

I am trying to match some text with mb_ereg_match of php and I am using this piece of regex to match all non Word chats:
/[^-\w.]|[_]/u
I want to be able to look up unicode chars that's why I am using mb_ereg.
With this input:
'γιωρ;γος.gr'
Which containes chars from Greek alphabet.
I want to match the ';' and if it is matched to return -1 else return the input.
Whatever I do it doesn't match the ';' and returns the input.
I tried to use preg_match but it doesn't work as I work.
Any suggestions?
Edit 1
I did a test and I found that it matches corectly if I convert my input to:
';γος.gr'
Also works fine with latin chars.
Edit 2
If I get one of the following I want to print -1.
'γιωρ;γος.gr'
';γος.gr'
'γιωρ;.gr'
';.gr'
Else I want to get whatever the input is.
Edit 3
I did some more tests and it doesn't match any special char that is surounded of utf-8 chars.

You need to use \X with preg_match_all to match all Unicode chars:
\X
- an extended Unicode sequence
Also, see this \X description from Regular-Expression.info:
Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, and Ruby 2.0: simply use \X. You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.
And you can use the following snippet then:
$re = '/\X/u';
$str = "γιωρ;γος.gr";
preg_match_all($re, $str, $matches);
if (in_array(";", $matches[0])) {
echo -1;
}
else {
print_r($matches[0]);
}
See IDEONE demo

preg_match_all alpha+accented, but not numeric

I would like to use PHP's preg_match_all to capture substrings which comprise:
A-Z, a-z, all accented chars;
space;
hyphen.
It must not capture strings with anything else in them, including numeric chars.
This example is close but also catches strings containing numeric chars:
preg_match_all("/([\w -]+)/u", $abigstring, $matches);

That's a job for Unicode properties:
preg_match_all("/([\p{L} -]+)/u", $abigstring, $matches);
\p{L} matches any character with the Unicode property "Letter".

This is also an option :
preg_replace("/[^A-Za-zÀ-ÿ -]+/u", "", "juana 123456 sfdf 423 999 _ -a- dsa & ç%& à à$¨à+", -1);
Here is a working example -> https://xrg.es/#17hyhfm

For those want to, here is a correction of the code above which doesn't work !
preg_match("/^([\p{L} -]+)$/u", $string)
Anchors (^ and $) were missing
EDIT : Much better. If hyphens/spaces are only allowed in the middle:
/^([\p{L}](?:[\p{L} -]+[\p{L}])?)$/u

regular expression to detect numbers written as words - UTF-8 input

thanks for the answers to :
"regular expression to detect numbers written as words" :
regular expression to detect numbers written as words
I now have this working, however I have the same requirement but the numbers as words are in Arabic (or any other UTF-8) and not English, so :
if (preg_match("/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/", $str, $matches) > 0)
return true;
Does not work - I've googled and there seems to be quite a few issues with preg_match and UTF-8 string but I couldn't get any of the suggestions found to work. Any help much appreciated.

Note that \b may not be working as you expect. \b specifies a word boundary, but what is considered a word character by PCRE depends on what locale the script is running in (take a look towards the bottom of the PCRE escape sequences manual page):
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
You might also want to read Handling UTF-8 with PHP (the section on PCRE in particular).
Instead, you could use a lookaround in conjunction with a Unicode character property to emulate a word boundary: (?<=\P{L}). This asserts that the previous character is not a unicode "letter".
So all together it would look like:
/(?<=\P{L})(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\s*?){4}/

convert both pattern and $str to windows-1256, do the matching, then convert $matches items back (if needed), this is the solution I came to after suffering for some time.
$pattern="/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/";
$pattern_windows1265 = iconv('utf-8', 'windows-1256', $pattern);
$str_windows1265 = iconv('utf-8', 'windows-1256', $str);
if (preg_match($pattern_windows1265, $str_windows1265, $matches) > 0)
return true;
Here's a test example to check if unicode conversion is allowing Arabic letters match in preg_match:
<?php
$pattern="/(واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)/";
$pattern_windows1265 = iconv('utf-8', 'windows-1256', $pattern);
$test_cases=array(
'لدي أربعة أولاد',
'قفز الثعلب فوق الشجرة',
'عندي خمسة أرانب',
);
foreach ($test_cases as $str) {
$str_windows1265 = iconv('utf-8', 'windows-1256', $str);
if (preg_match($pattern_windows1265, $str_windows1265, $matches) > 0) {
echo $str, '<br />';
}
}
when executing, it will output:
لدي أربعة أولاد
لدي خمسة أرانب
I removed some of the pattern to check if the plain check against Arabic works, which seems to be working.

You can use the pattern modifier u to use any UTF-8 supported language.
if (preg_match("/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/u", $str, $matches) > 0)
Resources :
Pattern modifiers

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Search for repeated arabic (hindi) numerals in a string - php

Use a character class and quantify it. See this regex: /[١٢٣٤٥٦٧٨٩]{4,}/ Your characters are not word characters, so \b would assert a word character in front of / behind your match, remove it. Here is a regex demo. As a note, if you are matching more than 4 characters, use {5,} instead.

Related

regex capture certain characters only

How can I detect letters in other languages (not English) in the string?

Php mb_ereg_match faulty match

preg_match_all alpha+accented, but not numeric

regular expression to detect numbers written as words - UTF-8 input

Categories

Resources