Converting a Javascript regular expression to preg_match() compatible

Converting a Javascript regular expression to preg_match() compatible - php

I have this code from a javascript
/+\uFF0B0-9\uFF10-\uFF19\u0660-\u0669\u06F0-\u06F9u/
after some read about php & \u support I convert it to \x
/\+\x{FF0B}0-9\x{FF10}-\x{FF19}\x{0660}-\x{0669}\x{06F0}-\x{06F9}/u
but still I'm not able to use it in php
$phoneNumber = '+911561110304';
$start = preg_match('/\+\x{FF0B}0-9\x{FF10}-\x{FF19}\x{0660}-\x{0669}\x{06F0}-\x{06F9}/u', $phoneNumber,$matches);
the matches will be null!
how to fix this?

It looks like you want to match an ASCII plus sign or its Japanese Halfwidth equivalent, followed by one or more digits from a few different writing systems. But, as #mario observed, you seem to be missing some square brackets. The JavaScript version probably should be:
/[+\uFF0B][0-9\uFF10-\uFF19\u0660-\u0669\u06F0-\u06F9]+/
(I couldn't see any reason for the u at the end, so I dropped it.) The PHP version would be:
'/[+\x{FF0B}][0-9\x{FF10}-\x{FF19}\x{0660}-\x{0669}\x{06F0}-\x{06F9}]+/u'
Of course, this will allow a mix of ASCII, Arabic and Halfwidth characters in the same number. If that's a problem, you'll need to break it up a bit. For example:
'/\+(?:[0-9]+|[\x{0660}-\x{0669}]+|[\x{06F0}-\x{06F9}]+)|\x{FF0B}[\x{FF10}-\x{FF19}]+/u'

Related

How to get only integer value Started from Symbol from given string

I need a integer value which started from £ and Â£ , I try to do with regrex but I only getting value which starting from Â£.
Here I use the regrex Like this.
if(preg_match('/(\£[0-9]+(\.[0-9]{2})?)/',$vals,$matches))
{
$main[]= str_replace('£','',$matches[0]);
}
I am not familiar with regrex. so please share any solution. any help would highly appriciated.Thank you.

From your question I understand that you are having troubles with character encodings, so first of all I would suggest you to address this issue one step before, it is really important to resolve encoding issues in the earliest possible step.
Back to the question, first off, to avoid falling deeper into the charset encoding hell, I would recommend you to write your regexp literal in HEX, because otherwise the charset encoding in which you save your PHP files would affect the result. I.E. if you do something like this:
preg_match('/(£|Â£)(\d+)', ...)
It would match "£" and "Â£" (binary) if you save your source code in ISO-8859-1, but it would actually match "Â£" and "Ã‚Â£" (binary) if you chose to save your source code in UTF-8 (which might be a good idea in general). So be careful with this, and verify what your editor/IDE is doing!
My suggestion thus is to write it this way, which is equivalent for ISO-8859-1 and UTF-8:
preg_match('/(\xa3|\xc2\xa3)(\d+)', ...) // match "£" and "Â£"
Also I suggest to make use of the sub-pattern capture feature of regular expressions, so you don't have to str_replace() afterwards, this way:
if (preg_match('/(?:\xa3|\xc2\xa3)([0-9]+(?:\.[0-9]{2})?)/', $data, $regp)) {
$main[] = $regp[1];
}
The "?:" at after the "(" means "this is a sub-pattern, but don't capture it".
Note that you can also replace preg_match with preg_match_all and you will find in $regp[1] the array of all matching numbers already prepared.

Try with this modified regex:
(?:£|Â£)([0-9]+(\.[0-9]{2})?)
It should do the trick. But it will return you decimal values also, because of the:
(.[0-9]{2})?
You can remove it and it will return only the integer part after £|Â£

Having en-dash at the end of the string doesn't allow json_encode

I am trying to extract n characters from a string using
substr($originalText,0,250);
The nth character is an en-dash. So I get the last character as â€ when I view it in notepad. In my editor, Brackets, I can't even open the log file it since it only supports UTF-8 encoding.
I also cannot run json_encode on this string.
However, when I use substr($originalText,0,251), it works just fine. I can open the log file and it shows an en-dash instead of â€. json_encode also works fine.
I can use mb_convert_encoding($mystring, "UTF-8", "Windows-1252") to circumvent the problem, but could anyone tell me why having these characters at the end specifically causes an error?
Moreover, on doing this, my log file shows â€ in brackets, which is confusing too.
My question is why is having the en-dash at the end of the string, different from having it anywhere else (followed by other characters).
Hopefully my question is clear, if not I can try to explain further.
Thanks.

Pid's answer gives an explanation for why this is happening, this answer just looks at what you can do about it...
Use mb_substr()
The multibyte string module was designed for exactly this situation, and provides a number of string functions that handle multibyte characters correctly. I suggest having a look through there as there are likely other ones that you will need in other places of your application.
You may need to install or enable this module if you get a function not found error. Instructions for this are platform dependent and out-of-scope for this question.
The function you want for the case in your question is called mb_substr() and is called the same as you would use substr(), but has other optional arguments.

UTF-8 uses so-called surrogates which extend the codepage beyond ASCII to accomodate many more characters.
A single UTF-8 character may be coded into one, two, three or four bytes, depending on the character.
You cut the string right in the middle of a multi-byte character:
[<-character->]
[byte-0|byte-1]
^
You cut the string right here in the middle!
[<-----character---->]
[byte-0|byte-1|byte-2]
^ ^
Or anywhere here if it's 3 bytes long.
So the decoder has the first byte(s) but can't read the entire character because the string ends prematurely.
This causes all the effects you are witnessing.
The solution to this problem is here in Dezza's answer.

How do I remove Unicode Characters from String?

I need a Regex code to remove Emoji, Symbols ( basically any unicode character ) except Japanese, Korean, Chinese, Vietnamese , and any other languages that use unicode characters. The regex is going to be used for a php and Python server. I noticed that I'm having problem with iPhone users who uses Emoji keyboard to create some weird names.
So far I've tried a few regex but I couldn't find any proper one.
Below is my own text string which I use for testing. Please note that I have no idea what does the other english character means. If its a bad word please change it.
abcdefghij
klmnopqrst
uvwxyz
1234567890
한국 韓國
‎Công Ty Cổ Phần Hùng Đức
南极星
おはようございます
============== Below characters should be detected by regex ========
™£¢£¢§¢∞§¶•§ª§¶
[]{}"';?><_+=-
()*&^%$##!~`,.
😊🐻🏢🐭4️⃣⌘
❤❣☁♫🗽🐯

All symbols match \p{S} regex. You just need to be sure your PHP is in UTF-8 mode (whatever that means, I don't do PHP) – see http://php.net//manual/pl/regexp.reference.unicode.php – and for Python, you need an alternative regex library: https://pypi.python.org/pypi/regex

You may find that regular expressions aren't the hammer for all nails. In this case you simply want to exclude characters, so it probably isn't.
In Python 3 the string translate() method would be useful: if you mapped the characters you want excluded to None they will indeed be excluded from the result.
Unfortunately this method only applies to ASCII strings, and takes a 256-character string as its mapping table. You couldl ,however, program a similar alogirthm yourself in Python, but it's not going to be as efficient.
PS: There are no "bad words" in your text.

how to count the occurrences of a Unicode character in a string?

how do you count the occurrences of a Unicode character in a string with PHP?
maybe this is a simple questions but I am a biginner in PHP.
I want to count how many Unicode characters U+06cc are in a string.
Character 'yeh' in farsi corresponds to 2 code points.
ی = u+06cc
ي = u+064a
that u+064a is a substitute in Farsi.
The popular character Arabic charset CP-1256 has no character mapped into U+06cc.
now I want to count how many Unicode characters U+06cc are in a string to detect that string is arabic or farsi.
when I use $count = substr_count($str, "ى"); or when I use
$count = substr_count($str, "\xDB\x8c");
it counts both "ی" and "ي" ,
any idea ?

I suppose you have a UTF-8 string, since UTF-8 is the most reasonable Unicode encoding.
$count = substr_count($str, "\xDB\x8C");
is what you want. You simply treat the string as a sequence of bytes. In UTF-8 the first byte of a multibyte character and its continuation bytes can never be mixed up (the first byte is always 11...... binary, while continuation bytes are always 10......). This ensures you cannot find something different from what your are looking for.
To find the UTF-8 encoding of U+06CC I used the fileformat.info website, which I think is the best for this purpose.
If you use UTF-8 in your IDE too, you can simply write "ى" instead of "\xDB\x8C" (internally they are exactly the same string in PHP), but that will make the readability of what you have written dependent on the IDE (often not good if you need to share your code).
Now that you have clarified your question, my above answer is no more appropriate. I leave it there just as a reference for other passers-by.
Your problem could stem from the fact that, reading here it seems that "ي" can lose its dots below if modified by the Unicode character U+0654 (the non-spacing mark "Arabic hamsa above"). Since my browser does not remove the dots, and adds the hamsa, I don't know whether the hamsa is supposed to disappear too when the dots disappear. Anyway, it COULD be that "\xDB\x8C" has the same appearance as "\xD9\x8A\xD9\x94". I have not been able to find the reverse, i.e., the double dot below as a non-spacing modification character, which would explain why substr_count($str, "\xDB\x8c") finds the Arabic yeh too - but maybe it exists.

I have tried this example, and it works fine:
$str="مىمى";
$count = substr_count($str, "ى");
echo $count;
I got the answer 2 , which is true.
If you want a more specific answer, you should provide more specific details in your question.

regexunicode - Accented characters are removed when using preg_match_all

I have the the problem described in title.
If I use
preg_match_all('/\pL+/u', $_POST['word'], $new_word);
and I type hello à and ì the new_word returned is *hello and *
Why?
Someone advised me to specify all characters I want to convert in this way
preg_match_all('/\pL+/u', $_POST['word'], 'aäeëioöuáéíóú');
, but I want my application works with all existing accents (for a multilanguage website).
Can you help me?
Thanks.
EDIT: I specify that I utilise this regex to purify punctuation. It well purify all punctuation but unicode characters are wrong returned, in fact are not even returned.
EDIT 2: I am sorry, but I very badly explained.
The problem is not in preg_match_all but in
str_word_count($my_key, 2, 'aäáàeëéèiíìoöóòuúù');
I had to manually specify accented characters but I think there are many others. Right?

\pL should match all utf8 characters and spaces. Be sure, that $_POST['word'] is a string encoded with utf8. If not, try utf8_encode() before matching or check the encoding of your HTML form. In my tests, your example works like a charm.
You may use this together with count() to get the number of words. Then you need not care about the possible characters. \pL will do this for you. This should do the trick:
$string = "áll thât words wíth ìntérnâtiønal çhårs";
preg_match_all('/\pL+/u', $string, $words);
echo count($words[0]); // returns: 6

Try using mb_ereg_match() (instead of preg_match()) from Multibyte String PHP library. It is specially made for working with multibyte strings.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.