A preg_match using regexp are losing the last character - php

I have a file(.txt) that I would like to have formated. the lines look like this =>
Name on Company
Street 7 CITY phone: 1234 - 56 78 91 Webpage: www.webpage.se
http://www.webpage.se
Name on Restaurant
Street 11 CITY CITY phone: 7023 - 51 83 83 Webpage:
http://
The problem I'm having is with my regexp when i would like to match the city(which is in uppercase). So far I'm come up woth this =>
preg_match('/\b[A-ZÅÄÖ]{2,}[ \t][A-ZÅÄÖ]+|[A-ZÅÄÖ]{2,}\b/', $info, $city);
As you can see it is swedish city's I'm working with thus A-ZÅÄÖ. But using this regexp doesnt work if the last character in the citys name is either 'ÅÄÖ' in these cases it just take the characters before that.
are anyone seeing the problem?
thanks in advance

Your problem is that \b is defined as matching the border between characters that are in \w and those that are not.
Your swedish-specific characters are not in \w (which is typically equivalent to [a-zA-Z0-9_]).
You can instead replace \b with appropriate lookaround assertions (example).

FWIW, this would to seem be a perfect place to use http://txt2re.com to develop and test your regex from examples.
That being said, there doesn't appear to be anything wrong with the regex that would cause it to skip trailing ÅÄÖ character. Those are being treated no differently than the other alphabetic characters.
I suspect a Unicode problem. Perhaps the input data has a trailing Ä that is stored as an A followed by a separate diaresis combining character. The solution for this is to normalize the unicode string prior to applying the regex.
Also, as Amber points-out, the problem may be with the \b definition of a word boundary. The docs say, A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w. So, you may get relief by changing your locale setting.
Alternatively, you can try setting the u pattern modifier in case the input is in UTF-8.

Related

Trying to add arabic numeric support to a working regex

I'm validating phone numbers with the following regex
^((\+\d{1,3}(-|.| )?\(?\d\)?(-| |.)?\d{1,5})|(\(?\d{2,6}\)?))(-|.| )?(\d{3,4})(-|.| )?(\d{4})(( x| ext)\d{1,5}){0,1}$
and it's working perfectly.
I need to add Arabic numbers support e.g. "٠١٢٣٤٥٦٧٨٩"
I already did some research and found out that the \u0660 and \u0669 counts from 0 to 9 in Arabic, but I need this added into my working regex.
Thanks
Don't change the pattern. Just do:
$temp = str_replace(['٠','١','٢','٣','٤','٥','٦','٧','٨','٩'], range(0, 9), $input);`
Then run the test on the temporary variable. Sorry, first array is back to front, visually but byte order should be right.
You can change your pattern, so that \d is replaced by [\d\x{0660}-\x{0669}] for every occurance of \d. \x{....} is used to represent a specific character with the given hex-code and you can also apply ranges with these. The same can be done in Javascript by using \u...., so your pattern would be [\d\u0660-\u0669].
You could alternatively turn on the u-flag (unicode) for your pattern, which then will cause \d to match any unicode digit (including latin and arabic, but not restricted to them). It will also affect other tokens like \w and [[:alpha:]], but that should not be an issue here.

PHP regex needs to filter other characters between words

I am new to PHP regex. Just started playing with it.
I want to accept only words (it could be 2 or 3 or 4 or 5 etc), but should accept only words, no special characters, no numbers in between these words.
for example a Name: "John williams lewis"
Regex should reject if a name has something like "John 123 williams -$ Lewis"
I tried using as [\w\s]+ , this regex able to accept more words, but not understanding how to filter other characters in between words.
I am sorry if it is a dumb query.
^[a-zA-Z][a-zA-Z ]*$
You can try this.Anchors ^$ will make sure you dont have partial matches.
You can use ^[a-zA-Z]+(?: [a-zA-Z])*$ if you want to match names separated by space.
\w accept all word characters include digits and alphabets, instead you can use [a-zA-Z] if you just want to match alphabets. and for accepting the desire result you can add white-space matcher \s to your character class :
[a-zA-Z\s]+
Also you can use negated character class to refuse of match some characters for example the following regex will match any thing except digits :
[^\d]+
Try this :
^[a-zA-Z][a-zA-Z\\s]+$
This will accept only Alphabets(small or capital both) and space only.

Meaning of a dash between mixed characters in regex?

I'm just getting my feet wet with regexes and I came across this within a PHP program that someone else had written:
[ -\w]. Note that the dash is not the first character, there is a space preceding it.
I can't make heads or tails of what it means. I know that the dash between characters inside brackets normally indicates a range, i.e. [a-z] matches any lowercase character "a" through "z", but what does it match when the dash is between characters of different types?
My first thought was that it just matches any space or alphanumeric character, but then the dash wouldn't be necessary. My second thought was that it's matching spaces, alphanumerics, and the dash; but then I realized that the dash would probably be either escaped or moved to the front or back for that.
I've googled around and can't find anything about using a dash in a character class with mixed characters. Maybe I'm using the wrong search terms.
This might help : http://www.regular-expressions.info/charclass.html in the section "Metacharacters Inside Character Classes" it says :
Hyphens at other positions in character classes where they can't
form a range may be interpreted as literals or as errors. Regex
flavors are quite inconsistent about this.
My guess would be that it is being intepreted as a literal, so the regexp would match a space, hyphen or \w .
As a reference, it looks invalid in PCRE:
Debuggex Demo
In the PCRE reference §16. we find:
Perl, when in warning mode, gives warnings for character classes
such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
als. PCRE has no warning features, so it gives an error in these cases
because they are almost certainly user mistakes.
[ -\w] produces a warning in perl but not in php.
Your regex [ -\w] seems to be a misplaced one as it will only match characters like this:
[ !"#$%&'()*+,./-]
As due to - appearing in the middle it will act as a range between space (32) and first \w (48) characters.

Replace all but letters - explaination

I would like to modify a string and remove all but English letters (a-z, A-Z). Note that white space should also be removed.
This post provides two answers Remove everything except letters from PHP string
$new_string = preg_replace('/\PL/u', '', $old_string)
$new_string = preg_replace('/[^a-z]/i','',$old_string);
I understand the second answer, but not the first. The first had the highest votes.
Is the first the better answer? Please explain what it is doing.
That means special unicode-character class qualifier. In this particular case, L means "letter". In PHP, \P{xx} is available so that's why /\PL/u will work.
Note, that L includes the following properties: Ll, Lm, Lo, Lt and Lu (check full list in documentation). That means, L will include:
Lower case letter (Ll)
Modifier letter (Lm)
Other letter (Lo)
Title case letter (Lt)
Upper case letter (Lu)
That means, \PL fits requirement "all except letters" better, but it will keep such things as French letters (because of Lm), while [a-zA-Z] (same as /[a-z]/i) is more strict and will leave only letters, specified in group.
And, of course, \P{xx} has sense only in terms of unicode, thus - /u modifier is mandatory there.
\pL is the unicode property for letters
\pN is the unicode property for numbers
[a-z] doesn't take care of éàçè....
how can i use preg_match with alphanumeric and unicode acceptance?

regex unable to allow apostrophe

I am experiencing a strange problem with a regular expression I have already used before.
The goal is to allow the user to enter his name, with letters, hyphen, and apostrophes if needed in a php form.
My regex is:
"/^[\w\s'àáâãäåçèéêëìíîïðòóôõöùúûüýÿ-]+$/i"
But... everything is allowed but the apostrophe. Escaping it will not change. Why?
To deal with unicode characters, you can do:
/^[\pN\pL\pP\pZ]+$/
where:
\pN stands for any number
\pL stands for any letter
\pP stands for any punctuation
\pZ stands for any space
It matches names like:
d'Alembert
d’Alembert (note the different apos from above)
Jean-François
O'Connors

Categories