Regex not playing well with Russian text [duplicate]

Regex not playing well with Russian text [duplicate] - php

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
regexp with russian lang
I have a regular expression that filters out certain links out of a text and attaches a file icon based on the filetype of the link. Like this:
$text = preg_replace('((<a href="[\w\./:]+getfile.php\?id='.$file.'"([a-zA-Z0-9_\- ,\.:;"=]*)>)([a-zA-Z0-9_,\.:;&\-\(\)\<\>\'/ ]+)</a>)','\\1'.fileicon($name).'</a> \\1\\3</a> ('.($pagecount?$pagecount." ".($pagecount>1?$pages:$page1).", ":"").readable_filesize($size,1).')',$text);
this worked great until I tried this with some russian text. The input would be something like:
Русский
But it won't show the icon before the link and file information after the link, making me suspect the regex doesn't play well with Russian text. What could be the case here?

You shall use u modifier when working with Unicode strings:
preg_replace('/>([^<]+)</u', '', $string);

Your character class only allows [a-zA-Z0-9_,\.:;&\-\(\)\<\>\'/ ]. There are no russion characters in there.
You can fix this by adding the relevant characters to the class. If you only need to support russian, \p{InCyrillic} should do it. If you want all unicode letters, \p{Letter}.

You can simplify your regexp down to something like
$re = "~
(<a\s+href=\".+?getfile\.php\?id=$file\".*?>)
(.+?)
</a>
~xui";
this should solve the Cyrillic problem automatically.

Cyrillic unicode characters are within the range \x0400-\x04FF. Add this range in your character class.

Related

Exclude closed HTML tags from string with regex

I've got a problem with the preg_replace() in PHP.
I get some unescaped HTML strings from a database and escape all special chars with htmlentities(). It worked good but it also replaces the < and >symbols, so I used str_replace() and replaced all the < and >, so the tags are excluded from the replacement. All the tags I use are closed, but I use some contents that starts with < and > symbols like the string <nome programma> is seen like a tag.
So I decided to use preg_replace() with this regex <(\w+)>(.*)<\/(\w+)>
I have to escape those strings:
<sub>string</sub>
<code>start "<nome programma>":</code>
Il tipo <code>string</code> e il tipo <code>char</code>
That works good for the first two cases, but not properly for the last one.
I've scratched an example here.
Can someone help me figuring out?

I found the solution!
The working regex is this one: ((<)(\w+)(>))(.*?)((<)[\/+](\w+)(>)).
I Hope that this could help someone else!

preg_match accented characters

I have an issue using preg_match with php.
I want my users to fill the Name field with only valid characters.
Ex: no numbers or special chars.
My site will eventually be bilingual but most of my visitors are french Canadians
I prefer utf-8 for my encoding.
So at the top of my document i have this tag :
<meta charset="utf-8" />
I need to accept accented characters in my form and i have tryed this :
(preg_match('/^\p{L}+$/ui',$string))
But i cant get accent to be accepted this way.
Here is an example of what a name could contain as characters
jean-françois d'abiguäel
That's pretty much as bad as it could get
Everyone seems to get (preg_match('/^\p{L}+$/ui',$string)) working, but me.
I would need something like this :
/^\p{L}(\p{L}+[- ']?)*\p{L}$/ui
But i need to get it working.
My servers are IIS (godaddy)
PHP Version is 5.4
default timezone is set to America/Montreal
Thank you!

This pattern should work:
/^\pL+(?>[- ']\pL+)*$/u
demo
But feel free to adapt it for more exotic names (For example names with a trailing quote or an apostrophe).

~^([\p{L}-\s']+)$~ui
Matches the following names:
Jean-François d'Abiguäel
François Hollande
Père Noël
See a demo on regex 101.

Actually you can shorten #Casimir et Hippolyte's answer like so:
/^\pL+([- ']\pL+)*$/u

How to check in php if string contains special characters like { and } [duplicate]

This question already has answers here:
preg_match special characters
(7 answers)
Closed 5 years ago.
There is this input field where I want that user ISN'T able to use following special marks: {}[]$
Right now I have following solution in my code but problem is that it isn't allowing ä, ö, ü or other characters like that.
if (preg_match('/^[-a-zA-Z0-9 .]+$/', $string) || empty($string)){
echo "Everything ok!";
else{
echo "Everything not ok!";}
Because of that, I tried using preg_match('/^[\p{L}\p{N} .-]+$/', $string) because it was said to allow characters from any language but that solution isn't allowing marks like # and *, which I think may be needed. So any solution which would allow anything except {}[]$ -marks? Any help is much appreciated since I can't figure out what to write to get this working.

this is how i do it :
if (preg_match('/[^a-zA-Z\d]/', $string)) {
//$string contains special characters, do something.
}

Strip RTF strings with PHP - Regex [duplicate]

This question already has answers here:
Regular Expression for extracting text from an RTF string
(11 answers)
Closed 9 years ago.
A column in the database I work with contains RTF strings, I would like to strip these out using PHP, leaving just the sentence between.
It is a MS SQL database 2005 if I recall correctly.
An example of the kind of strings pulled from the database (need any more let me know, all the rest are similar):
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Tahoma;}}
\viewkind4\uc1\pard\lang1033\f0\fs17 ASSEMBLE COMPONENTS AS DETAILED ON DRAWING.\lang2057\fs17\par
}
I would like this to be stripped to only return:
ASSEMBLE COMPONENTS AS DETAILED ON DRAWING.
Now, I have successfully managed to strip the characters in ASP.NET for a previous project, however I would like to do so using PHP. Here is the regular expression I used in ASP.NET, which works flawlessly may I add:
"(\{.*\})|}|(\\\S+)"
However when I try to use the same expression in PHP with a preg_replace it does not strip half of the characters.
Any regex gurus out there?

Use this code. it will work fine.
$string = preg_replace("/(\{.*\})|}|(\\\S+)/", "", $string);
Note that I added a '/' in the beginning and at the end '/' in the regex.

HTML foreign language characters [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Regular Expression To Anglicize String Characters?
What would be the best way to convert foreign language characters to english ones? For example ü to u.

There are only a couple of reasons to do this (url friendliness, mostly). You want strtr.
It basically works like this:
$addr = strtr($addr, "äåö", "aao");
The 2nd comment in the manual has a nice translation table for you.

$text = mb_str_replace('ü','u', $text);
To find all non English character using:
preg_match('#[^a-z0-9\-\.\,\:\;]#', $text, $characters);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex not playing well with Russian text [duplicate] - php

You shall use u modifier when working with Unicode strings: preg_replace('/>([^<]+)</u', '', $string);

Your character class only allows [a-zA-Z0-9_,\.:;&\-\(\)\<\>\'/ ]. There are no russion characters in there. You can fix this by adding the relevant characters to the class. If you only need to support russian, \p{InCyrillic} should do it. If you want all unicode letters, \p{Letter}.

You can simplify your regexp down to something like $re = "~ (<a\s+href=\".+?getfile\.php\?id=$file\".*?>) (.+?) </a> ~xui"; this should solve the Cyrillic problem automatically.

Cyrillic unicode characters are within the range \x0400-\x04FF. Add this range in your character class.

Related

Exclude closed HTML tags from string with regex

preg_match accented characters

How to check in php if string contains special characters like { and } [duplicate]

Strip RTF strings with PHP - Regex [duplicate]

HTML foreign language characters [duplicate]

Categories

Resources