Codeigniter Form Validation - Custom callback with regex that allows accents

Codeigniter Form Validation - Custom callback with regex that allows accents - php

Before anyone flames me, I'd like to say I have tried all solutions here on SO as well as google but nothing works :(. I need some enlightenment :)
Here is my custom form_validation callback function.
Code Snippet
function _check_valid_text($text_string)
{
if(empty($text_string)) return TRUE;
if(!preg_match("/^[\pL\pN_ \w \-.,()\n]+$/i", $text_string))
{
// Set the error message:
$this->form_validation->set_message('_check_valid_text', 'Invalid Character or symbol');
return FALSE;
}
return TRUE;
}
WHAT I WANT:
Allow all alphanumeric characters [a-z, A-Z & 0-9]
Allow accents
Allow the following too: space, open brace, close brace, dash, period and comma
Strangely
für works, i.e. no error (YAY!).
trägen does not :(
Can someone please tell me where I am going wrong?
PS: Please don't provide solutions where your solution is to include the char that is not getting recognized into the preg_match string (e.g. add ä into the string). I am looking for a generic solution that works.
PPS I want to make a suggestion to the community. How about having a library of regex expressions for various cases? One can write a regex and once someone else verifies that it is working it can be accepted. Just a suggestion though. I assume there are more newcomers here like me who get stuck a lot in regex-hell :)

You must add the u modifier to inform the regex engine that your string must be treated as an unicode string. Note too that you can reduce your character class, example:
$subjects = array('Wer nur den lieben Gott läßt walten',
'Sei gegrüßet Jesu gütig',
'Komm, Gott, Schöpfer, Heiliger Geist',
'←↑→↓');
$pattern = '~^[\pL\pN\s,.()_-]++$~u';
foreach ($subjects as $subject) {
if (preg_match($pattern, $subject, $match))
echo '<br/>True';
else
echo '<br/>false';
}

There is no generic solution to handle this because no predefined set of characters that meets your needs exists.
However, you can add the range of accented characters.
[ ,-\.0-9A-Za-z\{\}À-ž]
Also, see How to allow utf-8 charset in preg_match? for information on matching against a specific character set.

Related

How to escape special chars when sending mail via PHPmailer while keeping Polish characters?

I'm making a simple contact form for a site, and I've come across a slight problem - when sending mail via PHPmailer, i can pass HTML tags, curly braces and other special characters in my form, which probably isn't really a good idea...
Problem is - I need to keep spaces and Polish characters (ąćęłńóśźż), and I'm an absolute newbie in regular expresions and php.
I resorted to using preg_replace doing basically this:
function clean($string) {
return preg_replace('/[^A-Za-z0-9\-]/', '', $string);
}
As you might expect, that leaves me with this kind of garbage:
Before preg_replace: https://imgur.com/a/DrJVBNT
After preg_replace: https://imgur.com/a/Q1xIhWI
All help is appreciated!
Problem solved: Summary
Ended up following Álvaro González's suggestion of using the zendEscaper component to escape HTML tags. Did so by doing this
$inputFieldName = $escaper->escapeHtml($_POST['inputFieldName']);
every time I need to make sure there can't be any HTML used, where inputFieldName is your <input name=""> attribute.
If for some other reason you really need to do what I asked to do in the first place, which is removing some characters entirely, but leaving English and Polish letters, numbers and spaces, then Toto's answer suits your needs:
function clean($string) {
return preg_replace('/[^A-Za-z0-9ąćęłńóśźżĄĆĘŁŃÓŚŹŻ\s-]/u', '', $string);
}
Again, thanks everyone for help!

Just add spaces and Polish characters inside the character class:
function clean($string) {
return preg_replace('/[^A-Za-z0-9ąćęłńóśźż\s-]/u', '', $string);
}
The /u flags is mandatory to deal with unicode characters.

stripping altered URLs from strings with preg_replace()

As the title says, but the regex i am using has some glitches. im not too good with regex, as you can see
I am trying to remove any web URLs that a user adds to a string.
However, as the user is "crafty", they try to alter the URL slightly so that it does not trigger my removal code, hence my below regex will match on slightly modified urls too (hence me not using a conventional ULR regex). I know it will always be possible to trick my removal code, but i would like to make it as hard as possible
The problem i am having is if a user adds a sentence and then a full stop, but does not space out things right, the below regex matches this. i would like to limit this as best possible.
e.g all the below match:
this.matches (i dont want this to match).
mysite.co.xx (i want this to match).
http:// www.mysite.co.xx (i want this to match)
i am trying to limit the characters after the last "." to between 2 and 4 but am struggling to work out how to do this.
The code below is what i am using.
define('REG_URL', '#((https?://|https?://\s)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)#');
public function stripURLs($string){
try {
$replacement = "[** website removed **]";
$string = preg_replace(REG_URL, $replacement, $string);
return $string;
}
catch (Exception $e){
error_log('checksubmitted.class.php MLE_Check.stripURls - Exception caught: '.$e->getMessage());
return false;
}
}
if anyone could point me in the right direction for how i do what i want, i would be very grateful.
If anyone know of any similar questions on here (i cant find any) or any other site that offers advice on removing "crafty" URLs i would again be grateful if this could be pointed out to me.

This is my personal preference for validating urls:
_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS

PHP: preg_match; Not able to match the £ symbol

I've really been wracking my brains over this one, as for the life of me I can't figure out what the problem is.
I've got some data I want to run a regular expression on. For reference, the original document is encoded in iso-8859-15, if that makes any difference.
Here is a function using the regular expression;
if(preg_match("{£\d+\.\d+}", $handle)) //
{
echo 'Found a match';
}
else
{
echo 'No match found';
}
No matter what I try I can't seem to get it to match. I've tried just searching for the £ symbol. I've gone over my regular expression and there aren't any issues there. I've even pasted the source data directly into a regular expression tester and it finds a complete match for what I'm looking for. I just don't understand why my regular expression isn't working. I've looked at the raw data in my string that I'm searching for and the £ symbol is there as clear as day.
I get the feeling that there's some encoded character there that I just can't see, but no matter how I output the data all I can see is the £ symbol, but for whatever reason it's not being recognised.
Any ideas? Is there an absolute method to viewing raw data in a string? I've tried var_dump and var_export, but I do get the feeling that something isn't quite right, as var_export does display the data in a different language. How can I see what's "really" there in my variable?
I've even saved the content to a txt file. The £ is there. There should be no reason why I shouldn't be able to find it with my regular expression. I just don't get it. If I create a string and paste in the exact bit of test my regular expression should pick up, it finds the match without any problems.
Truly baffling.

You could always transform the letter:
$string = '£100.00';
if(preg_match("/\xa3/",$string)){
echo 'match found';
}else{
echo 'no matches';
}

You can include any character in your regular expression if you know the hexadecimal value. I think the value is 0A3H, so try this:
\xa3 // Updated with the correct hex value

Regex to conditionally replace Twitter hashtags with hyperlinks

I'm writing a small PHP script to grab the latest half dozen Twitter status updates from a user feed and format them for display on a webpage. As part of this I need a regex replace to rewrite hashtags as hyperlinks to search.twitter.com. Initially I tried to use:
<?php
$strTweet = preg_replace('/(^|\s)#(\w+)/', '\1#\2', $strTweet);
?>
(taken from https://gist.github.com/445729)
In the course of testing I discovered that #test is converted into a link on the Twitter website, however #123 is not. After a bit of checking on the internet and playing around with various tags I came to the conclusion that a hashtag must contain alphabetic characters or an underscore in it somewhere to constitute a link; tags with only numeric characters are ignored (presumably to stop things like "Good presentation Bob, slide #3 was my favourite!" from being linked). This makes the above code incorrect, as it will happily convert #123 into a link.
I've not done much regex in a while, so in my rustyness I came up with the following PHP solution:
<?php
$test = 'This is a test tweet to see if #123 and #4 are not encoded but #test, #l33t and #8oo8s are.';
// Get all hashtags out into an array
if (preg_match_all('/(^|\s)(#\w+)/', $test, $arrHashtags) > 0) {
foreach ($arrHashtags[2] as $strHashtag) {
// Check each tag to see if there are letters or an underscore in there somewhere
if (preg_match('/#\d*[a-z_]+/i', $strHashtag)) {
$test = str_replace($strHashtag, ''.$strHashtag.'', $test);
}
}
}
echo $test;
?>
It works; but it seems fairly long-winded for what it does. My question is, is there a single preg_replace similar to the one I got from gist.github that will conditionally rewrite hashtags into hyperlinks ONLY if they DO NOT contain just numbers?

(^|\s)#(\w*[a-zA-Z_]+\w*)
PHP
$strTweet = preg_replace('/(^|\s)#(\w*[a-zA-Z_]+\w*)/', '\1#\2', $strTweet);
This regular expression says a # followed by 0 or more characters [a-zA-Z0-9_], followed by an alphabetic character or an underscore (1 or more), followed by 0 or more word characters.
http://rubular.com/r/opNX6qC4sG <- test it here.

It's actually better to search for characters that aren't allowed in a hashtag otherwise tags like "#Trentemøller" wont work.
The following works well for me...
preg_match('/([ ,.]+)/', $string, $matches);

I have devised this: /(^|\s)#([[:alnum:]])+/gi

I found Gazlers answer to work, although the regex added a blank space at the beginning of the hashtag, so I removed the first part:
(^|\s)
This works perfectly for me now:
#(\w*[a-zA-Z_0-9]+\w*)
Example here: http://rubular.com/r/dS2QYZP45n

preg_match function doesn't work correctly in certain PHP script

I am using preg_match function to filter unwanted characters from a textarea form in 2 PHP scripts I made, but in one of them seems not to work.
Here's the script with the problem:
<?php
//Database connection, etc......
mysql_select_db("etc", $con);
$errmsg = '';
$chido = $_POST['chido'];
$gacho = $_POST['gacho'];
$maestroid = $_POST['maestroid'];
$comentario = $_POST['comment'];
$voto = $_POST['voto'];
if($_POST['enviado']==1) {
if (preg_match ('/[^a-zA-Z áéíóúüñÁÉÍÓÚÜÑ]/i', $comentario))
$errmsg = 1;
if($errmsg == '') {
//here's some queries, etc
}
}
if($errmsg == 1)
echo "ERROR: You inserted invalid characters...";
?>
So as you can see the preg_match just filter unwanted chracters like !"#$%&/() etc..
But every time I type a special character like 'ñ' or 'á' it triggers the error code.
I have this very similar script that works perfectly with the same preg_match and filters just the unwanted characters:
//Database connection, etc..
mysql_select_db("etc", $con);
$errmsg = '';
if ($_POST['enviado']==1) {
$nombre = $_POST['nombre'];
$apodo = $_POST['apodo'];
$mat1 = $_POST['mat1'];
$mat2 = $_POST['mat2'];
$mat3 = $_POST['mat3'];
if (preg_match ('/[^a-zA-Z áéíóúüñÁÉÍÓÚÜÑ]/i', $nombre))
$errmsg = 1;
if($errmsg == '') {
//more queries after validation
}
}
if($errmsg == 1)
echo "ERROR: etc......."
?>
So the question is, what am I doing wrong in the first script??
I tried everything but always fails and shows the error.
Any suggestion?

try adding a u at the end along with your i to use unicode
/[^a-zA-Z áéíóúüñÁÉÍÓÚÜÑ]/iu

Hi before i was using this match expression:
/^[a-z]\d_]+$/i
because i was accepting letters from a to z, digits from 0 to 9 and underscore '_', the plus sign '+' to repeat through the whole string, and the '/i' for insensitive match. But i needed to accept the 'ñ' letter.
So, what i tried and worked for me was using this regex:
/^[a-z\d_\w]+$/iu
I added '\w' to accept any word character and also added an 'u' after '/i' to treat the pattern as UTF-16 character set, instead of UTF-8.

This might help: http://www.phpwact.org/php/i18n/charsets

I added this to the form.
<form accept-charset="utf-8">.
Now seems to work.

Why are you specifying /i yet enumerating all the upper‐ and lower‐case letters separately?
ALSO: This won’t work at all if you don’t normalize your input. Consider how ñ can be either character U+F1 or characters U+4E followed by U+303!
Unicode Normalization Form D will guarantee that both U+F1 and U+4E,U+303 turn into the canonically decomposed form U+4E,U+303.
Unicode Normalization Form C will guarantee that both U+F1 and U+4E,U+303 turn into form U+4E because it uses canonical decomposition followed by canonical composition.
Based on your pattern, it looks like you want the NFC form.
From PHP, you’ll need to use the Normalization class on these to get it working reliably.

i don't know if this can help but i had exactly the same problem with those kind of special characters and that turned me crazy for many days at the end i understood that the problem was a html_entities() command sanitizing the string before running in preg_match(), moving the html_entities() after prey_match()made it work great.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.