How to use preg_match properly to accept characters including new lines? - php

I've tried to use a series of questions to construct a preg_match if statement to check a string and make sure it includes characters that are accepted to pass through the system.
I've got the following if statement;
if ( !preg_match("~[A-Za-z0-9-_=+,.:;/\!?%^&*()##\"\'£\$€ ]~", $data['text']) ) {}
I'm using ~ as a separator in the string and want to ensure that the above characters are accepted in whatever string is passed through.
I've had to escape " and ' quotes and the $ sign to ensure it doesn't break the statement.
It seems to work however the following doesn't work.
Hello, this is a single line. Don't you agree?
This is also another line, see?
After some trial and error, it seemed the comma was also causing the string check to fail but it's in the preg_match rule too.
How can I accept these characters A-Za-z0-9-_=+,.:;/\!?%^&*()##\"\'£\$€ as well as multi lines (line blank lines, spaces etc etc).
EDIT
Just an update as to what I enter in the textarea and what data is actually returned.
I entered the following in the textarea;
Testing 123
Testing 123
The following was returned using print_r;
Testing 123\r\n\r\nTesting 1231

I had another look at your question and there is actually two issues:
The regex misses a character to accept any kind of spaces, you should update it to ~[A-Za-z0-9-_=+,.:;/\!?%^&*()##\"\'£\$€\s]+~ here I replaced your space character at the end with \s so that it supports any kind of spaces (i.e. tabs, newline, etc)
You forgot to escape the input string from $data['text'] which you can do using stripcslashes.
After fixing those you still need to validate your input, to do so you can use preg_replace to create a new string that will contain all the invalid characters if there is any. From there you only need to check if the string is empty, if so then the input is valid.
Here is what I used to test this:
<?php
$data = 'Testing 123\r\n\r\nTesting 1231';
$unescaped_data = stripcslashes($data);
$impureData = preg_replace("~[A-Za-z0-9-_=+,.:;/\!?%^&*()##\"\'£\$€\s]+~",
'', $unescaped_data);
if (0 == strlen($impureData)) {
echo 'TRUE';
}
else {
echo 'FALSE';
}
print_r("
====
$data
====
$impureData
====
");
And I get this result:
TRUE
====
Testing 123\r\n\r\nTesting 1231
====
====

Related

check particular string is present or not without considering spaces or any newline characters

I want to check a particular string is present in a content without considering space differences or newline characters.
Case 1
input string: this is a sample test
checking word: asample test
Case 2
input string: sample test1 \n new line content
checking word : test1 new
You can use the following logic where you remove all space characters from both the pattern string and the string to analyze before check if the strings contains the pattern.
PROTOTYPE:
$input1='this is a sample test';
$inputFiltered1= str_replace('\s', '', $input1);
$pattern1='asample test';
$patternFiltered1= str_replace('\s', '', $pattern1);
$input2='sample test1 \n new line content';
$inputFiltered2= str_replace('\s', '', $input2);
$pattern2='test1 new';
$patternFiltered2= str_replace('\s', '', $pattern2);
if (strpos($inputFiltered1, $patternFiltered1) !== false) {
echo 'true';
}
if (strpos($inputFiltered2, $patternFiltered2) !== false) {
echo 'true';
}
Method #1: Pure regex, $input is unmodified
$pattern='~'.preg_replace('~\S(?!$)\K\s*~','\s*',$check).'~';
if(preg_match($pattern,$input)){
echo "Found: $check";
}else{
echo "Did Not Find: $check";
}
Method #2: Regex modification of $input and $check
if (strpos(preg_replace('~\s+~', '', $input), preg_replace('~\s+~', '', $check))!==false){
echo "Found: $check";
}else{
echo "Did Not Find: $check";
}
Method #3: Regex-free modification of $input and $check
$whitechars=[' ',"\t","\r","\n"]; // hardcode whitespace characters
if (strpos(str_replace($whitechars, '', $input), str_replace($whitechars, '', $check))!==false){
echo "Found: $check";
}else{
echo "Did Not Find: $check";
}
(Demonstrations Link)
Now, you might be asking: "Which one should I choose for my project?"
Answering that properly will first depend on the size and content of your $input data, then the number of times you'll be running this snippet, then personal preference, and probably least importantly the size and content of $check.
If your $input data is relatively large, then you will want to avoid performing any modifications on the string because of the impact on speed. Assuming your $check string is usually going to be quite small, modifying this value alone is going to produce minimal "drag" on execution time. As a general answer, I would recommend Method #1; although it is using two preg_ calls, they are processing a very small string. I should explain that Method #1 only prepares the $check value by placing \s* between all visible characters. If there are any whitespace characters in the string, they are removed during pattern preparation (assuming there are no leading or trailing whitespace characters in $check -- otherwise, call trim() or refine pattern preparation). (Method #1 Pattern Demo)
If you want to prepare both $input and $check by removing all occurrences of whitespace characters, then the most direct approach would be to call preg_replace() on one or more \s characters and replace with an empty string (Method #2). This will tell the regex engine to isolate substrings of whitespace(s) and remove them via a "single scan" of the given string. By comparison, if you want to avoid regex functions, you can use str_replace() to perform the same task. However, using Method #3 means that you will need to individually "inform" the function of all characters to be removed. str_replace() doesn't know what \s means. Unfortunately, by listing the whitespace character in an array, the function will be performing n "iterated scans" of the string AND it can only replace one character at a time.
To crystallize, if you have a string that contains a[space][newline][newline][tab][space]b, then preg_replace() will scan the string once and make a single replacement. If you call str_replace() on the same string, it will make four scans (remember: [' ',"\t","\r","\n"]) and make five separate replacements.
I suppose the message that I am driving home is: Regular Expressions sometimes get a bad rap, but this seems like a compelling case to select a regex method over a non-regex method (unless you benchmark your actual project and find that the regex methods are in fact negatively impacting performance).

php regex remove inline comment only

I have simple code look like this
function session(){
return 1; // this default value for session
}
I need regex or code to remove the comment // this is default value for session, And only remove this type of comment, which starts by a space or two or more, then //, then a newline after it.
All other types of comment and cases are ignored.
UPDATED (1)
And only remove this type of comment, which starts by a space or two or more, then //, then a newline after it
Try this one:
regex101 1
PHP Fiddle 1 -hit "run" or F9 to see the result
/\s+\/\/[^\n]+/m
\s+ starts by a space or two or more
\/\/ the escaped //
[^\n]+ anything except a new line
UPDATE: to make sure -kinda-this only applied to code lines, we can make use of the lookbehind (2) regex to check if there is a semicolon ; before the space[s] and the comment slashes //, so the regex will be this:
regex101 2
PHP Fiddle 2
/(?<=;)\s+\/\/[^\n]+/m
where (?<=;) is the lookbehind which basically tells the engine to look behind and check if there's a ; before it then match.
-----------------------------------------------------------------------
(1) The preg_replace works globally, no need for the g flag
(2) The lookbehind is not supported in javascript
A purely regex solution would look something like this:
$result = preg_replace('#^(.*?)\s+//.*$#m', '\1', $source);
but that would still be wrong because you could get trapped by something like this:
$str = "This is a string // that has a comment inside";
A more robust solution would be to completely rewrite the php code using token_get_all() to actually parse the PHP code into tokens that you can then selectively remove when you re-emit the code:
foreach(token_get_all($source) as $token)
{
if(is_array($token))
{
if($token[0] != T_COMMENT || substr($token[1] != '//', 0, 3))
echo $token[1];
}
else
echo $token;
}

Building a regex expression for PHP

I am stuck trying to create a regex that will allow for letters, numbers, and the following chars: _ - ! ? . ,
Here is what I have so far:
/^[-\'a-zA-Z0-9_!\?,.\s]+$/ //not escaping the ?
and this version too:
/^[-\'a-zA-Z0-9_!\?,.\s]+$/ //attempting to escape the ?
Neither of these seem to be able to match the following:
"Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?"
Can somebody point out what I am doing wrong? I must point out that my script takes the user input (the paragraph in quotes in this case) and strips all white space so actual input has no white space.
Thanks!
UPDATE:
Thanks to Lix's advice, this is what I have so far:
/^[-\'a-zA-Z0-9_!\?,\.\s]+$/
However, it's still not working??
UPDATE2
Ok, based on input this is what's happening.
User inputs string, then I run the string through following functions:
$comment = preg_replace('/\s+/', '',
htmlspecialchars(strip_tags(trim($user_comment_orig))));
So in the end, user input is just a long string of chars without any spaces. Then that string of chars is run using:
preg_match("#^[-_!?.,a-zA-Z0-9]+$#",$comment)
What could possibly be causing trouble here?
FINAL UPDATE:
Ended up using this regex:
"#[-'A-Z0-9_?!,.]+#i"
Thanks all! lol, ya'll are going to kill me once you find out where my mistake was!
Ok, so I had this piece of code:
if(!preg_match($pattern,$comment) || strlen($comment) < 2 || strlen($comment) > 60){
GEEZ!!! I never bothered to look at the strlen part of the code. Of course it was going to fail every time...I only allowed 60 chars!!!!
When in doubt, it's always safe to escape non alphanumeric characters in a class for matching, so the following is fine:
/^[\-\'a-zA-Z0-9\_\!\?\,\.\s]+$/
When run through a regular expression tester, this finds a match with your target just fine, so I would suggest you may have a problem elsewhere if that doesn't take care of everything.
I assume you're not including the quotes you used around the target when actually trying for a match? Since you didn't build double quote matching in...
Can somebody point out what I am doing wrong? I must point out that my script takes the user input (the paragraph in quotes in this case) and strips all white space so actual input has no white space.
in which case you don't need the \s if it's working correctly.
I got the following code to work as expected to (running php5):
<?php
$pattern = "#[-'A-Z0-9_?!,.\s]+#i";
$string = "Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?";
$results = array();
preg_match($pattern, $string, $results);
echo '<pre>';
print_r($results);
echo '</pre>';
?>
The output from print_r($results) was as following:
Array
(
[0] => Oh why, oh why is this regex not working! It's getting pretty frustrating? Frustrating - that is to say the least. Hey look, an underscore_ I wonder if it will match this time around?
)
Tested on http://writecodeonline.com/php/.
It's not necessary to escape most characters inside []. However, \s will not do what you want inside the expression. You have two options: either manually expand (/^[-\'a-zA-Z0-9_!?,. \t\n\r]+$/) or use alternation (/^(?:[-\'a-zA-Z0-9_!?,.]|\s)+$/).
Note that I left the \ before the ' because I'm assuming you're putting this in a PHP string and I wouldn't want to suggest a syntax error.
The only characters with a special meaning within a character class are:
the dash (since it can be used as a delimiter for ranges), except if it is used at the beginning (since in this case it is no part of any range),
the closing bracket,
the backslash.
In "pure regex parlance", your character class can be written as:
[-_!?.,a-zA-Z0-9\s]
Now, you need to escape whatever needs to be escaped according to your language and how strings are written. Given that this is PHP, you can take the above sample as is. Note that \s is interpreted in character classes as well, so this will match anything which is matched by \s outside of a character class.
While some manuals recommend using escapes for safety, knowing the general regex rules for character classes and applying them leads to shorter and easier to read results ;)

regex with special characters?

i am looking for a regex that can contain special chracters like / \ . ' "
in short i would like a regex that can match the following:
may contain lowercase
may contain uppercase
may contain a number
may contain space
may contain / \ . ' "
i am making a php script to check if a certain string have the above or not, like a validation check.
The regular expression you are looking for is
^[a-z A-Z0-9\/\\.'"]+$
Remember if you are using PHP you need to use \ to escape the backslashes and the quotation mark you use to encapsulate the string.
In PHP using preg_match it should look like this:
preg_match("/^[a-z A-Z0-9\\/\\\\.'\"]+$/",$value);
This is a good place to find the regular expressions you might want to use.
http://regexpal.com/
You can always escape them by appending a \ in front of the special characters.
try this:
preg_match("/[A-Za-z0-9\/\\.'\"]/", ...)
NikoRoberts is 100% correct.
I would only add the following suggestion: When creating a PHP regex pattern string, always use: single-quotes. There are far fewer chars which need to be escaped (i.e. only the single quote and the backslash itself needs to be escaped (and the backslash only needs to be escaped if it appears at the end of the string)).
When dealing with backslash soup, it helps to print out the (interpreted) regex string. This shows you exactly what is being presented to the regex engine.
Also, a "number" might have an optional sign? Yes? Here is my solution (in the form of a tested script):
<?php // test.php 20110311_1400
$data_good = 'abcdefghijklmnopqrstuvwxyzABCDE'.
'FGHIJKLMNOPQRSTUVWXYZ0123456789+- /\\.\'"';
$data_bad = 'abcABC012~!###$%^&*()';
$re = '%^[a-zA-Z0-9+\- /\\\\.\'"]*$%';
echo($re ."\n");
if (preg_match($re, $data_good)) {
echo("CORRECT: Good data matches.\n");
} else {
echo("ERROR! Good data does NOT match.\n");
}
if (preg_match($re, $data_bad)) {
echo("ERROR! Bad data matches.\n");
} else {
echo("CORRECT: Bad data does NOT match.\n");
}
?>
The following regex will match a single character that fits the description you gave:
[a-zA-Z0-9\ \\\/\.\'\"]
If your point is to insure that ONLY characters in this range of characters are used in your string, then you can use the negation of this which would be:
[^a-zA-Z0-9\ \\\/\.\'\"]
In the second case, you could use your regex to find the bad stuff (that you don't want to be included), and if it didn't find anything then your string pattern must be kosher, because I'm assuming that if you find one character that is not in the proper range, then your string is not valid.
so to put it in PHP syntax:
$regex = "[^a-zA-Z0-9\ \\\/\.\'\"]"
if preg_match( $regex, ... ) {
// handle the bad stuff
}
Edit 1:
I've completely ignored the fact that backslashes are special in php double-quoted strings, so here is a correcting to the above code:
$regex = "[^a-zA-Z0-9\\ \\\\\\/\\.\\'\\\"]"
If that doesn't work it shouldn't take too much for someone to debug how many of the backslashes need to be escaped with a backslash, and what other characters need also to be escaped....

Remove all characters starting from last occurrence of specific sequence of characters

I am parsing out some emails. Mobile Mail, iPhone and I assume iPod touch append a signature as a separate boundary, making it simple to remove. Not all mail clients do, and just use '--' as a signature delimiter.
I need to chop off the '--' from a string, but only the last occurrence of it.
Sample copy
hello, this is some email copy-- check this out
--
Tom Foolery
I thought about splitting on '--', removing the last part, and I would have it, but explode() and split() neither seem to return great values for letting me know if it did anything, in the event there is not a match.
I can not get preg_replace() to go across more than one line. I have standardized all line endings to \n.
What is the best suggestion to end up with hello, this is some email copy-- check this out, taking not, there will be cases where there is no signature, and there are of course going to be cases where I can not cover all the cases.
Actually correct signature delimiter is "-- \n" (note the space before newline), thus the delimiter regexp should be '^-- $'. Although you might consider using '^--\s*$', so it'll work with OE, which gets it wrong.
Try this:
preg_replace('/--[\r\n]+.*/s', '', $body)
This will remove everything after the first occurence of -- followed by one or more line break characters. If you just want to remove the last occurence, use /.*--[\r\n]+.*/s instead.
Instead of just chopping of everything after -- could you not cache the last few emails sent by that user or service and compare. The bit at the bottom that looks like the others can be safely removed leaving the proper message intact.
I think in the interest of being more bulletproof, I will take the non regex route
echo substr($body, 0, strrpos($body, "\n--"));
This seems to give me the best result:
$body = preg_replace('/\s*(.+)\s*[\r\n]--\s+.*/s', '$1', $body);
It will match and trim the last "(newline)--(optional whitespace/newlines)(signature)"
Trim all remaining newlines before the signature
Trim beginning/ending whitespace from the body (remaining newlines before the signature, whitespace at the start of the body, etc)
Will only work if there's some text (non-whitespace) before the signature (otherwise it won't strip the signature and return it intact)
To cleanly remove all of the signature and its leading newline characters, perform greedy matching upto the the last occurring --. Before matching the last -- followed by zero or more spaces then a system-agnostic newline character, restart the fullstring match using \K, then match all of the remaining string to be replaced.
Code: (Demo)
$string = <<<BODY
hello, this is some email copy-- check this out
--
Tom Foolery
BODY;
var_export(preg_replace('~.*\K\R-- *\R.*~s', '', $string));
Output:
'hello, this is some email copy-- check this out'

Categories