Remove all but valid characters - php

Valid characters include the alphabet (abcd..), numbers (0123456789), spaces, ' and ".
I need to strip any other characters than these from a string in PHP.
Thanks :)

You can do this:
$str = preg_replace('/[^a-z0-9 "\']/', '', $str);
Here the character class [^a-z0-9 "'] will match any character except the listed ones (note the inverting ^ at the begin of the character class) that are then replaced by an empty string.

Gumbo's answer is correct for your given specification. But if your "specification" is only "symbolic", what you eventually need might be like the following:
$str = preg_replace('{ [^ \w \s \' " ] }x', '', $str );
[^ ]: negated character class (all except these inside)
\w: alphanumeric (letters and digits)
\s: white space
\': '

Related

Regular expression to remove trailing chars

I'm looking for a regular expression in Php that could transform incoming strings like this:
abaisser_negation_pronominal_question => abaisser_n_p_q
abaisser_pronominal_question => abaisser_p_q
abaisser_negation_question => abaisser_n_q
abaisser_negation_pronominal => abaisser_n_p
abaisser_negation_voix_passive_pronominal => abaisser_n_v_p_p
abaisser => abaisser
With the Php code close to something like:
$line=preg_replace("/<h3>/im", "", $line);
How would you do?
You can use:
$input = preg_replace('/(_[A-Za-z])[^_\n]*/', '$1', $input);
RegEx Demo
Explanation:
This regex searches for (_[A-Za-z])[^_\n]* which means underscore followed by single letter and then match before a newline or underscore
It capture first part (_[A-Za-z]) in a backreference $1
Replacement is $1 leaving underscore and first letter in the replacement string
You could use \K or positive lookbehind.
$input = preg_replace('~_.\K[^_\n]*~', '', $input);
Pattern _. in the above regex would match an _ and also the character following the underscore. \K discards the previously matched characters that is, _ plus the following character. It won't take these two characters into consideration. Now [^_\n]* matches any character but not of an _ or a \n newline character zero or more times. So the characters after the character which was preceded by an underscore would be matched upto the next _ or \n character. Removing those characters will give you the desired output.
DEMO
$input = preg_replace('~(?<=_.)[^_\n]*~', '', $input);
It just looks after to the _ and the character following the _ and matches all the characters upto the next underscore or newline character.
DEMO
You can use regex
$input = preg_replace('/_(.)[^\n_]+/', '_$1', $input);
DEMO
What it does is capture the character after _ and match till \n or _ is encountered and replaced with the _$1 which means _ plus the character captured.
$line = preg_replace("/_([a-z])([a-z]*)/i", "_$1", $line);

How to remove more than one whitespace

Hello guys I currently have a problem with my preg_replace :
preg_replace('#[^a-zA-z\s]#', '', $string)
It keeps all alphabetic letters and white spaces but I want more than one white space to be reduced to only one. Any idea how this can be done ?
$output = preg_replace('!\s+!', ' ', $input);
From Regular Expression Basic Syntax Reference
\d, \w and \s
Shorthand character classes matching digits, word characters (letters, digits, and underscores), and whitespace (spaces, tabs, and line breaks). Can be used inside and outside character classes.
The character type \s stands for five different characters: horizontal tab (9), line feed (10), form feed (12), carriage return (13) and ordinary space (32). The following code will find every substring of $string which is composed entirely of \s. Only the first \s in the substring will be preserved. For example, if line feed, horizontal tab and ordinary space occur immediately after one another in a substring, line feed alone will remain after the replacement is done.
$string = preg_replace('#(\s)\s+#', '\1', $string);
preg_replace(array('#\s+#', '#[^a-zA-z\s]#'), array(' ', ''), $string);
Though it will replace all of whitespaces with spaces. If you want to replace consequent whitespaces (like two newlines with only one newline) - you should figure out logic for that, coz \s+ will match "\n \n \n" (5 whitespaces in a row).
try using trim instead
<?php
$something = " Error";
echo $something."\n";
echo "------"."\n";
echo trim($something);
?>
output
Error
------
Error
Question is old and miss some details. Let's assume OP wanted to reduce all consecutive horizontal whitespaces and replace by a space.
Exemple:
"\t\t \t \t" => " "
"\t\t \t\t" => "\t \t"
One possible solution would be simply to use the generic character type \h which stands for horizontal whitespace space:
preg_replace('/\h+/', ' ', $text)

remove in php any character but not symbols and letters

how I can use str_ireplace or other functions to remove any characters but not letters,numbers or symbols that are commonly used in HTML as : " ' ; : . - + =... etc. I also wants to remove /n, white spaces, tabs and other.
I need that text, comes from doing ("textContent"). innerHTML in IE10 and Chrome, which a php variable are the same size, regardless of which browser do it.Therefore I need the same encoding in both texts and characters that are rare or different are removed.
I try this, but it dont work for me:
$textForMatch=iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
$textoForMatc = str_replace(array('\s', "\n", "\t", "\r"), '', $textoForMatch);
$text contains the result of the function ("textContent"). innerHTML, I want to delete characters as �é³..
The easiest option is to simply use preg_replace with a whitelist. I.e. use a pattern listing the things you want to keep, and replace anything not in that list:
$input = 'The quick brown 123 fox said "�é³". Man was I surprised';
$stripped = preg_replace('/[^-\w:";:+=\.\']/', '', $input);
$output = 'Thequickbrownfoxsaid"".ManwasIsurprised';
regex explanation
/ - start regex
[^ - Begin inverted character class, match NON-matching characters
- - litteral character
\w - Match word characters. Equivalent to A-Za-z0-9_
:";:+= - litteral characters
\. - escaped period (because a dot has meaning in a regex)
\' - escaped quote (because the string is in single quotes)
] - end character class
/ - end of regex
This will therefore remove anything that isn't words, numbers or the specific characters listed in the regex.

Regex to remove single characters from string

Consider the following strings
breaking out a of a simple prison
this is b moving up
following me is x times better
All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:
breaking out of simple prison
this is moving up
following me is times better
Is this possible with a single regex in php?
$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("#\\b[a-z]\\b ?#i", "", $str);
echo $res;
How about:
preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);
Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).
If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:
preg_replace('/(^|\s)[a-z]\b/', '$1', $string);
As a one-liner:
$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);
This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.
The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.
So
This a is my test sentence a. o How funny (what a coincidence a) this is!
is changed to
This is my test sentence. How funny (what coincidence) this is!
You could try something like this:
preg_replace('/\b\S\s\b/', "", $subject);
This is what it means:
\b # Assert position at a word boundary
\S # Match a single character that is a “non-whitespace character”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b # Assert position at a word boundary
Update
As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.
As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this
test a hello
but you could fix that by changing the expression to \b\S\s*\b
Try this one:
$sString = preg_replace("#\b[a-z]{1}\b#m", ' ', $sString);

Regex to strip out everything but words and numbers (and latin chars)

Im trying to clean a post string used in an ajax request (sanitize before db query) to allow only alphanumeric characters, spaces (1 per word, not multiple), can contain "-", and latin characters like "ç" and "é" without success, can anyone help or point me on the right direction?
This is the regex I'm using so far:
$string = preg_replace('/^[a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+$/', '', mb_strtolower(utf8_encode($_POST['q'])));
Thank you.
$regEx = '/^[^\w\p{L}-]+$/iu';
\w - matches alphanumerics
\p{L} - matches a single Unicode Code Point in the 'Letters' category (see the Unicode Categories section here).
- at the end of the character class matches a single hyphen.
^ in the character classes negates the character class, so that the regex will match the opposite of the character class (anything you do not specify).
+ outside of the character class says match 1 or more characters
^ and $ outside of the character class will cause the engine to only accept matches that start at the beginning of a line and goes until the end of the line.
After the pattern, the i modifier says ignore case and the u tells the pattern matching engine that we're going to be sending UTF8 data it's way, and g modifier originally present has been removed since it's not necessary in PHP (instead global matching is dependent on which matching function is called)
$string = mb_strtolower(utf8_encode($_POST['q'])));
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+/g', '', $string);
$string = preg_replace('/ +/g', ' ', $string);
Why not just use mysql_real_escape_string?
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû\-]/u', '', mb_strtolower(utf8_encode($_POST['q']), 'UTF-8'));
$string = preg_replace( '/ +/', ' ', $string );
should do the trick. Note that
the character class is negated by putting ^ inside the character class
you need the u flag when dealing with unicode strings either in the pattern or in the subject
it's better to specify the character set explicitly in mb_* functions because otherwise they will fall back on your system defaults, and that may not be UTF-8.
the hyphen character needed escaping (\- instead of - at the end of your character class)

Categories