I am no PHP expert. I am looking for the PHP equivalent of isLetter() in Java, but I can't find it. Does it exist?
I need to extract letters from a given string and make them lower case, for example: "Ap.ér4i5T i6f;" should give "apéritif'. So, yes, there are accentuated characters in my strings.
ctype_alpha().
In addition to regex / preg_replace, you can also use strtoupper($string) and strtolower($string), if you need to universally upper-case a string. As Konrad mentioned, preg_replace is probably your best bet though.
http://php.net/manual/en/function.strtoupper.php
http://www.php.net/manual/en/function.strtolower.php
In PHP (and in Java) you wouldn’t use isLetter to implement it, you’d rather replace all characters that aren’t letters using a regular expression:
echo preg_replace('/\P{L}/', '', input);
Loop up the documentation of preg_replace and the regex pattern syntax desciption, in particular the relevant Unicode character classes.
You could probably use the php-slugs source code, with appropriate modifications.
Related
How can I make the following regex ignore case sensitivity? It should match all the correct characters but ignore whether they are lower or uppercase.
G[a-b].*
Assuming you want the whole regex to ignore case, you should look for the i flag. Nearly all regex engines support it:
/G[a-b].*/i
string.match("G[a-b].*", "i")
Check the documentation for your language/platform/tool to find how the matching modes are specified.
If you want only part of the regex to be case insensitive (as my original answer presumed), then you have two options:
Use the (?i) and [optionally] (?-i) mode modifiers:
(?i)G[a-b](?-i).*
Put all the variations (i.e. lowercase and uppercase) in the regex - useful if mode modifiers are not supported:
[gG][a-bA-B].*
One last note: if you're dealing with Unicode characters besides ASCII, check whether or not your regex engine properly supports them.
Depends on implementation
but I would use
(?i)G[a-b].
VARIATIONS:
(?i) case-insensitive mode ON
(?-i) case-insensitive mode OFF
Modern regex flavors allow you to apply modifiers to only part of the regular expression. If you insert the modifier (?im) in the middle of the regex then the modifier only applies to the part of the regex to the right of the modifier. With these flavors, you can turn off modes by preceding them with a minus sign (?-i).
Description is from the page:
https://www.regular-expressions.info/modifiers.html
regular expression for validate 'abc' ignoring case sensitive
(?i)(abc)
The i flag is normally used for case insensitivity. You don't give a language here, but it'll probably be something like /G[ab].*/i or /(?i)G[ab].*/.
Just for the sake of completeness I wanted to add the solution for regular expressions in C++ with Unicode:
std::tr1::wregex pattern(szPattern, std::tr1::regex_constants::icase);
if (std::tr1::regex_match(szString, pattern))
{
...
}
JavaScript
If you want to make it case insensitive just add i at the end of regex:
'Test'.match(/[A-Z]/gi) //Returns ["T", "e", "s", "t"]
Without i
'Test'.match(/[A-Z]/g) //Returns ["T"]
In JavaScript you should pass the i flag to the RegExp constructor as stated in MDN:
const regex = new RegExp('(abc)', 'i');
regex.test('ABc'); // true
As I discovered from this similar post (ignorecase in AWK), on old versions of awk (such as on vanilla Mac OS X), you may need to use 'tolower($0) ~ /pattern/'.
IGNORECASE or (?i) or /pattern/i will either generate an error or return true for every line.
C#
using System.Text.RegularExpressions;
...
Regex.Match(
input: "Check This String",
pattern: "Regex Pattern",
options: RegexOptions.IgnoreCase)
specifically: options: RegexOptions.IgnoreCase
[gG][aAbB].* probably simples solution if the pattern is not too complicated or long.
Addition to the already-accepted answers:
Grep usage:
Note that for greping it is simply the addition of the -i modifier. Ex: grep -rni regular_expression to search for this 'regular_expression' 'r'ecursively, case 'i'nsensitive, showing line 'n'umbers in the result.
Also, here's a great tool for verifying regular expressions: https://regex101.com/
Ex: See the expression and Explanation in this image.
References:
man pages (man grep)
http://droptips.com/using-grep-and-ignoring-case-case-insensitive-grep
In Java, Regex constructor has
Regex(String pattern, RegexOption option)
So to ignore cases, use
option = RegexOption.IGNORE_CASE
Kotlin:
"G[a-b].*".toRegex(RegexOption.IGNORE_CASE)
You also can lead your initial string, which you are going to check for pattern matching, to lower case. And using in your pattern lower case symbols respectively .
You can practice Regex In Visual Studio and Visual Studio Code using find/replace.
You need to select both Match Case and Regular Expressions for regex expressions with case. Else [A-Z] won't work.enter image description here
I have some strings containing characters such as \x{1f601} which I want to replace with some text.
When I do this using preg_replace, it would be something like:
preg_replace('/\x{1f601}/u', '######', $str)
However, this doesn't seem to work with str_replace:
str_replace("\x{1f601}", '######', $str)
How can I make such replacements work with str_replace?
preg_replace is a Regex parser/replacer, which is a Perl Regular expression engine, but str_replace is NOT and replaces things with a plaintext method
The Preg_replace you have got can be seen here in regex101, stating that:
matches the character 😁 with position 0x1f601 (128513 decimal or 373001 octal) in the character set
But this could be transferable to a non-regex find and replace,by copy and pasting that face smiley symbol into the str_replace directly.
$str = str_replace("😁", '######', $str)
Or, by reading deceze's comment which gives you a clean, small solution.
Additional:
You are using a character set that is non-standard so it may be useful for you to explore Mb_Str_replace (gitHub) which is an accompanyment (but not directly from) the mb_string collection of PHP functions.
Finally:
Why do you need to do string replace whe you are already doing regex preg_replace? Also please read the manual which states all of this fairly clearly.
I just stuck at this and cannot find solution.
I would like to try to transform a string to lower case using preg_replace.
I just cannot create the right regex.
The reason is that normal strtolower does not support unicode characters.
I know that I could use mb_strtolower but this function seems to be quite slow and beside them not everyone has MB support.
Any clue?
Regards,
Radek
EDIT: Ok, thanks alot for your help guys. I think my approach was not quite correct.
I think it would be much better to use this: How do I detect non-ASCII characters in a string? and then respectively use either the strtolower or mb_strtolower if available.
Regex is not able to change characters by itself, it can only change their order and/or add additional characters/delete some of them.
There is preg_replace_callback or /e flag, but they can manipulate only with known functions, and therefore can't do better than strtolower.
If you can't rely on existense of mb_strolower function, you will have to implement it yourself.
You shouldn't use a preg_replace for this because preg_replace is used to match a certain pattern and replace it with something else. Wat you want is to replace every single uppercase character with a lowercase one, so no need to match a pattern.
mb_strtolower would be the way to go, and if you don't have the mb_ functions you'll have to write a function yourself using a lot of str_replace's...
Is there an equivalent of the PHP function preg_split for JavaScript?
Any string in javascript can be split using the string.split function, e.g.
"foo:bar".split(/:/)
where split takes as an argument either a regular expression or a literal string.
You can use regular expressions with split.
The problem is the escape characters in the string as the (? opens a non capturing group but there is no corresponding } to close the non capturing group it identifies the string to look for as '
If you want support for all of the preg_split arguments see https://github.com/kvz/phpjs/blob/master/_workbench/pcre/preg_split.js (though not sure how well tested it is).
Just bear in mind that JavaScript's regex syntax is somewhat different from PHP's (mostly less expressive). We would like to integrate XRegExp at some point as that makes up for some of the missing features of PHP regexes (as well as fixes the many browser reliability problems with functions like String.split()).
What's the best way to filter non-alphanumeric "repeating" characters
I would rather no build a list of characters to check for. Is there good regex for this I can use in PHP.
Examples:
...........
*****************
!!!!!!!!
###########
------------------
~~~~~~~~~~~~~
Special case patterns:
=*=*=*=*=*=
->->->->
Based on #sln answer:
$str = preg_replace('~([^0-9a-zA-Z])\1+|(?:=[*])+|(?:->)+~', '', $str);
The pattern could be something like this : s/([\W_]|=\*|->)\1+//g
or, if you want to replace by just a single instance: s/([\W_]|=\*|->)\1+/$1/g
edit ... probably any special sequence should be first in the alternation, incase you need to make something like == special, it won't be grabbed by [\W_].
So something like s/(==>|=\*|->|[\W_])\1+/$1/g where special cases are first.
preg_replace('~\W+~', '', $str);
sin's solution is pretty good but the use of \W "non-word" class includes whitespace. I don't think you wan't to be removing sequences of tabs or spaces! Using a negative class (something like: '[^A-Za-z0-9\s]') would work better.
This will filter out all symbols
[code]
$q = ereg_replace("[^A-Za-z0-9 ]", "", $q);
[/code]
replace(/([^A-Za-z0-9\s]+)\1+/, "")
will remove repeated patterns of non-alphanumeric non-whitespace strings.
However, this is a bad practice because you'll also be removing all non-ASCII European and other international language characters in the Unicode base.
The only place where you really won't ever care about internationalization is in processing source code, but then you are not handling text quoted in strings and you may also accidentally de-comment a block.
You may want to be more restrictive in what you try to remove by giving a list of characters to replace instead of the catch-all.
Edit: I have done similar things before when trying to process early-version ShoutCAST radio names. At that time, stations tried to call attention to themselves by having obnoxious names like: <<!!!!--- GREAT MUSIC STATION ---!!!!>>. I used used similar coding to get rid of repeated symbols, but then learnt (the hard way) to be careful in what I eventually remove.
This works for me:
preg_replace('/(.)\1{3,}/i', '', $sourceStr);
It removes all the symbols that repats 3+ times in row.