PHP: categorizing characters in a string

PHP: categorizing characters in a string - php

Consider the following string
hello, my name is 冰岛, nice to meet you
I need to scan this string and categorize each character as one of the following types:
1) Western text (alphabet and numbers only)
2) Chinese text (ideograms only, no punctuation)
3) Anything else (anything else, whether western or chinese or else)
Anyone can point me in the right direction? Thanks
Edit: since I suppose this has been downvoted due to being too generic..
for($i=0, $l = mb_strlen($string) - 1; $i<$l; $i++)
{
$char = mb_substr($string, $i, 1);
if(preg_match("/^[a-zA-Z]$/", $char)) $type = "alpha";
else
...
;
}
Regular expressions other than detecting alphabetic characters defy my knowledge, especially what is needed to include Han Ideograms only and leave all Han punctuation and special symbols out.

I can suggest that you should use a preg_replace_callback to grab the chunks of text you need with a regex that will capture different categories of texts into separate groups, and build the resulting array based on these captures:
$s = "hello, my name is 冰岛, nice to meet you";
$res = array();
preg_replace_callback('~\b(?<Chinese>\p{Han}+)\b|\b(?<Western>[a-zA-Z0-9]+)\b|(?<Other>[^\p{Han}A-Za-z0-9\s]+)~su',
function($m) use (&$res) {
if (!empty($m["Chinese"])) {
$t = array("type" => "Han", "value" => $m["Chinese"]);
array_push($res,$t);
}
else if (!empty($m["Western"])) {
$t = array("type" => "Western", "value" => $m["Western"]);
array_push($res, $t);
}
else if (!empty($m["Other"])) {
$t=array("type" => "Other", "value" => $m["Other"]);
array_push($res, $t);
}
},
$s);
print_r($res);
See the online PHP demo
Pattern:
\b(?<Chinese>\p{Han}+)\b - a whole Chinese word
| - or
\b(?<Western>[a-zA-Z0-9]+)\b - a whole word consisting of only ASCII letters and digits
| - or
(?<Other>[^\p{Han}A-Za-z0-9\s]+) - any 1+ symbols other than Chinese chars, ASCII letters, ASCII digits and whitespaces (\s).
The ~s modifier is redundant here, but if you want to match linebreaks, it will make . match these chars.
The ~u is necessary here since you deal with Unicode strings.
Also, see more about Unicode properties in the Unicode Properties section at the regular-expressions.info (e.g. you might be interested in \p{P} and \p{S} properties).

Related

php regex replace single capital with space capital [duplicate]

How would I go about splitting the word:
oneTwoThreeFour
into an array so that I can get:
one Two Three Four
with preg_match ?
I tired this but it just gives the whole word
$words = preg_match("/[a-zA-Z]*(?:[a-z][a-zA-Z]*[A-Z]|[A-Z][a-zA-Z]*[a-z])[a-zA-Z]*\b/", $string, $matches)`;

You can use preg_split as:
$arr = preg_split('/(?=[A-Z])/',$str);
See it
I'm basically splitting the input string just before the uppercase letter. The regex used (?=[A-Z]) matches the point just before a uppercase letter.

You can also use preg_match_all as:
preg_match_all('/((?:^|[A-Z])[a-z]+)/',$str,$matches);
Explanation:
( - Start of capturing parenthesis.
(?: - Start of non-capturing parenthesis.
^ - Start anchor.
| - Alternation.
[A-Z] - Any one capital letter.
) - End of non-capturing parenthesis.
[a-z]+ - one ore more lowercase letter.
) - End of capturing parenthesis.

I know that this is an old question with an accepted answer, but IMHO there is a better solution:
<?php // test.php Rev:20140412_0800
$ccWord = 'NewNASAModule';
$re = '/(?#! splitCamelCase Rev:20140412)
# Split camelCase "words". Two global alternatives. Either g1of2:
(?<=[a-z]) # Position is after a lowercase,
(?=[A-Z]) # and before an uppercase letter.
| (?<=[A-Z]) # Or g2of2; Position is after uppercase,
(?=[A-Z][a-z]) # and before upper-then-lower case.
/x';
$a = preg_split($re, $ccWord);
$count = count($a);
for ($i = 0; $i < $count; ++$i) {
printf("Word %d of %d = \"%s\"\n",
$i + 1, $count, $a[$i]);
}
?>
Note that this regex, (like codaddict's '/(?=[A-Z])/' solution - which works like a charm for well formed camelCase words), matches only a position within the string and consumes no text at all. This solution has the additional benefit that it also works correctly for not-so-well-formed pseudo-camelcase words such as: StartsWithCap and: hasConsecutiveCAPS.
Input:
oneTwoThreeFour
StartsWithCap
hasConsecutiveCAPS
NewNASAModule
Output:
Word 1 of 4 = "one"
Word 2 of 4 = "Two"
Word 3 of 4 = "Three"
Word 4 of 4 = "Four"
Word 1 of 3 = "Starts"
Word 2 of 3 = "With"
Word 3 of 3 = "Cap"
Word 1 of 3 = "has"
Word 2 of 3 = "Consecutive"
Word 3 of 3 = "CAPS"
Word 1 of 3 = "New"
Word 2 of 3 = "NASA"
Word 3 of 3 = "Module"
Edited: 2014-04-12: Modified regex, script and test data to correctly split: "NewNASAModule" case (in response to rr's comment).

While ridgerunner's answer works great, it seems not to work with all-caps substrings that appear in the middle of sentence. I use following and it seems to deal with these just alright:
function splitCamelCase($input)
{
return preg_split(
'/(^[^A-Z]+|[A-Z][^A-Z]+)/',
$input,
-1, /* no limit for replacement count */
PREG_SPLIT_NO_EMPTY /*don't return empty elements*/
| PREG_SPLIT_DELIM_CAPTURE /*don't strip anything from output array*/
);
}
Some test cases:
assert(splitCamelCase('lowHigh') == ['low', 'High']);
assert(splitCamelCase('WarriorPrincess') == ['Warrior', 'Princess']);
assert(splitCamelCase('SupportSEELE') == ['Support', 'SEELE']);
assert(splitCamelCase('LaunchFLEIAModule') == ['Launch', 'FLEIA', 'Module']);
assert(splitCamelCase('anotherNASATrip') == ['another', 'NASA', 'Trip']);

A functionized version of #ridgerunner's answer.
/**
* Converts camelCase string to have spaces between each.
* #param $camelCaseString
* #return string
*/
function fromCamelCase($camelCaseString) {
$re = '/(?<=[a-z])(?=[A-Z])/x';
$a = preg_split($re, $camelCaseString);
return join($a, " " );
}

$string = preg_replace( '/([a-z0-9])([A-Z])/', "$1 $2", $string );
The trick is a repeatable pattern $1 $2$1 $2 or lower UPPERlower UPPERlower etc....
for example
helloWorld = $1 matches "hello", $2 matches "W" and $1 matches "orld" again so in short you get $1 $2$1 or "hello World", matches HelloWorld as $2$1 $2$1 or again "Hello World". Then you can lower case them uppercase the first word or explode them on the space, or use a _ or some other character to keep them separate.
Short and simple.

When determining the best pattern for your project, you will need to consider the following pattern factors:
Accuracy (Robustness) -- whether the pattern is correct in all cases and is reasonably future-proof
Efficiency -- the pattern should be direct, deliberate, and avoid unnecessary labor
Brevity -- the pattern should use appropriate techniques to avoid unnecessary character length
Readability -- the pattern should be keep as simple as possible
The above factors also happen to be in the hierarchical order that strive to obey. In other words, it doesn't make much sense to me to prioritize 2, 3, or 4 when 1 doesn't quite satisfy the requirements. Readability is at the bottom of the list for me because in most cases I can follow the syntax.
Capture Groups and Lookarounds often impact pattern efficiency. The truth is, unless you are executing this regex on thousands of input strings, there is no need to toil over efficiency. It is perhaps more important to focus on pattern readability which can be associated with pattern brevity.
Some patterns below will require some additional handling/flagging by their preg_ function, but here are some pattern comparisons based on the OP's sample input:
preg_split() patterns:
/^[^A-Z]+\K|[A-Z][^A-Z]+\K/ (21 steps)
/(^[^A-Z]+|[A-Z][^A-Z]+)/ (26 steps)
/[^A-Z]+\K(?=[A-Z])/ (43 steps)
/(?=[A-Z])/ (50 steps)
/(?=[A-Z]+)/ (50 steps)
/([a-z]{1})[A-Z]{1}/ (53 steps)
/([a-z0-9])([A-Z])/ (68 steps)
/(?<=[a-z])(?=[A-Z])/x (94 steps) ...for the record, the x is useless.
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/ (134 steps)
preg_match_all() patterns:
/[A-Z]?[a-z]+/ (14 steps)
/((?:^|[A-Z])[a-z]+)/ (35 steps)
I'll point out that there is a subtle difference between the output of preg_match_all() and preg_split(). preg_match_all() will output a 2-dimensional array, in other words, all of the fullstring matches will be in the [0] subarray; if there is a capture group used, those substrings will be in the [1] subarray. On the other hand, preg_split() only outputs a 1-dimensional array and therefore provides a less bloated and more direct path to the desired output.
Some of the patterns are insufficient when dealing with camelCase strings that contain an ALLCAPS/acronym substring in them. If this is a fringe case that is possible within your project, it is logical to only consider patterns that handle these cases correctly. I will not be testing TitleCase input strings because that is creeping too far from the question.
New Extended Battery of Test Strings:
oneTwoThreeFour
hasConsecutiveCAPS
newNASAModule
USAIsGreatAgain
Suitable preg_split() patterns:
/[a-z]+\K|(?=[A-Z][a-z]+)/ (149 steps) *I had to use [a-z] for the demo to count properly
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/ (547 steps)
Suitable preg_match_all() pattern:
/[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z]|$)/ (75 steps)
Finally, my recommendations based on my pattern principles / factor hierarchy. Also, I recommend preg_split() over preg_match_all() (despite the patterns having less steps) as a matter of directness to the desired output structure. (of course, choose whatever you like)
Code: (Demo)
$noAcronyms = 'oneTwoThreeFour';
var_export(preg_split('~^[^A-Z]+\K|[A-Z][^A-Z]+\K~', $noAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+~', $noAcronyms, $out) ? $out[0] : []);
Code: (Demo)
$withAcronyms = 'newNASAModule';
var_export(preg_split('~[^A-Z]+\K|(?=[A-Z][^A-Z]+)~', $withAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+|[A-Z]+(?=[A-Z][^A-Z]|$)~', $withAcronyms, $out) ? $out[0] : []);

I took cool guy Ridgerunner's code (above) and made it into a function:
echo deliciousCamelcase('NewNASAModule');
function deliciousCamelcase($str)
{
$formattedStr = '';
$re = '/
(?<=[a-z])
(?=[A-Z])
| (?<=[A-Z])
(?=[A-Z][a-z])
/x';
$a = preg_split($re, $str);
$formattedStr = implode(' ', $a);
return $formattedStr;
}
This will return: New NASA Module

Another option is matching /[A-Z]?[a-z]+/ - if you know your input is on the right format, it should work nicely.
[A-Z]? would match an uppercase letter (or nothing). [a-z]+ would then match all following lowercase letters, until the next match.
Working example: https://regex101.com/r/kNZfEI/1

You can split on a "glide" from lowercase to uppercase thus:
$parts = preg_split('/([a-z]{1})[A-Z]{1}/', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
//PREG_SPLIT_DELIM_CAPTURE to also return bracketed things
var_dump($parts);
Annoyingly you will then have to rebuild the words from each corresponding pair of items in $parts
Hope this helps

First of all codaddict thank you for your pattern, it helped a lot!
I needed a solution that works in case a preposition 'a' exists:
e.g. thisIsACamelcaseSentence.
I found the solution in doing a two step preg_match and made a function with some options:
/*
* input: 'thisIsACamelCaseSentence' output: 'This Is A Camel Case Sentence'
* options $case: 'allUppercase'[default] >> 'This Is A Camel Case Sentence'
* 'allLowerCase' >> 'this is a camel case sentence'
* 'firstUpperCase' >> 'This is a camel case sentence'
* #return: string
*/
function camelCaseToWords($string, $case = null){
isset($case) ? $case = $case : $case = 'allUpperCase';
// Find first occurances of two capitals
preg_match_all('/((?:^|[A-Z])[A-Z]{1})/',$string, $twoCapitals);
// Split them with the 'zzzzzz' string. e.g. 'AZ' turns into 'AzzzzzzZ'
foreach($twoCapitals[0] as $match){
$firstCapital = $match[0];
$lastCapital = $match[1];
$temp = $firstCapital.'zzzzzz'.$lastCapital;
$string = str_replace($match, $temp, $string);
}
// Now split words
preg_match_all('/((?:^|[A-Z])[a-z]+)/', $string, $words);
$output = "";
$i = 0;
foreach($words[0] as $word){
switch($case){
case 'allUpperCase':
$word = ucfirst($word);
break;
case 'allLowerCase':
$word = strtolower($word);
break;
case 'firstUpperCase':
($i == 0) ? $word = ucfirst($word) : $word = strtolower($word);
break;
}
// remove te 'zzzzzz' from a word if it has
$word = str_replace('zzzzzz','', $word);
$output .= $word." ";
$i++;
}
return $output;
}
Feel free to use it, and in case there is an 'easier' way to do this in one step please comment!

Full function based on #codaddict answer:
function splitCamelCase($str) {
$splitCamelArray = preg_split('/(?=[A-Z])/', $str);
return ucwords(implode($splitCamelArray, ' '));
}

How to get substrings on both sides of hyphen and trailing substring?

I am currently working on a web app which is using a specific string to call a function. Here is a sample string:
$string = "translate from-to word for translate"
First I need to validate the string, and it should be like the above $string. How should I validate the string?
Then I need to extract 3 substrings from $string.
The word that precedes the hyphen. (To be named: $target)
The word that follows the hyphen. (To be named: $source)
The text (not including the first space) that follows $source to the end of the string. (To be named: $translate)
This is my coding attempt to get the from and to:
$found = false;
$source ="";
$target = "";
$next = 3;
$prev = 1;
for($i=0;$i<strlen($string);$i++){
if($found== false){
if($string[$i] == "-"){
$found = true;
while($string[$i+$prev] != " "){
$target .= $string[$i+$prev];
$prev +=1;
}
/*$next -=1;
while($string[$i-$next] != " " && $next > 0){
$source .= $string[$i-$next];
$next -=1;
}*/
}
}
}
From that code, I only can return the $target which contains to after -.I don't know how to get $source.
Please show me the fastest way to get the from as $source and to as $target.
Then I need to get word for translate (all of the string after from-to).
So the result should be
$target = "to";
$source = "from";
$translate = "word for translate";
Finally, if the $string has two hyphens, like translate from-to from-to test-test word for translate, it should be return false;
note to and from are random strings.

Consider the following possible input strings:
translate from-to word for translate (1 hyphen, no accents or non-English characters)
translate dari-ke dari-ke word for translate (2 hyphens)
translate clé-solution word for translate (1 hyphen, accented character used)
translate goodbye-さようなら word for translate (1 hyphen , Japanese characters used)
A case-insensitive pattern like: /^[a-z]+? ([a-z]+)-([a-z]+?) ([a-z ]+)$/i will perform as requested on the first two sample strings with high efficiency, but not the last two.
Using the "word character" (\w) to match the substrings (instead of case-insensitive [a-z]) will perform as intended with the first two samples with, but also allows 0-9 and _ as valid characters. This means a slight drop in pattern accuracy (this may be of no noticeable consequence to your project).
If you are translating strings that may go beyond English characters, it can be simpler / more forgiving to use a "negated character class" for matching. If you want to allow letters beyond a-z, like accented and other multibyte characters, then [^-] will offer a broad allowance of characters (at the expense of allowing many unwanted letters too). Here is a demo of this kind of pattern.
It is important to only write "capture groups" for substrings that you want to subsequently use. For this reason, I do not capture the leading substring translate.
list() is a handy "language construct" to assign variable names to array values. Notice that the first element (the fullstring match) is not assigned to a variable. This is why list()'s parameters starts with ,. If you don't wish to leverage the convenience of list(), then you can manually assign the three variable names over three lines like this:
$source=$out[1];
$target=$out[2];
$translate=$out[3];
Code: (Demo)
$strings=[
"translate from-to word for translate",
"translate dari-ke dari-ke word for translate",
"translate clé-solution word for translate",
"translate goodbye-さようなら word for translate"
];
foreach($strings as $string){
if(preg_match('/^[a-z]+? ([^-]+)-([^-]+?) ([a-z ]+)$/i',$string,$out)){
list(,$source,$target,$translate)=$out;
echo "source=$source; target=$target; translate=$translate";
}else{
var_export(false); // $found=false;
}
echo "<br>";
}
Output:
source=from; target=to; translate=word for translate
false
source=clé; target=solution; translate=word for translate
source=goodbye; target=さようなら; translate=word for translate
While regex provides a much more concise method with fewer function calls, this is a non-regex method:
if(substr_count($string,'-')!=1){
var_export(false); // $found=false;
}else{
$trimmed=ltrim($string,'translate ');
$array=explode(' ',$trimmed,2);
list($source,$target)=explode('-',$array[0]);
$translate=$array[1];
echo "source=$source; target=$target; translate=$translate";
}

If I understand your question correctly, this can be done with a regular expression:
<?php
$string = "translate from-to word for translate";
$result = preg_match("/^([\w ]+?) (\w+)-(\w+) ([\w ]+)$/", $string, $matches);
if ($result) {
print_r($matches);
$source = $matches[2];
$target = $matches[3];
$translate = $matches[4];
} else {
echo "No match";
}
Output:
Array
(
[0] => translate from-to word for translate
[1] => translate
[2] => from
[3] => to
[4] => word for translate
)
Here is an explanation of the regular expression.

Why does this regular expression only capture one word?

I'm trying to learn Regular Expressions. I know the basics, and I'm not terrible at regex, I'm just no pro - hence I've got a question for you guys. If you know regex, I bet it'll be simple.
What I've got currently is this:
/(\w+)\s-{1}\s(\w+)\.{1}(\w{3,4})/
What I'm trying to do is create a little script for myself that tidies up my music collection by formatting all of the filenames. I know there's other stuff out there already but this is a learning experience for me. I already screwed up all the titles once by replacing things like "Hell Aint A Bad Place To Be" with "Hell Aint a Bad Place To Be". In my wisdom I somehow ended up with "Hell Aint a ad Place to be" (I was looking a A followed by a space and an uppercase character). Obviously that was a nightmare to fix and it had to be done manually. Needless to say I'm testing samples first now.
Anyway, the above regex is sort of a stage 1 of many. Eventually I want to build it up, but for now I just need to get the simple bits working.
In the end I'd like to turn:
"arctic Monkeys- a fake tales of a san francisco"
into
"Arctic Monkeys - A Fake Tales of a San Francisco"
I know I'll need lookbehind assertions to grab when you're after a '-', because if the first word is 'a', 'of' etc. which I'd normally lowercase, I need to uppercase them (the above is a bad example for this use case I know).
Any way of fixing the existing regular expression would be great, and and tips on where to look on my cheatsheet to finish the rest off would be great (I'm not looking for a fully-fledged answer, since I need to learn to do it myself, I just can't figure why w+ is only getting one word).

I believe there is a much simpler way of approaching this problem: split the string into words, based on a much simpler regex, and then apply whatever processing you want to those words. This will allow you to perform more complicated transformations on the text in a much cleaner way. Here's an example:
<?php
$song = "arctic Monkeys- a fake tales of a san francisco";
// Split on spaces or - (the - is still present
// because it's only a lookahead match)
$words = preg_split("/([\s]+|(?=-))/", $song);
/*
Output for print_r:
Array
(
[0] => arctic
[1] => Monkeys
[2] => -
[3] => a
[4] => fake
[5] => tales
[6] => of
[7] => a
[8] => san
[9] => francisco
)
*/
print_r($words);
$new_words = array();
foreach ($words as $k => $word) {
$new_words[] = processWord($word, $k, $words);
}
// This will output:
// Arctic Monkeys - A Fake Tales of a San Francisco
echo implode(' ', $new_words);
// You can add as many processing rules you want in here - in a very clean way
function processWord($word, $idx, $words) {
if ($words[$idx - 1] == '-') return ucfirst($word);
return strlen($word) > 2 ? ucfirst($word) : $word;
}
Here's an example of this code running: http://codepad.org/t6pc8WpR

I'm a little confused about what you're doing, but maybe this will help. Remember that + is 1 or more characters, * is 0 or more. So you probably want to do something like ([\s]*) to match spaces. You don't need to specify the {1} next to a single character.
So maybe something like this:
([\w\s]+)([\s]*)-([\s]*)([\w\s]+)\.([\w]{3,4})
I haven't tested this code, but I think you get the idea.

\w does not contain the blank. A working regex might be:
/^(.+?)\s*-\s*(.+)$/
Explanation:
^ - must start at the beginning of the string
(.+?) - match any character, be ungreedy
\s* - match any number whitespace that might exists (including none)
- - match character
\s* - any whitespace again
(.+) - remaining characters
$ - end of string
The transcoding would then happen in another replacing regex.

For the first part, \w doesn't match words, it matches word characters. It's equivalent to [A-Za-z0-9_].
Instead, try ([A-Za-z0-9_ ]+) as your first bit (has an extra space inside the match square brackets and removed the \s.

Here's what I have:
<?php
/**
* Formats a string into a title:
* * Pads all dashes with spaces.
* * Uppercase all words with 3 letters or more.
* * Uppercase first word and first words after dashes.
*
* #param $str
*
* #return string
*/
function format_title($str) {
//Remove all spaces before and after dashes.
//(These will return in the final product)
$str = preg_replace("/\s?-\s?/", "-", $str);
//Explode by dash.
$string_split_by_dash = explode("-", $str);
//For each sentence (separated by dashes)
foreach ($string_split_by_dash as &$sentence) {
//Uppercase all words.
$sentence = ucwords($sentence);
//Explode into words (by space)
$words = explode(" ", $sentence);
//For each word
foreach ($words as &$word) {
//If its length is smaller than 3
if (strlen($word) < 3) {
//Lowercase it.
$word = strtolower($word);
}
}
//Implode back into a sentence.
$sentence = implode(" ", $words);
//Uppercase the first word, regardless of length.
$sentence = ucfirst($sentence);
}
//Implode all sentances back by space-padded dash.
$str = implode(" - ", $string_split_by_dash);
return $str;
}
$str = "arctic Monkeys- a fake tales of a san francisco";
var_dump(format_title($str));
I'd argue it's more readable (and more documentable) than a regex. Probably more efficient too, (didn't check).

Condensed function to strip double letters away from a string (PHP)

I need to take every double letter occurrence away from a word. (I.E. "attached" have to become: "aached".)
I wrote this function:
function strip_doubles($string, $positions) {
for ($i = 0; $i < strlen($string); $i++) {
$stripped_word[] = $string[$i];
}
foreach($positions['word'] as $position) {
unset($stripped_word[$position], $stripped_word[$position + 1]);
}
$returned_string= "";
foreach($stripped_words $key => $value) {
$returned_string.= $stripped_words[$key];
}
return $returned_string;
}
where $string is the word to be stripped and $positions is an array containing the positions of any first double letter.
It perfectly works but how would a real programmer write the same function... in a more condensed way? I have a feeling it could be possible to do the same thing without three loops and so much code.

Non-regex solution, tested:
$string = 'attached';
$stripped = '';
for ($i=0,$l=strlen($string);$i<$l;$i++) {
$matched = '';
// if current char is the same as the next, skip it
while (substr($string, $i, 1)==substr($string, $i+1, 1)) {
$matched = substr($string, $i, 1);
$i++;
}
// if current char is NOT the same as the matched char, append it
if (substr($string, $i, 1) != $matched) {
$stripped .= substr($string, $i, 1);
}
}
echo $stripped;

You should use a regular expression. It matches on certain characteristics and can replace the matched occurences with some other string(s).
Something like
$result = preg_replace('#([a-zA-Z]{1})\1#i', '', $string);
Should work. It tells the regexp to match one character from a-z followed by the match itself, thus effectively two identical characters after each other. The # mark the start and end of the regexp. If you want more characters than just a-z and A-Z, you could use other identifiers like [a-ZA-Z0-9]{1} or for any character .{1} or for only Unicode characters (including combined characters), use \p{L}\p{M}*
The i flag after the last # means 'case insensitive' and will instruct the regexp to also match combinations with different cases, like 'tT'. If you want only combinations in the same case, so 'tt' and 'TT', then remove the 'i' from the flags.
The '' tells the regexp to replace the matched occurences (the two identical characters) with an empty string.
See http://php.net/manual/en/function.preg-replace.php and http://www.regular-expressions.info/

PHP preg_match - only allow alphanumeric strings and - _ characters

I need the regex to check if a string only contains numbers, letters, hyphens or underscore
$string1 = "This is a string*";
$string2 = "this_is-a-string";
if(preg_match('******', $string1){
echo "String 1 not acceptable acceptable";
// String2 acceptable
}

Code:
if(preg_match('/[^a-z_\-0-9]/i', $string))
{
echo "not valid string";
}
Explanation:
[] => character class definition
^ => negate the class
a-z => chars from 'a' to 'z'
_ => underscore
- => hyphen '-' (You need to escape it)
0-9 => numbers (from zero to nine)
The 'i' modifier at the end of the regex is for 'case-insensitive' if you don't put that you will need to add the upper case characters in the code before by doing A-Z

if(!preg_match('/^[\w-]+$/', $string1)) {
echo "String 1 not acceptable acceptable";
// String2 acceptable
}

Here is one equivalent of the accepted answer for the UTF-8 world.
if (!preg_match('/^[\p{L}\p{N}_-]+$/u', $string)){
//Disallowed Character In $string
}
Explanation:
[] => character class definition
p{L} => matches any kind of letter character from any language
p{N} => matches any kind of numeric character
_- => matches underscore and hyphen
+ => Quantifier — Matches between one to unlimited times (greedy)
/u => Unicode modifier. Pattern strings are treated as UTF-16. Also
causes escape sequences to match unicode characters
Note, that if the hyphen is the last character in the class definition it does not need to be escaped. If the dash appears elsewhere in the class definition it needs to be escaped, as it will be seen as a range character rather then a hyphen.

\w\- is probably the best but here just another alternative
Use [:alnum:]
if(!preg_match("/[^[:alnum:]\-_]/",$str)) echo "valid";
demo1 | demo2

Why to use regex? PHP has some built in functionality to do that
<?php
$valid_symbols = array('-', '_');
$string1 = "This is a string*";
$string2 = "this_is-a-string";
if(preg_match('/\s/',$string1) || !ctype_alnum(str_replace($valid_symbols, '', $string1))) {
echo "String 1 not acceptable acceptable";
}
?>
preg_match('/\s/',$username) will check for blank space
!ctype_alnum(str_replace($valid_symbols, '', $string1)) will check for valid_symbols

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.