regex case insensitive and with/without whitespace

regex case insensitive and with/without whitespace - php

Not being that knowledgable in regex patterns and after reading all wikis and references I found I'm having problems altering a pattern for word detection and higlighting.
I found a function on another stackoverflow answer that did everything it was needed but now I found out it misses out on a few things
The function is:
function ParserGlossario($texto, $termos) {
$padrao = '\1\2\3';
if (empty($termos)) {
return $texto;
}
if (is_array($termos)) {
$substituir = array();
$com = array();
foreach ($termos as $key => $value) {
$key = $value;
$value = $padrao;
// $key = '([\s])(' . $key . ')([\s\.\,\!\?\<])';
$key = '([\s])(' . $key . ')([\s\.\,\!\?\<])';
$substituir[] = '|' . $key . '|ix';
$com[] = empty($value) ? $padrao : $value;
}
return preg_replace($substituir, $com, $texto);
} else {
$termos = '([\s])(' . $termos . ')([\s])';
return preg_replace('|'.$termos.'|i', $padrao, $texto);
}
}
Some words are not being highlighted (the ones marked with red arrows):
And I don't know if it helps, but here is the array of "terms" that is used to search the text:
EDIT. The string being searched is just plain text:
Abaxial Xxxxx acaule Acaule xxxxxx xxx; xxxxx xxx Abaxial esporos.
abaxial
EDIT. Added PHP code fiddle
http://phpfiddle.org/main/code/079ad24318f554d9f2ba
Any help? I really don't know much about regexes...

try
$key = '(^|\b)(' . $key . ')\b';
insetad of
$key = '([\s])(' . $key . ')([\s\.\,\!\?\<])';
should help. Your matches still will be in the second group but there will be no third and I think the first should not be touched, so I believe this
$padrao = '\1\2\3';
is better to be as
$padrao = '$2';
and forgot (sorry):
change
$substituir[] = '|' . $key . '|ix';
to
$substituir[] = '#' . $key . '#ix';
And also I would use a string
$com = empty($value) ? $padrao : $value;
instead of array, it's not needed in this case.

Let us look together on value of $key for example for array element acaule.
([\s])(acaule)([\s\.\,\!\?\<])
There are 3 marking groups defined by 3 pairs of (...).
The first marking group matches any whitespace character. If there is no whitespace character like for Abaxial at beginning of the string, the word is ignored.
Putting \s into a character class, i.e. within [...] is not really needed here as \s is itself a character class. ([\s]) and (\s) are equal.
The second marking group matches just the word from array.
The third marking group matches
either any whitespace character,
or a period,
or a comma,
or an exclamation mark,
or a question mark, i.e. the standard punctuation marks,
or a left angle bracket (from an HTML or XML tag).
A semicolon or colon is not matched and other non word characters are also ignored for a positive match.
If there is none of those characters like for abaxial at end of the string, the search is negative.
By the way: ([\s.,!?<]) is equal ([\s\.\,\!\?\<]) as only \ and ] (always) and - (depending on position) must be escaped with a backslash within a character class definition to be interpreted as literal character. Well, [ should be also escaped with a backslash within [...] for easier reading.
So it is clear why Abaxial at beginning of string and abaxial at end of the string are not matched.
But why is Acaule not matched?
Well, there is left to this word acaule with a space left and a space right as required for a positive match. So the space right of acaule was already taken for this positive match. Therefore for Acaule there is no whitespace character anymore left to this word.
There is \b which means word boundary not matching any character which might be used together with \W*? instead of ([\s]) and instead of ([\s\.\,\!\?\<]) to avoid matching substrings within a word.
Possible would be something like
$key = '(\W*?)(\b' . $key . '\b)(\W*?)';
\W*? means any non word character 0 or more times non-greedy.
\W? means any non word character 0 or 1 times and could be also used in first and third capturing group if that is better for the replace.
But what is the right search string depends on what you want as result of the replace.
I don't have a PHP interpreter installed at all and therefore can't try it out what your PHP code does on replace and therefore what you would like to see after replace done on provided example string.

Related

PHP: How to extract a substring from a specified index until the next whitespace or end of line

I have an input string:
$subject = "This punctuation! And this one. Does n't space that one."
I also have an array containing exceptions to the replacement I wish to perform, currently with one member:
$exceptions = array(
0 => "n't"
);
The reason for the complicated solution I would like to achieve is because this array will be extended in future and could potentially include hundreds of members.
I would like to insert whitespace at word boundaries (duplicate whitespace will be removed later). Certain boundaries should be ignored, though. For example, the exclamation mark and full stops in the above sentence should be surrounded with whitespace, but the apostrophe should not. Once duplicate whitespaces are removed from the final result with trim(preg_replace('/\s+/', ' ', $subject));, it should look like this:
"This punctuation ! And this one . Does n't space that one ."
I am working on a solution as follows:
Use preg_match('\b', $subject, $offsets, 'PREG_OFFSET_CAPTURE'); to gather an array of indexes where whitespace may be inserted.
Iterate over the $offsets array.
split $subject from whitespace before the current offset until the next whitespace or end of line.
check if result of split is contained within $exceptions array.
if result of split is not contained within exceptions array, insert whitespace character at current offset.
So far I have the following code:
$subject="This punctuation! And this one. Does n't space that one.";
$pattern = '/\b/';
preg_match($pattern, $subject, $offsets, PREG_OFFSET_CAPTURE );
if(COUNT($offsets)) {
$indexes = array();
for($i=0;$i<COUNT($offsets);$i++) {
$offsets[$i];
$substring = '?';
// Replace $substring with substring from after whitespace prior to $offsets[$i] until next whitespace...
if(!array_search($substring, $exceptions)) {
$indexes[] = $offsets[$i];
}
}
// Insert whitespace character at each offset stored in $indexes...
}
I can't find an appropriate way to create the $substring variable in order to complete the above example.

$res = preg_replace("/(?:n't|ALL EXCEPTIONS PIPE SEPARATED)(*SKIP)(*F)|(?!^)(?<!\h)\b(?!\h)/", " ", $subject);
echo $res;
Output:
This punctuation ! And this one . Doesn't space that one .
Demo & explanation

One "easy" (but not necessarily fast, depending on how many exceptions you have) solution would be to first replace all the exceptions in the string with something unique that doesn't contain any punctuation, then perform your replacements, then convert back the unique replacement strings into their original versions.
Here's an example using md5 (but could be lots of other things):
$subject = "This punctuation! And this one. Doesn't space that one.";
$exceptions = ["n't"];
foreach ($exceptions as $exception) {
$result = str_replace($exception, md5($exception), $subject);
}
$result = preg_replace('/[^a-z0-9\s]/i', ' \0', $result);
foreach ($exceptions as $exception) {
$result = str_replace(md5($exception), $exception, $result);
}
echo $result; // This punctuation ! And this one . Doesn't space that one .
Demo

regex \p{L} problems

im using this for my Validation:
$space = "[:blank:]";
$number = "0-9";
$letter = "\p{L}";
$specialchar = "-_.:!='\/&*%+,()";
...
$default = "/^[".$space.$number.$letter.$specialchar."]*$/";
if (!preg_match($all, $input)){
$error = true;
}
The Problem i have is:
all is working except "ü"... "Ü" is but "ü" not and i dont know why?
\p{L} should accept all letters and special letters... i dont get it why its not working :(
anyone an idea what i can do?
The data i try to validate is a POST Value from a registration FORM
// p.s. if im using \p{L}ü i get an error like this:
Compilation failed: range out of order in character class at offset 23 in...

Escape the dash:
$specialchar = "\-_.:!='\/&*%+,()";
# here __^
Also add the /u modifier for unicode matching:
$default = "/^[".$space.$number.$letter.$specialchar_def."]*$/u";
# here __^
Test:
$space = "[:blank:]";
$number = "0-9";
$letter = "\p{L}";
$specialchar = "\-_.:!='\/&*%+,()";
$default = "/^[".$space.$number.$letter.$specialchar."]*$/u";
// wrong variable name ^^^^^^^^^^^^ in your example.
$input = 'üÜ is but ';
if (!preg_match($default, $input)){
echo "KO\n";
} else {
echo "OK\n";
}
Output:
OK

The problem is the position that the hyphen is placed in. Within a character class you can place a hyphen as the first or last character in the range. If you place the hyphen anywhere else you need to escape it in order to add it to your class.
$specialchar = "_.:!='\/&*%+,()-";
Also you need to add the u (Unicode) modifier to your regular expression. This modifier turns on additional functionality of PCRE and pattern strings are treated as (UTF-8) and you have the wrong variable in the pattern.
$default = "/^[".$space.$number.$letter.$specialchar."]*$/u";

The - is a special character in character class used to specify a range of character. When you construct the regex by string concatenation, it is recommended that you escape it \-.
Including the fix above, there is another problem in "\-_.:!='\/&*%+,()". Do you want to include \ and /? Or only one of them?
If you want to include both, you should specify it as "\-_.:!='\\\\\/&*%+,()".
If you don't want to escape /, you can replace your separator / in the construction of $default to something not used in your regex, for example ~. In that case, the list of special character will have one less \: "\-_.:!='\\\\/&*%+,()".

PHP Regular Expression to Match Function Name and Parameters with string like Needle(needle|needle)

I am filtering database results with a query string that looks like this:
attribute=operator(value|optional value)
I'll use
$_GET['attribute'];
to get the value.
I believe the right approach is using regex to get matches on the rest.
The preferred output would be
print_r($matches);
array(
1 => operator
2 => value
3 => optional value
)
The operator will always be one word and consist of letters: like(), between(), in().
The values can be many different things including letters, numbers, spaces commas, quotation marks, etc...
I was asked where my code was failing and didn't include much code because of how poorly it worked. Based on the accepted answer, I was able to whip up a regex that almost works.
EDIT 1
$pattern = "^([^\|(]+)\(([^\|()]+)(\|*)([^\|()]*)";
Edit 2
$pattern = "^([^\|(]+)\(([^\|()]+)(\|*)([^\|()]*)"; // I thought this would work.
Edit 3
$pattern = "^([^\|(]+)\(([^\|()]+)(\|+)?([^\|()]+)?"; // this does work!
Edit 4
$pattern = "^([^\|(]+)\(([^\|()]+)(?:\|)?([^\|()]+)?"; // this gets rid of the middle matching group.
The only remaining problem is when the 2nd optional parameter does not exist, there is still an empty $matches array.

This script, with the input "operator(value|optional value)", returns the array you expect:
<?php
$attribute = $_GET['attribute'];
$result = preg_match("/^([\w ]+)\(([\w ]+)\|([\w ]*)\)$/", $attribute, $matches);
print($matches[1] . "\n");
print($matches[2] . "\n");
print($matches[3] . "\n");
?>
This assumes your "values" match [\w ] regexp (all word characters plus space), and that the | you specify is a literal |...

Is there a way by default to require ALL words in full text searching?

I'm trying to find a way to make it so when users do a search that by default ALL words are required.
This seemed easy enough at the beginning, just break the words up and add a + sign to the start of each word; however when you start to try and implement the other operators it get's complicated.
This is what I have so far..
function prepareKeywords($str) {
// Remove any + signs since we add them ourselves
// Also remove any operators we don't allow
// We don't allow some operators as they would have no use in the search since we don't order our results by relevance
$str = str_replace(array('+','~','<','>'), '', $str);
// Remove anything more than once space
$str = preg_replace('/\s{2,}/', ' ', $str);
// Now break up words into parts
$str = str_getcsv($str, ' ', '"');
// Now prepend a + sign to the front of each word
foreach ($ks as $key => $word) {
// First we check to make sure the start of the word doesn't already contain an operator before we add the + sign to the start
if (in_array($word{0}, array('-','<','>','~'))) {
} else {
$ks[$key] = '+' . $word;
}
}
// Now put word back to string
//$ks = implode(' ', $ks);
}
As you can see it only gets so far at the moment, respects quoted strings, but then I start thinking about not breaking up () as well, and then what if that contains nested double quotes or vice versa.... it starts getting very hairy.
So I'm trying to figure out if there is a way that I can do what I want without messing with the string and just making all words required by default UNLESS the user specifically negates them with a -.

Surely you can make use of preg_match() and \b in the pattern as the word boundary?
You search terms can then be split with something like
preg_match_all('/\b(.*)\b/', $matches);
I might be on the wrong thinking as it's late here, but it might give you something to go on

PHP - Regex for a string of special characters

Morning SO. I'm trying to determine whether or not a string contains a list of specific characters.
I know i should be using preg_match for this, but my regex knowledge is woeful and i have been unable to glean any information from other posts around this site. Since most of them just want to limit strings to a-z, A-Z and 0-9. But i do want some special characters to be allowed, for example: ! # £ and others not in the below string.
Characters to be matched on: # $ % ^ & * ( ) + = - [ ] \ ' ; , . / { } | \ " : < > ? ~
private function containsIllegalChars($string)
{
return preg_match([REGEX_STRING_HERE], $string);
}
I originally wrote the matching in Javascript, which just looped through each letter in the string and then looped through every character in another string until it found a match. Looking back, i can't believe i even attempted to use such an archaic method. With the advent of json (and a rewrite of the application!), i'm switching the match to php, to return an error message via json.
I was hoping a regex guru could assist with converting the above string to a regex string, but any feedback would be appreciated!

Regexp for a "list of disallowed character" is not mandatory.
You may have a look at strpbrk. It should do the job you need.
Here's an example of usage
$tests = array(
"Hello I should be allowed",
"Aw! I'm not allowed",
"Geez [another] one",
"=)",
"<WH4T4NXSS474K>"
);
$illegal = "#$%^&*()+=-[]';,./{}|:<>?~";
foreach ($tests as $test) {
echo $test;
echo ' => ';
echo (false === strpbrk($test, $illegal)) ? 'Allowed' : "Disallowed";
echo PHP_EOL;
}
http://codepad.org/yaJJsOpT

return preg_match('/[#$%^&*()+=\-\[\]\';,.\/{}|":<>?~\\\\]/', $string);

$pattern = preg_quote('#$%^&*()+=-[]\';,./{}|\":<>?~', '#');
var_dump(preg_match("#[{$pattern}]#", 'hello world')); // false
var_dump(preg_match("#[{$pattern}]#", 'he||o wor|d')); // true
var_dump(preg_match("#[{$pattern}]#", '$uper duper')); // true
Likely, you can cache the $pattern, depending on your implementation.
(Though looking outside of regular expressions, you're best of with strpbrk as mentioned here too)

I think what you're looking for can be greatly simplified by including the characters that you want to allow like so:
preg_match('/[^\w!#£]/', $string)
Here's a quick breakdown of what's happening:
[^] = not included
\w = letters and numbers
! # £ = the list of characters you would also like to allow

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

regex case insensitive and with/without whitespace - php

Related

PHP: How to extract a substring from a specified index until the next whitespace or end of line

regex \p{L} problems

PHP Regular Expression to Match Function Name and Parameters with string like Needle(needle|needle)

Is there a way by default to require ALL words in full text searching?

PHP - Regex for a string of special characters

Categories

Resources