Repeat pattern using preg_match

Repeat pattern using preg_match - php

I want to be able to validate the strings below to allow data between backticks unlimited times as long as it is followed by a comma, if it is not a comma must be a ")". Whitespaces are allowed only out of the backticks not in them.
I am not experienced with regex so I dont know how to allow a repeated pattern. Below is my pattern so far.
Thanks
UPDATED
// first 3 lines should match
$lines[] = "(`a-z0-9_-`,`a-z0-9_-`,`a-z0-9_-`,`a-z0-9_-`)";
$lines[] = "( `a-z0-9_-`, `a-z0-9_-` ,`a-z0-9_-` , `a-z0-9_-` )";
$lines[] = "(`a-z0-9_-`,
`a-z0-9_-`
,`a-z0-9_-` ,`a-z0-9_-`)";
// these lines below should not match
$lines[] = "(`a-z0-9_-``a-z0-9_-`,`a-z0-9_-`,`a-z0-9_-`)";
$lines[] = "(`a-z0-9_-``a-z0-9_-`,`a-z0-9_-`.`a-z0-9_-`";
$pattern = '/~^\(\s*(?:[a-z0-9_-]+\s*,?\s*)+\)$~/';
$result = array();
foreach($lines as $key => $line)
{
if (preg_match($pattern, $line))
{
$result[$key] = 'Found match.';
}
else
{
$result[$key] = 'Not found a match.';
}
}
print("<pre>" . print_r($result, true). "</pre>");

You're very close. It looks like you want this:
$pattern = "~^\(\s*`[a-z0-9_-]+`\s*(?:,\s*`[a-z0-9_-]+`\s*)*\)$~";
The two problems with your regex were:
You had two sets of delimiters (slashes and tildes) - pick just one and stick with it. My personal preference is parentheses because then you don't have to escape anything "just because delimiters", but also it helps me remember that the entire match is the first entry in the match array.
By making the comma optional, you were allowing things you didn't want. The solution does involve repeating yourself a little, but it is more accurate.

Well you weren't very clear about the matching rules for the data between the brackets, and you didn't really specify if you wanted to capture anything so...I took a best guess based on context of your code, hopefully this will suit your needs.
edit: fixed code block so it would show the backtics in the pattern, also changed the delimiter from ~ to / since OP was confused about that
$pattern = '/^\((\s*`[a-z0-9_-]+`\s*[,)])+$/';

here is a generic repeat pattern:
preg_match_all("/start_string([^repeat_string].*?)end_string/si", $input, $output);
var_dump($output);

Related

filtering words from text with exploits

I have filter which filters bad words like 'ass' 'fuck' etc. Now I am trying to handle exploits like "f*ck", "sh/t".
One thing I could do is matching each words with dictionary of bad word having such exploits. But this is pretty static and not good approach.
Another thing I can do is, using levenshtein distance. Words with levenshtein distance = 1 should be blocked. But this approach also prone to give false positive.
if(!ctype_alpha($text)&& levenshtein('shit', $text)===1)
{
//match
}
I am looking for some way of using regex. May be I can combine levenshtein distance with regex, but I could not figure it out.
Any suggestion is highly appreciable.

Like stated in the comments, it is hard to get this right. This snippet, far from perfect, will check for matches where letters are substituted for the same number of other characters.
It may give you a general idea of how you could solve this, although much more logic is needed if you want to make it smarter. This filter, for instance will not filter 'fukk', 'f ck', 'f**ck', 'fck', '.fuck' (with leading dot) or 'fück', while it does probably filter out '++++' to replace it with 'beep'. But it also filters 'f*ck', 'f**k', 'f*cking' and 'sh1t', so it could do worse. :)
An easy way to make it better, is to split the string in a smarter way, so punctuation marks aren't glued to the word they are adjacent to. Another improvement could be to remove all non-alphabetic characters from each word, and check if the remaining letters are in the same order in a word. That way, 'f\/ck' would also match 'fuck'. Anyway, let your imagination run wild, but be careful for false positives. And trust me that 'they' will always find a way to express themselves in a way that bypasses your filter.
<?php
$badwords = array('shit', 'fuck');
$text = 'Man, I shot this f*ck, sh/t! fucking fucker sh!t fukk. I love this. ;)';
$words = explode(' ', $text);
// Loop through all words.
foreach ($words as $word)
{
$naughty = false;
// Match each bad word against each word.
foreach ($badwords as $badword)
{
// If the word is shorter than the bad word, it's okay.
// It may be bigger. I've done this mainly, because in the example given,
// 'f*ck,' will contain the trailing comma. This could be easily solved by
// splitting the string a bit smarter. But the added benefit, is that it also
// matches derivatives, like 'f*cking' or 'f*cker', although that could also
// result in more false positives.
if (strlen($word) >= strlen($badword))
{
$wordOk = false;
// Check each character in the string.
for ($i = 0; $i < strlen($badword); $i++)
{
// If the letters don't match, and the letter is an actual
// letter, this is not a bad word.
if ($badword[$i] !== $word[$i] && ctype_alpha($word[$i]))
{
$wordOk = true;
break;
}
}
// If the word is not okay, break the loop.
if (!$wordOk)
{
$naughty = true;
break;
}
}
}
// Echo the sensored word.
echo $naughty ? 'beep ' : ($word . ' ');
}

Preg Match circumflex with ^ in php

I know I am going to get a lot of asinine comments, but I cannot figure this out no matter what I do. I have a function here
$filter = mysql_query("SELECT * FROM `filter`");
$fil = mysql_fetch_array($filter);
$bad = $fil['filter'];
$bword = explode(",", $bad);
function wordfilter($output,$bword){
$badWords = $bword;
$matchFound = preg_match_all("/(" . implode($badWords,"|") . ")/i",$output,$matches);
if ($matchFound) {
$words = array_unique($matches[0]);
foreach($words as $word) {
$output = preg_replace("/$word/","*****",$output);
}
}
return $output;
}
I know bad word filters are frowned upon, but my client has requested this.
Now i have a list in the database here are a few entries.
^ass$,^asses$,^asshopper,^cock$,^coon,^cracker$,^cum$,^dick$,^fap$,^heeb$,^hell$,^homo$,^humping,^jap$,^mick$,^muff$,^paki$,^phap$,^poon$,^spic$,^tard$,^tit$,^tits$,^twat$,^vag$,ass-hat,ass-pirate,assbag
as you can see I am using a circumflex and dollar signs for certain words.
The problem I am having is with the first three words beginning with ass it is blocking out the word even if i write something like glasses or grasshoppers but everything past the first 3 work fine, I have tried adding 3 entries before these in-case that was the problem, but unfortunately it isn't.
Is there something wrong with how i have this written?

Extending from comment:
Try to use \b to detect words:
$matchFound = preg_match_all('/\b('.implode($badWords,"|").')\b/i',$output,$matches);

If a variable only contains one word

I would like to know how I could find out in PHP if a variable only contains 1 word. It should be able to recognise: "foo" "1326" ";394aa", etc.
It would be something like this:
$txt = "oneword";
if($txt == 1 word){ do.this; }else{ do.that; }
Thanks.

I'm assuming a word is defined as any string delimited by one space symbol
$txt = "multiple words";
if(strpos(trim($txt), ' ') !== false)
{
// multiple words
}
else
{
// one word
}

What defines one word? Are spaces allowed (perhaps for names)? Are hyphens allowed? Punctuation? Your question is not very clearly defined.
Going under the assumption that you just want to determine whether or not your value contains spaces, try using regular expressions:
http://php.net/manual/en/function.preg-match.php
<?php
$txt = "oneword";
if (preg_match("/ /", $txt)) {
echo "Multiple words.";
} else {
echo "One word.";
}
?>
Edit
The benefit to using regular expressions is that if you can become proficient in using them, they will solve a lot of your problems and make changing requirements in the future a lot easier. I would strongly recommend using regular expressions over a simple check for the position of a space, both for the complexity of the problem today (as again, perhaps spaces aren't the only way to delimit words in your requirements), as well as for the flexibility of changing requirements in the future.

Utilize the strpos function included within PHP.
Returns the position as an integer. If needle is not found, strpos()
will return boolean FALSE.

Besides strpos, an alternative would be explode and count:
$txt = trim("oneword secondword");
$words = explode( " ", $txt); // $words[0] = "oneword", $words[1] = "secondword"
if (count($words) == 1)
do this for one word
else
do that for more than one word assuming at least one word is inputted

php, preg_match, regex, extract specific text

I have a very big .txt file with our clients order and I need to move it in a mysql database . However I don't know what kind of regex to use as the information is not very different .
-----------------------
4046904
KKKKKKKKKKK
Laura Meyer
MassMutual Life Insurance
153 Vadnais Street
Chicopee, MA 01020
US
413-744-5452
lmeyer#massmutual.co...
KKKKKKKKKKK
373074210772222 02/12 6213 NA
-----------------------
4046907
KKKKKKKKKKK
Venkat Talladivedula
6105 West 68th Street
Tulsa, OK 74131
US
9184472611
venkat.talladivedula...
KKKKKKKKKKK
373022121440000 06/11 9344 NA
-----------------------
I tried something but I couldn't even extract the name ... here is a sample of my effort with no success
$htmlContent = file_get_contents("orders.txt");
//print_r($htmlContent);
$pattern = "/KKKKKKKKKKK(.*)\n/s";
preg_match_all($pattern, $htmlContent, $matches);
print_r($matches);
$name = $matches[1][0];
echo $name;

You may want to avoid regexes for something like this. Since the data is clearly organized by line, you could repeatedly read lines with fgets() and parse the data that way.

You could read this file with regex, but it may be quite complicated create a regex that could read all fields.
I recommend that you read this file line by line, and parse each one, detecting which kind of data it contains.

As you know exactly where your data is (i.e. which line its on) why not just get it that way?
i.e. something like
$htmlContent = file_get_contents("orders.txt");
$arrayofclients = explode("-----------------------",$htmlContent);
$newlinesep = "\r\n";
for($i = 0;i < count($arrayofclients);$i++)
{
$temp = explode($newlinesep,$arrayofclients[i]);
$idnum = $temp[0];
$name = $temp[4];
$houseandstreet = $temp[6];
//etc
}
or simply read the file line by line using fgets() - something like:
$i = 0;$j = 0;
$file = fopen("orders.txt","r");
$clients = [];
while ($line = fgets($ffile) )
{
if(line != false)
{
$i++;
switch($i)
{
case 2:
$clients[$j]["idnum"] = $line;
break;
case 6:
$clients[$j]["name"] = $line;
break;
//add more cases here for each line up to:
case 18:
$j++;
$i = 0;
break;
//there are 18 lines per client if i counted right, so increment $j and reset $i.
}
}
}
fclose ($f);
You could use regex's, but they are a bit awkward for this situation.
Nico

For the record, here is the regex that will capture the names for you. (Granted speed very well may be an issue.)
(?<=K{10}\s{2})\K[^\r\n]++(?!\s{2}-)
Explanation:
(?<=K{10}\s{2}) #Positive lookbehind for KKKKKKKKKK then 2 return/newline characters
\K[^\r\n]++ #Greedily match 1 or more non-return/newline characters
(?!\s{2}-) #Negative lookahead for return/newline character then dash
Here is a Regex Demo.
You will notice that my regex pattern changes slightly between the Regex Demo and my PHP Demo. Slight tweaking depending on environment may be required to match the return / newline characters.
Here is the php implementation (Demo):
if(preg_match_all("/(?<=K{10}\s{2})\K[^\r\n]++(?!\s{2}-)/",$htmlContent,$matches)){
var_export($matches[0]);
}else{
echo "no matches";
}
By using \K in my pattern I avoid actually having to capture with parentheses. This cuts down array size by 50% and is a useful trick for many projects. The \K basically says "start the fullstring match from this point", so the matches go in the first subarray (fullstrings, key=0) of $matches instead of generating a fullstring match in 0 and the capture in 1.
Output:
array (
0 => 'Laura Meyer',
1 => 'Venkat Talladivedula',
)

Regular Expressions: how to do "option split" replaces

those reqular expressions drive me crazy. I'm stuck with this one:
test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not
Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:
test1:link test2:silver test3:out1insideout2 test4:this|not
I came up with (PHP)
$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]
this works for part1 of the task. but before that I think I should do the option split, my best solution:
$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);
Result:
test1:silver test3:[[out1[[inside]]out2]] this|not
I'm stuck. may someone with some free minutes help me? Thanks!

I think the easiest way to do this would be multiple passes. Use a regular expression like:
\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]
This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.
Edit 1: By way of explanation,
\[\[ # Opening [[
(?: # A non-matching group (we don't want this bit)
[^\[\]] # Non-bracket characters
* # Zero or more of anything but [
\| # A literal '|' character representing the end of the discarded options
)? # This group is optional: if there is only one option, it won't be present
( # The group we're actually interested in ($1)
[^\[\]] # All the non-bracket characters
+ # Must be at least one
) # End of $1
\]\] # End of the grouping.
Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).
Edit 3: There is no need to know the number of nested brackets as you can do something like:
$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
$oldtext = $newtext;
$newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;
Basically, this keeps running the regular expression replace until the output is the same as the input.
Note that I don't know PHP, so there are probably syntax errors in the above.

This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.
Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.
You will need to escape all backslashes when putting it into a string (\ becomes \\.)
\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]
As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)
Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:
test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not
becomes
test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not

Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.
When trying to get something going favour clarity and simplicity.
Seems like you have all the pieces.

Why not just simply remove any brackets that are left?
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);

Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/';
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
$opt = explode('|', $match); $match = $opt[count($opt)-1];
}
$m[2][$pos] = str_replace(array('[', ']'),'', $match );
}
foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];

This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.
String input = "test1:[[link]] " +
"test2:[[gold|silver]] " +
"test3:[[out1[[inside]]out2]] " +
"test4:this|not";
String step1 = Regex.Replace(input, #"\[\[([^|]+)\|([^\]]+)\]\]", #"[[$2]]");
String step2 = Regex.Replace(step1, #"\[\[|\]\]", String.Empty);
// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);

$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
$v = preg_replace("/\[\[|\]\]/","",$v);
$j = explode(":",$v);
$j[1]=preg_replace("/.*\|/","",$j[1]);
print implode(":",$j)."\n";
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Repeat pattern using preg_match - php

here is a generic repeat pattern: preg_match_all("/start_string([^repeat_string].*?)end_string/si", $input, $output); var_dump($output);

Related

filtering words from text with exploits

Preg Match circumflex with ^ in php

If a variable only contains one word

php, preg_match, regex, extract specific text

Regular Expressions: how to do "option split" replaces

Categories

Resources