How to use preg_match()/preg_replace() on specific lengths and patterns - php

I'm having so much trouble understanding this, could someone explain it?
I'm extremely concerned about sanitizing data on my website and I'd like to go the extra mile and strip everything that's not supposed to be there. here's an example I end up handling a lot of hex values which only contain hyphens, numbers and letters A through F in a string like this
7d43637d-780c-4703-8467-13525d590
How would I go about writing a preg_match/replace that would check for
8 characters '0-9' 'A-F' 'a-f', hyphen, 4 chars, hyphen, 4 chars, hyphen, 4 chars, hyphen, 9 chars?
Would it go a bit like this?
pre_replace("/([0-9a-fA-F]{8})\/([-])\/([0-9a-fA-F]{4})\/([-])\/([0-9a-fA-F]{4})\/([-])\/([0-9a-fA-F]{4})\/([-])\/([0-9a-fA-F]{9})\", "", $string);
I'm sorry it's so long and confusing, I am definitely open to alternative ideas to accomplish this.
I also need one that will check 'A-Z' 'a-z' '0-9' '-' '_' and 32 characters long
I really wish there was an online generator for this, maybe if I can understand how it is done I'll build one for other poor souls.
Edit
Using AbraCadaver's code I was able to get this to work using
if(!preg_match("/[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{9}/i", $string)){
echo 'error';
} else {
//continue code
}

You were close but have alot more than needed. Also, the i makes it case insensitive (A or a):
"/[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{9}/i"
Debuggex Demo
if(!preg_match("/[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{9}/i", $string)) {
echo "Invalid string";
}

Related

Adding regex to detect a word that repeats 3 characters?

I have searched and searched but I cannot find anything quite exactly like what I need. I have a code:
$repeater = "pompom";
if (preg_match('/([a-zA-Z])\1{3}/', $repeater)) {
echo "Yes, $repeater does repeat 3 characters.<br>";
}
else {
echo "No, $repeater does not repeat 3 characters.<br>";
}
(I can barely understand regex as it is... so just ignore my current regex.. it's just a mixture of randomness I began to type.)
Anyhow, I need the regex code to return
true for words like
pompom
grugru
mopmop
cancan
etc...
and return false for words like
coocoo
daadaa
allall
giigii
etc.
The regex must detect and return true for any word that has 3 different characters that repeat more than once in that word.
This must work for words that have characters that are not necessarily in sequence with one another. I have found solutions to that. Words such as "cooo" or "pooool" is not what I need to apply this regex for. Note: This must return True only for words that have 3 or more different letters in the word and are repeated more than once. Such as, pompom..
This should return false for words like coocoo because there are only 2 different letters in the word.
Again, please ignore my current regex it was just what I had when I decided to ask for some help. I've tried probably 200 different methods, all wrong of course :].
Any help would be nice, maybe we can figure this out together I just need some ideas to bounce off of.
The following regex will perform as requested:
^((.)(?!\2)(.)(?!\2)(?!\3).)\1$
https://regex101.com/r/eHKzWB/3

PHP Regex for full names in a specific format

I'm trying to make a function to verify names on PHP using Regex, I want the names to be able to carry infinite amount of spaces and ' and -, and to allow only capital characters after spaces but to allow capital and none capitals after - and '.. Also the total length should be of 50 characters and the name should end with a lowercase, note that the uppercases are A to Z plus those characters :
ÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ
and the lower cases are a to z plus those characters :
éçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß
each word (between a space , ' or - and another) should count at least 2 characters the name should also start with an uppercase and finish with a lower case and in words (between a space , ' or - and another) no uppercases but that of the beginning is allowed
Examples of acceptable names are :
Adam Klsld
Adam'odskdl
Adam'Ddlsl
Ùdam-ddkkdk
Addssd-Ddsdsd
I've been trying a lot but here's my last try that I still keep in my php file, the others I've deleted in the chaos of non-successful attempts (using mb_ereg function to match, so this is a posix-ere):
([A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+){1}((^[\'\-\s])[A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+)*
(this does not necessarily mean it's the best attempt but I though it may help and give an idea on how much of a dork am I)
I wouldn't exactly suggest you use this... but I think this does what you want?
^([A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+){1}((([\s])[A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+)|((['\-])([A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ]|[a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß])[a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+))*$
Here it is in a non-code block so you can see how insane it is... think it strips some characters here though:
^([A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+){1}((([\s])[A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+)|((['-])([A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ]|[a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß])[a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+))*$
Is this Regex answering what you need to check ?
(You'll have to add the weird characters inside each brackets of course).
You can use this to avoid accented characters issue:
$pattern = "~^[\p{Lu}ß]\p{Ll}*+(?>(?> [\p{Lu}ß]|['-]\p{L})\p{Ll}*+)*$~u";
if(preg_match($pattern, $name)) { ...
Or for a more specific set of characters:
$pattern = "~(?(DEFINE)(?<Up>[A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ]))
(?(DEFINE)(?<Lo>[a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]))
^\g<Up>\g<Lo>*+(?>(?>\h\g<Up>|['-]\g<Up>?+\g<Lo>)\g<Lo>*+)*+$~ux";
if (preg_match($pattern, $name, $matches)) { ...
or the same in a shorter way:
$pattern = "~(?(DEFINE)(?<Up>[A-ZÀ-ÖØ-ݟߌ]))
(?(DEFINE)(?<Lo>[a-zà-öø-ýÿßœ]))
^\g<Up>\g<Lo>*+(?>(?>\h\g<Up>|['-]\g<Up>?+\g<Lo>)\g<Lo>*+)*+$~ux";

Explode UTF8 string regarding to uppercase or numeric characters

As this question, I can split strings that includes upper cases like this:
function splitAtUpperCase($string){
return preg_replace('/([a-z0-9])?([A-Z])/','$1 $2',$string);
}
$string = 'setIfUnmodifiedSince';
echo splitAtUpperCase($string);
Output is "set If Unmodified Since"
But I need some modification:
That code snippet doesn't handle the cases, when these characters exist in string: ÇÖĞŞÜİ. I don't want to transliterate the characters. Then I lose meaning of word. I need to use some UTF characters. That code makes "HereÇonThen" to "HereÇon Then"
I also don't want to split uppercase abbreviations. If word is "IKnowYouWillComeASAPHere" I need it to be converted to "I Know You Will Come ASAP Here"
Don't explode if all letters are uppercase. Like "DONTCOMEHERE"
Explode also numeric values. "Before2013ends" to "Before 2013 ends"
Explode if first character is hash key (#).
cases and expected results
"comeHEREtomorrow" => "come HERE tomorrow"
"KissYouTODAY" => "kiss you TODAY"
"comeÜndeHere" => "come Ünde Here"
"NEVERSAYIT" => "NEVERSAYIT"
"2013willCome" => "2013 will Come"
"Before2013ends" => "Before 2013 ends"
"IKnowThat" => "I Know That"
"#whatiknow" => "# whatiknow"
For these cases I use subsequent str_replace operations. I look for a short solution that doesn't make too much for loops to check the words. It would be better to have it as preg_replace or etc. if possible.
Edit: Anyone can try his solution by changing convert function inside this PHP fiddle: http://ideone.com/9gajZ8
/([[:lower:][:digit:]])?([[:upper:]]+)/u should do it.
Here /u is used for Unicode characters. and ([[:upper:]]+) is used for Sequence of upper cased letters.
Note. Case of a letter depends on the character set you are using.
Some notes:
Use Unicode properties to search for upper-case & lower-case letters (and even title-case ones, f.ex. Dž Lj Nj Dz)
comeHEREtomorrow & IKnowThat won't work with one method, until you use some dictionaries to find exact words.
Because if you want to translate comeHEREtomorrow as come HERE tomorrow, IKnowThat will be IK now That (or even IK now T hat);
And if you want to translate IKnowThat as I Know That, comeHEREtomorrow will be come H E R E tomorrow
My solution: http://ideone.com/oALyTo (excludes non-letter & non-number charaters)
Well, I matched all of your test cases, but I still don't think it's a good solution. (One of the few flaws in test driven design).
I took a slightly different approach. Instead of trying to write a regular expression for what the place between a word should look like, I wrote a regular expression that looks for everything that apparently is a word, and then imploded.
function convert($keyword) {
$wResult = preg_match_all('/(^I|[[:upper:]]{2,}|[[:upper:]][[:lower:]]*|[[:lower:]]+|\d+|#)/u', $keyword, $matches);
return implode(' ',$matches[0]);
}
As you can see, this is what I decided qualified as a word:
^I A capital I at the beginning of the string. Break point: Icons.
[[:upper:]]{2,} Consecutive capitals. Break Point: WellIKnowThat
[[:upper:]][[:lower:]]* A single Capital followed by some lower case letters
[[:lower:]]+ A string of lower case letters
\d+ A string of digits
# A literal #
It's not perfect - there're still many breakpoints. You can continue to refine these word definitions, but frankly, there's always going to be an edge case you can't catch. Then you wind up slowly expanding this regular expression until it's totally unmanageable. You could try using a dictionary, but that breaks down eventually, too. What do you do with "whirlwind"? Or "ITan"? Is that "IT an", or "I Tan"? Case in point? Here it is after I tried to catch some of My errors. It's getting so huge, and it's still trivial to come up with strings it breaks on. This function is all about degrees - how much time is it worth spending to teach your algorithm all the funny points of all the world languages?
EDIT: After some work, And deciding that I could be separated out as its own word if and only if it was followed immediately by One Capital letter and one lower case letter, I've updated my attempt at an answer.
function convert($keyword, $debug = false) {
$wResult = preg_match_all('/I(?=[[:upper:]][[:lower:]])|[[:upper:]]{2,}|[[:upper:]][[:lower:]]*|[[:lower:]]+|\d+|#/u', $keyword, $matches);
if($debug){
var_dump($matches);
var_dump($matches[0]);
var_dump(implode(' ',$matches[0]));
}
return implode(' ',$matches[0]);
}
I also added some new test cases:
convert("Icons") = "Icons"
convert("WellIKnowThat") == "Well I Know That"
convert("ITan") == "I Tan"
convert("whirlwind") == "whirlwind"
I think this is about as good as it's going to get today. The final set of "Word Definitions" in order of preference, is:
Upper case I, provided it's followed by an upper case letter and a lower case letter:I(?=[[:upper:]][[:lower:]])
Two or more consecutive upper case letters: [[:upper:]]{2,}
A single uppercase Letter, followed by as many Lower case letters as possible: [[:upper:]][[:lower:]]*
one or more consecutive lower case letters: [[:lower:]]+
One or more consecutive digits: \d+
A literal pound symbol: #
I've added another word definition, a test case, and refined the testing fiddle. The new word definition matches the rule for I, but with A - the only other one letter word in the English Language.
you need Unicode Regex:
\p{Lu} for upercase and \p{Li} for lowercase
Hence, your usage will look like this:
/([\p{Ll}0-9])?([\p{Lu}])/

PHP preg_match with regex: only single hyphens and spaces between words continue

I was trying to write an regex that allows single hyphens and single spaces only within words but not at the beginning or at the end of the words.
I thought I have this sorted from the answer I got yesterday, but I just realised there is small error which I don't quite understand,
Why it won't accept the inputs like,
'forum-category-b forum-category-a'
'forum-category-b Counter-terrorism'
'forum-category-a Preventing'
'forum-category-a Preventing Violent'
'forum-category-a International-Research-and-Publications'
'International-Research-and-Publications forum-category-b forum-category-a'
but it takes,
'forum-category-b'
'Counter-terrorism forum-category-a'
'Preventing forum-category-a'
'Preventing Violent forum-category-a'
'International-Research-and-Publications forum-category-b'
Why is that? How can I fix it? It Below is the regex with the initial test, but ideally it should accept all the combination inputs above,
$aWords = array(
'a',
'---stack---over---flow---',
' stack over flow',
'stack-over-flow',
'stack over flow',
'stacoverflow'
);
foreach($aWords as $sWord) {
if (preg_match('/^(\w+([\s-]\w+)?)+$/', $sWord)) {
echo 'pass: ' . $sWord . "\n";
} else {
echo 'fail: ' . $sWord . "\n";
}
}
accept/ to reject the input like these below,
---stack---over---flow---
stack-over-flow- stack-over-flow2
stack over flow
Thanks.
Your pattern does not do what you want. Let's break it apart:
^(\w+([\s-]\w+)?)+$
It matches strings that consist solely of one or more sequences of the pattern:
\w+([\s-]\w+)?
...which is a sequence of word characters, followed optionally by one other sequence of word characters, separated by one space or dash character.
In other words, your pattern searches for strings like:
xxx-xxxyyy-yyyzzz zzz
...but you intent to write a pattern that would find:
xxx-xxxxxx-xxxxxx yyy
In your examples, this one is matched:
Counter-terrorism forum-category-a
...but it is interpreted as the following sequence:
(Counter(-terroris)) (m( foru)) (m(-categor) (y(-a))
As you can see, the pattern did not really find the words you are looking for.
This example is not matched:
forum-category-a Preventing Violent
...since the pattern cannot form groups of "word characters, space-or-dash, word-characters" when it encounters a single word character followed by space or dash:
(forum(-categor)) (y(-a)) <Mismatch: Found " " but expected "\w">
If you would add another character to "forum-category-a", say "forum-category-ax", it would match again, since it could split at the "ax":
(forum(-categor)) (y(-a)) (x( Preventin)) (g( Violent))
What you are actually interested in is a pattern like
^(\w+(-\w+)*)(\s\w+(-\w+)*)*$
...which would find a sequence of words that may contain dashes, separated by spaces:
(forum(-category)(-a)) ( Preventing) ( Violent)
By the way, I tested this using a Python script, and while trying to match your pattern against the example string "International-Research-and-Publications forum-category-b forum-category-a", the regular expression engine seemed to run into an infinite loop...
import re
expr = re.compile(r'^(\w+([\s-]\w+)?)+$')
expr.match('International-Research-and-Publications forum-category-b forum-category-a')
the part of your pattern ([\s-]\w+)? is the issue. It's only allowing for one repetition (the trailing ?). Try changing the last ? to * and see if that helps.
Nope, I still believe that's the problem. The original pattern is looking for "word" or "word[space_hyphen]word" repeated 1+ times. Which is weird because the pattern should fall within another match. But switching the question mark worked for me.
There should be only one answer to this problem:
/^((?<=\w)[ -]\w|[^ -])+$/
There is only 1 rule as stated \w[ -]\w and thats it. And its on a per character basis granularity, and cannot be anthing else. Add the [^ -] for the rest.

php regular expression to filter out junk

So I have an interesting problem: I have a string, and for the most part i know what to expect:
http://www.someurl.com/st=????????
Except in this case, the ?'s are either upper case letters or numbers. The problem is, the string has garbage mixed in: the string is broken up into 5 or 6 pieces, and in between there's lots of junk: unprintable characters, foreign characters, as well as plain old normal characters. In short, stuff that's apt to look like this: Nyþ=mî;ëMÝ×nüqÏ
Usually the last 8 characters (the ?'s) are together right at the end, so at the moment I just have PHP grab the last 8 chars and hope for the best. Occasionally, that doesn't work, so I need a more robust solution.
The problem is technically unsolvable, but I think the best solution is to grab characters from the end of the string while they are upper case or numeric. If I get 8 or more, assume that is correct. Otherwise, find the st= and grab characters going forward as many as I need to fill up the 8 character quota. Is there a regex way to do this or will i need to roll up my sleeves and go nested-loop style?
update:
To clear up some confusion, I get an input string that's like this:
[garbage]http:/[garbage]/somewe[garbage]bsite.co[garbage]m/something=[garbage]????????
except the garbage is in unpredictable locations in the string (except the end is never garbage), and has unpredictable length (at least, I have been able to find patterns in neither). Usually the ?s are all together hence me just grabbing the last 8 chars, but sometimes they aren't which results in some missing data and returned garbage :-\
$var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case
$clean = join(
array_filter(
str_split($var, 1),
function ($char) {
return (
array_key_exists(
$char,
array_flip(array_merge(
range('A','Z'),
range('a','z'),
range((string)'0',(string)'9'),
array(':','.','/','-','_')
))
)
);
}
)
);
Hah, that was a joke. Here's a regex for you:
$clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);
As stated, the problem is unsolvable. If the garbage can contain "plain old normal characters" characters, and the garbage can fall at the end of the string, then you cannot know whether the target string from this sample is "ABCDEFGH" or "BCDEFGHI":
__http:/____/somewe___bsite.co____m/something=__ABCDEFGHI__
What do these values represent? If you want to retain all of it, just without having to deal with garbage in your database, maybe you should hex-encode it using bin2hex().
You can use this regular expression :
if (preg_match('/[\'^£$%&*()}{##~?><>,|=_+¬-]/', $string) ==1)

Categories