I have a parish website that I am maintaining. The site has a parish registration form on it. Yesterday someone submitted the form with spam. The submitter supplied an inappropriate web address in one of the fields.
I'm fairly confident this was not a bot form submission as I use a recapcha and honeypot to fend off bots.
What I'm trying to figure out is how on the processing page to look at all the text entry fields and scrub URLs.
Since the language is PHP:
function scrubURL(field){
if($_POST[field] contains **SomeURL**){
$field = str_replace(***SomeURL***, "", $_POST[field])
} else{
$field = $_POST[field];
}
return $field;
}
I'm just not sure to check the field to see if it contains a URL.
I'm planning to scrub URLs by calling:
$first = scrubURL($first);
$last = scrubURL($last);
$address = scrubURL($address);
I will then use $first, $last & $address in the mail that gets sent to the parish office.
This function will recognize URLs and replace then with empty strings. Just realize that lots of thing, such as wx.yz look like valid URLs.
function scrubURL($field)
{
//return preg_replace('#((https?://)?([-\\w]+\\.[-\\w\\.]+)+\\w(:\\d+)?(/([-\\w/_\\.]*(\\?\\S+)?)?)*)(?:[?&][^?$]+=[^?&]*)*#i', '', $_POST[$field]);
return preg_replace("#((https?://|ftp://|www\.|[^\s:=]+#www\.).*?[a-z_\/0-9\-\#=&])(?=(\.|,|;|\?|\!)?(\"|'|«|»|\[|\s|\r|\n|$))#iS", '', $_POST[$field]);
}
The parameter, $field, has to be a string, such as "email" corresponding to $_POST["email"]
<?php
$_POST = [
'email' => 'something www.badsite.com?site=21&action=redirect else',
];
function scrubURL($field)
{
return preg_replace('#((https?://)?([-\\w]+\\.[-\\w\\.]+)+\\w(:\\d+)?(/([-\\w/_\\.]*(\\?\\S+)?)?)*)(?:[?&]\S+=\S*)*#i', '', $_POST[$field]);
}
echo scrubURL('email');
Prints:
something else
Regex is an easy way to evaluate fields for possible URL markers. Something like the following would remove much of it (though, given how many shapes URLs can come in, not everything):
$_POST = [
'first' => 'actualname',
'last' => 'something http://url.com/visit-me',
'middle' => 'hello www.foobar.com spammer',
'other' => 'visit https://spammery.us/ham/spam spamming',
'more' => 'spam.tld',
];
// Iterates all $_POST fields, editing the $_POST array in place
foreach($_POST as $key => &$val) {
$val = scrubUrl($val);
}
function scrubURL($data)
{
/* Removes anything that:
- starts with http(s):
- starts with www.
- has a domain extension (2-5 characters)
... ending the match with the first space following the match.
*/
$data = preg_replace('#\b
(
https?:
|
www\.
|
[^\s]+\.[a-z]{2,5}
)
[^\s]+#x', '-', $data);
return $data;
}
print_r($_POST);
Be aware that the last condition, looking for any TLD (.abc) -- and there are lots of them! -- may result in some false positives.
"sentence.Poor punctuation" would be safe. We're only matching [a-z].However, spam.Com would also pass! Use [a-Z] to match both cases, or add the "i" modifier to the regex.
"my acc.no is 12345" would be removed (potential spammer accountants from Norway?!)
The above process would give you the following filtered data:
Array (
[first] => actualname
[last] => something -
[middle] => hello - spammer
[other] => visit - spamming
[more] => -
)
The regex can definitely be further refined. ^_^
N.B. You may also want to sanitize the incoming data with e.g. strip_tags and htmlspecialchars to ensure the website is sending reasonably safe data to your parish.
Related
I wanted to know if it's possible to make a PHP mention system with usernames with space ?
I tried this
preg_replace_callback('##([a-zA-Z0-9]+)#', 'mentionUser', htmlspecialchars_decode($r['content']))
My function:
function mentionUser($matches) {
global $db;
$req = $db->prepare('SELECT id FROM members WHERE username = ?');
$req->execute(array($matches[1]));
if($req->rowCount() == 1) {
$idUser = $req->fetch()['id'];
return '<a class="mention" href="members/profile.php?id='.$idUser.'">'.$matches[0].'</a>';
}
return $matches[0];
It works, but not for the usernames with space...
I tried to add \s, it works, but not well, the preg_replace_callback detect the username and the other parts of the message, so the mention don't appear...
Is there any solution ?
Thanks !
I know you said that you just removed the ability to add a space, but I still wanted to post a solution. To be clear, I don't necessarily think you should use this code, because it probably is just easier to keep things simple, but I think it should work still.
Your major problem is that almost every mention will incur two lookups because #bob johnson went to the store could be either bob or bob johnson and there's no way to determine that without going to the databases. Caching will greatly reduce this problem, luckily.
Below is some code that generally does what you are looking for. I made a fake database using just an array for clarity and reproducibility. The inline code comments should hopefully make sense.
function mentionUser($matches)
{
// This is our "database" of users
$users = [
'bob johnson',
'edward',
];
// First, grab the full match which might be 'name' or 'name name'
$fullMatch = $matches['username'];
// Create a search array where the key is the search term and the value is whether or not
// the search term is a subset of the value found in the regex
$names = [$fullMatch => false];
// Next split on the space. If there isn't one, we'll have an array with just a single item
$maybeTwoParts = explode(' ', $fullMatch);
// Basically, if the string contained a space, also search only for the first item before the space,
// and flag that we're using a subset
if (count($maybeTwoParts) > 1) {
$names[array_shift($maybeTwoParts)] = true;
}
foreach ($names as $name => $isSubset) {
// Search our "database"
if (in_array($name, $users, true)) {
// If it was found, wrap in HTML
$ret = '<span>#' . $name . '</span>';
// If we're in a subset, we need to append back on the remaining string, joined with a space
if ($isSubset) {
$ret .= ' ' . array_shift($maybeTwoParts);
}
return $ret;
}
}
// Nothing was found, return what was passed in
return '#' . $fullMatch;
}
// Our search pattern with an explicitly named capture
$pattern = '##(?<username>\w+(?:\s\w+)?)#';
// Three tests
assert('hello <span>#bob johnson</span> test' === preg_replace_callback($pattern, 'mentionUser', 'hello #bob johnson test'));
assert('hello <span>#edward</span> test' === preg_replace_callback($pattern, 'mentionUser', 'hello #edward test'));
assert('hello #sally smith test' === preg_replace_callback($pattern, 'mentionUser', 'hello #sally smith test'));
Try this RegEx:
/#[a-zA-Z0-9]+( *[a-zA-Z0-9]+)*/g
It will find an at sign first, and then try to find one or more letter or numbers. It will try to find zero or more inner spaces and zero or more letters and numbers coming after that.
I am assuming the username only contains A-Za-z0-9 and space.
Hello everyone im trying to retrieve data from a form with a POST request.
This data is posted into another website.
On the website where the data is created i have a text field called website. The data filled in this field goes to the other website where the data is collected. Now i want to exclude the 'www' part. for example if the user enters www.hello.nl i want to receive hello.nl only.
What i tried:
function website () {
$str = $_POST['billing_myfield12'];
echo chop($str,"www");
}
// end remove www
// prepare the sales payload
$sales_payload = array(
'organization_id' => $organization_id,
'contact_id' => $contact_id,
'status' => 'Open',
'subject' => $_product->post_title." ".website(), <----- here i call it
This is not working. Is there a way to do this?
You can use trim() or specifically ltrim() to trim way the www. on the left side. Please don't forget the . after www.
echo ltrim($str, "www.");
Sample Code
echo ltrim("www.hello.nl", "www."); // hello.nl
Demo: http://ideone.com/bqMY7X
Looks like there are side effects with the above code. Let's go with the traditional str_replace method:
echo str_replace("www.", "", $str);
Also, we are sure that it should replace only from the first characters. So, we need to use a preg_replace instead, making it replace from the start.
echo preg_replace("/^www\./g", "", $str);
Verified the above code with: https://regex101.com/r/dv8N6d/1
I am currently dealing with a performance issue where I cannot find a way to fix it. I want to search a text for usernames mentioned with the # sign in front. The list of usernames is available as PHP array.
The problem is usernames may contain spaces or other special characters. There is no limitation for it. So I can't find a regex dealing with that.
Currently I am using a function which gets the whole line after the # and checks char by char which usernames could match for this mention, until there is just one username left which totally matches the mention. But for a long text with 5 mentions it takes several seconds (!!!) to finish. for more than 20 mentions the script runs endlessly.
I have some ideas, but I don't know if they may work.
Going through username list (could be >1.000 names or more) and search for all #Username without regex, just string search. I would say this would be far more inefficient.
Checking on writing the usernames with JavaScript if space or punctual sign is inside the username and then surround it with quotation marks. Like #"User Name". Don't like that idea, that looks dirty for the user.
Don't start with one character, but maybe 4. and if no match, go back. So same principle like on sorting algorithms. Divide and Conquer. Could be difficult to implement and will maybe lead to nothing.
How does Facebook or twitter and any other site do this? Are they parsing the text directly while typing and saving the mentioned usernames directly in the stored text of the message?
This is my current function:
$regular_expression_match = '#(?:^|\\s)#(.+?)(?:\n|$)#';
$matches = false;
$offset = 0;
while (preg_match($regular_expression_match, $post_text, $matches, PREG_OFFSET_CAPTURE, $offset))
{
$line = $matches[1][0];
$search_string = substr($line, 0, 1);
$filtered_usernames = array_keys($user_list);
$matched_username = false;
// Loop, make the search string one by one char longer and see if we have still usernames matching
while (count($filtered_usernames) > 1)
{
$filtered_usernames = array_filter($filtered_usernames, function ($username_clean) use ($search_string, &$matched_username) {
$search_string = utf8_clean_string($search_string);
if (strlen($username_clean) == strlen($search_string))
{
if ($username_clean == $search_string)
{
$matched_username = $username_clean;
}
return false;
}
return (substr($username_clean, 0, strlen($search_string)) == $search_string);
});
if ($search_string == $line)
{
// We have reached the end of the line, so stop
break;
}
$search_string = substr($line, 0, strlen($search_string) + 1);
}
// If there is still one in filter, we check if it is matching
$first_username = reset($filtered_usernames);
if (count($filtered_usernames) == 1 && utf8_clean_string(substr($line, 0, strlen($first_username))) == $first_username)
{
$matched_username = $first_username;
}
// We can assume that $matched_username is the longest matching username we have found due to iteration with growing search_string
// So we use it now as the only match (Even if there are maybe shorter usernames matching too. But this is nothing we can solve here,
// This needs to be handled by the user, honestly. There is a autocomplete popup which tells the other, longer fitting name if the user is still typing,
// and if he continues to enter the full name, I think it is okay to choose the longer name as the chosen one.)
if ($matched_username)
{
$startpos = $matches[1][1];
// We need to get the endpos, cause the username is cleaned and the real string might be longer
$full_username = substr($post_text, $startpos, strlen($matched_username));
while (utf8_clean_string($full_username) != $matched_username)
{
$full_username = substr($post_text, $startpos, strlen($full_username) + 1);
}
$length = strlen($full_username);
$user_data = $user_list[$matched_username];
$mentioned[] = array_merge($user_data, array(
'type' => self::MENTION_AT,
'start' => $startpos,
'length' => $length,
));
}
$offset = $matches[0][1] + strlen($search_string);
}
Which way would you go? The problem is the text will be displayed often and parsing it every time will consume a lot of time, but I don't want to heavily modify what the user had entered as text.
I can't find out what's the best way, and even why my function is so time consuming.
A sample text would be:
Okay, #Firstname Lastname, I mention you!
Listen #[TEAM] John, you are a team member.
#Test is a normal name, but #Thât♥ should be tracked too.
And see #Wolfs garden! I just mean the Wolf.
Usernames in that text would be
Firstname Lastname
[TEAM] John
Test
Thât♥
Wolf
So yes, there is clearly nothing I know where a name may end. Only thing is the newline.
I think the main problem is, that you can't distinguish usernames from text and it's a bad idea, to lookup maybe thousands of usernames in a text, also this can lead to further problems, that John is part of [TEAM] John or JohnFoo...
It's needed to separate the usernames from other text. Assuming that you're using UTF-8, could put the usernames inside invisible zero-w space \xE2\x80\x8B and non-joiner \xE2\x80\x8C.
The usernames can now be extracted fast and with little effort and if needed still verified in db.
$txt = "
Okay, #\xE2\x80\x8BFirstname Lastname\xE2\x80\x8C, I mention you!
Listen #\xE2\x80\x8B[TEAM] John\xE2\x80\x8C, you are a team member.
#\xE2\x80\x8BTest\xE2\x80\x8C is a normal name, but
#\xE2\x80\x8BThât?\xE2\x80\x8C should be tracked too.
And see #\xE2\x80\x8BWolfs\xE2\x80\x8C garden! I just mean the Wolf.";
// extract usernames
if(preg_match_all('~#\xE2\x80\x8B\K.*?(?=\xE2\x80\x8C)~s', $txt, $out)){
print_r($out[0]);
}
Array
(
[0] => Firstname Lastname
1 => [TEAM] John
2 => Test
3 => Thât♥
4 => Wolfs
)
echo $txt;
Okay, #Firstname Lastname, I mention you!
Listen #[TEAM] John, you are a team member.
#Test is a normal name, but
#Thât♥ should be tracked too.
And see #Wolfs garden! I just mean the Wolf.
Could use any characters you like and that possibly don't occur elsewhere for separation.
Regex FAQ, Test at eval.in (link will expire soon)
I am creating an OpenCart extension where the admin can change his email templates using the user interface in the admin panel.
I would like the user to have the option to add variables to his custom email templates. For example he could put in:
Hello $order['customer_firstname'], your order has been processed.
At this point $order would be undefined, the user is simply telling defining the message that is to be sent. This would be stored to the database and called when the email is to be sent.
The problem is, how do I get "$order['customer_firstname']" to become a litteral string, and then be converted to a variable when necessary?
Thanks
Peter
If I understand your question correctly, you could do something like this:
The customer has a textarea or similar to input the template
Dear %NAME%, blah blah %SOMETHING%
Then you could have
$values = array('%SOMETHING%' => $order['something'], '%NAME%' => $order['name']);
$str = str_replace(array_keys($values), array_values($values), $str);
the user will be using around 40 variables. Is there a way I can set it to do that for each "%VARIABLE%"?
Yes, you can do so for each variable easily with the help of a callback function.
This allows you, to process each match with a function of your choice, returning the desired replacement.
$processed = preg_replace_callback("/%(\S+)%/", function($matches) {
$name = $matches[1]; // between the % signs
$replacement = get_replacement_if_valid($name);
return $replacement;
},
$text_to_replace_in
);
From here, you can do anything you like, dot notation, for example:
function get_replacement_if_valid($name) {
list($var, $key) = explode(".", $name);
if ($var === "order") {
$order = init_oder(); // symbolic
if(array_key_exists($key, $order)) {
return $order[$key];
}
}
return "<invalid key: $name>";
}
This simplistic implementation allows you, to process replacements such as %order.name% substituting them with $order['name'].
You could define your own simple template engine:
function template($text, $context) {
$tags = preg_match_all('~%([a-zA-Z0-9]+)\.([a-zA-Z0-9]+)%~', $text, $matches);
for($i = 0; $i < count($matches[0]); $i++) {
$subject = $matches[0][$i];
$ctx = $matches[1][$i];
$key = $matches[3][$i];
$value = $context[$ctx][$key];
$text = str_replace($subject, $value, $text);
}
return $text;
}
This allows you to transform a string like this:
$text = 'Hello %order.name%. You have %order.percent%% discount. Pay a total ammount of %payment.ammount% using %payment.type%.';
$templated = template($text, array(
'order' => array(
'name' => 'Alex',
'percent' => 20
),
'payment' => array(
'type' => 'VISA',
'ammount' => '$299.9'
)
));
echo $templated;
Into this:
Hello Alex. You have 20% discount. Pay a total ammount of $299.9 using VISA.
This allows you to have any number of variables defined.
If you want to keep the PHP-syntax, then a regex would be appropriate to filter them:
$text = preg_replace(
"/ [$] (\w+) \[ '? (\w+) \'? \] /exi",
"$$1['$2']", # basically a constrained eval
$text
);
Note that it needs to be executed in the same scope as $order is defined. Else (and preferrably) use preg_replace_callback instead for maximum flexibility.
You could also allow another syntax this way. For example {order[customer]} or %order.customer% is more common and possibly easier to use than the PHP syntax.
You can store it as Hello $order['customer_firstname'] and while accessing make sure you have double-quotes "" to convert the variable to its corresponding value.
echo "Hello $order['customer_firstname']";
Edit: As per the comments, a variation to Prash's answer,
str_replace('%CUSTOMERNAME%', $order['customer_name'], $str);
What you're looking for is:
eval("echo \"" . $input . "\";");
but please, PLEASE don't do that, because that lets the user run any code he wants.
A much better way would be a custom template-ish system, where you provide a list of available values for the user to drop in the code using something like %user_firstname%. Then, you can use str_replace and friends to swap those tags out with the actual values, but you can still scan for any sort of malicious code.
This is why Markdown and similar are popular; they give the user control over presentation of his content while still making it easy to scan for HTML/JS/PHP/SQL injection/anything else they might try to sneak in, because whitelisting is easier than blacklisting.
Perhaps you can have a template like this:
$tpl = "Hello {$order['customer_firstname']}, your order has been processed.".
If $order and that specific key is not null, you can use echo $tpl directly and show the content of 'customer_firstname' key in the text. The key are the curly braces here.
I am looking for a function that suits the following situation.
What the script does:
It retrieves the server.log from a minecraft server, and breaks it into sections from user chats, to server notifications etc.
I have an array containing the users information, it is set by another file called users.yml
and it is converted to an array like so
$userconfig = array(
'TruDan' => array(
'prefix' => '&3',
'suffix' => '&e',
),
'TruGaming' => array(
'prefix' => '&c',
'suffix' => '&f',
),
'PancakeMiner' => array(
'prefix' => '&c',
'suffix' => '&f',
),
'Teddybear952' => array(
'prefix' => '&b',
'suffix' => '&f',
),
);
What i want to do, is search the $line from server.log (it loops through lines) for a username above (array key) and return the array. so i can then parse $ret['prefix'] and $ret['suffix']
mc.php (the file) http://pastebin.com/9geyfuup
server.log (partial, the actual thing is 12,000 lines long, so i took a few lines from it) http://pastebin.com/DKz8YfgK
If you're using preg_match() to search each line for a username, make sure you first sort the list of usernames in reverse order using rsort():
$users = array_map('preg_quote', array_keys($userconfig));
rsort($users);
$pattern = '/' . implode('|', $users); . '/';
if preg_match($pattern, $line, $matches) {
return matches[0];
}
else {
return array();
}
If you search a line for the pattern "/TruDan|TruDan123/" the search will match a line containing "TruDan123" to the shorter version "TruDan" because it was specified first in the pattern. Sorting the user list in reverse order ensures that the pattern will be "/TruDan123|TruDan/" and so give preference to the longer match.
I'm not sure I 100% understand the question, but based on what I gather you'd like to use the above array (keys) as the needle, and the $line as a haystack?
Could do it in multiple ways, but something such as:
// Regex Method
// this depends on how big this array is though, however, otherwise this
// regex pattern could potentially be HUGE.
$pattern = sprintf('/(%s)/g', implode('|', array_map('preg_quote',array_keys($userconfig))));
if (preg_match($pattern,$line,$matches)){
// $matches has all names found in the line
// $userconfig[$match[1]]
}
Otherwise you could keep iterating over the keys:
// "brute fore"
// this keeps checking for $name int he $line over and over, but no real
// optimizations going on here.
foreach (array_keys($userconfig) as $name)
{
if (strpos($line,$name) !== false)
{
// $name was found in $line
// $userconfig[$name]
}
}
So you want to take 'TruDan', 'TruGaming', 'PancakeMiner', and 'Teddybear952' and see if they show up in a particular log file at all?
$names = array_keys($userconfig); // get the user names
$name_regex = implode('|', $names); // produce a|b|c|d
... load your log file
foreach($lines as $line) {
if (preg_match("/($names)/", $line, $matches)) { // search for /(a|b|c|d)/ in the string
echo "found $matches[1] in $line";
print_r($userconfig[$matches[1]]); // show the user's config data.
}
}
Of course, this is simplistic. if there's any regex metacharacters in the username, it'll cause a syntax error in the regex, so you'd want to massage the names a bit before building the regex clause.