Example code for a comment parser

Example code for a comment parser - php

Anyone know of any sample php (ideally codeigniter) code for parsing user submitted comments. TO remove profanity and HTML tags etc?

Try strip_tags to get rid of any html submitted. You can use htmlspecialchars to escape the tags if you just want to ensure that no html is displayed in the comments - as per Matchu's example, less unintended effects will happen with it than with strip_tags.
For a word filter, depending on how indepth you want to go, there are many examples on the web, from simple to complex. Here's the code from Jake Olefsky's example (the simple one linked previously):
<?
//This is totally free to use by anyone for any purpose.
// BadWordFilter
// This function does all the work. If $replace is 1 it will replace all bad words
// with the wildcard replacements. If $replace is 0 it will not replace anything.
// In either case, it will return 1 if it found bad words or 0 otherwise.
// Be sure to fill the $bads array with the bad words you want filtered.
function BadWordFilter(&$text, $replace)
{
//fill this array with the bad words you want to filter and their replacements
$bads = array (
array("butt","b***"),
array("poop","p***"),
array("crap","c***")
);
if($replace==1) { //we are replacing
$remember = $text;
for($i=0;$i<sizeof($bads);$i++) { //go through each bad word
$text = eregi_replace($bads[$i][0],$bads[$i][5],$text); //replace it
}
if($remember!=$text) return 1; //if there are any changes, return 1
} else { //we are just checking
for($i=0;$i<sizeof($bads);$i++) { //go through each bad word
if(eregi($bads[$i][0],$text)) return 1; //if we find any, return 1
}
}
}
//this will replace all bad words with their replacements. $any is 1 if it found any
$any = BadWordFilter($wordsToFilter,1);
//this will not repace any bad words. $any is 1 if it found any
$any = BadWordFilter($wordsToFilter,0);
?>
Many more examples of this can be found easily on the web.

Related

PHP mention system with usernames with space

I wanted to know if it's possible to make a PHP mention system with usernames with space ?
I tried this
preg_replace_callback('##([a-zA-Z0-9]+)#', 'mentionUser', htmlspecialchars_decode($r['content']))
My function:
function mentionUser($matches) {
global $db;
$req = $db->prepare('SELECT id FROM members WHERE username = ?');
$req->execute(array($matches[1]));
if($req->rowCount() == 1) {
$idUser = $req->fetch()['id'];
return '<a class="mention" href="members/profile.php?id='.$idUser.'">'.$matches[0].'</a>';
}
return $matches[0];
It works, but not for the usernames with space...
I tried to add \s, it works, but not well, the preg_replace_callback detect the username and the other parts of the message, so the mention don't appear...
Is there any solution ?
Thanks !

I know you said that you just removed the ability to add a space, but I still wanted to post a solution. To be clear, I don't necessarily think you should use this code, because it probably is just easier to keep things simple, but I think it should work still.
Your major problem is that almost every mention will incur two lookups because #bob johnson went to the store could be either bob or bob johnson and there's no way to determine that without going to the databases. Caching will greatly reduce this problem, luckily.
Below is some code that generally does what you are looking for. I made a fake database using just an array for clarity and reproducibility. The inline code comments should hopefully make sense.
function mentionUser($matches)
{
// This is our "database" of users
$users = [
'bob johnson',
'edward',
];
// First, grab the full match which might be 'name' or 'name name'
$fullMatch = $matches['username'];
// Create a search array where the key is the search term and the value is whether or not
// the search term is a subset of the value found in the regex
$names = [$fullMatch => false];
// Next split on the space. If there isn't one, we'll have an array with just a single item
$maybeTwoParts = explode(' ', $fullMatch);
// Basically, if the string contained a space, also search only for the first item before the space,
// and flag that we're using a subset
if (count($maybeTwoParts) > 1) {
$names[array_shift($maybeTwoParts)] = true;
}
foreach ($names as $name => $isSubset) {
// Search our "database"
if (in_array($name, $users, true)) {
// If it was found, wrap in HTML
$ret = '<span>#' . $name . '</span>';
// If we're in a subset, we need to append back on the remaining string, joined with a space
if ($isSubset) {
$ret .= ' ' . array_shift($maybeTwoParts);
}
return $ret;
}
}
// Nothing was found, return what was passed in
return '#' . $fullMatch;
}
// Our search pattern with an explicitly named capture
$pattern = '##(?<username>\w+(?:\s\w+)?)#';
// Three tests
assert('hello <span>#bob johnson</span> test' === preg_replace_callback($pattern, 'mentionUser', 'hello #bob johnson test'));
assert('hello <span>#edward</span> test' === preg_replace_callback($pattern, 'mentionUser', 'hello #edward test'));
assert('hello #sally smith test' === preg_replace_callback($pattern, 'mentionUser', 'hello #sally smith test'));

Try this RegEx:
/#[a-zA-Z0-9]+( *[a-zA-Z0-9]+)*/g
It will find an at sign first, and then try to find one or more letter or numbers. It will try to find zero or more inner spaces and zero or more letters and numbers coming after that.
I am assuming the username only contains A-Za-z0-9 and space.

How can I str_replace partially in PHP in a dynamic string with unknown key content

Working in WordPress (PHP). I want to set strings to the database like below. The string is translatable, so it could be in any language keeping the template codes. For the possible variations, I presented 4 strings here:
<?php
$string = '%%AUTHOR%% changed status to %%STATUS_new%%';
$string = '%%AUTHOR%% changed status to %%STATUS_oldie%%';
$string = '%%AUTHOR%% changed priority to %%PRIORITY_high%%';
$string = '%%AUTHOR%% changed priority to %%PRIORITY_low%%';
To make the string human-readable, for the %%AUTHOR%% part I can change the string like below:
<?php
$username = 'Illigil Liosous'; // could be any unicode string
$content = str_replace('%%AUTHOR%%', $username, $string);
But for status and priority, I have different substrings of different lengths.
Question is:
How can I make those dynamic substring be replaced on-the-fly so that they could be human-readable like:
Illigil Liosous changed status to Newendotobulous;
Illigil Liosous changed status to Oldisticabulous;
Illigil Liosous changed priority to Highlistacolisticosso;
Illigil Liosous changed priority to Lowisdulousiannosso;
Those unsoundable words are to let you understand the nature of a translatable string, that could be anything other than known words.
I think I can proceed with something like below:
<?php
if( strpos($_content, '%%STATUS_') !== false ) {
// proceed to push the translatable status string
}
if( strpos($_content, '%%PRIORITY_') !== false ) {
// proceed to push the translatable priority string
}
But how can I fill inside those conditionals efficiently?
Edit
I might not fully am clear with my question, hence updating the query. The issue is not related to array str_replace.
The issue is, the $string that I need to detect is not predefined. It would come like below:
if($status_changed) :
$string = "%%AUTHOR%% changed status to %%STATUS_{$status}%%";
else if($priority_changed) :
$string = "%%AUTHOR%% changed priority to %%PRIORITY_{$priority}%%";
endif;
Where they will be filled dynamically with values in the $status and $priority.
So when it comes to str_replace() I will actually use functions to get their appropriate labels:
<?php
function human_readable($codified_string, $user_id) {
if( strpos($_content, '%%STATUS_') !== false ) {
// need a way to get the $status extracted from the $codified_string
// $_got_status = ???? // I don't know how.
get_status_label($_got_status);
// the status label replacement would take place here, I don't know how.
}
if( strpos($_content, '%%PRIORITY_') !== false ) {
// need a way to get the $priority extracted from the $codified_string
// $_got_priority = ???? // I don't know how.
get_priority_label($_got_priority);
// the priority label replacement would take place here, I don't know how.
}
// Author name replacement takes place now
$username = get_the_username($user_id);
$human_readable_string = str_replace('%%AUTHOR%%', $username, $codified_string);
return $human_readable_string;
}
The function has some missing points where I currently am stuck. :(
Can you guide me a way out?

It sounds like you need to use RegEx for this solution.
You can use the following code snippet to get the effect you want to achieve:
preg_match('/%%PRIORITY_(.*?)%%/', $_content, $matches);
if (count($matches) > 0) {
$human_readable_string = str_replace("%%PRIORITY_{$matches[0]}%%", $replace, $codified_string);
}
Of course, the above code needs to be changed for STATUS and any other replacements that you require.
Explaining the RegEx code in short it:
/
The starting of any regular expression.
%%PRIORITY_
Is a literal match of those characters.
(
The opening of the match. This is going to be stored in the third parameter of the preg_match.
.
This matches any character that isn't a new line.
*?
This matches between 0 and infinite of the preceding character - in this case anything. The ? is a lazy match since the %% character will be matched by the ..
Check out the RegEx in action: https://regex101.com/r/qztLue/1

PHP performant search a text for given usernames

I am currently dealing with a performance issue where I cannot find a way to fix it. I want to search a text for usernames mentioned with the # sign in front. The list of usernames is available as PHP array.
The problem is usernames may contain spaces or other special characters. There is no limitation for it. So I can't find a regex dealing with that.
Currently I am using a function which gets the whole line after the # and checks char by char which usernames could match for this mention, until there is just one username left which totally matches the mention. But for a long text with 5 mentions it takes several seconds (!!!) to finish. for more than 20 mentions the script runs endlessly.
I have some ideas, but I don't know if they may work.
Going through username list (could be >1.000 names or more) and search for all #Username without regex, just string search. I would say this would be far more inefficient.
Checking on writing the usernames with JavaScript if space or punctual sign is inside the username and then surround it with quotation marks. Like #"User Name". Don't like that idea, that looks dirty for the user.
Don't start with one character, but maybe 4. and if no match, go back. So same principle like on sorting algorithms. Divide and Conquer. Could be difficult to implement and will maybe lead to nothing.
How does Facebook or twitter and any other site do this? Are they parsing the text directly while typing and saving the mentioned usernames directly in the stored text of the message?
This is my current function:
$regular_expression_match = '#(?:^|\\s)#(.+?)(?:\n|$)#';
$matches = false;
$offset = 0;
while (preg_match($regular_expression_match, $post_text, $matches, PREG_OFFSET_CAPTURE, $offset))
{
$line = $matches[1][0];
$search_string = substr($line, 0, 1);
$filtered_usernames = array_keys($user_list);
$matched_username = false;
// Loop, make the search string one by one char longer and see if we have still usernames matching
while (count($filtered_usernames) > 1)
{
$filtered_usernames = array_filter($filtered_usernames, function ($username_clean) use ($search_string, &$matched_username) {
$search_string = utf8_clean_string($search_string);
if (strlen($username_clean) == strlen($search_string))
{
if ($username_clean == $search_string)
{
$matched_username = $username_clean;
}
return false;
}
return (substr($username_clean, 0, strlen($search_string)) == $search_string);
});
if ($search_string == $line)
{
// We have reached the end of the line, so stop
break;
}
$search_string = substr($line, 0, strlen($search_string) + 1);
}
// If there is still one in filter, we check if it is matching
$first_username = reset($filtered_usernames);
if (count($filtered_usernames) == 1 && utf8_clean_string(substr($line, 0, strlen($first_username))) == $first_username)
{
$matched_username = $first_username;
}
// We can assume that $matched_username is the longest matching username we have found due to iteration with growing search_string
// So we use it now as the only match (Even if there are maybe shorter usernames matching too. But this is nothing we can solve here,
// This needs to be handled by the user, honestly. There is a autocomplete popup which tells the other, longer fitting name if the user is still typing,
// and if he continues to enter the full name, I think it is okay to choose the longer name as the chosen one.)
if ($matched_username)
{
$startpos = $matches[1][1];
// We need to get the endpos, cause the username is cleaned and the real string might be longer
$full_username = substr($post_text, $startpos, strlen($matched_username));
while (utf8_clean_string($full_username) != $matched_username)
{
$full_username = substr($post_text, $startpos, strlen($full_username) + 1);
}
$length = strlen($full_username);
$user_data = $user_list[$matched_username];
$mentioned[] = array_merge($user_data, array(
'type' => self::MENTION_AT,
'start' => $startpos,
'length' => $length,
));
}
$offset = $matches[0][1] + strlen($search_string);
}
Which way would you go? The problem is the text will be displayed often and parsing it every time will consume a lot of time, but I don't want to heavily modify what the user had entered as text.
I can't find out what's the best way, and even why my function is so time consuming.
A sample text would be:
Okay, #Firstname Lastname, I mention you!
Listen #[TEAM] John, you are a team member.
#Test is a normal name, but #Thât♥ should be tracked too.
And see #Wolfs garden! I just mean the Wolf.
Usernames in that text would be
Firstname Lastname
[TEAM] John
Test
Thât♥
Wolf
So yes, there is clearly nothing I know where a name may end. Only thing is the newline.

I think the main problem is, that you can't distinguish usernames from text and it's a bad idea, to lookup maybe thousands of usernames in a text, also this can lead to further problems, that John is part of [TEAM] John‌ or JohnFoo...
It's needed to separate the usernames from other text. Assuming that you're using UTF-8, could put the usernames inside invisible zero-w space \xE2\x80\x8B and non-joiner \xE2\x80\x8C.
The usernames can now be extracted fast and with little effort and if needed still verified in db.
$txt = "
Okay, #\xE2\x80\x8BFirstname Lastname\xE2\x80\x8C, I mention you!
Listen #\xE2\x80\x8B[TEAM] John\xE2\x80\x8C, you are a team member.
#\xE2\x80\x8BTest\xE2\x80\x8C is a normal name, but
#\xE2\x80\x8BThât?\xE2\x80\x8C should be tracked too.
And see #\xE2\x80\x8BWolfs\xE2\x80\x8C garden! I just mean the Wolf.";
// extract usernames
if(preg_match_all('~#\xE2\x80\x8B\K.*?(?=\xE2\x80\x8C)~s', $txt, $out)){
print_r($out[0]);
}
Array
(
[0] => Firstname Lastname
1 => [TEAM] John
2 => Test
3 => Thât♥
4 => Wolfs
)
echo $txt;
Okay, #Firstname Lastname, I mention you!
Listen #[TEAM] John‌, you are a team member.
#Test‌ is a normal name, but
#Thât♥‌ should be tracked too.
And see #Wolfs‌ garden! I just mean the Wolf.
Could use any characters you like and that possibly don't occur elsewhere for separation.
Regex FAQ, Test at eval.in (link will expire soon)

php Swear code doesn't quite work

I have this code below which works ok ish.
$swearWords = file("blacklist.txt");
foreach ($swearWords as $naughty)
{
$post = str_ireplace(rtrim($naughty), "<b><i>(oops)</i></b>", $post);
}
The problem is with words that contain thee swear words..
for instant "Scunthorpe" has a bad word within it. this code changes it to S(oops)horpe.
Any ideas how i can fix this ? do I need to

You can replace your str_replace() with a preg_replace that ignores words that have leading and/or trailing letters, so a swear word is only replaced if its standing alone:
$post = "some Scunthorpe text";
$newpost = $post;
$swearWords = file("blacklist.txt");
foreach ($swearWords as $naughty)
{
$naughty = preg_quote($naughty, '/');
$newpost = preg_replace("/([^a-z]+{$naughty}[^a-z]*|[^a-z]+{$naughty}[^a-z]+)/i", "<b><i>(oops)</i></b>", $newpost);
}
if ($newpost) $post = $newpost;
else echo "an error occured during regex replacement";
Note that it still allows swear words like "aCUNT", "soFUCKINGstupid", ... i don't know how you could even handle that.

Swear and profanity filters are notoriously bad at catching "false positives".
The easiest way of dealing with these, in dictionary terms is to use a whitelist (in a similar way to your blacklist). A list of words that contain matches, but that are essentially allowed.
It's worth you reading this: How do you implement a good profanity filter which details the pro's and cons.

This oughta do it:
$swearWords = file("blacklist.txt");
$post_words = preg_split("/\s+/", $post);
foreach ($swearWords as $naughty)
{
foreach($post_words as &$word)
{
if(stripos($word, $naughty) !== false)
{
$word = "<b><i>(oops)</i></b>";
}
}
}
$post = implode(' ', $post_words);
So what's happening? It loads in your swear words, then loops through these. It then loops through all the words in the post, and checks if the current swearword exists in the currently looked at word. If it does, it removes it replaces it with your 'oops'.
Note that this will remove any whitespace formatting, so check this suits your situation first (do you care about tab characters or multiple sequential spaces?)

preg_match and reg expression using alphanumeric, commas, periods, exclamations, etc

I am having one hell of a time coming up with a decent way make this if statement search a file for these codes. I set up the text file to read from as such:
myfile.txt
r)
0Y7
1a6
q.
#g
#(
#a
!P
T[
V}
0,
Here is a brief of what I got going.
$subject = file_get_contents(fvManager_Path . 'myfile.txt');
if ( preg_match('/^[a-zA-Z0-9,]+$/',$result['fmbushels_itemCode'], $subject) ) {
Basically I am trying to search the text file line by line to see if the whole string exists. They are case sensitive as well.
$result['fmbushels_itemCode'] is from a sql query and always returns a code like the above in the text.
I'd appreciate any help on this. If someone knows a better way of doing this or a different command, I'd be willing to give that a shot as well :)
edit:
private function _fvShareBushels() {
$subject = file_get_contents(fvManager_Path . 'myfile.txt');
if (count($vShareArray) > 0) {
$vCntMoves = count($vShareArray);
for ($vI = 0;$vI < $vRunMainLoop;$vI++) {
sell $result['fmbushels_itemCode']);
}
}
}
This is a snippet of a big code. I had to rip most out because of post limitation. The area I could be working with is:
if (count($vShareArray) > 0) {
If I could make this something like:
if (count($vShareArray) > 0 && $result['fmbushels_itemCode'] **is not in** $subject) {

If you want to do line by line, use the file() function.
$f = file(fvManager_Path . 'myfile.txt');
foreach($f AS $line){
// $line is current line at file
}
I'm not to sure if you understand completely how preg_match works. The first parameter is the regular expression pattern, the second is what you want to match the pattern to, and the third is an array of matches. So for every valid pattern matched in the second parameter a new index on the array is created.
I'm not 100% on what you're trying to accomplish. Are you trying to see if the $result['fmbushels_itemCode'] exists in the file?
If the above is the correct case you simply just need to do something like:
$f = file('myfile.txt');
array_map('trim', $f);
if(in_array($result['fmbushels_itemCode'], $f)){
// success
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Example code for a comment parser - php

Anyone know of any sample php (ideally codeigniter) code for parsing user submitted comments. TO remove profanity and HTML tags etc?

Related

PHP mention system with usernames with space

How can I str_replace partially in PHP in a dynamic string with unknown key content

PHP performant search a text for given usernames

php Swear code doesn't quite work

preg_match and reg expression using alphanumeric, commas, periods, exclamations, etc

Categories

Resources