Sentence Comparison: Ignore Words

Sentence Comparison: Ignore Words - php

I need some help with sentence comparison.
$answer = "This is the (correct and) acceptable answer. Content inside the parenthesis are ignored if not present in the user's answer. If it is present, it should not count against them.";
$response = "This is the correct and acceptable answer. Content inside the parenthesis are ignored if not present in the user's answer. If it is present, it should not count against them.";
echo "<strong>Acceptable Answer:</strong>";
echo "<pre style='white-space:normal;'>$answer</pre><hr/>";
echo "<strong>User's Answer:</strong>";
echo "<pre>".$response."</pre>";
// strip content in brackets
$answer = preg_replace("/\([^)]*\)|[()]/", "", $answer);
// strip punctuation
$answer = preg_replace("/[^a-zA-Z 0-9]+/", " ", $answer);
$response = preg_replace("/[^a-zA-Z 0-9]+/", " ", $response);
$common = similar_text($answer, $response, $percent);
$orgcount = strlen($answer);
printf("The user's response has %d/$orgcount characters in common (%.2f%%).", $common, $percent);
Basically what I want to do is ignore parenthiseised words. For example, in the $answer string, correct and are in parenthesis - because of this, I don't want these words to count agains the user's response. So if the user has these words, it doesn't count against them. And if the user doesn't have these words, it doesn't count against them.
Is this possible?

Thanks to the comments, I've wrote a solution, since it's a "long" process i though to put it in a function.
EDIT: After debugging it came out that strpos() was causing some trouble if the position was 0, so i added an OR statement:
$answer = "(This) is the (correct and) acceptable answer. (random this will not count) Content inside the parenthesis are ignored if not present in the user's answer. If it is present, it should not count against them.";
$response = "This is the correct and acceptable answer. Content inside the parenthesis are ignored if not present in the user's answer. If it is present, it should not count against them.";
echo 'The user\'s response has '.round(compare($answer, $response),2).'% characters in common'; // The user's response has 100% characters in common
function compare($answer, $response){
preg_match_all('/\((?P<parenthesis>[^\)]+)\)/', $answer, $parenthesis);
$catch = $parenthesis['parenthesis'];
foreach($catch as $words){
if(!strpos($response, $words) === false || strpos($response, $words) === 0){ // if it does exist then remove brackets
$answer = str_replace('('.$words.')', $words, $answer);
}else{ //if it does not exist remove the brackets with the words
$answer = str_replace('('.$words.')', '', $answer);
}
}
/* To sanitize */
$answer = preg_replace(array('/[^a-zA-Z0-9]+/', '/ +/'), array(' ', ' '), $answer);
$response = preg_replace(array('/[^a-zA-Z 0-9]+/', '/ +/'), array(' ', ' '), $response);
$common = similar_text($answer, $response, $percent);
return($percent);
}

Related

Check if hashtag is at the start OR middle of string php

I'm currently grabbing all #hashtags found in a string (i.e. a tweet). Works as it should.
However, I want to find hashtags that are only at the start of the string OR in the MIDDLE of the string (or close enough to it). In other words, find all hashtags that aren't at the end of the string.
Bonus points, if you can also point me in the direction on how to see if a hashtag exists at the end of the string as well.
$tweet = 'This is an #example tweet';
preg_match_all('/(#\w+)/u', $tweet, $matches);
if ($matches) { // found hashtag(s) }

// Check if Hashtag is last word; the strpos and explode way:
$tweet = 'This is an #example #tweet';
$words = explode(" ", $tweet);
$last_word = end($words);
// count the number of times "#" occurs in the $tweet.
// if # exists in somewhere else $exists_anywhere equals TRUE !!
$exists_anywhere = substr_count($tweet,'#') > 1 ? TRUE : FALSE ;
if(strpos($last_word,'#') !== FALSE ) {
// last word has #
}
from doc :
Do not use preg_match() if you only want to check if one string is
contained in another string. Use strpos() or strstr() instead as they
will be faster.

preg_match_all('/(?!#\w+\W+$)(#\w+)\W/', $tweet, $result);
This is a #tweet folks will catch #tweet
#Second example of #tweet folks will catch #Second and #tweet
#Another example of #tweet will catch #Another but not #tweet (even if it ends in !, ., or any other non-word character)
We're almost done, #yup! won't catch anything
Last one #tweet! good night will catch #tweet
Of course, all your hastags (captures) will be stored in $result[1]

To match at the beginning only:
/^(#\w+)/
To look for a specific #hashtag:
/^#tweet/
To match anywhere in the middle (not beginning or end):
/^[^#]+(#\w+)[^\w]+$/
To look for a specific #hashtag:
/^[^#]+#tweet[^\w]$/
To match at the end only:
/(#\w+)$/
To look for a specific #hashtag:
/#tweet$/

OK, personally I would turn the string into an array of words:
$words = explode(' ', $tweet);
Then run a check on the first word:
preg_match_all('/(#\w+)/u', $words[0], $matches);
if ($matches) {
//first word has a hashtag
}
Then you can simply walk through rest of array for hashtags in the middle.
And finally to check the last word,
$sizeof = count($words) - 1;
preg_match_all('/(#\w+)/u', $word[$sizeof], $matches);
if ($matches) {
//last word has a hashtag
}

UPDATE / EDIT
$tweet = "Hello# It's #a# beaut#iful day. #tweet";
$tweet_arr = explode(" ", $tweet);
$arrCount = 0;
foreach ($tweet_arr as $word) {
$arrCount ++;
if (strpos($word, '#') !== false) {
$end = substr($word, -1);
$beginning = substr($word, 0, 1);
$middle_string = substr($word, 1, -1);
if ($beginning === "#") {
echo "hash is at the beginning on word " . $arrCount . "<br />";
}
if (strpos($middle_string, '#') !== false) {
$charNum = strpos($middle_string, '#') + 1;
echo "hash is in the middle at character number " . $charNum . " on word " . $arrCount . "<br />";
}
if ($end === "#") {
echo "hash is at the end on word " . $arrCount . "<br />";
}
}
}

Regex to disallow two characters in a row

I'm trying to modify this regex pattern so that it disallows two specified characters in a row or at the start/end -
/^[^\!\"\£\$\%\^\&\*\(\)\[\]\{\}\#\~\#\/\>\<\\\*]+$/
So at the moment it prevents these characters anywhere in the string, but I also want to stop the following from happening with these characters:
any spaces, apostophes ', underscores _ or hyphens - or dots . appearing at the start of end of the string
also prevent any two of these characters in a row, i.e. '' or _._ or ' -__- ' .
Any help would be hugely appreciated.
Thanks a lot

One way
/^(?=[^!"£$%^&*()[\]{}#~#\/><\\*]+$)(?!.*[ '_.-]{2})[^ '_.-].*[^ '_.-]$/
Note, only tested as javascript regex, i.e.
var rex = /^(?=[^!"£$%^&*()[\]{}#~#\/><\\*]+$)(?!.*[ '_.-]{2})[^ '_.-].*[^ '_.-]$/;
rex.test('okay'); // true
rex.test('_not okay'); // false
Or, to match on disallowed patterns
/^[ '_.-]|[ '_.-]$|[!"£$%^&*()[\]{}#~#\/><\\*]|[ '_.-]{2}/
The first regex will only match strings that contain no disallowed patterns.
The one above will match any disallowed patterns in a string.
Update
Now tested briefly using php. The only difference is that the " in the character set needed to be escaped.
<?php
$test = 'some string';
$regex = "/^[ '_.-]|[ '_.-]$|[!\"£$%^&*()[\]{}#~#\/><\\*]|[ '_.-]{2}/";
if ( preg_match( $regex, $test ) ) {
echo 'Disallowed!';
}

$tests[1] = "fail_.fail"; // doubles
$tests[] = "fail_-fail";
$tests[] = "fail_ fail";
$tests[] = "fail fail";
$tests[] = "fail -fail";
$tests[] = "pas.s_1";
$tests[] = "pa.s-s_2"; // singles
$tests[] = "pas.s_3";
$tests[] = "p.a.s.s_4";
$tests[10] = "pa s-s_5";
$tests[] = "fail fail'"; // pre or post-pended
$tests[] = " fail fail";
$tests[] = " fail fail";
$tests[] = "fail fail_";
$tests[15] = "fail fail-";
// The list of disallowed characters. There is no need to escape.
// This will be done with the function preg_quote.
$exclude = array(" ","'", "_", ".", "-");
$pattern = "#[" . preg_quote(join("", $exclude)) . "]{2,}#s";
// run through the simple test cases
foreach($tests as $k=>$test){
if(
in_array(substr($test, 0, 1), $exclude)
|| in_array(substr(strrev($test), 0 , 1) , $exclude))
{
echo "$k thats a fail" . PHP_EOL;
continue;
}
$test = preg_match( $pattern, $test);
if($test === 1){
echo "$k - thats a fail". PHP_EOL ;
}else{
echo "$k - thats a pass $test ". PHP_EOL ;
}
}
Stealing hopelessly from other replies, I'd advocate using PHPs simple in_array to check the start and end of the string first and just fail early on discovering something bad.
If the test gets past that, then run a really simple regex.
Stick the lot into a function and return false on failure -- that would rm quite a few verbose lines I added -- you could even send in the exclusion array as a variable -- but it would seem rather a specific function so may be YAGNI
eg
if( badString($exclude_array, $input) ) // do stuff

I'm not sure I understand the exact problem, but here's a suggestion:
<?php
$test = "__-Remove '' _._ or -__- but not foo bar '. _ \n";
$expected = 'Remove or but not foo bar';
// The list of disallowed characters. There is no need to escape.
// This will be done with the function preg_quote.
$excluded_of_bounds = "'_.-";
// Remove disallowed characters from start/end of the string.
// We add the space characters that should not be in the regexp.
$test = trim($test, $excluded_of_bounds . " \r\n");
// In two passes
$patterns = array(
// 1/ We remove all successive disallowed characters,
// excepted for the spaces
'#[' . preg_quote($excluded_of_bounds) . ']{2,}#',
// 2/ We replace the successive spaces by a unique space.
'#\s{2,}#',
);
$remplacements = array('', ' ');
// Go!
$test = preg_replace($patterns, $remplacements, $test);
// bool(true)
var_dump($expected === $test);

Get the current + the next word in a string

this is what I try to get:
My longest text to test When I search for e.g. My I should get My longest
I tried it with this function to get first the complete length of the input and then I search for the ' ' to cut it.
$length = strripos($text, $input) + strlen($input)+2;
$stringpos = strripos($text, ' ', $length);
$newstring = substr($text, 0, strpos($text, ' ', $length));
But this only works first time and then it cuts after the current input, means
My lon is My longest and not My longest text.
How I must change this to get the right result, always getting the next word. Maybe I need a break, but I cannot find the right solution.
UPDATE
Here is my workaround till I find a better solution. As I said working with array functions does not work, since part words should work. So I extended my previous idea a bit. Basic idea is to differ between first time and the next. I improved the code a bit.
function get_title($input, $text) {
$length = strripos($text, $input) + strlen($input);
$stringpos = stripos($text, ' ', $length);
// Find next ' '
$stringpos2 = stripos($text, ' ', $stringpos+1);
if (!$stringpos) {
$newstring = $text;
} else if ($stringpos2) {
$newstring = substr($text, 0, $stringpos2);
} }
Not pretty, but hey it seems to work ^^. Anyway maybe someone of you have a better solution.

You can try using explode
$string = explode(" ", "My longest text to test");
$key = array_search("My", $string);
echo $string[$key] , " " , $string[$key + 1] ;
You can take i to the next level using case insensitive with preg_match_all
$string = "My longest text to test in my school that is very close to mY village" ;
var_dump(__search("My",$string));
Output
array
0 => string 'My longest' (length=10)
1 => string 'my school' (length=9)
2 => string 'mY village' (length=10)
Function used
function __search($search,$string)
{
$result = array();
preg_match_all('/' . preg_quote($search) . '\s+\w+/i', $string, $result);
return $result[0];
}

There are simpler ways to do that. String functions are useful if you don't want to look for something specific, but cut out a pre-defined length of something. Else use a regular expression:
preg_match('/My\s+\w+/', $string, $result);
print $result[0];
Here the My looks for the literal first word. And \s+ for some spaces. While \w+ matches word characters.
This adds some new syntax to learn. But less brittle than workarounds and lengthier string function code to accomplish the same.

An easy method would be to split it on whitespace and grab the current array index plus the next one:
// Word to search for:
$findme = "text";
// Using preg_split() to split on any amount of whitespace
// lowercasing the words, to make the search case-insensitive
$words = preg_split('/\s+/', "My longest text to test");
// Find the word in the array with array_search()
// calling strtolower() with array_map() to search case-insensitively
$idx = array_search(strtolower($findme), array_map('strtolower', $words));
if ($idx !== FALSE) {
// If found, print the word and the following word from the array
// as long as the following one exists.
echo $words[$idx];
if (isset($words[$idx + 1])) {
echo " " . $words[$idx + 1];
}
}
// Prints:
// "text to"

Check stock tickers in string against array

Consider the following array which holds all US stock tickers, ordered by length:
$tickers = array('AAPL', 'AA', 'BRK.A', 'BRK.B', 'BAE', 'BA'); // etc...
I want to check a string for all possible matches. Tickers are written with or without a "$" concatenated to the front:
$string = "Check out $AAPL and BRK.A, BA and BAE.B - all going up!";
All tickers are to be labeled like: {TICKER:XX}. The expected output would be:
Check out {TICKER:AAPL} and {TICKER:BRK.A} and BAE.B - all going up!
So tickers should be checked against the $tickers array and matched both if they are followed by a space or a comma. Until now, I have been using the following:
preg_replace('/\$([a-zA-Z.]+)/', ' {TICKER:$1} ', $string);
so I didn't have to check against the $tickers array. It was assumed that all tickers started with "$", but this only appears to be the convention in about 80% of the cases. Hence, the need for an updated filter.
My question being: is there a simple way to adjust the regex to comply with the new requirement or do I need to write a new function, as I was planning first:
function match_tickers($string) {
foreach ($tickers as $ticker) {
// preg_replace with $
// preg_replace without $
}
}
Or can this be done in one go?

Just make the leading dollar sign optional, using ? (zero or 1 matches). Then you can check for legal trailing characters using the same technique. A better way to go about it would be to explode your input string and check/replace each substring against the ticker collection, then reconstruct the input string.
function match_tickers($string) {
$aray = explode( " ", $string );
foreach ($aray as $word) {
// extract any ticker symbol
$symbol = preg_replace( '/^\$?([A-Za-z]?\.?[A-Za-z])\W*$/', '$1', $word );
if (in_array($symbol,$tickers)) { // symbol, replace it
array_push( $replacements, preg_replace( '/^\$?([A-Za-z]?\.?[A-Za-z])(\W*)$/', '{TICKER:$1}$2', $word ) );
}
else { // not a symbol, just output it normally
array_push( $replacements, $word );
}
}
return implode( " ", $replacements );
}

I think just a slight change to your regex should do the trick:
\$?([a-zA-Z.]+)
i added "?" in front of the "$", which means that it can appear 0 or 1 times

You can use a single foreach loop on your array to replace the ticker items in your string.
$tickers = array('AAPL', 'AA', 'BRK.A', 'BRK.B', 'BAE', 'BA');
$string = 'Check out $AAPL and BRK.A, BA and BAE.B - all going up!';
foreach ($tickers as $ticker) {
$string = preg_replace('/(\$?)\b('.$ticker.')\b(?!\.[A-Z])/', '{TICKER:$2}', $string);
}
echo $string;
will output
Check out {TICKER:AAPL} and {TICKER:BRK.A}, {TICKER:BA} and BAE.B -
all going up!

Adding ? after the $ sign will also accept words, i.e. 'out'
preg_replace accepts array as a pattern, so if you change your $tickers array to:
$tickers = array('/AAPL/', '/AA/', '/BRK.A/', '/BRK.B/', '/BAE/', '/BA/');
then this should do the trick:
preg_replace($tickers, ' {TICKER:$1} ', $string);
This is according to http://php.net/manual/en/function.preg-replace.php

determine if a string contains one of a set of words in an array

I need a simple word filter that will kill a script if it detects a filtered word in a string.
say my words are as below
$showstopper = array(badword1, badword2, badword3, badword4);
$yourmouth = "im gonna badword3 you up";
if(something($yourmouth, $showstopper)){
//stop the show
}

You could implode the array of badwords into a regular expression, and see if it matches against the haystack. Or you could simply cycle through the array, and check each word individually.
From the comments:
$re = "/(" . implode("|", $showstopper) . ")/"; // '/(badword1|badword2)/'
if (preg_match($re, $yourmouth) > 0) { die("foulmouth"); }

in_array() is your friend
$yourmouth_array = explode(' ',$yourmouth);
foreach($yourmouth_array as $key=>$w){
if (in_array($w,$showstopper){
// stop the show, like, replace that element with '***'
$yourmouth_array[$key]= '***';
}
}
$yourmouth = implode(' ',$yourmouth_array);

You might want to benchmark this vs the foreach and preg_match approaches.
$showstopper = array('badword1', 'badword2', 'badword3', 'badword4');
$yourmouth = "im gonna badword3 you up";
$check = str_replace($showstopper, '****', $yourmouth, $count);
if($count > 0) {
//stop the show
}

A fast solution involves checking the key as this does not need to iterate over the array. It would require a modification of your bad words list, however.
$showstopper = array('badword1' => 1, 'badword2' => 1, 'badword3' => 1, 'badword4' => 1);
$yourmouth = "im gonna badword3 you up";
// split words on space
$words = explode(' ', $yourmouth);
foreach($words as $word) {
// filter extraneous characters out of the word
$word = preg_replace('/[^A-Za-z0-9]*/', '', $word);
// check for bad word match
if (isset($showstopper[$word])) {
die('game over');
}
}
The preg_replace ensures users don't abuse your filter by typing something like bad_word3. It also ensures the array key check doesn't bomb.

not sure why you would need to do this but heres a way to check and get the bad words that were used
$showstopper = array(badword1, badword2, badword3, badword4);
$yourmouth = "im gonna badword3 you up badword1";
function badWordCheck( $var ) {
global $yourmouth;
if (strpos($yourmouth, $var)) {
return true;
}
}
print_r(array_filter($showstopper, 'badWordCheck'));
array_filter() returns an array of bad words, so if the count() of it is 0 nothign bad was said

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Sentence Comparison: Ignore Words - php

Related

Check if hashtag is at the start OR middle of string php

Regex to disallow two characters in a row

Get the current + the next word in a string

Check stock tickers in string against array

determine if a string contains one of a set of words in an array

Categories

Resources