Negating sentences using POS-tagging - php

I'm trying to find a way to negate sentences based on POS-tagging. Please consider:
include_once 'class.postagger.php';
function negate($sentence) {
$tagger = new PosTagger('includes/lexicon.txt');
$tags = $tagger->tag($sentence);
foreach ($tags as $t) {
$input[] = trim($t['token']) . "/" . trim($t['tag']) . " ";
}
$sentence = implode(" ", $input);
$postagged = $sentence;
// Concatenate "not" to every JJ, RB or VB
// Todo: ignore negative words (not, never, neither)
$sentence = preg_replace("/(\w+)\/(JJ|MD|RB|VB|VBD|VBN)\b/", "not$1/$2", $sentence);
// Remove all POS tags
$sentence = preg_replace("/\/[A-Z$]+/", "", $sentence);
return "$postagged<br>$sentence";
}
BTW: In this example, I'm using the POS-tagging implementation and lexicon of Ian Barber. An example of this code running would be:
echo negate("I will never go to their place again");
I/NN will/MD never/RB go/VB to/TO their/PRP$ place/NN again/RB
I notwill notnever notgo to their place notagain
As you can see, (and this issue is also commented in the code), negating words themselves are being negated as wel: never becomes notnever, which obviously shouldn't happen. Since my regex skills aren't all that, is there a way to exclude these words from the regex used?
[edit] Also, I would very much welcome other comments / critiques you might have in this negating implementation, since I'm sure it's (still) quite flawed :-)

Give this a try:
$sentence = preg_replace("/(\s)(?:(?!never|neither|not)(\w*))\/(JJ|MD|RB|VB|VBD|VBN)\b/", "$1not$2", $sentence);

Related

PHP: preg_replace() to get "parent" component of NameSpace

How can I use the preg_replace() replace function to only return the parent "component" of a PHP NameSpace?
Basically:
Input: \Base\Ent\User; Desired Output: Ent
I've been doing this using substr() but I want to convert it to regex.
Note: Can this be done without preg_match_all()?
Right now, I also have a code to get all parent components:
$s = '\\Base\\Ent\\User';
print preg_replace('~\\\\[^\\\\]*$~', '', $s);
//=> \Base\Ent
But I only want to return Ent.
Thank you!
As Rocket Hazmat says, explode is almost certainly going to be better here than a regex. I would be surprised if it's actually slower than a regex.
But, since you asked, here's a regex solution:
$path = '\Base\Ent\User';
$search = preg_match('~([^\\\\]+)\\\\[^\\\\]+$~', $path, $matches);
if($search) {
$parent = $matches[1];
}
else {
$parent = ''; // handles the case where the path is just, e.g., "User"
}
echo $parent; // echos Ent
I think maybe preg_match might be a better choice for this.
$s = '\\Base\\Ent\\User';
$m = [];
print preg_match('/([^\\\\]*)\\\\[^\\\\]*$/', $s, $m);
print $m[1];
If you read the regular expression backwards, from the $, it says to match many things that aren't backslashes, then a backslash, then many things that aren't backslashes, and save that match for later (in $m).
How about
$path = '\Base\Ent\User';
$section = substr(strrchr(substr(strrchr($path, "\\"), 1), "\\"), 1);
Or
$path = '\Base\Ent\User';
$section = strstr(substr($path, strpos($path, "\\", 1)), "\\", true);

How can I make preg_replace only replace the first of *each* character matched?

For example, the following code wraps each matching character with an <i> tag.
echo preg_replace('/[aeiou]/', '<i>$0</i>', 'alphabet');
// result: <i>a</i>lph<i>a</i>b<i>e</i>t
But I'd like it to only replace each character once.
I'm looking for a result like <i>a</i>lphab<i>e</i>t, where the second a makes it through without a tag because the search string only has one a.
Can you help? Is this possible without of iterating through each character in an foreach loop?
The answer should also allow for two or more of the same characters, each only to be used once. For example, if I were searching for aaeioo in the string alphabetsoupisgood, it should match both of the a's but only two of the three o's.
If you just want a letter to be replaced only once then following regex should work:
echo preg_replace('/([aeiou])(?!.+?\\1)/', '<i>$1</i>', 'alphabet');
OUTPUT:
alph<i>a</i>b<i>e</i>t
PS: Note that it replaces last occurrence of a letter instead of the first one.
EDIT:
Following would produce the same output as expected by the OP (thanks to #AntonyHatchkins):
echo strrev(preg_replace
('/([aeiou])(?!.+?\\1)/', strrev('<i>1$</i>'), strrev('alphabet')))."\n";
EDIT 2:
Upon OP's comment:
Can you help me allow more than one a then? How can I match 2, but not 3 a's
I am posting this answer:
echo strrev(preg_replace('/([aeiou])(?!(.+?\\1){2})/',
strrev('<i>1$</i>'), strrev('alphabetax'))) . "\n";
EDIT 3:
Upon another of OP's comment:
that will allow duplicates for all characters in the string, not just 2 a's & 1 e
I am posting this answer:
echo strrev(preg_replace(array('/(a)(?!(.+?\\1){2})/',
'/(?<!>)([eiou])(?!.+?\\1)/'),
array(strrev('<i>1$</i>'), strrev('<i>1$</i>')), strrev('alphabetaxen')))."\n";
OUTPUT:
<i>a</i>lph<i>a</i>b<i>e</i>taxen
Note: I believe original problem has already changed so many times so please don't add further complexity in this problem. You're free to post another question if you have different queries.
Ugly but here's an approach.
echo
preg_replace('/a/','<i>$0</i>',
preg_replace('/e/','<i>$0</i>',
preg_replace('/i/','<i>$0</i>',
preg_replace('/o/','<i>$0</i>',
preg_replace('/u/','<i>$0</i>','alphabet',
1),1),1),1),1);
Use
echo preg_replace('/[aeiou]/', ' ', 'alphabet', 1);
Sorry i ansered before you edited your question
How about this.
$word = "alphabet";
$replace = array('a','e','i','o','u');
$with = array('','','','','');
echo str_replace($replace,$with,$word); //lphbt
According to http://php.net/manual/es/function.preg-replace.php, you can specify an additional parameter that limits the amount of replaces done.
Edit: Sorry, I didn't originally get your question. In that case, it's not possible. Due to the nature of Regex, it's not aware of the amount of replacements it has made. I might be wrong though, but I doubt that a regular expression exists that only makes one replacement per character.
Your best bet would be to make 5 calls, one each per character. Something like this:
$res = preg_replace('/a/', '<i>$0</i>', 'alphabet', 1);
$res = preg_replace('/e/', '<i>$0</i>', 'alphabet', 1);
$res = preg_replace('/i/', '<i>$0</i>', 'alphabet', 1);
$res = preg_replace('/o/', '<i>$0</i>', 'alphabet', 1);
$res = preg_replace('/u/', '<i>$0</i>', 'alphabet', 1);
echo $res;
Cheers
Here's another approach using the preg_replace_callback() function so I'm posting this as a brand new answer.
function replace_first($matches) {
static $used = array();
$key = $matches[0];
if (in_array($key,$used)) return $key;
$used[] = $key;
return '<i>' . $key . '</i>';
}
$str = preg_replace_callback('/[aeiou]/','replace_first','alphabet');
echo $str;
will output <i>a</i>lphab<i>e</i>t
The best I could do is loop through each character. (I try to avoid loops whenever possible) I wish there was a cleaner way, but this will do for now.
$word = 'alphabetsoupisgood';
$match = 'aaeou';
$wordArr = str_split($word);
$matchArr = str_split($capitals);
foreach ($matchArr as $c) {
$key = array_search($c, $wordArr);
if ($key !== false) {
$wordArr[$key] = strtoupper($c);
}
}
foreach ($wordArr as $k => $c) {
if (ctype_upper($c)) {
$wordArr[$k] = '<i>' . strtolower($c) . '</i>';
}
}
$word = implode('',$wordArr);
echo $word;
Here it is in codepad but they're running some super old version of PHP so ctype_upper isn't enabled... http://codepad.org/UoLWcK2S
If anyone can provide a cleaner answer using preg_replace I'd gladly mark yours as the answer.

Remove all acronyms from PHP string

I have a line of text that has acronyms inside is kind of like this...
$draft="The war between the CIA and NSA started in K2 when the FBI hired M";
I can't for the life of me figure out how to create a new string with all acronyms removed.
I need this output...
$newdraft="The war between the and started in when the hired";
The only php functions I can find only remove words that you statically declare like this!
$newdraft= str_replace("CIA", " ", $draft);
Anyone have any ideas, or an already created function?
Ok, let's try to write something (albeit I can't understand what for it can be useful).
<?php
function remove_acronyms($str)
{
$str_arr = explode(' ', $str);
if (empty($str_arr)) return false;
foreach ($str_arr as $index => $val)
{
if ($val==strtoupper($val)) unset($str_arr[$index]);
}
return implode(' ', $str_arr);
}
$draft = "The war between the CIA and NSA started in K2 when the FBI hired M";
print remove_acronyms($draft);
http://codepad.org/cIZSwwhV
Definition of an acronym: any word that is fully capitalized, and at least 2 chars long.
<?php
$draft="The war between the CIA and NSA started in K2 when the FBI hired M";
$words = explode(' ', $draft);
foreach($words as $i => $word)
{
if (!strcmp($word, strtoupper($word)) && strlen($word) >= 2)
{
unset($words[$i]);
}
}
$clean = implode(' ', $words);
echo $clean;
?>
Try to define an acronym. You'd have to cut some corners, but stating something like 'any single word that is smaller then 5 characters and in all capitals' should be correct for this sample, and you'd be able to write a regular expression for that.
Other then that, you could make a huge list of known acronyms and just replace those.
Regex to remove multiple caps and/or numbers appearing together:
$draft="The war between the CIA and NSA started in K2 when the FBI hired M";
$newdraft = preg_replace('/[A-Z0-9][A-Z0-9]+/', '', $draft);
echo $newdraft;

preg_replace suddenly stops making distinctions

Confounded. I've been using the below IF PREG_MATCH to distinguish between words which entire words and words which are parts of other words. It has suddenly ceased to function in this script, and any other script I use, which depend on this command.
The result is it finds parts of words, although you can see it is explicitly told to find only entire words.
$word = preg_replace("/[^a-zA-Z 0-9]+/", " ", $word);
if (preg_match('#\b'.$word.'\b#',$goodfile) && (trim($word) != "")) {
$fate = strpos($goodfile,$word);
print $word ." ";
print $fate ."</br>";
If you only want to read the first word of a line of a text file, like your title suggests, try another method:
// Get the file as an array, each element being a line
$lines = file("/path/to/file");
// Break up the first line by spaces
$words = explode(" ", $lines[0]);
// Get the first word
$firstWord = $words[0];
This would be faster and cleaner than explode and you won't be making any array
$first_word = stristr($lines, ' ', true);

Is there a way to select the first word/combo of characters in a string, separated by spaces?

For instance, I'm trying to select the first word in this string:
"chocolate muffin"
So I want "chocolate", but not the " " (space) and not the "muffin" text.
I imagine I could do $separate = explode(" ",$string), and just take $separate[0], but I was wondering if there was a more efficient way to do this?
Edit: This is in PHP.
This is more efficient, although a bit less readable in my opinion:
$mystring = substr($mystring, 0, strpos($mystring, " "));
This is because with strpos the search cycle stops to the first occurrence of the charachter, then it returns the given length of the string.
With explode, the search cycle goes til the end of the string.
can also write a slight variation ...
list($res) = explode(' ',$string);
$firstword = strtok($string," ");
$first = strstr($string, ' ', true);
Please note that this will only work > PHP 5.3.
PHP strstr()

Categories