I'm attempting to remove noise words from a string, and I have what I believe is a good algorithm for it, but I'm running into a snag. Before I do my preg_replace I remove all punctuation except apostrophe ('). The I put it through this preg_replace:
$content = preg_replace('/\b('.implode('|', self::$noiseWords).')\b/','',$content);
Which works great, except for words that do indeed have that ' character. preg_replace seems to be treating that as a boundary character. This is a problem for me.
Is there a way I can get around this? A different solution perhaps?
Thanks!
Here is the example I'm using:
$content = strtolower(strip_tags($content));
$content = preg_replace("/(?!['])\p{P}/u", "", $content);// remove punctuation
echo $content;// i've added striptags for editing as well should still workyep it doesnbsp
$content = preg_replace("/\b(?<')(".implode('|', self::$noiseWords).")(?!')\b/",'',$content);
$contentArray = explode(" ", $content);
print_r($contentArray);
On the 3rd line you'll see the comment of what $content is right before the preg_replace
And though I'm assuming you can guess what my noiseWords array looks like, here's just a small fraction of it:
$noiseWords = array("a", "able","about","above","abroad","according","accordingly","across",
"actually","adj","after","afterwards","again",......)
You can use a negative lookbehind and positive lookahead to make sure you're not "around" a quote character:
$regex = "/\b(?<!')(".implode('|', self::$noiseWords).")(?!')\b/";
Now, your regex will not match anything that is preceded by or following with a single quote.
Related
Imagine if:
$string = "abcdabcdabcdabcdabcdabcdabcdabcd";
How do I remove the repeated sequence of characters (all characters, not just alphabets) in the string so that the new string would only have "abcd"? Perhaps running a function that returns a new string with removed repetitions.
$new_string = remove_repetitions($string);
The possible string before removing the repetition is always like above. I don’t know how else to explain since English is not my first language. Other examples are,
$string = “EqhabEqhabEqhabEqhabEqhab”;
$string = “o=98guo=98guo=98gu”;
Note that I want it to work with other sequence of characters as well. I tried using Regex but I couldn't figure out a way to accomplish it. I am still new to php and Regex.
For details : https://algorithms.tutorialhorizon.com/remove-duplicates-from-the-string/
In different programming have a different way to remove the same or duplicate character from a string.
Example: In PHP
<?php
$str = "Hello World!";
echo count_chars($str,3);
?>
OutPut : !HWdelor
https://www.w3schools.com/php/func_string_count_chars.asp
Here, if we wish to remove the repeating substrings, I can't think of a way other than knowing what we wish to collect since the patterns seem complicated.
In that case, we could simply use a capturing group and add our desired output in it the remove everything else:
(abcd|Eqhab|guo=98)
I'm guessing it should be simpler way to do this though.
Test
$re = '/.+?(abcd|Eqhab|guo=98)\1.+/m';
$str = 'abcdabcdabcdabcdabcdabcdabcdabcd
EqhabEqhabEqhabEqhabEqhab
o98guo=98guo=98guo=98guo=98guo=98guo=98guo98';
$subst = '$1';
$result = preg_replace($re, $subst, $str);
echo $result;
Demo
You did not tell what exactly to remove. A "sequnece of characters" can be as small as just 1 character.
So this simple regex should work
preg_replace ( '/(.)(?=.*?\1)/g','' 'abcdabcdabcdabcdabcdabcd');
$str="Your LaTeX document can \DIFaddbegin \DIFadd{test}\DIFaddend be easily
and the text can have multiple lines in it like this\DIFaddbegin \DIFadd{test2}
\DIFaddend"
I need to convert all \DIFaddbegin \DIFadd{test}\DIFaddend to \added{test}.
I tried
$o= preg_replace_callback('/\\DIFaddbegin\\s\DIFadd{(.*?)}\\DIFaddend/',
function($m) {return preg_replace('/$m[0]/','\added{$m[1]}',$m[0]);},$str);
But no luck. Which would be correct pattern for this? And also even if the string contains new line character the pattern should work.
You don't need a callback, using preg_replace() is fine for this task. To match a single backslash you need to double escape it meaning \\\\. To match possible whitespace between each substring, you can use \s* meaning whitespace "zero or more" times.
$str = preg_replace('~\\\\DIFaddbegin\s*\\\\DIFadd({[^}]*})\s*\\\\DIFaddend~', '\added$1', $str);
Try this:
$new_str = preg_replace("/\\\\DIFaddbegin \\\\DIFadd\{(.*)\}\\\\DIFaddend/s","\\added{\$1}",$str);
I have strings like this (some examples):
F7998FM3213/02F
J442554NM/05
K439459845/34D
I need to use PHP with preg_replace and regular expressions to delete all non-numeric characters in any string, after the forward-slash, '/'.
For example the codes above would look like this afterwards:
F7998FM3213/02
J442554NM/05
K439459845/34
If you're going for readability, something like this would be perfect:
$parts = explode("/",$line,2);
$parts[1] = preg_replace("/\D/","",$parts[1]);
$output = implode("/",$parts);
However, for conciseness and based entirely on the examples you have given, try this:
$output = preg_replace("/\D+$/","",$input);
This will strip any non-numeric characters from the end of the string, which seems to be what you're after based on your examples.
you can use this:
$subject = <<<LOD
F7998FM3213/02F
J442554NM/05
K439459845/34D
K439459845/34D34
LOD;
echo preg_replace('~^[^/]*+/\K|[^\d\n]++~m', '', $subject);
explanation:
The regex is an alternation between two things:
You match the begining until you encounter / included
the part after the / that is all that is not a digit or a new line one or more times
Since the begining of the string is checked at first, all non digit characters are removed after the /
To remove all \D anywhere after a / you could replace:
(?:/\K|\G(?!^))(\d*)\D+
with $1. Like:
preg_replace(',(?:/\K|\G(?!^))(\d*)\D+,', '$1', $str);
I am currently using what appears to be a horribly complex and unnecessary solution to form a required string.
The string could have any punctuation and will include slashes.
As an example, this string:
Test Ripple, it\'s a comic book one!
Using my current method:
str_replace(" ", "-", trim(preg_replace('/[^a-z0-9]+/i', ' ', str_replace("'", "", stripslashes($string)))))
Returns the correct result:
Test-Ripple-its-a-comic-book-one
Here is a breakdown of what my current (poor) solution is doing in order to achieve the desired output:-
Strip all slashes from the string
remove any apostrophes with str_replace
remove any remaining punctuation using preg_replace and replace it with whitespace
Trim off any extra whitespace from the beginning/end of string which may have been caused by punctuation.
Replace all whitespace with '-'
But there must be a better and more efficient way. Can anyone help?
Personally it looks fine to me however I would make one small change.
Change
preg_replace("/[^a-z0-9]+/i"
to the following
preg_replace("/[^a-zA-Z0-9\s]/"
Given a literal string such as:
Hello\n\n\n\n\n\n\n\n\n\n\n\nWorld
I would like to reduce the repeated \n's to a single \n.
I'm using PHP, and been playing around with a bunch of different regex patterns. So here's a simple example of the code:
$testRegex = '/(\\n){2,}/';
$test = 'Hello\n\n\n\n\n\n\n\n\nWorld';
$test2 = preg_replace($testRegex ,'\n',$test);
echo "<hr/>test regex<hr/>".$test2;
I'm new to PHP, not that new to regex, but it seems '\n' conforms to special rules. I'm still trying to nail those down.
Edit: I've placed the literal code I have in my php file here, if I do str_replace() I can get good things to happen, but that's not a complete solution obviously.
To match a literal \n with regex, your string literal needs four backslashes to produce a string with two backlashes that’s interpreted by the regex engine as an escape for one backslash.
$testRegex = '/(\\\\n){2,}/';
$test = 'Hello\n\n\n\n\n\n\n\n\n\n\n\nWorld';
$test2 = preg_replace($testRegex, '\n', $test);
Perhaps you need to double up the escape in the regular expression?
$pattern = "/\\n+/"
$awesome_string = preg_replace($pattern, "\n", $string);
Edit: Just read your comment on the accepted answer. Doesn't apply, but is still useful.
If you're intending on expanding this logic to include other forms of white-space too:
$output = echo preg_replace('%(\s)*%', '$1', $input);
Reduces all repeated white-space characters to single instances of the matched white-space character.
it indeed conforms to special rules, and you need to add the "multiline"-modifier, m. So your pattern would look like
$pattern = '/(\n)+/m'
which should provide you with the matches. See the doc for all modifiers and their detailed meaning.
Since you're trying to reduce all newlines to one, the pattern above should work with the rest of your code. Good luck!
Try this regular expression:
/[\n]*/