What regex pattern do I need for this? - php

I need a regex (to work in PHP) to replace American English words in HTML with British English words. So color would be replaced by colour, meters by metres and so on [I know that meters is also a British English word, but for the copy we'll be using it will always be referring to units of distance rather than measuring devices]. The pattern would need to work accurately in the following (slightly contrived) examples (although as I have no control over the actual input these could exist):
<span style="color:red">This is the color red</span>
[should not replace color in the HTML tag but should replace it in the sentence]
<p>Color: red</p>
[should replace word]
<p>Tony Brammeter lives 2000 meters from his sister</p>
[should replace meters for the word but not in the name]
I know there are edge cases where replacement wouldn't be useful (if his name was Tony Meter for example), but these are rare enough that we can deal with them when they come up.

Html/xml should not be processed with regular expressions, it is really hard to generate one that will match anything. But you can use the builtin dom extension and process your string recursively:
# Warning: untested code!
function process($node, $replaceRules) {
foreach ($node->children as $childNode) {
if ($childNode instanceof DOMTextNode) {
$text = pre_replace(
array_keys(replaceRules),
array_values($replaceRules),
$childNode->wholeText
);
$node->replaceChild($childNode, new DOMTextNode($text));
} else {
process($childNode, $replaceRules);
}
}
}
$replaceRules = array(
'/\bcolor\b/i' => 'colour',
'/\bmeter\b/i' => 'metre',
);
$doc = new DOMDocument();
$doc->loadHtml($htmlString);
process($doc, $replaceRules);
$htmlString = $doc->saveHTML();

I think you'd rather need a dictionary and maybe even some grammatical analysis in order to get this working correctly, since you don't have control over the input. A pure regex solution is not really going to be able to process this kind of data correctly.
So I'd suggest to first come up with a list of words that need to be replaced, those are not only "color" and "meter". Wikipedia has some information on the topic.

You do not want a regular expression for this. Regular expressions are by their very nature stateless, and you need some measure of state to be able to tell the difference between 'in a html tag' and 'in data'.
You want to be using a HTML parser in combination with something like a str_replace, or even better, use a proper grammer dictionary and stuff as Lucero suggests.

The second problem is easier - you want to replace when there are word boundaries around the word: http://www.regular-expressions.info/wordboundaries.html -- this will make sure you don't replace the meter in Brammeter.
The first problem is much harder. You don't want to replace words inside HTML entities - nothing between <> characters. So, your match must make sure that you last saw > or nothing, but never just <. This is either hard, and requires some combination of lookahead/lookbehind assertions, or just plain impossible with regular expressions.
a script implementing a state machine would work much better here.

You don't need to use a regex explicitly. You can try the str_replace function, or if you need it to be case insensitive use the str_ireplace function.
Example:
$str = "<p>Color: red</p>";
$new_str = str_ireplace ('%color%', 'colour', $str);
You can pass an array with all the words that you want to search for, instead of the string.

Related

How can I parse out specific "tags" from a string in php

I like how StackOverflow allows you to search for tags by specifying [tagname] in the search field. How could I go about writing a parser that would help me separate out tags from normal text. I can think of the manual way which would be to use some combination of substring and/or regex to get the position of opening and closing square brackets, and then extract out those strings, but I'm curious if there's a better way (and my regex skill is subpar at best)
// example
$query = 'How to use [jQuery] [selector] selectors';
$tags = getTags($query); // $tags == 'jQuery, selector'
$text = getText($query); // $text == 'How to use selectors'
Regular Expressions are probably the way to go. The more you can specify about how the tags are set the easier it will be to capture the right ones (In the expression below I limit it to either letters \w or numbers \d. The function uses a capture group (enclosed in parens) to pull out the relevant tags.
function getTags($query) {
preg_match_all("/\[([\w\d]+)\]/", $query, $matches);
return $matches;
}
Regex would probably work best, just don't try to parse HTML.
https://www.debuggex.com/
Is a really good site for visually seeing what your regex string is doing. I would recommend reading up on the PHP regex functions, and learn some more, there is a cheatsheat at the bottom of the site.
.*[(tag)].*
Would work to get the tags, using a captured group. The preg_match_all function is really good for working with multiple results, just make sure to read the official documentation to get it working how you need it.
For parsing more complex, or irregular things (like html, which is extremely difficult to do reliably), it is better to do it manually. Regex has worked for all my non HTML parsing needs in the past.

How to remove offensive words from post by php?

Assume "xyza" is a bad word. I'm using following method to replace offensive words-
$text = str_replace("x***","(Offensive words detected & removed!)",$text);
This code will replace xyza into "(Offensive words detected & removed!)".
But problem is "Case" if someone type XYZA my code can't detect it. How to solve it?
No matter what you do, users will find ways to get around your filters. They will use unicode characters (аss, for example, uses a Cyrillic а and will not get captured by any of the regex solutions). They will use spaces, dollar signs, asterisks, whatever you haven't managed to catch yet.
If family-friendliness is essential to your application, have a person review the content before it goes live. Otherwise, add a flag feature so other people can flag offensive content. Better yet, use some sort of machine learning or Bayesian filter to automatically flag potentially offensive posts and have humans check them out manually. People read human languages better than computers.
The problem with whitelists/blacklists is—as other users have pointed out—your users will make it their priority to find ways around your filter for satisfaction rather than using your website for what it was intended for, whatever that may be.
One approach would be to use Google’s undocumented profanity API it created for its “What Do You Love?” website. If you get a response of true then just give the user a message saying their post couldn’t be submitted due to detected profanity.
You could approach this as follows:
<?php
if (isset($_POST['submit'])) {
$result = json_decode(file_get_contents(sprintf('http://www.wdyl.com/profanity?q=%s', urlencode($_POST['comments']))));
if ($result->response == true) {
// profanity detected
}
else {
// save comments to database as normal
}
}
Other answers and comments say that programming is not the best solution to this problem. I agree with them. Those answers should be moved to Moderators - Stack Exchange or Webmasters - Stack Exchange.
Since this is stackoverflow, my answer is going to be based on computer programming.
If you want to use str_replace, do something like this.
For the sake of this post, since some people are offended by actual cusswords, let's pretend that these are bad words:
'fug', 'schnitt', 'dam'.
$text = str_ireplace(" fug ","(Offensive words detected & removed!)",$text);
Notice, it's str_ireplace not str_replace. The i is for "case insensitive".
But that will erroneously match "fuggedaboudit," for example.
If you want to do a more reliable job, you need to use regex.
$bad_text = "Fug dis schnitt, because a schnitter never dam wins a fuggin schnitting darn";
$hit_words = array("fug","schnitt","dam"); // these words are 'hits' that we need to replace. hit words...
array_walk($hit_words, function(&$value, $key) { // this prepares the regex, requires PHP 5.3+ I think.
$value = '~\b' . preg_quote( $value ,'~') . '\b~i'; // \b means word boundary, like space, line-break, period, dash, and many others. Prevends "refudgee" from being matched when searching for "fudge"
});
/*print_r($bad_words);*/
$good_words = array("fudge","shoot","dang");
$good_text = preg_replace($hit_words,$good_words,$bad_text); // does all search/replace actions at once
echo '<br />' . $good_text . '<br />';
That will do all your search/replacements at once. The two arrays should contain the same number of elements, matching up searches and replace terms. It will not match parts of words, only whole words. And of course, determined cussers will find ways of getting their swearing onto your website. But it will stop lazy cussers.
I've decided to add some links to sites that obviously use programming to do a first run through removing profanity. I'll add more as I come across them. Other than yahoo:
1.) Dell.com - replace matching words with <profanity deleted>.
http://en.community.dell.com/support-forums/peripherals/f/3529/t/19502072.aspx
2.) Watson, the supercomputer, apparently developed a cursing problem. How do you tell the difference between cursing and slang? Apparently, it's so hard that the researchers just decided to purge it all. But they could have just used a list of curse words ( exact matching is a subset of regex, I would say) and forbidden their use. That's kind of how it works in real life, anyway.
Watson develops a profanity problem
3.) Content Compliance section of Gmail custom settings in Apps for Business:
Add expressions that describe the content you want to search for in each message
The "Expresssions" used can be of several types, including "Advanced content match", which, among other things, allows you to choose "Match type" options very similar to what you'd have in an excel filter: Starts with, Ends with, Contains, Not contains, Equals, Is Empty, all of which presumably use Regex. But wait, there's more: Matches regex, Not matches regex, Matches any word, Matches all words. So, the mighty Google implements regex filtering options for its business users. Why would it do that, when regex is supposedly so ineffective? Because it actually is effective enough. It is a simple, fast, programming solution that will only fail when people are hell-bent on circumventing it.
Besides that list, I wonder if anyone else has noticed the similarity between weeding out profanity and filtering out spam. Clearly, regex has uses in both arenas but nitpickers who learned by rote that "all regex is bad" will always downvote any answer to any question if regex is even mentioned.
Try googling "how spam filters work". You'll get results like this one that covers spam assassin:
http://www.seas.upenn.edu/cets/answers/spamblock-filter.html
Another example where I'm sure regex is used is when communicating via Amazon.com's Amazon Marketplace. You receive emails at your usual email address. So, naturally, when responding to a seller, your email program will include all kinds of sender information, like your email address, cc email addresses, and any you enter into the body. But Amazon.com strips these out "for your protection." Can I find a way around this regex? Probably, but it would take more trouble than it's worth and is therefore effective to a degree. They also keep the emails for 2 years, presumably so that a human can go over them in case of any fraud claims.
SpamAssassin also looks at the subject and body of the message for the same sort of things that a person notices when a message "looks like spam". It searches for strings like "viagra", "buy now", "lowest prices", "click here", etc. It also looks for flashy HTML such as large fonts, blinking text, bright colors, etc.
Regex is not mentioned, but I'm sure it's in use.
Use str_ireplace function that Case-insensitive version of str_replace()
$text = str_ireplace("flip","(Offensive words detected & removed!)", $text);
Use 'str_ireplace' to replace any case sensitive strings
Probable, this will help you
$text = 'contains offensive_word .... so on';
$array = array(
'offensive_word' => '****',
'offensive_word2' => '****',
'offensive_word3' => '****',
//.....
);
$text = str_ireplace(array_keys($array),array_values($array), $text);
echo $text;
You should use regex replacement and need to add the i flag to the end of your regex so it searches your text regardless of case. so..
$text = preg_replace("/xyza/i","(Offensive words detected & removed!)", $text);
str_ireplace can also be used if you don't need complex regex rules.
$text = str_ireplace("xyza","(Offensive words detected & removed!)", $text);
In fact, the latter is the preferred way as it's faster than regex manipulation. From PHP docs:
If you don't need fancy replacing rules, you should generally use this function instead of preg_replace() with the i modifier.
BUT, as the commenter pointed out, simple string/regex replacements can break your strings if the substring you're replacing appears as part of another non-offensive word. For this, you could either use word boundaries in your regexes or replace only those words that can't be part of other strings (e.g. the word xyza).

PHP preg_replace();

I've got a problem with regexp function, preg_replace(), in PHP.
I want to get viewstate from html's input, but it doesn't work properly.
This code:
$viewstate = preg_replace('/^(.*)(<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value=")(.*[^"])("\s+name="__VIEWSTATE">)(.*)$/u','^\${3}$',$html);
Returns this:
%0D%0A%0D%0A%3C%21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.0+Transitional%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-transitional.dtd%22%3E%0D%0A%0D%0A%3Chtml+xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml%22+%3E%0D%0A%3Chead%3E%3Ctitle%3E%0D%0A%09Strava.cz%0D%0A%3C%2Ftitle%3E%3Clink+rel%3D%22shortcut+icon%22+href%3D%22..%2FGrafika%2Ffavicon.ico%22+type%3D%22image%2Fx-icon%22+%2F%3E%3Clink+rel%3D%22stylesheet%22+type%3D%22text%2Fcss%22+media%3D%22screen%22+href%3D%22..%2FStyly%2FZaklad.css%22+%2F%3E%0D%0A++++%3Cstyle+type%3D%22text%2Fcss%22%3E%0D%0A++++++++.style1%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+47px%3B%0D%0A++++++++%7D%0D%0A++++++++.style2%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+64px%3B%0D%0A++++++++%7D%0D%0A++++%3C%2Fstyle%3E%0D%0A%0D%0A%3Cscript+type%3D%22text%2Fjavascript%22%3E%0D%0A%0D%0A++var+_gaq+%3D+_gaq+%7C%7C+%5B%5D%3B%0D%0A++_gaq.push%28%5B
EDIT: Sorry, I left this question for a long time. Finally I used DOMDocument.
To be sure i'd split this match into two phases:
Find the relevant input element
Get the value
Because you cannot be certain what the attributes order in the element will be.
if(preg_match('/<input[^>]+name="__VIEWSTATE"[^>]*>/i', $input, $match))
$value = preg_replace('/.*value="([^"]*)".*/i', '$1', $match[0]);
And, of course, always consider DOM and DOMXpath over regex for parsing html/xml.
You should only capture when you're planning on using the data. So most () are obsolete in that regexp pattern. Not a cause for failure but I thought I'd mention it.
Instead of using [^"] to mark that you don't want that character you could use the non-greedy modifier - ?. This makes sure the pattern is matching as little as it can. Since you have name="__VIEWSTATE" following the value this should be safe.
Let's put this in practice and simplify the pattern some. This works as you want:
'/.*<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value="(.+?)"\s+name="__VIEWSTATE">.*/'
I would strongly recommend checking out an alternative to regexp for DOM operations. This makes certain your code works also if the attributes changes order. Plus it's so much nicer to work with.
The main mistake was the use of funciton preg_replace, witch returns the subject - neither the matched pattern nor the replacement. Thank you for your ideas and for the recommendation of DOMDocument. m93a
http://www.php.net/manual/en/function.preg-replace.php#refsect1-function.preg-replace-returnvalues

PHP Regex pattern to match magic search keywords

Ok regex experts. I'm having a ton of trouble trying to make a regex pattern for my needs.
The goal:
Take a search query such as "good food type:post format:gallery" and parse the type or format or both from the string.
This is what I wrote, but doesnt work unless both type and format are present and type comes before format. Ideally, either type or format could be present.
$query = "Great food type:post format:gallery";
preg_match('/(.*?(?<=\btype:)(?P<type>[a-z]*\w+))(.*?(?<=\bformat:)(?P<format>[a-z]*\w+))/', $query, $matches);
I image I need the returned $matches to be named as well right?
Thanks,
I don't think you'll want to use a regex for this. It'll be a pain to maintain and update when you add more operators like type: and format: Also the regex then depends on ordering of what's entered.
A simple approach might be like
$tokens=explode(" ",$searchString);
foreach($tokens as $token){
if(preg_match('~([^:]+:(.*)~',$token,$flagMatch)){
$flags[$flagMatch[1]]=$flagMatch[2];
}
$searchtokens[]=$token
}
Obvious caveat with that example is exploding straight on space so you wouldn't be able to handle "quoted terms" that should be treated as one.

How to match anything except a pattern between two tags

I am attempting to match a string which is composed of HTML. Basically it is an image gallery so there is a lot of similarity in the string. There are a lot of <dl> tags in the string, but I am looking to match the last <dl>(.?)+</dl> combo that comes before a </div>.
The way I've devised to do this is to make sure that there aren't any <dl's inside the <dl></dl> combo I'm matching. I don't care what else is there, including other tags and line breaks.
I decided I had to do it with regular expressions because I can't predict how long this substring will be or anything that's inside it.
Here is my current regex that only returns me an array with two NULL indicies:
preg_match_all('/<dl((?!<dl).)+<\/dl>(?=<\/div>)/', $foo, $bar)
As you can see I use negative lookahead to try and see if there is another <dl> within this one. I've also tried negative lookbehind here with the same results. I've also tried using +? instead of just + to no avail. Keep in mind that there's no pattern <dl><dl></dl> or anything, but that my regex is either matching the first <dl> and the last </dl> or nothing at all.
Now I realize . won't match line breaks but I've tried anything I could imagine there and it still either provides me with the NULL indicies or nearly the whole string (from the very first occurance of <dl to </dl></div>, which includes several other occurances of <dl>, exactly what I didn't want). I honestly don't know what I'm doing incorrectly.
Thanks for your help! I've spent over an hour just trying to straighten out this one problem and it's about driven me to pulling my hair out.
Don't use regular expressions for irregular languages like HTML. Use a parser instead. It will save you a lot of time and pain.
I would suggest to use tidy instead. You can easily extra all the desired tags with their contents, even for broken HTML.
In general I would not recommend to write a parser using regex.
See http://www.php.net/tidy
As crazy as it is, about 2 minutes after I posted this question, I found a way that worked.
preg_match_all('/<dl([^\z](?!<dl))+?<\/dl>(?=<\/div>)/', $foo, $bar);
The [^\z] craziness is just a way I used to say "match all characters, even line breaks"

Categories