PHP Regex pattern to match magic search keywords - php

Ok regex experts. I'm having a ton of trouble trying to make a regex pattern for my needs.
The goal:
Take a search query such as "good food type:post format:gallery" and parse the type or format or both from the string.
This is what I wrote, but doesnt work unless both type and format are present and type comes before format. Ideally, either type or format could be present.
$query = "Great food type:post format:gallery";
preg_match('/(.*?(?<=\btype:)(?P<type>[a-z]*\w+))(.*?(?<=\bformat:)(?P<format>[a-z]*\w+))/', $query, $matches);
I image I need the returned $matches to be named as well right?
Thanks,

I don't think you'll want to use a regex for this. It'll be a pain to maintain and update when you add more operators like type: and format: Also the regex then depends on ordering of what's entered.
A simple approach might be like
$tokens=explode(" ",$searchString);
foreach($tokens as $token){
if(preg_match('~([^:]+:(.*)~',$token,$flagMatch)){
$flags[$flagMatch[1]]=$flagMatch[2];
}
$searchtokens[]=$token
}
Obvious caveat with that example is exploding straight on space so you wouldn't be able to handle "quoted terms" that should be treated as one.

Related

How can I parse out specific "tags" from a string in php

I like how StackOverflow allows you to search for tags by specifying [tagname] in the search field. How could I go about writing a parser that would help me separate out tags from normal text. I can think of the manual way which would be to use some combination of substring and/or regex to get the position of opening and closing square brackets, and then extract out those strings, but I'm curious if there's a better way (and my regex skill is subpar at best)
// example
$query = 'How to use [jQuery] [selector] selectors';
$tags = getTags($query); // $tags == 'jQuery, selector'
$text = getText($query); // $text == 'How to use selectors'
Regular Expressions are probably the way to go. The more you can specify about how the tags are set the easier it will be to capture the right ones (In the expression below I limit it to either letters \w or numbers \d. The function uses a capture group (enclosed in parens) to pull out the relevant tags.
function getTags($query) {
preg_match_all("/\[([\w\d]+)\]/", $query, $matches);
return $matches;
}
Regex would probably work best, just don't try to parse HTML.
https://www.debuggex.com/
Is a really good site for visually seeing what your regex string is doing. I would recommend reading up on the PHP regex functions, and learn some more, there is a cheatsheat at the bottom of the site.
.*[(tag)].*
Would work to get the tags, using a captured group. The preg_match_all function is really good for working with multiple results, just make sure to read the official documentation to get it working how you need it.
For parsing more complex, or irregular things (like html, which is extremely difficult to do reliably), it is better to do it manually. Regex has worked for all my non HTML parsing needs in the past.

PHP preg_replace();

I've got a problem with regexp function, preg_replace(), in PHP.
I want to get viewstate from html's input, but it doesn't work properly.
This code:
$viewstate = preg_replace('/^(.*)(<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value=")(.*[^"])("\s+name="__VIEWSTATE">)(.*)$/u','^\${3}$',$html);
Returns this:
%0D%0A%0D%0A%3C%21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.0+Transitional%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-transitional.dtd%22%3E%0D%0A%0D%0A%3Chtml+xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml%22+%3E%0D%0A%3Chead%3E%3Ctitle%3E%0D%0A%09Strava.cz%0D%0A%3C%2Ftitle%3E%3Clink+rel%3D%22shortcut+icon%22+href%3D%22..%2FGrafika%2Ffavicon.ico%22+type%3D%22image%2Fx-icon%22+%2F%3E%3Clink+rel%3D%22stylesheet%22+type%3D%22text%2Fcss%22+media%3D%22screen%22+href%3D%22..%2FStyly%2FZaklad.css%22+%2F%3E%0D%0A++++%3Cstyle+type%3D%22text%2Fcss%22%3E%0D%0A++++++++.style1%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+47px%3B%0D%0A++++++++%7D%0D%0A++++++++.style2%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+64px%3B%0D%0A++++++++%7D%0D%0A++++%3C%2Fstyle%3E%0D%0A%0D%0A%3Cscript+type%3D%22text%2Fjavascript%22%3E%0D%0A%0D%0A++var+_gaq+%3D+_gaq+%7C%7C+%5B%5D%3B%0D%0A++_gaq.push%28%5B
EDIT: Sorry, I left this question for a long time. Finally I used DOMDocument.
To be sure i'd split this match into two phases:
Find the relevant input element
Get the value
Because you cannot be certain what the attributes order in the element will be.
if(preg_match('/<input[^>]+name="__VIEWSTATE"[^>]*>/i', $input, $match))
$value = preg_replace('/.*value="([^"]*)".*/i', '$1', $match[0]);
And, of course, always consider DOM and DOMXpath over regex for parsing html/xml.
You should only capture when you're planning on using the data. So most () are obsolete in that regexp pattern. Not a cause for failure but I thought I'd mention it.
Instead of using [^"] to mark that you don't want that character you could use the non-greedy modifier - ?. This makes sure the pattern is matching as little as it can. Since you have name="__VIEWSTATE" following the value this should be safe.
Let's put this in practice and simplify the pattern some. This works as you want:
'/.*<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value="(.+?)"\s+name="__VIEWSTATE">.*/'
I would strongly recommend checking out an alternative to regexp for DOM operations. This makes certain your code works also if the attributes changes order. Plus it's so much nicer to work with.
The main mistake was the use of funciton preg_replace, witch returns the subject - neither the matched pattern nor the replacement. Thank you for your ideas and for the recommendation of DOMDocument. m93a
http://www.php.net/manual/en/function.preg-replace.php#refsect1-function.preg-replace-returnvalues

Regex replace matched subexpression (and nothing else)?

I've used regex for ages but somehow I managed to never run into something like this.
I'm looking to do some bulk search/replace operations within a file where I need to replace some data within tag-like elements. For example, converting <DelayEvent>13A</DelayEvent> to just <DelayEvent>X</DelayEvent> where X might be different for each.
The current way I'm doing this is such:
$new_data = preg_replace('|<DelayEvent>(\w+)</DelayEvent>|', '<DelayEvent>X</DelayEvent>', $data);
I can shorten this a bit to:
$new_data = preg_replace('|(<DelayEvent>)(\w+)(</DelayEvent>)|', '${1}X${2}', $data);
But really all I want to do is simulate a "replace text between tags T with X".
Is there a way to do such a thing? In essence I'm trying to prevent having to match all the surrounding data and reassembling it later. I just want to replace a given matched sub-expression with something else.
Edit: The data is not XML, although it does what appear to be tag-like elements. I know better than parsing HTML and XML with RegEx. ;)
It is possible using lookarounds:
$new_data = preg_replace('|(?<=<DelayEvent>)\w+(?=</DelayEvent>)|', 'X', $data);
See it working online: ideone

What regex pattern do I need for this?

I need a regex (to work in PHP) to replace American English words in HTML with British English words. So color would be replaced by colour, meters by metres and so on [I know that meters is also a British English word, but for the copy we'll be using it will always be referring to units of distance rather than measuring devices]. The pattern would need to work accurately in the following (slightly contrived) examples (although as I have no control over the actual input these could exist):
<span style="color:red">This is the color red</span>
[should not replace color in the HTML tag but should replace it in the sentence]
<p>Color: red</p>
[should replace word]
<p>Tony Brammeter lives 2000 meters from his sister</p>
[should replace meters for the word but not in the name]
I know there are edge cases where replacement wouldn't be useful (if his name was Tony Meter for example), but these are rare enough that we can deal with them when they come up.
Html/xml should not be processed with regular expressions, it is really hard to generate one that will match anything. But you can use the builtin dom extension and process your string recursively:
# Warning: untested code!
function process($node, $replaceRules) {
foreach ($node->children as $childNode) {
if ($childNode instanceof DOMTextNode) {
$text = pre_replace(
array_keys(replaceRules),
array_values($replaceRules),
$childNode->wholeText
);
$node->replaceChild($childNode, new DOMTextNode($text));
} else {
process($childNode, $replaceRules);
}
}
}
$replaceRules = array(
'/\bcolor\b/i' => 'colour',
'/\bmeter\b/i' => 'metre',
);
$doc = new DOMDocument();
$doc->loadHtml($htmlString);
process($doc, $replaceRules);
$htmlString = $doc->saveHTML();
I think you'd rather need a dictionary and maybe even some grammatical analysis in order to get this working correctly, since you don't have control over the input. A pure regex solution is not really going to be able to process this kind of data correctly.
So I'd suggest to first come up with a list of words that need to be replaced, those are not only "color" and "meter". Wikipedia has some information on the topic.
You do not want a regular expression for this. Regular expressions are by their very nature stateless, and you need some measure of state to be able to tell the difference between 'in a html tag' and 'in data'.
You want to be using a HTML parser in combination with something like a str_replace, or even better, use a proper grammer dictionary and stuff as Lucero suggests.
The second problem is easier - you want to replace when there are word boundaries around the word: http://www.regular-expressions.info/wordboundaries.html -- this will make sure you don't replace the meter in Brammeter.
The first problem is much harder. You don't want to replace words inside HTML entities - nothing between <> characters. So, your match must make sure that you last saw > or nothing, but never just <. This is either hard, and requires some combination of lookahead/lookbehind assertions, or just plain impossible with regular expressions.
a script implementing a state machine would work much better here.
You don't need to use a regex explicitly. You can try the str_replace function, or if you need it to be case insensitive use the str_ireplace function.
Example:
$str = "<p>Color: red</p>";
$new_str = str_ireplace ('%color%', 'colour', $str);
You can pass an array with all the words that you want to search for, instead of the string.

Whitelist in php

I have an input for users where they are supposed to enter their phone number. The problem is that some people write their phone number with hyphens and spaces in them. I want to put the input trough a filter to remove such things and store only digits in my database.
I figured that I could do some str_replace() for the whitespaces and special chars.
However I think that a better approach would be to pick out just the digits instead of removing everything else. I think that I have heard the term "whitelisting" about this.
Could you please point me in the direction of solving this in PHP?
Example: I want the input "0333 452-123-4" to result in "03334521234"
Thanks!
This is a non-trivial problem because there are lots of colloquialisms and regional differences. Please refer to What is the best way for converting phone numbers into international format (E.164) using Java? It's Java but the same rules apply.
I would say that unless you need something more fully-featured, keep it simple. Create a list of valid regular expressions and check the input against each until you find a match.
If you want it really simple, simply remove non-digits:
$phone = preg_replace('![^\d]+!', '', $phone);
By the way, just picking out the digits is, by definition, the same as removing everything else. If you mean something different you may want to rephrase that.
$number = filter_var(str_replace(array("+","-"), '', $number), FILTER_SANITIZE_NUMBER_INT);
Filter_Var removes everything but pluses and minuses, and str_replace gets rid of those.
or you could use preg_replace
$number = preg_replace('/[^0-9]/', '', $number);
You could do it two ways. Iterate through each index in the string, and run is_numeric() on it, or you could use a regular expression on the string.
On the client side I do recommand using some formating that you design when creating a form. This is good for zip or telephone fields. Take a look at this jquery plugin for a reference. It will much easy later on the server side.

Categories