PHP (preg) Regular Expression For Content Indexing/Update - php

I have the following code:
/* record 863.content.en */
UPDATE language_def
SET en='<html>blah blah markup</html>'
WHERE page_id=863,
AND string_id='content';
/* record_end 863.content.en */
I would like to create an expression to match that statement where:
the data in between the periods of 863.content.en are variable BUT SPECIFIC (there will be many of these statements in a row)
the data in between the two comments is variable but NOT specific
This is what I have so far:
'[/*]\s*record\s*specific_number[.]specific_string1[.]specific_string2\s*[*/].*[/*]\s*record_end\s*specific_number[.]specific_string1[.]specific_string2\s*[*/]'

There are a few problems with your regex.
First of all, as FrankeTheKneeMan pointed out, you need delimiters. # is a good choice for HTML matches (the standard choice is / but that interferes with tags too often):
'#[/*]\s*record\s*specific_number[.]specific_string1[.]specific_string2\s*[*/].*[/*]\s*record_end\s*specific_number[.]specific_string1[.]specific_string2\s*[*/]#'
Now while [.] is a nice way of escaping a single character, it doesn't work the same for [/*]. This is a character class, that matches either / or *. Same for [*/]. Use this instead:
'#/[*]\s*record\s*specific_number[.]specific_string1[.]specific_string2\s*[*]/.*/[*]\s*record_end\s*specific_number[.]specific_string1[.]specific_string2\s*[*]/#'
Now .* is the remaining problem. Actually there are too, one is critical, the other might not be. The first is that . does not match line breaks by default. You can change this by using the s (singleline) modifier. The second is, that * is greedy. Should a section appear twice in the string, you would get everything from the first corresponding /* record to the last corresponding /* record_end, even if there is unrelated stuff in between. Since your records seem to be very specific, I suppose this is not the case. But still it is generally good practice, to make the quantifier ungreedy, so that it consumes as little as possible. Here is your final regex string:
'#/[*]\s*record\s*specific_number[.]specific_string1[.]specific_string2\s*[*]/.*?/[*]\s*record_end\s*specific_number[.]specific_string1[.]specific_string2\s*[*]/#s'
For your presented example, this is
'#/[*]\s*record\s*863[.]content[.]en\s*[*]/.*?/[*]\s*record_end\s*863[.]content[.]en\s*[*]/#s'
If you want to find all of these sections, then you can make 863, content and en variable, capture them (using parentheses) and use a backreference to make sure you get the corresponding record_end:
'#/[*]\s*record\s*(\d+)[.](\w+)[.](\w+)\s*[*]/.*?/[*]\s*record_end\s*\1[.]\2[.]\3\s*[*]/#s'

'#/\* record (\S+) \*/.*<html>(.*)</html>.*/\* record_end \1 \*/#is'
This regular expression will split your string up into individual records, as seen here. You can feel free to replace any spaces with \s*, but I left it this way for readability. \S+ matches any number of non-whitespace characters, but you can replace it with your specific strings if you like. Other wise, you can parse over the match objects returned by preg_match_all and use the first subcapture to get the specific record, and the second subcapture to get the information between the html tags. The #s are delimiters needed by php to separate the regular expressions - i for case insensitive and s to make the . match new lines.

Related

Matching between comment lines

I'm trying to match
Some HTML content
Using preg_match
\<\!\-\- FOR (\d+) \-\-\>(.*)\<\!\-\- END FOR \-\-\>
Doesn't work since they are on different lines.
First you need to learn that < ! - > are not special characters. Escaping them with backslashes makes you look a bit silly.
Then learn about the /x and /s flags. One of them is what you need. The other is me trying to trick you into learning something unrelated.
Then test your regular expression with some HTML content that contains two or more of those FOR/END FORs and see what happens.
Also, you need to look into how to make your capturing conditions "greedy" or "non greedy". By default, matches will be greedy. So a condition such as "A(.)B" with the string "A1B A2B A3B" would find one match "1B A2B A3" - everything form the first "A" to the last "B". If you wanted to find all the values between each set of A/B, then you need make the match non-greedy - "A(.?)B"

Matching sets of tags in PHP with Regular Expression

I am currently working on protecting my AJAX Chat against exploits by checking all text in PHP before it is passed to the client. So far I have been successful with my mission except for one part where I require to match sets of image tags.
Overall I wish to have it pick up any instance of there being a newline character between a set tags which I have sort of managed, but the solution I have is greedy and matches newline characters outside of tags as well if there are multiple sets of tags.
At the moment I have the following which works if I wanted to match just [img]{newline}[/img]
if(preg_match('/\[\bimg\].*\x0A.*\[\/\bimg\]/', $text)){ //code }
But if I wanted to do [img]image.jpg[/img]{newline}[img]image.jpg[/img], it only sees the very first and end tags which I do not want.
So now I ask, how do you make it match each set of tags properly?
Edit: For clarification. Any newline characters inside tags are bad, so I want to detect them. Any newline characters outside tags are good and I want to ignore them. The reason being, if the client processes a newline character inside of a tag, it crashes.
Just make it ungreedy by putting ? after the two .*
But note that your current solution will not match this:
[img]
look, two newlines!
[/img]
I'm not sure why you want to do this, but you can make . match newlines by adding the s modifier to your regex. Then it's just "(\[img\](.*?)\[/img\])is" to match it, and you can even capture that group and individually check it for newlines if you want.
Try setting the s modifier, like this:
if (preg_match('/\[\bimg\].*\x0A.*\[\/\bimg\]/s', $text)) { code }
See also the PHP Documentation for Regex modifiers

Regular expression doesn't quite work

I have created a Regular Expression (using php) below; which must match ALL terms within the given string that contains only a-z0-9, ., _ and -.
My expression is: '~(?:\(|\s{0,},\s{0,})([a-z0-9._-]+)(?:\s{0,},\s{0,}|\))$~i'.
My target string is: ('word', word.2, a_word, another-word).
Expected terms in the results are: word.2, a_word, another-word.
I am currently getting: another-word.
My Goal
I am detecting a MySQL function from my target string, this works fine. I then want all of the fields from within that target string. It's for my own ORM.
I suppose there could be a situation where by further parenthesis are included inside this expression.
From what I can tell, you have a list of comma-separated terms and wish to find only the ones which satisfy [a-z0-9._\-]+. If so, this should be correct (it returns the correct results for your example at least):
'~(?<=[,(])\\s*([a-z0-9._-]+)\\s*(?=[,)])~i'
The main issues were:
$ at the end, which was anchoring the query to the end of the string
When matching all you continue from the end of the previous match - this means that if you match a comma/close parenthesis at the end of one match it's not there at match at the beginning of the next one. I've solved this with a lookbehind ((?<=...) and a lookahead ((?=...)
Your backslashes need to be double escaped since the first one may be stripped by PHP when parsing the string.
EDIT: Since you said in a comment that some of the terms may be strings that contain commas you will first want to run your input through this:
$input = preg_replace('~(\'([^\']+|(?<=\\\\)\')+\'|"([^"]+|(?<=\\\\)")+")~', '"STRING"', $input);
which should replace all strings with '"STRING"', which will work fine for matching the other regex.
Maybe using of regex is overkill. In this kind of text you can just remove parenthesis and explode string by comma.

Parse block with php regex

I'm trying to write a (I think) pretty simple RegEx with PHP but it's not working.
Basically I have a block defined like this:
%%%%blockname%%%%
stuff goes here
%%%%/blockname%%%%
I'm not any good at RegEx, but this is what I tried:
preg_match_all('/^%%%%(.*?)%%%%(.*?)%%%%\/(.*?)%%%%$/i',$input,$matches);
It returns an array with 4 empty entries.
I guess it also, apart from actually working, needs some sort of pointer for the third match because it should be equal to the first one?
Please enlighten me :)
You need to allow the dot to match newlines, and to allow ^ and $ to match at the start and end of lines (not just the entire string):
preg_match_all('/^%%%%(.*?)%%%%(.*?)%%%%\/(.*?)%%%%$/sm',$input,$matches);
The s (single-line) option makes the dot match any character including newlines.
The m (multi-line) option allows ^ and $ to match at the start and end of lines.
The i option is unnecessary in your regex since there are no case-sensitive characters in it.
Then, to answer the second part of your question: If blockname is the same in both cases, then you can make that explicit by using a backreference to the first capturing group:
preg_match_all('/^%%%%(.*?)%%%%(.*?)%%%%\/\1%%%%$/sm',$input,$matches);
I'm pretty sure you can't since these operations would need to save a variable and you can't in regex. You should try to do this using PHP's built-in token parser. http://php.net/manual/en/function.token-get-all.php

Reg-Ex for filtering out, parsing and replacing a specific string? (php)

While developing a private CMS for a client, I've had an idea to implement a php-underlying, yet server-side and flexible "language".
I'm in trouble finding a reqular-expression finding (filter..) the following string ( [..] is the code, which'll be parsed after it's been filtered out ), I want to filter the string out with the line-breaks.
<(
[..]
)>
I was looking for a solution all night, but I didn't find a solution.
First off: Listen to Dan Grossmans advice above.
From my current understanding of your question, you want to get the verbatim content between <( and )> - no exceptions, no comment handling.
If so, try this RegExp
'/<\(((?:.|\s)*?)\)>/'
which you can use like this
preg_match_all('/<\(((?:.|\s)*?)\)>/', $yourstring, $matches)
It doesn't need case insensitivity, and it does lazy matching (so you can apply it to a string with several instances of matches).
Explanation of the RegExp: Starting with <(, ending with )> (brackets escaped of course), in between is the capturing group. At its core, we take either regular characters . or whitespace \s (which solves your problem, since line breaks are whitespace too). We don't want to capture every single character, so the inner group is non capturing - just either whitespace or character: (?:.|\s). This is repeated any number of times (including zero), but only until the first match is complete: *? for lazy 0-n. That's about it, hope it helps.

Categories