Update 5/26
I've fixed the behavior of the regular expressions that were previously contained in this question, but as others have mentioned, my syntax still wasn't correct. Apparently the fact that it compiles is due to PHP's preg_* family of functions overlooking my mistakes.
I'm definitely a PCRE novice so I'm trying to understand what mistakes are present so that I can go about fixing them. I'm also open to critique about design/approach, and as others have mentioned, I am also going to build in compatibility with JSON and YAML, but I'd like to go ahead and finish this home-brewed parser since I have it working and I just need to work on the expression syntax (I think).
Here are all of the preg_match_all references and the one preg_replace reference extracted from the whole page of code:
// matches the outside container of objects {: and :}
$regex = preg_match_all('/\s\{:([^\}]+):\}/i', $this->html, $HTMLObjects);
// double checks that the object container is removed
$markup = preg_replace('/[\{:]([^\}]+):\}/i', '$1', $markup);
// matches all dynamic attributes (those containing bracketed data)
$dynamicRegEx = preg_match_all('/[\n]+([a-z0-9_\-\s]+)\[([^\]]+)\]/', $markup, $dynamicMatches);
// matches all static attributes (simple colon-separated attributes)
$staticRegEx = preg_match_all('/([^:]+):([^\n]+)/', $staticMarkup, $staticMatches);
If you'd like to see the preg_match_all and preg_replace references in context so that you can comment/critique that as well, you can see the containing source file by following the link below.
Note: viewing the source code of the page makes everything much more readable
http://mdl.fm/codeshare.php?htmlobject
Like I said, I have it functioning as it stands, I'm just asking for supervision on my PCRE syntax so that it isn't illegal. However, if you have comments on the structure/design or anything else I'm open to all suggestions.
(Rewritten to reflect new question)
The first regex is correct, but you don't need to escape } within a character class. Also, I usually include both braces to avoid the matching of nested objects (your regex would match {:foo {:bar:} in the string "{:foo {:bar:} baz:}"), mine would only match {:bar:}. The /i mode modifier is useless since there is no cased text in your regex.
// matches the outside container of objects {: and :}
$regex = preg_match_all('/\s\{:([^{}]+):\}/', $this->html, $HTMLObjects);
In your second regex, there is an incorrect character class at the start that needs to be removed. Otherwise, it's the same.
// double checks that the object container is removed
$markup = preg_replace('/\{:([^{}]+):\}/', '$1', $markup);
Your third regex looks OK; there's another useless character class, though. Again, I've included both brackets in the negated character class. I'm not sure why you've made it case-sensitive - shouldn't there be an /i modifier here?
// matches all dynamic attributes (those containing bracketed data)
$dynamicRegEx = preg_match_all('/\n+([a-z0-9_\-\s]+)\[([^\[\]]+)\]/i', $markup, $dynamicMatches);
The last regex is OK, but it will always match from the very first character of the string until the first colon (and then on to the rest of the line). I think I would add a newline character to the first negated character class to make sure that can't happen:
// matches all static attributes (simple colon-separated attributes)
$staticRegEx = preg_match_all('/([^\n:]+):([^\n]+)/', $staticMarkup, $staticMatches);
Related
I want to perform a php preg_match_callback against all single or double-quoted strings, for which I'm using the code seen on https://codereview.stackexchange.com/a/217356, which includes handling of backslashed single/double quotes.
const PATTERN = <<<'PATTERN'
~(?|(")(?:[^"\\]|\\(?s).)*"|(')(?:[^'\\]|\\(?s).)*'|(#|//).*|(/\*)(?s).*?\*/|(<!--)(?s).*?-->)~
PATTERN;
$result=preg_replace_callback(PATTERN, function($m) {
return $m[1]."XXXX".$m[1];
}, $test);
but this runs into a problem when scanning blocks like that seen in .replace() calls from javascript, e.g.
x=y.replace(/'/g, '"');
... which treats '/g, ' as a string, with the "');......." as the following string.
To work around this I figure it would be good to do the callback except when the quotes are inside the first argument of .replace() as these cause problems with quoting.
i.e. do the standard callbacks, but when .replace is involved I want to change the XXXX part of abc.replace(/\'/, "XXXX"); but I want to ignore the \' quote/part.
How can I do this?
See https://onlinephp.io/c/5df12 ** https://onlinephp.io/c/8a697 for a running example, showing some successes (in green), and some failures (in red).
(** Edit to correct missing slash)
Note, the XXXX is a placeholder for some more work later.
Also note that I have looked at Javascript regex to match a regex but this talks about matching regex's - and I'm talking about excluding them. If you plug in their regex pattern into my code it does not work - so should not be considered a valid answer
You can use verbs (*SKIP)(*F) to skip something. For skipping the first argument e.g.:
\(\s*/.*?/\w*\h*,(*SKIP)(*F)|(?|(")[^"\\]*(?:\\.[^"\\]*)*"|(')[^'\\]*(?:\\.[^'\\]*)*')
See this demo at regex101 or your updated php demo
The pattern on the skipped side is very simple, you might want to further improve that.
Besides I used a bit more efficient pattern to match the quoted parts, explained here.
I am looking for a way to replace all string looking alike in entire page with their defined values
Please do not recommend me other methods of including language constants.
Strings like this :
[_HOME]
[_NEWS]
all of them are looking the same in [_*] part
Now the big issue is how to scan a HTML page and to replace the defined values .
One ways to parse the html page is to use DOMDocument and then pre_replace() it
but my main problem is writing a pattern for the replacement
$pattern = "/[_i]/";
$replacement= custom_lang("/i/");
$doc = new DOMDocument();
$htmlPage = $doc->loadHTML($html);
preg_replace($pattern, $replacement, $htmlPage);
In RegEx, [] are operators, so if you use them you need to escape them.
Other problem with your expression is _* which will match Zero or more _. You need to replace it with some meaningful match, Like, _.* which will match _ and any other characters after that. SO your full expression becomes,
/\[_.*?\]/
Hey, why an ?, you might be tempted to ask: The reason being that it performs a non-greedy match. Like,
[_foo] [_bar] is the query string then a greedy match shall return one match and give you the whole of it because your expression is fully valid for the string but a non-greedy match will get you two seperate matches. (More information)
You might be better-off in being more constrictive, by having an _ followed by Capital letters. Like,
/\[_[A-Z]+\]/
Update: Using the matched strings and replacing them. To do so we use the concept called back-refrencing.
Consider modifying the above expression, enclosing the string in parentheses, like, /\[_([A-Z]+)\]/
Now in preg-replace arguments we can use the expression in parentheses by back-referencing them with $1. So what you can use is,
preg_replce("/\[_([A-Z]+)\]/e", "my_wonderful_replacer('$1')", $html);
Note: We needed the e modifier to treat the second parameter as PHP code. (More information)
If you know the full keyword you are trying to replace (e.g. [_HOME]), then you can just use str_replace() to replace all instances.
No need to make things like this more complex by introducing regex.
I have the following problem.
Let's take the input (wikitext)
======hello((my first program)) world======
I want to match "hello", "my first program" and " world" (notice the space).
But for the input:
======hello(my first program)) world======
I want to match "hello(my first program" and " world".
In other words, I want to match any letters, spaces and additionally any single symbols (no double or more).
This should be done with the unicode character properties like \p{L}, \p{S} or \p{Z}, as documented here.
Any ideas?
Addendum 1
The regex has just to stop before any double symbol or punctuation in unicode terms, that is, before any \p{S}{2,} or \p{P}{2,}.
I'm not trying to parse the whole wikitext with this, read my question carefully. The regex I'm looking for IS for the lexer I'm working on, and making it match such inputs will simplify my parser incredibly.
Addendum 2
The pattern must work with preg_match(). I can imagine how I'd have to split it first. Perhaps it would use some lookahead, I don't know, I've tried everything that I could imagine.
Using only preg_match() is a requirement set in stone by the current implementation of the lexer. It must be that way, because that's the natural way of how lexers work: they match sequences in the input stream.
return preg_split('/([\pS\pP])\\1+/', $theString);
Result: http://www.ideone.com/YcbIf
(You need to get rid of the empty strings manually.)
Edit: as a preg_match regex:
'/(?:^|([\pS\pP])\\1+)((?:[^\pS\pP]|([\pS\pP])(?!\\3))*)/'
take the 2nd capture group when it is matched. Example: http://www.ideone.com/ErTVA
But you could just consume ([\pS\pP])\\1+ and discard, or if doesn't match, consume (?:[^\pS\pP]|([\pS\pP])(?!\\3))* and record, since your lexer is going to use more than 1 regex anyway?
Regular expressions are notoriously overused and ill-suited for parsing languages like this. You can get away with it for a little while, but eventually you will find something that breaks your parser, requiring tweak after tweak and a huge library of unit tests to ensure compliance.
You should seriously consider writing a proper lexer and parser instead.
Here's a piece of code from the xss_clean method of the Input_Core class of the Kohana framework:
do
{
// Remove really unwanted tags
$old_data = $data;
$data = preg_replace('#</*(?:applet|b(?:ase|gsound|link)|embed|frame(?:set)?|i(?:frame|layer)|l(?:ayer|ink)|meta|object|s(?:cript|tyle)|title|xml)[^>]*+>#i', '', $data);
}
while ($old_data !== $data);
Is the do ... while loop necessary? I would think that the preg_replace call would do all the work in just one iteration.
Well, it's necessary if the replacement potentially creates new matches in the next iteration. It's not very wasteful because it's only and additional check at worst, though.
Going by the code it matches, it seems unlikely that it will create new matches by replacement, however: it's very strict about what it matches.
EDIT: To be more specific, it tries to match an opening angle bracket optionally followed by a slash followed by one of several keywords optionally followed by any number of symbols that are not a closing angle bracket and finally a closing angle bracket. If the input follows that syntax, it'll be swallowed whole. If it's malformed (e.g. multiple opening and closing angle brackets), it'll generate garbage until it can't find substrings matching the initial sequence anymore.
So, no. Unless you have code like <<iframe>iframe>, no repetition is necessary. But then you're dealing with a level of tag soup the regex isn't good enough for anyway (e.g. it will fail on < iframe> with the extra space).
EDIT2: It's also a bit odd that the pattern matches zero or more slashes at the beginning of the tag (it should be zero or one). And if my regex knowledge isn't too rusty, the final *+ doesn't make much sense either (the asterisk means zero or more, the plus means one or more, maybe it's a greedy syntax or something fancy like that?).
On a completely unrelated subject, I would like to add a word on optimisation here.
preg_replace() can tell you whether a replacement has been made or not (see the 5th argument, which is passed by reference). It's far much efficient than comparing strings, especially if they are large.
I need to retrieve content of <p> tag with given class. Class could be simplecomment or comment ...
So I wrote the following code
preg_match("|(<p class=\"(simple)?comment(.*)?\">)(.*)<\/p>|ism", $fcon, $desc);
Unfortunately, it returns nothing. However if I remove tag-ending part (<\/p>) it works somehow, returing the string which is too long (from tag start to the end of the document) ...
What is wrong with my regular expression?
Try using a dom parser like http://simplehtmldom.sourceforge.net/
If I read the example code on simplehtmldom's homepage correctly
you could do something like this:
$html->find('div.simplecomment', 0)->innertext = '';
The quick fix here is the following:
'|(<p class="(simple)?comment[^"]*">)((?:[^<]+|(?!</p>).)*)</p>|is'
Changes:
The construct (.*) will just blindly match everything, which stops your regular expression from working, so I've replaced those instances completely with more strict matches:
...comment(.*)?... – this will match all or nothing, basically. I replaced this with [^"]* since that will match zero or more non-" characters (basically, it will match up to the closing " character of the class attribute.
...>)(.*)<\/p>... – again, this will match too much. I've replaced it with an efficient pattern that will match all non-< characters, and once it hits a < it will check if it is followed by </p>. If it is, it will stop matching (since we're at the end of the <p> tag), otherwise it will continue.
I removed the m flag since it has no use in this regular expression.
But it won't be reliable (imagine <p class="comment">...<p>...</p></p>; it will match <p class="comment">...<p>...</p>).
To make it reliable, you'll need to use recursive regular expressions or (even better) an HTML parser (or XML if it's XHTML you're dealing with.) There are even libraries out there that can handle malformed HTML "properly" (like browsers do.)