Python: unserialize PHP array with regex - php

I'm pretty new to regular expressions but decided to use them to unserialize PHP arrays. Here's some background info:
I rewrote a database-based website for companies in django which was written in PHP. There is an M2M relation with companies and industries. In the previous model it was solved by using serialized PHP arrays so I now have to sync everything correctly. My first attempt was some splitting and cutting and it was really ugly so I decided to dive into regular expressions. Here is what I got (it's working perfectly fine) now:
def unserialize_array(serialized_array):
import re
matched_sub = re.search('^a:\d+:\{i:\d+;s:\d+:"(.*?)";\}$', serialized_array).group(1)
industry_list = re.sub('";i:\d+;s:\d+:"', "? ", matched_sub).split("? ")
new_dict = dict(enumerate(industry_list))
return new_dict
I was wondering however if I couldn't do all this with a single regular expression instead of two.

Update: updated to handle correctly also escaped quotes \" (actually written \\") and any escaped sequence (as an escaped quotes after an escaped escape\\\" that is \\\\\\").
I think, if i understood correctly the structure of your input, that you can rewrite your method as this:
def unserialize_array(serialized_array):
import re
return dict(enumerate(re.findall(r'"((?:[^"\\]|\\.)*)"', serialized_array)))
Assumed input (as is not specified explicitly in your question):
a:3:{i:1;s:9:"industry\\\\\\"A\\"";i:2;s:9:"\\"industry2\\"";i:3;s:9:"industry3"}
Output
{0: 'industry1\\\"A\"', 1: '\"industry2\"', 2: 'industry3'} (actually: {0: 'industry1\\\\\\"A\\"', 1: '\\"industry2\\"', 2: 'industry3'})
How it works
There is no need to match the entire structure of the serialized array, cause we are interested only for the string contents. The regex "((?:[^"\\]|\\.)*)" simply extract per char till encounter an escape '\' (in that case accept escape + another char) or the closing double quotes ".
The capturing group assure that the double quotes are removed in the final result.
Finally the re.findall method returns in one single call a list of strings composed by our desired results.
This is a peculiarity of re.findall that overrides the entire match if at least a capturing group is present the matches (or by the capturing group in our case). Infact the docs declares:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Suggesting to use re.sub with callback, which will just call unserialize_array in recursion.

Related

preg_replace_callback to run EXCEPT when inside first argument of .replace()

I want to perform a php preg_match_callback against all single or double-quoted strings, for which I'm using the code seen on https://codereview.stackexchange.com/a/217356, which includes handling of backslashed single/double quotes.
const PATTERN = <<<'PATTERN'
~(?|(")(?:[^"\\]|\\(?s).)*"|(')(?:[^'\\]|\\(?s).)*'|(#|//).*|(/\*)(?s).*?\*/|(<!--)(?s).*?-->)~
PATTERN;
$result=preg_replace_callback(PATTERN, function($m) {
return $m[1]."XXXX".$m[1];
}, $test);
but this runs into a problem when scanning blocks like that seen in .replace() calls from javascript, e.g.
x=y.replace(/'/g, '"');
... which treats '/g, ' as a string, with the "');......." as the following string.
To work around this I figure it would be good to do the callback except when the quotes are inside the first argument of .replace() as these cause problems with quoting.
i.e. do the standard callbacks, but when .replace is involved I want to change the XXXX part of abc.replace(/\'/, "XXXX"); but I want to ignore the \' quote/part.
How can I do this?
See https://onlinephp.io/c/5df12 ** https://onlinephp.io/c/8a697 for a running example, showing some successes (in green), and some failures (in red).
(** Edit to correct missing slash)
Note, the XXXX is a placeholder for some more work later.
Also note that I have looked at Javascript regex to match a regex but this talks about matching regex's - and I'm talking about excluding them. If you plug in their regex pattern into my code it does not work - so should not be considered a valid answer
You can use verbs (*SKIP)(*F) to skip something. For skipping the first argument e.g.:
\(\s*/.*?/\w*\h*,(*SKIP)(*F)|(?|(")[^"\\]*(?:\\.[^"\\]*)*"|(')[^'\\]*(?:\\.[^'\\]*)*')
See this demo at regex101 or your updated php demo
The pattern on the skipped side is very simple, you might want to further improve that.
Besides I used a bit more efficient pattern to match the quoted parts, explained here.

Regex: Split function call arguments

Running into this problem and I've been searching for days. I'm using PHP to parse formulas for a platform.
A formula could be something like:
object.Field
ADD(object.NumberOfTHings, object.NumberOfThings)
object.DoSomething(ADD(object.NumberOfTHings, object.NumberOfThings), 'words!')
The idea is, it can nest many levels. Users can include quotes (double and single) as well.
I'm working on a function that will return each parameter at the highest level. So
object.DoSomething(ADD(object.NumberOfTHings, object.NumberOfThings), 'words!')
Will need to return the following array:
ADD(object.NumberOfTHings, object.NumberOfThings)
'words!'
We then go back and parse each parameter appropriately (some are object calls, function calls, etc.). I'm open to parsing it all at once, but figured that would just be more complicated.
My current regex is as follows:
\(?'pullsinglequotes'\'.+?\')|(?'pulldoublequotes'\".+?\")|(?'pullfunctions'[^,]\(([^()]|(?R))*\))\
It MOSTLY works, but has two issues:
Won't return objects yet (ex. if I reference object.Field as a parameter).
Only includes the last letter of a function.
Here's a REGEXR with the issue:
https://regexr.com/41e20
I've tried many different variations of REGEX and each has its downsides.
My question is: Does anyone have enough regex knowledge to solve those two issues? If so, any help would be greatly appreciated.
Update
If anyone is interested, this following was my final regex.
/(?'pullsinglequotes'\'.+?\')|(?'pulldoublequotes'\".+?\")|(?'pullfunctions'\b[\w.]+\s*\(([^()]|(?R))*\))|(?'pullvars'\w+(?:\.\w+)?)/
Your pullfunctions is only matching one character that's not a , followed by a parens. Allow it to repeat and precede it with a word boundary.
For the vars and objects, just use a repeating word character with an optional dot-separated part. You can adjust this to a character group to allow other characters like - or _.
Full regex:
(?'pullsinglequotes'\'.+?\')|(?'pulldoublequotes'\".+?\")|(?'pullfunctions'\b[\w]+\s*\(([^()]|(?R))*\))|(?'pullvars'\w+(?:\.\w+)?)

Regex that will match each specific tag that contains ../

I'm trying to find a regex that will match each specific tag that contains ../.
I had it matching when each element was on its own line. But then there was an instance where my HTML rendered on one line causing the regex to match the whole line:
<body><img src="../../../img.png"><img src="../../img.png"><img src="../../img.png"><img src="..//../img.png"><img src="..../../img.png">
Here was the regex that I was using
<.*[\.]{2}[\/].*>
You need to make sure to match only one tag per match.
Using a negative character class like below will accomplish that.
<[^>]*\.\./[^>]*>
< = start of tag
[^>]* = any number of characters that aren't >, since > would end the tag
\.\./ = "../" with escapes for the . characters
[^>]* = same as above
> = end of tag
It appears you might be doing this to prevent path parenting. You should know that for a URL attribute in an HTML tag, the following tags are considered "equivalent":
<img src="../foo.jpg">
<img src="%2e%2e%2ffoo.jpg">
<img src="../foo.jpg">
That's because the src attribute goes through HTML entity un-escaping, and then URL un-escaping (in that order) before being used. As a result, there are 5,832 different ways to write '../' into an HTML tag's path attribute (18 ways to write each character times 3 characters).
Making a regex to match any of these encodings of ../ is more difficult, but still possible.
(\.|.|(%|%)(2|2)([Ee]|E|e)){2}(/|/|(%|%)(2|2)([Ff]|F|f))
For reference:
. = . HTML escape sequence
/ = / HTML escape sequence
%2E or %2e = . URL escape sequence
%2F or %2f = / URL escape sequence
% = % HTML escape sequence
2 = 2 HTML escape sequence
E = E HTML escape sequence
e = e HTML escape sequence
F = F HTML escape sequence
f = f HTML escape sequence
You can see why people usually say it's better to use a real HTML parser, instead of regex!
Anyway, assuming yo need this, and a full HTML parser isn't feasable, here's the version of <[^>]*[="'/]\.\./[^>]*> that also catches HTML and URL escaping:
<[^>]*[="'/](\.|.|(%|%)(2|2)([Ee]|E|e)){2}(/|/|(%|%)(2|2)([Ff]|F|f))[^>]*>
Causing the regex to match the whole line seems you are regex is greedy, try this way as #Avinash Raj commented.
SEE DEMO
To get the regexp you want I will try to follow a step by step approach:
First, we need some regex that matches the beginning and end of the tag. But we must be carefull, as the tag end character > is allowed in single and double quote strings. We construct first the regexp that matches these single/double quoted strings: ([^"'>]|"[^"]*"|'[^']*')* (a sequence of: non-quote (single and double) and non end tag character, or a single quoted string, or a double quoted string)
Now, modify it to match a single quoted string or a double quoted string that includes a ../: ([^"'>]|"[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')* (we can simplify it, eliminating the last * operator, as we will match the whole string with only one matching ../ inside, and we can eliminate the first option, as we will have the ../ seq inside quoted strings). We get to: ("[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')
To get a string matching a sequence including at least one of the second strings, we concatenate the first regex at the beginning and at the end, and the other in the middle. We get to: ([^"'>]|"[^"]*"|'[^']*')*("[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')([^"'>]|"[^"]*"|'[^']*')*
Now, we only need to surround this regexp with the needed sequences first <[iI][mM][gG][ \t\n], and after >, getting to:
<[iI][mM][gG][ \t\n]([^"'>]|"[^"]*"|'[^']*')*("[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')([^"'>]|"[^"]*"|'[^']*')*>
This is the regexp we need. See demo If we extract the content of the second group ($2, \2, etc.) we'll get to the parameter value that matches (with the quotes included) the ../ string.
Don't try to simplify this further as > characters are allowed inside single and double quoted strings, and " are allowed in single quoted strings, and ' are in double quoted strings. As someone explained in another answer to this question, you cannot be greedy (using .* inside, as you'll eat as much input as possible before matching) This regexp will need to match multiline tags, as these could be part of your input file. If you have a well formed HTML file, then you'll have no problem with this regexp.
And some final quoting: an HTML tag is defined by a grammar that is regular (it is only a regular subset of the full HTML syntax), so it is perfectly parseable with a regex (the same is not true for the complete HTML language) A regex is by far more efficient and less resource consuming than a full HTML parser. The caveats are that you have to write it (and to write it well) and that HTML parsers are easily found with some googling that avoid you the work of doing it, but you have to write it only once. Regexp parsing is a one pass process that grows in complexity (for this example, at least) linearly with input text length. You'll be advised against this by people that simply don't know how to write the right regexp or don't know how to determine is some grammar is regular.
Note:
This regexp will match commented tags. In case you don't want to match commented <img> tags, you'll have to extend your regexp a little or do a two pass to eliminate comments first, and then parse tags (the regexp that only recognizes uncommented tags is far more complicated than this) Also, look below for more difficulties you can have on your task to eliminate parent directory references.
Note 2:
As I have read in your comments to some answers, the problem you want to solve (eliminating .. references in HTML/XML sources) is not regular. The reason is that you can have . and .. references embedded in the path strings. Normally, one must proceed eliminating the /. or ./ components of the path, getting a path without . (actual directory) references. Once you have this, you have to eliminate a/.. references, where a is distinct of ... This deals to eliminating occurrences of a/.., a/b/../.., etc. But the language that matches a^i b^i is not regular (as demonstrated by the pumping lemma ---see google) and you'll need a context independent grammar.
Note 3:
If you limit the number of a/b/c/../../.. levels to some maximum bound, you're still able to find a regexp to match this kind of strings, but you can have one example that breaks your regexp and makes it invalid. Remember, you first have to eliminate the single dot . path component (as you can have something like a/b/./././c/./d/.././e/f/.././../... You will first eliminate the single dot components, leading to: a/b/c/d/../e/f/../../../... Then you proceed by pairs of <non ..>/.., getting a/b/c/[d/..]/e/f/../../../.. to a/b/c/e/[f/..]/../../.. -> a/b/c/[e/..]/../.. -> a/b/[c/..]/.. -> a/[b/..] -> a (you ought to check that all the first components of a pair do exist before being eliminated to be precise) and if you get to an empty path, you will have to change it to . to be usable.
I have code to do this process, but it's embedded in some bigger program. If you are interested, you can access this code. (look at the rel_path() routine here)
You cannot eliminate a .. element at the beginning of a path (better, that has not a <non ..> counterpart), as it refers to outside of the tree, making the reference dependant on the external structure of the tree.

How to match a string until the first instance of a character that does not follow another specific character

Related question: How can I use regex to match a character (') when not following a specific character (?)?
I'm parsing a log using regex (PHP PCRE library), and trying to extract a URL from it. The URL is encapsulated in double quotes ", but some of the requests also include a double quote ". For example:
"https://www.amh.net.au/online/dbSearch.php?t=all&q=\"Rosuvastatin\""
My first pattern was basically:
#\"([^\"]*)\"#
This worked well, until I reached one of the entries as above, and it truncated the match so all I got was:
https://www.amh.net.au/online/dbSearch.php?t=all&q=\
After digging around, and rediscovering the cheatsheets for regex at http://addedbytes.com and also some more useful information at http://www.regular-expressions.info/lookaround.html I have now tried the following look-behind:
#"([(?<!\\)"]*)"#
But, now all I get is "" and then an empty string
You placed your lookbehind INSIDE your group ([]), so it's not interpreted as such, but rather just you say you only want those individual characters.
Basically, I think you'd like something like this:
#"(?:[^"]|(?<=\\)")"#
Though you should be aware that you'd be trolled by \\" for example.
The URLs in the logs would be URL-encoded. As such, the following pattern should work:
#\"([^ ]*)\"#

Regular expression doesn't quite work

I have created a Regular Expression (using php) below; which must match ALL terms within the given string that contains only a-z0-9, ., _ and -.
My expression is: '~(?:\(|\s{0,},\s{0,})([a-z0-9._-]+)(?:\s{0,},\s{0,}|\))$~i'.
My target string is: ('word', word.2, a_word, another-word).
Expected terms in the results are: word.2, a_word, another-word.
I am currently getting: another-word.
My Goal
I am detecting a MySQL function from my target string, this works fine. I then want all of the fields from within that target string. It's for my own ORM.
I suppose there could be a situation where by further parenthesis are included inside this expression.
From what I can tell, you have a list of comma-separated terms and wish to find only the ones which satisfy [a-z0-9._\-]+. If so, this should be correct (it returns the correct results for your example at least):
'~(?<=[,(])\\s*([a-z0-9._-]+)\\s*(?=[,)])~i'
The main issues were:
$ at the end, which was anchoring the query to the end of the string
When matching all you continue from the end of the previous match - this means that if you match a comma/close parenthesis at the end of one match it's not there at match at the beginning of the next one. I've solved this with a lookbehind ((?<=...) and a lookahead ((?=...)
Your backslashes need to be double escaped since the first one may be stripped by PHP when parsing the string.
EDIT: Since you said in a comment that some of the terms may be strings that contain commas you will first want to run your input through this:
$input = preg_replace('~(\'([^\']+|(?<=\\\\)\')+\'|"([^"]+|(?<=\\\\)")+")~', '"STRING"', $input);
which should replace all strings with '"STRING"', which will work fine for matching the other regex.
Maybe using of regex is overkill. In this kind of text you can just remove parenthesis and explode string by comma.

Categories