Parsing Javascript-Array in PHP - php

If I have a String, containing the Javascript-Code for an array:
parent.data[0].c = [[10,'TESTVALUE',]];
with a lot of nested arrays in it. What is the best way to parse it with PHP. JSON is not an option due to the fact that the data is only available in the format above.
Thx!

If you remove the extra comma after testvalue, and change 'TESTVALUE' to "TESTVALUE" its valid json

I would probably do this with a recursive function that uses regular expression to find matching pairs of [ ] and build the array piece by piece. Probably requires a bit of trial and error not to miss any possible error sources when matching the regular expressions, but should not be too much work, due to recursivity.
Good luck.

Related

Don't know how to write this preg

I'm making a function call to a library that is returning a malformed json array. I can work around this if I can get a preg written to extract the part that I want.
The array is a jumbled mess, but buried deep inside it is a string that looks like this:
token=??????,
I need to write a preg to grab the characters represented by the question marks. I wrote this, but it's not getting the part of the text that I want:
$token = preg_match('#^(?:token=)?([^,]+)#i', $badJson, $matches);
Can anyone help me? Thanks.
You can try:
/token=([^,]+)/i
and the use the first sub-match to extract the token. Being more specific is usually a good idea with regex (eg. does the token have a set length? does it only contain hex characters? etc.)
Site note: https://leaverou.github.io/regexplained/ is a great site for testing regular expressions.

PHP Regex string returns two identical arrays

I've got a Regex query here to pull out all of the tags in a page. It looks like this:
preg_match_all('%<tr[^>]++>(.*?)</tr>%s', $pageText, $rows);
Problem is that while it does find all of the tags on the page in the return array it actually returns a multidimensional array, where each entry of the first array contains an array of all of the matches. In other words, it hands me multiple identical copies of the first array, IE the one I actually want.
Help please?
EDIT: Also relevant: I'm not allowed to use DOM for this application despite it being a significantly easier (and better) way of going about things.
What you're actually asking about is the $row[0] list, which redundantly contains the <tr>...</tr> blob again. If you just care about the (.*?) inner data, then use \K to reset the full match.
preg_match_all('=<tr\b[^>]*+>(.*?)</tr>\K=s', $pageText, $rows);
It's not possible to get rid of $row[0] completely. You'll have to ignore it, and use $row[1] alone.
Try this one:
preg_match_all('~<tr(?:\\s+[^>]*)?>(.*?)</tr>~si', $pageText, $rows);
var_dump($rows[1]);
Don't use % to wrap RegExps. It's a character somehow reserved for printf() like functions and with %s or %i at the end of your Pattern, it can be quite confusing.

Regular expression to extract json response in php

I'm new to php and am trying to write a regular expression using preg_match to extract the href value that I get from my http get.
The response looks:
{"_links":{"http://a.b.co/documents":{"href":"/docs"}}}
I want to extract only the href value and pass it to my next api... i.e. /docs.
Can anyone please tell me how to extract this?
I've been using http://www.solmetra.com/scripts/regex/index.php to test my regex.. and had no luck since last one day :(
please any help would be appreciated.
Thanks,
DR
No need for a regex.
Use json_decode() and then access the href property.
For example:
$data = json_decode('{"_links":{"http://a.b.co/documents":{"href":"/docs"}}}', true);
echo $data['_links']['http://a.b.co/documents']['href'];
Note: I'd encourage you to clean up your JSON if possible. Particularly the keys.
Don't use regex, use json_decode(). JSON is an excellent example of a context-free grammar that you shouldn't even try to parse with regex.
Here's PHP.NET's reference on using json_decode() for just this sort of thing.
Just like HTML parsing, I would recommend not using a REGEX but rather a json parser then reading the value. Check out json_encode and json_decode functions in php.
That said if you just need the href value then here is a regex to do just that on the example you gave
preg_match('/"href":"([^"]+)"/',$string,$matches);
$matches[1];// this is the href
Regex is only the right tool when you know exactly what you want and exactly the format it will be in. Often json and HTML from other parties can't be exactly predicted. There are also examples of certain legal HTML and json which can't properly be parsed with regex so in general use a specialized parser for them.

PHP dealing with huge string

I have to replace xmlns with ns in my incomming xml in order to fix SimpleXMLElements xpath() function. Most functions do not have a performance problem. But there allways seems to be an overhead as the string grows.
E.g. preg_replace on a 2 MB string takes 50ms to process, even if I limit the replaces to 1 and the replace is done at the very beginning.
If I substr the first few characters and just replace that part it is slightly faster. But not really that what I want.
Is there any PHP method that would perform better in my problem? And if there is no option, could a simple php extension help, that just does Replace => SimpleXMLElement in C?
If you know exactly where the offending "x", "m" and "l" are, you can just use something like $xml[$x_pos] = ' '; $xml[$m_pos] = ' '; $xml[$l_pos] = ' ' to transform them into spaces. Or transform them into ns___ (where _ = space).
You're always going to get an overhead when trying to do this - you're dealing with a char array and trying to do replace multiple matching elements of the array (i.e. words).
50ms is not much of an overhead, unless (as I suspect) you're trying to do this in a loop?
50ms sounds pretty reasonable to me, for something like this. The requirement itself smells of something being wrong.
Is there any particular reason that you're using regular expressions? Why do people keep jumping to the overkill regex solution?
There is a bog-standard string replace function called str_replace that may do what you want in a fraction of the time (though whether this is right for you depends on how complex your search/replace is).
From the PHP source, as we can see, for example here:
http://svn.php.net/repository/php/php-src/branches/PHP_5_2/ext/standard/string.c
I don`t see, any copies, but I'm not expert in C. From the other hand we can see there many convert to string calls, which at 1st sight could copy values. If they copy values, then we in trouble here.
Only if we in trouble
Try to invent some str_replace wheel here with the help of string-by-char processing. For example we have string $somestring = "somevalue". In PHP we could work with it's chars by indexes as echo $somestring{0}, which will give us "s" or echo $somestring{2} which will give us "m". I'm not sure in this way, but it's possible, if official implimentations don't use references, as they should use.

Escaping -> and => when parsing HTML using regular expression

I need to parse and return the tagname and the attributes in our PHP code files:
<ct:tagname attr="attr1" attr="attr2">
For this purpose the following regular expression has been constructed:
(\<ct:([^\s\>]*)([^\>]*)\>)
This expression works as expected but it breaks when the following code is parsed
<ct:form/input type="attr1" value="$item->field">
The original regular expression breaks because of the > character in the $item->field. I would need to construct a regular expression that ignores the -> or => but not the single >.
I am open to any suggestions... Thanks for your help in advance.
Try this:
<ct:([^\s\>]*)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*')\s*)*)>
But if that’s XML, use should better use a XML parser.
You could try using negative lookbehind like that:
(\<ct:([^\s\>]*)(.*?)(?<!-|=)\>)
Matches :
<ct:tagname attr="attr1" attr="attr2">
<ct:form/input type="attr1" value="$item->field">
Not sure that it the best suited solution for your case, but that respects the constraints.
In general, any parsing problem rapidly runs into language constructs that are context-free but not regular. It may be a better[1] solution to write a context-free parser, ignoring everything except the elements you're interested in.
[1] "better" as seen from a viewpoint of Being The Right Thing, not necessarily a return on investment one.
I think what you want to do is not recognize the -> and =>, but ignore everything between pairs of quotes.
I think it can be done by inserting ((
("[^"]*")*
)) at the opportune place.
My suggestion is to match to the attributes in the same expression.
\<ct:([^\s\>]*)((([a-x0-9]+)=\"([^\"]*)\")*)\>
edit: removed part about > not being valid xml in attribute values.

Categories