Regular expression to extract json response in php - php

I'm new to php and am trying to write a regular expression using preg_match to extract the href value that I get from my http get.
The response looks:
{"_links":{"http://a.b.co/documents":{"href":"/docs"}}}
I want to extract only the href value and pass it to my next api... i.e. /docs.
Can anyone please tell me how to extract this?
I've been using http://www.solmetra.com/scripts/regex/index.php to test my regex.. and had no luck since last one day :(
please any help would be appreciated.
Thanks,
DR

No need for a regex.
Use json_decode() and then access the href property.
For example:
$data = json_decode('{"_links":{"http://a.b.co/documents":{"href":"/docs"}}}', true);
echo $data['_links']['http://a.b.co/documents']['href'];
Note: I'd encourage you to clean up your JSON if possible. Particularly the keys.

Don't use regex, use json_decode(). JSON is an excellent example of a context-free grammar that you shouldn't even try to parse with regex.
Here's PHP.NET's reference on using json_decode() for just this sort of thing.

Just like HTML parsing, I would recommend not using a REGEX but rather a json parser then reading the value. Check out json_encode and json_decode functions in php.
That said if you just need the href value then here is a regex to do just that on the example you gave
preg_match('/"href":"([^"]+)"/',$string,$matches);
$matches[1];// this is the href
Regex is only the right tool when you know exactly what you want and exactly the format it will be in. Often json and HTML from other parties can't be exactly predicted. There are also examples of certain legal HTML and json which can't properly be parsed with regex so in general use a specialized parser for them.

Related

PHP Escape a string if it hasn't already been escaped with entities

I'm using a 3rd party API that seems to return its data with the entity codes already in there. Such as The Lion’s Pride.
If I print the string as-is from the API it renders just fine in the browser (in the example above it would put in an apostrophe). However, I can't trust that the API will always use the entities in the future so I want to use something like htmlentities or htmlspecialchars myself before I print it. The problem with this is that it will encode the ampersand in the entity code again and the end result will be The Lion’s Pride in the HTML source which doesn't render anything user friendly.
How can I use htmlentities or htmlspecialchars only if it hasn't already been used on the string? Is there a built-in way to detect if entities are already present in the string?
No one seems to be answering your actual question, so I will
How can I use htmlentities or htmlspecialchars only if it hasn't already been used on the string? Is there a built-in way to detect if entities are already present in the string?
It's impossible. What if I'm making an educational post about HTML entities and I want to actually print this on the screen:
The Lion’s Pride
... it would need to be encoded as...
The Lion’s Pride
But what if that was the actual string we wanted to print on the string ? ... and so on.
Bottom line is, you have to know what you've been given and work from there – which is where the advice from the other answers comes in – which is still just a workaround.
What if they give you double-encoded strings? What if they start wrapping the html-encoded strings in XML? And then wrap that in JSON? ... And then the JSON is converted to binary strings? the possibilities are endless.
It's not impossible for the API you depend on to suddenly switch the output type, but it's also a pretty big violation of the original contract with your users. To some extent, you have to put some trust in the API to do what it says it's going to do. Unit/Integration tests make up the rest of the trust.
And because you could never write a program that works for any possible change they could make, it's senseless to try to anticipate any change at all.
Decode the string, then re-encode the entities. (Using html_entity_decode())
$string = htmlspecialchars(html_entity_decode($string));
https://eval.in/662095
There is NO WAY to do what you ask for!
You must know what kind of data is the service giving back.
Anything else would be guessing.
Example:
what if the service is giving back & but is not escaping ?
you would guess it IS escaping so you would wrongly interpret as & while the correct value is &
I think the best solution, is first to decode all html entities/special chars from the original string, and then html encode the string again.
That way you will end up with a correctly encoded string, no matter if the original string was encoded or not.
You also have the option of using htmlspecialchars_decode();
$string = htmlspecialchars_decode($string);
It's already in htmlentities:
php > echo htmlentities('Hi&mom', ENT_HTML5, ini_get('default_charset'), false);
Hi&mom
php > echo htmlentities('Hi&mom', ENT_HTML5, ini_get('default_charset'), true);
Hi&mom
Just use the [optional]4th argument to NOT double-encode.

Don't know how to write this preg

I'm making a function call to a library that is returning a malformed json array. I can work around this if I can get a preg written to extract the part that I want.
The array is a jumbled mess, but buried deep inside it is a string that looks like this:
token=??????,
I need to write a preg to grab the characters represented by the question marks. I wrote this, but it's not getting the part of the text that I want:
$token = preg_match('#^(?:token=)?([^,]+)#i', $badJson, $matches);
Can anyone help me? Thanks.
You can try:
/token=([^,]+)/i
and the use the first sub-match to extract the token. Being more specific is usually a good idea with regex (eg. does the token have a set length? does it only contain hex characters? etc.)
Site note: https://leaverou.github.io/regexplained/ is a great site for testing regular expressions.

Counterpart to PHP’s preg_match in Python

I am planning to move one of my scrapers to Python. I am comfortable using preg_match and preg_match_all in PHP. I am not finding a suitable function in Python similar to preg_match. Could anyone please help me in doing so?
For example, if I want to get the content between <a class="title" and </a>, I use the following function in PHP:
preg_match_all('/a class="title"(.*?)<\/a>/si',$input,$output);
Whereas in Python I am not able to figure out a similar function.
You looking for python's re module.
Take a look at re.findall and re.search.
And as you have mentioned you are trying to parse html use html parsers for that. There are a couple of option available in python like lxml or BeautifulSoup.
Take a look at this Why you should not parse html with regex
I think you need somthing like that:
output = re.search('a class="title"(.*?)<\/a>', input, flags=re.IGNORECASE)
if output is not None:
output = output.group(0)
print(output)
you can add (?s) at the start of regex to enable multiline mode:
output = re.search('(?s)a class="title"(.*?)<\/a>', input, flags=re.IGNORECASE)
if output is not None:
output = output.group(0)
print(output)
You might be interested in reading about Python Regular Expression Operations

Parsing Javascript-Array in PHP

If I have a String, containing the Javascript-Code for an array:
parent.data[0].c = [[10,'TESTVALUE',]];
with a lot of nested arrays in it. What is the best way to parse it with PHP. JSON is not an option due to the fact that the data is only available in the format above.
Thx!
If you remove the extra comma after testvalue, and change 'TESTVALUE' to "TESTVALUE" its valid json
I would probably do this with a recursive function that uses regular expression to find matching pairs of [ ] and build the array piece by piece. Probably requires a bit of trial and error not to miss any possible error sources when matching the regular expressions, but should not be too much work, due to recursivity.
Good luck.

escaping json string with a forward slash?

I am having a problem passing a json string back to a php script to process.
I have a json string that's been created by using dojo.toJson() that contains a / and looks like this:
[{"id":"2","company":"My Company / Corporation","jobrole":"Consultant","jobtitle":"System Integration Engineer"}]
When I pass the string back to the php script it get's chopped at the / and creates a malformed json string, which then means I can't convert it into a php array.
What is the best way of escaping the / in this string? I was looking at regular expressions and doing a string.replace() however my regex isn't that strong, and I'm not sure if there are better ways of doing this?
Many thanks
You shouldn't need to do anything special to represent a / in JSON - a string can contain any character except a " or (when not used to start an escape sequence) \.
The problem is possibly therefore in:
the way you parse the JSON server side
the way your parse the HTTP data to get the JSON string
the way you encode the string before making the HTTP request
(I'd bet on it being the last of those options).
I would start by using a tool such as LiveHttpHeaders or Charles Proxy to see exactly what data is sent to the server.
(I'd also expand the question with the code you use to make the request, and the code you use to parse it at the other end).
\/. Take a look here. The documentation is really easy to read, concise and clear. But unescaped / should still be valid in JSON's string so maybe your bug is somewhere else?
Ok. Anyway.
When passing variables to PHP don't use JSON - it's good for passing variables other way.
Instead you better use http://api.dojotoolkit.org/jsdoc/1.3/dojo.objectToQuery method and on PHP side parse standard PHP $_GET variables.
EDIT: Ok, I'm 'lost in the woods' here also, but here's a tip - check if you don't have some mod_rewrite rules in action here. Kind of seems like that.
Also, if you can send me the URL which gave you 404 (you can cut out domain part, i'm interested in script filename and all afterwards) maybe I can give you more detailed answer.
To be clear, whether you choose to send JSON to PHP or use regular form values is a matter of preference. It /should/ work either way. It sounds like you aren't url-encoding the JSON at the client-side so the server-side is treating / as a path delimiter. In which case its borked before json_decode gets to it.
so, try encodeURIComponent( dojo.toJson(stuff) )
json_encode() used to escape forward slashes. like this:
prompt> json_encode(json_decode('"A/B"'));
string(6) ""A\/B""
JSON_UNESCAPED_SLASHES was added in PHP5.4 to suppress this behavior.

Categories