Extracting JSON from an HTML page with regex (PHP)

Extracting JSON from an HTML page with regex (PHP) - php

I have an HTML page which has a very large and very complex chunk of JSON in a script tag.
I want to extract the JSON so that I can decode it in a php script.
The JSON looks something like:
<script type="text/javascript">
var user_list_data_obj = (
({
... truncated ...
})
);
... some more js ...
</script>
The script tags can't be used in the pattern, because there's other JS between them, and there's nothing to make them unqiue anyway.
I believe I need to match against the variable name, and the first occurrence of '}));' but my attempts to match that have failed.
What I've got so far is:
$pattern = '/var user_list_data_obj = \(\s\(({.*})\)\s\);/';
Which returns nothing.
What am I doing wrong in that pattern? I know its difficult to match anything that has opening and closing delimiters like JSON, etc with a regex, but it should be possible in this case, no?
EDIT:
I'm trying to get the entire "user_list_data_obj" object parsed into my php script. But really, the bits I'm interested in are the several "columns :[] " arrays, so if it's easier to get those out separately, it might make sense to do that.
The columns[] arrays look something like
columns : [
{ display_value : '<input type="checkbox" name="user" value="username">'},
{ display_value : 'username', sort_value : 'username'},
{ display_value : 'username', sort_value : 'username'},
{ display_value : 'Enabled', sort_value : '1' },
{ display_value : '<img class="" src="/enabled.gif">', sort_value : '1' },
{ display_value : '<img class="" src="/enabled.gif">', sort_value : '1' },
{ display_value : '<img class="" src="/enabled.gif">', sort_value : '1' }
],

I was able to match the entire json object with the following
/user_list_data_obj\s*=\s*\(\s*\({(.*?)}\)\s*\);/
But in actuality, I ended up using preg_match_all to match each columns[] array in the json by using:
/columns\s*:\s*\[.*?\],/s

The closest I can get is
preg_match('/var user_list_data_obj = \(\s+\(({.*})\)\s+\);/s', $html, $matches);
The s modifer allows for the matching of newlines.
This is imperfect as it makes assumptions about the structure: namely that the JSON you need literally starts with
( /* some space */
({
and ends with
}) /* some space */
);
If you can't make those assumptions, a less specific regex will likely match other parts of the script. Also, if you have }) ); at some point in the script that you don't want to match, it will still be matched. Using {.*?} won't work because there can be many nested objects literals in the string you want to capture.

Related

How can I str_replace partially in PHP in a dynamic string with unknown key content

Working in WordPress (PHP). I want to set strings to the database like below. The string is translatable, so it could be in any language keeping the template codes. For the possible variations, I presented 4 strings here:
<?php
$string = '%%AUTHOR%% changed status to %%STATUS_new%%';
$string = '%%AUTHOR%% changed status to %%STATUS_oldie%%';
$string = '%%AUTHOR%% changed priority to %%PRIORITY_high%%';
$string = '%%AUTHOR%% changed priority to %%PRIORITY_low%%';
To make the string human-readable, for the %%AUTHOR%% part I can change the string like below:
<?php
$username = 'Illigil Liosous'; // could be any unicode string
$content = str_replace('%%AUTHOR%%', $username, $string);
But for status and priority, I have different substrings of different lengths.
Question is:
How can I make those dynamic substring be replaced on-the-fly so that they could be human-readable like:
Illigil Liosous changed status to Newendotobulous;
Illigil Liosous changed status to Oldisticabulous;
Illigil Liosous changed priority to Highlistacolisticosso;
Illigil Liosous changed priority to Lowisdulousiannosso;
Those unsoundable words are to let you understand the nature of a translatable string, that could be anything other than known words.
I think I can proceed with something like below:
<?php
if( strpos($_content, '%%STATUS_') !== false ) {
// proceed to push the translatable status string
}
if( strpos($_content, '%%PRIORITY_') !== false ) {
// proceed to push the translatable priority string
}
But how can I fill inside those conditionals efficiently?
Edit
I might not fully am clear with my question, hence updating the query. The issue is not related to array str_replace.
The issue is, the $string that I need to detect is not predefined. It would come like below:
if($status_changed) :
$string = "%%AUTHOR%% changed status to %%STATUS_{$status}%%";
else if($priority_changed) :
$string = "%%AUTHOR%% changed priority to %%PRIORITY_{$priority}%%";
endif;
Where they will be filled dynamically with values in the $status and $priority.
So when it comes to str_replace() I will actually use functions to get their appropriate labels:
<?php
function human_readable($codified_string, $user_id) {
if( strpos($_content, '%%STATUS_') !== false ) {
// need a way to get the $status extracted from the $codified_string
// $_got_status = ???? // I don't know how.
get_status_label($_got_status);
// the status label replacement would take place here, I don't know how.
}
if( strpos($_content, '%%PRIORITY_') !== false ) {
// need a way to get the $priority extracted from the $codified_string
// $_got_priority = ???? // I don't know how.
get_priority_label($_got_priority);
// the priority label replacement would take place here, I don't know how.
}
// Author name replacement takes place now
$username = get_the_username($user_id);
$human_readable_string = str_replace('%%AUTHOR%%', $username, $codified_string);
return $human_readable_string;
}
The function has some missing points where I currently am stuck. :(
Can you guide me a way out?

It sounds like you need to use RegEx for this solution.
You can use the following code snippet to get the effect you want to achieve:
preg_match('/%%PRIORITY_(.*?)%%/', $_content, $matches);
if (count($matches) > 0) {
$human_readable_string = str_replace("%%PRIORITY_{$matches[0]}%%", $replace, $codified_string);
}
Of course, the above code needs to be changed for STATUS and any other replacements that you require.
Explaining the RegEx code in short it:
/
The starting of any regular expression.
%%PRIORITY_
Is a literal match of those characters.
(
The opening of the match. This is going to be stored in the third parameter of the preg_match.
.
This matches any character that isn't a new line.
*?
This matches between 0 and infinite of the preceding character - in this case anything. The ? is a lazy match since the %% character will be matched by the ..
Check out the RegEx in action: https://regex101.com/r/qztLue/1

read dotted string inside single quote in PHP

In PHP I'm trying to strip information and store them in $sysver variable from a string named $freply like this:
var id='E8ABFA19FDE2';
var sys_ver='17.37.2.49';
var app_ver='20.9.1.150';
using PHP sscanf with whe following parameters:
sscanf($freply, "var sys_ver='%[^']'", $sysver);
However a blank result in $sysver is all I get.
UPDATE
Working on the first row with:
sscanf($freply, "var id=' %[^']'", $ea);
Gives me a correct result loaded as expected in $ea, that shows E8ABFA19FDE2.
Someone is able to tell me where's the mistake?
Despite I'm using PHP I guess this question is related to Javascript too or any other C-like language.

What you're telling sscanf() is your string is formatted beginning with the literal characters var sys_ver... etc. then you're passing it a string that starts with var id... and it's NOPE'ing right out.
This works:
sscanf($freply, "var id='E8ABFA19FDE2';\nvar sys_ver='%[^']'", $sysver);
or this:
foreach (explode("\n", $freply) as $line) {
if (sscanf($line, "var sys_ver='%[^']'", $sysver)) break;
}
But really sscanf() is not quite the right tool for this job. Just use preg_match():
preg_match("/var sys_ver='([^']+)'/", $freply, $matches);
$sysver = $matches[1];

Remove double quotes in JSON result?

My code is as follows
foreach($location_total_n_4 as $u=> $v) {
$final_location_total_4 .= "[".$u.",".$v."],";
}
I'm sending these values as JSON.
echo json_encode(array("location"=>"$final_location_total_4" ));
Here's how my response object looks:
{
"location": "[1407110400000,6641],[1407196800000,1566],[1407283200000,3614],"‌
}
I'm creating graph on success with ajax.so I need it like this,
{
"location": [1407110400000,6641],[1407196800000,1566],[1407283200000,3614],
}
Can anyone help me to solve this?

The problem is that your location value is non-properly serialized value. It's definitely appropriate to fix on the server-side (looks like one's trying to implement their own json_encode and failing), but it's possible to fix on the client-side as well. One possible approach:
var location = JSON.parse('[' + response.location.slice(0,-1) + ']');
Demo. slice(0,-1) removes the trailing comma, then the contents are wrapped into brackets, turning them into a proper JSON (at least for the given dataset).
As for server-side, turned out I was right: this code...
foreach($location_total_n_4 as $u=> $v) {
$final_location_total_4 .= "[".$u.",".$v."],";
}
echo json_encode(array('location' => "$final_location_total_4"));
... is wrong both tactically (always adding a trailing comma) and strategically (one shouldn't solve the task already solved by the language itself). One possible replacement:
$locations = array();
foreach ($location_total_n_4 as $u => $v) {
$locations[] = array($u, $v);
}
echo json_encode(array('location' => $locations));
The bottom line: never attempt to implement your own serialization protocol unless you're really know what're you doing.

There is a trailing comma(,) in the end So json.parse will throw error so we need to remove that.
b = JSON.parse("["+ data[0].substr(0,data[0].length-1) +"]");
Then b becomes
[[1407110400000,6641],[1407196800000,1566],[1407283200000,3614],[1407369600000,3654],[1407456000000,2918],[1407715200000,3900]]
without the trailing comma.

Manipulate the content of HTML strings without changing the HTML

If I have a string of HTML, maybe like this...
<h2>Header</h2><p>all the <span class="bright">content</span> here</p>
And I want to manipulate the string so that all words are reversed for example...
<h2>redaeH</h2><p>lla eht <span class="bright">tnetnoc</span> ereh</p>
I know how to extract the string from the HTML and manipulate it by passing to a function and getting a modified result, but how would I do so whilst retaining the HTML?
I would prefer a non-language specific solution, but it would be useful to know php/javascript if it must be language specific.
Edit
I also want to be able to manipulate text that spans several DOM elements...
Quick<em>Draw</em>McGraw
warGcM<em>warD</em>kciuQ
Another Edit
Currently, I am thinking to somehow replace all HTML nodes with a unique token, whilst storing the originals in an array, then doing a manipulation which ignores the token, and then replacing the tokens with the values from the array.
This approach seems overly complicated, and I am not sure how to replace all the HTML without using REGEX which I have learned you can go to the stack overflow prison island for.
Yet Another Edit
I want to clarify an issue here. I want the text manipulation to happen over x number of DOM elements - so for example, if my formula randomly moves letters in the middle of a word, leaving the start and end the same, I want to be able to do this...
<em>going</em><i>home</i>
Converts to
<em>goonh</em><i>gmie</i>
So the HTML elements remain untouched, but the string content inside is manipulated (as a whole - so goinghome is passed to the manipulation formula in this example) in any way chosen by the manipulation formula.

If you want to achieve a similar visual effect without changing the text you could cheat with css, with
h2, p {
direction: rtl;
unicode-bidi: bidi-override;
}
this will reverse the text
example fiddle: http://jsfiddle.net/pn6Ga/

Hi I came to this situation long time ago and i used the following code. Here is a rough code
<?php
function keepcase($word, $replace) {
$replace[0] = (ctype_upper($word[0]) ? strtoupper($replace[0]) : $replace[0]);
return $replace;
}
// regex - match the contents grouping into HTMLTAG and non-HTMLTAG chunks
$re = '%(</?\w++[^<>]*+>) # grab HTML open or close TAG into group 1
| # or...
([^<]*+(?:(?!</?\w++[^<>]*+>)<[^<]*+)*+) # grab non-HTMLTAG text into group 2
%x';
$contents = '<h2>Header</h2><p>the <span class="bright">content</span> here</p>';
// walk through the content, chunk, by chunk, replacing words in non-NTMLTAG chunks only
$contents = preg_replace_callback($re, 'callback_func', $contents);
function callback_func($matches) { // here's the callback function
if ($matches[1]) { // Case 1: this is a HTMLTAG
return $matches[1]; // return HTMLTAG unmodified
}
elseif (isset($matches[2])) { // Case 2: a non-HTMLTAG chunk.
// declare these here
// or use as global vars?
return preg_replace('/\b' . $matches[2] . '\b/ei', "keepcase('\\0', '".strrev($matches[2])."')",
$matches[2]);
}
exit("Error!"); // never get here
}
echo ($contents);
?>

Parse the HTML with something that will give you a DOM API to it.
Write a function that loops over the child nodes of an element.
If a node is a text node, get the data as a string, split it on words, reverse each one, then assign it back.
If a node is an element, recurse into your function.

could use jquery?
$('div *').each(function(){
text = $(this).text();
text = text.split('');
text = text.reverse();
text = text.join('');
$(this).text(text);
});
See here - http://jsfiddle.net/GCAvb/

I implemented a version that seems to work quite well - although I still use (rather general and shoddy) regex to extract the html tags from the text. Here it is now in commented javascript:
Method
/**
* Manipulate text inside HTML according to passed function
* #param html the html string to manipulate
* #param manipulator the funciton to manipulate with (will be passed single word)
* #returns manipulated string including unmodified HTML
*
* Currently limited in that manipulator operates on words determined by regex
* word boundaries, and must return same length manipulated word
*
*/
var manipulate = function(html, manipulator) {
var block, tag, words, i,
final = '', // used to prepare return value
tags = [], // used to store tags as they are stripped from the html string
x = 0; // used to track the number of characters the html string is reduced by during stripping
// remove tags from html string, and use callback to store them with their index
// then split by word boundaries to get plain words from original html
words = html.replace(/<.+?>/g, function(match, index) {
tags.unshift({
match: match,
index: index - x
});
x += match.length;
return '';
}).split(/\b/);
// loop through each word and build the final string
// appending the word, or manipulated word if not a boundary
for (i = 0; i < words.length; i++) {
final += i % 2 ? words[i] : manipulator(words[i]);
}
// loop through each stored tag, and insert into final string
for (i = 0; i < tags.length; i++) {
final = final.slice(0, tags[i].index) + tags[i].match + final.slice(tags[i].index);
}
// ready to go!
return final;
};
The function defined above accepts a string of HTML, and a manipulation function to act on words within the string regardless of if they are split by HTML elements or not.
It works by first removing all HTML tags, and storing the tag along with the index it was taken from, then manipulating the text, then adding the tags into their original position in reverse order.
Test
/**
* Test our function with various input
*/
var reverse, rutherford, shuffle, text, titleCase;
// set our test html string
text = "<h2>Header</h2><p>all the <span class=\"bright\">content</span> here</p>\nQuick<em>Draw</em>McGraw\n<em>going</em><i>home</i>";
// function used to reverse words
reverse = function(s) {
return s.split('').reverse().join('');
};
// function used by rutherford to return a shuffled array
shuffle = function(a) {
return a.sort(function() {
return Math.round(Math.random()) - 0.5;
});
};
// function used to shuffle the middle of words, leaving each end undisturbed
rutherford = function(inc) {
var m = inc.match(/^(.?)(.*?)(.)$/);
return m[1] + shuffle(m[2].split('')).join('') + m[3];
};
// function to make word Title Cased
titleCase = function(s) {
return s.replace(/./, function(w) {
return w.toUpperCase();
});
};
console.log(manipulate(text, reverse));
console.log(manipulate(text, rutherford));
console.log(manipulate(text, titleCase));
There are still a few quirks, like the heading and paragraph text not being recognized as separate words (because they are in separate block level tags rather than inline tags) but this is basically a proof of method of what I was trying to do.
I would also like it to be able to handle the string manipulation formula actually adding and removing text, rather than replacing/moving it (so variable string length after manipulation) but that opens up a whole new can of works I am not yet ready for.
Now I have added some comments to the code, and put it up as a gist in javascript, I hope that someone will improve it - especially if someone could remove the regex part and replace with something better!
Gist: https://gist.github.com/3309906
Demo: http://jsfiddle.net/gh/gist/underscore/1/3309906/
(outputs to console)
And now finally using an HTML parser
(http://ejohn.org/files/htmlparser.js)
Demo: http://jsfiddle.net/EDJyU/

You can use a setInterval to change it every ** time for example:
const TITTLE = document.getElementById("Tittle") //Let's get the div
setInterval(()=> {
let TITTLE2 = document.getElementById("rotate") //we get the element at the moment of execution
let spanTittle = document.createElement("span"); // we create the new element "span"
spanTittle.setAttribute("id","rotate"); // attribute to new element
(TITTLE2.textContent == "TEXT1") // We compare wich string is in the div
? spanTittle.appendChild(document.createTextNode(`TEXT2`))
: spanTittle.appendChild(document.createTextNode(`TEXT1`))
TITTLE.replaceChild(spanTittle,TITTLE2) //finally, replace the old span for a new
},2000)
<html>
<head></head>
<body>
<div id="Tittle">TEST YOUR <span id="rotate">TEXT1</span></div>
</body>
</html>

How to pass something with apostrophe to json

My facebook app uses facebook's ui method, to which you pass json. My php variable's however might contain apostrophe's... what is the best way to preserve them, while properly passing them to the fb method?
FB.ui(
{
method: 'feed',
name: '<? echo $tname; ?>',
link: '<? echo $short; ?>',
caption: '<? echo $description; ?>'
},

use json_encode
$myJSON = array(
'method' => 'feed',
'name' => $tag_line.' at '.$owner_name,
'link' => $short,
'caption' => $description,
'description' => $short
);
$myJSON = json_encode($myJSON);
$myJSON is now contains a string with your data in JSON format and ready to echo to the page.
{
"method" : "feed",
"name" : "tag_line at owner_name",
"link" : "short",
"caption" : "description",
"description" : "short"
}

You need to "escape" them, meaning add a backslash before them.
There are some convenience functions in PHP to help you accomplish this: json_encode (specifically for properly formatting JSON) or addslashes.
However, you should use json_encode rather than addslashes in this case, since addslashes may output incorrect formatting for JSON depending on the input.

The OP needs to escape apostrophes - something that json_encode does not do. The problem that is often encountered is that a json string needs to be generated in php using json_encode and placed in javascript code (as in OP example). Since Javascript does not have heredoc comments you will need to either escape apostrophes or double quotes in order to prevent a parse error.
Something such as:
jQuery.parseJSON('<?=str_replace("'", "\\\\u0027", json_encode(array('something' => "John's")))?>');
But event this is not optimal as it will sometimes stay as \u0027 on js side. So ideally we need to replace whatever code we used to translate our quote using js again:
var code = '<?=parse(json_encode($data)) ?>';
code = code.replace(/\\\\u0027/g, "'");
jQuery.parseJSON(code);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting JSON from an HTML page with regex (PHP) - php

I was able to match the entire json object with the following /user_list_data_obj\s=\s\(\s\({(.?)}\)\s\);/ But in actuality, I ended up using preg_match_all to match each columns[] array in the json by using: /columns\s:\s\[.?\],/s

Related

How can I str_replace partially in PHP in a dynamic string with unknown key content

read dotted string inside single quote in PHP

Remove double quotes in JSON result?

Manipulate the content of HTML strings without changing the HTML

How to pass something with apostrophe to json

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting JSON from an HTML page with regex (PHP) - php

I was able to match the entire json object with the following /user_list_data_obj\s*=\s*\(\s*\({(.*?)}\)\s*\);/ But in actuality, I ended up using preg_match_all to match each columns[] array in the json by using: /columns\s*:\s*\[.*?\],/s

Related

How can I str_replace partially in PHP in a dynamic string with unknown key content

read dotted string inside single quote in PHP

Remove double quotes in JSON result?

Manipulate the content of HTML strings without changing the HTML

How to pass something with apostrophe to json

Categories

Resources

I was able to match the entire json object with the following /user_list_data_obj\s=\s\(\s\({(.?)}\)\s\);/ But in actuality, I ended up using preg_match_all to match each columns[] array in the json by using: /columns\s:\s\[.?\],/s