Extract JSON from HTML using PHP

Extract JSON from HTML using PHP - php

I'm reading source code of an online shop website, and on each product page I need to find a JSON string which shows product SKUs and their quantity.
Here are 2 samples:
'{"sku-SV023435_B_M":7,"sku-SV023435_BL_M":10,"sku-SV023435_PU_M":11}'
The sample above shows 3 SKUs.
'{"sku-11430_B_S":"20","sku-11430_B_M":"17","sku-11430_B_L":"30","sku-11430_B_XS":"13","sku-11430_BL_S":"7","sku-11430_BL_M":"17","sku-11430_BL_L":"4","sku-11430_BL_XS":"16","sku-11430_O_S":"8","sku-11430_O_M":"6","sku-11430_O_L":"22","sku-11430_O_XS":"20","sku-11430_LBL_S":"27","sku-11430_LBL_M":"25","sku-11430_LBL_L":"22","sku-11430_LBL_XS":"10","sku-11430_Y_S":"24","sku-11430_Y_M":36,"sku-11430_Y_L":"20","sku-11430_Y_XS":"6","sku-11430_RR_S":"4","sku-11430_RR_M":"35","sku-11430_RR_L":"47","sku-11430_RR_XS":"6"}',
The sample above shows many more SKUs.
The number of SKUs in the JSON string can range from one to infinity.
Now, I need a regex pattern to extract this JSON string from each page. At that point, I can easily use json_encode().
Update:
Here I found another problem, sorry that my question was not complete, there is another similar json string which is starting with sku- , Please have a look at source code of below link you will understand, the only difference is the value for that one is alphanumeric and for our required one is numeric. Also please note our final goal is to extract SKUs with their quantity, maybe you have a most straightforward solution.
Source
#chris85
Second update:
Here is another strange issue which is a bit off topic.
while I'm opening the URL content using below code there is no json string in the source!
$html = file_get_contents("http://www.dresslink.com/womens-candy-color-basic-coat-slim-suit-jacket-blazer-p-8131.html");
But when I'm opening the url with my browser the json is there! really confused about this :(

Trying to extract specific data from json directly with regexp is normally always a bad idea due to the way json is encoded. The best way is to regexp the whole json data, then decode using the php function json_decode.
The issue with the missing data is due to a missing required cookie. See my comments in the code below.
<?php
function getHtmlFromDresslinkUrl($url)
{
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
//You must send the currency cookie to the website for it to return the json you want to scrape
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Cookie: currencies_code=USD;',
));
$output=curl_exec($ch);
curl_close($ch);
return $output;
}
$html = getHtmlFromDresslinkUrl("http://www.dresslink.com/womens-candy-color-basic-coat-slim-suit-jacket-blazer-p-8131.html");
//Get the specific arguments for this js function call only
$items = preg_match("/DL\.items\_list\.initItemAttr\((.+)\)\;/", $html, $matches);
if (count($matches) > 0) {
$arguments = $matches[1];
//Split by argument seperator.
//I know, this isn't great but it seems to work.
$args_array = explode(", ", $arguments);
//You need the 5th argument
$fourth_arg = $args_array[4];
//Strip quotes
$fourth_arg = trim($fourth_arg, "'");
//json_decode
$qty_data = json_decode($fourth_arg, true);
//Then you can work with the php array
foreach ($qty_data as $name => $qtty) {
echo "Found " . $qtty . " of " . $name . "<br />";
}
}
?>
Special thanks to #chris85 for making me read the question again. Sorry but I couldn't undo my downvote.

You will want to use preg_match_all() to perform the regex matching operation (documentation here).
The following should do it for you. It will match each substring beginning with "sku" and ending with ",".
preg_match_all("/sku\-.+?:[0-9]*/", $input)
Working example here.
Alternatively, if you want to extract the entire string, you can use:
preg_match_all("/{.sku\-.*}/, $input")
This will grab everything between the opening and closing brackets.
Working example here.
Please note that $input denotes the input string.

A simple /'(\{"[^\}]+\})'/ will match all these JSON strings. Demo: https://regex101.com/r/wD5bO4/2
The first element of the returned array will contain the JSON string for json_decode:
preg_match_all ("/'(\{\"[^\}]+\})'/", $html, $matches);
$html is the HTML to be parsed, the JSON will be in $matches[0][1], $matches[1][1], $matches[2][1] etc.

Related

How to delete tracking code from links in PHP

Hi I have a form in WordPress where users can submit a link to a product, but very often the links come with unnecessary baggage, like tracking codes. I would like to create a filter in WordPress and clean the links so they consist of just a working link. I would like to if possible confirm that the link still works or a method that will guarantee that the link will still work.
The main things I want to get rid of in links are utm_source and it's contents, utm_medium and it's contents, etc. Everything but the clean working link.
So for example, a link like this:
https://www.serenaandlily.com/variationproduct?dwvar_m10055_size=Twin&dwvar_m10055_color=Chambray&pid=m10055&pdp=true&source=detail&utm_source=affiliate&utm_medium=affiliate&utm_campaign=pjdatafeed&publisherId=20648&clickId=2669312134#fo_c=745&fo_k=c0ebaf8359ca7853df8343e535533280&fo_s=pepperjam
Will end up like this:
https://www.serenaandlily.com/variationproduct?dwvar_m10055_size=Twin&dwvar_m10055_color=Chambray&pid=m10055
I'd really appreciate if someone can lead me in the right direction.
Thanks!

You can do what you want with explode, parse_str and http_build_query. This code uses an array of unwanted parameters to decide what to delete from the query string:
$unwanted_params = array('utm_source', 'utm_medium', 'utm_campaign', 'clickId', 'publisherId', 'source', 'pdp', 'details', 'fo_k', 'fo_s');
$url = 'https://www.serenaandlily.com/variationproduct?dwvar_m10055_size=Twin&dwvar_m10055_color=Chambray&pid=m10055&pdp=true&source=detail&utm_source=affiliate&utm_medium=affiliate&utm_campaign=pjdatafeed&publisherId=20648&clickId=2669312134#fo_c=745&fo_k=c0ebaf8359ca7853df8343e535533280&fo_s=pepperjam';
list($path, $query_string) = explode('?', $url, 2);
// parse the query string
parse_str($query_string, $params);
// delete unwanted parameters
foreach ($unwanted_params as $p) unset($params[$p]);
// rebuild the query
$query_string = http_build_query($params);
// reassemble the URL
$url = $path . '?' . $query_string;
echo $url;
Output:
https://www.serenaandlily.com/variationproduct?dwvar_m10055_size=Twin&dwvar_m10055_color=Chambray&pid=m10055
Demo on 3v4l.org

You can do this in the PHP itself. There is a function called parse_url() (https://secure.php.net/manual/en/function.parse-url.php) which can give you all the URI params as array. After parsing, you can filter the parameters, remove the unwanted. Finally, use http_build_query() (https://secure.php.net/manual/en/function.http-build-query.php) to build a string URI to return :)

Special Characters within URL Variable

Currently I am trying to fiddle around with the Deezer API and running into a slight issue, I am trying to gather content from this artist.
XYLØ - Nothing Left To Say
https://api.deezer.com/search/track?q=XYLØ - Nothing Left To Say
The page above displays the content in a JSON format, however when I use the following code.
$id = 'XYLØ - Nothing Left To Say';
$h = str_replace(' ', '+', $id);
$json_string = 'https://api.deezer.com/search/track?q='.$h;
$jsondata = file_get_contents($json_string);
$obj = json_decode($jsondata,true);
I get an empty pallet on my image request.
$obj['data'][0]['album']['cover_medium']
Any ideas on how I can get this to work properly?

Use PHP's built in function for query args,
//changed $h to $id (see below)
$json_string = 'https://api.deezer.com/search/track?q='.urlencode($id);
http://php.net/manual/en/function.urlencode.php
This function is convenient when encoding a string to be used in a query part of a URL, as a convenient way to pass variables to the next page.
You can also do away with this stuff (AKA remove it):
$h = str_replace(' ', '+', $id);
As urlencode does that to!!!.
As a Bonus
You can use
http://php.net/manual/en/function.http-build-query.php
http_build_query — Generates a URL-encoded query string from the associative (or indexed) array provided.
To build the whole query string from an array, which I figure may be useful to someone reading this...

Grab, filter and show this data in PHP

This is example what I mean:
I wanna grab result from this url web1.com/do.php?id=45944
Example output:
"pk":"bn564vc3b5yvct5byvc45bv","1b":129,"isvalid":true,"referrer":true,"mobile":true
Then, I will show data result on other site web2.com/show.php
But I just wanna see data value "pk", "1b" and "isvalid". I don't need "referrer" and "mobile" data.
So, when I access web2.com/show.php, it just show data like this:
bn564vc3b5yvct5byvc45bv 129 true
file_get_contents web1.com/do.php?id=45944
Grab this result "pk":"bn564vc3b5yvct5byvc45bv","1b":129,"isvalid":true,"referrer":true,"mobile":true
Filter and show value "pk", "1b" and "isvalid" on web2.com/show.php
So, can you help me with simple php code/script?
Sorry if you don't understand because my english.

Is this source providing JSON-formatted data? That is, with { }'s around it?
If so, use PHP's built in json_decode function. (Man page at http://php.net/manual/en/function.json-decode.php) It will parse the JSON data for you and return an associative array. For example
$your_JSON_data = '{"pk":"bn564vc3b5yvct5byvc45bv","1b":129,"isvalid":true,"referrer":true,"mobile":true}';
$your_array = json_decode($your_JSON_data);
echo $your_array['pk'] . ' ';
echo $your_array['1b'] . ' ';
echo $your_array['isvalid'];
If, for some reason, there are no JSON-style curly braces around your data, you can append them.... for example,
$proper_JSON = "{$bad_json_data}";
however, if that's the case, I'd wonder why it wasn't properly formatted in the first place. In that case it may be safer to use the PHP explode function (http://php.net/manual/en/function.explode.php)

PHP get request returning nothing

Using window.location.hash (used to pass in ID for page) returns something like the following:
Also, for people asking why I used window.location.hash instead of window.location.href is because window.location.href started looping infinitely for some reason, and .hash does not. I don't think this should be a big deal, but let me know if it is and if I need to change it.
http://website.com/NewPage.php#?name=1418019307305
[The string of numbers is actually epoch system time]
When using PHP to try to retrieve this variable It is not picking up any text in the file It's supposed to write to.
<?php
$myfile = fopen("File1.txt","w");
echo $_GET['name'];
fwrite($myfile, $_GET['name']);
fclose($myfile);
?>

Try to print $_SERVER variable and it will give you the array and in the desired key you can get the values. It can help you to find that variable in the string.

If you want to get the value after the hash mark or anchor, that isn't possible with "standard" HTTP as this value is never sent to the server. However, you could parse a URL into bits, including the fragment part, using parse_url().
This should do the trick:
<?php
$name_query = parse_url("http://website.com/NewPage.php#?name=1418019307305");
$get_name = substr($name_query['query'], strpos($name_query['query'], "=") + 1);
echo $get_name;
?>
Working example: http://codepad.org/8sHYUuCS
Then you can use $get_name to store "name" value in a text file.

The hash tag is a fragment that never gets processed by the server, but rather the user-agent, i.e. the browser, so JavaScript may certainly access it. (See https://www.rfc-editor.org/rfc/rfc3986#section-3.5). PHP does allow you to manipulate a url that contains a hash tag with parse_url(). Here's another way to get the info:
<?php
$parts = parse_url("http://website.com/NewPage.php#?name=1418019307305");
list(,$value) = explode("=",$parts['fragment']);
echo $value; // 1418019307305
The placement of the hash tag in this case wipes out the query string so $_SERVER['QUERY_STRING'] will display an empty string. If one were to rewrite the url following best practice, the query string would precede the hash tag and any info following that mark. In which case the script for parsing such a url could be a variation of the preceding, as follows:
<?php
$bestPracticeURL = "http://website.com/NewPage.php?name=1418019307305#more_data";
$parts = parse_url( $bestPracticeURL );
list(,$value) = explode("=", $parts['query']);
$hashData = $parts['fragment'];
echo "Value: $value, plus extra: $hashData";
// Value: 1418019307305, plus extra: more_data
Note how in this case parse_url was able to capture the query string as well as the hash tag data. Of course, if the query string had more than one key and value, then you might need to explode on the '&' into an array and then explode each array element to extract the value.

PHP website data mining Preg_Match Undefined Offset

I'm working on a PHP project for school. The task is to build a website to grab and analyze data from another website. I have the framework set up, and I am able to grab certain data from the desired site, but I can't seem to get the syntax right for other data that I need to obtain.
For example, the site that I am currently analyzing is a page for a specific item returned from a search of Amazon.com (e.g. search amazon.com for "iPad" and pick the first result). I am able to grab the title of the product's page, but I need to grab the review count and the price, and therein lies the issue. I'm using preg_match to get the title (works fine), but I'm not able to get the reviews nor the price. I continue to get the Undefined Offset error, which I've discovered means that there is nothing being returned that matches the given criterion. Simply checking to see whether something has been returned will not help me, since I need to obtain these data for my analysis. The 's that I'm trying to mine are unique on the page, so there is only one instance of each.
The Page Source for my product page contains the following snippits of HTML that I need to grab. (The website can, and needs to be able to handle, anything, but for this example, I searched "iPad").
<span id="priceblock_ourprice" class="a-size-medium a-color-price">$397.74</span>
I need the 397.74.
<span id="acrCustomerReviewText" class="a-size-base">1,752 customer reviews</span>
I need the 1,752.
I've tried all combinations of escape characters, wildcards, etc., but I can't seem to get beyond the Undefined Offset error. An example of my code is as follows where $link is the URL, and $f is an empty array in which I want to store the result (Note: There is NOT a space after the '<' in "< span..." It just erased everything up to the "...(.*)..." when I typed it as "< span..." without the space):
preg_match("#\< span id\=\"priceblock\_ourprice\" class\=\"a\-size\-medium a\-color\-price\"\>(.*)\<\/span\>#", file_get_contents($link), $f);
$price=$f[1]; //Offset error occurs on this line
echo $price;
Please help. I've been beating my head against this for the past two days now. I'm hoping I'm just doing something stupid. This is my first experience with preg_match and data mining. Thank you much in advanced for your time and assistance.

Code
As stated by #cabellicar123, you shouldn't use regex with html.
I believe what you are looking for is strpos() and substr(). It should look something like this:
function get_content($string, $begintag, $endtag) {
if (strpos($string, $begintag) !== False) {
$location = strpos($string, $begintag) + strlen($begintag);
$leftover = substr($string, $location);
$contents = substr($leftover, 0, strpos($leftover, $endtag));
return $contents;
}
}
// Usage (Change the variables):
$str = file_get_contents('http://www.amazon.com/OLB3-Official-League-Recreational-Ball/dp/B004KOBRMC/');
$beg = '<b class="priceLarge">$';
$end = '</b>';
get_content($str, $beg, $end);
I've provided a working example which would return the price of the object on the page, in this case, the price of a rawlings baseball.
Explanation
I'll go through the code, line by line, and explain every piece.
function get_content($string, $begintag, $endtag)
$string is the string being searched through (in this case an amazon page), $begintag is the opening tag of the element being searched for, and $closetag is the closing tag of that element. NOTE: This will only use the first instance of the opening tag, more than that will be ignored.
if (strpos($string, $begintag) !== False)
Checks if the beginning tag actually exists. Note the !== False; that's because strpos can return 0, which evaluates to False.
$location = strpos($string, $begintag) + strlen($begintag);
strpos() will return the first instance of $begintag in $string, therefore the length of the $begintag must be added to the strpos() to get the location of the end of $begintag.
$leftover = substr($string, $location);
Now that we have the $location of the opening tag, we need to narrow the $string down by setting $leftover to the part of the $string after $location.
$contents = substr($leftover, 0, strpos($leftover, $endtag));
This gets the position of the $endtag in $leftover, and stores everything before that $endtag in $contents.
As for the last few lines of code, they are specific to this example and just need to be changed to fit the circumstances.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.