How to parse an HTML page using PHP?

How to parse an HTML page using PHP? - php

Parsing HTML / JS codes to get info using PHP.
www.asos.com/Asos/Little-Asos-Union-Jack-T-Shirt/Prod/pgeproduct.aspx?iid=1273626
Take a look at this page, it's a clothes shop for kids. This is one of their items and I want to point out the size section. What we need to do here is to get all the sizes for this item and check whether the sizes are available or not. Right now all the sizes for this items are:
3-4 years
4-5 years
5-6 years
7-8 years
How can you say if the sizes are available or not?
Now take a look at this page first and check the sizes again:
www.asos.com/Ralph-Lauren/Ralph-Lauren-Long-Sleeve-Big-Horse-Stripe-Rugby-Top/Prod/pgeproduct.aspx?iid=1111751
This item has the following sizes:
12 months
18 months - Not Available
24 months
As you can see 18 months size is not available, it is indicated by the "Not Available" text next to the size.
What we need to do is go the page of an item, get the sizes and check the availability of each sizes. How can I do this in PHP?
EDIT:
Added a working code and a new problem to tackle.
Working code but it needs more work:
<?php
function getProductVariations($url) {
//Use CURL to get the raw HTML for the page
$ch = curl_init();
curl_setopt_array($ch,
array(
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_HEADER => false,
CURLOPT_URL => $url
)
);
$raw_html = curl_exec($ch);
//If we get an invalid response back from the server fail
if ($raw_html===false) {
throw new Exception(curl_error($ch));
}
curl_close($ch);
//Find the variation JS declarations and extract them
$raw_variations = preg_match_all("/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[[0-9]+\].*Array\((.*)\);/",$raw_html,$raw_matches);
//We are done with the Raw HTML now
unset($raw_html);
//Check that we got some results back
if (is_array($raw_matches) && isset($raw_matches[1]) && sizeof($raw_matches[1])==$raw_variations && $raw_variations>0) {
//This is where the matches will go
$matches = array();
//Go through the results of the bracketed expression and convert them to a PHP assoc array
foreach($raw_matches[1] as $match) {
//As they are declared in javascript we can use json_decode to process them nicely, they just need wrapping
$proc=json_decode("[$match]");
//Label the fields as best we can
$proc2=array(
"variation_id"=>$proc[0],
"size_desc"=>$proc[1],
"colour_desc"=>$proc[2],
"available"=>(trim(strtolower($proc[3]))=="true"),
"unknown_col1"=>$proc[4],
"price"=>$proc[5],
"unknown_col2"=>$proc[6], /*Always seems to be zero*/
"currency"=>$proc[7],
"unknown_col3"=>$proc[8],
"unknown_col4"=>$proc[9], /*Negative price*/
"unknown_col5"=>$proc[10], /*Always seems to be zero*/
"unknown_col6"=>$proc[11] /*Always seems to be zero*/
);
//Push the processed variation onto the results array
$matches[$proc[0]]=$proc2;
//We are done with our proc2 array now (proc will be unset by the foreach loop)
unset($proc2);
}
//Return the matches we have found
return $matches;
} else {
throw new Exception("Unable to find any product variations");
}
}
//EXAMPLE USAGE
try {
$variations = getProductVariations("http://www.asos.com/Asos/Prod/pgeproduct.aspx?iid=803846");
//Do something more useful here
print_r($variations);
} catch(Exception $e) {
echo "Error: " . $e->getMessage();
}
?>
The above code works, but there's a problem when the product needs you to select a colour first before the sizes are displayed.
Like this one:
http://www.asos.com/Little-Joules/Little-Joules-Stewart-Venus-Fly-Trap-T-Shirt/Prod/pgeproduct.aspx?iid=1171006
Any idea how to go about this?

SOLUTION:
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
return curl_exec($ch);
curl_close ($ch);
}
$html = curl('http://www.asos.com/pgeproduct.aspx?iid=1111751');
preg_match_all('/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[(.*?)\] \= new Array\((.*?),\"(.*?)\",\"(.*?)\",\"(.*?)\"/is',$html,$bingo);
echo print_r($bingo);
Link: http://debconf11.com/stackoverflow.php
You are on your own now :)
EDIT2:
Ok, we are close to solution...
<script type="text/javascript">var arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct = new Array;
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[0] = new Array(1164,"12 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[1] = new Array(1165,"18 months","SailingOrange","False","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[2] = new Array(1167,"24 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
</script>
It is not loaded via ajax, instead array is in javascript variable. You can parse this with PHP, you can clearly see that 18 months is a False, which means it is not available.
EDIT:
This sizes are loaded via javascript, therefore you cannot parse them since they are not there.
I can extract only this...
<select name="drpdwnSize" id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option>
</select>
You can sniff JS to check if you can load sizes based on product id.
First you need: http://simplehtmldom.sourceforge.net/
Forget file_get_contents() it is ~5 slower than cURL.
You then parse this piece of code (html with id ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize)
<select id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" name="ctl00$ContentMainPage$ctlSeparateProduct$drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option><option value="1164">12 months</option><option value="1165">18 months - Not Available</option><option value="1167">24 months</option></select>
You can then use preg_match(),explode(),str_replace() and others to filter out values you want. I can write it but I don't have time right now :)

The most simple way to fetch the content of a URL is to rely on fopen wrappers and just use file_get_contents with the URL. You can use the tidy extension to parse the HTML and extract content. http://php.net/tidy

You can download the file using fopen() or file_get_contents(), as Raoul Duke said, but if you have experience with the JavaScript DOM model, the DOM extension might be a bit easier to use than Tidy.
I know for a fact that the DOM extension is enabled by default in PHP, but I am a bit unsure if Tidy is (the manual page only says it's "bundeled", so I suspect that it might not be enabled).

Related

Not able to fetch the complete source code from the PHP script

<?php
$url = "http://www.justdial.com/Delhi-NCR/Pizza-Outlets-%3Cnear%3E-Okhla/";
$ptr = fopen("op.txt","w");
$data = file_get_contents($url);
print_r($data);
$result = htmlentities($data);
$doc = new DOMDocument();
#$doc->loadHTML($result);
$finder = new DOMXPath($doc);
$node = $finder->query("//h3[contains(#class, 'r')]");
?>
Above is the code which I have written to fetch the source code of justdial. The only output which I get is the first pizza outlet.How can I fetch all the results which are shown on the justdial website.
Thanks in advance.

All items are part of the '<div id="tab_block">' html element which its content is built by javascript/AJAX calls, so they cannot appear in the HTML file you load via file_get_contents() since you will get the HTML definition only without the javascript code being interpreted.
However, it means you can access the items/database directly by code if you know the endpoints.
For example (url are shown as when I tested them)
This url will return the first few items of the complete list. It will return something like (in JSON format):
[{docid: "011PXX11.XX11.151106170721.W5H9",…}, {docid: "011PXX11.XX11.140302105210.Y9N8",…},…]
0: {docid: "011PXX11.XX11.151106170721.W5H9",…}
disp_pic: "http://images.jdmagicbox.com/delhi/h9/011pxx11.xx11.151106170721.w5h9/catalogue/6cf575ffbd1090f5a314d2cf40451c88.jpg"
docid: "011PXX11.XX11.151106170721.W5H9"
1: {docid: "011PXX11.XX11.140302105210.Y9N8",…}
disp_pic: "http://images.jdmagicbox.com/delhi/n8/011pxx11.xx11.140302105210.y9n8/catalogue/ecfd2106644df17013e98bb60f40c527.jpg"
docid: "011PXX11.XX11.140302105210.Y9N8"
video: "http://videos.jdmagicbox.com/delhi/n8/011pxx11.xx11.140302105210.y9n8/video/fc2a62242ae03c74c15436dbcc04c33a_m.jpg"
...
the docid can be used to do further query on a particular item, while the disp_pic url will return the image
This url will return the image of the 1st item too but use some parameters
In any case, I just scratch the surface of the whole issue to demonstrate how to proceed. You would need to understand the site logic to read the complete dataset, but it would be easier to contact the webmaster and ask him to describe its API/endpoints for you to access the data. As well as asking him permission to use it even if the 'API' is not protected.
Once you know the endpoint, structure and data description, you can use a PHP library like mashape\unirest to do queries like this:
Unirest\Request::verifyPeer (false) ;
$response =Unirest\Request::get (
'http://www.justdial.com/functions/sortbyphotosnew.php?contractid=011PXX11.XX11.151106170721.W...,
array ( 'Accept' => 'application/json' ),
null
) ;
if $response->code == 200 then the $response->body is your JSON object containing the document array.

preg_match misses some ids while fetching data with cURL

For learning purposes, I'm trying to fetch data from the Steam Store, where if the image game_header_image_full exists, I've reached a game. Both alternatives are sort of working, but there's a catch. One is really slow, and the other seems to miss some data and therefore not writing the URL's to a text file.
For some reason, Simple HTML DOM managed to catch 9 URL's, whilst the 2nd one (cURL) only caught 8 URL's with preg_match.
Question 1.
Is $reg formatted in a way that $html->find('img.game_header_image_full') would catch, but not my preg_match? Or is the problem something else?
Question 2.
Am I doing things correctly here? Planning to go for the cURL alternative, but can I make it faster somehow?
Simple HTML DOM Parser (Time to search 100 ids: 1 min, 39s. Returned: 9 URL.)
<?php
include('simple_html_dom.php');
$i = 0;
$times_to_run = 100;
set_time_limit(0);
while ($i++ < $times_to_run) {
// Find target image
$url = "http://store.steampowered.com/app/".$i;
$html = file_get_html($url);
$element = $html->find('img.game_header_image_full');
if($i == $times_to_run) {
echo "Success!";
}
foreach($element as $key => $value){
// Check if image was found
if (strpos($value,'img') == false) {
// Do nothing, repeat loop with $i++;
} else {
// Add (don't overwrite) to file steam.txt
file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
}
}
}
?>
vs. the cURL alternative.. (Time to search 100 ids: 34s. Returned: 8 URL.)
<?php
$i = 0;
$times_to_run = 100;
set_time_limit(0);
while ($i++ < $times_to_run) {
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, 'http://store.steampowered.com/app/'.$i);
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
$url = "http://store.steampowered.com/app/".$i;
$reg = "/<\\s*img\\s+[^>]*class=['\"][^'\"]*game_header_image_full[^'\"]*['\"]/i";
if(preg_match($reg, $content)) {
file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
}
}
?>

Well you shouldn't use regex with HTML. It mostly works, but when it doesn't, you have to go through hundreds of pages and figuring out which one is the failing one, and why, and correct the regex, then hope and pray that in the future nothing like that will ever happen again. Spoiler alert: it will.
Long story short, read this funny answer: RegEx match open tags except XHTML self-contained tags
Don't use regex to parse HTML. Use HTML parsers, which are complicated algorithms that don't use regex, and are reliable (as long as the HTML is valid). You are using one already, in the first example. Yes, it's slow, because it does more than just searching for a string within a document. But it's reliable. You can also play with other implementations, especially the native ones, like http://php.net/manual/en/domdocument.loadhtml.php

Google + button counts showing "0" using the Sharrre library ( Json , Php )

So checked via a phpinfo() and Safe Mode on my server is off, Curl is activated and there are no reasons for it not to work.
I also made sure Sharrre.php is in my root directory. Even included the Curlurl to the php file. Tried both absolute and relative linking. The google button with the counter shows as soon it is uploaded but not as expected because the counter shows 0 the entire time.
The culprit seems to be: $json = array('url'=>'','count'=>0);
After a few lines of other code we got this:
if(filter_var($_GET['url'], FILTER_VALIDATE_URL)){
if($type == 'googlePlus'){ //source http://www.helmutgranda.com/2011/11/01/get-a-url-google-count-via-php/
$contents = parse('https://clients6.google.com/rpc?key=AIzaSyCKSbrvQasunBoV16zDH9R33D88CeLr9gQurl=' . $url . '&count=true');
preg_match( '/window\.__SSR = {c: ([\d]+)/', $contents, $matches );
if(isset($matches[0])){
$json['count'] = (int)str_replace('window.__SSR = {c: ', '', $matches[0]);
}
}
So either the google url code is not valid anymore or... well maybe there is something wrong with the suspected culprit because:
when changed to a value higher than 0 $json = array('url'=>'','count'=>15);
It shows 15 counts as you can see. I want it to be dynamic though and get the counts I already have and update those per click.
What can be done to solve this?

In my particular case the problem was in the asignement of the URL to the Curl object.
The original script sharrre.php sets the URL by asigning it to an array element of the curl object, but this is not working and causes Google counter not retrieve any amount.
Instead, the URL must be asigned by the curl_setopt() function.
This resolved this problem in my case:
sharrre.php:
//...
$ch = curl_init();
//$options[CURLOPT_URL] = $encUrl; // <<<--- not working! comment this line.
curl_setopt_array($ch, $options);
curl_setopt($ch, CURLOPT_URL, $encUrl ); // <<<--- Yeeaa, working! Add this line.
//...
Hope this help.

Can I retry file_get_contents() until it opens a stream?

I am using PHP to get the contents of an API. The problem is, sometimes that API just sends back a 502 Bad Gateway error and the PHP code can’t parse the JSON and set the variables correctly. Is there some way I can keep trying until it works?

This is not an easy question because PHP is a synchronous language by default.
You could do this:
$a = false;
$i = 0;
while($a == false && $i < 10)
{
$a = file_get_contents($path);
$i++;
usleep(10);
}
$result = json_decode($a);
Adding usleep(10) allows your server not to get on his knees each time the API will be unavailable. And your function will give up after 10 attempts, which prevents it to freeze completely in case of long unavailability.

Since you didn't provide any code it's kind of hard to help you. But here is one way to do it.
$data = null;
while(!$data) {
$json = file_get_contents($url);
$data = json_decode($json); // Will return false if not valid JSON
}
// While loop won't stop until JSON was valid and $data contains an object
var_dump($data);
I suggest you throw some sort of increment variable in there to stop attempting after X scripts.

Based on your comment, here is what I would do:
You have a PHP script that makes the API call and, if successful, records the price and when that price was acquired
You put that script in a cronjob/scheduled task that runs every 10 minutes.
Your PHP view pulls the most recent price from the database and uses that for whatever display/calculations it needs. If pertinent, also show the date/time that price was captured
The other answers suggest doing a loop. A combo approach probably works best here: in your script, put in a few loops just in case the interface is down for a short blip. If it's not up after say a minute, use the old value until your next try.

A loop can solve this problem, but so can a recursive function like this one:
function file_get_contents_retry($url, $attemptsRemaining=3) {
$content = file_get_contents($url);
$attemptsRemaining--;
if( empty($content) && $attemptsRemaining > 0 ) {
return file_get_contents_retry($url, $attemptsRemaining);
}
return $content;
}
// Usage:
$retryAttempts = 6; // Default is 3.
echo file_get_contents_retry("http://google.com", $retryAttempts);

SimpleXML feed showing blank arrays - how do I get the content out?

I'm trying to get the image out of a rss feed using a simpleXML feed and parsing the data out via an array and back into the foreach loop...
in the source code the array for [description] is shown as blank though I've managed to pull it out using another loop, however, I can't for the life of me work out how to pull in the next array, and subsequently the image for each post!
help?
you can view my progress here: http://dev.thebarnagency.co.uk/tfolphp.php
here's the original feed: feed://feeds.feedburner.com/TheFutureOfLuxury?format=xml
$xml_feed_url = 'http://feeds.feedburner.com/TheFutureOfLuxury?format=xml';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $xml_feed_url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$xml = curl_exec($ch);
curl_close($ch);
function produce_XML_object_tree($raw_XML) {
libxml_use_internal_errors(true);
try {
$xmlTree = new SimpleXMLElement($raw_XML);
} catch (Exception $e) {
// Something went wrong.
$error_message = 'SimpleXMLElement threw an exception.';
foreach(libxml_get_errors() as $error_line) {
$error_message .= "\t" . $error_line->message;
}
trigger_error($error_message);
return false;
}
return $xmlTree;
}
$feed = produce_XML_object_tree($xml);
print_r($feed);
foreach ($feed->channel->item as $item) {
// $desc = $item->description;
echo 'link<br>';
foreach ($item->description as $desc) {
echo $desc;`
}
}
thanks

Can you use
wp_remote_get( $url, $args );
Which i get from here http://dynamicweblab.com/2012/09/10-useful-wordpress-functions-to-reduce-your-development-time
Also get more details about this function http://codex.wordpress.org/Function_API/wp_remote_get
Hope this will help

I'm not entirely clear what your problem is here - the code you provided appears to work fine.
You mention "the image for each post", but I can't see any images specifically labelled in the XML. What I can see is that inside the HTML in the content node of the XML, there is often an <img> tag. As far as the XML document is concerned, this entire blob of HTML is just one string delimited with the special tokens <![CDATA[ and ]]>. If you get this string into a PHP variable (using (string)$item->content you can then find a way of extracting the <img> tag from inside it - but note that the HTML is unlikely to be valid XML.
The other thing to mention is that SimpleXML is not, as you repeatedly refer to it, an array - it is an object, and a particularly magic one at that. Everything you do to the SimpleXML object - foreach ( $nodeList as $node ), isset($node), count($nodeList), $node->childNode, $node['attribute'], etc - is actually a function call, often returning another SimpleXML object. It's designed for convenience, so in many cases writing what seems natural will be more helpful than inspecting the object.
For instance, since each item has only one description you don't need the inner foreach loop - the following will all have the same effect:
foreach ($item->description as $desc) { echo $desc; } (loop over all child elements with tag name description)
echo $item->description[0]; (access the first description child node specifically)
echo $item->description; (access the first/only description child node implicitly; this is why you can write $feed->channel->item and it would still work if there was a second channel element, it would just ignore it)

I had an issue where simplexml_load_file was returning some array sections blank as well, even though they contained data when you view the source url directly.
Turns out the data was there, but it was CDATA so it was not properly being displayed.
Is this perhaps the same issue op was having?
Anyways my solution was this:
So initially I used this:
$feed = simplexml_load_file($rss_url);
And I got empty description back like this:
[description] => SimpleXMLElement Object
(
)
But then I found this solution in comments of PHP.net site, saying I needed to use LIBXML_NOCDATA:
https://www.php.net/manual/en/function.simplexml-load-file.php
$feed = simplexml_load_file($rss_url, "SimpleXMLElement", LIBXML_NOCDATA);
After making this change, I got description like this:
[description] => My description text!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to parse an HTML page using PHP? - php

The most simple way to fetch the content of a URL is to rely on fopen wrappers and just use file_get_contents with the URL. You can use the tidy extension to parse the HTML and extract content. http://php.net/tidy

Related

Not able to fetch the complete source code from the PHP script

preg_match misses some ids while fetching data with cURL

Google + button counts showing "0" using the Sharrre library ( Json , Php )

Can I retry file_get_contents() until it opens a stream?

SimpleXML feed showing blank arrays - how do I get the content out?

Categories

Resources