loading external div with php - php

I was trying to load a page from H&M (for studying purposes), when I noticed that the content of one div isn't loaded, but if I save the page from the browser, the div is saved correctly.
Can anyone explain me why this happens?
The div (and most important, ist's contents) I'm looking for is:
body>div main>div content> div relatedInformationContainer
(inside there's lot of content: div relatedInformation>etc...)
this is the code i used:
<?php
$url = "http://www.hm.com/gb/product/05427";
libxml_use_internal_errors(true);
$html = file_get_contents($url);
$dom = new DomDocument();
$dom->loadHTML($html);
$xp = new domxpath($dom);
$contentDivs = $xp->query('//div[#id="content"]')->item(0);
$numContentDivs = $xp->evaluate('count(div)', $contentDivs);
// echo $numContentDivs; // output:3 (correct)
$relatedDiv = $xp->query('//div[#id="content"]/div[2]')->item(0)->getAttribute("id");
echo $relatedDiv; // output:relatedInformationContainer (correct)
$relatedDivContent = $xp->query('//div[#id="content"]/div[2]')->item(0);
$numRelatedDivContent = $xp->evaluate('count(div)', $relatedDivContent);
echo $numRelatedDivContent; // output:0 (incorrect!!! it should output 1)
?>
I used more simple methods, same result:
<?php
$url = "http://www.hm.com/gb/product/05427";
$doc = new DOMDocument();
$load = #$doc->loadHTMLFile($url);
echo $doc->saveHTML();
?>
I would apreciate if anyone could explain me why this happens, and if there's a solution.
Thanks.

The DIV is loaded from Javascript. You need to retrieve what the Javascript call is, and replicate that in PHP.
Using Firefox with Firebug, I see that the page issues a call to
http://www.hm.com/gb/product/05427/05427-A/related
which returns the DIV with all its contents (I guess it replaces the DIV). You will have to capture that.
Also, some servers check who is asking what and on behalf of whom. So the query above might not work if its HTTP_REFERER field is not set to the correct originating page, with the right User-Agent and session cookies etc. (in general; it appears not to be the case here - even though I may be wrong).

Related

Sending url parameters through file_get_contents returns nothig

I am trying to scrape a website in order to get latitude and longitude for counties in the us(there are 3306 thus why I am trying to do it through code and not manually)
I am using the code below
function GetLatitude($countyName,$stateShortName){
//Create DOM from url
$page = file_get_contents("https://www.mapdevelopers.com/geocode_tool.php?$countyName,$stateShortName");
$doc = new DOMDocument();
$doc->loadHTML($page);
$node = $doc->getElementById("display_lat");
var_dump($doc);
}
GetLatitude("Guilford County","NC");
This returns nothing but if I change the url to get without the parameters like "https://www.mapdevelopers.com/geocode_tool.php" then I can see that $doc now has some information in it but that is not useful because the value I need (latitude) is dependent upon the parameters passed into the url.
How do I solve this issue?
EDIT:
Based on the suggestion to encode the parameters I changed my code to this and now the document contains information but appears as though it is ignoring the parameters
<?
function GetLatitude($countyName,$stateShortName){
$countyName = urlencode($countyName);
$stateShortName = urlencode($stateShortName);
//Create DOM from url
$page = file_get_contents("https://www.mapdevelopers.com/geocode_tool.php?address=$countyName,$stateShortName");
$doc = new DOMDocument();
$doc->loadHTML($page);
$node = $doc->getElementById("display_lat");
var_dump($doc);
}
GetLatitude("Clarke County","AL");
?>
Your issue is that the latitude information etc isn't present on page load, and java script puts it there
You're going to have a hard time trying to run a webpage with JS and scraping it from PHP without something in the middle, maybe re-try this project with something like puppet or phantomjs so you can run your script against a real browser.
Searching the page there is a ajax request to https://www.mapdevelopers.com/data.php
Sending a POST or GET request will give you the response you are looking for

Taking a div from other website with PHP DOM

Yesterday, I tried to take a div from other website to my web.
I want PHP to read the information that gives the div and compare between the string I give and the string that the website gives to me.
Here is my code:
//Blah, blah, blah
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML(file_get_contents('http://www.habbo.'.$hotel.'/home/'.$habbo));
$xpath = new DomXpath($dom);
$motto = $xpath->query('//*[#class="profile-motto"]')->item(0)->textContent;
echo $motto;
if($code !== $motto){
$num_habbo = 2;
}
//Blah, blah, blah
A example of a page:
http://www.habbo.es/home/iEnriqueSP
The string I want to take is in "Mi perfil", between "AƱadir amigo" and the avatar of the user.
When I try to show the string with echo $motto, PHP show nothing.
I don't know if cURL is necesary with PHP DOM but in my hosting's PHP Info cURL appears enable:
Thanks for your attention
There's a nice library for tasks like this.
PHP Simple HTML DOM Parser
It let's you parse html files and fetch both inner and outer text. The documentation should also be quite simple.

Having trouble targeting with DOM Xpath

I know there are many questions here about DOM traversal with XPATH. I have done a good amount of research before bringing my question here, but I am still having an issue. I'm trying to pull the number of downloads for a given app on the android market. So for instance if the app were the stack exchange app, I would want to pull the numbers: 50,000 - 100,000 from this page:
https://play.google.com/store/apps/details?id=com.stackexchange.marvin
I am attempting to target the div with an itemprop of "numDownloads" to little avail. I have no trouble targeting other items on page I have tried (various classes, etc) but this specific item never returns results. I have checked to make sure the value is, in fact, in the source and not being inserted by JS. Here is my code:
// Load up the document so we can parse the dom
$dom = new DomDocument();
$dom->loadHTML($this->html);
// XPath so we can do some specific searches
$finder = new DomXPath($dom);
// Find all the number of downloads item on page
$installs = $finder->query("//*[#itemprop='numDownloads']");
echo "<pre>"; var_dump($installs); echo "</pre>";
foreach($installs as $install) {
echo "<pre>"; var_dump($install->nodeValue); echo "</pre>";
}
Any suggestions would be greatly appreciated!
Actually you are already on the right track.
$url = 'https://play.google.com/store/apps/details?id=com.stackexchange.marvin';
$contents = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($contents);
$finder = new DomXPath($dom);
$installs = $finder->query("//div[#itemprop='numDownloads']");
// directly point it to a div since it is a div
foreach($installs as $install) {
echo $install->nodeValue; // 50,000 - 100,000
}

Get ID-specific div from a remote page, then edit its content

Say we have a div with id MydDiv in the remote page site.com/page1.html
We want to get this div only from the page in a way that allow us to manipulate or edit its content later.
So what is the best practice in this concern?
I've tried two ways: either through file_get_contents and then loading the content to Domdocument, or through Simple html dom parser
For the first method, I read about it but don't know how to get the only MyDiv with file_get_contents.
For the second method, my current code is:
<?php
include_once('simple_html_dom.php');
$url = "site.com/page1.html";
$html = str_get_html($url);
$elem = $html->find('div[id=MyDiv]', 0);
echo $elem;
?>
but it's also not working and I don't know why.
use dom document to loadhtmnl content.
$dom = new DOMDocument();
$dom->loadHTML($html);
$path = new DOMXPath($dom);
$divContent = $xpath->query('//div[id="MDiv"]');

DOMDocument : access the next following tag in PHP

I have installed a JSON plugin and got the content of HTML page. Now I want to parse and find a particular table, which has only class, but no id. I parse it using the PHP class DOMDocument.I have the idea to access the tag before the table and after that somehow to access the next following tag(my table) using DOMDocument.
Example:
<a name="Telefonliste" id="Telefonliste"></a>
<table class="wikitable">
So, i get fist the <a> and after that I get <table>.
I have got all the tables using the following commands and especially getElementsByTagName(). After that I can access item(2) where my table is:
$dom = new DOMDocument();
//load html source
$html = $dom->loadHTML($myHtml);
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$table = $dom->getElementsByTagName('table');
$rows = $table->item(2)->getElementsByTagName('tr');
This way is ok, but I want to make it more general, because now I know that the table is located in item(2), but the location can be changed e.g if a new table is included in the HTML page before my table. My table will not be in item(2), but in item(3). So, I want it it to parse in a way that I can still reach this table without changing something in my code. Can I do it using DOMDocument as a DOM parser?
You can use DOMXPath, and make the expression as general as you need it.
For example:
$dom = new DOMDocument();
//discard white space
$dom->preserveWhiteSpace = false;
//load html source
$dom->loadHTML($myHtml);
$domxpath = new DOMXPath($dom);
$table = $domxpath->query('//table[#class="wikitable" and not(#id)][0]')->item(0);
$elementBeforeTable = $table->previousSibling;
$rows = $table->getElementsByTagName('tr');
I've started writing a simple extension of this for the purpose of web scraping. I'm not 100% on the direction I want to take with it yet, but you can see an example of how to get the original HTML back in the response of the search rather than just raw text.
https://github.com/WolfeDev/PageScraper
EDIT: I plan on implementing basic table parsing soon.

Categories