I know there are many questions here about DOM traversal with XPATH. I have done a good amount of research before bringing my question here, but I am still having an issue. I'm trying to pull the number of downloads for a given app on the android market. So for instance if the app were the stack exchange app, I would want to pull the numbers: 50,000 - 100,000 from this page:
https://play.google.com/store/apps/details?id=com.stackexchange.marvin
I am attempting to target the div with an itemprop of "numDownloads" to little avail. I have no trouble targeting other items on page I have tried (various classes, etc) but this specific item never returns results. I have checked to make sure the value is, in fact, in the source and not being inserted by JS. Here is my code:
// Load up the document so we can parse the dom
$dom = new DomDocument();
$dom->loadHTML($this->html);
// XPath so we can do some specific searches
$finder = new DomXPath($dom);
// Find all the number of downloads item on page
$installs = $finder->query("//*[#itemprop='numDownloads']");
echo "<pre>"; var_dump($installs); echo "</pre>";
foreach($installs as $install) {
echo "<pre>"; var_dump($install->nodeValue); echo "</pre>";
}
Any suggestions would be greatly appreciated!
Actually you are already on the right track.
$url = 'https://play.google.com/store/apps/details?id=com.stackexchange.marvin';
$contents = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($contents);
$finder = new DomXPath($dom);
$installs = $finder->query("//div[#itemprop='numDownloads']");
// directly point it to a div since it is a div
foreach($installs as $install) {
echo $install->nodeValue; // 50,000 - 100,000
}
Related
I am pulling HTML from Selenium, and then extracting data from the HTML using Xpaths.
This is the Xpath:
/html/body/div[2]/div[1]/div/div/div/div/ul/li/div[1]/h3/a
This is my code:
$data = $webdriver->getPageSource();
d($data, $urltemplate);
$doc = new DOMDocument();
$doc->loadHTML($data);
$xp = "/html/body/div[2]/div[1]/div/div/div/div/ul/li/div[1]/h3/a";
$xpatho = new DOMXpath($doc);
$elementsn = $xpatho->query($xp);
d(get_class($elementsn),$elementsn->count(),$xp,$name);
// d() is a custom function like var_dump().
I always get $elementsn->count() = 0.
This is $data:
https://pastebin.com/ahuvkJfN
I am trying to extract those strings like "NAD M10 BLUOS...", "NAD M12 DIRECT DIGITAL..." and so on...
I saved the HTML into a file, and opened it in my browser. I am attaching screenshot of what data I was looking to retrieve (highlighted in blue):
Basically, the HTML page is a product listing, and I am looking to extract all the product names. To confirm, I used Chrome Developer tools, and used the copy full Xpath function. I have the following Xpaths for some of the product names:
/html/body/div[2]/div[1]/div/div/div/div/ul/li[1]/div[1]/h3/a
/html/body/div[2]/div[1]/div/div/div/div/ul/li[3]/div[1]/h3/a
I would guess that this would generalise to:
/html/body/div[2]/div[1]/div/div/div/div/ul/li/div[1]/h3/a
However, I keep on getting a DOMNodeList with count = 0. Why is this so, and how can I check what the error is, if any?
P.S.: This is the original webpage: http://lenbrook.com.sg/3-shop-by-brand#/page-4/price-49-8667
Try changing your $xp
$xp = '//a[#class="product_link"]/text()'
I am a complete beginner with PHP. I understand the concepts but am struggling to find a tutorial I understand. My goal is this:
Use the xpath addons for Firefox to select which piece of text I would like to scrape from a site
Format the scraped text properly
Display the text on a website
Example)
// Get the HTML Source Code
$url='http://steamcommunity.com/profiles/76561197967713768';
$source = file_get_contents($url);
// DOM document Creation
$doc = new DOMDocument;
$doc->loadHTML($source);
// DOM XPath Creation
$xpath = new DOMXPath($doc);
// Get all events
$username = $xpath->query('//html/body/div[3]/div[1]/div/div/div/div[3]/div[1]');
echo $username;
?>
In this example, I would like to scrape the username (which at the time of writing is mopar410).
Thank you for your help - I am so lost :( Right now I managed to use xpath with importXML in Google doc spreadsheets and that works, but I would like to be able to do this on my own site with PHP to learn how.
This is code I found online and edited the URL and the variable - as I am not aware of how to write this myself.
They have a public API.
Simply use http://steamcommunity.com/profiles/STEAM_ID/?xml=1
<?php
$profile = simplexml_load_file('http://steamcommunity.com/profiles/76561197967713768/?xml=1', 'SimpleXMLElement', LIBXML_NOCDATA);
echo (string)$profile->steamID;
Outputs: mopar410 (at time of writing)
This also provides other information such as mostPlayedGame, hoursPlayed, etc (look for the xml node names).
Say we have a div with id MydDiv in the remote page site.com/page1.html
We want to get this div only from the page in a way that allow us to manipulate or edit its content later.
So what is the best practice in this concern?
I've tried two ways: either through file_get_contents and then loading the content to Domdocument, or through Simple html dom parser
For the first method, I read about it but don't know how to get the only MyDiv with file_get_contents.
For the second method, my current code is:
<?php
include_once('simple_html_dom.php');
$url = "site.com/page1.html";
$html = str_get_html($url);
$elem = $html->find('div[id=MyDiv]', 0);
echo $elem;
?>
but it's also not working and I don't know why.
use dom document to loadhtmnl content.
$dom = new DOMDocument();
$dom->loadHTML($html);
$path = new DOMXPath($dom);
$divContent = $xpath->query('//div[id="MDiv"]');
This question may seem very stupid, but I am not able to find much help on how to find the node value of the last child using PHP, even though it's a piece of cake with JS.
This is what my XML currently looks like:
<?xml version="1.0"?>
<files>
<file>.DS_Store</file>
<file>ID2PDF_log_1.xml</file>
<file>ID2PDF_log_12.xml</file>
<file>ID2PDF_log_15.xml</file>
</files>
Here's the php code:
$filename = 'files.xml'; //my xml file name
$dom = new DomDocument();
$dom->load($filename);
$elements = $dom->getElementsByTagName('file');
echo $elements->lastChild(); // This is obviously not working
/*I get an error that I am trying to access an undefined method in DOMNodeList. Now, I know
that lastChild is a property of DOMNode. But I can't figure out how I can change my code to
get this to work.*/
I am trying to echo out
ID2PDF_log_15.xml
Can anyone show me how to get this done?
P.S.: I don't want to change the xml file structure because I am creating it through a script and I am a lazy programmer. But, I did do my research to get this. Didn't help.
I did try getting the number of elements in the node 'file' and then using item(#), but that didn't seem to work either.
Thanks
SOLUTION
$filename = 'files.xml';
$dom = new DomDocument();
$dom->load($filename);
$elements = $dom->getElementsByTagName('file')->length;
echo 'Total elements in the xml file:'.$elements."\n";
$file = file_get_contents('files.xml');
$xml = simplexml_load_string($file);
$result = $xml->xpath('file');
echo "Last element".$result[$elements-1]."\n";
I'll make this neater a little later. But, just thought that I should share the answer anyway any new users in the future.
This should work:
$elements->xpath('root/child[last()]');
Read up about xpath
Alternatively I would suggest counting the number of elements, and then targeting the last element using that count:
$file_count = $elements->getElementsByTagName('file')->length;
$elements[$file_count];
i did it this way:
$elements = $dom->getElementsByTagName('file')->item(0);
echo $elements->lastChild->nodeValue;
I was trying to load a page from H&M (for studying purposes), when I noticed that the content of one div isn't loaded, but if I save the page from the browser, the div is saved correctly.
Can anyone explain me why this happens?
The div (and most important, ist's contents) I'm looking for is:
body>div main>div content> div relatedInformationContainer
(inside there's lot of content: div relatedInformation>etc...)
this is the code i used:
<?php
$url = "http://www.hm.com/gb/product/05427";
libxml_use_internal_errors(true);
$html = file_get_contents($url);
$dom = new DomDocument();
$dom->loadHTML($html);
$xp = new domxpath($dom);
$contentDivs = $xp->query('//div[#id="content"]')->item(0);
$numContentDivs = $xp->evaluate('count(div)', $contentDivs);
// echo $numContentDivs; // output:3 (correct)
$relatedDiv = $xp->query('//div[#id="content"]/div[2]')->item(0)->getAttribute("id");
echo $relatedDiv; // output:relatedInformationContainer (correct)
$relatedDivContent = $xp->query('//div[#id="content"]/div[2]')->item(0);
$numRelatedDivContent = $xp->evaluate('count(div)', $relatedDivContent);
echo $numRelatedDivContent; // output:0 (incorrect!!! it should output 1)
?>
I used more simple methods, same result:
<?php
$url = "http://www.hm.com/gb/product/05427";
$doc = new DOMDocument();
$load = #$doc->loadHTMLFile($url);
echo $doc->saveHTML();
?>
I would apreciate if anyone could explain me why this happens, and if there's a solution.
Thanks.
The DIV is loaded from Javascript. You need to retrieve what the Javascript call is, and replicate that in PHP.
Using Firefox with Firebug, I see that the page issues a call to
http://www.hm.com/gb/product/05427/05427-A/related
which returns the DIV with all its contents (I guess it replaces the DIV). You will have to capture that.
Also, some servers check who is asking what and on behalf of whom. So the query above might not work if its HTTP_REFERER field is not set to the correct originating page, with the right User-Agent and session cookies etc. (in general; it appears not to be the case here - even though I may be wrong).