I want to get all organic search results from Google.
I need help defining the XPath to exclude the ads. The cite tag on the ads does not contain a class attribute, and the organic results have 2 different class values. My attempts at defining the XPath have failed. The Google results page looks something like this
Ad
<cite>example.com</cite>
Organic Result 1
<cite class="_Rm">example.com/page1.html</cite>
Organic Result 2
<cite class="_Rm bc">example.com > Breadcrumbs > Page2</cite>
Here is my code:
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.google.com/search?q=mortgage&num=100');
$xpath = new DOMXPath($html);
$nodes = $xpath->query('//cite');
foreach ($nodes as $n){
echo $n->nodeValue.'<br />'; // Show all links
}
Please help
Try //cite[#class='_Rm' or #class='_Rm bc'] This will select cite nodes with a class that is either _Rm or _RM bc.
Assuming the portion of the HTML you want to get is not generated by client-side scripts (usually javascript), following simple XPath will do the job :
$nodes = $xpath->query('//cite[#class]');
Above XPath gets all <cite> tags containing class attribute with any value.
Otherwise, you need to find a way to run the client-side scripts, so that the HTML can be generated completely before you can apply above XPath query against the HTML.
Related
I know that same issues have been discussed many times but I'm really stuck. I am trying to access a part of an HTML DOM tree:
<div class="price widget-tooltip widget-tooltip-tl price-breakdown">
<strong class="current-price">10€</strong>
</div>
I want to get the value of the node < strong > which would be 10€, so I use the xpath:
//div//strong[contains(#class, 'current-price')]
which works perfectly in Chrome Dev Tools, but not in my PHP script:
// Create DOM object
$dom = new DOMDocument();
#$dom->loadHTML($header['content']);
$xpath = new DOMXPath($dom);
$prices = $xpath->query("//div//strong[contains(#class, 'current-price')]");
I tried different "versions" as suggested here and here (and in many other cases) without success. The problem is that I'm not getting anything as return value, just empty, nothing.
I isolated the issue and can confirm that it happens only when the class selector goes into game, if I use the path without the class selector it works fine (returning many elements).
What am I doing wrong?
How can I (with PHP) remove the style attributes from DIVS having a certain class?
Because of an 'Drag&Drop' process some DIV elements get polluted with unnecessary styles which can lead to problems later on.
I know I can remove the style attributes with JavaScript after a 'Drag&Drop' process, but I only remove them when to HTML is being processed by the server (For sending the HTML as an e-mail).
This isn't a particularly difficult problem, so far as I can tell. You need to load the HTML into a DOMDocument structure, then use a simple XPath attribute selector to find the relevant elements and DOMElement::removeAttribute to remove the style attribute. Your code might look like this:
$dom = new DOMDocument;
$dom->loadHTML($dirtyHtml);
$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div[#class="someclass"]');
foreach ($divs as $div) {
$div->removeAttribute('style');
}
$cleanHtml = $dom->saveHTML();
I'm attempting to scrape the value of an input box from a URL. I seem to be having problems with my implementation of XPath.
The page to be scraped looks something like:
<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<div><span>Blah</span></div>
<div><span>Blah</span> Blah</div>
<div>
<form method="POST" action="blah">
<input name="SomeName" id="SomeId" value="GET ME"/>
<input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
</form>
</div>
</body>
</html>
and I'm attempting to parse it like this:
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));
NB: dump() just wraps print_r() but adds some stack trace info and formatting.
The output is as folllowws:
14:50:08 scraper.php 181: (Scraper->Test)
//input[#id='csrfToken-login']/#value
14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)
Which I'm assuming means it was unable to find anything in the document which matches my selector? I've tried a number of variations, jsut to see if I can get something back:
/input/#value
/input
//input
/div
The only selector which I've been able to get anything from is / which returns the entire document.
What am I doing wrong?
EDIT: As some can't reproduce the problem with the old example, I've replaced it with an almost identical example which also demonstrates the problem but uses a public URL (LinkedIn login page).
There's been a suggestion that this isn't possible due to the parser choking on html5 - (as is the internal page) anyone have any experience of this?
If your selector starts with a single slash(/), it means the absolute path from the root. You need to use double slash (//) which selects all matching elements regardless of their location.
print_r won't work for this. Everything was fine in your code except for actually getting value.
Lists classes in PHP usually have a property called length, check that instead.
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;
DOMXPath looks fine to me.
As for the xpath use descendant-or-self shortcut // to get to the input tag
//input[#id='SomeId']/#value
I've been to the LinkedIn login page that you specified and it is malformed; even your pared-down example has an unclosed input node. I know nothing about PHP's XPath implementation, but I'm guessing no straight XPath API is ever going to work with a malformed document.
Your XPath is correct, by the way.
You might need an intermediary step using TagSoup to "well form" the source before you start querying it, or Google "tag soup php" for any PHP-specific solutions/implementations.
I hope this helps,
Zachary
First, I know that I can get the HTML of a webpage with:
file_get_contents($url);
What I am trying to do is get a specific link element in the page (found in the head).
e.g:
<link type="text/plain" rel="service" href="/service.txt" /> (the element could close with just >)
My question is: How can I get that specific element with the "rel" attribute equal to "service" so I can get the href?
My second question is: Should I also get the "base" element? Does it apply to the "link" element? I am trying to follow the standard.
Also, the html might have errors. I don't have control on how my users code there stuff.
Using PHP's DOMDocument, this should do it (untested):
$doc = new DOMDocument();
$doc->loadHTML($file);
$head = $doc->getElementsByTagName('head')->item(0);
$links = $head->getElementsByTagName("link");
foreach($links as $l) {
if($l->getAttribute("rel") == "service") {
echo $l->getAttribute("href");
}
}
You should get the Base element, but know how it works and its scope.
In truth, when I have to screen-scrape, I use phpquery. This is an older PHP port of jQuery... and what that may sound like something of a dumb concept, it is awesome for document traversal... and doesn't require well-formed XHTMl.
http://code.google.com/p/phpquery/
I'm working with Selenium under Java for Web-Application-Testing. It provides very nice features for document traversal using CSS-Selectors.
Have a look at How to use Selenium with PHP.
But this setup might be to complex for your needs if you only want to extract this one link.
I would like to make a simple but non trivial manipulation of DOM Elements with PHP but I am lost.
Assume a page like Wikipedia where you have paragraphs and titles (<p>, <h2>). They are siblings. I would like to take both elements, in sequential order.
I have tried GetElementbyName but then you have no possibility to organize information.
I have tried DOMXPath->query() but I found it really confusing.
Just parsing something like:
<html>
<head></head>
<body>
<h2>Title1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Title2</h2>
<p>Paragraph3</p>
</body>
</html>
into:
Title1
Paragraph1
Paragraph2
Title2
Paragraph3
With a few bits of HTML code I do not need between all.
Thank you. I hope question does not look like homework.
I think DOMXPath->query() is the right approach. This XPath expression will return all nodes that are either a <h2> or a <p> on the same level (since you said they were siblings).
/html/body/*[name() = 'p' or name() = 'h2']
The nodes will be returned as a node list in the right order (document order). You can then construct a foreach loop over the result.
I have uased a few times simple html dom by S.C.Chen.
Perfect class for access dom elements.
Example:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Check it out here. simplehtmldom
May help with future projects
Try having a look at this library and corresponding project:
Simple HTML DOM
This allows you to open up an online webpage or a html page from filesystem and access its items via class names, tag names and IDs. If you are familiar with jQuery and its syntax you need no time in getting used to this library.