Web scraper with DOMDocument

Web scraper with DOMDocument - php

I'm trying to scrape a web page for content, using file_get_contents to grab the HTML and then using a DOMDocument object. My problem is that I cannot get the appropriate information. I'm not sure if this is because I'm using DOMDocument's methods wrong, or if the (X)HTML in my source is just poor.
In the source, there is an element with an id of 'cards', which has two child divs. I want the first child, which has many child divs, who in turn have an anchor child with div child. I want the href from the anchor and the nodeValue from it's child div.
The structure is like this:
<div id="cards">
<div class="grid">
<div class="card-wrap">
<a href="linkValue">
<img src="..."/>
<div>nameValue</div>
</a>
</div>
...
</div>
<div id="...">
</div>
</div>
I've started out with $cards = $dom->getElementById("cards"). I get a DOMText Object, a DOMElement Object, a DOMText Object, a DOMElement Object, and a DOMText Object. I then use $grid = $cards->childNodes->item(1) to get the first DOMElement Object, which is presumably the .grid element. However, when I then iterate through the $grid with:
foreach($grid->childNodes as $item){
if($item->nodeName == "div"){
echo $item->nodeName,' | ',$item->nodeValue,'<br>';
}
}
I end up with a page full of "div | nameValue" where nameValue is the embedded div's nodeValue, and I am unable to locate the anchors to get their href value.
Am I doing something obviously wrong with my DOMDocument, or perhaps there is something more going on here?

Well, from your example code if($item->nodeName == "div"){ is very going to preclude any <a> tag. Additionally, I do not believe childNodes allows recursive iteration.
Therefore, to access the nodes in question, you could use:
$children = $dom->getElementById("cards")->childNodes
->item(1)->childNodes->item(1)->childNodes;
Yet, as you can see this is very messy... Introducing XPath:
http://php.net/manual/en/class.domxpath.php
http://www.w3schools.com/xpath/xpath_syntax.asp

The XPath way:
$src = <<<EOS
<div id="cards">
<div class="grid">
<div class="card-wrap">
<a href="linkValue">
<img src="..."/>
<div>nameValue</div>
</a>
</div>
</div>
<div id="whatever">
</div>
</div>
EOS;
$xml = new SimpleXMLElement($src);
list ($anchor) = $xml->xpath('//div[#id="cards"]/div[1]/div[1]/a');
echo $anchor->div, ' => ', $anchor['href'], PHP_EOL;
"Get anchor of first child div of first child div of div with an id of 'cards'"
Output:
nameValue => linkValue

Related

PHP XMLNode, DOMnode Xpath selection predicate for a grandchild attribute value

I have some xml
<div> First Element
<div>
<level3 name="fred">
</level3>
</div>
</div>
<div> Second Element
<div>
<level3 name="dave">
</level3>
</div>
</div>
<div> Third Element
<div>
<level3 name="jim">
</level3>
</div>
</div>
<div> Fifth Element
<div>
<level3 name="mike">
</level3>
</div>
</div>
I want to extract the xml (as a string, including the xml tags) from a specific top level div element based in its grandchilds name at level3.
So to get the top div above the level3 node with the name of jim I have been looking at things like:
$sname="jim";
$spath = new DOMXPath($doc);
// Find a div with a child div with a level3 with a matching attribute name.
$spexp = "//div[./div/level3[contains(#name,\"$sname\")]]";
$story = $spath->evaluate("$spexp");
echo $story->item(0)->nodeValue . "\n";
I have tried various combinations - including 'exists' in the predicate which I am sure is basic xslt, but not in PHP(!).
I have googled loads... but predicates going down past the immediate level hasn't come up, and it seems PHP's xpath has its own flavour, so general XPath stuff isn't always useful.

The XPath was OK, this just removes the first bit inside the first [ as it's not needed.
To output the XML, you need to use saveXML() with the node you want to export all of the XML tags as well...
$sname="jim";
$spath = new DOMXPath($doc);
// Find a div with a child div with a level3 with a matching attribute name.
$spexp = "//div[div/level3[contains(#name,\"$sname\")]]";
$story = $spath->evaluate("$spexp");
echo $doc->saveXML($story->item(0)). "\n";
Gives...
<div> Third Element
<div>
<level3 name="jim">
</level3>
</div>
</div>

how to get content of an element with HTML nodes?

I need to get the content of an element and place that content into another element. I use createTextNode to append that content as a child to the target element.
As I append it as text node, < and > is converted into < and >. How can I append that content without conversion?
For example:
<li id="fn1">
<div>
<a>some text
</a>
</div>
</li>
Expected output:
<p>
<div>
<a>some text
</a>
</div>
</p>
But my output is like,
<p>
<div>
<a>some text</a>
</div>
</p>
my code
$ch=dom->createElement("p");
$li=$xp->query("//li[contains(#id, 'fn')]");
foreach($li as $liv) {
$linodes = $liv->childNodes;
$pvalue="";
foreach ($linodes as $lin) {
$pvalue.=$dom->saveXML($lin);}
$ch->appendChild($dom->createTextNode($pvalue)); }
I have tried,
$ch->appendChild($dom->createTextNode(htmlspecialchars_decode($pvalue))); but same output

If you want to
move a node within the same document: remove that node via DOMNode::removeChild and append the return value of that function via DOMNode::appendChild to its new parent node.
copy the node to a new location within the same document, make a deep clone of the node via DOMNode::clone the node and append it.
transfer the node to another document, import that node to the new document via DOMDOcument::importNode and then append it to its new parent.

How to get element whose parents child is element x? PHP Simple HTML DOM Parser

So for example I have a HTML tree like this:
<section class="product">
<div>
<div class="p-image">
<img alt="Product name" src="path/to/image.jpg">
</div>
<div class="p-content">
<h3>Product name</h3>
</div>
<div class="p-info">
<div class="new-price">
<span>400 €</span>
</div>
</div>
</section>
So I want to get the content of span element whose parent (div) has a child element (img) with a specific alt attribute. I know how to select an element by its attributes, but I haven't found any solution to selecting an element by it's parent's child.
I hope my explanation was understandable.
Thank you.

in jQuery you could use $(selector).parent() to get element's parent and $(img alt="x") to get the img tag with alt attribute that is equal to x

Retrieving (relating) two separate tags/attributes using a single XPath query?

I am Xpathing a DOMDocument file I have. the general pattern of this domdocument is as follows:
<h2> Title info </h2>
<div> .... </div>
<p> ...</p>
<div class = format_text>
<p>
<img src = "http://sourceofimageOnline.com">
</p>
</div>
<h2> 2nd title</h2>
<div> .... </div>
<p> ...</p>
<div class = format_text>
<p>
<img src = "http://sourceofimageOnline.com"></img>
<img src = "http://sourceofimageonline.com"</img>
</p>
</div>
The key is to return the titles and the src attribute for images that are hyperlinks.
Essentially, I render it as :
Title 1
Img URI 1
Title 2
Img URI 2
Img URI 3
...
..
Now the Titles can be easily retrieved using
DomDocument->getElementsByTagNames('h2')
And the img src are retrieved by an XPATH query:
//div[#class = "format_text"]/p/a/img/#src
This returns all the information I need. However, I am being challenged by trying to get the img src's relate to the titles they fall under. Since they are retrieved independently, I am unable to comprehend what kind of Xpath query I need to execute to retrieve both such that the above constraint is satisfied.

fetch an array with XPath expression /html/body//h2
iterate over this array with another XPath expression
refer to the current h2 with . and refer to the first link with
./../div[#class='format_text']/p/a[$counter]/img
XPath expression where $counter is the array id.

Parse HTML with PHP's HTML DOMDocument

I was trying to do it with "getElementsByTagName", but it wasn't working, I'm new to using DOMDocument to parse HTML, as I used to use regex until yesterday some kind fokes here told me that DOMEDocument would be better for the job, so I'm giving it a try :)
I google around for a while looking for some explains but didn't find anything that helped (not with the class anyway)
So I want to capture "Capture this text 1" and "Capture this text 2" and so on.
Doesn't look to hard, but I can't figure it out :(
<div class="main">
<div class="text">
Capture this text 1
</div>
</div>
<div class="main">
<div class="text">
Capture this text 2
</div>
</div>

If you want to get :
The text
that's inside a <div> tag with class="text"
that's, itself, inside a <div> with class="main"
I would say the easiest way is not to use DOMDocument::getElementsByTagName -- which will return all tags that have a specific name (while you only want some of them).
Instead, I would use an XPath query on your document, using the DOMXpath class.
For example, something like this should do, to load the HTML string into a DOM object, and instance the DOMXpath class :
$html = <<<HTML
<div class="main">
<div class="text">
Capture this text 1
</div>
</div>
<div class="main">
<div class="text">
Capture this text 2
</div>
</div>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
And, then, you can use XPath queries, with the DOMXPath::query method, that returns the list of elements you were searching for :
$tags = $xpath->query('//div[#class="main"]/div[#class="text"]');
foreach ($tags as $tag) {
var_dump(trim($tag->nodeValue));
}
And executing this gives me the following output :
string 'Capture this text 1' (length=19)
string 'Capture this text 2' (length=19)

You can use http://simplehtmldom.sourceforge.net/
It is very simple easy to use DOM parser written in php, by which you can easily fetch the content of div tag.
Something like this:
// Find all <div> which have attribute id=text
$ret = $html->find('div[id=text]');
See the documentation of it for more help.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Web scraper with DOMDocument - php

Related

PHP XMLNode, DOMnode Xpath selection predicate for a grandchild attribute value

how to get content of an element with HTML nodes?

How to get element whose parents child is element x? PHP Simple HTML DOM Parser

Retrieving (relating) two separate tags/attributes using a single XPath query?

Parse HTML with PHP's HTML DOMDocument

Categories

Resources