How do I parse HTML using PHP DOMDocument?

How do I parse HTML using PHP DOMDocument? - php

I have an HTML block here:
<div class="title">
<a href="http://test.com/asus_rt-n53/p195257/">
Asus RT-N53
</a>
</div>
<table>
<tbody>
<tr>
<td class="price-status">
<div class="status">
<span class="available">Yes</span>
</div>
<div name="price" class="price">
<div class="uah">758<span> ua.</span></div>
<div class="usd">$ 62</div>
</div>
How do I parse the link (http://test.com/asus_rt-n53/p195257/), title (Asus RT-N53) and price (758)?
Curl code here:
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$models = $xpath->query('//div[#class="title"]/a');
foreach ($models as $model) {
echo $model->nodeValue;
$prices = $xpath->query('//div[#class="uah"]');
foreach ($prices as $price) {
echo $price->nodeValue;
}
}

One ugly solution is to cast the price result to keep only numbers:
echo (int) $price->nodeValue;
Or, you can query to find the span inside the div, and remove it from the price (inside the prices foreach):
$span = $xpath->query('//div[#class="uah"]/span')->item(0);
$price->removeChild($span);
echo $price->nodeValue;
Edit:
To retrieve the link, simply use getAttribute() and get the href one:
$model->getAttribute('href')

Related

simple html dom traversal confusion when looping

I'm trying to use the php script simplehtmldom to loop over divs on a web page while scraping.
Right now I have this:
$url = "https://test.com/";
$html = new simple_html_dom();
$html->load_file($url);
$item_list = $html->find('div.main div[id]');
foreach ($item_list as $item)
{
echo $item->outertext . PHP_EOL;
}
This will give me many like this (from the echo in the loop above):
<div id=1>
<div>
stuff here
</div>
<div>
<span class="title">name</span>
</div>
</div>
<div id=2>
<div>
stuff here
</div>
<div>
<span class="title">name 2</span>
</div>
</div>
What I'm trying to do is loop over the span with class=title, but no matter what I can't seem to quite get the right selector. Could someone help me out?

You can get the spans adding span[class=title] as a selector:
$item_list = $html->find('div.main div[id] span[class=title]');
foreach ($item_list as $item)
{
echo $item->outertext . PHP_EOL;
}

How to return in php DOMXPath object?

Now found query if '$NotXP->query' = query return string?!
How to make work next code?
$xp = new \DOMXPath(#\DOMDocument::loadHTMLFile($url));
$list = $xp->query('//table[#class="table-list quality series"] tbody');
$link = $list->query('//tr[#class="item"]');
$arr_links = [];
foreach ($link as $link_in_cycle) {
$link_quality = $link_in_cycle->query('//td[#class="column first video"]');
$link_audio = $link_in_cycle->query('//td[#class="column audio"]');
$link_size = $link_in_cycle->query('//td[#class="column size"]');
$link_seed = $link_in_cycle->query('//td[#class="column seed-leech"] span[#class="seed"]');
$link_download_url = $link_in_cycle->query('//td[#class="column last download"] a')->getAttribute("data-default");
html source for request #nigel-ren
From this code need grab of info
<tbody>
<tr class="item">
<td class="column first video">720x400</td>
<td class="column audio">mp3</td>
<td class="column size">5.70 Gb</td>
<td class="column seed-leech">
<span class="seed">15</span>
<span class="leech">26</span>
</td>
<td class="column updated">07.07.2017</td>
<td class="column consistence"></td>
<td class="column last download">
<a class="button middle rounded download zona-link"
data-type="download"
data-zona="0"
data-torrent=""
data-default="url_data"
data-not-installed=""
data-installed=""
data-metriks="{'eventType': 'click', 'data' : { 'type': 'show_download', 'id': '84358'}}"
title="text in title" href="javascript:void(0);" >Download</a> </td>

I've made a few changes to help me in debug the code. The main thing is that your XPath expressions were invalid, you can always try a site like FreeFormatter which allows you to check your expressions with some example source.
$doc = new \DOMDocument();
$doc->loadHTMLFile($url);
$xp = new \DOMXPath($doc);
$list = $xp->query('//table[#class="table-list quality series"]//tr[#class="item"]');
$arr_links = [];
foreach ($list as $link_in_cycle) {
$link_quality = $xp->query('//td[#class="column first video"]/text()', $link_in_cycle)[0]->wholeText;
$link_audio = $xp->query('//td[#class="column audio"]/text()', $link_in_cycle)[0]->wholeText;
$link_size = $xp->query('//td[#class="column size"]/text()', $link_in_cycle)[0]->wholeText;
$link_seed = $xp->query('//td[#class="column seed-leech"]//span[#class="seed"]/text()', $link_in_cycle)[0]->wholeText;
$link_download_url = $xp->query('//td[#class="column last download"]//a/#data-default', $link_in_cycle)[0]->value;
echo $link_quality.PHP_EOL;
echo $link_audio.PHP_EOL;
echo $link_size.PHP_EOL;
echo $link_seed.PHP_EOL;
echo $link_download_url.PHP_EOL;
}
The XPath expressions try and retrieve the text node in each element, which will return a list of all of the nodes, this code does assume there isn't any whitespace around the actual content (and uses [0] to fetch the first element of the list). The wholetext is just the actual content of the DOMText element.
With the sample content you gave (plus the surrounding bits I had to invent) it gives...
720x400
mp3
5.70 Gb
15
Download

PHP - search for value in file and echo the whole <div>

I have an external file with lots of informations e.g
http://domain.com/thefile.html
Each Data in the file is wrapped into a <div> element:
....
<div class="lineData">
<div class="lineLData">Playstation</div>
<div class="lineRData">awesome</div>
</div>
<div class="lineData">
<div class="lineLData">xbox one</div>
<div class="lineRData">not awesome</div>
</div>
<div class="lineData">
<div class="lineLData">wii u</div>
<div class="lineRData">mhhhh</div>
</div>
....
Now I want to search the whole file for the Keyword "Playstation" and echo the whole <div>:
<div class="lineData">
<div class="lineLData">Playstation</div>
<div class="lineRData">awesome</div>
</div>
Is this possible with PHP ?

If we assume the resource / URL is $url :
$result = array();
$dom = new DOMDocument;
$dom->loadHTML(file_get_contents($url));
find all <div>'s with the class lineData using DomXPath :
$xpath = new DomXPath($dom);
$lineDatas = $xpath->query('//div[contains(#class,"lineData")]');
add all lineData <div>'s containing "playstation" to the $result array :
foreach($lineDatas as $lineData) {
if (strpos(strtolower($lineData->nodeValue), 'playstation') !== false) {
$result[] = $lineData;
}
}
example of outputting the result
foreach($result as $lineData) {
echo $dom->saveHTML($lineData);
}
outputs
<div class="lineData">
<div class="lineLData">Playstation</div>
<div class="lineRData">awesome</div>
</div>
when tested on the example HTML in OP.

Use DOMDocument for this purpose.
$dom = new DOMDocument;
$dom->loadHTMLFile("file.html");
Now you can search for the div:
$xpath = new DOMXPath($dom);
$res = $xpath->query("//*[contains(#class, 'lineData')]");
Now you have the div as DOMElement. Saving should be possible with these few lines:
$html = $res->ownerDocument->saveHTML($res);

Retrieve elements with xpath and DOMDocument

I have a list of ads in the html code below.
What I need is a PHP loop to get the folowing elements for each ad:
ad URL (href attribute of <a> tag)
ad image URL (src attribute of <img> tag)
ad title (html content of <div class="title"> tag)
<div class="ads">
<a href="http://path/to/ad/1">
<div class="ad">
<div class="image">
<div class="wrapper">
<img src="http://path/to/ad/1/image.jpg">
</div>
</div>
<div class="detail">
<div class="title">Ad #1</div>
</div>
</div>
</a>
<a href="http://path/to/ad/2">
<div class="ad">
<div class="image">
<div class="wrapper">
<img src="http://path/to/ad/2/image.jpg">
</div>
</div>
<div class="detail">
<div class="title">Ad #2</div>
</div>
</div>
</a>
</div>
I managed to get the ad URL with the PHP code below.
$d = new DOMDocument();
$d->loadHTML($ads); // the variable $ads contains the HTML code above
$xpath = new DOMXPath($d);
$ls_ads = $xpath->query('//a');
foreach ($ls_ads as $ad) {
$ad_url = $ad->getAttribute('href');
print("AD URL : $ad_url");
}
But I didn't manage to get the 2 other elements (image url and title). Any idea?

I managed to get what I need with this code (based on Khue Vu's code) :
$d = new DOMDocument();
$d->loadHTML($ads); // the variable $ads contains the HTML code above
$xpath = new DOMXPath($d);
$ls_ads = $xpath->query('//a');
foreach ($ls_ads as $ad) {
// get ad url
$ad_url = $ad->getAttribute('href');
// set current ad object as new DOMDocument object so we can parse it
$ad_Doc = new DOMDocument();
$cloned = $ad->cloneNode(TRUE);
$ad_Doc->appendChild($ad_Doc->importNode($cloned, True));
$xpath = new DOMXPath($ad_Doc);
// get ad title
$ad_title_tag = $xpath->query("//div[#class='title']");
$ad_title = trim($ad_title_tag->item(0)->nodeValue);
// get ad image
$ad_image_tag = $xpath->query("//img/#src");
$ad_image = $ad_image_tag->item(0)->nodeValue;
}

for other elements, you just do the same:
foreach ($ls_ads as $ad) {
$ad_url = $ad->getAttribute('href');
print("AD URL : $ad_url");
$ad_Doc = new DOMDocument();
$ad_Doc->documentElement->appendChild($ad_Doc->importNode($ad));
$xpath = new DOMXPath($ad_Doc);
$img_src = $xpath->query("//img[#src]");
$title = $xpath->query("//div[#class='title']");
}

php xpath: query within a query result

I'm trying to parse an html file.
The idea is to fetch the span's with title and desc classes and to fetch their information in each div that has the attribute class='thebest'.
here is my code:
<?php
$example=<<<KFIR
<html>
<head>
<title>test</title>
</head>
<body>
<div class="a">moshe1
<div class="aa">haim</div>
</div>
<div class="a">moshe2</div>
<div class="b">moshe3</div>
<div class="thebest">
<span class="title">title1</span>
<span class="desc">desc1</span>
</div>
<div class="thebest">
span class="title">title2</span>
<span class="desc">desc2</span>
</div>
</body>
</html>
KFIR;
$doc = new DOMDocument();
#$doc->loadHTML($example);
$xpath = new DOMXPath($doc);
$expression="//div[#class='thebest']";
$arts = $xpath->query($expression);
foreach ($arts as $art) {
$arts2=$xpath->query("//span[#class='title']",$art);
echo $arts2->item(0)->nodeValue;
$arts2=$xpath->query("//span[#class='desc']",$art);
echo $arts2->item(0)->nodeValue;
}
echo "done";
the expected results are:
title1desc1title2desc2done
the results that I'm receiving are:
title1desc1title1desc1done

Make the queries relative... start them with a dot (e.g. ".//…").
foreach ($arts as $art) {
// Note: single slash (direct child)
$titles = $xpath->query("./span[#class='title']", $art);
if ($titles->length > 0) {
$title = $titles->item(0)->nodeValue;
echo $title;
}
$descs = $xpath->query("./span[#class='desc']", $art);
if ($descs->length > 0) {
$desc = $descs->item(0)->nodeValue;
echo $desc;
}
}

Instead of doing the second query try textContent
foreach ($arts as $art) {
echo $art->textContent;
}
textContent returns the text content of this node and its descendants.
As an alternative, change the XPath to
$expression="//div[#class='thebest']/span[#class='title' or #class='desc']";
$arts = $xpath->query($expression);
foreach ($arts as $art) {
echo $art->nodeValue;
}
That would fetch the span children of the divs with a class thebest having a class of title or desc.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How do I parse HTML using PHP DOMDocument? - php

Related

simple html dom traversal confusion when looping

How to return in php DOMXPath object?

PHP - search for value in file and echo the whole <div>

Retrieve elements with xpath and DOMDocument

php xpath: query within a query result

Categories

Resources