I'm trying to parse an html file.
The idea is to fetch the span's with title and desc classes and to fetch their information in each div that has the attribute class='thebest'.
here is my code:
<?php
$example=<<<KFIR
<html>
<head>
<title>test</title>
</head>
<body>
<div class="a">moshe1
<div class="aa">haim</div>
</div>
<div class="a">moshe2</div>
<div class="b">moshe3</div>
<div class="thebest">
<span class="title">title1</span>
<span class="desc">desc1</span>
</div>
<div class="thebest">
span class="title">title2</span>
<span class="desc">desc2</span>
</div>
</body>
</html>
KFIR;
$doc = new DOMDocument();
#$doc->loadHTML($example);
$xpath = new DOMXPath($doc);
$expression="//div[#class='thebest']";
$arts = $xpath->query($expression);
foreach ($arts as $art) {
$arts2=$xpath->query("//span[#class='title']",$art);
echo $arts2->item(0)->nodeValue;
$arts2=$xpath->query("//span[#class='desc']",$art);
echo $arts2->item(0)->nodeValue;
}
echo "done";
the expected results are:
title1desc1title2desc2done
the results that I'm receiving are:
title1desc1title1desc1done
Make the queries relative... start them with a dot (e.g. ".//…").
foreach ($arts as $art) {
// Note: single slash (direct child)
$titles = $xpath->query("./span[#class='title']", $art);
if ($titles->length > 0) {
$title = $titles->item(0)->nodeValue;
echo $title;
}
$descs = $xpath->query("./span[#class='desc']", $art);
if ($descs->length > 0) {
$desc = $descs->item(0)->nodeValue;
echo $desc;
}
}
Instead of doing the second query try textContent
foreach ($arts as $art) {
echo $art->textContent;
}
textContent returns the text content of this node and its descendants.
As an alternative, change the XPath to
$expression="//div[#class='thebest']/span[#class='title' or #class='desc']";
$arts = $xpath->query($expression);
foreach ($arts as $art) {
echo $art->nodeValue;
}
That would fetch the span children of the divs with a class thebest having a class of title or desc.
Related
This question already has answers here:
how to add a custom attributes with PHP Simple HTML DOM Parser
(2 answers)
Closed 5 years ago.
I m using simple html doc here to get data from source code and then filtering it to my need
//including script
include($config>root.'/script/vendor/simple_html_dom/simple_html_dom.php');
//getting all data
$url = "www.example.com";
$html = file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=false, $defaultBRText=DEFAULT_SPAN_TEXT , $defaultSpanText=DEFAULT_SPAN_TEXT);
//title
$title = $html->find('#question-header h1 a',0)->innertext;
//comments
foreach($html->find('#question .comment-body') as $element) {
$question_comments[] = $element->innertext;
}
//{running a lot of loops like above}
//{than i have a final result inside output}
$output = ob_get_clean();
{here i need some stack help to add attribute to all anchor tags inside a particular div}
echo $output;
?>
For Example This is what i got in $output variable
$output = "
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='UTF-8'>
<title>Example</title>
</head>
<body>
<div class='container'>
<a href='/link-1'></a>
<div class='data'>
<a href='/link-2'></a>
<a href='/link-3'></a>
<a href='/link-4'></a>
</div>
</div>
<div class='footer'>
<a href='/new-link'></a>
</div>
</body>
</html>
";
and i want to add attribute rel='no-follow' to all anchor tags inside container
Try this. Hope this one will be helpful.
You can remove comments if you want this solution to work with more nested tags as well.
Try this code snippet here
<?php
ini_set('display_errors', 1);
$string =<<<HTML
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='UTF-8'>
<title>Example</title>
</head>
<body>
<div class='container'>
<a href='/link-1'></a>
<div class='data'>
<a href='/link-2'></a>
<a href='/link-3'></a>
<a href='/link-4'></a>
</div>
</div>
<div class='footer'>
<a href='/new-link'></a>
</div>
</body>
</html>
HTML;
$domDocument = new DOMDocument();
$domDocument->loadHTML($string);
$domXPath = new DOMXPath($domDocument);
$results = $domXPath->query("//div[#class='container']/*");
foreach($results as $result)
{
if($result instanceof DOMElement && $result->tagName=="a")
{
$result->setAttribute("rel", "no-follow");
}
else
{
addAttribute($result);
}
}
function addAttribute($result)
{
global $domXPath;
$results = $domXPath->query("./a",$result);//change query to ./* to make it work with nested HTML.
foreach($results as $result)
{
if($result instanceof DOMElement && $result->tagName=="a")
{
$result->setAttribute("rel", "no-follow");
}
//Uncomment these lines to work with more nested HTML.
//else
//{
// addAttribute($result);
//}
}
}
echo $domDocument->saveHTML();
This code uses http://querypath.org and http://php.net/manual/en/class.domelement.php
What I would like to do is get all tags in $types and go through $html and save the output to an array in order of that tags appearance in html
e.g
.imageBlock,h1,h2,p,h5,h1,h2,p,.imageblock,h4
this is the order the final array must have how would I do this?
$html = <<<html
<section>
<div class=" imageBlock" style="background: url('1.jpg');"></div>
<div>
<h1>H1 .1</h1>
<h2>H2. 1</h2>
<p>p .1</p>
</div>
<h5>Here is an H5</h5>
<div>
<h1>H1 .2</h1>
<h2>H2. 2</h2>
<p>p .2</p>
</div>
<div class="imageBlock" style="background: url('2.jpg');"></div>
<h4>Test</h4>
<section>
html;
$types = ['h1','h2','h3','h4','h5','h6','p', '.imageBlock'];
foreach($types as $type) {
//Create a varible for each tag
${"$type"} = htmlqp($html, $type);
}
foreach($types as $type) {
$types = (${"$type"}); //Easier to use varible
$nodes = []; //Array to hold elements in order of appperance not sure if in correct place?
foreach($types as $tag) {
if($types->length !== 0) {
//Show all elements
for($i = 0 ; $i< $types->length; $i++) {
echo $type;
var_dump($types->get()[$i]);
}
}
}
}
?>
I have an external file with lots of informations e.g
http://domain.com/thefile.html
Each Data in the file is wrapped into a <div> element:
....
<div class="lineData">
<div class="lineLData">Playstation</div>
<div class="lineRData">awesome</div>
</div>
<div class="lineData">
<div class="lineLData">xbox one</div>
<div class="lineRData">not awesome</div>
</div>
<div class="lineData">
<div class="lineLData">wii u</div>
<div class="lineRData">mhhhh</div>
</div>
....
Now I want to search the whole file for the Keyword "Playstation" and echo the whole <div>:
<div class="lineData">
<div class="lineLData">Playstation</div>
<div class="lineRData">awesome</div>
</div>
Is this possible with PHP ?
If we assume the resource / URL is $url :
$result = array();
$dom = new DOMDocument;
$dom->loadHTML(file_get_contents($url));
find all <div>'s with the class lineData using DomXPath :
$xpath = new DomXPath($dom);
$lineDatas = $xpath->query('//div[contains(#class,"lineData")]');
add all lineData <div>'s containing "playstation" to the $result array :
foreach($lineDatas as $lineData) {
if (strpos(strtolower($lineData->nodeValue), 'playstation') !== false) {
$result[] = $lineData;
}
}
example of outputting the result
foreach($result as $lineData) {
echo $dom->saveHTML($lineData);
}
outputs
<div class="lineData">
<div class="lineLData">Playstation</div>
<div class="lineRData">awesome</div>
</div>
when tested on the example HTML in OP.
Use DOMDocument for this purpose.
$dom = new DOMDocument;
$dom->loadHTMLFile("file.html");
Now you can search for the div:
$xpath = new DOMXPath($dom);
$res = $xpath->query("//*[contains(#class, 'lineData')]");
Now you have the div as DOMElement. Saving should be possible with these few lines:
$html = $res->ownerDocument->saveHTML($res);
I'm using simpile_html_dom for getting html pages elements.
I have some div elements like this. All i want is to get "Fine Thanks" sentence in each div (that is not inside any sub-element).
How can i do it?
<div class="right">
<h2>
Hello
</h2>
<br/>
<span>How Are You?</span>
<span>How Are You?</span>
<span>How Are You?</span>
Fine Thanks
</div>
It should be simply $html->find('div.right > text'), but that won't work because Simple HTML DOM Parser doesn't seem to support direct descendant queries.
So you'd have to find all <div> elements first and search the child nodes for a text node. Unfortunately, the ->childNodes() method is mapped to ->children() and thus only returns elements.
A working solution is to call ->find('text') on each <div> element, after which you filter the results based on the parent node.
foreach ($doc->find('div.right') as $parent) {
foreach ($parent->find('text') as $node) {
if ($node->parent() === $parent && strlen($t = trim($node->plaintext))) {
echo $t, PHP_EOL;
}
}
}
Using DOMDocument, this XPath expression will do the same work without the pain:
$doc = new DOMDocument;
$doc->loadHTML($content);
$xp = new DOMXPath($doc);
foreach ($xp->query('//div/text()') as $node) {
if (strlen($t = trim($node->textContent))) {
echo $t, PHP_EOL;
}
}
There is no built in method to read text property in simple_html_dom.php
But this should work;
include 'parser.php';
$html = str_get_html('<div class="right">
<h2>
Hello
</h2>
<br/>
<span>How Are You?</span>
<span>How Are You?</span>
<span>How Are You?</span>
Fine Thanks
</div>');
function readTextNode($element){
$local = $element;
$childs = count($element->childNodes());
for($i = 0; $i < $childs; $i++)
$local->childNodes($i)->outertext = '';
return $local->innertext;
}
echo readTextNode($html->find('div.right',0));
I would switch to phpquery for this one. You still need to use DOM but not too painful:
require('phpQuery.php');
$html =<<<EOF
<div class="right">
<h2>
Hello
</h2>
<br/>
<span>How Are You?</span>
<span>How Are You?</span>
<span>How Are You?</span>
Fine Thanks
</div>
EOF;
$dom = phpQuery::newDocumentHTML($html);
foreach($dom->find("div.right > *:last") as $last_element){
echo $last_element->nextSibling->nodeValue;
}
Update
These days I'm recommending this simple replacement which does let you avoid the dom ugliness:
$doc = str_get_html($html);
foreach($doc->find('div.right > text:last') as $el){
echo $el->text;
}
public function removeNode($selector)
{
foreach ($html->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
use this function to remove the h2 and span element from the div. Then get the div element data.
Reference URL : Simple HTML Dom: How to remove elements?
I have an HTML block here:
<div class="title">
<a href="http://test.com/asus_rt-n53/p195257/">
Asus RT-N53
</a>
</div>
<table>
<tbody>
<tr>
<td class="price-status">
<div class="status">
<span class="available">Yes</span>
</div>
<div name="price" class="price">
<div class="uah">758<span> ua.</span></div>
<div class="usd">$ 62</div>
</div>
How do I parse the link (http://test.com/asus_rt-n53/p195257/), title (Asus RT-N53) and price (758)?
Curl code here:
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$models = $xpath->query('//div[#class="title"]/a');
foreach ($models as $model) {
echo $model->nodeValue;
$prices = $xpath->query('//div[#class="uah"]');
foreach ($prices as $price) {
echo $price->nodeValue;
}
}
One ugly solution is to cast the price result to keep only numbers:
echo (int) $price->nodeValue;
Or, you can query to find the span inside the div, and remove it from the price (inside the prices foreach):
$span = $xpath->query('//div[#class="uah"]/span')->item(0);
$price->removeChild($span);
echo $price->nodeValue;
Edit:
To retrieve the link, simply use getAttribute() and get the href one:
$model->getAttribute('href')