im using simple_html_dom.php
for example I'm using this code to get one span inside the div but I want to get the three spans that it has inside instead of writing one foreach for each span.
for(){
foreach($html->find('span.street-address'$i) as $e){
$list[$i] = $e->plaintext;
echo $list[$i];
}
}
another thing, the div that I want to get the information from the HTML file is
<div class="address adr">
<span class="street-address">
<span class="no_ds">...</span>
<span class="postal-code">...</span>
<span class="locality"></span>
</span>
<div>
I want to get everything within the div class.
there is also a phone div that is diferent.
<div class="phone tel">
<span class="no_ds"></span>
<div>
as you can see one span class "no_ds" is the same name as the other span class. Will that have any affect on my code? The space between "address adr" and "phone tel", how do I write that in the code? with a period?
The getElementsByTagName is the answer here.
// gets all SPANs
$spans = $dochtml->getElementsByTagName('span');
// traverse the object with all SPANs
foreach($spans as $span) {
// gets, and outputs the ID and content of each DIV
$id = $span->getAttribute('id');
$cnt = $span->nodeValue;
echo $id. ' - '. $cnt. '<br/>';
}
Related
I have a piece of HTML code that contains sub divs that have information I need to extract. I'm able to query for the parent div or specifically one of the child divs, but I can't seem to do it all at the same time.
For every instance of the <div class="box"> ... </div>
I need to extract the text from the second element ("test text" in this case)
I need to extract the text from the <div class="pBox"> ($230 in this case)
Note: the div id "adx_113_x_shorttext" is generally random for each instance but starts with "adx_".
Here is a sample of the HTML:
<div class="box">
<div id="adx_113_x_shorttext" class="shorttext">test text</div>
<div class="btnLst"><span id="vs_table_IRZ_x_MI52_x_c010_MI52" class="info">
<flag class="tooltip-show-condition"></flag></span></div>
<div class="pBox">
<div>$230</div>
</div>
</div>
I've tried the following PHP code but $acc_count isn't aligning well and I'm fairly certain this is not very efficient or the correct way:
$acc_count = 0;
$accessories = $xpath->query("//div[contains(#class,'shorttext')]");
$price = $xpath->query("//div[contains(#class,'pBox')]");
foreach ($accessories as $node) {
echo $node->nodeValue . " | " . $price[$acc_count]->nodeValue; . "\n";
$acc_count++;
}
Can someone show me the correct way to query the div box class and it's sub divs?
If I understand you correctly, this should get you there:
$accessories = $xpath->query("//div[contains(#class,'shorttext')]/text()");
$price = $xpath->query("//div[contains(#class,'pBox')]/div/text()");
$acc_count = 0;
foreach ($accessories as $node) {
echo $node->nodeValue . " | " . $price[$acc_count]->nodeValue . "\n";
$acc_count++;
};
I'm parsing an html page with php DOMXPath and I'm trying to get the nodeValue from class label corresponding to the nodeValue in class info.
<h3>
<div class="metadata">
<span class="label">Another Label</span>
<span class="info">
Link Name
</span>
</div>
</h3>
<h3>
<div class="metadata">
<span class="label">Some Label</span>
<span class="info">
Link Name,
Another Link Name,
Yet Another Link Name
</span>
</div>
</h3>
I'm accessing the content with:
$label = $xpathLabel->query("//h3/div/span[#class='label']");
$info = $xpathInfo->query("//h3/div/span[#class='info']/a");
and outputting it with:
foreach ($labels as $label) {
print "{$label->nodeValue}\n";
foreach($infos as $info){
print "\t{$info->nodeValue}\n";
}
}
Which outputs:
Another Label
Link Name
Link Name
Another Link Name
Yet Another Link Name
Some Label
Link Name
Link Name
Another Link Name
Yet Another Link Name
It still makes sense why this is happening as the queries are independent and their output is all content from class label in one and all content of class info in the other.
Is there a better way to make the query or any better way to output the content that would solve the issue?
You need to use the outer metadata divs as the anchor for your loop, then list out the labels and info links within just that element:
$metadata = $xpathLabel->query("//h3/div[#class='metadata']");
foreach ($metadata as $group) {
$labels = $xpathLabel->query("./span[#class='label']", $group);
foreach ($labels as $label) {
print "{$label->nodeValue}\n";
}
$infos = $xpathLabel->query("./span[#class='info']/a", $group);
foreach($infos as $info){
print "\t{$info->nodeValue}\n";
}
}
The <div> elements are used as the $contextnode argument to DOMXpath::query, to only search the children of the current element.
See https://eval.in/955491 for a full example
Let's start with this html in my database table:
<section id="love">
<h2 class="h2Article">III. Love</h2>
<div class="divArticle">
This is what the display looks like after I run it through a DOM script:
<section id="love"><h2 class="h2Article" id="a3" data-toggle="collapse" data-target="#b3">III. Love</h2>
<div class="divArticle collapse in article" id="b3">
And this is what I would like it to look like this:
<section id="love"><h2 class="h2Article" id="a3" data- toggle="collapse" data-target="#b3">
<span class="Article label label-primary">
<i class="only-collapsed fa fa-chevron-down"></i>
<i class="only-expanded fa fa-remove"></i> III. Love</span></h2>
<div class="divArticle collapse in article" id="b3">
In other word, DOM has given it the necessary function, correctly numbering each id sequentially. All that's missing is the styling:
<span class="Article"><span class="label label-primary"><i class="only- collapsed fa fa-chevron-down"></i><i class="only-expanded fa fa-remove"> </i> III. Love</span></span>
Can anyone tell me how to add that styling? The titles will change, of course (e.g. III. Love, IV. Hate, etc.). I posted my DOM script below:
$i = 1; // initialize counter
$dom = new DOMDocument;
#$dom->loadHTML($Content); // load the markup
$sections = $dom->getElementsByTagName('section'); // get all section tags
foreach($sections as $section) { // for each section tag
// get div inside each section
foreach($section->getElementsByTagName('h2') as $h2) {
if($h2->getAttribute('class') == 'h2Article') { // if this div has class maindiv
$h2->setAttribute('id', 'a' . $i); // set id for div tag
$h2->setAttribute('data-target', '#b' . $i);
}
}
foreach($section->getElementsByTagName('div') as $div) {
if($div->getAttribute('class') == 'divArticle') { // if this div has class divArticle
$div->setAttribute('id', 'b' . $i); // set id for div tag
}
if($div->getAttribute('class') == 'divClose') { // if this div has class maindiv
$div->setAttribute('data-target', '#b' . $i); // set id for div tag
}
}
$i++; // increment counter
}
// back to string again, get all contents inside body
$Content = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$Content .= $dom->saveHTML($child); // convert to string and append to the container
}
$Content = str_replace('data-target', 'data-toggle="collapse" data-target', $Content);
$Content = str_replace('<div class="divArticle', '<div class="divArticle collapse in article', $Content);
Since in this case a DOM Document object is being used, the createElement function can be used to add HTML.
See http://php.net/manual/en/domdocument.createelement.php
And stealing from the documentation on the attached page
<?php
$dom = new DOMDocument('1.0', 'utf-8');
$element = $dom->createElement('test', 'This is the root element!');
// We insert the new element as root (child of the document)
$dom->appendChild($element);
echo $dom->saveXML();
?>
will output
<?xml version="1.0" encoding="utf-8"?>
<test>This is the root element!</test>
Without the DOM object, you would normally add PHP in one of the following ways.
1.
echo "<div>this method is often used for shorter pieces of HTML</div>";
2.
?> <div> You can also escape out of HTML and then "turn" PHP back on like this </div> <?php
The first method uses the echo command to output a string of HTML. The second method uses the ?> escape tag to tell the computer to start treating everything as HTML until it sees another opening <?php PHP tag.
So normally in a PHP file you can add HTML like so.
?>
<span class="Article">
<span class="label label-primary">
<i class="only- collapsed fa fa-chevron-down"></i>
<i class="only-expanded fa fa-remove"></i>
III. Love
</span>
</span>
<?php
But since in this case we're trying to edit content coming from inside of the database we're not able to do this.
Well, I guess the obvious solution is to just wrap the title in something that can be modified with a simple str_replace...
<h2><span class="Answer">IIII. Love</span></h2>
Or even this...
<h2>[]III. Love[]</h2>
Kind of Mickey Mouse, but it gets the job done. I just having to write out or paste all of that code into every heading in every article. I prefer to automate it as much as possible.
I am using Simple html dom to scrape a website. The problem I have run into is that there is text positioned outside of any specific element. The only element it seems to be inside is <div id="content">.
<div id="content">
<div class="image-wrap"></div>
<div class="gallery-container"></div>
<h3 class="name">Here is the Heading</h3>
All the text I want is located here !!!
<p> </p>
<div class="snapshot"></div>
</div>
I guess the webmaster has messed up and the text should actually be inside the <p> tags.
I've tried using this code below, however it just won't retrieve the text:
$t = $scrape->find("div#content text",0);
if ($t != null){
$text = trim($t->plaintext);
}
I'm still a newbie and still learning. Can anyone help at all ?
You're almost there... Use a test loop to display the content of your nodes and locate the index of the wanted text. For example:
// Find all texts
$texts = $html->find('div#content text');
foreach ($texts as $key => $txt) {
// Display text and the parent's tag name
echo "<br/>TEXT $key is ", $txt->plaintext, " -- in TAG ", $txt->parent()->tag ;
}
You'll find that you should use index 4 instead of 0:
$scrape->find("div#content text",4);
And if your text doesnt have always the same index but you know for example that it follows the h3 heading, then you could use something like:
foreach ($texts as $key => $txt) {
// Locate the h3 heading
if ($txt->parent()->tag == 'h3') {
// Grab the next index content from $texts
echo $texts[$key+1]->plaintext;
// Stop
break;
}
}
I'm trying to pull some data from my website. It is pretty simple, but I can't find any good examples/docs, so I am having a tough time. I'm trying to make an API for my friends to use my blog, but it's a bit difficult. Let's assume I have a website at http://www.sample.com, and the html source for that website is:
<div class="container">
<a href="/mywebsiteblogpost/">
<h2 class="title">im the best</h2>
</a>
<span class="author">Josue Espinosa</span>
<div class="thumb"> <img src="http://www.sample.com/imgsrc" alt="">
<span class="category">sports</span>
</div>
<p>preview text</p>
<a class="more" href="/mywebsiteblogpost/">full text...</a>
</div>
I want to get all of .container's children, the first a child's href value, the text value of the class title, author, the img src for the child inside .thumb, and the text value for category.
I started with the a href src, but I didn't even get that far. I thought $title would be echoing the href value of the first anchor tag inside of container, but it doesn't work.
$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div) {
$class = $div->getAttribute('class');
if(strpos($class, 'container') !== FALSE) {
// title doesnt retrieve the href value of title :(
$title = 'TITLE'.$div->getElementsByTagName('a')->getAttribute('href').'<br>';
//this echos all the text in all of the children of $div
echo $div->textContent.'<br>';
}
}
Can anyone explain why please?
The culprit is $div->getElementsByTagName('a')->getAttribute('href'). The first part, $div->getElementsByTagName('a') retrieves a list of elements, not a single element. So the following ->getAttribute('href') will not do the right thing.
To fix this, iterate just as you do with the div-tags:
foreach($div->getElementsByTagName('a') as $a) {
$href = $a->getAttribute('href');
if ($href) echo "TITLE$href<br>";
}
ok so first
$div->getElementsByTagName('a')
returns a domnodelist (http://php.net/manual/en/class.domnodelist.php) object, You need to get the first item there to get the attribute.
Second
$div->textContent
Does as intended ? show all text content in the $div ?
You may be better off looking at xpath queries( http://php.net/manual/en/class.domxpath.php) for this type of DOM searching
I made some corrections on the php code you posted that doesn't work, may be it can help you keep going
$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div)
{
$class = $div->getAttribute('class');
// _($class);
if(strpos($class, 'container') !== FALSE)
{
// title doesnt retrieve the href value of title :(
$a = $div->getElementsByTagName('a');
foreach ($a as $key => $value)
{
$A = $value;
break;
}
$title = 'TITLE'. $A->getAttribute('href').'<br>';
//this echos all the text in all of the children of $div
echo $div->textContent.'<br>';
}
}