Let's start with this html in my database table:
<section id="love">
<h2 class="h2Article">III. Love</h2>
<div class="divArticle">
This is what the display looks like after I run it through a DOM script:
<section id="love"><h2 class="h2Article" id="a3" data-toggle="collapse" data-target="#b3">III. Love</h2>
<div class="divArticle collapse in article" id="b3">
And this is what I would like it to look like this:
<section id="love"><h2 class="h2Article" id="a3" data- toggle="collapse" data-target="#b3">
<span class="Article label label-primary">
<i class="only-collapsed fa fa-chevron-down"></i>
<i class="only-expanded fa fa-remove"></i> III. Love</span></h2>
<div class="divArticle collapse in article" id="b3">
In other word, DOM has given it the necessary function, correctly numbering each id sequentially. All that's missing is the styling:
<span class="Article"><span class="label label-primary"><i class="only- collapsed fa fa-chevron-down"></i><i class="only-expanded fa fa-remove"> </i> III. Love</span></span>
Can anyone tell me how to add that styling? The titles will change, of course (e.g. III. Love, IV. Hate, etc.). I posted my DOM script below:
$i = 1; // initialize counter
$dom = new DOMDocument;
#$dom->loadHTML($Content); // load the markup
$sections = $dom->getElementsByTagName('section'); // get all section tags
foreach($sections as $section) { // for each section tag
// get div inside each section
foreach($section->getElementsByTagName('h2') as $h2) {
if($h2->getAttribute('class') == 'h2Article') { // if this div has class maindiv
$h2->setAttribute('id', 'a' . $i); // set id for div tag
$h2->setAttribute('data-target', '#b' . $i);
}
}
foreach($section->getElementsByTagName('div') as $div) {
if($div->getAttribute('class') == 'divArticle') { // if this div has class divArticle
$div->setAttribute('id', 'b' . $i); // set id for div tag
}
if($div->getAttribute('class') == 'divClose') { // if this div has class maindiv
$div->setAttribute('data-target', '#b' . $i); // set id for div tag
}
}
$i++; // increment counter
}
// back to string again, get all contents inside body
$Content = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$Content .= $dom->saveHTML($child); // convert to string and append to the container
}
$Content = str_replace('data-target', 'data-toggle="collapse" data-target', $Content);
$Content = str_replace('<div class="divArticle', '<div class="divArticle collapse in article', $Content);
Since in this case a DOM Document object is being used, the createElement function can be used to add HTML.
See http://php.net/manual/en/domdocument.createelement.php
And stealing from the documentation on the attached page
<?php
$dom = new DOMDocument('1.0', 'utf-8');
$element = $dom->createElement('test', 'This is the root element!');
// We insert the new element as root (child of the document)
$dom->appendChild($element);
echo $dom->saveXML();
?>
will output
<?xml version="1.0" encoding="utf-8"?>
<test>This is the root element!</test>
Without the DOM object, you would normally add PHP in one of the following ways.
1.
echo "<div>this method is often used for shorter pieces of HTML</div>";
2.
?> <div> You can also escape out of HTML and then "turn" PHP back on like this </div> <?php
The first method uses the echo command to output a string of HTML. The second method uses the ?> escape tag to tell the computer to start treating everything as HTML until it sees another opening <?php PHP tag.
So normally in a PHP file you can add HTML like so.
?>
<span class="Article">
<span class="label label-primary">
<i class="only- collapsed fa fa-chevron-down"></i>
<i class="only-expanded fa fa-remove"></i>
III. Love
</span>
</span>
<?php
But since in this case we're trying to edit content coming from inside of the database we're not able to do this.
Well, I guess the obvious solution is to just wrap the title in something that can be modified with a simple str_replace...
<h2><span class="Answer">IIII. Love</span></h2>
Or even this...
<h2>[]III. Love[]</h2>
Kind of Mickey Mouse, but it gets the job done. I just having to write out or paste all of that code into every heading in every article. I prefer to automate it as much as possible.
Related
I'm trying to learn how to curl/scrape and echo text with php pretty well. So far I've learned how to do it with tags like and unique divs. For ex, below successfully scrapes and echos text using the div class"market"
<?php
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.nbcnews.com/business');
$xpath = new DOMXPath($doc);
$query = "//div[#class='market']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
However, I'd like to expand that ability to get even more precise, for example in the below situation, where the a class and various div tags are used many times throughout that website, and the only unique aspect of the below code is that it's using different titles, which in this case is "10-year yield." Is it possible to adjust the existing php code I'm using, to scrape using a title identifier? Otherwise, I'm not sure how to grab something like this, without grabbing everything else using similar tags. Thank you for any thoughts! (in the below case I'm trying to echo the "2.20%"
<!-- BEGIN: Quote -->
<li class="row">
<a class="quote" href="/data/bonds/index.html">
<span class="column quote-name" title="10-year yield">10-year
yield</span>
<span class="column quote-col"><span class="pre-currency-symbol">
</span><span stream="last_572094" class="quote-dollar" title="10-year
yield">2.20</span><span class="post-currency-symbol">%</span></span>
<span stream="changePct_572094" class="column quote-change"><span
class="posData">+0.00</span></span>
</a>
</li>
<!-- END: Quote -->
I have a problem regarding HTML webscraping.
<div class="mbs fwb">
<a href="/groups/291064327770896/" data-hovercard="/ajax/hovercard/group.php?id=291064327770896" aria-owns="js_0" aria-haspopup="true" aria-describedby="js_1" id="js_2">
NCR Business Startups </a>
</div>
<div class="mbs fwb" >
<a href="/groups/Analystamit/" data-hovercard="/ajax/hovercard/group.php?id=158649140871478" aria-owns="js_0" aria-haspopup="true" aria-describedby="js_1" id="js_2">
Risk Professionals </a>
</div>
I need to scrape inside anchor tag data-hovercard field.
Below is the code I used:
include('simple_html_dom.php');
$html = file_get_html('http://sampleurl.com/taki.html');
foreach($html->find('div[class="mbs fwb"]') as $desc11)
foreach($desc11->find('a') as $desc12)
echo $desc12->data-hovercard . '<br>';
It is not working. The result I am getting:
0
0
I want a result like this:
/ajax/hovercard/group.php?id=291064327770896
/ajax/hovercard/group.php?id=158649140871478
Use a Regular Expression with a pattern like: /data-hovercard="([^"]*)"/gi;
The resulting matchs' "\1" will contain all of the values for that attribute. You might need to remove newlines from your source text, just for good housekeeping.
Hope this helps.
You can do this using the built-in SimpleXMLElement class and an XPath query:
$xml = new SimpleXMLElement('http://foo.bar/baz.html', null, true);
$anchors = $xml->xpath('//div[#class="mbs fwb"]/a');
foreach ($anchors as $a) {
echo $a['data-hovercard'], PHP_EOL;
}
Output, assuming baz.html is a valid HTML file containing the divs
from the question:
/ajax/hovercard/group.php?id=291064327770896
/ajax/hovercard/group.php?id=158649140871478
I have layout like this:
<div class="fly">
<img src="a.png" class="badge">
<img class="aye" data-original="b.png" width="130" height="253" />
<div class="to">
<h4>Fly To The Moon</h4>
<div class="clearfix">
<div class="the">
<h4>**Wow**</h4>
</div>
<div class="moon">
<h4>**Great**</h4>
</div>
</div>
</div>
</div>
First I get query from xpath :
$a = $xpath->query("//div[#class='fly']""); //to get all elements in class fly
foreach ($a as $p) {
$t = $p->getElementsByTagName('img');
echo ($t->item(0)->getAttributes('data-original'));
}
When I run the code, it will produced 0 result. After I trace I found that <img class="badge"> is processed first. I want to ask, how can I get data-original value from <img class="aye">and also get the value "Wow" and "Great" from <h4> tag?
Thank you,
Alernatively, you could use another xpath query on that to add on your current code.
To get the attribute, use ->getAttribute():
$dom = new DOMDocument();
$dom->loadHTML($markup);
$xpath = new DOMXpath($dom);
$parent_div = $xpath->query("//div[#class='fly']"); //to get all elements in class fly
foreach($parent_div as $div) {
$aye = $xpath->query('./img[#class="aye"]', $div)->item(0)->getAttribute('data-original');
echo $aye . '<br/>'; // get the data-original
$others = $xpath->query('./div[#class="to"]/div[#class="clearfix"]', $div)->item(0);
foreach($xpath->query('./div/h4', $others) as $node) {
echo $node->nodeValue . '<br/>'; // echo the two h4 values
}
echo '<hr/>';
}
Sample Output
Thank you for your code!
I try the code but it fails, I don't know why. So, I change a bit of your code and it works!
$dom = new DOMDocument();
$dom->loadHTML($markup);
$xpath = new DOMXpath($dom);
$parent_div = $xpath->query("//div[#class='fly']"); //to get all elements in class fly
foreach($parent_div as $div) {
$aye = $xpath->query('**descendant::**img[#class="aye"]', $div)->item(0)->getAttribute('data-original');
echo $aye . '<br/>'; // get the data-original
$others = $xpath->query('**descendant::**div[#class="to"]/div[#class="clearfix"]', $div)->item(0);
foreach($xpath->query('.//div/h4', $others) as $node) {
echo $node->nodeValue . '<br/>'; // echo the two h4 values
}
echo '<hr/>';
}
I have no idea what is the difference between ./ and descendant but my code works fine using descendant.
given the following XML:
<div class="fly">
<img src="a.png" class="badge">
<img class="aye" data-original="b.png" width="130" height="253" />
<div class="to">
<h4>Fly To The Moon</h4>
<div class="clearfix">
<div class="the">
<h4>**Wow**</h4>
</div>
<div class="moon">
<h4>**Great**</h4>
</div>
</div>
</div>
</div>
you asked:
how can I get data-original value from <img class="aye">and also get the value "Wow" and "Great" from <h4> tag?
With XPath you can obtain the values as string directly:
string(//div[#class='fly']/img/#data-original)
This is the string from the first data-original attribute of an img tag within all divs with class="fly".
string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])
string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])
These are the string values of first and second <h4> tag that is not followed on it's own level by another <h4> tag within all divs class="fly".
This looks a bit like standing in the way right now, but with iteration, those parts in front will not be needed any longer soon because the xpath then will be relative:
//div[#class='fly']
string(./img/#data-original)
string(.//h4[not(following-sibling::*//h4)][1])
string(.//h4[not(following-sibling::*//h4)][2])
To use xpath string(...) expressions in PHP you must use DOMXPath::evaluate() instead of DOMXPath::query(). This would then look like the following:
$aye = $xpath->evaluate("string(//div[#class='fly']/img/#data-original)");
$h4_1 = $xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])");
$h4_2 = $xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])");
A full example with iteration and output:
// all <div> tags with class="fly"
$divs = $xpath->evaluate("//div[#class='fly']");
foreach ($divs as $div) {
// the first data-original attribute of an <img> inside $div
echo $xpath->evaluate("string(./img/#data-original)", $div), "<br/>\n";
// all <h4> tags anywhere inside the $div
$h4s = $xpath->evaluate('.//h4[not(following-sibling::*//h4)]', $div);
foreach ($h4s as $h4) {
echo $h4->nodeValue, "<br/>\n";
}
}
As the example shows, you can use evaluate as well for node-lists, too. Obtaining the values from all <h4> tags it not with string() any longer as there could be more than just two I assume.
Online Demo including special string output (just exemplary):
echo <<<HTML
{$xpath->evaluate("string(//div[#class='fly']/img/#data-original)")}<br/>
{$xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])")}<br/>
{$xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])")}<br/>
<hr/>
HTML;
im using simple_html_dom.php
for example I'm using this code to get one span inside the div but I want to get the three spans that it has inside instead of writing one foreach for each span.
for(){
foreach($html->find('span.street-address'$i) as $e){
$list[$i] = $e->plaintext;
echo $list[$i];
}
}
another thing, the div that I want to get the information from the HTML file is
<div class="address adr">
<span class="street-address">
<span class="no_ds">...</span>
<span class="postal-code">...</span>
<span class="locality"></span>
</span>
<div>
I want to get everything within the div class.
there is also a phone div that is diferent.
<div class="phone tel">
<span class="no_ds"></span>
<div>
as you can see one span class "no_ds" is the same name as the other span class. Will that have any affect on my code? The space between "address adr" and "phone tel", how do I write that in the code? with a period?
The getElementsByTagName is the answer here.
// gets all SPANs
$spans = $dochtml->getElementsByTagName('span');
// traverse the object with all SPANs
foreach($spans as $span) {
// gets, and outputs the ID and content of each DIV
$id = $span->getAttribute('id');
$cnt = $span->nodeValue;
echo $id. ' - '. $cnt. '<br/>';
}
I would like to place a new node element, before a given element. I'm using insertBefore for that, without success!
Here's the code,
<DIV id="maindiv">
<!-- I would like to place the new element here -->
<DIV id="child1">
<IMG />
<SPAN />
</DIV>
<DIV id="child2">
<IMG />
<SPAN />
</DIV>
//$div is a new div node element,
//The code I'm trying, is the following:
$maindiv->item(0)->parentNode->insertBefore( $div, $maindiv->item(0) );
//Obs: This code asctually places the new node, before maindiv
//$maindiv object(DOMNodeList)[5], from getElementsByTagName( 'div' )
//echo $maindiv->item(0)->nodeName gives 'div'
//echo $maindiv->item(0)->nodeValue gives the correct data on that div 'some random text'
//this code actuall places the new $div element, before <DIV id="maindiv>
http://pastie.org/1070788
Any kind of help is appreciated, thanks!
If maindiv is from getElementsByTagName(), then $maindiv->item(0) is the div with id=maindiv. So your code is working correctly because you're asking it to place the new div before maindiv.
To make it work like you want, you need to get the children of maindiv:
$dom = new DOMDocument();
$dom->load($yoursrc);
$maindiv = $dom->getElementById('maindiv');
$items = $maindiv->getElementsByTagName('DIV');
$items->item(0)->parentNode->insertBefore($div, $items->item(0));
Note that if you don't have a DTD, PHP doesn't return anything with getElementsById. For getElementsById to work, you need to have a DTD or specify which attributes are IDs:
foreach ($dom->getElementsByTagName('DIV') as $node) {
$node->setIdAttribute('id', true);
}
From scratch, this seems to work too:
$str = '<DIV id="maindiv">Here is text<DIV id="child1"><IMG /><SPAN /></DIV><DIV id="child2"><IMG /><SPAN /></DIV></DIV>';
$doc = new DOMDocument();
$doc->loadHTML($str);
$divs = $doc->getElementsByTagName("div");
$divs->item(0)->appendChild($doc->createElement("div", "here is some content"));
print_r($divs->item(0)->nodeValue);
Found a solution:
$child = $maindiv->item(0);
$child->insertBefore( $div, $child->firstChild );
I don't know how much sense this makes, but well, it worked.