DOMDocument, get images AFTER first <h1>

DOMDocument, get images AFTER first <h1> - php

I am trying to get all <img> tags after the first <h1> tag, but I can't quite figure how.
Currently I am able to get all <img> tags from a page using this code:
$html = file_get_contents($this->url);
$this->doc = new DOMDocument();
#$this->doc->loadHTML($html);
$tags = $this->doc->getElementsByTagName('img');
foreach ($tags as $tag) {
array_push($this->images, $tag->getAttribute('src'));
}
How can I make it do this after the first <h1> tag?

For php get a dom parser.
http://simplehtmldom.sourceforge.net/manual.htm#section_traverse
Find the h1 tag then use traverse the siblings searching for the img tags.
$es = $html->find( 'h1' )
foreach($es->next_sibling() as $sibling)
{
foreach($sibling->find( 'img' ) as $img )
{
// do something...
}
}

Related

Extract all the 'a' tags with in which 'img' tag resides, using php but i am not able to figure it out

Here is the code snipet being used:
$urlContent = file_get_contents('http://www.techeblog.com/');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$domPath=new DOMXpath($dom);
$linkList = $domPath->evaluate("/html/body/a/img");
foreach ($linkList as $link)
{
echo $link->getAttribute("src")."<br />";
}
Need to extract all the links in which the child node is an image tag.

Your XPath expression will only return image tags that are inside links that are direct children of the body tag. If you want all link tags that contain images anywhere in the document, use the expression //a[img]
That being said, you may want to be more specific about which images you pull. This expression will limit the results to links containing images that are inside the blog entries //div[#class="entry"]//a[img].
Here is a great XPath cheat sheet.
<?php
$urlContent = file_get_contents('http://www.techeblog.com/');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$domPath=new DOMXpath($dom);
$linkList = $domPath->evaluate('//div[#class="entry"]//a[img]');
foreach ($linkList as $link)
{
echo $link->getAttribute("href").PHP_EOL;
}
Also, your echo is looking for an attribute calles src, which will not be present in the links.

PHP DOM Get Links Inside DIV

I'm attempting to iterate thru DIV's and get all of the links from each DIV. I'd put this is an array, i.e.:
[Astronomy] // div #class=container
[link] http://www.nasa.gov
[link] http://www.seti.org
[Biology] // div #class=container
[link] http://www.biology.com
[Chemistry] // div #class=container
[link] http://www.chemistry.com
I can use DOM to get the text of the content inside the DIV's, but I can't figure out how to get the HREF Attribute of nodes inside the DIV. getAttribute isn't a method of Node. How can I iterate thru elements ('a') inside of an existing xpath?
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
$dom_xpath = new DOMXpath($dom_document);
$elements = $dom_xpath->query("*/div[#class='container']");
foreach($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
// ??? $links = $dom_xpath->query("//a");
}
}

You should try and use $element->getElementsByTagName('a') instead of using $element->childNodes.

DomDocument fetch h1 tag

I have very very big html page/data. I need to fetch data under h1 tag.
so what I have seen on various example is DOMDOCUMENT is basically used for parsing xml.
but if i have html data, very meshu, and I want to fetch text under <h1></h1> tag then what will be code.
If there are number of <h1> tags
$doc = new DOMDocument();
#$doc->loadHTML($this->siteHtmlData);
$aElements = $doc->getElementsByTagName("h1")
Please help me.
Thanks

You could loop it to get the value:
foreach ($aElementsas as $node) {
echo $node->nodeValue, PHP_EOL;
}

How to pass data from DOMDocument to regexp?

Using the following code I get "img" tags from some html and check them if they are covered with "a" tags. Later if current "img" tag is not part of the "a" ( hyperlink ) I want to do cover this img tag into "a" tag adding hyperlinks start ending tag plus setting to target. For this I want the whole "img" tags html to work with.
Question is how can I transfer "img" tags html into regexp. I need some php variable in regexp to work with the place is marked with ??? signs.
$doc = new DOMDocument();
$doc->loadHTML($article_header);
$imgs = $doc->getElementsByTagName('img');
foreach ($imgs as $img) {
if ($img->parentNode->tagName != "a") {
preg_match_all("|<img(.*)\/>|U", ??? , $matches, PREG_PATTERN_ORDER);
}
}

You do not want to use regex for this. You already have a DOM, so use it:
foreach ($imgs as $img) {
$container = $img->parentNode;
if ($container->tagName != "a") {
$a = $doc->createElement("a");
$a->appendChild( $img->cloneNode(true) );
$container->replaceChild($a, $img);
}
}
see documentation on
DOMDocument::createElement
DOMNode::appendChild
DOMNode::cloneNode
DOMNode::replaceChild

Regex match HTML tag NOT containing another tag

I am writing a regex find/replace that will insert a <span> into every <a href> in a file where a <span> does not already exist. It will allow other tags to be in the <a href> like <img>, <b>, etc.
Currently I have this regex:
Find: (<a[^>]+?style=".*?color:#(\w{6}).*?".*?>)(.+?)(<\/a>)
Replace: '$1<span style="color:#$2;">$3</span>$4'
It works great except if i run it over the same file, it will insert a <span> inside of a <span> and it gets messy.
Target Example:
We want it to ignore this:
<span style="color:#bfbcba;">Howdy</span>
But not this:
Howdy
Or this:
<img src="myimg.gif" />Howdy
--EDIT--
Using the PHP DOM library as suggested in the comments, this is what I have so far:
$doc = new DOMDocument();
$doc->loadHTML($input);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
$spancount = $tag->getElementsByTagName("span")->length;
if($spancount == 0){
$element = $doc->createElement('span');
$tag->appendChild($element);
}
}
echo $doc->saveHTML();`
Currently it will detect if there is a span inside an anchor and if there is, it will append a span to the inside of the anchor, however, i have yet to figure out how to get the original contents of the anchor inside the span.

Don't use regex for this, it's not ideal for HTML.
Use a DOM library and getElementsByTagName('a') then iterate through each anchor and see if it contains a sub span element with getElementsByTagName('span'), using the length property. If it doesn't, appendChild or assign the firstChild of the anchor node to your new span created with document.createElement('span').
EDIT: As for grabbing the inner html of the anchor, if there are lots of nodes inside, try using this:
<?php
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
$html = innerHTML( $anchorRef );
This may also help you out: Change innerHTML of a php DOMElement

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

DOMDocument, get images AFTER first <h1> - php

Related

Extract all the 'a' tags with in which 'img' tag resides, using php but i am not able to figure it out

PHP DOM Get Links Inside DIV

DomDocument fetch h1 tag

How to pass data from DOMDocument to regexp?

Regex match HTML tag NOT containing another tag

Categories

Resources