Grabbing an attribute with DOMXPath

Grabbing an attribute with DOMXPath - php

i know there are lots of way to grabbing an attribute.
this is my html result :
<li class="result">
<a class="block_container" href="**FIRST**">
<img alt="changeable text" src="**SOME LINK**" border="0">
</a>
</li>
<li class="result">
<a class="block_container" href="**SECOND**">
<img alt="changeable text" src="**SOME LINK**" border="0">
</a>
</li>
//and many like this ...
i can grab (href) but i have many of this attribute !
i used DOMXPath query to help me choose grab first href or second href with item number :
$a = $xpath->query("//li[#class='block_container']/a");
echo $text = $a->item(**MY ITEM NUMBER**)->nodeValue;
but it doesn't work !
can you help me grab href and src with item number ?

if you want a.href
$hrefs = $xpath->query("//li/a[#class='block_container']/#href");
foreach($hrefs as $href) {
echo $href->nodeValue ."<br>\n";
}
and if you want outerHTML of image tag
$imgs = $xpath->query("//li/a[#class='block_container']/img");
foreach($imgs as $img) {
echo $dom->saveHTML($img) ."<br>\n";
}
demo on eval.in

Related

How to select one href tag without nested tag

html part
<div class="headlines">
<img src="">
</div>
<div class="headlines">
<img src="">
</div>
php part
require_once "xpath.php";
$startUrl ="index.php";
$xpath = new XPATH($startUrl);
$linkHref = $xpath->query("//div[#class='headlines']/a/#href");
$imageSrc = $xpath->query("//div[#class='headline']//img/#src");
$data = array();
for($x=0; $x<$imageSrc->length; $x++){
$data[$x]['imageSrclink'] = $imageSrc->item($x)->nodeValue;
$data[$x]['dataLinks'] = $linkHref->item($x)->nodeValue;
}
echo "<pre>";
print_r($data);
I am trying to use the length for the img tag to match the href tag but the href tag is not equal to the img tag. what I mean is that, I want to have two href tag instead of four.

You can get the first occurrence of each "a" tag (a[1]) in each "div" tag:
$linkHref = $xpath->query("//div[#class='headlines']/a[1]/#href");

Regular Expression or other way to get string in right format

Please help me out.. I have following string
<p>this is text before first image</p>
<p><img class="size-full wp-image-2178636" src="image1.jpg" alt="first" /> this is first caption</p>
<p>this is text before second image.</p>
<p><img src="image2.jpg" alt="second" class="size-full wp-image-2178838" /> this is second caption</p>
<p>there may be many more images</p>
and I need above string formatted as following :
<p>this is text before first image</p>
<a href="">
<figure>
<img class="size-full wp-image-2178636" src="image1.jpg" alt="first" />
<figcaption class="newcaption">
<h1>this is first caption</h1>
</figcaption>
</figure>
</a>
<p>this is text before second image.</p>
<a href="">
<figure>
<img class="size-full wp-image-2178636" src="image2.jpg" alt="first" />
<figcaption class="newcaption">
<h1>this is second caption</h1>
</figcaption>
</figure>
</a>
<p>there may be many more images</p>
Kindly help me.. how we can do that either by regular expressions or using other way. I am doing it using PHP.
Regards,
Sachin.

Although SO is not supposed to be a code-writing service here is a quick n' dirty solution that uses the DOMDocument-approach:
$html = '...'; // your input data
$input = new DOMDocument();
$input->loadHTML($html);
$ps = $input->getElementsByTagName('p');
$output = new DOMDocument();
$counter = 0;
foreach ($ps as $p) {
if ($counter%2 === 0) {
// text before image
$p_before_image = $output->createElement("p", $p->nodeValue);
$output->appendChild($p_before_image);
}
elseif ($p->hasChildNodes()) {
// image output routine
$as_input = $p->getElementsByTagName("a");
$a_output = $output->importNode($as_input->item(0));
$figure = $output->createElement("figure");
$imgs_input = $p->getElementsByTagName("img");
$img_output = $output->importNode($imgs_input->item(0));
$figure->appendChild($img_output);
$figcaption = $output->createElement("figcaption");
$figcaption->setAttribute("class", "newcaption");
$h1 = $output->createElement("h1", $p->nodeValue);
$figcaption->appendChild($h1);
$figure->appendChild($figcaption);
$a_output->appendChild($figure);
$output->appendChild($a_output);
}
else {
// Document malformed
}
$counter++;
}
print $output->saveHTML();
Note that saveHTML() will output plain old HTML. Thus, imgs won't be turned into self-closing tags. You may want to look into saveXML() if this is important to you.

Using simplehtmldom trying to find a URL with out and id or class

First time poster on here, did about a couple hours of searching and trying but got stuck... so go easy on me :)
With a page containing this...
<li onclick="javascript:trackClick(14423, 'web'); document.location='http://www.mywebsite.com';">
<img class="listing-control" src="img/url-profile-listings.png" alt="Get Directions" width="51" height="51" style="padding:4px;">
<span id="web14423">Visit Website</span>
</li>
I am trying to get the url http://www.mywebsite.com in the document.location of the li tag.
The only unique and constant thing to key off is the "Visit Website" text in the span tag. Is there any way to find that and go up to the parent li tag to the the document.location property from the onclick event?
Any help would be greatly appreciated!!!
Thanks,
MrMo.

Of course load it in the SimpleHTMLDOM object, then just target the <li> tag with it. Target the onclick="" attribute to get the values inside it.
Disclaimer: I'm not a regex expert in any way.
$html_string = <<<EOT
<li onclick="javascript:trackClick(14423, 'web'); document.location='http://www.mywebsite.com';">
<img class="listing-control" src="img/url-profile-listings.png" alt="Get Directions" width="51" height="51" style="padding:4px;">
<span id="web14423">Visit Website</span>
</li>
EOT;
$html = str_get_html($html_string);
// after loading the html with either str_get_html or file_get_html
foreach($html->find('li') as $list) {
$script = $list->onclick;
preg_match('/document.location\s*=\s*\'(.*?)\';/', $script, $match);
if(!empty($match)) {
$url = $match[1];
echo $url;
}
}

how do I get sets of data with xpath

My below code retrieves a series of images from the search results of a site and also the corresponding age data. It works fine however I get a list of images followed by a list of the information in the age field.
img img img img age age age age and so on.
How do I combine these so I can display them in sets: img age img age img age
<?php
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.site.com/searchresults.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
$tags = $html->getElementsByTagName('img');
foreach ($tags as $tag) {
$image = $tag->getAttribute('src');
echo '<img src='. $image .' alt="image" ><br>';
}
foreach ($nodelist as $n)
{
echo $n->nodeValue."<br>";
}
?>
Sample page, I want to extract the img source title data from <div class="age" title="30 usa">:
<div id="sr-15763292" class="search-result">
<div class="thumb-wrapper">
<a class="bioLink" href="http://www.site.com/user/" title="View user"><img src="http://www.site.com/img/15763292.jpg" class="thumb" alt="user" width="140" height="105"></a>
<p class="status"><a href="http://www.site.com/user/" >Online</a></p>
</div>
<div class="rating">
<div class="rating-stars rating4"></div>
</div>
<div class="age" title="30 usa">
<p>30</p>
<p class="gender m">m</p>
<p>USA</p>
</div>
<div>
<p class="headline">Hello there.</p>
</div>
</div>

It's hard to answer if we don't know what the HTML looks like! Assuming it looks something like this
<div class="age"><p>21</p>
<img src="a.jpg" />
</div>
<div class="age"><p>51</p>
<img src="b.jpg" />
</div>
you need to find each div and then find the image inside each div. getElementsByTagName() will give you a list even if there's only one result, so use item() to fetch the first.
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('results.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
foreach ($nodelist as $node) {
$tags = $node->getElementsByTagName('img');
$image = $tags->item(0)->getAttribute('src');
echo '<img src="'. $image .'" alt="image" ><br>';
echo $node->textContent . '<br>';
}
If the HTML is like this
<div class="age"><p>21</p></div><img src="a.jpg" />
you can try
$node->nextSibling()
As a general point trace through the HTML and think how do I get from A to B? Go forwards? backwards? up to parent, to the next node and down again ...?

Remove part of string from array value in PHP

I have the following code snippet which essentially parses my blog site and store some information as variables:
global $articles;
$items = $html->find('div[class=blogpost]');
foreach($items as $post) {
$articles[] = array($post->children(0)->innertext,
$post->children(1)->first_child()->outertext);
}
foreach($articles as $item) {
echo $item[0];
echo $item[1];
echo "<br>";
}
The above code outputs as follows:
Title of blog post 1 <script type="text/javascript">execute_function(3,'')</script><a href="http://www.example.com/cool_news" id="963" target="_blank" >Click here for news</a> <img src="/news.gif" width="12" height="12" title="validated" /><span class="title">
Title of blog post 2 <script type="text/javascript">execute_function(3,'')</script><a href="http://www.example.com/neato" id="963" target="_blank" >Click here for neato</a> <img src="/news.gif" width="12" height="12" title="validated" /><span class="title">
Title of blog post 3 <script type="text/javascript">execute_function(3,'')</script><a href="http://www.example.com/lame" id="963" target="_blank" >Click here for lame</a> <img src="/news.gif" width="12" height="12" title="validated" /><span class="title">
with $item[0] containing "Title of blog post X" and $item[1] containing the rest.
What I want to do is parse $item[1] and retain only the URL contained within it as a separate variable. Perhaps I am not phrasing my question correctly, but I cannot find anything that can help me figure this out.
Can anyone help me?

If you were to parse $item[1] into whatever DOM crawler object you were using for $html, you could use the following XPath
$item[1]->find('//a[0]/#href');
which will return
href="http://www.example.com/cool_news"
Then extract the url however you want, with PHP or refine the XPath query. Not sure what the XPath would be to get the value, perhaps someone might be able to expand on that one.
EDIT: Seeing as you using Simple DOM Parser, try the following
$blogItemHtml = new simple_html_dom();
$blogItemHtml->load($item[1]);
$anchors = $blogItemHtml->find('a');
echo $anchors[0]->href; // "http://www.example.com/cool_news"

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Grabbing an attribute with DOMXPath - php

Related

How to select one href tag without nested tag

Regular Expression or other way to get string in right format

Using simplehtmldom trying to find a URL with out and id or class

how do I get sets of data with xpath

Remove part of string from array value in PHP

Categories

Resources