html dom parser to extract href from span sibiling

html dom parser to extract href from span sibiling - php

Here is my html file contains date and a link in <span> tag within a table.
Can anyone help me find the link of a particular date. view link of particular date
<table>
<tbody>
<tr class="c0">
<td class="c11">
<td class="c8">
<ul class="c2 lst-kix_h6z8amo254ry-0 start">
<li class="c1">
<span>1st Apr 2014 - </span>
<span class="c6"><a class="c4" href="/link.html">View</a>
</span>
</li>
</ul>
</td>
</tr>
</td>
</table>
I want to retrieve the link for particular date
MY CODE IS LIKE THIS
include('simple_html_dom.php');
$html = file_get_html('link.html');
//store the links in array
foreach($html->find('span') as $value)
{
//echo $value->plaintext . '<br />';
$date = $value->plaintext;
if (strpos($date,$compare_text)) {
//$linkeachday = $value->find('span[class=c1]')->href;
//$day_url[] = $value->href;
//$day_url = Array("text" => $value->plaintext);
$day_url = Array("text" => $date, "link" =>$linkeachday);
//echo $value->next_sibling (a);
}
}
or
$spans = $html->find('table',0)->find('li')->find('span');
echo $spans;
$num = null;
foreach($spans as $span){
if($span->plaintext == $compare_text){
$next_span = $span->next_sibling();
$num = $next_span->plaintext;
echo($num);
break;
}
}
echo($num);

You were on the right path with your last example...
I modified it a bit to get the following which basically gets all spans, then test if they have the searched text, and if so, it displays the content of their next sibling if there is any (check the in code comments):
$input = <<<_DATA_
<table>
<tbody>
<tr class="c0">
<td class="c11">
<td class="c8">
<ul class="c2 lst-kix_h6z8amo254ry-0 start">
<li class="c1">
<span>1st Apr 2013 - </span>
<span>1st Apr 2014 - </span>
<span class="c6">
<a class="c4" href="/link.html">View</a>
</span>
<span>1st Apr 2015 - </span>
</li>
</ul>
</td>
</td>
</tr>
</tbody>
</table>
_DATA_;
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($input);
// Searched value
$searchDate = '1st Apr 2014';
// Find all the spans direct childs of li, which is a descendent of table
$spans = $html->find('table li > span');
// Loop through all the spans
foreach ($spans as $span) {
// If the span starts with the searched text && has a following sibling
if ( strpos($span->plaintext, $searchDate) === 0 && $sibling = $span->next_sibling()) {
// Then, print it's text content
echo $sibling->plaintext; // or ->innertext for raw content
// And stop (if only one result is needed)
break;
}
}
OUTPUT
View
For the string comparison, you may also (for the best) use regex...
So in the code above, you add this to build your pattern:
$pattern = sprintf('~^\s*%s~i', preg_quote($searchDate, '~'));
And then use preg_match to test the match:
if ( preg_match($pattern, $span->plaintext) && $sibling = $span->next_sibling()) {

I don't know about simple HTML DOM but the built in PHP DOM library should suffice.
Say you have your date in a string like this...
$date = '1st Apr 2014';
You can easily find the corresponding link using an XPath expression. For example
$doc = new DOMDocument();
$doc->loadHTMLFile('link.html');
$xp = new DOMXpath($doc);
$query = sprintf('//span[starts-with(., "%s")]/following-sibling::span/a', $date);
$links = $xp->query($query);
if ($links->length) {
$href = $links->item(0)->getAttribute('href');
}

include('simple_html_dom.php');
$html = file_get_html('link.html');
$compare_text = "1st Apr 2013";
$tds = $html->find('table',1)->find('span');
$num = 0;
foreach($tds as $td){
if (strpos($td->plaintext, $compare_text) !== false){
$next_td = $td->next_sibling();
foreach($next_td->find('a') as $elm) {
$num = $elm->href;
}
//$day_url = array($day => array(daylink => $day, text => $td->plaintext, link => $num));
echo $td->plaintext. "<br />";
echo $num . "<br />";
}
}

Related

PHP string search and replace - possible use of DOM Needed

I cant seem to figure out how to achieve my goal.
I want to find and replace a specific class link based off of a generated RSS feed (need the option to replace later no matter what link is there)
Example HTML:
<a class="epclean1" href="#">
WHAT IT SHOULD LOOK LIKE:
<a class="epclean1" href="google.com">
May need to incorporate get element using DOM as the Full php has a created document. If that is the case I would need to know how to find by class and add the href url that way.
FULL PHP:
<?php
$rss = new DOMDocument();
$feed = array();
$urlArray = array(array('url' => 'https://feeds.megaphone.fm')
);
foreach ($urlArray as $url) {
$rss->load($url['url']);
foreach ($rss->getElementsByTagName('item') as $node) {
$item = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue
);
array_push($feed, $item);
}
}
usort( $feed, function ( $a, $b ) {
return strcmp($a['title'], $b['title']);
});
$limit = sizeof($feed);
$previous = null;
$count_firstletters = 0;
for ($x = 0; $x < $limit; $x++) {
$firstLetter = substr($feed[$x]['title'], 0, 1); // Getting the first letter from the Title you're going to print
if($previous !== $firstLetter) { // If the first letter is different from the previous one then output the letter and start the UL
if($count_firstletters != 0) {
echo '</ul>'; // Closing the previously open UL only if it's not the first time
echo '</div>';
}
echo '<button class="glanvillecleancollapsible">'.$firstLetter.'</button>';
echo '<div class="glanvillecleancontent">';
echo '<ul style="list-style-type: none">';
$previous = $firstLetter;
$count_firstletters ++;
}
$title = str_replace(' & ', ' & ', $feed[$x]['title']);
echo '<li>';
echo '<a class="epclean'.$i++.'" href="#" target="_blank">'.$title.'</a>';
echo '</li>';
}
echo '</ul>'; // Close the last UL
echo '</div>';
?>
</div>
</div>
The above fullphp shows on site like so (this is shortened as there is 200+):
<div class="modal-glanvillecleancontent">
<span class="glanvillecleanclose">×</span>
<p id="glanvillecleaninstruct">Select the first letter of the episode that you wish to get clean version for:</p>
<br>
<button class="glanvillecleancollapsible">8</button>
<div class="glanvillecleancontent">
<ul style="list-style-type: none">
<li><a class="epclean1" href="#" target="_blank">80's Video Vixen Tawny Kitaen 044</a></li>
</ul>
</div>
<button class="glanvillecleancollapsible">A</button>
<div class="glanvillecleancontent">
<ul style="list-style-type: none">
<li><a class="epclean2" href="#" target="_blank">Abby Stern</a></li>
<li><a class="epclean3" href="#" target="_blank">Actor Nick Hounslow 104</a></li>
<li><a class="epclean4" href="#" target="_blank">Adam Carolla</a></li>
<li><a class="epclean5" href="#" target="_blank">Adrienne Janic</a></li>
</ul>
</div>

You're not very clear about how your question relates to the code shown, but I don't see any attempt to replace the attribute within the DOM code. You'd want to look at XPath to find the desired elements:
function change_clean($content) {
$dom = new DomDocument;
$dom->loadXML($content);
$xpath = new DomXpath($dom);
$nodes = $xpath->query("//a[#class='epclean1']");
foreach ($nodes as $node) {
if ($node->getAttribute("href") === "#") {
$node->setAttribute("href", "https://google.com/");
}
}
return $dom->saveXML();
}
$xml = '<?xml version="1.0"?><foo><bar><a class="epclean1" href="#">test1</a></bar><bar><a class="epclean1" href="https://example.com">test2</a></bar></foo>';
echo change_clean($xml);
Output:
<foo><bar><a class="epclean1" href="https://google.com/">test1</a></bar><bar><a class="epclean1" href="https://example.com">test2</a></bar></foo>

Hmm. I think your pattern and replacement might be your problem.
What you have
$pattern = 'class="epclean1 href="(.*?)"';
$replacement = 'class="epclean1 href="google.com"';
Fix
$pattern = '/class="epclean1" href=".*"/';
$replacement = 'class="epclean1" href="google.com"';

PHP - Get links from within an element after element has been found

I have the following code....
<div class="outer">
<div>
<h1>Christmas</h1>
<ul>
<li>Holiday</li>
<li>Fun</li>
<li>Joy</li>
</ul>
<h1>4th July</h1>
<ul>
<li>Fireworks</li>
<li>Happy</li>
<li>Spectral</li>
</ul>
</div>
</div>
<div class="outer">
<div>
<h1>Christmas2</h1>
<ul>
<li>Holiday</li>
<li>Fun</li>
<li>Joy</li>
</ul>
<h1>4th July</h1>
<ul>
<li>Fireworks2</li>
<li>Happy</li>
<li>Spectral</li>
</ul>
</div>
</div>
I already know that I can find the DIV and then look inside the DIV for the elements etc by doing...
$doc->loadHTML($output); //$output being the text above
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="outer"]'); //Check outer
I know this above 3 lines will get the elements from within the DIV listed, but what I really want to be able to do is get the text of the [H1], then display the [li] values next to each H1..
the output i'm looking for is...
Christmas - Holiday, Fun, Joy
4th July - Fireworks, Happy, Spectral
Christmas2 - Holiday, Fun, Joy
4th July2 - Fireworks, Happy, Spectral

Yes you can continue to use xpath to traverse the elements on the header and get its following sibling, the list. Example:
$doc = new DOMDocument();
$doc->loadHTML($output);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="outer"]/div');
if($elements->length > 0) {
foreach($elements as $div) {
foreach ($xpath->query('./h1', $div) as $e) {
$header = $e->nodeValue;
$list = array();
foreach ($xpath->query('./following-sibling::ul/li', $e) as $li) {
$list[] = $li->nodeValue;
}
echo $header . ' - ' . implode(', ', $list) . '<br/>';
}
echo '<hr/>';
}
}
Sample Output

I've used phpQuery for this type of issue in the past:
// include phpquery
require('phpQuery/phpQuery.php');
// initialize
$doc = phpQuery::newDocumentHTML($markup);
// get the text from the various elements
$h1Value = $doc['h1:first']->text(); // Christmas
// ... etc.
(untested)

how to parse this by simple html dom parser in php

<div id="productDetails" class="tabContent active details">
<span>
<b>Case Size:</b>
</span>
44mm
<br>
<span>
<b>Case Thickness:</b>
</span>
13mm
<br>
<span>
<b>Water Resistant:</b>
</span>
5 ATM
<br>
<span>
<b>Brand:</b>
</span>
Fossil
<br>
<span>
<b>Warranty:</b>
</span>
11-year limited
<br>
<span>
<b>Origin:</b>
</span>
Imported
<br>
</div>
How can I get data like 44mm, fossil, etc. by DOM parser in PHP?
the data i can get easily by
$data=$html->find('div#productDetails',0)->innertext;
var_dump($data);
but i want to break it in meta_key and meta_value for my sql table...
i can get the meta_key by
$meta_key=$html->find('div#productDetails span',0)->innertext;
but the meta value related to it????

It's not that hard, really... just google, and click this link, you now know how to parse a DOM, here you can see what methods you can use to select all elements of interest, iterate the DOM, get its contents and what have you...
$DOM = new DOMDocument();
$DOM->loadHTML($htmlString);
$spans = $DOM->getElementsByTagName('span');
for ($i=0, $j = count($spans); $i < $j; $i++)
{
echo $spans[$i]->childNodes[0]->nodeValue.' - '.$spans[$i]->parentNode->nodeValue."\n";
}
That seems to be what you're after, if I'm not mistaken. This is just off the top of my head, but I think this should output something like:
Case Size: - 44mm
Case Thickness: - 13mm
UPDATE:
Here's a tested solution, that returns the desired result, if I'm not mistaken:
$str = "<div id='productDetails' class='tabContent active details'>
<span>
<b>Case Size:</b>
</span>
44mm
<br>
<span>
<b>Case Thickness:</b>
</span>
13mm
<br>
<span>
<b>Water Resistant:</b>
</span>
5 ATM
<br>
<span>
<b>Brand:</b>
</span>
Fossil
<br>
<span>
<b>Warranty:</b>
</span>
11-year limited
<br>
<span>
<b>Origin:</b>
</span>
Imported
<br>
</div>";
$DOM = new DOMDocument();
$DOM->loadHTML($str);
$txt = implode('',explode("\n",$DOM->textContent));
preg_match_all('/([a-z0-9].*?\:).*?([0-9a-z]+)/im',$txt,$matches);
//or if you don't want to include the colon in your match:
preg_match_all('/([a-z0-9][^:]*).*?([0-9a-z]+)/im',$txt,$matches);
for($i = 0, $j = count($matches[1]);$i<$j;$i++)
{
$matches[1][$i] = preg_replace('/\s+/',' ',$matches[1][$i]);
$matches[2][$i] = preg_replace('/\s+/',' ',$matches[2][$i]);
}
$result = array_combine($matches[1],$matches[2]);
var_dump($result);
//result:
array(6) {
["Case Size:"]=> "44mm"
["Case Thickness:"]=> "13mm"
["Water Resistant:"]=> "5"
["ATM Brand:"]=> "Fossil"
["Warranty:"]=> "11"
["year limited Origin:"]=> "Imported"
}
To insert this in your DB:
foreach($result as $key => $value)
{
$stmt = $pdo->prepare('INSERT INTO your_db.your_table (meta_key, meta_value) VALUES (:key, :value)');
$stmt->execute(array('key' => $key, 'value' => $value);
}
Edit
To capture the 11-year limit substring entirely, you'll need to edit the code above like so:
//replace $txt = implode('',explode("\n",$DOM->textContent));etc... by:
$txt = $DOM->textContent;//leave line-feeds
preg_match_all('/([a-z0-9][^:]*)[^a-z0-9]*([a-z0-9][^\n]+)/im',$txt,$matches);
for($i = 0, $j = count($matches[1]);$i<$j;$i++)
{
$matches[1][$i] = preg_replace('/\s+/',' ',$matches[1][$i]);
$matches[2][$i] = preg_replace('/\s+/',' ',$matches[2][$i]);
}
$matches[2] = array_map('trim',$matches[2]);//remove trailing spaces
$result = array_combine($matches[1],$matches[2]);
var_dump($result);
The output is:
array(6) {
["Case Size"]=> "44mm"
["Case Thickness"]=> "13mm"
["Water Resistant"]=> "5 ATM"
["Brand"]=> "Fossil"
["Warranty"]=> "11-year limited"
["Origin"]=> "Imported"
}

You can remove the span tag using the set_callback Api
Try this
$url = "";
$html = new simple_html_dom();
$html->load_file($url);
$html->set_callback('my_callback');
$elem = $html->find('div[id=productDetails]');
$product_details = array();
$attrib = array( 1 => 'size', 2 => 'thickness', 3 => 'wr', 4 => 'brand', 5 => 'warranty', 6 => 'orgin' );
$attrib_string = strip_tags($elem[0]->innertext);
$attrib_arr = explode(' ',$attrib_string); // hope this can help you for every product
// Remove Empty Values
$attrib_arr = array_filter($attrib_arr);
$i = 1;
foreach($attrib_arr as $temp)
{
$product_details[$attrib[$i]] = $temp;
$i++;
}
print_r($product_details);
// remove span tag inside div
function my_callback($element) {
if($element->tag == 'span'){ $element->outertext = ""; }
}

Php Xpath - How to get two information in a child node

How can I get the following information in xpath?
Text 01 - link_1.com
Text 02 - link_2.com
$page = '
<div class="news">
<div class="content">
<div>
<span class="title">Text 01</span>
<span class="link">link_1.com</span>
</div>
</div>
<div class="content">
<div>
<span class="title">Text 02</span>
<span class="link">link_2.com</span>
</div>
</div>
</div>';
#$this->dom->loadHTML($page);
$xpath = new DOMXPath($this->dom);
// perform step #1
$childElements = $xpath->query("//*[#class='content']");
$lista = '';
foreach ($childElements as $child) {
// perform step #2
$textChildren = $xpath->query("//*[#class='title']", $child);
foreach ($textChildren as $n) {
echo $n->nodeValue.'<br>';
}
$linkChildren = $xpath->query("//*[#class='link']", $child);
foreach ($linkChildren as $n) {
echo $n->nodeValue.'<br>';
}
}
My result is returning
Text 01
Text 02
link_1.com
link_2.com
Text 01
Text 02
link_1.com
link_2.com

Replace // by descendant:: in second and third xpath, because // tells xpath to search this element evrywhere in xml and not in specific node (as you need), and $child is NOT separate XML. descendat:: means any child node
#$this->dom->loadHTML($page);
$xpath = new DOMXPath($this->dom);
// perform step #1
$childElements = $xpath->query("//*[#class='content']");
$lista = '';
foreach ($childElements as $child) {
// perform step #2
$textChildren = $xpath->query("descendant::*[#class='title']", $child);
foreach ($textChildren as $n) {
echo $n->nodeValue.'<br>';
}
$linkChildren = $xpath->query("descendant::*[#class='link']", $child);
foreach ($linkChildren as $n) {
echo $n->nodeValue.'<br>';
}
}

reading twitter's rss search feed with simple xml

Having some trouble selecting some nodes in the rss feed for twitter's search
the rss url is here
http://search.twitter.com/search.rss?q=twitfile
each item looks like this
<item>
<title>RT #TwittBoy: TwitFile - Comparte tus archivos en Twitter (hasta 200Mb) http://bit.ly/xYNsM</title>
<link>http://twitter.com/MarielaCelita/statuses/5990165590</link>
<description>RT <a href="http://twitter.com/TwittBoy">#TwittBoy</a>: <b>TwitFile</b> - Comparte tus archivos en Twitter (hasta 200Mb) <a href="http://bit.ly/xYNsM">http://bit.ly/xYNsM</a></description>
<pubDate>Mon, 23 Nov 2009 22:45:39 +0000</pubDate>
<guid>http://twitter.com/MarielaCelita/statuses/5990165590</guid>
<author>MarielaCelita#twitter.com (M.Celita Lijerón)</author>
<media:content type="image/jpg" width="48" height="48" url="http://a3.twimg.com/profile_images/537676869/orkut_normal.jpg"/>
<google:image_link>http://a3.twimg.com/profile_images/537676869/orkut_normal.jpg</google:image_link>
</item>
My php is below
foreach ($twitter_xml->channel->item as $key) {
$screenname = $key->{"author"};
$date = $key->{"pubDate"};
$profimg = $key->{"google:image_link"};
$link = $key->{"link"};
$title = $key->{"title"};
echo"
<li>
<h5><a href=$link>$author</a></h5>
<p class=info><a href=$link>$title</a></p>
</li>
";
Problem is nothing is being echoed, i mean from the rss feed, if there are 20 results, its looping 20 times, just no data

In the code, $screenname is assigned a value but you are echoing $author.
To get elements within namespaces like google:image_link ,you will have to do this:
$g = $key->children("http://base.google.com/ns/1.0");
$profimg = $g->{"image_link"};
If you are wondering where did I get "http://base.google.com/ns/1.0" from, the namespace is mentioned in the second line of the rss feed.
$url="http://search.twitter.com/search.rss?q=twitfile";
$twitter_xml = simplexml_load_file($url);
foreach ($twitter_xml->channel->item as $key) {
$author = $key->{"author"};
$date = $key->{"pubDate"};
$link = $key->{"link"};
$title = $key->{"title"};
$g = $key->children("http://base.google.com/ns/1.0");
$profimg = $g->{"image_link"};
echo"
<li>
<h5><a href=$link>$author</a></h5>
<p class=info><a href=$link>$title</a></p>
</li>
";
$xml = $twitter_xml;
}
This code works.

Set error_reporting(E_ALL); and you'll see that $author isn't defined.
You can't access <google:image_link/> this way, you'll have to use XPath or children()
$key->children("google", true)->image_link;
If you use SimpleDOM, there's a shortcut that returns the first element of an XPath result:
$key->firstOf("google:image_link");

if (!$xml = simplexml_load_file('http://search.twitter.com/search.atom?q='.urlencode ($terms)))
{
throw new RuntimeException('Unable to load or parse search results feed');
}
if (!count($entries = $xml->entry))
{
throw new RuntimeException('No entry found');
}
for($i=0;$i<count($entries);$i++)
{
$title[$i] = $entries[$i]->title;
//etc.. continue description,,,,,
}

I made this and it works :)) $sea_name is the keyword your looking for...
<?php
function twitter_feed( $sea_name ){
$endpoint = 'http://search.twitter.com/search.rss?q='.urlencode($sea_name); // URL to call
$resp = simplexml_load_file($endpoint);
// Check to see if the response was loaded, else print an error
if ($resp) {
$results = '';
$counter=0;
// If the response was loaded, parse it and build links
foreach($resp->channel->item as $item) {
//var_dump($item);
preg_match("/\((.*?)\)/", $item->author, $blah);
$content = $item->children("http://search.yahoo.com/mrss/" );
$imageUrl = getXmlAttribute( $content, "url" );
echo '
<div class="twitter-item">
<img src="'.$imageUrl.'" />
<span class="twit">'.$blah[1].'</span><br />
<span class="twit-content">'.$item->title.'</span>
<br style="clear:both; line-height:0;margin:0;padding:0;">
</div>';
$counter++;
}
}
// If there was no response, print an error
else {
$results = "Oops! Must not have gotten the response!";
}
echo $results;
}
function getXmlAttribute( SimpleXMLElement $xmlElement, $attribute ) {
foreach( $xmlElement->attributes() as $name => $value ) {
if( $name == $attribute ) {
return (string)$value;
}
}
}
?>
The object will contain somthing like:
<!-- SimpleXMLElement Object
(
[title] => Before I go to bed, I just want to say I've just seen Peter Kay's CIN cartoon video for the 1st time... one word... WOW.
[link] => http://twitter.com/Alex_Segal/statuses/5993710015
[description] => Before I go to bed, I just want to say I&apos;ve just seen <b>Peter</b> <b>Kay</b>&apos;s CIN cartoon video for the 1st time... one word... WOW.
[pubDate] => Tue, 24 Nov 2009 01:00:00 +0000
[guid] => http://twitter.com/Alex_Segal/statuses/5993710015
[author] => Alex_Segal#twitter.com (Alex Segal)
)
-->
You can use any of it inside the foreach look and echo them such as $item->author, $item->link, etc....any other attributes you can use the getattribute function...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

html dom parser to extract href from span sibiling - php

Related

PHP string search and replace - possible use of DOM Needed

PHP - Get links from within an element after element has been found

how to parse this by simple html dom parser in php

Php Xpath - How to get two information in a child node

reading twitter's rss search feed with simple xml

Categories

Resources