I'm using the simple HTML dom to grab scraped data and it's been working well. However, one of the source I have doesn't have any unique fields so I'm trying to str_replace and then grab the elements that I've renamed and then use simple_html_dom.
However, it doesn't work. my code is:
require('simple_html_dom.php');
// Create DOM from URL or file
$html = file_get_html('http://www.url.com');
$html = str_replace('<strong>','',$html);
$html = str_replace('</strong>','',$html);
$html = str_replace('<span class="pound">£</span>','',$html);
$html = str_replace('<td>','<td class="myclass">',$html);
foreach($html->find('td.myclass') as $element)
$price = $element->innertext;
$price = preg_replace('/[^(\x20-\x7F)]*/','', $price);
echo $price;
try
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$html = file_get_html( 'http://www.url.com' );
foreach( $html->find( 'td' ) as $element ) {
$price = trim( str_replace( "£", "", $element->plaintext ) );
}
$price = preg_replace('/[^(\x20-\x7F)]*/','', $price);
echo $price;
?>
Related
I want to grab what's new text from play store whatsapp. I am trying below code, and it's working well.
<?php
$url = 'https://play.google.com/store/apps/details?id=com.whatsapp&hl=en';
$content = file_get_contents($url);
$first_step = explode( '<div class="recent-change">' , $content );
$second_step = explode("</div>" , $first_step[1] );
echo $second_step[0];
?>
The issue is that, this code only show text from first recent-change div class. It has multiple divs with recent-change class name. How to get all content from it?
As already suggested in comments you have to use dom content. But if you want to display all text containing recent-change class. You can use loop. I am providing solution on same way which you are using
$url = 'https://play.google.com/store/apps/details?id=com.whatsapp&hl=en';
$content = file_get_contents($url);
$first_step = explode( '<div class="recent-change">' , $content );
foreach ($first_step as $key => $value) {
if($key > 0)
{
$second_step = explode("</div>" , $value );
echo $second_step[0];
echo "<br>";
}
}
I've found a great tutorial on how to accomplish most of the work at:
https://www.developphp.com/video/PHP/simpleXML-Tutorial-Learn-to-Parse-XML-Files-and-RSS-Feeds
but I can't understand how to extract media:content images from the feeds. I've read as much info as i can find, but i'm still stuck.
ie: How to get media:content with SimpleXML
this suggests using:
foreach ($xml->channel->item as $news){
$ns_media = $news->children('http://search.yahoo.com/mrss/');
echo $ns_media->content; // displays "<media:content>"}
but i can't get it to work.
Here's my script and feed i'm trying to parse:
<?php
$html = "";
$url = "http://rssfeeds.webmd.com/rss/rss.aspx?RSSSource=RSS_PUBLIC";
$xml = simplexml_load_file($url);
for($i = 0; $i < 10; $i++){
$title = $xml->channel->item[$i]->title;
$link = $xml->channel->item[$i]->link;
$description = $xml->channel->item[$i]->description;
$pubDate = $xml->channel->item[$i]->pubDate;
$html .= "<a href='$link'><h3>$title</h3></a>";
$html .= "$description";
$html .= "<br />$pubDate<hr />";
}
echo $html;
?>
I don't know where to add this code into the script to make it work. Honestly, i've browsed for hours, but couldn't find working script that would parse media:content.
Can someone help with this?
========================
UPDATE:
Thanx to fusion3k, i got the final code working:
<?php
$html = "";
$url = "http://rssfeeds.webmd.com/rss/rss.aspx?RSSSource=RSS_PUBLIC";
$xml = simplexml_load_file($url);
for($i = 0; $i < 5; $i++){
$image = $xml->channel->item[$i]->children('media', True)->content->attributes();
$title = $xml->channel->item[$i]->title;
$link = $xml->channel->item[$i]->link;
$description = $xml->channel->item[$i]->description;
$pubDate = $xml->channel->item[$i]->pubDate;
$html .= "<img src='$image' alt='$title'>";
$html .= "<a href='$link'><h3>$title</h3></a>";
$html .= "$description";
$html .= "<br />$pubDate<hr />";
}
echo $html;
?>
Basically all i needed was this simple line:
$image = $xml->channel->item[$i]->children('media', True)->content->attributes();
Can't believe it was so hard for non techie to find this info online after reading dozens of posts and articles. Well, hope this will serve well for other folks like me :)
To get 'url' attribute, use ->attribute() syntax:
$ns_media = $news->children('http://search.yahoo.com/mrss/');
/* Echoes 'url' attribute: */
echo $ns_media->content->attributes()['url'];
// in php < 5.5: $attr = $ns_media->content->attributes(); echo $attr['url'];
/* Catches 'url' attribute: */
$url = $ns_media->content->attributes()['url']->__toString();
// in php < 5.5: $attr = $ns_media->content->attributes(); $url = $attr['url']->__toString();
Namespaces explanation:
The ->children() arguments is not the URL of your XML, it is a Namespace URI.
XML namespaces are used for providing uniquely named elements and attributes in an XML document:
<xxx> Standard XML tag
<yyy:zzz> Namespaced tag
└┬┘ └┬┘
│ └──── Element Name
└──────── Element Prefix (Namespace Identifier)
So, in your case, <media:content> is the “content” element of Namespace “media”. Namespaced elements must be have an associated Namespace URI, as attribute of a parent node or — most commonly — of the root element: this attribute has the form xmlns:yyy="NamespaceURI" (in your case xmlns:media="http://search.yahoo.com/mrss/" as attribute of root node <rss>).
Ultimately, the above $news->children( 'http://search.yahoo.com/mrss/' ) means “retrieve all children elements with http://search.yahoo.com/mrss/ as Namespace URI; an alternative — most intelligible — syntax is: $news->children( 'media', True ) (True means “regarded as a prefix”).
Returning to the code in example, the generic syntax to retrieve all first item's children with prefix media is:
$xml = simplexml_load_file( 'http://rssfeeds.webmd.com/rss/rss.aspx?RSSSource=RSS_PUBLIC' );
$xml->channel->item[0]->children( 'http://search.yahoo.com/mrss/' );
or (identical result):
$xml = simplexml_load_file( 'http://rssfeeds.webmd.com/rss/rss.aspx?RSSSource=RSS_PUBLIC' );
$xml->channel->item[0]->children( 'media', True );
Your new code:
If you want to show the <media:content url> thumbnail for each element in your page, modify the original code in this way:
(...)
$pubDate = $xml->channel->item[$i]->pubDate;
$image = $xml->channel->item[$i]->children( 'media', True )->content->attributes()['url'];
// in php < 5.5:
// $attr = $xml->channel->item[$i]->children( 'media', True )->content->attributes();
// $image = $attr['url'];
$html .= "<a href='$link'><h3>$title</h3></a>";
$html .= "<img src='$image' alt='$title'>";
(...)
Simple example for newbs like me:
$url = "https://www.youtube.com/feeds/videos.xml?channel_id=UCwNPPl_oX8oUtKVMLxL13jg";
$rss = simplexml_load_file($url);
foreach($rss->entry as $item) {
$time = $item->published;
$time = date('Y-m-d \ H:i', strtotime($time));
$media_group = $item->children( 'media', true );
$title = $media_group->group->title;
$description = $media_group->group->description;
$views = $media_group->group->community->statistics->attributes()['views'];
}
echo $time . ' :: ' . $title . '<br>' . $description . '<br>' . $views . '<br>';
I have a theme that I edited causing very high load on my server,
First i used this code to get only text from content
$response = get_the_content();
$content = $response;
$content = preg_replace("/(<)([img])(\w+)([^>]*>)/", "", $content);
$content = apply_filters('the_content', $content);
$content = str_replace(']]>', ']]>', $content);
Then i wanted to fetch all images inside my post content so i used "DOMDocument" Code:
$document = new DOMDocument();
libxml_use_internal_errors(true);
$document->loadHTML($response);
libxml_clear_errors();
$images = array();
$imgsq = $document->getElementsByTagName('img');
I have every post contains a static part than is in all photo pages so i used that code to get it
function findit($mytext,$starttag,$endtag) {
$posLeft = stripos($mytext,$starttag)+strlen($starttag);
$posRight = stripos($mytext,$endtag,$posLeft+1);
return substr($mytext,$posLeft,$posRight-$posLeft);
}
$project = #findit($content , '-projectinfostart-' , '-projectinfoend-');
$check = str_replace('ializer-buttons clearfix">', '', $project);
if($project != $check) $project = '';
$replace = array('-projectinfostart-' , '-projectinfoend-' , $project , '<p> </p>');
$content = str_replace( $replace, '', $content);
Then at last i wanted to get all photos in thumb size so i used that code:
foreach($imgsq as $key => $img) :
// Extract what we want
$image = array('src' => $img->getAttribute('src') );
if( ! $image['src'])
continue;
if($key == $page) :
echo '<center><img style="height: auto !important;max-height:450px;" class="responsiveMe" src=" ' . $image['src'] . '" /> ';
endif;
$srcs[$key] = array();
$srcs[$key]['src'] = wp_get_attachment_thumb_url(get_attachment_id_by_url($image['src']));
$srcs[$key]['full'] = $image['src'];
if(!empty($project)) $description = '<p>' . $project . '</p>';
else $description = '';
endforeach;
My server is 16 GB Ram and can't work with 1500 online users on single post page ! any ideas about what is causing this high load ?
Thanks.
I am currently using PHP's file_get_contents($url) to fetch content from a URL. After getting the contents I need to inspect the given HTML chunk, find a 'select' that has a given name attribute, extract its options, and their values text. I am not sure how to go about this, I can use PHP's simplehtmldom class to parse html, but how do I get a particular 'select' with name 'union'
<span class="d3-box">
<select name='union' class="blockInput" >
<option value="">Select a option</option> ..
Page can have multiple 'select' boxes and hence I need to specifically look by name attribute
<?php
include_once("simple_html_dom.php");
$htmlContent = file_get_contents($url);
foreach($htmlContent->find(byname['union']) as $element)
echo 'option : value';
?>
Any sort of help is appreciated. Thank you in advance.
Try this PHP code:
<?php
require_once dirname(__FILE__) . "/simple_html_dom.php";
$url = "Your link here";
$htmlContent = str_get_html(file_get_contents($url));
foreach ($htmlContent->find("select[name='union'] option") as $element) {
$option = $element->plaintext;
$value = $element->getAttribute("value");
echo $option . ":" . $value . "<br>";
}
?>
how about this:
$htmlContent = file_get_html('your url');
$htmlContent->find('select[name= "union"]');
in object oriented way:
$html = new simple_html_dom();
$htmlContent = $html->load_file('your url');
$htmlContent->find('select[name= "union"]');
From DOMDocument documentation: http://www.php.net/manual/en/class.domdocument.php
$html = file_get_contents( $url );
$dom = new DOMDocument();
$dom->loadHTML( $html );
$selects = $dom->getElementsByTagName( 'select' );
$select = $selects->item(0);
// Assuming all children are options.
$children = $select->childNodes;
$options_values = array();
for ( $i = 0; $i < $children->length; $i++ )
{
$item = $children->item( $i );
$options_values[] = $item->nodeValue;
}
I'm using simple_html_dom [ http://sourceforge.net/projects/simplehtmldom/ ] to parse through HTML.
I'm trying to get all of the <script> urls, grab the contents, and then replace it in the $html variable... I have this and it almost works like I want:
$html_elements = str_get_html( $html );
$current_src = array( );
$new_src = array( );
foreach($html_elements->find('script') as $element) {
if( $element->src != '' )
{
$script_url = $element->src;
$script_data = get_script( $script_url );
$current_src[] = $element->outertext;
$new_src[] = "<script>" . $element->innertext . "\n" . $script_data . "</script>";
}
}
$html = str_replace( $current_src, $new_src, $html );
function get_script( $url )
{
$data = file_get_contents( $url );
return $data;
}
The problem is that it seems to be turning the plus signs in the javascript files in to spaces when it's all said and done?
Please refer to the comment section above.
After further debugging, I was parsing the data one to many times through urldecode() later on in the code.