Parsing RSS news not working - php

I would like to parse news titles and links from the following RSS page:
http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE
I have tried using this code (but it's not working):
<?php
$xml=("http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$link=$x->item($i)->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
echo $title;
echo $link;
}
?>
However the same code is working to get RSS titles and links from other RSS pages.. for example:
<?php
$xml=("https://feeds.finance.yahoo.com/rss/2.0/headline?s=bcm.v&region=US&lang=en-US");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$link=$x->item($i)->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
echo $title;
echo $link;
}
?>
Do you have any idea on how to make it work?
Thanks in advance!

Downloading Remote Documents
The problem is that you are trying to download remote document with DOMDocument::load. The method is capable of downloading remote files, but it doesn't set the User-Agent HTTP header, if it is not specified via user_agent INI setting. Some hosts are configured to reject HTTP requests, if the User-Agent header is absent. And the URL you pasted into the question returns 403 Forbidden, if the header is missing.
So you should either set user agent via INI settings:
ini_set('user_agent', 'MyCrawler/1.0');
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$doc = new DOMDocument();
$doc->load($url);
or download the document manually with User-Agent header set, e.g.:
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'MyCrawler/1.0');
$xml = curl_exec($ch);
$doc = new DOMDocument();
$doc->loadXML($xml);
Traversing the DOM
The next problem with your code is that you are fully relying on specific DOM structure:
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
The are many possible cases where you code will not work as expected: less than 5 items, missing elements, empty document, etc. Besides, the code is not very readable. You should always check if the node exists before going deeper into its structure, e.g.:
$channels = $doc->getElementsByTagName('channel');
foreach ($channels as $channel) {
// Print channel properties
foreach ($channel->childNodes as $child) {
if ($child->nodeType !== XML_ELEMENT_NODE) {
continue;
}
switch ($child->nodeName) {
case 'title':
echo "Title: ", $child->nodeValue, PHP_EOL;
break;
case 'description':
echo "Description: ", $child->nodeValue, PHP_EOL;
break;
}
}
}
You can parse the item elements in similar manner:
$items = $channel->getElementsByTagName('item');
foreach ($items as $item) {
// ...
}

They have security in place when no user agent is set so you'll have to use curl and fake an user agent to get the xml content eg:
$url = "http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$xml = curl_exec($ch);

Related

Error when trying to get Instagram Embed page HTML code

I'm trying to get the HTML Code of the Instagram's Embed pages for my API, but it returns me a strange error and I do not know what to do now, because I'm new to PHP. The code works on other websites.
I tried it already on other websites like apple.com and the strange thing is that when I call this function on the 'normal' post page it works, the error only appears when I call it on the '/embed' URL.
This is my PHP Code:
<?php
if (isset($_GET['url'])) {
$filename = $_GET['url'];
$file = file_get_contents($filename);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($file);
libxml_use_internal_errors(false);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$stringbody = $dom->saveHTML($body);
echo $stringbody;
}
?>
I call the API like this:
https://api.com/get-website-body.php?url=http://instagr.am/p/BoLVWplBVFb/embed
My goal is to get the body of the website, like it is when I call this code on the https://apple.com URL for example.
You can use direct url to scrape the data if you use CURL and its faster than file_get_content. Here is the curl code for different urls and this will scrape the body data alone.
if (isset($_GET['url'])) {
// $website_url = 'https://www.instagram.com/instagram/?__a=1';
// $website_url = 'https://apple.com';
// $website_url = $_GET['url'];
$website_url = 'http://instagr.am/p/BoLVWplBVFb/embed';
$curl = curl_init();
//curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $website_url);
curl_setopt($curl, CURLOPT_REFERER, $website_url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0(Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/66.0');
$str = curl_exec($curl);
curl_close($curl);
$json = json_decode($str, true);
print_r($str); // Just taking tha page as it is
// Taking body part alone and play as your wish
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($str);
libxml_use_internal_errors(false);
$bodies = $dom->getElementsByTagName('body');
foreach ($bodies as $key => $value) {
print_r($value);// You will all content of body here
}
}
NOTE: Here you don't want to use https://api.com/get-website-body.php?url=....

How to get reviews from Google Business using CURL PHP

I'm trying to get reviews in Google Business. The goal is to get access via curl and then get value from pane.rating.moreReviews label jsaction.
How I can fix code below to get curl?
function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
$html = curl("https://www.google.com/maps?cid=12909283986953620003");
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'pane.rating.moreReviews';
$nodes = $finder->query("//*[contains(#jsaction, '$classname')]");
foreach ($nodes as $node) {
$check_reviews = $node->nodeValue;
$ses_key = preg_replace('/[^0-9]+/', '', $check_reviews);
}
// result should be: 166
echo $ses_key;
If I try do var_dump($html);, I'm getting:
string(348437) " "
And this number is changing on each page refresh.
Get Google-Reviews with PHP cURL & without API Key
How to find the CID - If you have the business open in Google Maps:
Do a search in Google Maps for the business name
Make sure it’s the only result that shows up.
Replace http:// with view-source: in the URL
Click CTRL+F and search the source code for “ludocid”
CID will be the numbers after “ludocid\u003d” and till the last number
or use this tool: https://ryanbradley.com/tools/google-cid-finder/
Example
ludocid\\u003d16726544242868601925\
HINT: Use the class ".quote" in you CSS to style the output
The PHP
<?php
/*
💬 Get Google-Reviews with PHP cURL & without API Key
=====================================================
How to find the CID - If you have the business open in Google Maps:
- Do a search in Google Maps for the business name
- Make sure it’s the only result that shows up.
- Replace http:// with view-source: in the URL
- Click CTRL+F and search the source code for “ludocid”
- CID will be the numbers after “ludocid\\u003d” and till the last number
or use this tool: https://pleper.com/index.php?do=tools&sdo=cid_converter
Example
-------
```TXT
ludocid\\u003d16726544242868601925\
```
> HINT: Use the class ".quote" in you CSS to style the output
###### Copyright 2019 Igor Gaffling
*/
$cid = '16726544242868601925'; // The CID you want to see the reviews for
$show_only_if_with_text = false; // true OR false
$show_only_if_greater_x = 0; // 0-4
$show_rule_after_review = false; // true OR false
/* ------------------------------------------------------------------------- */
$ch = curl_init('https://www.google.com/maps?cid='.$cid);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla / 5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko / 20070725 Firefox / 2.0.0.6");
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
$result = curl_exec($ch);
curl_close($ch);
$pattern = '/window\.APP_INITIALIZATION_STATE(.*);window\.APP_FLAGS=/ms';
if ( preg_match($pattern, $result, $match) ) {
$match[1] = trim($match[1], ' =;'); // fix json
$reviews = json_decode($match[1]);
$reviews = ltrim($reviews[3][6], ")]}'"); // fix json
$reviews = json_decode($reviews);
//$customer = $reviews[0][1][0][14][18];
//$reviews = $reviews[0][1][0][14][52][0];
$customer = $reviews[6][18]; // NEW IN 2020
$reviews = $reviews[6][52][0]; // NEW IN 2020
}
if (isset($reviews)) {
echo '<div class="quote"><strong>'.$customer.'</strong><br>';
foreach ($reviews as $review) {
if ($show_only_if_with_text == true and empty($review[3])) continue;
if ($review[4] <= $show_only_if_greater_x) continue;
for ($i=1; $i <= $review[4]; ++$i) echo '⭐'; // RATING
if ($show_blank_star_till_5 == true)
for ($i=1; $i <= 5-$review[4]; ++$i) echo '☆'; // RATING
echo '<p>'.$review[3].'<br>'; // TEXT
echo '<small>'.$review[0][1].'</small></p>'; // AUTHOR
if ($show_rule_after_review == true) echo '<hr size="1">';
}
echo '</div>';
}
Source: https://github.com/gaffling/PHP-Grab-Google-Reviews
Please try below code
$html = curl("https://maps.googleapis.com/maps/api/place/details/json?cid=12909283986953620003&key=<google_apis_key>", "Mozilla 5.0");
$datareview = json_decode($html);// get all data in array
Ex. : http://meetingwords.com/QiIN1vaIuY
It will work for you.
Create Google Key From google console developer
https://developers.google.com/maps/documentation/embed/get-api-key

DOM structure, get element by attribute name/value

I see a lot of answers on SO that pertain to the question but either there are slight differences that I couldn't overcome or maybe i just couldn't repeat the processes shown.
What I am trying to accomplish is to use CURL to get the HTML from a Google+ business page, iterate over the HTML and for each review of the business scrape the reviews HTML for display on that businesses non google+ webpage.
Every review shares this parent div structure:
<div class="ZWa nAa" guidedhelpid="userreviews"> .....
Thus i am trying to do a foreach loop based on finding and grabbing the div and innerhtml for each div with attribute: guidehelpid="userreviews"
I am succesfully getting the HTML back via CURL and can parse it when targeting a standard TAG name like "a" or if it had an ID, but iterating over the HTML using the PHP default parser when looking for a attribute name is problematic:
How can I take this successful code below and make it work like intended as shown in the second code which of course is wrong?
WORKING CODE (Finds,gets, echo's all "a" tags in $output)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->getAttribute('href');
echo "<br />";}
THEORETICALLY NEEDED CODE: (Find every review by custom attribute in HTML and echo them)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('div[guidehelpid=userreviews]') as $review) {
echo $review;
echo "<br />"; }
Any help i correcting this would be appreciated. I would prefer not to use "simple_html_dom" if I can accomplish this without it.
I suggest and you could use an DOMXpath in this case too. Example:
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($output);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$review = $xpath->query('//div[#guidedhelpid="userreviews"]');
if($review->length > 0) { // if it exists
echo $review->item(0)->nodeValue;
// echoes
// John DeRemer reviewed 3 months ago Last fall, we had a major issue with mold which required major ... and so on
}

XML PHP, filter out specific "type"

I'm extracting the required nodes from the api easily from the response here : https://musicbrainz.org/ws/2/release-group/fc02b10c-8a09-38d6-b612-22c47794d2c6?inc=releases+media+url-rels
However, i am trying to get the target of relation type discogs from the below with no luck, any help appreciated
<relation-list target-type="url">
<relation type-id="6578f0e9-1ace-4095-9de8-6e517ddb1ceb" type="wikipedia">
<target id="cba6f347-45b3-4bd5-8b7f-818fa76a63b2">
http://en.wikipedia.org/wiki/Insineratehymn
</target>
</relation>
<relation type-id="99e550f3-5ab4-3110-b5b9-fe01d970b126" type="discogs">
<target id="e9e6e840-ca1a-47fa-b835-47d0d86bcda8">
http://www.discogs.com/master/316106
</target>
</relation>
<relation type-id="b988d08c-5d86-4a57-9557-c83b399e3580" type="wikidata">
<target id="f9b00d41-6f25-4174-9c72-ba1c452cfa6d">
http://www.wikidata.org/wiki/Q1932481
</target>
</relation>
</relation-list>
Just made a simple code:
$xml = simplexml_load_file('https://musicbrainz.org/ws/2/release-group/fc02b10c-8a09-38d6-b612-22c47794d2c6?inc=releases+media+url-rels');
foreach($xml->{'release-group'}->{'relation-list'}->relation as $relation) {
if($relation['type'] == 'discogs') {
$link = $relation->target;
}
}
echo $link;
Output:
http://www.discogs.com/master/316106
You can use ->attributes() method to check whether that particular tag contains the desired attribute that you want. Example:
// without xpath
$xml = simplexml_load_file('https://musicbrainz.org/ws/2/release-group/fc02b10c-8a09-38d6-b612-22c47794d2c6?inc=releases+media+url-rels');
foreach($xml->{'release-group'}->{'relation-list'}->{'relation'} as $relation) {
if($relation->attributes()['type'] == 'discogs') { // dereference (5.4 or above)
echo (string) $relation->target; // http://www.discogs.com/master/316106
}
}
Or you can also use xpath and register the namespace inside the xml, and target it directly:
$xml->registerXPathNamespace('r', 'http://musicbrainz.org/ns/mmd-2.0#'); // registers xmlns (namespace)
$discogs = (string) $xml->xpath('//r:release-group/r:relation-list/r:relation[#type="discogs"]/r:target')[0];
echo $discogs; // http://www.discogs.com/master/316106
Check this
<?php
$url = 'https://musicbrainz.org/ws/2/release-group/fc02b10c-8a09-38d6-b612-22c47794d2c6?inc=releases+media+url-rels';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$result= curl_exec ($ch);
curl_close ($ch);
$res = simplexml_load_string($result);
//echo "<pre>";
//print_r($res);
$listing = $res->{'release-group'}->{'relation-list'}->relation;
// full listing
foreach($listing as $list)
{
echo $list['type']."<br>"; // full listing
}
//Particular url
foreach($listing as $list)
{
if($list['type'] == 'discogs')
{
echo $list->target; // Particular url
}
}
exit;
Output
wikipedia
discogs
wikidata
http://www.discogs.com/master/316106
The XML uses a namespace. DOM+Xpath allows you to take this in account and to fetch the string value directly:
$dom = new DOMDocument();
$dom->load(
'https://musicbrainz.org/ws/2/release-group/fc02b10c-8a09-38d6-b612-22c47794d2c6?inc=releases+media+url-rels'
);
$xpath = new DOMXPath($dom);
$xpath->registerNamespace('mb', "http://musicbrainz.org/ns/mmd-2.0#");
$expression = 'string(
//mb:release-group
/mb:relation-list
/mb:relation[#type = "discogs"]
/mb:target
)';
var_dump($xpath->evaluate($expression));
Output:
string(36) "http://www.discogs.com/master/316106"

get information from html table using Curl

i need to get some information about some plants and put it into mysql table.
My knowledge on Curl and DOM is quite null, but i've come to this:
set_time_limit(0);
include('simple_html_dom.php');
$ch = curl_init ("http://davesgarden.com/guides/pf/go/1501/");
curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept-Language: es-es,en"));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec ($ch);
curl_close ($ch);
$html= str_get_html($data);
$e = $html->find("table", 8);
echo $e->innertext;
now, i'm really lost about how to move in from this point, can you please guide me?
Thanks!
This is a mess.
But at least it's a (somewhat) consistent mess.
If this is a one time extraction and not a rolling project, personally I'd use quick and dirty regex on this instead of simple_html_dom. You'll be there all day twiddling with the tags otherwise.
For example, this regex pulls out the majority of title/data pairs:
$pattern = "/<b>(.*?)</b>\s*<br>(.*?)</?(td|p)>/si";
You'll need to do some pre and post cleaning before it will get them all though.
I don't envy you having this task...
Your best bet will be to wrape this in php ;)
Yes, this is a ugly hack for a ugly html code.
<?php
ob_start();
system("
/usr/bin/env links -dump 'http://davesgarden.com/guides/pf/go/1501/' |
/usr/bin/env perl -lne 'm/((Family|Genus|Species):\s+\w+\s+\([\w-]+\))/ && \
print $1'
");
$out = ob_get_contents();
ob_end_clean();
print $out;
?>
Use Simple Html Dom and you would be able to access any element/element's content you wish. Their api is very straightforward.
you can try somthing like this.
<?php
$ch = curl_init ("http://www.digionline.ir/Allprovince/CategoryProducts/cat=10301");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$data = array();
// get all table rows and rows which are not headers
$table_rows = $xpath->query('//table[#id="tbl-all-product-view"]/tr[#class!="rowH"]');
foreach($table_rows as $row => $tr) {
foreach($tr->childNodes as $td) {
$data[$row][] = preg_replace('~[\r\n]+~', '', trim($td->nodeValue));
}
$data[$row] = array_values(array_filter($data[$row]));
}
echo '<pre>';
print_r($data);
?>

Categories