I would like to parse news titles and links from the following RSS page:
http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE
I have tried using this code (but it's not working):
<?php
$xml=("http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$link=$x->item($i)->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
echo $title;
echo $link;
}
?>
However the same code is working to get RSS titles and links from other RSS pages.. for example:
<?php
$xml=("https://feeds.finance.yahoo.com/rss/2.0/headline?s=bcm.v®ion=US&lang=en-US");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$link=$x->item($i)->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
echo $title;
echo $link;
}
?>
Do you have any idea on how to make it work?
Thanks in advance!
Downloading Remote Documents
The problem is that you are trying to download remote document with DOMDocument::load. The method is capable of downloading remote files, but it doesn't set the User-Agent HTTP header, if it is not specified via user_agent INI setting. Some hosts are configured to reject HTTP requests, if the User-Agent header is absent. And the URL you pasted into the question returns 403 Forbidden, if the header is missing.
So you should either set user agent via INI settings:
ini_set('user_agent', 'MyCrawler/1.0');
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$doc = new DOMDocument();
$doc->load($url);
or download the document manually with User-Agent header set, e.g.:
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'MyCrawler/1.0');
$xml = curl_exec($ch);
$doc = new DOMDocument();
$doc->loadXML($xml);
Traversing the DOM
The next problem with your code is that you are fully relying on specific DOM structure:
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
The are many possible cases where you code will not work as expected: less than 5 items, missing elements, empty document, etc. Besides, the code is not very readable. You should always check if the node exists before going deeper into its structure, e.g.:
$channels = $doc->getElementsByTagName('channel');
foreach ($channels as $channel) {
// Print channel properties
foreach ($channel->childNodes as $child) {
if ($child->nodeType !== XML_ELEMENT_NODE) {
continue;
}
switch ($child->nodeName) {
case 'title':
echo "Title: ", $child->nodeValue, PHP_EOL;
break;
case 'description':
echo "Description: ", $child->nodeValue, PHP_EOL;
break;
}
}
}
You can parse the item elements in similar manner:
$items = $channel->getElementsByTagName('item');
foreach ($items as $item) {
// ...
}
They have security in place when no user agent is set so you'll have to use curl and fake an user agent to get the xml content eg:
$url = "http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$xml = curl_exec($ch);
Related
I'm trying to get the HTML Code of the Instagram's Embed pages for my API, but it returns me a strange error and I do not know what to do now, because I'm new to PHP. The code works on other websites.
I tried it already on other websites like apple.com and the strange thing is that when I call this function on the 'normal' post page it works, the error only appears when I call it on the '/embed' URL.
This is my PHP Code:
<?php
if (isset($_GET['url'])) {
$filename = $_GET['url'];
$file = file_get_contents($filename);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($file);
libxml_use_internal_errors(false);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$stringbody = $dom->saveHTML($body);
echo $stringbody;
}
?>
I call the API like this:
https://api.com/get-website-body.php?url=http://instagr.am/p/BoLVWplBVFb/embed
My goal is to get the body of the website, like it is when I call this code on the https://apple.com URL for example.
You can use direct url to scrape the data if you use CURL and its faster than file_get_content. Here is the curl code for different urls and this will scrape the body data alone.
if (isset($_GET['url'])) {
// $website_url = 'https://www.instagram.com/instagram/?__a=1';
// $website_url = 'https://apple.com';
// $website_url = $_GET['url'];
$website_url = 'http://instagr.am/p/BoLVWplBVFb/embed';
$curl = curl_init();
//curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $website_url);
curl_setopt($curl, CURLOPT_REFERER, $website_url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0(Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/66.0');
$str = curl_exec($curl);
curl_close($curl);
$json = json_decode($str, true);
print_r($str); // Just taking tha page as it is
// Taking body part alone and play as your wish
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($str);
libxml_use_internal_errors(false);
$bodies = $dom->getElementsByTagName('body');
foreach ($bodies as $key => $value) {
print_r($value);// You will all content of body here
}
}
NOTE: Here you don't want to use https://api.com/get-website-body.php?url=....
I'm trying to get reviews in Google Business. The goal is to get access via curl and then get value from pane.rating.moreReviews label jsaction.
How I can fix code below to get curl?
function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
$html = curl("https://www.google.com/maps?cid=12909283986953620003");
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'pane.rating.moreReviews';
$nodes = $finder->query("//*[contains(#jsaction, '$classname')]");
foreach ($nodes as $node) {
$check_reviews = $node->nodeValue;
$ses_key = preg_replace('/[^0-9]+/', '', $check_reviews);
}
// result should be: 166
echo $ses_key;
If I try do var_dump($html);, I'm getting:
string(348437) " "
And this number is changing on each page refresh.
Get Google-Reviews with PHP cURL & without API Key
How to find the CID - If you have the business open in Google Maps:
Do a search in Google Maps for the business name
Make sure it’s the only result that shows up.
Replace http:// with view-source: in the URL
Click CTRL+F and search the source code for “ludocid”
CID will be the numbers after “ludocid\u003d” and till the last number
or use this tool: https://ryanbradley.com/tools/google-cid-finder/
Example
ludocid\\u003d16726544242868601925\
HINT: Use the class ".quote" in you CSS to style the output
The PHP
<?php
/*
💬 Get Google-Reviews with PHP cURL & without API Key
=====================================================
How to find the CID - If you have the business open in Google Maps:
- Do a search in Google Maps for the business name
- Make sure it’s the only result that shows up.
- Replace http:// with view-source: in the URL
- Click CTRL+F and search the source code for “ludocid”
- CID will be the numbers after “ludocid\\u003d” and till the last number
or use this tool: https://pleper.com/index.php?do=tools&sdo=cid_converter
Example
-------
```TXT
ludocid\\u003d16726544242868601925\
```
> HINT: Use the class ".quote" in you CSS to style the output
###### Copyright 2019 Igor Gaffling
*/
$cid = '16726544242868601925'; // The CID you want to see the reviews for
$show_only_if_with_text = false; // true OR false
$show_only_if_greater_x = 0; // 0-4
$show_rule_after_review = false; // true OR false
/* ------------------------------------------------------------------------- */
$ch = curl_init('https://www.google.com/maps?cid='.$cid);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla / 5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko / 20070725 Firefox / 2.0.0.6");
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
$result = curl_exec($ch);
curl_close($ch);
$pattern = '/window\.APP_INITIALIZATION_STATE(.*);window\.APP_FLAGS=/ms';
if ( preg_match($pattern, $result, $match) ) {
$match[1] = trim($match[1], ' =;'); // fix json
$reviews = json_decode($match[1]);
$reviews = ltrim($reviews[3][6], ")]}'"); // fix json
$reviews = json_decode($reviews);
//$customer = $reviews[0][1][0][14][18];
//$reviews = $reviews[0][1][0][14][52][0];
$customer = $reviews[6][18]; // NEW IN 2020
$reviews = $reviews[6][52][0]; // NEW IN 2020
}
if (isset($reviews)) {
echo '<div class="quote"><strong>'.$customer.'</strong><br>';
foreach ($reviews as $review) {
if ($show_only_if_with_text == true and empty($review[3])) continue;
if ($review[4] <= $show_only_if_greater_x) continue;
for ($i=1; $i <= $review[4]; ++$i) echo '⭐'; // RATING
if ($show_blank_star_till_5 == true)
for ($i=1; $i <= 5-$review[4]; ++$i) echo '☆'; // RATING
echo '<p>'.$review[3].'<br>'; // TEXT
echo '<small>'.$review[0][1].'</small></p>'; // AUTHOR
if ($show_rule_after_review == true) echo '<hr size="1">';
}
echo '</div>';
}
Source: https://github.com/gaffling/PHP-Grab-Google-Reviews
Please try below code
$html = curl("https://maps.googleapis.com/maps/api/place/details/json?cid=12909283986953620003&key=<google_apis_key>", "Mozilla 5.0");
$datareview = json_decode($html);// get all data in array
Ex. : http://meetingwords.com/QiIN1vaIuY
It will work for you.
Create Google Key From google console developer
https://developers.google.com/maps/documentation/embed/get-api-key
I see a lot of answers on SO that pertain to the question but either there are slight differences that I couldn't overcome or maybe i just couldn't repeat the processes shown.
What I am trying to accomplish is to use CURL to get the HTML from a Google+ business page, iterate over the HTML and for each review of the business scrape the reviews HTML for display on that businesses non google+ webpage.
Every review shares this parent div structure:
<div class="ZWa nAa" guidedhelpid="userreviews"> .....
Thus i am trying to do a foreach loop based on finding and grabbing the div and innerhtml for each div with attribute: guidehelpid="userreviews"
I am succesfully getting the HTML back via CURL and can parse it when targeting a standard TAG name like "a" or if it had an ID, but iterating over the HTML using the PHP default parser when looking for a attribute name is problematic:
How can I take this successful code below and make it work like intended as shown in the second code which of course is wrong?
WORKING CODE (Finds,gets, echo's all "a" tags in $output)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->getAttribute('href');
echo "<br />";}
THEORETICALLY NEEDED CODE: (Find every review by custom attribute in HTML and echo them)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('div[guidehelpid=userreviews]') as $review) {
echo $review;
echo "<br />"; }
Any help i correcting this would be appreciated. I would prefer not to use "simple_html_dom" if I can accomplish this without it.
I suggest and you could use an DOMXpath in this case too. Example:
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($output);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$review = $xpath->query('//div[#guidedhelpid="userreviews"]');
if($review->length > 0) { // if it exists
echo $review->item(0)->nodeValue;
// echoes
// John DeRemer reviewed 3 months ago Last fall, we had a major issue with mold which required major ... and so on
}
I'm extracting the required nodes from the api easily from the response here : https://musicbrainz.org/ws/2/release-group/fc02b10c-8a09-38d6-b612-22c47794d2c6?inc=releases+media+url-rels
However, i am trying to get the target of relation type discogs from the below with no luck, any help appreciated
<relation-list target-type="url">
<relation type-id="6578f0e9-1ace-4095-9de8-6e517ddb1ceb" type="wikipedia">
<target id="cba6f347-45b3-4bd5-8b7f-818fa76a63b2">
http://en.wikipedia.org/wiki/Insineratehymn
</target>
</relation>
<relation type-id="99e550f3-5ab4-3110-b5b9-fe01d970b126" type="discogs">
<target id="e9e6e840-ca1a-47fa-b835-47d0d86bcda8">
http://www.discogs.com/master/316106
</target>
</relation>
<relation type-id="b988d08c-5d86-4a57-9557-c83b399e3580" type="wikidata">
<target id="f9b00d41-6f25-4174-9c72-ba1c452cfa6d">
http://www.wikidata.org/wiki/Q1932481
</target>
</relation>
</relation-list>
Just made a simple code:
$xml = simplexml_load_file('https://musicbrainz.org/ws/2/release-group/fc02b10c-8a09-38d6-b612-22c47794d2c6?inc=releases+media+url-rels');
foreach($xml->{'release-group'}->{'relation-list'}->relation as $relation) {
if($relation['type'] == 'discogs') {
$link = $relation->target;
}
}
echo $link;
Output:
http://www.discogs.com/master/316106
You can use ->attributes() method to check whether that particular tag contains the desired attribute that you want. Example:
// without xpath
$xml = simplexml_load_file('https://musicbrainz.org/ws/2/release-group/fc02b10c-8a09-38d6-b612-22c47794d2c6?inc=releases+media+url-rels');
foreach($xml->{'release-group'}->{'relation-list'}->{'relation'} as $relation) {
if($relation->attributes()['type'] == 'discogs') { // dereference (5.4 or above)
echo (string) $relation->target; // http://www.discogs.com/master/316106
}
}
Or you can also use xpath and register the namespace inside the xml, and target it directly:
$xml->registerXPathNamespace('r', 'http://musicbrainz.org/ns/mmd-2.0#'); // registers xmlns (namespace)
$discogs = (string) $xml->xpath('//r:release-group/r:relation-list/r:relation[#type="discogs"]/r:target')[0];
echo $discogs; // http://www.discogs.com/master/316106
Check this
<?php
$url = 'https://musicbrainz.org/ws/2/release-group/fc02b10c-8a09-38d6-b612-22c47794d2c6?inc=releases+media+url-rels';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$result= curl_exec ($ch);
curl_close ($ch);
$res = simplexml_load_string($result);
//echo "<pre>";
//print_r($res);
$listing = $res->{'release-group'}->{'relation-list'}->relation;
// full listing
foreach($listing as $list)
{
echo $list['type']."<br>"; // full listing
}
//Particular url
foreach($listing as $list)
{
if($list['type'] == 'discogs')
{
echo $list->target; // Particular url
}
}
exit;
Output
wikipedia
discogs
wikidata
http://www.discogs.com/master/316106
The XML uses a namespace. DOM+Xpath allows you to take this in account and to fetch the string value directly:
$dom = new DOMDocument();
$dom->load(
'https://musicbrainz.org/ws/2/release-group/fc02b10c-8a09-38d6-b612-22c47794d2c6?inc=releases+media+url-rels'
);
$xpath = new DOMXPath($dom);
$xpath->registerNamespace('mb', "http://musicbrainz.org/ns/mmd-2.0#");
$expression = 'string(
//mb:release-group
/mb:relation-list
/mb:relation[#type = "discogs"]
/mb:target
)';
var_dump($xpath->evaluate($expression));
Output:
string(36) "http://www.discogs.com/master/316106"
i need to get some information about some plants and put it into mysql table.
My knowledge on Curl and DOM is quite null, but i've come to this:
set_time_limit(0);
include('simple_html_dom.php');
$ch = curl_init ("http://davesgarden.com/guides/pf/go/1501/");
curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept-Language: es-es,en"));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec ($ch);
curl_close ($ch);
$html= str_get_html($data);
$e = $html->find("table", 8);
echo $e->innertext;
now, i'm really lost about how to move in from this point, can you please guide me?
Thanks!
This is a mess.
But at least it's a (somewhat) consistent mess.
If this is a one time extraction and not a rolling project, personally I'd use quick and dirty regex on this instead of simple_html_dom. You'll be there all day twiddling with the tags otherwise.
For example, this regex pulls out the majority of title/data pairs:
$pattern = "/<b>(.*?)</b>\s*<br>(.*?)</?(td|p)>/si";
You'll need to do some pre and post cleaning before it will get them all though.
I don't envy you having this task...
Your best bet will be to wrape this in php ;)
Yes, this is a ugly hack for a ugly html code.
<?php
ob_start();
system("
/usr/bin/env links -dump 'http://davesgarden.com/guides/pf/go/1501/' |
/usr/bin/env perl -lne 'm/((Family|Genus|Species):\s+\w+\s+\([\w-]+\))/ && \
print $1'
");
$out = ob_get_contents();
ob_end_clean();
print $out;
?>
Use Simple Html Dom and you would be able to access any element/element's content you wish. Their api is very straightforward.
you can try somthing like this.
<?php
$ch = curl_init ("http://www.digionline.ir/Allprovince/CategoryProducts/cat=10301");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$data = array();
// get all table rows and rows which are not headers
$table_rows = $xpath->query('//table[#id="tbl-all-product-view"]/tr[#class!="rowH"]');
foreach($table_rows as $row => $tr) {
foreach($tr->childNodes as $td) {
$data[$row][] = preg_replace('~[\r\n]+~', '', trim($td->nodeValue));
}
$data[$row] = array_values(array_filter($data[$row]));
}
echo '<pre>';
print_r($data);
?>