I'm trying to build a web scraper for amazon's product page.
I decided to go with Goutte library (based on this freeCodeCamp tutorial).
Here's what I've coded so far:
<?php
require 'vendor/autoload.php';
$link = readline('Enter the product you want to scrape: ');
$httpClient = new \Goutte\Client();
$response = $httpClient->request('GET', $link);
$title_ = $response->evaluate('//span[#id="productTitle"]');
$price_ = $response->evaluate('//span[#class="priceBlockStrikePriceString a-text-strike"]');
$offer_ = $response->evaluate('//span[#id="priceblock_ourprice"]');
foreach ($title_ as $key => $title) {
$str = $title->textContent . PHP_EOL;
$str = str_replace("\n","",$str);
echo $str , "\n";
}
foreach ($price_ as $key => $price) {
$str = $price->textContent . PHP_EOL;
$str = str_replace(array("\n", " "),"",$str);
echo $str , "\n";
}
foreach ($offer_ as $key => $offer) {
$str = $offer->textContent . PHP_EOL;
$str = str_replace(array("\n", " "),"",$str);
echo $str , "\n";
}
?>
As you can see I'm trying to extract the product's title, listed price and Amazon's offer.
What really frustrates me, is that the above code sometimes works and sometimes it doesn't.
For example, I'm testing it by giving it this link.
Sometimes I get the desired result:
Amazon Basics Liquid Crystal Clear Soft TPU Smartphone Cover for iPhone 13 Pro Max
$9.99
$6.69
But then I'm trying again, without changing anything, and I only get the product's title:
Amazon Basics Liquid Crystal Clear Soft TPU Smartphone Cover for iPhone 13 Pro Max
What's going on? I'm guessing issues with my code (I'm new to web scraping), can anyone help? Thanks in advance.
Related
Kindly please help regarding Xpath...
Following scripts will scraping the main body of URL by using Xpath
<?php
//sentimen order
if (PHP_SAPI != 'cli') {
echo "<pre>";
}
require_once __DIR__ . '/../autoload.php';
$sentiment = new \PHPInsight\Sentiment();
require_once 'Xpath.php';
$startUrl = "http://news.sky.com/story/1445575/suspect-held-over-shooting-of-ferguson-police/";
$xpath = new XPATH($startUrl);
// We starts from the root element
$query = '/html/body/div[2]/div[3]/article/div/div[2]/div[2]/p[3]';
$strQuery = $xpath->query($query);
$strNode = $strQuery->item(0)->nodeValue;
$result = array($strNode);
foreach ($result as $string) {
// calculations:
$scores = $sentiment->score($string);
$class = $sentiment->categorise($string);
// output:
echo "Strings $string \n";
echo "Dominant: $class, scores: ";
print_r($scores);
echo "\n";
}
Above scripts run well except the array loop...Xpath does not scraping ALL content but ONLY the first line of main body..
I think the problem lies from array loop and foreach...
Anyone please help to fix this looping....
You only fetch one paragraph. Additionally you only put one string into the array.
You're perhaps looking for something more along this lines:
foreach ($xpath->query('
//header/h1
|//header/p
|//header//p[#class="last-updated__text"]
|//div[#class="story__content"]/p') as $p) {
echo string_normalize($p->textContent), "\n\n";
}
function string_normalize($string)
{
return preg_replace('~\s+~u', ' ', trim($string));
}
Output:
Shooting Of Ferguson Police: Suspect Charged
A prosecutor says the 20-year-old suspect claims he fired the shots in a dispute with other individuals and did not aim at police.
05:19, UK, Monday 16 March 2015
By Sky News US Team
A suspect has been charged in connection with the shooting and wounding last week of two police officers in Ferguson, Missouri.
St Louis County prosecutor Robert McCulloch told a news conference the accused was 20-year-old Jeffrey Williams.
He said the suspect, a local resident, was facing two counts of assault in the first degree.
Williams, who was arrested on Saturday night, is also charged with firing a handgun from a vehicle.
"He has acknowledged his participation in firing the shots," Mr McCulloch told reporters.
...
I want to parse Google News rss with PHP. I managed to run this code:
<?
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
foreach($news->channel->item as $item) {
echo "<strong>" . $item->title . "</strong><br />";
echo strip_tags($item->description) ."<br /><br />";
}
?>
However, I'm unable to solve following problems. For example:
How can i get the hyperlink of the news title?
As each of the Google news has many related news links in footer, (and my code above includes them also). How can I remove those from the description?
How can i get the image of each news also? (Google displays a thumbnail image of each news)
Thanks.
There we go, just what you need for your particular situation:
<?php
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
$feeds = array();
$i = 0;
foreach ($news->channel->item as $item)
{
preg_match('#src="([^"]+)"#', $item->description, $match);
$parts = explode('<font size="-1">', $item->description);
$feeds[$i]['title'] = (string) $item->title;
$feeds[$i]['link'] = (string) $item->link;
$feeds[$i]['image'] = $match[1];
$feeds[$i]['site_title'] = strip_tags($parts[1]);
$feeds[$i]['story'] = strip_tags($parts[2]);
$i++;
}
echo '<pre>';
print_r($feeds);
echo '</pre>';
?>
And the output should look like this:
[2] => Array
(
[title] => Los Alamos Nuclear Lab Under Siege From Wildfire - ABC News
[link] => http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGxBe4YsZArH0kSwEjq_zDm_h-N4A&url=http://abcnews.go.com/Technology/wireStory?id%3D13951623
[image] => http://nt2.ggpht.com/news/tbn/OhH43xORRwiW1M/6.jpg
[site_title] => ABC News
[story] => A wildfire burning near the desert birthplace of the atomic bomb advanced on the Los Alamos laboratory and thousands of outdoor drums of plutonium-contaminated waste Tuesday as authorities stepped up ...
)
I'd recommend checking out SimplePie. I've used it for several different projects and it works great (and abstracts away all of the headache you're currently dealing with).
Now, if you're writing this code simply because you want to learn how to do it, you should probably ignore this answer. :)
To get the URL for a news item, use $item->link.
If there's a common delimiter for the related news links, you could use regex to cut off everything after it.
Google puts the thumbnail image HTML code inside the description field of the feed. You could regex out everything between the open and close brackets for the image declaration to get the HTML for it.
I'm using a basic LAMP server for my website and looking to rewrite a script I have that's linked with Twitch's API. The issue that I'm having is trying to find, for lack of a better word, the opposite of the foreach.
For example I have an array of names that are sent with a URL to the Twitch servers, and if one of the names are currently streaming "streams" will have lots of values, and if they aren't currently streaming "streams" will return null.
This works wonderful for checking people and showing they are online, but I also need it to display the ones who are offline and I can't figure out how to do that. Something like a foreachelse or something like that. This is the code I have below. Thanks.
<?php
$username = array("zeromi", "aerosgw", "krylen","tshirt_aion","snooky_aion","vashiro","papa456","vinley_aion","hanfkrokette","wtfast_siel","neckofthewood","paraproc","aionnae","uhiwi","mufflermankr","valorium","knighterrantry","soulune","relizex3","vinlockz","trevyn201","tiger529","xkegi","logsnsticks","meowform","uzuk3","kalzard01","squall_m","suyji","headpcgamer","sariett_siel");
$callAPI = implode(",",$username);
$data = json_decode(#file_get_contents('https://api.twitch.tv/kraken/streams?channel=' . $callAPI), true);
foreach ($data['streams'] as $streams){
$name = $streams['channel']['name'];
echo $name.'<br>';
}
?>
I don't see the problem, if you just use array_diff, you can filter out the usernames.
$online = array();
foreach ($data['streams'] as $streams)
$online[] = $streams['channel']['name'];
$offline = array_diff($username, $online);
echo 'Online users: ' . implode(', ', $online) . "\n<br>";
echo 'Offline users: '. implode(', ', $offline);
Output (at time of writing):
Online users: sariett_siel, mufflermankr
Offline users: zeromi, aerosgw, krylen, tshirt_aion, snooky_aion, vashiro, papa456, vinley_aion, hanfkrokette, wtfast_siel, neckofthewood, paraproc, aionnae, uhiwi, valorium, knighterrantry, soulune, relizex3, vinlockz, trevyn201, tiger529, xkegi, logsnsticks, meowform, uzuk3, kalzard01, squall_m, suyji, headpcgamer
I have a PHP script on the site that pulls a twitter feed and displays it. Strangely most of the time it seems to work just fine, but sometimes (quite a lot actually) it doesn't work at all and just displays the follow button.
The code is as follows, obviously USERNAME has the actual twitter account username in:
$widget = true;
$twitterid = "#USERNAME";
$doc = new DOMDocument();
# load the RSS document, edit this line to include your username or user id
if($doc->load('http://twitter.com/statuses/user_timeline/USERNAME.rss')) {
# specify the number of tweets to display, max is 20
$max_tweets = 4;
$i = 1;
foreach ($doc->getElementsByTagName('item') as $node) {
# fetch the title from the RSS feed.
# Note: 'pubDate' and 'link' are also useful (I use them in the sidebar of this blog)
$tweet = $node->getElementsByTagName('title')->item(0)->nodeValue;
# the title of each tweet starts with "username: " which I want to remove
$tweet = substr($tweet, stripos($tweet, ':') + 1);
# OPTIONAL: turn URLs into links
$tweet = preg_replace('#(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)#', '$1', $tweet);
# OPTIONAL: turn #replies into links
$tweet = preg_replace("/#([0-9a-zA-Z]+)/", "#$1", $tweet);
echo "<p> <p>".$tweet."</p></p><hr />\n";
if ($i++ >= $max_tweets)
break;
}
echo "</ul>\n";
}
// Here's the Twitter Follow Button Widget
if($widget){
echo "Follow #" .$twitterid. "<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=\"//platform.twitter.com/widgets.js\";fjs.parentNode.insertBefore(js,fjs);}}(document,\"script\",\"twitter-wjs\");</script>";
}
Sadly Twitter has removed the URL https://twitter.com/statuses/user_timeline/USERNAME.rss and it now returns Sorry, that page does not exist as of Oct 12 2012. There is a json equivalent however this may fail as well after March 2013. Try https://api.twitter.com/1/statuses/user_timeline.json?screen_name=USERNAME&count=4 for the time being.
HTH
Twitter enforces rate limiting on unauthenticated calls (calls made to the API that haven't been authenticated using OAuth).
"Unauthenticated calls are permitted 150 requests per hour. Unauthenticated calls are measured against the public facing IP of the server or device making the request."
If you are using shared hosting, it makes it more likely for you to get rate-limited as someone else using the same IP on the host could also be querying the Twitter API (hence, counting towards the hourly limit for that IP).
You can read more on these restrictions on Twitter's Rate Limiting restrictions website as well as on the Rate Limiting FAQ website.
<?php
$timeline="http://api.twitter.com/1/statuses/user_timeline.xml?screen_name=arvizard";
$xml= new SimpleXMLElement(file_get_contents($timeline));
$i=0;
print "<ul class=\"tweet_list\">";
foreach($xml ->children() as $tstatus)
{
$stat=$tstatus->text;
$split= preg_split('/\s/',$stat);
print "<li class=\"tweet\"><p class=\"tweet_text\">";
foreach ($split as $word)
{
if (preg_match('/^#/',$word)) {
print " "."".$word."";
}
else if (preg_match('/^http:\/\//',$word)){
print " "."".$word."";
}
else
{
print " ".$word;
}
}
print "</p>";
print "<span class=\"date\">".substr($tstatus->created_at,0,strlen($tstatus->created_at)-14)."</span>";
print "</li>";
$i++;
if ($i==5)
{
break;
}
}
print "</ul>";
?>
This may help you. please check it.
I want to parse Google News rss with PHP. I managed to run this code:
<?
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
foreach($news->channel->item as $item) {
echo "<strong>" . $item->title . "</strong><br />";
echo strip_tags($item->description) ."<br /><br />";
}
?>
However, I'm unable to solve following problems. For example:
How can i get the hyperlink of the news title?
As each of the Google news has many related news links in footer, (and my code above includes them also). How can I remove those from the description?
How can i get the image of each news also? (Google displays a thumbnail image of each news)
Thanks.
There we go, just what you need for your particular situation:
<?php
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
$feeds = array();
$i = 0;
foreach ($news->channel->item as $item)
{
preg_match('#src="([^"]+)"#', $item->description, $match);
$parts = explode('<font size="-1">', $item->description);
$feeds[$i]['title'] = (string) $item->title;
$feeds[$i]['link'] = (string) $item->link;
$feeds[$i]['image'] = $match[1];
$feeds[$i]['site_title'] = strip_tags($parts[1]);
$feeds[$i]['story'] = strip_tags($parts[2]);
$i++;
}
echo '<pre>';
print_r($feeds);
echo '</pre>';
?>
And the output should look like this:
[2] => Array
(
[title] => Los Alamos Nuclear Lab Under Siege From Wildfire - ABC News
[link] => http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGxBe4YsZArH0kSwEjq_zDm_h-N4A&url=http://abcnews.go.com/Technology/wireStory?id%3D13951623
[image] => http://nt2.ggpht.com/news/tbn/OhH43xORRwiW1M/6.jpg
[site_title] => ABC News
[story] => A wildfire burning near the desert birthplace of the atomic bomb advanced on the Los Alamos laboratory and thousands of outdoor drums of plutonium-contaminated waste Tuesday as authorities stepped up ...
)
I'd recommend checking out SimplePie. I've used it for several different projects and it works great (and abstracts away all of the headache you're currently dealing with).
Now, if you're writing this code simply because you want to learn how to do it, you should probably ignore this answer. :)
To get the URL for a news item, use $item->link.
If there's a common delimiter for the related news links, you could use regex to cut off everything after it.
Google puts the thumbnail image HTML code inside the description field of the feed. You could regex out everything between the open and close brackets for the image declaration to get the HTML for it.