Scraping Google Front Page Results with php - php

i can with php code Scraping title and url from google search results now how to get descriptions
$url = 'http://www.google.com/search?hl=en&safe=active&tbo=d&site=&source=hp&q=Beautiful+Bangladesh&oq=Beautiful+Bangladesh';
$html = file_get_html($url);
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
echo '<p>Title: ' . $title . '<br />';
echo 'Link: ' . $link . '</p>';
}
The above code gives the following output
Title: Natural Beauties - Bangladesh Photo Gallery
Link: http://www.photo.com.bd/Beauties/
Now I want the following output
Title: Natural Beauties - Bangladesh Photo Gallery
Link: http://www.photo.com.bd/Beauties/
description : photo.com.bd is a website for creative photographers from Bangladesh, mainly for amateur ... Natural-Beauty-of-Bangladesh_Flower ยท fishing on ... BEAUTY-4.

include("simple_html_dom.php");
$in = "Beautiful Bangladesh";
$in = str_replace(' ','+',$in); // space is a +
$url = 'http://www.google.com/search?hl=en&tbo=d&site=&source=hp&q='.$in.'&oq='.$in.'';
print $url."<br>";
$html = file_get_html($url);
$i=0;
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
$descr = $html->find('span.st',$i); // description is not a child element of H3 thereforce we use a counter and recheck.
$i++;
echo '<p>Title: ' . $title . '<br />';
echo 'Link: ' . $link . '<br />';
echo 'Description: ' . $descr . '</p>';
}

Related

Php download image from html tag and return the downloaded image

please help me with the blow code on where am having error, i have two php function, one copies an external image from and website in any string that has [img]here is the image link[/img]. and the other function returns the image downloaded in our sever and set to the previous external link which was copied from.
here is the two functions
function covertContentToHTML($title, $content)
{
$regex = '#\[img( alt="(.+?)\")?( caption="(.+?)\")?](.+?)\[\/img\]#is';
$content = preg_replace_callback($regex, function ($matches) use ($title) {
$imageAlt = $matches[2];
$caption = $matches[4];
$imageUrl = $matches[5];
$caption = trim(preg_replace('/\s+/', ' ', $caption));
$imageUrlToLocalPath = copyimage($imageUrl);
return '<figure class="center"><img src="' . $imageUrlToLocalPath . '" alt="' . (empty($imageAlt) ? $title : $imageAlt) . '" title="' . $title . '">
' . (!empty($caption) ? '<br><figcaption><span class="help">Inset:</span> <strong>' . $caption . '</strong></figcaption>' : '') . '
</figure>';
}, $content);
return $content;
}
function copyimage($content) {
$content = preg_replace('#<img(.*?)src="(.*?)"(.*?)>#is', '[img]\\2[/img]', $content);
preg_match_all('#\[img( alt="(.+?)\")?( caption="(.+?)\")?](.+?)\[\/img\]#is', $content, $matches);
$images = $matches[5];
$i = 0;
$im = [];
foreach ($images as $image) {
array_push($im, $image);
$i++;
$encodeImageUrl = base64_encode($image);
$imageBasename = pathinfo($image, PATHINFO_BASENAME);
$imageLocalPath = "images/hsi/" . $encodeImageUrl . "/images";
if ( !is_dir( $imageLocalPath ) ) {
mkdir($imageLocalPath, 0755, true );
}
copy($image, $imageLocalPath.'/'.$imageBasename);
}
}
the problem am having is that the function convertContentToHtml Does not return the downloaded image and set them to the return figure class..
example i have a string or content like this
$content = 'HELLO WORLD <p> [img]http://www.stackoverflow/images/newimage.jpg[/img] testing this content [img]http://www.stackoverflow.com/images/newimage2.jpg'[/img]';
the code function convertContentToHtml will download the the stackoverflow newimage.jpg and newimage2.jpg to my server and then replace with the stackoverflow/images/newimage.jpg and stackoverflow.com/images/newimage2.jpg to to my link to the image downloaded using the copyimage function.
please help. thanks.

Scrape only 1 result from google search via PHP

Recently i've been having issues with the code of not parsing only 1 result from the google search url. Instead it parses 10 results which, I am not sure if it is changable.
<?php
include('simple_html_dom.php');
$html = file_get_html('https://www.google.com/search?q=raspberry&oq=raspberry&aqs=chrome.0.69i59j0l5.1144j0j7&sourceid=chrome&ie=UTF-8');
$linkObjs = $html->find('div[class=jfp3ef] a');
foreach ($linkObjs as $linkObj) {
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
//if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
echo '<p>Title:' . $title . '<br />';
echo 'Link: ' . $link . '</p>';
}
?>
All, I need is 1 result being scraped off a google search. Thats all. The code i've included is the web scraper written in PHP.

Fetch internal and external links count from a webpage with PHP

Here is my code which is a partially based on a few different codes that you can find easily in various places if googled. I'm trying to count the internal and external links, all links and ( TO DO .nofollow ) links on a any webpage. This is what I have till now. Most of the results are correct, some generic calls gives me a weird results though, and I still need to do .nofollow and perhaps _blank as well. If you care to comment or add/change anything with bit of logic explanation then please do so, it will be very appreciated.
<?php
// transform to absolute path function...
function path_to_absolute($rel, $base)
{
/* return if already absolute URL */
if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;
/* queries and anchors */
if ($rel[0]=='#' || $rel[0]=='?') return $base.$rel;
/* parse base URL and convert to local variables:
$scheme, $host, $path */
extract(parse_url($base));
/* remove non-directory element from path */
$path = preg_replace('#/[^/]*$#', '', $path);
/* destroy path if relative url points to root */
if ($rel[0] == '/') $path = '';
/* dirty absolute URL */
$abs = "$host$path/$rel";
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for($n=1; $n>0; $abs=preg_replace($re, '/', $abs, -1, $n)) {}
/* absolute URL is ready! */
return $scheme.'://'.$abs;
}
// count zero begins
$intnumLinks = 0;
$extnumLinks = 0;
$nfnumLinks = 0;
$allnumLinks = 0;
// get url file
$url = $_REQUEST['url'];
// get contents of url file
$html = file_get_contents($url);
// http://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php
// loading DOM document
$doc=new DOMDocument();
#$doc->loadHTML($html);
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$strings=$xml->xpath('//a');
foreach ($strings as $string) {
$aa = path_to_absolute( $string[href], $url, true );
$a = parse_url($aa, PHP_URL_HOST);
$a = str_replace("www.", "", $a);
$b = parse_url($url, PHP_URL_HOST);
if($a == $b){
echo 'call-host: ' . $b . '<br>';
echo 'type: int </br>';
echo 'title: ' . $string[0] . '<br>';
echo 'url: ' . $string['href'] . '<br>';
echo 'host: ' . $a . '<br><br>';
$intnumLinks++;
}else{
echo 'call-host: ' . $b . '<br>';
echo 'type: ext </br>';
echo 'title: ' . $string[0] . '<br>';
echo 'url: ' . $string['href'] . '<br>';
echo 'host: ' . $a . '<br><br>';
$extnumLinks++;
}
$allnumLinks++;
}
// count results
echo "<br>";
echo "Count int: $intnumLinks <br>";
echo "Count ext: $extnumLinks <br>";
echo "Count nf: $nfnumLinks <br>";
echo "Count all: $allnumLinks <br>";
?>
Consider this post as closed. At first I wanted to delete this post but then again someone might use this code for his work.

How to grab news titles from http://alsat-m.tv/category/5/nga-vendi

Im trying to grab titles from http://alsat-m.tv/category/5/nga-vendi
but I cant. I have tried with below code. If any one can help me please?Below, I have added to pull titles only like a text and link.This code is working only with http://www.programminghelp.com/ and not with the other web pages, I dont know where the problem is.
<?php
$html = file_get_contents("http://alsat-m.tv/");
preg_match_all(
'/<h5><a href="(.*?)" rel="bookmark" title=".*?">(.*?)<\/a><\/h5>/s',
$html,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($posts as $post) {
$link = $post[1];
$title = $post[2];
echo "<a href'" . $link . "'>" . $title . "</a></br>";
}
echo "<p>" . count($posts) . " posts found</p>";
$html = file_get_contents("http://www.alsat-m.tv/");
preg_match_all(
'/<h5><a href="(.*?)" rel="bookmark" title=".*?">(.*?)<\/a><\/h5>/s',
$html,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($posts as $post) {
$link = $post[1];
$title = $post[2];
echo "<a href='" . $link . "'>" . $title . "</a></br>";
}
echo "<p>" . count($posts) . " posts found</p>";
?>
Here is a solution in python
import requests
from lxml import etree
xml = requests.get('http://alsat-m.tv/RssFeed')
tree = etree.fromstring(xml.content)
root = tree.find('channel')
titles = [x.find('title').text for x in root.findall('item')]
print titles

Retrieving RSS feed with tag <content:encoded>

I have the following snippet of code:
function getFeed($feed_url) {
$content = file_get_contents($feed_url);
$x = new SimpleXmlElement($content);
echo "<ul>";
foreach($x->channel->item as $entry) {
echo "<li><a href='$entry->link' title='$entry->title'>" . $entry->title . "</a></li>";
echo "<li>$entry->content</li>";
echo "</ul>";
}
It works EXCEPT the $entry->content
That part doesn't register. In the actual feed the tag is listed as <content:encoded> but I can't get it to feed. Any suggestions?
The Tag name here is "encoded".
Try this:
$url = 'put_your_feed_URL';
$rss = new DOMDocument();
$rss->load($url);
$feed = array();
foreach ($rss->getElementsByTagName('item') as $node) {
$item = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
'link' => $node->getElementsByTagName('link')->item(0)->nodeValue,
'pubDate' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue,
'description' => $node->getElementsByTagName('description')->item(0)->nodeValue,
'content' => $node->getElementsByTagName('encoded')->item(0)->nodeValue
);
array_push($feed, $item);
}
In <content:encoded>, content is the namespace and encoded is the tag name.
You have to use SimpleXMLElement::children. See the output of
var_dump($entry->children("content", true));
I'll suggest you the following code:
function getFeed($feed_url) {
$feeds = file_get_contents($feed_url);
$feeds = str_replace("<content:encoded>","<contentEncoded>",$feeds);
$feeds = str_replace("</content:encoded>","</contentEncoded>",$feeds);
$rss = simplexml_load_string($feeds);
echo "<ul>";
foreach($x->channel->item as $entry) {
echo "<li><a href='$entry->link' title='$entry->title'>" . $entry->title . "</a></li>";
echo "<li>$entry->contentEncoded</li>";
echo "</ul>";
}
Hope this works for you.
.... PHP example
<?php
// --------------------------------------------------------------------
$feed_url = 'http://www.tagesschau.de/xml/rss2';
$xml_data = simplexml_load_file($feed_url);
// --------------------------------------------------------------------
$i=0;
foreach($xml_data->channel->item as $ritem) {
// --------------------------------------
$e_title = (string)$ritem->title;
$e_link = (string)$ritem->link;
$e_pubDate = (string)$ritem->pubDate;
$e_description = (string)$ritem->description;
$e_guid = (string)$ritem->guid;
$e_content = $ritem->children("content", true);
$e_encoded = (string)$e_content->encoded;
$n = ($i+1);
// --------------------------------------
print '<p> ---------- '. $n .' ---------- </p>'."\n";
print "\n";
print '<div class="entry" style="margin:0 auto; padding:4px; text-align:left;">'."\n";
print '<p> Title: '. $e_title .'</p>'."\n";
print '<p> Link: '. $e_link .'</p>'."\n";
print '<p> Date: '. $e_pubDate .'</p>'."\n";
print '<p> Desc: '. $e_description .'</p>'."\n";
print '<p> Guid: '. $e_guid .'</p>'."\n";
print '<p> Content: </p>'."\n";
print '<p style="background:#DEDEDE">'. $e_encoded .'</p>'."\n";
print '</div>'."\n";
// --------------------------------------
print '<br />'."\n";
print '<br />'."\n";
$i++;
}
// --------------------------------------------------------------------
?>
if you want to see the content HTML Source Code in your Browser, use eg:
print '<pre style="background:#DEDEDE">'. htmlentities($e_encoded) .'</pre>'."\n";
:=)
The working answer for this is just:
$e_content = $entry->children("content", true);
$e_encoded = (string)$e_content->encoded;
Using SimpleXmlElement or loading via simplexml_load_file you could access it via $entry->children("http://purl.org/rss/1.0/modules/content/")->encoded, just remember to cast it to a string:
foreach($x->channel->item as $entry) {
echo "<li><a href='$entry->link' title='$entry->title'>" . $entry->title . "</a></li>";
echo "<li>" . (string)$entry->children("http://purl.org/rss/1.0/modules/content/")->encoded . "</li>";
}

Categories