Recently i've been having issues with the code of not parsing only 1 result from the google search url. Instead it parses 10 results which, I am not sure if it is changable.
<?php
include('simple_html_dom.php');
$html = file_get_html('https://www.google.com/search?q=raspberry&oq=raspberry&aqs=chrome.0.69i59j0l5.1144j0j7&sourceid=chrome&ie=UTF-8');
$linkObjs = $html->find('div[class=jfp3ef] a');
foreach ($linkObjs as $linkObj) {
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
//if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
echo '<p>Title:' . $title . '<br />';
echo 'Link: ' . $link . '</p>';
}
?>
All, I need is 1 result being scraped off a google search. Thats all. The code i've included is the web scraper written in PHP.
Related
i want to create an array of variable $link to get all the links in array so that i can process them simultaneously outside curly braces
include("simple_html_dom.php");
$html = file_get_html($url);
$i=0;
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj)
{
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
//if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1]))
{
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
$descr = $html->find('span.st',$i); // description is not a child element of H3 thereforce we use a counter and recheck.
$i++;
}
Create an array and push matches.
include("simple_html_dom.php");
$html = file_get_html($url);
$links = array();
$i=0;
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj)
{
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1]))
{
array_push($links, $link);
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
$descr = $html->find('span.st',$i); // description is not a child element of H3 thereforce we use a counter and recheck.
$i++;
}
Just declare a array variable, and add it to use it later.
Before Loop,
$myLinks = [];
And, Just after this line,
$link = $matches[1];
$myLinks[] = $link;
Now, you can use the array $myLinks, Hope this was what you needed.
I want to find a specific url in a html page and get it's part.
the url is in this page :
http://site1.com/games/arcade/139173-angry-birds-friends-1-7-0.html`
and is like
http://download.site2.org/?server=2&apkid=com.rovio.angrybirdsfriends&ver=1.7.0
I want 3 parts of it:
2
com.rovio.angrybirdsfriends
1.7.0
My code:
$html = file_get_contents("http://site1.com/games/name/139173-angry-birds-friends-1-7-0.html");
preg_match("/download(.*)/", $html, $results)
echo = $results[0];
Is this what you are looking for?
$url = 'http://download.site2.org/?server=2&apkid=com.rovio.angrybirdsfriends&ver=1.7.0';
$query = parse_url($url, PHP_URL_QUERY);
parse_str($query, $params);
echo $params['server'], PHP_EOL;
echo $params['apkid'], PHP_EOL;
echo $params['ver'], PHP_EOL;
Output:
2
com.rovio.angrybirdsfriends
1.7.0
Update
// Read HTML
$html = file_get_contents(
'http://getandroidapp.org/games/arcade/'
. '139173-angry-birds-friends-1-7-0.html'
);
// Turn HTML into a DOM document
$dom = new DOMDocument();
#$dom->loadHTML($html); // Mute warnings
// Find anchor ...
foreach ($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
// ... having a query part that starts with 'server='
if (preg_match('#\?server=#', $href)) {
$url = $href;
// Parse query string from href
$query = parse_url($url, PHP_URL_QUERY);
parse_str($query, $params);
// Display values
echo $params['server'], PHP_EOL;
echo $params['apkid'], PHP_EOL;
echo $params['ver'], PHP_EOL;
// One is enough
break;
}
}
Output:
2
com.rovio.angrybirdsfriends
1.7.0
It is not totally fool-proof but maybe good enough in your case.
i can with php code Scraping title and url from google search results now how to get descriptions
$url = 'http://www.google.com/search?hl=en&safe=active&tbo=d&site=&source=hp&q=Beautiful+Bangladesh&oq=Beautiful+Bangladesh';
$html = file_get_html($url);
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
echo '<p>Title: ' . $title . '<br />';
echo 'Link: ' . $link . '</p>';
}
The above code gives the following output
Title: Natural Beauties - Bangladesh Photo Gallery
Link: http://www.photo.com.bd/Beauties/
Now I want the following output
Title: Natural Beauties - Bangladesh Photo Gallery
Link: http://www.photo.com.bd/Beauties/
description : photo.com.bd is a website for creative photographers from Bangladesh, mainly for amateur ... Natural-Beauty-of-Bangladesh_Flower ยท fishing on ... BEAUTY-4.
include("simple_html_dom.php");
$in = "Beautiful Bangladesh";
$in = str_replace(' ','+',$in); // space is a +
$url = 'http://www.google.com/search?hl=en&tbo=d&site=&source=hp&q='.$in.'&oq='.$in.'';
print $url."<br>";
$html = file_get_html($url);
$i=0;
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
$descr = $html->find('span.st',$i); // description is not a child element of H3 thereforce we use a counter and recheck.
$i++;
echo '<p>Title: ' . $title . '<br />';
echo 'Link: ' . $link . '<br />';
echo 'Description: ' . $descr . '</p>';
}
I have got a PHP code to get all images in a URL. It will display the URL of the image. But there are some problems with the URL of the image obtained.
1: If the images are hosted in the root directory, it will not show the domain. For example, if the given URL is www.google.com, the URL of the image obtained is
"/logos/doodles/2014/womens-day-2014-6253511574552576.3-hp.png"
and what I need is
"www.google.com/logos/doodles/2014/womens-day-2014-6253511574552576.3-hp.png"
2: The URL obtained is always between " ". How can I remove it??
Here the PHP code
<?php
$url_image = $_GET['url'];
$homepage = file_get_contents($url_image);
preg_match_all("{<img\\s*(.*?)src=('.*?'|\".*?\"|[^\\s]+)(.*?)\\s*/?>}ims", $homepage, $matches, PREG_SET_ORDER);
foreach ($matches as $val) {
echo $val[2];
echo "<br>";
}
?>
try this:
$url_image = $_GET['url'];
$homepage = file_get_contents($url_image);
preg_match_all("{<img\\s*(.*?)src=('.*?'|\".*?\"|[^\\s]+)(.*?)\\s*/?>}ims", $homepage, $matches, PREG_SET_ORDER);
foreach ($matches as $val) {
$pos = strpos($val[2],"/");
$link = substr($val[2],1,-1);
if($pos == 1)
echo "http://domain.com" . $link;
else
echo $link;
echo "<br>";
}
You can remove the "" using substr($url,1,-1);
I have an rss feed that I am reading into. I need to retrieve certain data from the field in this feed.
This is the example feed data :
<content:encoded><![CDATA[
<b>When:</b><br />
Weekly Event - Every Thursday: 1:30 PM to 3:30 PM (CT)<br /><br />
<b>Where:</b><br />
100 West Street<BR>2nd floor<BR>Gainesville<BR>
<br>.....
How do I pull out the data for When: and Where: respectively? I attempted to use regex but I am unsure if I am not accessing the data correctly or if my regex expression is wrong. I'm not set on using regex.
This is my code:
foreach ($x->channel->item as $event) {
$eventCounter++;
$rowColor = ($eventCounter % 2 == 0) ? '#FFFFFF' : '#F1F1F1';
$content = $event->children('http://purl.org/rss/1.0/modules/content/');
$contents = $content->encoded;
echo '<tr style="background-color:' . $rowColor . '">';
echo '<td>';
//echo "<a id=buttonRed href='$event->link' title='$event->title' target='_blank'>" . $event->title . "</a>";
echo "" . $event->title . "";
echo '</td>';
echo '<td>';
$re = '%when\:\s*</b>\s*(.|\s)<br \/><br \/>$/i';
if (preg_match($re, $contents, $matches)) {
$date = $matches;
}
echo $date;
echo '</td>';
echo '<td>';
$re = '/^When\:<\/b>()$/';
if (preg_match($re, $contents, $matches)) {
$location = $matches;
}
echo $location;
echo '</td>';
echo '<td>';
echo "<a id=buttonRed href='$event->link' title='$event->title' target='_blank'>Click Here To Register</a>";
echo '</td>';
echo '</tr>';
}
The two $res are just my attempt to get the data out using different regex expressions. Let me know where I am going wrong. Thanks
The following should sort of get you there. (I wrote this from the top of my head and it does not exactly following your XML syntax. But you get the idea.)
<?php
$str = "<root><b>When:</b> whenwhen <b>Where:</b> wherewhere</root>";
$doc = new DOMDocument();
$doc->loadXML($str);
$when = $where = "";
$target = null;
foreach ($doc->documentElement->childNodes as $node) {
if ($node->tagName == "b") {
if (++$i == 1) {
$target = &$when;
} else {
$target = &$where;
}
}
if ($target !== null && $node->nodeType === XML_TEXT_NODE) {
$target .= $node->nodeValue;
}
}
var_dump($when, $where);
I had a problem like this and I ended up using YQL. Take a good look at the page-scraping code given there, especially the select command. Then go the the console and put in your own select statement, specifying the feed url and the xpath to the nodes you're wanting. Select JSON format. Then go down to the bottom of the page, get the REST query url, and use it in a jquery jsonp request. MAGIC!
please, don't extract data from XML-documents via regex.
The long answer is e.g. here: https://stackoverflow.com/a/335446/313145
The short answer is: it is not easier to use regex and will break often.