Webscraping with Goutte and Guzzle - php

I have the following method from my controller that gets the data from the site:
$goutteClient = new Client();
$guzzleClient = new GuzzleClient([
'timeout' => 60,
]);
$goutteClient->setClient($guzzleClient);
$crawler = $goutteClient->request('GET', 'https://html.duckduckgo.com/html/?q=Laravel');
$crawler->filter('.result__title .result__a')->each(function ($node) {
dump($node->text());
});
The above code gives me the title of contents from the search results. I also want to get the link of the corresponding search result. That resides in class result__extras__url.
How do I filter the link in and the title at once? Or do I have to run another method for that?

Try to inspect the attributes of the nodes. Once you get the href attribute, parse it to get the URL.
$crawler->filter('.result__title .result__a')->each(function ($node) {
$parts = parse_url(urldecode($node->attr('href')));
parse_str($parts['query'], $params);
$url = $params['uddg']; // DDG puts their masked URL and places the actual URL as a query param.
$title = $node->text();
});

For parsing, I usually do the following:
$doc = new DOMDocument();
$doc->loadHTML((string)$crawler->getBody());
from then on, you can access using getElementsByTagName functions on your DOMDocument.
for example:
$rows = $doc->getElementsByTagName('tr');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
$value = trim($cols->item(0)->nodeValue);
}
You can find more information in
https://www.php.net/manual/en/class.domdocument.php

Related

Trying to get same result by xpath and css element

Im trying to get the same result from a site by using dom elements and xpath. So i can make this crawler dynamic for more sites, so that i only have to fill in url and what type(xpath, domelement).
$url = 'https://#/';
$xpath = "/html[1]/body[1]/div[3]/header[1]/div[1]/div[1]/div[2]/div[1]/ul[1]/li[2]/ul[1]/li[1]/span[1]";
$client = new Client();
$guzzleClient = new GuzzleClient(array(
'timeout' => 60,
));
$client->setClient($guzzleClient);
$crawler = $client->request('GET', $url);
$crawler->filter('.rate')->filter('.gold')->each(function ($node) {
print $node->text()."\n";
});
$result = $crawler->filterXPath($xpath);
var_dump($result);
result should be, gold price like this code piece outputs: $crawler->filter('.rate')->filter('.gold')->each(function ($node) {
print $node->text()."\n";
});
If anything is unclear please let me know!
Welcome to SO.
If you want to fetch the gold rate then you can use the below selectors.
xpath
//ul[#class='rates-widget list-inline']//span[#class='rate gold']
CSS
ul.rates-widget.list-inline span.rate.gold

How to type a value outside the function

I want the value of $name outside of function so i can rename the zip file with it but how to do that?
this crawele method and i can not use the value outside the function
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$client = new \GuzzleHttp\Client();
// go get the date from url
$url = 'https://hentaifox.com/gallery/58118/';
$resnm = $client->request('GET', $url);
$htmlnm = ''.$resnm->getBody();
$crawler = new Crawler($html);
$res_name = $client->request('GET', $url);
$html_name = ''.$res_name->getBody();
$crawler_name = new Crawler($html_name);
$nameValues_name = $crawler_name->filter('.info > h1')->reduce(function (Crawler $node, $i){
$name = $node->text();
return $maname;
});
print_r($name);
$res = $client->request('GET', $url);
$html = ''.$res->getBody();
$crawler = new Crawler($html);
$nodeValues = $crawler->filter('.gallery .preview_thumb')->each(function (Crawler $node, $i) {
$image = $node->filter('img')->attr('data-src');
$imagerep = str_replace(array('//i2.hentaifox.com' , '//i.hentaifox.com','t.jpg'),array('https://i2.hentaifox.com','https://i2.hentaifox.com','.jpg'),$image);
$zip = new ZipArchive();
$my_save_dir = $name.'.zip';
$zip->open($my_save_dir, ZipArchive::CREATE);
$imgdownload = file_get_contents($imagerep);
$zip->addFromString(basename($imagerep), $imgdownload);
$zip->close();
});
help me out the value of $name to use it for nameing the zip file in $my_save_dir
If you want to use a variable from outside, inside your anonymous function you need to do it like this:
$nodeValues = $crawler->filter('.gallery .preview_thumb')
->each(function (Crawler $node, $i) use ($name) {
If you also want to be able to access any changes to the variable that have been made inside the anonymous function, you need to pass it by reference:
$nodeValues = $crawler->filter('.gallery .preview_thumb')
->each(function (Crawler $node, $i) use (&$name) {
You could also declare $name as a global variable, but this is generally considered a bad practice and you always need to keep track of them.
anonymous function can inherit data from it's surrounding scope with the use clause (see example 3 on https://www.php.net/manual/en/functions.anonymous.php)
// $name has to be defined here.
$nodeValues = $crawler->filter('.gallery .preview_thumb')
->each(function (Crawler $node, $i) use ($name) { // <-- the use
//...
});
However, I have this weird feeling, that you'd be much better off when not using anonymous functions all the time, since you're not really adhering to the map-reduce principles. I guess this would work easier and good as well:
foreach($crawler->filter('.gallery .preview_thumb') as $i => $node) {
// it's the same scope now
// do whatever ...
}
However, why not open the Zip file outside the loop, and just add stuff inside the loop?!

PHP creating multiple DOMDocuments in a loop issue

I have a list of items to be added to the end of a base url and am trying to retrieve the html from each of these generated url's in a loop. However, I am encountering an error and i've really been struggling to fix it!
current code:
($items is just an array of strings)
$output = "";
foreach($items as $item) {
$url = $baseUrl . $item;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$output = $output . json_encode($dom->saveHTML());
}
echo $output;
Can anyone tell me why I can't load multiple HTML documents like this?
Annoyingly i'm not getting any PHP error logs and the ajax xhr text is not providing any useful info, it's just returning a section of the first html page loaded as the 'error' (it seems to be able to load the first item in the array but then fails)
You were almost there. This way it should do the trick:
$output = "";
foreach($items as $item) {
$url = $baseUrl . $item;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$output .= json_encode($dom->saveHTML(),JSON_ERROR_UTF8);
}
echo $output;

preg_match_all to show multiple results

I'm trying to get this messages (METAR) from a page and show everything just in other php file without the styles and extra info.
At the moment I'm using this code:
<?php
$options = array('http' => array(
'method' => 'GET',
));
$config= stream_context_create($options);
$config_final=file_get_contents('http://www.smn.gov.ar/mensajes/index.php?observacion=metar&operacion=consultar&87582=on&87641=on&87750=on&87765=on&87222=on&87761=on&87860=on&87395=on&87344=on&87166=on&87904=on&87571=on&87347=on&87803=on&87576=on&87162=on&87532=on&87497=on&87097=on&87046=on&87548=on&87217=on&87506=on&87692=on&87418=on&87574=on&87715=on&87374=on&87289=on&87852=on&87178=on&87896=on&87823=on&87270=on&87155=on&87453=on&87925=on&87934=on&87480=on&87047=on&87553=on&87311=on&87909=on&87436=on&87509=on&87912=on&87623=on&87444=on&87129=on&87371=on&87645=on&87022=on&87127=on&87828=on&87121=on&87938=on&87791=on&87448=on',false, $config);
preg_match_all("|<td width=\"100%\">METAR (.*)</td>|sU", $config_final, $tiempo);
echo $tiempo[1][0];
?>
</div>
Using that code I can get only the first METAR, Waht I need is to see all of them in different lines, like showing different results.
Any ideas?
Thanks in advance
I suggest you utilize PHP's built in HTML Parsers for this, in particular the DOMDocument, and use DOMXpath to search for those needle.
Example:
$url = 'http://www.smn.gov.ar/mensajes/index.php?observacion=metar&operacion=consultar&87582=on&87641=on&87750=on&87765=on&87222=on&87761=on&87860=on&87395=on&87344=on&87166=on&87904=on&87571=on&87347=on&87803=on&87576=on&87162=on&87532=on&87497=on&87097=on&87046=on&87548=on&87217=on&87506=on&87692=on&87418=on&87574=on&87715=on&87374=on&87289=on&87852=on&87178=on&87896=on&87823=on&87270=on&87155=on&87453=on&87925=on&87934=on&87480=on&87047=on&87553=on&87311=on&87909=on&87436=on&87509=on&87912=on&87623=on&87444=on&87129=on&87371=on&87645=on&87022=on&87127=on&87828=on&87121=on&87938=on&87791=on&87448=on';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// search for td's with contains METAR
$metars = $xpath->query('//td[contains(text(), "METAR")]');
if($metars->length <= 0) {
echo 'no metars found';
exit;
}
$data = array();
foreach($metars as $metar) {
$data[] = $metar->nodeValue;
}
echo '<pre>';
print_r($data);

Generated Google Site map not validating

I have write this little class below which generates a XML sitemap although when I try to add this to Google Webmaster I get error:
Sitemap URL: http://www.moto-trek.co.uk/sitemap/xml
Unsupported file format
Your Sitemap does not appear to be in a supported format. Please ensure it meets our Sitemap guidelines and resubmit.
<?php
class Frontend_Sitemap_Xml extends Cms_Controller {
/**
* Intercept special function actions and dispatch them.
*/
public function postDispatch() {
$db = Cms_Db_Connections::getInstance()->getConnection();
$oFront = $this->getFrontController();
$oUrl = Cms_Url::getInstance();
$oCore = Cms_Core::getInstance();
$absoDomPath = $oFront->getDomain() . $oFront->getHome();
$pDom = new DOMDocument();
$pXML = $pDom->createElement('xml');
$pXML->setAttribute('version', '1.0');
$pXML->setAttribute('encoding', 'UTF-8');
// Finally we append the attribute to the XML tree using appendChild
$pDom->appendChild($pXML);
$pUrlset = $pDom->createElement('urlset');
$pUrlset->setAttribute('xmlns', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$pXML->appendChild($pUrlset);
// FETCH content and section items
$array = $this->getDataset("sitemap")->toArray();
foreach($array["sitemap"]['rows'] as $row) {
try {
$content_id = $row['id']['fvalue'];
$url = "http://".$absoDomPath.$oUrl->forContent($content_id);
$pUrl = $pDom->createElement('url');
$pLoc = $pDom->createElement('loc', $url);
$pLastmod = $pDom->createElement('lastmod', gmdate('Y-m-d\TH:i:s', strtotime($row['modified']['value'])));
$pChangefreq = $pDom->createElement('changefreq', ($row['changefreq']['fvalue'] != "")?$row['changefreq']['fvalue']:'monthly');
$pPriority = $pDom->createElement('priority', ($row['priority']['fvalue'])?$row['priority']['fvalue']:'0.5');
$pUrl->appendChild($pLoc);
$pUrl->appendChild($pLastmod);
$pUrl->appendChild($pChangefreq);
$pUrl->appendChild($pPriority);
$pUrlset->appendChild($pUrl);
} catch(Exception $e) {
throw($e);
}
}
// Set content type to XML, thus forcing the browser to render is as XML
header('Content-type: text/xml');
// Here we simply dump the XML tree to a string and output it to the browser
// We could use one of the other save methods to save the tree as a HTML string
// XML file or HTML file.
echo $pDom->saveXML();
}
}
?>
urlset should by root element, but in your case it is xml. So appending urlset directly into domdocument should solve your problem.
$pDom = new DOMDocument('1.0','UTF-8');
$pUrlset = $pDom->createElement('urlset');
$pUrlset->setAttribute('xmlns', 'http://www.sitemaps.org/schemas/sitemap/0.9');
foreach( ){ }
$pDom->appendChild($pUrlset);
echo $pDom->saveXML();

Categories