I have this script to extract data from multiple pages of the same website. There are some 120 pages.
Here is the code I'm using to get for a single page.
$html = file_get_contents('https://www.example.com/product?page=1');
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('div');
foreach ($links as $link){
file_put_contents('products.txt', $link->getAttribute('data-product-name') .PHP_EOL, FILE_APPEND);
}
How can I do it for multiple pages? The links for that specific pages are incremental like the next page will be https://www.example.com/product?page=2 and so on. How can I do it without creating different files for each link?
What about this :
function extractContent($page)
{
$html = file_get_contents('https://www.example.com/product?page='.$page);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('div');
foreach ($links as $link) {
// skip empty attributes
if (empty($link->getAttribute('data-product-name'))) {
continue;
}
file_put_contents('products.txt', $link->getAttribute('data-product-name') .PHP_EOL, FILE_APPEND);
}
}
for ($i=1; $i<=120; $i++) {
extractContent($i);
}
Related
I'm trying to extract the properties from a link like facebook does only the page loses many seconds to load with a couple of links. Is there any way to speed it up? For example using caches?
libxml_use_internal_errors(true);
$c = file_get_contents('https://link-here');
$d = new DomDocument();
$d->loadHTML($c);
$xp = new domxpath($d);
foreach ($xp->query("//meta[#property='og:title']") as $el) {
$title = $el->getAttribute("content");
}
foreach ($xp->query("//meta[#property='og:description']") as $el) {
$content = $el->getAttribute("content");
}
foreach ($xp->query("//meta[#property='og:image']") as $el) {
$image = $el->getAttribute("content");
}
I have to process a huge XML file, I used DOMDocument to process But the datas returned is huge, so how can I choose specific amount of elements to display.
For example I want to display 5 elements.
My code:
<?php
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->load('IPCCPC-epoxif-201905.xml'); //IPCCPC-epoxif-201905
$xpath = new DOMXPath($doc);
if(empty($_POST['search'])){
$txtSearch = 'A01B1/00';
}
else{
$txtSearch = $_POST['search'];
}
$titles = $xpath->query("Doc/Fld[#name='IC']/Prg/Sen[contains(text(),\"$txtSearch\")]");
foreach ($titles as $title)
{
// I want to display 5 results here.
}
Add an index to the loop, and break out when it hits the limit.
$limit = 5;
foreach ($titles as $i => $title) {
if ($i >= $limit) {
break;
}
// rest of code
}
I want to scrape a website that has 3 levels of scraping. firs I get all pages, then in each page, I get image, title, url that redirects me to a unique page which contains more info like description , date ,... . so if I use foreach it gives me the false results and if I use for instead of foreach it returns just one object. how can I handle if(use for instead of foreach);
<?php
$stackHref=array();
$eventDetail=array();
$sitecontent = file_get_contents('https://www.everfest.com/music/edm-festivals');
if($sitecontent === FALSE) {
$error_log .= 'Error on $sitecontent = file_get_contents(https://www.everfest.com/music/edm-festivals) ';
//insert_error($error_log);
}
// echo $sitecontent;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($sitecontent);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("(//ul[#class='pagination'])[1]/li/a/#href");
// $all_area_set= ' ';
//echo $sitecontent;
if(!isset($nodes))
{
$error_log .= "Error on $nodes = $xpath->query((//ul[#class='pagination'])[1]/li/a/#href)";
//insert_error($error_log);
echo $error_log;
}
// get total pages
foreach ($nodes as $link) {
$stackHref[]='https://www.everfest.com'.$link->nodeValue;
}
//loop through each pages in order to scrape
$j=0;
for($i=0;$i<count($stackHref);$i++){
$sitecontent=file_get_contents($stackHref[$i]);
if($sitecontent === FALSE) {
$error_log .= 'Error on $sitecontent = file_get_contents(https://www.everfest.com/music/edm-festivals) ';
//insert_error($error_log);
}
$dom= new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHTML($sitecontent);
libxml_use_internal_errors(FALSE);
$innerXpath= new DOMXPath($dom);
//get page link
$pageLinks= $innerXpath->query('//div[#class="festival-card grow"]/a[1]/#href');
for ($a=0;$a <$pageLinks->length;$a++ ){
//get img src
$eventDetail[$j]['pagelink']='https://www.everfest.com'.$pageLinks[$a]->nodeValue;
$images= $innerXpath->query("//div[contains(#class,'columns medium-6 large-4')]/div[contains(#class,'grow')]/a/img/#src");
$eventDetail[$j]['img']=$images[$a]->nodeValue;
//get title
$titles= $innerXpath->query("//div[contains(#class,'clearfix')]/a[1]/text()");
$eventDetail[$j]['title']=$titles[$a]->nodeValue;
// go inside of each pages in order to get description, date, venue
$sitecontent=file_get_contents($eventDetail[$j]['pagelink']);
$dom= new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHTML($sitecontent);
libxml_use_internal_errors(FALSE);
$deepxpath= new DOMXPath($dom);
$descriptions= $deepxpath->query('//div[#class="columns"]/div[contains(#class,"card-white")]/p[contains(#class,"")]/span[1]/following-sibling::text()[1]');
$eventDetail[$j]['description']=$descriptions[$a]->nodeValue;
//get date
$dates= $deepxpath->query('//div[#id="signup"]/div[#class="row"]/div[contains(#class,"columns")][1]/p/text()[1]');
$eventDetail[$j]['Date']=$dates[$a]->nodeValue;
//get venue
$venues= $deepxpath->query('//div[#id="signup"]/div[#class="row"]/div[contains(#class,"columns")][1]/p/text()[2]');
$eventDetail[$j++]['venue']=$venues[$a]->nodeValue;
}
}
?>
I have a function which returns all img links of any web page but i want to take the image that represent the news article best . I know that it is a little hard question but every news articles have some main image on top of the article . I need to pick it among all of other images . Facebook and reddit like sites can do that . Do you have any kind of idea ? When members of my website shared a link , there should be a picture for it . I can take all url of images in a web page now i need to find main image . :)
function get_links($url) {
$xml = new DOMDocument();
libxml_use_internal_errors(true);
$html = file_get_contents($url);
if(!$xml->loadHTML($html)) {
$errors="";
foreach (libxml_get_errors() as $error) {
$errors.=$error->message."<br/>";
}
libxml_clear_errors();
print "libxml errors:<br>$errors";
return;
}
// Empty array to hold all links to return
$links = array();
//Loop through each <img> tag in the dom and add it to the link array
foreach ($xml->getElementsByTagName('img') as $link) {
$url = $link->getAttribute('src');
if (!empty($url)) {
$links[] = $link->getAttribute('src');
}
}
//Return the links
return $links;
}
You can improve your existing function, but if you want to preference the existence of the Open Graph data, add this before your getElementsByTagName('img') logic...
$xpath = new DOMXPath( $xml );
if( $xpathNodeList = $xpath->query('//meta[#property="og:image" and #content]') )
{
return array( $xpathNodeList->item(0)->getAttribute('content') );
}
or add it to your array...
// Empty array to hold all links to return
$links = array();
$xpath = new DOMXPath( $xml );
if( $xpathNodeList = $xpath->query('//meta[#property="og:image" and #content]') )
{
$links[] = $xpathNodeList->item(0)->getAttribute('content');
}
I am trying to extract specific type of links from the webpage using php
links are like following..
http://www.example.com/pages/12345667/some-texts-available-here
I want to extract all links like in the above format.
maindomain.com/pages/somenumbers/sometexts
So far I can extract all the links from the webpage, but the above filter is not happening. How can i acheive this ?
Any suggestions ?
<?php
$html = file_get_contents('http://www.example.com');
//Create a new DOM document
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
?>
You can use DOMXPath and register a function with DOMXPath::registerPhpFunctions to use it after in an XPATH query:
function checkURL($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if ( count($parts) == 2 &&
isset($parts['host']) &&
isset($parts['path']) &&
preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
return true;
}
return false;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($filename);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');
$links = $xp->query("//a[php:functionString('checkURL', #href)]");
foreach ($links as $link) {
echo $link->getAttribute('href'), PHP_EOL;
}
In this way you extract only the links you want.
This is a slight guess, but if I got it wrong you can still see the way to do it.
foreach ($links as $link){
//Extract and show the "href" attribute.
If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
}
You already use a parser, so you might step forward and use an xpath query on the DOM. XPath queries offer functions like starts-with() as well, so this might work:
$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(#href, 'maindomain.com')]");
Loop over them afterwards:
foreach ($links as $link) {
// do sth. with it here
// after all, it is a DOMElement
}