How to get class value from website using DOMDocument PHP - php

I'm trying to get specific class from website url. I've tried use code below, but I cannot get loadHTML because I have 503 response.
// <span class="_1n0q8zmp">AUD - $</span>
$url = 'https://www.airbnb.com/rooms/19844318';
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
$html = file_get_contents_curl($url);
$dom = new DOMDocument();
libxml_use_internal_errors(1);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$script = $dom->getElementsByTagName('span');
$script = $xpath->query("//*[contains(#class, '_1n0q8zmp')]");
echo $script;
// result should be: AUD - $

Related

PHP render sitemap with SimpleXMLElement

I am trying to build a function to render sitemap links and get inside links of inner sitemap its working good but its not working for all the links some of the links ( with the same syntax) is not working and responding errors
function download_page($path){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$path);
curl_setopt($ch, CURLOPT_FAILONERROR,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36',
'Content-type: application/xml'
]);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$retValue = curl_exec($ch);
curl_close($ch);
return $retValue;
}
function getAllLinks($sitemapUrl) {
$links = array();
$i=0;
// $context = stream_context_create(array('http' => array('header' => 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36')));
// $xml = file_get_contents($sitemapUrl, false, $context);
$sitemap = $this->download_page($sitemapUrl);
// dd($sitemap);
// Load the sitemap XML file
$sitemapXml = new \SimpleXMLElement($sitemap);
// $sitemapXml = simplexml_load_file($sitemap);
// $sitemapXml = simplexml_load_string($sitemap);
// Loop through the <url> and <sitemap> elements
foreach($sitemapXml->children() as $child) {
if ($child->getName() === 'url') {
$i++;
$links[$i]['url'] = (string)$child->loc;
$links[$i]['lastmod'] = (string)$child->lastmod;
}
elseif ($child->getName() === 'sitemap') {
$links = array_merge($links, $this->getAllLinks((string)$child->loc));
}
}
return $links;
}
In the comments I tried to u se multiple methods
Example for working link : https://rulepingpong.com/sitemap_index.xml
Example for not working link: https://majesticgaragedoorfl.com/sitemap_index.xml
getting the error "String could not be parsed as XML"
I am really lost

How to get property attribute from Facebook profile?

I'm trying to get specific property al:android:url from the link https://www.facebook.com/tobiasz.mencfel.
Current code: String $link_id shows nothing.
I've done so far:
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
$html = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl("https://www.facebook.com/tobiasz.mencfel");
$doc = new DOMDocument();
#$doc->loadHTML($html);
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++) {
$meta = $metas->item($i);
if ($meta->getAttribute('property') == 'al:android:url') {
$link_id = $meta->getAttribute('content');
}
}
// output should be: fb://profile/100025596917906
echo $link_id;
How meta looks like:
<meta property="al:android:url" content="fb://profile/100025596917906" />
//modify
return $data;
//to
return $html;
//result : fb://profile/100025596917906

Can't Return When Looping

Why I can't return while looping in function? Why I just got 1 result like without looping? Here is my code:
function search($get){
$i=0;
//print_r($get);
foreach($get->itemlist as $song){
$i++;
$ch = curl_init('');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIE, 'wmid=14997771; user_type=2; country=id; session_key=96870dd03ab9280c905566cad439c904;');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36');
$json = curl_exec($ch);
$json = str_replace('MusicInfoCallback(', '', $json);
$json = str_replace(')', '', $json);
$json = json_decode($json);
$songurl = $json->mp3Url;
//print_r($json);
return array($i => array("song" => $json->msong,
"singer" => $json->msinger,
"url" => $song->songid));
}
}
print_r(search("key"));
any alternative?
Untested Code:
function search($get){
foreach($get->itemlist as $song){
$ch = curl_init('');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIE, 'wmid=14997771; user_type=2; country=id; session_key=96870dd03ab9280c905566cad439c904;');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36');
$json = curl_exec($ch);
$json = json_decode(substr($json,18,-1),true);
$results[]=['songurl'=>$json['mp3Url'],
'song'=>$json['msong'],
'singer'=>$json['msinger'],
'url'=>$song->songid
];
}
return $results;
}
I don't have any sample data to verify my code with. I am making an assumption that 'MusicInfoCallback( and ) are the start and end of the curl string. I recommend packing all data into an (automatically) indexed array.
$songurl was also "trapped" within the scope of the function.

Scraping a website for price data using PHP but it returns zero(==$0) may be the website is blocking me. How to over come it?

This is the code that i have used:
$curl = curl_init("https://www.flipkart.com/curren-cu2-345656-analog-watch-boys-men/p/itmeax4wh4ujcfft?pid=WATEAX4WGYNYWVCM&srno=b_1_1&otracker=hp_omu_Deals%20of%20the%20Day_5_15c7e867-d35a-4431-a4a0-da39f043bc1f_0&lid=LSTWATEAX4WGYNYWVCMHVLY32");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tableRows = $xpath->query('//*[#id="container"]/div/div[2]/div[2]/div/div/div[1]/div/div[2]/div/div[2]/div[2]/div[1]/div/div[1]');
echo $tableRows[0];
echo $tableRows[1];
echo $tableRows[2];
foreach ($tableRows as $row) {
echo $row . "<br>";
}
It shows zero, while i open the source in F12 developer mode it shows "==$0" adjacent to the div, how to i overcome this ?
As such flipkart is https so its blocking your request. To overcome this issue. Please use following two lines of code in addition with your curl request.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 2);

Web Scraping in Google Scholar

I'm trying to scrape from Google Scholar profile pages.The idea is that I want to retrieve the publications' list with an XPath but I does not download the page, here's my code:
I tried with curl
function get_page($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
//I tried to change user agent as well
//curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
return $response;
}
And without curl :
function get_xpath($query_url) {
$dom = new DOMDocument();
#$dom->loadHTMLFile($query_url);
sleep(1);
return new DOMXpath($dom);
}
$query_url = "https://scholar.google.it/citations?user=p-POZjgAAAAJ&hl=it&cstart=0&pagesize=100";
To get it without curl
$xpath = get_xpath($query_url);
To get it with curl
$xpath = get_xpath(get_page($query_url));
And then
$autori=$xpath->query("//tr[1]/td[1]/div[1]");
But $autori keeps being empty, any idea?

Categories