I'm trying to get specific property al:android:url from the link https://www.facebook.com/tobiasz.mencfel.
Current code: String $link_id shows nothing.
I've done so far:
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
$html = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl("https://www.facebook.com/tobiasz.mencfel");
$doc = new DOMDocument();
#$doc->loadHTML($html);
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++) {
$meta = $metas->item($i);
if ($meta->getAttribute('property') == 'al:android:url') {
$link_id = $meta->getAttribute('content');
}
}
// output should be: fb://profile/100025596917906
echo $link_id;
How meta looks like:
<meta property="al:android:url" content="fb://profile/100025596917906" />
//modify
return $data;
//to
return $html;
//result : fb://profile/100025596917906
Related
I am trying to build a function to render sitemap links and get inside links of inner sitemap its working good but its not working for all the links some of the links ( with the same syntax) is not working and responding errors
function download_page($path){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$path);
curl_setopt($ch, CURLOPT_FAILONERROR,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36',
'Content-type: application/xml'
]);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$retValue = curl_exec($ch);
curl_close($ch);
return $retValue;
}
function getAllLinks($sitemapUrl) {
$links = array();
$i=0;
// $context = stream_context_create(array('http' => array('header' => 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36')));
// $xml = file_get_contents($sitemapUrl, false, $context);
$sitemap = $this->download_page($sitemapUrl);
// dd($sitemap);
// Load the sitemap XML file
$sitemapXml = new \SimpleXMLElement($sitemap);
// $sitemapXml = simplexml_load_file($sitemap);
// $sitemapXml = simplexml_load_string($sitemap);
// Loop through the <url> and <sitemap> elements
foreach($sitemapXml->children() as $child) {
if ($child->getName() === 'url') {
$i++;
$links[$i]['url'] = (string)$child->loc;
$links[$i]['lastmod'] = (string)$child->lastmod;
}
elseif ($child->getName() === 'sitemap') {
$links = array_merge($links, $this->getAllLinks((string)$child->loc));
}
}
return $links;
}
In the comments I tried to u se multiple methods
Example for working link : https://rulepingpong.com/sitemap_index.xml
Example for not working link: https://majesticgaragedoorfl.com/sitemap_index.xml
getting the error "String could not be parsed as XML"
I am really lost
I'm trying to get specific class from website url. I've tried use code below, but I cannot get loadHTML because I have 503 response.
// <span class="_1n0q8zmp">AUD - $</span>
$url = 'https://www.airbnb.com/rooms/19844318';
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
$html = file_get_contents_curl($url);
$dom = new DOMDocument();
libxml_use_internal_errors(1);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$script = $dom->getElementsByTagName('span');
$script = $xpath->query("//*[contains(#class, '_1n0q8zmp')]");
echo $script;
// result should be: AUD - $
I have this code to try and get the pagination links using php but the result is not quiet right. could any one help me.
what I get back is just a recurring instance of the first link.
<?php
include_once('simple_html_dom.php');
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
$Next_Link = array();
foreach($dom->find('a[title=Next]') as $element){
$Next_Link[] = $element->href;
}
print_r($Next_Link);
$next_page_url = $Next_Link[0];
if($next_page_url !='') {
echo '<br>' . $next_page_url;
$dom->clear();
unset($dom);
//load the next page from the pagination to collect the next link
dlPage($next_page_url);
}
}
$url = 'https://www.jumia.com.gh/phones/';
$data = dlPage($url);
//print_r($data)
?>
what i want to get is
mySiteUrl/?facet_is_mpg_child=0&viewType=gridView&page=2
mySiteUrl//?facet_is_mpg_child=0&viewType=gridView&page=3
.
.
.
to the last link in the pagination. Please help
Here it is. Look that I htmlspecialchars_decode the link. Cause the href in curl there shouldn't be an & like in xml. Should the return value of dlPage the last link in Pagination. I understood so.
<?php
include_once('simple_html_dom.php');
function dlPage($href, $already_loaded = array()) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$htmlPage = curl_exec($curl);
curl_close($curl);
echo "Loading From URL:" . $href . "<br/>\n";
$already_loaded[$href] = true;
// Create a DOM object
$dom = file_get_html($href);
// Load HTML from a string
$dom->load($htmlPage);
$next_page_url = null;
$items = $dom->find('ul[class="osh-pagination"] li[class="item"] a[title="Next"]');
foreach ($items as $item) {
$link = htmlspecialchars_decode($item->href);
if (!isset($already_loaded[$link])) {
$next_page_url = $link;
break;
}
}
if ($next_page_url !== null) {
$dom->clear();
unset($dom);
//load the next page from the pagination to collect the next link
return dlPage($next_page_url, $already_loaded);
}
return $href;
}
$url = 'https://www.jumia.com.gh/phones/';
$data = dlPage($url);
echo "DATA:" . $data . "\n";
And the output is:
Loading From URL:https://www.jumia.com.gh/phones/<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=2<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=3<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=4<br/>
Loading From URL:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=5<br/>
DATA:https://www.jumia.com.gh/phones/?facet_is_mpg_child=0&viewType=gridView&page=5
I want to combine Curl and Simple HTML DOM.
Both are working fine separately.
I want to curl a site and then I want to look into the inner data using DOM
with pagination page numbers.
I am using this code.
<?php
include 'simple_html_dom.php';
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
return $dom;
}
$url = 'http://example.com/';
$data = dlPage($url);
// echo $data;
#######################################################
$startpage = 1;
$endpage = 3;
for ($p=$startpage;$p<=$endpage;$p++) {
$html = file_get_html('http://example.com/page/$p.html');
// connect to main page links
foreach ($html->find('div#link a') as $link) {
$linkHref = $link->href;
//loop through each link
$linkHtml = file_get_html($linkHref);
// parsing inner data
foreach($linkHtml->find('h1') as $title) {
echo $title;
}
foreach ($linkHtml->find('div#data') as $description) {
echo $description;
}
}
}
?>
How can I combine this to make it work as one single script?
<?php
header("Content-type: text/xml");
$xml = new SimpleXMLElement("<noresult>1</noresult>");
$fn = urlencode($_REQUEST['fn']);
$ln = urlencode($_REQUEST['ln']);
$co = $_REQUEST['co'];
if (empty($fn) || empty($ln)):
echo $xml->asXML();
exit();
endif;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.linkedin.com/pub/dir/?first={$fn}&last={$ln}&search=Search");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($ch, CURLOPT_TIMEOUT, 8);
$res = curl_exec($ch);
preg_match("/<div id=\"content\".*?<\/div>\s*<\/div>/ms", $res, $match);
if (!empty($match)):
$dom = new DOMDocument();
$dom->loadHTML($match[0]);
$ol = $dom->getElementsByTagName('ol');
$vcard = $dom->getElementsByTagName('li');
$co_match_node = false;
for ($i = 0; $i < $vcard->length; $i++):
if (!empty($co) && stripos($vcard->item($i)->nodeValue, $co) !== false) $co_match_node = $vcard->item($i);
endfor;
if (!empty($co_match_node)):
echo $dom->saveXML($co_match_node);
// my idea is to put code here to save in the database.
else:
echo (string)$dom->saveXML($ol->item(0));
endif;
else:
echo $xml->asXML();
endif;
curl_close($ch);
exit();
I'm trying to save XML into a MySQL database.
However, I don't know how to parse the $dom or how to segregate the "li".
There are 5 fields needed in the database:
span.given-name
span.family-name
span.location
span.industry
dd.current-content span
These fields are available in the XML.