Parsing BlogId from Blogspot.com in PHP using Regex - php

How can i get the blogid from a given blogspot.com url?
I looked at the source code of the webpage from a blogspot.com it looks like this
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://www.blogger.com/rsd.g?blogID=4899870735344410268" />
how can i parse this to get the number 4899870735344410268

Use DOMDocument to parse the document and then use its methods to retrieve the wanted element.
I cannot stress this enough: never use regular expressions to parse an HTML document.
function getBlogId($url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$page = curl_exec ($ch);
curl_close($ch);
$doc = new DOMDocument();
#$doc->loadHTML($page);
$links = $doc->getElementsByTagName('link');
foreach($links as $link) {
$rel = $link->attributes->getNamedItem('rel');
if($rel && $rel->nodeValue == 'EditURI') {
$href = $link->attributes->getNamedItem('href')->nodeValue;
$query = parse_url($href, PHP_URL_QUERY);
if($query) {
$queryComp = array();
parse_str($query, $queryComp);
if($queryComp['blogID']) {
return $queryComp['blogID'];
}
}
}
}
return false;
}
Example use:
$id = getBlogId('http://thehouseinmarrakesh.blogspot.com/');
echo $id; // 483911541311389592

$pageContents = file_get_contents('blospot_url');
preg_match('~<link rel="EditURI" type="application/rsd\+xml" title="RSD" href="http://www.blogger.com/rsd.g\?blogID=([0-9]+)" />~', $pageContents, $matches);
echo $matches[1];

Related

Need help extracting meta title from an URL using curl and DOM

I need to extract the dollar amount e.g. $594 from the meta title of an URL. I am getting full meta title however i just need the $594 from it not the whole title. Here is my code. Thanks
<?php
// Web page URL
$url = 'https://www.cheapflights.com.au/flights-to-Delhi/Sydney/';
// Extract HTML using curl
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
// Load HTML to DOM object
$dom = new DOMDocument();
#$dom->loadHTML($data);
// Parse DOM to get Title data
$nodes = $dom->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
// Parse DOM to get meta data
$metas = $dom->getElementsByTagName('meta');
$description = $keywords = '';
for($i=0; $i<$metas->length; $i++){
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description'){
$description = $meta->getAttribute('content');
}
if($meta->getAttribute('name') == 'keywords'){
$keywords = $meta->getAttribute('content');
}
}
echo "$title". '<br/>';
?>

Foreach loop and problems with cURL requests

I am a beginner in PHP programming. I have this script in which I'm trying to get a string multiple times, each time with different "login" data, from an external website. I am using PHP, cURL, DOM and XPath. The fact is that my code seems to work only if I don't use a foreach construct to loop the entire operation. But I don't know how else I could repeat this operation changing the data from time to time.
The situation is: I have just logged in, and now the site ask me to fill two more fields that are necessary to proceed to the next page where I can get the string that I need. The next portion of code is contained in a if block.
// A function to automatically select the form fields:
function form_fields($xpath, $query) {
$inputs = $xpath->query($query);
$fields = array();
foreach ($inputs as $input) {
$key = $input->attributes->getNamedItem('name')->nodeValue;
$type = $input->nodeName;
$value = $input->attributes->getNamedItem('value')->nodeValue;
$fields[$key] = $value;
}
return $fields;
}
// Executing the XPath queries to fill the fields:
$opzutenza = 'incarichi';
$action = $xpath->query("//form[#name='fm_$opzutenza']")->item(0)->attributes->getNamedItem('action')->nodeValue;
curl_setopt($ch, CURLOPT_URL, $action);
$fields = form_fields($xpath, "//form[#name='fm_$opzutenza']/input");
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($fields));
$html = curl_exec($ch);
$dom = new DomDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// The strings that I need to get depend on each value contained in this select element:
$options = $xpath->query("//select[#name='sceltaincarico']/option");
$partiteiva = array();
foreach($options as $option){
$partiteiva[] = $option->nodeValue;
unset($partiteiva[0]);
}
} // -----------> END OF 'IF' BLOCK
$queriesNA = array();
foreach ($partiteiva as $piv) {
$queryNA = ".//select[#name='sceltaincarico']/option[text()='$piv']";
$queriesNA[] = $queryNA;
}
// And this is the problematic loop:
foreach($queriesNA as $querypiv){
$form = $xpath->query("//form[#name='fm_scelta_tipo_incarico']")->item(0);
$action = $form->attributes->getNamedItem('action')->nodeValue;
#$option = $xpath->query($querypiv, $form);
curl_setopt($ch, CURLOPT_URL, $action);
$fields = [
'sceltaincarico' => $option->item(0)->attributes->getNamedItem('value')->nodeValue,
'tipoincaricante' => 'incDiretto'
];
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($fields)); // ----> Filling the last field
curl_exec($ch);
curl_setopt($ch, CURLOPT_URL, 'https://website.com/dp/api');
curl_exec($ch);
curl_setopt($ch, CURLOPT_URL, 'https://website.com/cons/cons-services/sc/tokenB2BCookie/get');
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
$http = curl_exec($ch);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_NOBODY, false);
function parse_headers($http) {
$headers = explode("\r\n", $http);
$hdrs = array();
foreach($headers as $h) {
#list($k, $v) = explode(':', $h);
$hdrs[trim($k)] = trim($v);
}
return $hdrs;
}
$hdrs = parse_headers($http);
$tokens = array(
"x-token: ".$hdrs['x-token'],
"x-b2bcookie: ".$hdrs['x-b2bcookie']
);
curl_setopt($ch, CURLOPT_HTTPHEADER, $tokens);
curl_setopt($ch, CURLOPT_URL, "https://website.com/cons/cons-services/rs/disclaimer/accetta"); // Accepting the disclaimer...
curl_exec($ch);
curl_setopt($ch, CURLOPT_URL, "https://website.com/portale/web/guest/home");
$html = curl_exec($ch); // Finally got to the page that I need
$dom = new DomDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// Selecting the string:
$string = $xpath->query("//div[#class='informativa']/strong[2]");
$nomeazienda = array();
foreach ($string as $str) {
$nomeazienda[] = $str->childNodes->item(0)->nodeValue;
}
// Going back to the initial page so the loop can start again from the beginning:
$piva_page = 'https://website.com/portale/scelta-utenza-lavoro?....';
curl_setopt($ch, CURLOPT_URL, $piva_page);
$html = curl_exec($ch);
$dom = new DomDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
}
curl_close($ch);
These are the error messages:
Notice: Trying to get property 'attributes' of non-object...
Fatal error: Uncaught Error: Call to a member function getNamedItem() on null...
Error: Call to a member function getNamedItem() on null...
The function getNamedItem() is the first one just after the malfunctioning loop, and so are the 'attributes'.

how to scraping a site using php

I'm getting the content of the site using this following code
function get_content($url){
$content = #file_get_contents($url);
if( empty($content) ){
$content = get_url_contents($url);
}
return $content;
}
function get_url_contents($url){
$crl = curl_init();
$timeout = 90;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
$url = "http://www.site.com";
$html = get_content($url);
echo $html;
Everything is ok, but I need to get for example all my div elements or the title of the page or all my images.
How can I do that?
Thanks
Use a HTML Parsing library. While many of them exist, I have personally used SimpleHTMLDom and had a good experience. It uses JQuery style selectors making it easy to learn.
Some code samples:
To get title of page:
$html = str_get_html($html);
$title = $html->find('title',0);
echo $title->plaintext;
For all div elements:
$html = str_get_html($html);
$divs = $html->find('div');
foreach($divs as $div) {
// do something;
}
You can use DOMDocument
eg:
$dom = new DOMDocument;
$dom->loadHTML($html);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
echo $div->nodeValue. PHP_EOL;
}

Getting url data by curl method giving unexpected results in symbols

I am facing some times Problem in getting url data by curl method specially website data is is in other language like arabic etc
My curl function is
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
//checking mime types
if(strstr($info,'text/html')) {
curl_close($ch);
return $data;
} else {
return false;
}
}
And how i am getting data
$html = file_get_contents_curl($checkurl);
$grid ='';
if($html)
{
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
#$title = $nodes->item(0)->nodeValue;
#$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
}
I am getting all data correctly from some arabic websites like
http://www.emaratalyoum.com/multimedia/videos/2012-04-08-1.474873
and when i give this youtube url
http://www.youtube.com/watch?v=Eyxljw31TtU&feature=g-logo&context=G2c4f841FOAAAAAAAFAA
it shows symbols..
what setting i have to do to show exactly the same title description.
Introduction
Getting Arabic can be very tricky but they are some basic steps you need to ensure
Your document must output UTF-8
Your DOMDocument must read in UTF-8 fromat
Problem
When getting Youtube information its already given the information in "UTF-8" format and the retrieval process adds addition UTF-8 encoding .... not sure why this occurs but a simple utf8_decode would fix the issue
Example
header('Content-Type: text/html; charset=UTF-8');
echo displayMeta("http://www.emaratalyoum.com/multimedia/videos/2012-04-08-1.474873");
echo displayMeta("http://www.youtube.com/watch?v=Eyxljw31TtU&feature=g-logo&context=G2c4f841FOAAAAAAAFAA");
Output
emaratalyoum.com
التقطت عدسات الكاميرا حارس مرمى ريال مدريد إيكر كاسياس في موقف محرج قبل لحظات من بداية مباراة النادي الملكي مع أبويل القبرصي في ذهاب دور الثمانية لدوري أبطال
youtube.com
أوروبا.ففي النفق المؤدي إلى الملعب، قام كاسياس بوضع إصبعه في أنفه، وبعدها قام بمسح يده في وجه أحدبنات سعوديات: أريد "شايب يدللني ولا شاب يعللني"
Function Used
displayMeta
function displayMeta($checkurl) {
$html = file_get_contents_curl($checkurl);
$grid = '';
if ($html) {
$doc = new DOMDocument("1.0","UTF-8");
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
for($i = 0; $i < $metas->length; $i ++) {
$meta = $metas->item($i);
if ($meta->getAttribute('name') == 'description') {
$description = $meta->getAttribute('content');
if (stripos(parse_url($checkurl, PHP_URL_HOST), "youtube") !== false)
return utf8_decode($description);
else {
return $description;
}
}
}
}
}
*file_get_contents_curl*
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
// checking mime types
if (strstr($info, 'text/html')) {
curl_close($ch);
return $data;
} else {
return false;
}
}
I believe this will work... utf8_decode() your attribute..
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
//checking mime types
if(strstr($info,'text/html')) {
curl_close($ch);
return $data;
} else {
return false;
}
}
$html = file_get_contents_curl($checkurl);
$grid ='';
if($html)
{
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
#$title = $nodes->item(0)->nodeValue;
#$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
$description = utf8_decode($meta->getAttribute('content'));
}
What happens here is that you're discarding the found Content-Type header that cURL returned in your file_get_contents_curl() function; DOMDocument needs that information to understand the character set that was used on the page.
A somewhat ugly hack, but most generic, is to prefix the returned page with a <meta> tag containing the returned character set from the response headers:
if (strstr($info, 'text/html')) {
curl_close($ch);
return '<meta http-equiv="Content-Type" content="' . $info . '" />' . $data;
}
DOMDocument will accept the misplaced meta tag and do the respective conversions automatically.

CURL alternative to the built-in "file_get_contents()" function

So from my understanding this should be fairly simple as I should only need to change the original fileget contents code, and the rest of the script should still work? I have commented out the old file get contents and added the curl below.
after changing from file get contents to cURL the code below does not output
//$data = #file_get_contents("http://www.city-data.com/city/".$cityActualURL."-".$stateActualURL.".html");
//$data = file_get_contents("http://www.city-data.com/city/Geneva-Illinois.html");
//Initialize the Curl session
$ch = curl_init();
$url= "http://www.city-data.com/city/".$cityActualURL."-".$stateActualURL.".html";
//echo "$url<br>";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
//echo $data;
$details = str_replace("\n", "", $data);
$details = str_replace("\r", "", $details);
$detailsBlock = <<<HTML
~<div style='clear:both;'></div><br/><b>(.*?) on our <a href='http://www.city-data.com/top2/toplists2.html'>top lists</a>: </b><ul style='margin:10px;'>(.*?)<div style='bp_bindex'>~
HTML;
$detailsBlock2 = <<<HTML
~<br/><br/><b>(.*?) on our <a href='http://www.city-data.com/top2/toplists2.html'>top lists</a>: </b><ul style='margin:10px;'>(.*?)</ul></td>~
HTML;
$detailsBlock3 = <<<HTML
~<div style='clear:both;'></div><br/><b>(.*?) on our <a href='http://www.city-data.com/top2/toplists2.html'>top lists</a>: </b><ul style='margin:10px;'>(.*?)</ul></td>~
HTML;
preg_match($detailsBlock, $details, $matches);
preg_match($detailsBlock2, $details, $matches2);
preg_match($detailsBlock3, $details, $matches3);
if (isset($matches[2]))
{
$facts = "<ul style='margin:10px;'>".$matches[2];
}
elseif (isset($matches2[2]))
{
$facts = "<ul style='margin:10px;'>".$matches2[2];
}
elseif (isset($matches3[2]))
{
$facts = "<ul style='margin:10px;'>".$matches3[2];
}
else
{
$facts = "More Information to Come...";
}
If you have a problem with your script you need to debug it. For example:
$data = curl_exec($ch);
var_dump($data); die();
Then you will get an output what $data is. Depending on the output you can further decide where to look next for the cause of the malfunction.
The following function works great, just pass it a URL.
function file_get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //Set curl to return the data instead of printing it to the browser.
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
TIP: New lines and carriage returns can be replaced with one line of code.
$details = str_replace(array("\r\n","\r","\n"), '', $data);

Categories