Getting title of a webpage issue - php

I want to get the title of a webpage with file_get_contents(),
I tried:
$get=file_get_meta_tags("http://example.com");
echo $get["title"];
but it doesn't match.
What is wrong with it?

Title tag is not part of match in get_meta_tags() function and it is also not a meta tag.
Try this:
$get=file_get_contents("http://example.com");
preg_match("#<title>(.*?)</title>#i,$get,$matches);
print_r($matches);
Regex #<title>(.*?)</title>#i matches the title string.

Use the Below Code snipet to get the webpage title.
<?php
function curl_file_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$targetUrl = "http://google.com/";
$html = curl_file_get_contents($targetUrl);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$page_title = $nodes->item(0)->nodeValue;
echo "Title: $page_title". '<br/><br/>';
?>

Related

Xpath scraping further link from subpages?

I Finally managed to make a script in php for scraping basic elements from other websites. It is super simple. This example shows how to get title and url.
ini_set('display_errors', 1);
$url = 'http://test123cxqwq12.000webhostapp.com/mainpage.php';
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$title = $xpath->query('/html/body/a/h1');
$source = $xpath->query('/html/body/a/#href');
for ($i = 0; $i <= count($source)-1; $i++) {
$new = $source[$i]->nodeValue;
$text = $title[$i]->nodeValue;
echo ""."</br>";
}
Page with results: http://test123cxqwq12.000webhostapp.com/scrap.php
Page to scraping content: http://test123cxqwq12.000webhostapp.com/mainpage.php
Subpage: http://test123cxqwq12.000webhostapp.com/subpage.php
Now I would like to go a step further and take the data from the subpage. So instead of taking source from main page like is right now. I would like to go into this source and take another source (in this example google.com link) from subpage. I'm out of ideas. I would like to ask for some tips, is it possible to do it with xpath in similar way I was doing now?
I think a solution could be to store the URL in a database then apply your Curl and xpath functions to them
<?php
function curlGet($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$results = curl_exec($ch);
curl_close($ch);
return $results;
}
function returnXPathObject($item) {
$xmlPageDom = new DomDocument();
#$xmlPageDom->loadHTML($item);
$xmlPageXPath = new DOMXPath($xmlPageDom);
return $xmlPageXPath;
}
$allUrl = $cxn->query("SELECT * FROM yourDatabaseUrl");
$allUrl = $allUrl->fetchAll();
for ($i = 0; $i<count($allUrl); $i++){
$url = $allUrl[$i];
$getDom = curlGet($url);
$getDomXpath = returnXPathObject($getDom);
$title = $getDomXpath->query('/html/body/a/h1');
$source = $getDomXpath->query('/html/body/a/#href');
}
I'm not sure about this answer it's just a proposition

Scraping HTML page using XPath and PHP

I'm trying to scraping a HTML page using this PHP code
<?php
ini_set('display_errors', 1);
$url = 'http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');
foreach( $greenWaitingNumber as $node )
{
echo "Number first green line: " .$node->nodeValue;
echo '<br>';
echo '<br>';
}
?>
All works fine (no error and in my browser console I can see '200' as return code ...), but nothing is printed in my HTML page .... .
Probably the problem is about the xpath /html/body/div/div/div[4]/div[3]/section/p that refers to the first green line in the source HTML page, but this is my Firefox Firebug tells me for that page section ....
Suggestions / examples?
!!! UPDATE !!!!
As Santosh Sapkota suggest in his reply, the first problem is that the text inside that green box, is loaded from iFrame ... I've seen the url of the HTML page inside the IFrame ad so I've tried to use this one in my code that now is ...
<?php
ini_set('display_errors', 1);
$url = 'http://listeps.cittadellasalute.to.it/?id=01090101';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');
foreach( $greenWaitingNumber as $node )
{
echo "Number first green line: " .$node->nodeValue;
echo '<br>';
echo '<br>';
}
?>
but unfortunately nothing is still printed in my output HTML page ....
Other suggestions / examples?
Must be problem with you xpath. As well as check if there is content laded from iFrame or not.

Getting site title in unknown format using Php Curl and Dom-Document

I want to get site title using site url with most of the site it is working but it is getting some not readable text with japennese and chinnese site.
Here is my function
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Use
use--------
$html = $this->file_get_contents_curl($url);
Parsing
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
I am getting this ouput "ã¢ã¡ã¼ãIDç»é² ã¡ã¼ã«ã®ç¢ºèªï½Ameba(ã¢ã¡ã¼ã)"
Site URL : https://user.ameba.jp/regist/registerIntro.do?campaignId=0053&frmid=3051
Please help me out suggest some way to get exact site title in any language.
//example
/* MEthod----------4 */
function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$uurl="http://www.piaohua.com/html/xuannian/index.html";
$html = file_get_contents_curl($uurl);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
if(!empty($nodes->item(0)->nodeValue)){
$title = utf8_decode($nodes->item(0)->nodeValue);
}else{
$title =$uurl;
}
echo $title;
Make sure your script is using utf-8 encoding by adding following line to the begining of the file
mb_internal_encoding('UTF-8');
After doing so, remove utf8_decode function from your code. Everything should work fine without it
[DOMDocument::loadHtml]1 function gets encoding from html page meta tag. So you could have problems if page do not excplicitly specifies its encoding.
Simply add this line on top of your PHP Code.
header('Content-Type: text/html;charset=utf-8');
The code..
<?php
header('Content-Type: text/html;charset=utf-8');
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl('http://www.piaohua.com/html/lianxuju/2013/1108/27730.html');
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
echo $title = $nodes->item(0)->nodeValue;

Why does this regex not match the URLs in this Google results page?

I'm having trouble scraping the URLs out of the Google results. This code worked for me for a long time but seems like Google changed a few things this week and now I'm getting a ton of extra characters surrounded by the actual URL I want.
preg_match_all('#<h3\s*class="r">\s*<a[^<>]*href="([^<>]*)"[^<>]*>(.*)</a>\s*</h3>#siU', $results, $matches[$key]);
EDIT
All links come out like this when scraped with the above code
/url?url=http://cooksandtravelbooks.com/write-for-us/&rct=j&sa=U&ei=XdayUNnHBIqDiwKZuYEY&ved=0CBQQFjAA&q=cooking+%5C%22Write+for+Us%5C%22&usg=AFQjCNGMiCiWYY_8JDAhqJggVDW2qHRMfw
<?php
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
foreach($dom->getElementsByTagName('a') as $link) {
echo $link->getAttribute('href');
echo "<br />";
}
?>

php dom not accepting url

I am trying to create a program that will open a text file with urls seperated by |. It will then take the first line of the text document, crawl that url and remove it from the text file. Each url is to be scraped by a basic crawler. I know the crawler part works because if I enter in one of the urls in quotations, rather than a variable from the text file, it will work. I am at the point where it will not return anything because the url simply will not be accepted.
this is a basic version of my code because I had to break it down alot to iscolate the problem.
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}`
The code below works like a champ tested with your example data:
<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}
?>
ALMOST FORGOT: LETS NOW PUT IT IN A LOOP TO LOOP THROUGH ALL URLS:
<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
foreach($urlarray as $url) {
if(!empty($url)) {
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,trim($url));
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}
echo '<hr />';
}
}
?>
So if you put in a URL manually $url = 'http://www.mywebsite.com'; every thing works as expected?
If so there is a problem here:
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
are you sure urls.txt is loading? Are you sure it contains http://a.com|http://b.com etc?
I would var dump
$contents = file_get_contents('urls.txt') before the explode statement to see if it is loading in.
If yes, then I would explode the into $urlarray, and var dump $urlarray[0]
if it looks right I would trim it before being sent to dom with trim($urlarray[0])
I may even go as far as using valid regex to make sure these URL's are in fact URL's before sending it to dom.
Let me know the results and I will try to help further, or post all sample code including URLS.txt
And I will run it locally

Categories