Getting url data by curl method giving unexpected results in symbols - php

I am facing some times Problem in getting url data by curl method specially website data is is in other language like arabic etc
My curl function is
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
//checking mime types
if(strstr($info,'text/html')) {
curl_close($ch);
return $data;
} else {
return false;
}
}
And how i am getting data
$html = file_get_contents_curl($checkurl);
$grid ='';
if($html)
{
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
#$title = $nodes->item(0)->nodeValue;
#$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
}
I am getting all data correctly from some arabic websites like
http://www.emaratalyoum.com/multimedia/videos/2012-04-08-1.474873
and when i give this youtube url
http://www.youtube.com/watch?v=Eyxljw31TtU&feature=g-logo&context=G2c4f841FOAAAAAAAFAA
it shows symbols..
what setting i have to do to show exactly the same title description.

Introduction
Getting Arabic can be very tricky but they are some basic steps you need to ensure
Your document must output UTF-8
Your DOMDocument must read in UTF-8 fromat
Problem
When getting Youtube information its already given the information in "UTF-8" format and the retrieval process adds addition UTF-8 encoding .... not sure why this occurs but a simple utf8_decode would fix the issue
Example
header('Content-Type: text/html; charset=UTF-8');
echo displayMeta("http://www.emaratalyoum.com/multimedia/videos/2012-04-08-1.474873");
echo displayMeta("http://www.youtube.com/watch?v=Eyxljw31TtU&feature=g-logo&context=G2c4f841FOAAAAAAAFAA");
Output
emaratalyoum.com
التقطت عدسات الكاميرا حارس مرمى ريال مدريد إيكر كاسياس في موقف محرج قبل لحظات من بداية مباراة النادي الملكي مع أبويل القبرصي في ذهاب دور الثمانية لدوري أبطال
youtube.com
أوروبا.ففي النفق المؤدي إلى الملعب، قام كاسياس بوضع إصبعه في أنفه، وبعدها قام بمسح يده في وجه أحدبنات سعوديات: أريد "شايب يدللني ولا شاب يعللني"
Function Used
displayMeta
function displayMeta($checkurl) {
$html = file_get_contents_curl($checkurl);
$grid = '';
if ($html) {
$doc = new DOMDocument("1.0","UTF-8");
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
for($i = 0; $i < $metas->length; $i ++) {
$meta = $metas->item($i);
if ($meta->getAttribute('name') == 'description') {
$description = $meta->getAttribute('content');
if (stripos(parse_url($checkurl, PHP_URL_HOST), "youtube") !== false)
return utf8_decode($description);
else {
return $description;
}
}
}
}
}
*file_get_contents_curl*
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
// checking mime types
if (strstr($info, 'text/html')) {
curl_close($ch);
return $data;
} else {
return false;
}
}

I believe this will work... utf8_decode() your attribute..
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
//checking mime types
if(strstr($info,'text/html')) {
curl_close($ch);
return $data;
} else {
return false;
}
}
$html = file_get_contents_curl($checkurl);
$grid ='';
if($html)
{
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
#$title = $nodes->item(0)->nodeValue;
#$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
$description = utf8_decode($meta->getAttribute('content'));
}

What happens here is that you're discarding the found Content-Type header that cURL returned in your file_get_contents_curl() function; DOMDocument needs that information to understand the character set that was used on the page.
A somewhat ugly hack, but most generic, is to prefix the returned page with a <meta> tag containing the returned character set from the response headers:
if (strstr($info, 'text/html')) {
curl_close($ch);
return '<meta http-equiv="Content-Type" content="' . $info . '" />' . $data;
}
DOMDocument will accept the misplaced meta tag and do the respective conversions automatically.

Related

Need help extracting meta title from an URL using curl and DOM

I need to extract the dollar amount e.g. $594 from the meta title of an URL. I am getting full meta title however i just need the $594 from it not the whole title. Here is my code. Thanks
<?php
// Web page URL
$url = 'https://www.cheapflights.com.au/flights-to-Delhi/Sydney/';
// Extract HTML using curl
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
// Load HTML to DOM object
$dom = new DOMDocument();
#$dom->loadHTML($data);
// Parse DOM to get Title data
$nodes = $dom->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
// Parse DOM to get meta data
$metas = $dom->getElementsByTagName('meta');
$description = $keywords = '';
for($i=0; $i<$metas->length; $i++){
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description'){
$description = $meta->getAttribute('content');
}
if($meta->getAttribute('name') == 'keywords'){
$keywords = $meta->getAttribute('content');
}
}
echo "$title". '<br/>';
?>

Curl and array values in curlopt_url does not work

i have a very weird issue with curl and url defined inside an array.
I have an array of url and i want perform an http GET on those urls with curl
for ($i = 0, $n = count($array_station) ; $i < $n ; $i++)
{
$station= curl_init();
curl_setopt($station, CURLOPT_VERBOSE, true);
curl_setopt($station, CURLOPT_URL, $array_station[$i]);
curl_setopt($station, CURLOPT_RETURNTRANSFER, true);
curl_setopt($station, CURLOPT_FOLLOWLOCATION, true);
$response = curl_exec($station);
curl_close($station);
}
If i define my $array_station in the way below
$array_station=array("http://www.example.com","http://www.example2.com");
the code above with curl working flawlassy,but since my $array_station is build in the way below (i perform a scan of directory searchin a specific filename, then i clean the url), the curl does not work, no error showed and nothing happens..
$di = new RecursiveDirectoryIterator(__DIR__,RecursiveDirectoryIterator::SKIP_DOTS);
$it = new RecursiveIteratorIterator($di);
$array_station=array();
$i=0;
foreach($it as $file) {
if (pathinfo($file, PATHINFO_FILENAME ) == "db_insert") {
$string = str_replace('/web/htdocs/', 'http://', $file.PHP_EOL);
$string2 = str_replace('/home','', $string);
$array_station[$i]=$string2;
$i++;
}
}
Doyou have some ideas? i'm giving up :-(
I'm on mobile right now so i cannot test it, but why are you adding a new line (PHP_EOL) to the url? Try to remove the new line or trim() the url at the end.
Add the lines of code below.
If there is a curl error it will report the error number.
If the request is made, it will show the HTTP request and response headers. The request is in $info and response header is in $head
for ($i = 0, $n = count($array_station) ; $i < $n ; $i++)
{
$station= curl_init();
curl_setopt($station, CURLOPT_VERBOSE, true);
curl_setopt($station, CURLOPT_URL, $array_station[$i]);
curl_setopt($station, CURLOPT_RETURNTRANSFER, true);
curl_setopt($station, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
$response = curl_exec($station);
if (curl_errno($station)){
$response .= 'Retreive Base Page Error: ' . curl_error($station);
}
else {
$skip = intval(curl_getinfo($station, CURLINFO_HEADER_SIZE));
$head = substr($response ,0,$skip);
$response = substr($response ,$skip);
$info = var_export(curl_getinfo($station),true);
}
echo $head;
echo $info;
curl_close($station);
}

Get meta data from tumblr

I would like to put open graph meta tags (og:image, og:description) to a database on my site from a tumblr blog. I tried to get meta tags from tubmlr with this code
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl("http://www.example.com");
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
if($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}
echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";
I know this code won't get the open graph code, but the problem is it doesn't get any meta. The code works fine with for exaple www.google.com but not with tumblr.
How could I get the tumblr meta?
Thank you!

Get Title and Meta description and image of External site in yii framework

how can i get the meta description , title and image , from external site url, i have achieved this using php but i dont know how i use this it in yii controller, my code is
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl("http://www.example.com");
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
if($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
if($meta->getAttribute('language') == 'language');
$language = $meta->getAttribute('language');
}
echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";
im new to yii , any help
I used your code (with some minor edits) to create the following file. Save it in protected/components/HttpDetails.php (note: error handling not implemented - in case of http failure or other)
class HttpDetails {
private static function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
public static function getDetails($url) {
$html = self::file_get_contents_curl($url);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++) {
$meta = $metas->item($i);
if ($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
if ($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}
return array(
'title'=>isset($title)?$title:'Not set',
'description'=> isset($description)?$description:'Not set',
'keywords'=> isset($keywords)?$keywords:'Not set',
);
}
}
Edit your import array in protected\config\main.php to include 'application.components.HttpDetails'
...
'import' => array(
...
'application.components.HttpDetails',
),
To read the details from a page do the following in any controller (or elsewhere in your application)
$url = "www.cnn.com";
$details = HttpDetails::getDetails($url);
$title = $details['title'];
$description = $details['description'];
$keywords = $details['keywords'];
The above exact code has been tested and works fine. If you are getting errors, you should check your php environment for DOM / libxml extensions where your Yii is hosted.

DOM Xpath wordpress Grab content

I have a plugin that i want to modified but im stuck here is the php function:
function wpr_ezinemarkpost($keyword,$num,$start,$optional="",$comments="",$options,$template,$ua,$proxy,$proxytype,$proxyuser) {
global $wpdb,$wpr_table_templates;
$page = $start / 20;
$page = (string) $page;
$page = explode(".", $page);
$page=(int)$page[0];
$page++;
if($page == 0) {$page = 1;}
$prep = floor($start / 20);
$numb = $start - $prep * 20;
$search_url = "http://www.freewptube.com/demo4/";
// make the cURL request to $search_url
if ( function_exists('curl_init') ) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $ua);
if($proxy != "") {
//curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
if($proxyuser) {curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxyuser);}
if($proxytype == "socks") {curl_setopt ($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);}
}
curl_setopt($ch, CURLOPT_URL,$search_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 45);
$html = curl_exec($ch);
if (!$html) {
$return["error"]["module"] = "Article";
$return["error"]["reason"] = "cURL Error";
$return["error"]["message"] = __("cURL Error Number $search_url","wprobot").curl_errno($ch).": ".curl_error($ch);
return $return;
}
curl_close($ch);
} else {
$html = #file_get_contents($search_url);
if (!$html) {
$return["error"]["module"] = "Article";
$return["error"]["reason"] = "cURL Error";
$return["error"]["message"] = __("cURL is not installed on this server!","wprobot");
return $return;
}
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// Grab Product Links
$xpath = new DOMXPath($dom);
$paras = $xpath->query("//div[#class='boxtitle']//h2/a");
$x = 0;
$end = $numb + $num;
if($paras->length == 0) {
$posts["error"]["module"] = "Article";
$posts["error"]["reason"] = "No content";
$posts["error"]["message"] = __("No (more) articles found. $search_url","wprobot");
return $posts;
}
if($end > $paras->length) { $end = $paras->length;}
for ($i = $numb; $i < $end; $i++ ) {
$para = $paras->item($i);
if(empty($para)) {
$posts["error"]["module"] = "Article";
$posts["error"]["reason"] = "No content";
$posts["error"]["message"] = __("No (more) articles found. $search_url","wprobot");
print_r($posts);
return $posts;
} else {
$target_url = $para->getAttribute('href');
// make the cURL request to $search_url
if ( function_exists('curl_init') ) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $ua);
if($proxy != "") {
//curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
if($proxyuser) {curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxyuser);}
if($proxytype == "socks") {curl_setopt ($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);}
}
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 45);
$html = curl_exec($ch);
if (!$html) {
$return["error"]["module"] = "Article";
$return["error"]["reason"] = "cURL Error";
$return["error"]["message"] = __("cURL Error Number $search_url","wprobot").curl_errno($ch).": ".curl_error($ch);
return $return;
}
curl_close($ch);
} else {
$html = #file_get_contents($target_url);
if (!$html) {
$return["error"]["module"] = "Article";
$return["error"]["reason"] = "cURL Error";
$return["error"]["message"] = __("cURL is not installed on this server!","wprobot");
return $return;
}
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// Grab Article Title
$xpath1 = new DOMXPath($dom);
$paras1 = $xpath1->query("//div[#class='textsection']/h2");
$para1 = $paras1->item(0);
$title = $para1->textContent;
if (empty($title)) {
$return["error"]["module"] = "Article";
$return["error"]["reason"] = "IncNum";
$return["error"]["message"] = __("Video content skipped. ","wprobot");
return $return;
}
// Grab Article
$xpath2 = new DOMXPath($dom);
$paras2 = $xpath2->query("//div[#id='screen']/div[#class='videosection']");
$para2 = $paras2->item(0);
$string = $dom->saveXml($para2);
if ($options['wpr_eza_striplinks']=='yes') {$string = wpr_strip_selected_tags($string, array('a'));}
$articlebody .= $string. ' ';
// Grab Ressource Box
$xpath3 = new DOMXPath($dom);
$paras3 = $xpath3->query("//div[#id='extras']//h4/a");
$ressourcetext = "";
for ($y = 0; $y < $paras3->length; $y++ ) { //$paras->length
$para3 = $paras3->item($y);
$ressourcetext .= $dom->saveXml($para3);
}
$title = utf8_decode($title);
// Split into Pages
if($options['wpr_eza_split'] == "yes") {
$articlebody = wordwrap($articlebody, $options['wpr_eza_splitlength'], "<!--nextpage-->");
}
$post = $template;
$post = wpr_random_tags($post);
$post = str_replace("{article}", $articlebody, $post);
$post = str_replace("{authortext}", $ressourcetext, $post);
$noqkeyword = str_replace('"', '', $keyword2);
$post = str_replace("{keyword}", $noqkeyword, $post);
$post = str_replace("{Keyword}", ucwords($noqkeyword), $post);
$post = str_replace("{title}", $title, $post);
$post = str_replace("{url}", $target_url, $post);
if(function_exists("wpr_rewrite_partial")) {
$post = wpr_rewrite_partial($post,$options);
}
if(function_exists("wpr_translate_partial")) {
$post = wpr_translate_partial($post);
}
/* We are adding a call to this function to ensure that our keyword is used at least once */
$posts[$x]["unique"] = $target_url;
$posts[$x]["title"] = $title;
$posts[$x]["content"] = $post;
$x++;
}
}
return $posts;
}
i already made it to grab the title and the embed video but i want to also grab the thumbails located at the homepage. how can we make the thumbnails go to the top of the embed video code? by the way this is a wordpress plugin that i am modifying for me to use.
thanks

Categories