Trouble getting meta description from some sites using php

Trouble getting meta description from some sites using php - php

I cannot fetch the meta description tag from some sites, one in particular is you-tube.
I have tried using "get_meta_tags" however it does not return the description. I have tried using several regex as well. The title returns fine.
Try getting the description from this link: http://www.youtube.com/watch?v=xci0-26M-bk
$url = 'http://www.youtube.com/watch?v=xci0-26M-bk';
if ($fp = #fopen($url, 'r')) {
$file = file($url);
$file = implode("", $file);
$tags = get_meta_tags($url);
$description = trim($tags['description']);
}
$description returns blank...

Why don't you put directly the $url in get_meta_tags?
If I do:
print_r(get_meta_tags('http://www.youtube.com/watch?v=xci0-26M-bk'));
I get:
Array ( [title] => Catfish Blues - Jimi Hendrix Experience [description] => Couldn't ignore Jimi any longer. This is a track recorded for Radio one about 1967. Love the simplicity (seemingly)and the, understated, power. [keywords] => Jimi, Hendrix, sixties )

Related

How to get thumbnail image from Google News RSS [duplicate]

I want to parse Google News rss with PHP. I managed to run this code:
<?
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
foreach($news->channel->item as $item) {
echo "<strong>" . $item->title . "</strong><br />";
echo strip_tags($item->description) ."<br /><br />";
}
?>
However, I'm unable to solve following problems. For example:
How can i get the hyperlink of the news title?
As each of the Google news has many related news links in footer, (and my code above includes them also). How can I remove those from the description?
How can i get the image of each news also? (Google displays a thumbnail image of each news)
Thanks.

There we go, just what you need for your particular situation:
<?php
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
$feeds = array();
$i = 0;
foreach ($news->channel->item as $item)
{
preg_match('#src="([^"]+)"#', $item->description, $match);
$parts = explode('<font size="-1">', $item->description);
$feeds[$i]['title'] = (string) $item->title;
$feeds[$i]['link'] = (string) $item->link;
$feeds[$i]['image'] = $match[1];
$feeds[$i]['site_title'] = strip_tags($parts[1]);
$feeds[$i]['story'] = strip_tags($parts[2]);
$i++;
}
echo '<pre>';
print_r($feeds);
echo '</pre>';
?>
And the output should look like this:
[2] => Array
(
[title] => Los Alamos Nuclear Lab Under Siege From Wildfire - ABC News
[link] => http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGxBe4YsZArH0kSwEjq_zDm_h-N4A&url=http://abcnews.go.com/Technology/wireStory?id%3D13951623
[image] => http://nt2.ggpht.com/news/tbn/OhH43xORRwiW1M/6.jpg
[site_title] => ABC News
[story] => A wildfire burning near the desert birthplace of the atomic bomb advanced on the Los Alamos laboratory and thousands of outdoor drums of plutonium-contaminated waste Tuesday as authorities stepped up ...
)

I'd recommend checking out SimplePie. I've used it for several different projects and it works great (and abstracts away all of the headache you're currently dealing with).
Now, if you're writing this code simply because you want to learn how to do it, you should probably ignore this answer. :)

To get the URL for a news item, use $item->link.
If there's a common delimiter for the related news links, you could use regex to cut off everything after it.
Google puts the thumbnail image HTML code inside the description field of the feed. You could regex out everything between the open and close brackets for the image declaration to get the HTML for it.

PHP Simple HTML DOM Parser - Link element in RSS

I just started using PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/) and have some problems parsing XML.
I can perfectly parse all the links from HTML documents, but parsing links from RSS feeds (XML format) doesn't work. For example, I want to parse all the links from http://www.bing.com/search?q=ipod&count=50&first=0&format=rss so I use this code:
$content = file_get_html('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
foreach($content->find('item') as $entry)
{
$item['title'] = $entry->find('title', 0)->plaintext;
$item['description'] = $entry->find('description', 0)->plaintext;
$item['link'] = $entry->find('link', 0)->plaintext;
$parsed_results_array[] = $item;
}
print_r($parsed_results_array);
The script parses title and description but link element is empty. Any ideas? My guess is that "link" is reserved word or something, so how do I get the parser to work?

I suggest you use the right tool for this job. Use SimpleXML: Plus, its built-in :)
$xml = simplexml_load_file('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
$parsed_results_array = array();
foreach($xml as $entry) {
foreach($entry->item as $item) {
// $parsed_results_array[] = json_decode(json_encode($item), true);
$items['title'] = (string) $item->title;
$items['description'] = (string) $item->description;
$items['link'] = (string) $item->link;
$parsed_results_array[] = $items;
}
}
echo '<pre>';
print_r($parsed_results_array);
Should yield something like:
Array
(
[0] => Array
(
[title] => Apple - iPod
[description] => Learn about iPod, Apple TV, and more. Download iTunes for free and purchase iTunes Gift Cards. Check out the most popular TV shows, movies, and music.
[link] => http://www.apple.com/ipod/
)
[1] => Array
(
[title] => iPod - Wikipedia, the free encyclopedia
[description] => The iPod is a line of portable media players designed and marketed by Apple Inc. The first line was released on October 23, 2001, about 8½ months after ...
[link] => http://en.wikipedia.org/wiki/IPod
)

If you are used to use PHP Simple HTML DOM, you can keep using it!
Too many approaches would make confusions, and simplehtmldom is already easy and powerful.
Be sure you start like this:
require_once('lib/simple_html_dom.php');
$content = file_get_contents('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
$xml = new simple_html_dom();
$xml->load($content);
Then you can go with you queries!

edit simple_html_doom class
protected $self_closing_tags
delete key "link"
BEFORE:
protected $self_closing_tags = array('img'=>1, 'br'=>1,'link'=>1, 'input'=>1, 'meta'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1);
AFTER:
protected $self_closing_tags = array('img'=>1, 'br'=>1, 'input'=>1, 'meta'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1);

How to get Google search results page into a DOMDocument object in PHP?

I'm not looking to scrape Google. This just a one-time thing to get about 300 urls a bit faster than manually doing it.
I can't seem to get a DOMDocument to be created though. It always ends up as an empty object.
search_list.txt contains my list of search terms. Right now I'm testing it with just 1 term, "legos".
The script correctly downloads the search results page. I viewed it in a web browser and it looked fine.
search_list.txt
legos
getresults.php
<?php
$search_list = 'search_list.txt'; // file containing search terms
$results = 'results.txt';
$handle = fopen($vendor_list,'r');
while($line = fgets($handle)) {
$fp = fopen($results,'w');
$ch = curl_init('http://www.google.com/'
. 'search?q=' . urlencode($line));
curl_setopt($ch,CURLOPT_FILE,$fp);
curl_setopt($ch,CURLOPT_HEADER,0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
unset($ch,$fp);
}
fclose($handle);
$dom = DOMDocument::loadHTML(file_get_contents($results));
echo print_r($dom,true); // EMPTY
$search_div = $dom->getElementById('search');
if(is_null($search_div)) { // ALWAYS NULL
echo 'Search_div is null';
} else {
echo print_r($search_div,true);
}
?>

I made some changes.
Instead of fopen - fgets - * , file .
Instead of curl, simple_html_dom::load_file
$search_list = 'search_list.txt'; // file containing search terms
$result_list = 'results.txt'; // file containing search terms
$searching_list = file($search_list);
foreach ($search_list as $key => $searching_word) {
$html->load_file('http://www.google.com/'.'search?q='.urlencode($searching_word));
$search_div = $html->find("div[id='search']");
echo $search_div[0]; // See content of the search div.
file_put_contents($result_list,$search_div[0]);
}
?>
You can see the results with echo $search_div[0];.
It shows you whole content of search div .
I searched for 'asd' =) ...
Based on my results , it is started with like
<div id="search"><div id="ires"><ol><li class="g"><h3 class="r"><b>Atrial septal defect</b> - Wikipedia, the free encyclopedia</h3><div class="s"><div class="kv" style="margin-bottom:2px"><cite>en.wikipedia.org/wiki/<b>Atrial_septal_defect</b></cite><span class="flc"> - Cached - Similar</span></div><span class="st"><b>Atrial septal defect</b> (<b>ASD</b>)
And ended like
</span><br></div></li><li class="g"><h3 class="r"><b>Achievement School District</b></h3><div class="s"><div class="kv" style="margin-bottom:2px"><cite>achievementschooldistrict.org/</cite><span class="flc"> - Cached</span></div><span class="st"><b>Achievement School District</b> · The <b>ASD</b> · Driving Results · Campuses · Join Our <br> Team · Enroll A Student · <b>ASD</b> News · Contact Us <b>...</b></span><br></div></li></ol></div></div>
UPDATE
This part is based on comment of Buttle Butk .
If there is no change the first 1st result of google search you can use this code to get the first result in the search.
<?php
$search_list = 'search_list.txt'; // file containing search terms
$result_list = 'results.txt'; // file containing search terms
$order_language = "en"
$searching_list = file($search_list);
foreach ($search_list as $key => $searching_word) {
$link = 'https://www.google.com.tr/search?hl='.$order_language.'&q='.$searching_word.'&btnI=1';
echo $link;
file_put_contents($result_list,$link[0]);
}
?>
I searched for 'asd' =) again ...
The result
https://www.google.com.tr/search?hl=en&q=asd&btnI=1
When i copied and paste to chrome , this link redirect to me to 1st result of 'asd searching'.
http://www.asd-europe.org/
If i can help you , i'll feel happy .
Have a good day.

Parsing Google News RSS with PHP

I want to parse Google News rss with PHP. I managed to run this code:
<?
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
foreach($news->channel->item as $item) {
echo "<strong>" . $item->title . "</strong><br />";
echo strip_tags($item->description) ."<br /><br />";
}
?>
However, I'm unable to solve following problems. For example:
How can i get the hyperlink of the news title?
As each of the Google news has many related news links in footer, (and my code above includes them also). How can I remove those from the description?
How can i get the image of each news also? (Google displays a thumbnail image of each news)
Thanks.

There we go, just what you need for your particular situation:
<?php
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
$feeds = array();
$i = 0;
foreach ($news->channel->item as $item)
{
preg_match('#src="([^"]+)"#', $item->description, $match);
$parts = explode('<font size="-1">', $item->description);
$feeds[$i]['title'] = (string) $item->title;
$feeds[$i]['link'] = (string) $item->link;
$feeds[$i]['image'] = $match[1];
$feeds[$i]['site_title'] = strip_tags($parts[1]);
$feeds[$i]['story'] = strip_tags($parts[2]);
$i++;
}
echo '<pre>';
print_r($feeds);
echo '</pre>';
?>
And the output should look like this:
[2] => Array
(
[title] => Los Alamos Nuclear Lab Under Siege From Wildfire - ABC News
[link] => http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGxBe4YsZArH0kSwEjq_zDm_h-N4A&url=http://abcnews.go.com/Technology/wireStory?id%3D13951623
[image] => http://nt2.ggpht.com/news/tbn/OhH43xORRwiW1M/6.jpg
[site_title] => ABC News
[story] => A wildfire burning near the desert birthplace of the atomic bomb advanced on the Los Alamos laboratory and thousands of outdoor drums of plutonium-contaminated waste Tuesday as authorities stepped up ...
)

I'd recommend checking out SimplePie. I've used it for several different projects and it works great (and abstracts away all of the headache you're currently dealing with).
Now, if you're writing this code simply because you want to learn how to do it, you should probably ignore this answer. :)

To get the URL for a news item, use $item->link.
If there's a common delimiter for the related news links, you could use regex to cut off everything after it.
Google puts the thumbnail image HTML code inside the description field of the feed. You could regex out everything between the open and close brackets for the image declaration to get the HTML for it.

How to parse this OFX file?

this is an original ofx file as it comes from m bank (no worries, theres nothing sensitive, i cut out the middle part with all the transactions)
Open Financial Exchange (OFX) is a
data-stream format for exchanging
financial information that evolved
from Microsoft's Open Financial
Connectivity (OFC) and Intuit's Open
Exchange file formats.
now i need to parse this. i already saw that question, but this is not a dup because i am interested in how to do this.
i am sure i could figure out some clever regexps that would do the job, but that is ugly and error vulnerable (if the format is changed, some fields may be missing, the formatting/white spaces are different etc etc...)
OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>
<SIGNONMSGSRSV1>
<SONRS>
<STATUS>
<CODE>0
<SEVERITY>INFO
</STATUS>
<DTSERVER>20110420000000[+1:CET]
<LANGUAGE>ENG
</SONRS>
</SIGNONMSGSRSV1>
<BANKMSGSRSV1>
<STMTTRNRS>
<TRNUID>1
<STATUS>
<CODE>0
<SEVERITY>INFO
</STATUS>
<STMTRS>
<CURDEF>EUR
<BANKACCTFROM>
<BANKID>20404
<ACCTID>02608983629
<ACCTTYPE>CHECKING
</BANKACCTFROM>
<BANKTRANLIST>
<DTSTART>20110207
<DTEND>20110419
<STMTTRN>
<TRNTYPE>XFER
<DTPOSTED>20110205000000[+1:CET]
<TRNAMT>-6.12
<FITID>C74BD430D5FF2521
<NAME>unbekannt
<MEMO>BILLA DANKT 1265P K2 05.02.UM 17.49
</STMTTRN>
<STMTTRN>
<TRNTYPE>XFER
<DTPOSTED>20110207000000[+1:CET]
<TRNAMT>-10.00
<FITID>C74BE0F90A657901
<NAME>unbekannt
<MEMO>AUTOMAT 13177 KARTE2 07.02.UM 10:22
</STMTTRN>
............................. goes on like this ........................
<STMTTRN>
<TRNTYPE>XFER
<DTPOSTED>20110418000000[+1:CET]
<TRNAMT>-9.45
<FITID>C7A5071492D14D29
<NAME>unbekannt
<MEMO>HOFER DANKT 0408P K2 18.04.UM 18.47
</STMTTRN>
</BANKTRANLIST>
<LEDGERBAL>
<BALAMT>1992.29
<DTASOF>20110420000000[+1:CET]
</LEDGERBAL>
</STMTRS>
</STMTTRNRS>
</BANKMSGSRSV1>
</OFX>
i currently use this code which gives me the desired result:
<?
$files = array();
$files[] = '***_2011001.ofx';
$files[] = '***_2011002.ofx';
$files[] = '***_2011003.ofx';
system('touch file.csv && chmod 777 file.csv');
$fp = fopen('file.csv', 'w');
foreach($files as $file) {
echo $file."...\n";
$content = file_get_contents($file);
$content = str_replace("\n","",$content);
$content = str_replace(" ","",$content);
$regex = '|<STMTTRN><TRNTYPE>(.+?)<DTPOSTED>(.+?)<TRNAMT>(.+?)<FITID>(.+?)<NAME>(.+?)<MEMO>(.+?)</STMTTRN>|';
echo preg_match_all($regex,$content,$matches,PREG_SET_ORDER)." matches... \n";
foreach($matches as $match) {
echo ".";
array_shift($match);
fputcsv($fp, $match);
}
echo "\n";
}
echo "done.\n";
fclose($fp);
this is really ugly and if this was a valid xml file i would personally kill myself for that, but how to do it better?

Your code seems fine, considering that the file isn't XML or even SGML. The only thing you could do is try to make a more generic SAX-like parser. That is, you simply go through the input stream one block at a time (where block can be anything, e.g. a line or simply a set amount of characters). Then, call a callback function every time you encounter an <ELEMENT>. You can even go as fanciful as building a parser class where you can register callback functions that listen to specific elements.
It will be more generic and less "ugly" (for some definition of "ugly") but it will be more code to maintain. Nice to do and nice to have if you need to parse this file format a lot (or in a lot of different variations). If your posted code is the only place you do this then just KISS.

// Load Data String
$str = file_get_contents($fLoc);
$MArr = array(); // Final assembled master array
// Fetch all transactions
preg_match_all("/<STMTTRN>(.*)<\/STMTTRN>/msU",$str,$m);
if ( !empty($m[1]) ) {
$recArr = $m[1]; unset($str,$m);
// Parse each transaction record
foreach ( $recArr as $i => $str ) {
$_arr = array();
preg_match_all("/(^\s*<(?'key'.*)>(?'val'.*)\s*$)/m",$str,$m);
foreach ( $m["key"] as $i => $key ) {
$_arr[$key] = trim($m["val"][$i]); // Reassemble array key => val
}
array_push($MArr,$_arr);
}
}
print_r($MArr);

function close_tags($x)
{
return preg_replace('/<([A-Za-z0-9.]+)>([^<\r\n]+)/', '<\1>\2</\1>', $x);
}
$ofx = file_get_contents('myfile.ofx');
$body = '<OFX>'.explode('<OFX>', $ofx)[1]; // strip the header
$xml = close_tags($body); // make valid XML
$reader = new SimpleXMLElement($xml);
foreach($reader->xpath('//STMTTRN') as $txn): // find and loop through all STMTTRN tags, note the double forward slash
// get the tag contents by casting as (string) to invoke the SimpleXMLElement::__toString() method
$trntype = (string)$txn->TRNTYPE;
$dtposted = (string)$txn->DTPOSTED;
$trnamt = (string)$txn->TRNAMT;
$name = (string)$xn->NAME;
$memo = (string)$txn->MEMO;
endforeach;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Trouble getting meta description from some sites using php - php

Related

How to get thumbnail image from Google News RSS [duplicate]

PHP Simple HTML DOM Parser - Link element in RSS

How to get Google search results page into a DOMDocument object in PHP?

Parsing Google News RSS with PHP

How to parse this OFX file?

Categories

Resources