function to scrape page keywords , description and title? - php

i wrote simple 3 functions to scrape titles , description and keywords of simple html page
this is the first function to scrape titles
function getPageTitle ($url)
{
$content = $url;
if (eregi("<title>(.*)</title>", $content, $array)) {
$title = $array[1];
return $title;
}
}
and it works fine
and those are 2 functions to scrape description and keywords and those not working
function getPageKeywords($url)
{
$content = $url;
if ( preg_match('/<meta[\s]+[^>]*?name[\s]?=[\s\"\']+keywords[\s\"\']+content[\s]?=[\s\"\']+(.*?)[\"\']+.*?>/i', $content, $array)) {
$keywords = $array[1];
return $keywords;
}
}
function getPageDesc($url)
{
$content = $url;
if ( preg_match('/<meta[\s]+[^>]*?name[\s]?=[\s\"\']+description[\s\"\']+content[\s]?=[\s\"\']+(.*?)[\"\']+.*?>/i', $content, $array)) {
$desc = $array[1];
return $desc;
}
}
i know there may be something wrong with the preg_match line but i really don't know
i tried it so much things but it doesn't work

Why not use get_meta_tags? PHP Documentation Here
<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');
// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author']; // name
echo $tags['keywords']; // php documentation
echo $tags['description']; // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>
NOTE You can change the parameter to either a URL, local file or string.

Its better to use php's native DOMDocument to parse HTML then regex, you can also use , tho in this day in age allot of sites dont even add the keywords, description tags no more, so you cant rely on them always being there. But here is how you can do it with DOMDocument:
<?php
$source = file_get_contents('http://php.net');
$dom = new DOMDocument("1.0","UTF-8");
#$dom->loadHTML($source);
$dom->preserveWhiteSpace = false;
//Get Title
$title = $dom->getElementsByTagName('title')->item(0)->nodeValue;
$description = '';
$keywords = '';
foreach($dom->getElementsByTagName('meta') as $metas) {
if($metas->getAttribute('name') =='description'){ $description = $metas->getAttribute('content'); }
if($metas->getAttribute('name') =='keywords'){ $keywords = $metas->getAttribute('content'); }
}
print_r($title);
print_r($description);
print_r($keywords);
?>

Related

PHP: Test for string of text inside html tags from file_get_contents string

I need to perform a series of tests on a url. The first test is a word count, I have that working perfectly and the code is below:
if (isset($_GET[article_url])){
$title = 'This is an example title';
$str = #file_get_contents($_GET[article_url]);
$test1 = str_word_count(strip_tags(strtolower($str)));
if($test1 === FALSE) { $test = '0'; }
if ($test1 > '550') {
echo '<div><i class="fa fa-check-square-o" style="color:green"></i> This article has '.$test1.' words.';
} else {
echo '<div><i class="fa fa-times-circle-o" style="color:red"></i> This article has '.$test1.' words. You are required to have a minimum of 500 words.';
}
}
Next I need to get all h1 and h2 tags from $str and test them to see if any contain the text $title and echo yes if so and no if not. I am not really sure how to go about doing this.
I am looking for a pure php means of doing this without installing php libraries or third party functions.
please try below code.
if (isset($_GET[article_url])){
$title = 'This is an example title';
$str = #file_get_contents($_GET[article_url]);
$document = new DOMDocument();
$document->loadHTML($str);
$tags = array ('h1', 'h2');
$texts = array ();
foreach($tags as $tag)
{
//Fetch all the tags with text from the dom matched with passed tags
$elementList = $document->getElementsByTagName($tag);
foreach($elementList as $element)
{
//Store text in array from dom for tags
$texts[] = strtolower($element->textContent);
}
}
//Check passed title is inside texts array or not using php
if(in_array(strtolower($title),$texts)){
echo "yes";
}else{
echo "no";
}
}

Simple HTML Dom PHP RECURSION Error in return value

I am using Simple HTML Dom, trying to get strings from a website. When I print out $title[0] within the function it shows just one string, but when I safe it in the return array and print out the return value, I receive a never ending text with RECURSION.
I don't understand why it would work with the second variable $oTitle.
<?php
include 'scripts/simple_html_dom.php';
function getDetails($id) {
$url = "http://www.something.com";
$html = file_get_html ( $url );
$title = $html->find('span[itemprop=name]');
print_r($title[0] . PHP_EOL); //prints out the correct title
$oTitle = "Something"; //there is also code for this variable but it works as it should
$details = array("Title" => $title[0], "Original Title" => $oTitle);
return $details;
flush ();
}
$values = getDetails($number);
print_r($values); //code breakes here
?>
Take a look at this page: http://simplehtmldom.sourceforge.net/
As I can see, you're using this parser.
In order to get HTML content you should use something like this:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
In order to drop content, you should use something like this:
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;
Try this code:
<?php
include 'simple_html_dom.php';
function getDetails() {
$url = "http://www.godaddy.com";
$html = file_get_html ( $url );
$title = getTitle($url);
echo $title; //prints out the correct title
$oTitle = "Something"; //there is also code for this variable but it works as it should
$details = array("Title" => $title, "Original Title" => $oTitle);
return $details;
flush ();
}
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
$values = getDetails();
print_r($values); //code breakes here
?>

PHP find keywords in XML feed - preg_match_all

I'm loading in a rss feed in XML. I am then attempting to search the RSS feed for keywords. I then need to it email myself with a notification that it found these key words and their contents.
Example i have:
$keywords = "iphone|ipad|ipod";
$rss = simplexml_load_file("http://myfeed.com/news?format=xml");
foreach ($rss->entry as $i)
{
//need to check the title tag for the keywords
//email myself a notification with the keywords found
}
UPDATE
Ok i have the following which is part working:
function getPartsWithRegex($text, $regex){
preg_match_all($regex, $text, $array);
return $array[0];
}
$keywords = "LLC|socks";
$rss = simplexml_load_file("http://feeds.myfeed.com/iphoneapps-news?format=xml");
$text = "Downcast - Jamawkinaw Enterprises LLC";
$regex = "/$keywords/i";
foreach ($rss->entry as $i)
{
$title = (string)$i->title;
$productNumbers = getPartsWithRegex($text, $regex);
}
if(count($productNumbers) > 0){
foreach($productNumbers as $productNumber){
echo $productNumber . '<br />';
}
}
else {
echo "Nothing found.";
}
This works when i pass $productNumbers = getPartsWithRegex($text, $regex); However, when i pass $title which has the value of $i->title it doesnt work. Even though when i echo out the value of $title to the browser it renders as: Downcast - Jamawkinaw Enterprises LLC
So to clarify when i manually enter and pass a string it works correct. However, when i pass the value from the XML which is read it doesnt. I've tried converting to string but that doesnt work.
Any ideas why?

Display markdown images as links

I'm using Michel Fortin's PHP Markdown for Markdown converting but i want to show images as links instead of inline. Because anyone can insert a 5MB jpg from Imgur and slow down page.
How can i change images to link-to-image?
An example of an override would look something like the follwing:
class CustomMarkdown extends Markdown
{
function _doImages_reference_callback($matches) {
$whole_match = $matches[1];
$alt_text = $matches[2];
$link_id = strtolower($matches[3]);
if ($link_id == "") {
$link_id = strtolower($alt_text); # for shortcut links like ![this][].
}
$alt_text = $this->encodeAttribute($alt_text);
if (isset($this->urls[$link_id])) {
$url = $this->encodeAttribute($this->urls[$link_id]);
$result = "$alt_text";
} else {
# If there's no such link ID, leave intact:
$result = $whole_match;
}
return $result;
}
function _doImages_inline_callback($matches) {
$whole_match = $matches[1];
$alt_text = $matches[2];
$url = $matches[3] == '' ? $matches[4] : $matches[3];
$title =& $matches[7];
$alt_text = $this->encodeAttribute($alt_text);
$url = $this->encodeAttribute($url);
return "$alt_text";
}
}
Demo: http://codepad.viper-7.com/VVa2hP
You should have a look at Parsedown. It is a more recent and, I believe, easier to extend implementation of Markdown.

Retrieving description and keywords meta tags in php

I was wondering: which is the fastest method or code to get meta tags?
I have this code with me, but using get_meta_tags function slows down the process. Any ideas?
$tags = get_meta_tags('http://www.example.com/');
echo $tags['keywords']; // keywords
echo $tags['description']; //description
The reason is because the whole page is parsed before php attempts to get the meta tags. It is probably best to use a regex to parse the returned html.
function get_meta_data($page) {
$meta_data = array();
preg_match_all(
"/]+name=\"([^"])\"[^>]+content="([^\"])"[^>]+>/i",
$page,
$result,
PREG_PATTERN_ORDER);
$total_found = count($result[1]);
while (--$total_found) {
strtolower($out[1][$i]) == "keywords") && $meta_data['keywords'] = $results[2][$i];
strtolower($out[1][$i]) == "description") && $meta_data['description'] = $results[2][$i];
}
return $meta;
}
hope that helps

Categories