A reliable way to scrape title, description and keywords

A reliable way to scrape title, description and keywords - php

Currently I'm using CURL to scrape a website. I want to reliably get the title, description and keywords.
//Parse for the title, description and keywords
if (strlen($link_html) > 0)
{
$tags = get_meta_tags($link); // name
$link_keywords = $tags['keywords']; // php documentation
$link_description = $tags['description'];
}
The only problem is people are now using all kinds of meta tags, such as open graph <meta property="og:title" content="The Rock" />. They also vary the tags a lot <title> <Title> <TITLE> <tiTle>. It's very difficult to get these reliably.
I really need some code that will extract these variables consistently. If there is some title, keyword and description provided that it will find it. Because right now it seems very hit and miss.
Perhaps a way to extract all titles into a titles array? Then the scraping web developer can choose the best one to record in their database. The same applying to keywords and description.
This is not a duplicate. I have searched through stackoverflow and
nowhere is this solution to place all "title", "keywords" and
"description" type tags into arrays.

Generally get_meta_tags() should get you most of what you need, you just need to setup a set of cascading checks that will sample the required field from each metadata system until one is found. For example, something like this:
function get_title($url) {
$tags = get_meta_tags($url);
$props = get_meta_props($url);
return #tags["title"] || #props["og:title"] || ...
}
The above implementation is obviously not efficient (because if we implemetn all the getters like this you'd reload the URL for each getter), and I didn't implement get_meta_props() - which is problematic to implement correctly using pcre_* and tedious to implement using DOMDocument.
Still a correct implementation is trivial though a lot of work - which is a classic scenario for an external library to solve the problem! Fortunately, there is one for just that - called simply "Embed" and you can find it on github, or using composer just run
composer require embed/embed

Related

Fetching Open Graph meta tags from a react-helmet site

I have a script that grabs Open Graph meta tags from a remote page. I'm using PHP to load the page into a DOM document and pulling out the tag content like this:
if($meta->getAttribute('property')=='og:image')
$og_image = $meta->getAttribute('content');
That works great for a typical page where the tags are (to my understanding) formatted correctly, like this:
<meta property="og:image" content="https://example.com/image.jpg" />
But I'm running into a site (pitchfork.com) where the OG tags are formatted this way:
<meta data-react-helmet="true" name="og:image" content="https://example.com/image.jpg"/>
Of course, my code misses these because it's looking for the property, not the name.
I don't know anything about React or react-helmet, but is this proper code? Do the tags in all react-helmet sites look this way? Should I be rewriting my code to account for this? If so, what's the best way to do it?

Can a search bar built in PHP can also search static web pages?

I want my website to have a search bar, my site is created in core PHP.
Some section of my site has static pages and some section have dynamic pages.
My developer says that it is not advisable to build a customised
search bar in PHP that can work on the static pages, as it would be
too slow, because it would do an in-folder search. As per him
cutomised search bar would work best on only dynamic pages, and
hence I should implement a Google API.
Why I was not interested in building a Google API search box? Because my developer was not able to customize it design and the standard box was not blending with my site's design.
Any suggestion or advise on how to do either points.

If you save static pages in the Database, work be easier

It would be really slow he is right. But with the right caching system it is possible to catch that.
Google searchbar is slightly customizable like this: http://www.elitepvpers.com/forum/
but i really gives you some great features and a slight google boost

Like Jawlon Rodriguez says, you need to create table in database for static content, save all text and HTML, CSS, JS content in table and lather fetch that out. Is better and faster.

Why don't clients trust their developers? He's absolutely right.
A dynamic search through static pages would indeed require a search through the File System which is slow as hell (could be prevented by caching though).
A google searchbar is not really customizable and that's not because "he was not able to do it", but because it's just restricted by Google. You use their products? You get their branding. It's as simple as that.
PHP gives you the power to create dynamic websites. You just need to use it. If you store your content in a database you would have the advantages of a fulltext search which is great.
If you've got a dead simple website consider to set up a XML (or whatever format you prefer) with the static content and link it with the URL of the actual page. That could look like this for example:
XML File
<?xml version="1.0"?>
<pages>
<page>
<url>http://google.com</url>
<title>Some title</title>
<content>
The quick brown fox jumps over the lazy dog
</content>
</page>
<page>
<url>http://yahoo.com</url>
<title>Ohai Catz</title>
<content>
The quick Cat kitten cat cat
</content>
</page>
</pages>
Sample PHP
<?php
// String to search for
$searchString = 'Cat';
$xml = simplexml_load_file('staticpages.xml');
$pageMatches = [];
foreach( $xml->page AS $page )
{
// Check if the keyword exists in either the title or the content
if( stristr($page->content, $searchString) !== false || stristr($page->title, $searchString) !== false )
{
$pageMatches[] = $page;
}
}
?>
<?php echo $searchString ?> has been found on the following pages: <br>
<?php foreach( $pageMatches AS $match ): ?>
<?php echo $match->url ?><br>
<?php endforeach ?>
Output
Cat has been found on the following pages:
http://yahoo.com
The advantages of that approach:
No database
With small tweaks you could let the contents of your site be displayed from the XML, so you had a central source of your data
You could add additional information like tags or keywords to the XML to make the search more precise
Now the big disadvantage that instantly disqualifies this method for efficient searching:
You need to implement a search algorithm.
I simulated it with stristr(). If you want a search with usable results you have to put a lot of work and effort into it. Maybe you'll stumble upon some algorithms on the internet. But search algorithms are a science in it's own. Google isn't a multi-billion dollar company for nothing. Keep that in mind

Creating an inline 'Jargon' helper in PHP

I have an article formatted in HTML. It contains a whole lot of jargon words that perhaps some people wouldn't understand.
I also have a glossary of terms (MySQL Table) with definitions which would be helpful to there people.
I want to go through the HTML of my article and find instances of these glossary terms and replace them with some nice JavaScript which will show a 'tooltip' with a definition for the term.
I've done this nearly, but i'm still having some problems:
terms are being found within words (ie: APS is in Perhaps)
I have to make sure that it doesn't do this to alt, title, linked text, etc. So only text that doesn't have any formatting applied. BUT it needs to work in tables and paragraphs.
Here is the code I have:
$query_glossary = "SELECT word FROM glossary_terms WHERE status = 1 ORDER BY LENGTH(word) DESC";
$result_glossary = mysql_query_run($query_glossary);
//reset mysql via seek so we don't have to do the query again
mysql_data_seek($result_glossary,0);
while($glossary = mysql_fetch_array($result_glossary)) {
//once done we can replace the words with a nice tip
$glossary_word = $glossary['word'];
$glossary_word = preg_quote($glossary_word,'/');
$article['content'] = preg_replace_callback('/[\s]('.$glossary_word.')[\s](.*?>)/i','article_checkOpenTag',$article['content'],10);
}
And here is the PHP function:
function article_checkOpenTag($matches) {
if (strpos($matches[0], '<') === false) {
return $matches[0];
}
else {
$query_term = "SELECT word,glossary_term_id,info FROM glossary_terms WHERE word = '".escape($matches[1])."'";
$result_term = mysql_query_run($query_term);
$term = mysql_fetch_array($result_term);
# CREATING A RELEVENT LINK
$glossary_id = $term['glossary_term_id'];
$glossary_link = SITEURL.'/glossary/term/'.string_to_url($term['word']).'-'.$term['glossary_term_id'];
# SOME DESCRIPTION STUFF FOR THE TOOLTIP
if(strlen($term['info'])>400) {
$glossary_info = substr(strip_tags($term['info']),0,350).' ...<br /> Read More';
}
else {
$glossary_info = $term['info'];
}
return ' '.$term['word'].'',$glossary_info,400,1,0,1).'">'.$matches[1].'</a> '.$matches[2];
}
}

Move the load from server to client. Assuming that your "dictionary of slang" changes not frequently and that you want to "add nice tooltips" to words across a lot of articles, you can export it into a .js file and add a corresponding <script> entry into your pages - just a static file easily cacheable by a web-browser.
Then write a client-side js-script that will try to find a dom-node where "a content with slang" is put, then parse out the occurences of the words from your dictionary and wrap them with some html to show tooltips. Everything with js, everything client-side.
If the method is not suitable and you're going to do the job within your php backend, at least consider some caching of processed content.
I also see that you insert a description text for every "jargon word" found within content. What if a word is very frequent across an article? You get overhead. Make that descriptions separate, put them into JS as an object. The task is to find words which have a description and just mark them using some short tag, for instance <em>. Your js-script should find that em`s, pick a description from the object (associative array with descriptions for words) and construct a tooltip dynamically on "mouse over" event.

Interestingly enough, I was searching exactly NOT for a question like yours, but while reading I realized that your question is one that I had been through quite some time ago
It was basically a system to parse a dictionary and spits augmented HTML.
My suggestion would include instead:
Use database if you want, but a cached generated CSV file could be faster to use as dictionary
Use a hook in your rendering system to parse the actual content within this dictionary
caching of the page could be useful too
I elaborated a solution on my blog (in French, sorry for that). But it outlines basically something that you can actually use to do that.
I called it "ContentAbbrGenerator" as a MODx plugin. But the raw of the plugin can be applied outside of the established structure.
Anyway you can download the zip file and get the RegExes and find a way around it.
My objective
Use one file that is read to get the kind of html decoration.
Generate html from within author entered content that doesnt know about accessibility and tags (dfn and or abbr)
Make it re-usable.
Make it i18n-izable. That is, in french, we use the english definition but the adaptative technology reads the english word in french and sounds weird. So we had to use the lang="" attribute to make it clear.
What I did
Is basically that the text you give, gets more semantic.
Imagine the following dictionary:
en;abbr;HTML;Hyper Text Markup Language;es
en;abbr;abbr;Abbreviation
Then, the content entered by the CMS could spit a text like this:
<p>Have you ever wanted to do not hassle with HTML abbr tags but was too lazy to hand-code them all!? That is my solution :)</p>
That gets translated into:
<p>Have you ever wanted to do not hassle with <abbr title="Hyper Text Markup Language" lang="es">HTML</abbr> <abbr title="Abbreviation">abbr</abbr> tags but was too lazy to hand-code them all!? That is my solution :)</p>
All depends from one CSV file that you can generate from your database.
The conventions I used
The file /abbreviations.txt is publicly available on the server (that could be generated) is a dictionary, one definition per accronym
An implementation has only to read the file and apply it BEFORE sending it to the client
The tooltips
I strongly recommend you use the tooltip tool that even Twitter Bootstrap implements. It basically reads the title of any marked up tags you want.
Have a look there: Bootstrap from Twitter with Toolip helper.
PS: I'm very sold to the use of the patterns Twitter put forward with this Bootstrap project, it's worth a look!!

Facebook like on demand meta content scraper

you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube).
any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)
thanks!

FB scrapes the meta tags from the HTML.
I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.
As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.
Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/
I've had a look at how FB does it, and it looks like the scraping is done at server side.
class ScrapedInfo
{
public $url;
public $title;
public $description;
public $imageUrls;
}
function scrapeUrl($url)
{
$info = new ScrapedInfo();
$info->url = $url;
$html = file_get_html($info->url);
//Grab the page title
$info->title = trim($html->find('title', 0)->plaintext);
//Grab the page description
foreach($html->find('meta') as $meta)
if ($meta->name == "description")
$info->description = trim($meta->content);
//Grab the image URLs
$imgArr = array();
foreach($html->find('img') as $element)
{
$rawUrl = $element->src;
//Turn any relative Urls into absolutes
if (substr($rawUrl,0,4)!="http")
$imgArr[] = $url.$rawUrl;
else
$imgArr[] = $rawUrl;
}
$info->imageUrls = $imgArr;
return $info;
}

Facebook looks at various meta information in the HTML of the page that you paste into a link field. The title and description are two obvious ones but a developer can also use <link rel="image_src" href="thumbnail.jpg" /> to provide a preferred screengrab. I guess you could check for these things. If this tag is missing you could always use a website thumbnail generation service.

As I am developing a project like that, it is not as easy as it seems, encoding issues, rendering content with javascript, existence of so many non-semantic websites are one of big problems I encountered. Especially extracting video info and trying to get auto-play behavior is always tricky or sometimes impossible. You can see a demo in http://www.embedify.me , it is written in .net but it has a service interface so you can call it via javascript, also there is javascript api to get the same ui/behavior as in fb.

Integrating tumblr blog with website

I would like to integrate my tumblr feed in to my website. It seems that tumblr has an API for this, but I'm not quite sure how to use it. From what I understand, I request the page, and tumblr returns an xml file with the contents of my blog. But how do I then make this xml into meaningful html? Must I parse it with php, turning the relevant tags into headers and so on? I tell myself it cannot be that painful. Anyone have any insights?

There's a javascript include that does this now, available from Tumblr (you have to login to see it): http://www.tumblr.com/developers
It winds up being something like this:
<script type="text/javascript" src="http://{username}.tumblr.com/js"></script>

You can use PHPTumblr, an API wrapper written in PHP which makes retrieving posts a breeze.

If you go to http://yourblog.tumblr.com/api/read where "yourblog" should be replaced with the name of your blog (be careful, if you host your Tumblr blog on a custom domain, like I do, use that) you'll see the XML version of your blog. It comes up really messy for me on Firefox for some reason so I use Chrome, try a couple of different browser, it'll help to see the XML file well-formed, indented and such.
Once your looking at the XML version of your blog, notice that each post has a bunch of data in an attribute="value" orientation. Here's an example from my blog:
<post id="11576453174" url="http://wamoyo.com/post/11576453174" url-with-slug="http://wamoyo.com/post/11576453174/100-year-old-marathoner-finishes-race" type="link" date-gmt="2011-10-17 18:01:27 GMT" date="Mon, 17 Oct 2011 14:01:27" unix-timestamp="1318874487" format="html" reblog-key="E2Eype7F" slug="100-year-old-marathoner-finishes-race" bookmarklet="true">
So, there's lots of ways to do this, I'll show you the one I used, and drop my code on the bottom of this post so you can just tailor that to your needs. Notice the type="link" part? Or the id="11576453174" ? These are the values you're going to use to pull data into your PHP script.
Here's the example:
<!-- The Latest Text Post -->
<?php
echo "";
$request_url = "http://wamoyo.com/api/read?type=regular"; //get xml file
$xml = simplexml_load_file($request_url); //load it
$title = $xml->posts->post->{'regular-title'}; //load post title into $title
$post = $xml->posts->post->{'regular-body'}; //load post body into $post
$link = $xml->posts->post['url']; //load url of blog post into $link
$small_post = substr($post,0,350); //shorten post body to 350 characters
echo // spit that baby out with some stylish html
'<div class="panel" style="width:220px;margin:0 auto;text-align:left;">
<h1 class="med georgia bold italic black">'.$title.'</h1>'
. '<br />'
. '<span>'.$small_post.'</span>' . '...'
. '<br /></br><div style="text-align:right;"><a class="bold italic blu georgia" href="'.$link.'">Read More...</a></div>
</div>
<img style="position:relative;top:-6px;" src="pic/shadow.png" alt="" />
';
?>
So, this is actually fairly simple. The PHP script here places data (like the post title and post text) from the xml file into php variables, and then echos out those variable along with some html to create a div which features a snippet from a blog post. This one features the most recent text post. Feel free to use it, just go in and change that first url to your own blog. And then choose whatever values you want from your xml file.
For example let's say you want, not the most recent, but the second most recent "photo" post. You have to change the request_url to this:
$request_url = "http://wamoyo.com/api/read?type=photo&start=1"
Or let's say you want the most recent post with a specific tag
$request_url = "http://wamoyo.com/api/read?tagged=events";
Or let's say you want a specific post, just use the id
$request_url = "http://wamoyo.com/api/read?id=11576453174";
So all you have to do is tack on the ? with whatever parameter and use an & if you have multiple parameters.
If you want to do something fancier, you'll need the tumblr api docs here: http://www.tumblr.com/docs/en/api/v2
Hope this was helpful!

There are two main ways to do this. First, you can parse the xml, pulling out the content from the the tags you need (a few ways to do this depending on whether you use a SAX or DOM parser). This is the quick and dirty solution.
You can also use an XSLT transformation to convert the xml source directly to the html you want. This is more involved since you have to learn the syntax for xslt templates, which is a bit verbose.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.