Extracting site title from URL - php

I'm trying to find a away to extract a site title from a URL entered into a field in PHP. For example, if the user were to enter the URL http://www.nytimes.com/2009/11/05/sports/baseball/05series.html, I would want "New York Times" or "NY Times" or something along those lines.
I know it's fairly easy to extract the title of the WINDOW... for example, the URL I linked would have the title "Yankees 7, Phillies 3 - Back on Top....", but this is exactly what I don't want.
For clarification, this is for adding sources to a quote. I want to be able to add a source to quotes without a huge page URL and not just a link that says "Source".
Can anyone help me with this? Thanks in advance.

$source = parse_url('http://www.nytimes.com/....', PHP_URL_HOST); // www.nytimes.com

There is no such thing as a "site title" , you can get
the domain name (and then the owner name)
the page's title
I see you have the meta tag "cre" with the value "The New York Times" but you won't find it everywhere
You can do one thing : extract the domain name from the URL, and then get the first page's title
"http://www.nytimes.com/" will give you "The New York Times - Breaking News, World News & Multimedia"

Build a list of URL prefixes to site names, and check for each prefix in turn from longest to shortest.

You'd surely need a lookup table mapping domains (nytimes.com) to your titles "NY Times" in which case it would be easy to do.
If you want to have a method that will work on any link from any domain, then it is a bit harder as PHP in itself is not going to be able to work out what is a uniform title as it will vary from site to site.
You can explode the URL easily enough, but how then would you be able to dissect nytimes into "NY" and "TIMES".
You may be able to find a web service that allows you to feed in a domain and get back a site title, but I do not know of one.
You are best off simply quoting the domain, trimmed like "NYTIMES.COM" as the source, or "NYTIMES".

You would want to use file_get_contents() then run a match to check the text between any <title></title> tags - that then would be your title that you display.
Using parse_url wouldn't return the actual page title.
Something like:
<?php
$x = file_get_contents("http://google.com");
preg_match("/<title>(.+?)<\/title>/", $x, $match);
echo $match[1];
?>

Use the Simple HTML DOM Parser. Here is an example:
require "simple_html_dom.php";
$url = "http://www.google.com";
$html = file_get_html( $url );
list( $title ) = $html->find( 'title' );
echo strip_tags( $title ); // Output: "Google"

Related

Dealing with online newspaper headline link in PHP

I have seen on most online newspaper websites that when i click on a headline link, e.g. two thieves caught red handed, it normally opens a url like this: www.example.co.uk/news/two-thieves-caught-red-handed.
How do I deal with this url in php code, so that I can only pick the last part in the url. e.g. two-thieves-caught-red-handed. After that I want to work with this string.
I know how to deal with GET parameters like "www.example.co.uk/news/headline=two thieves caught red handed".
But I do not want to do it that way. Could you show me another way.
You can use the combination of explode and end functions for that
for example:
<?php
$url = "www.example.co.uk/news/two-thieves-caught-red-handed";
$url = explode('/', $url);
$end = end($url);
echo "$end";
?>
The code will result
two-thieves-caught-red-handed
You have several options in php to get the current url. For a detailed overview look here.
One would be to use $_SERVER[REQUEST_URI] and the use a string manipulation function for extraction of the parts you need.
Maybe this thread will help you too.

Preg_Replace Change URL

I am trying to grab content from another one of my site which is working fine, apart from all the links are incorrect.
include_once('../simple_html_dom.php');
$page = file_get_html('http://www.website.com');
$ret = $page->find('div[id=header]');
echo $ret[0];
Is there anyway instead of all links showing link to have the full link? using preg replace.
$ret[0] = preg_replace('#(http://([\w-.]+)+(:\d+)?(/([\w/_.]*(\?\S+)?)?)?)#',
'http://fullwebsitellink.com$1', $ret[0]);
I guess it would be something like above but I dont understand?
Thanks
Your question doesn't really explain what is "incorrect" about the links, but I'm guessing you have something like this:
<div id="header">Home | Sitemap</div>
and you want to embed it in another site, where those links need to be fully-qualified with a domain name, like this:
<div id="header">Home | Sitemap</div>
Assuming this is the case, the replacement you want is so simple you don't even need a regex: find all href attributes beginning "/", and add the domain part (I'll use "http://example.com") to their beginning to make them absolute:
$scraped_html = str_replace('href="/', 'href="http://example.com/', $scraped_html);

Replace Specifc Full Links Between href=" " Using PHP

I have tried searching through related answers but can't quite find something that is suitable for my specific needs. I have quite a few affiliate links within 1,000s of articles on one of my wordpress sites - which all start with the same url format and sub-domain structure:
http://affiliateprogram.affiliates.com/
However, after the initial url format, the query string appended changes for each individual url in order to send visitors to specific pages on the destination site.
I am looking for something that will scan a string of html code (the article body) for all href links that include the specific domain above and then replace THE WHOLE LINK (whatever the query string appended) with another standard link of my choice.
href="http://affiliateprogram.affiliates.com/?random=query_string&page=destination"
gets replaced with
href="http://www.mylink.com"
I would ideally like to do this via php as I have a basic grasp, but if you have any other suggestions I would appreciate all input.
Thanks in advance.
<?php
$html = 'href="http://affiliateprogram.affiliates.com/?random=query_string&page=destination"';
echo preg_replace('#http://affiliateprogram.affiliates.com/([^"]+)#is', 'http://www.mylink.com', $html);
?>
http://ideone.com/qaEEM
Use a regular expression such as:
href="(https?:\/\/affiliateprogram.affiliates.com\/[^"]*)"
$data =<<<EOT
bar
foo
<a name="zz" href="http://affiliateprogram.affiliates.com/?query=random&page=destination&string">baz</a>
EOT;
echo (
preg_replace (
'#href="(https?://affiliateprogram.affiliates.com/[^"]*)"#i',
'href="http://www.mylink.com"',
$data
)
);
output
bar
foo
<a name="zz" href="http://www.mylink.com">baz</a>
$a = '<a class="***" href="http://affiliateprogram.affiliates.com/?random=query_string&page=destination" attr="***">';
$b = preg_replace("/<a([^>]*)href=\"http:\/\/affiliateprogram\.affiliates\.com\/[^\"]*\"([^>]*)>/", "<a\\1href=\"http://www.mylink.com/\"\\2>", $a);
var_dump($b); // <a class="***" href="http://www.mylink.com/" attr="***">
That's quite simple, as you only need a single placeholder for the querystring. .*? would normally do, but you can make it more specific by matching anything that's not a double quote:
$html =
preg_replace('~ href="http://affiliateprogram\.affiliates\.com/[^"]*"~i',
' href="http://www.mylink.com"', $html);
People will probably come around and recomend a longwinded domdocument approach, but that's likely overkill for such a task.

Get URL of specific link in PHP

I have a file that contains a bunch of links:
site 1
site 2
site 3
I want to get the URL to a link with specific text. For example, search for "site 2" and get back "http://site2.com"
I tried this:
preg_match("/.*?[Hh][Rr][Ee][Ff]=\"(.*?)\">site 2<\/[Aa]>.*/", $contents, $match)
(I know the HREF= will be the last part of the anchor)
But it returns
http://site1.com">site 1</a><a href="http://site2.com
Is there a way to do a search backwards, or something? I know I can do preg_match_all and loop over everything, but I'm trying to avoid that.
Try this:
preg_match("(<a.*?href=[\"']([^\"']+)[\"'][^>]?>site 2</a>)i",$contents,$match);
$result = $match[1];
Hope this helps!
Or you can try using phpQuery.

PHP Summarize any URL

How can I, in PHP, get a summary of any URL? By summary, I mean something similar to the URL descriptions in Google web search results.
Is this possible? Is there already some kind of tool I can plug in to so I don't have to generate my own summaries?
I don't want to use metadata descriptions if possible.
-Dylan
What displays in Google is (generally) the META description tag. If you don't want to use that, you could use the page title instead though.
If you don't want to use metadata descriptions (btw, this is exactly what they are for), you have a lot of research and work to do. Essentially, you have to guess which part of the page is content and which is just navigation/fluff. Indeed, Google has exactly that; note however, that extracting valuable information from useless fluff is their #1 competency and they've been researching and improving that for a decade.
You can, of course, make an educated guess (e.g. "look for an element with ID or class maincontent" and get the first paragraph from it) and maybe it will be OK. The real question is, how good do you want the results to be? (Facebook has something similar for linking to websites, sometimes the summary just insists that an ad is the main content).
The following will allow you to to parse the contents of a page's title tag. Note: php must be configured to allow file_get_contents to retrieve URLs. Otherwise you'll have to use curl to retrieve the page HTML.
$title_open = '<title>';
$title_close = '</title>';
$page = file_get_contents( 'http://www.domain.com' );
$n = stripos( $page, $title_open ) + strlen( $title_open );
$m = stripos( $page, $title_close);
$title = substr( $page, n, m-n );
While i hate promoting a service i have found this:
embed.ly
It has an API, that returns a JSON with all the data you need.
But i am still searching for a free/opensource library to do the same thing.

Categories