Getting Actual links from Google News RSS with PHP - php

I want to parse Google News RSS with PHP, to get actual links of the content.
Google News RSS item link looks like this:
http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGkF58EwDE7aA742GfVP9aE8azmhg&url=http://www.reuters.com/article/2012/01/15/us-obama-mlk-idUSTRE80E0PD20120115
I need just the actual link, everything after &url= :
http://www.reuters.com/article/2012/01/15/us-obama-mlk-idUSTRE80E0PD20120115
And how would one go about eliminating the "non-essential" part of the URL, in essence targeting everything starting with http://news.google.com and ending with &url= ?
http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGkF58EwDE7aA742GfVP9aE8azmhg&url=
I do a little regex, but this is out of my reach...
Thanks, fellas!

Regex is not necessarily the best approach here.
$query = parse_url($google_url, PHP_URL_QUERY);
parse_str($query, $parts);
$url = $parts['url'];

Here ya go:
$google_url = 'http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGkF58EwDE7aA742GfVP9aE8azmhg&url=http://www.reuters.com/article/2012/01/15/us-obama-mlk-idUSTRE80E0PD20120115';
preg_match('/&url=([^&]+)/', $google_url, $matches);
$url = $matches[1];
echo $url;

Related

How to delete tracking code from links in PHP

Hi I have a form in WordPress where users can submit a link to a product, but very often the links come with unnecessary baggage, like tracking codes. I would like to create a filter in WordPress and clean the links so they consist of just a working link. I would like to if possible confirm that the link still works or a method that will guarantee that the link will still work.
The main things I want to get rid of in links are utm_source and it's contents, utm_medium and it's contents, etc. Everything but the clean working link.
So for example, a link like this:
https://www.serenaandlily.com/variationproduct?dwvar_m10055_size=Twin&dwvar_m10055_color=Chambray&pid=m10055&pdp=true&source=detail&utm_source=affiliate&utm_medium=affiliate&utm_campaign=pjdatafeed&publisherId=20648&clickId=2669312134#fo_c=745&fo_k=c0ebaf8359ca7853df8343e535533280&fo_s=pepperjam
Will end up like this:
https://www.serenaandlily.com/variationproduct?dwvar_m10055_size=Twin&dwvar_m10055_color=Chambray&pid=m10055
I'd really appreciate if someone can lead me in the right direction.
Thanks!
You can do what you want with explode, parse_str and http_build_query. This code uses an array of unwanted parameters to decide what to delete from the query string:
$unwanted_params = array('utm_source', 'utm_medium', 'utm_campaign', 'clickId', 'publisherId', 'source', 'pdp', 'details', 'fo_k', 'fo_s');
$url = 'https://www.serenaandlily.com/variationproduct?dwvar_m10055_size=Twin&dwvar_m10055_color=Chambray&pid=m10055&pdp=true&source=detail&utm_source=affiliate&utm_medium=affiliate&utm_campaign=pjdatafeed&publisherId=20648&clickId=2669312134#fo_c=745&fo_k=c0ebaf8359ca7853df8343e535533280&fo_s=pepperjam';
list($path, $query_string) = explode('?', $url, 2);
// parse the query string
parse_str($query_string, $params);
// delete unwanted parameters
foreach ($unwanted_params as $p) unset($params[$p]);
// rebuild the query
$query_string = http_build_query($params);
// reassemble the URL
$url = $path . '?' . $query_string;
echo $url;
Output:
https://www.serenaandlily.com/variationproduct?dwvar_m10055_size=Twin&dwvar_m10055_color=Chambray&pid=m10055
Demo on 3v4l.org
You can do this in the PHP itself. There is a function called parse_url() (https://secure.php.net/manual/en/function.parse-url.php) which can give you all the URI params as array. After parsing, you can filter the parameters, remove the unwanted. Finally, use http_build_query() (https://secure.php.net/manual/en/function.http-build-query.php) to build a string URI to return :)

Search thorugh links and identify RSS source with Regex, PHP or Javascript

I'm building a news / blog aggregator that's focusing on the Syrian conflict, and I would like to be able to identify the source. It's a simple site, and the aggregator is an external javascript that pulls RSS from my Yahoo Pipes. My problem is that I cannot find a way to identify the source (i.e. CNN, BBC, etc)
So I figured if I scan the document and identify the href source, I would be able to do something.
Let's say that we have <a href="http://foxnews.com/blahblahblah.php">, I would like to do a IF href == http://foxnews.com { logo(fox); } -- or something like this.
I'm not sure if I'm even "thinking right", but I'd really like to get my way around this problem. Any suggestions? Or are there Author info that I'm missing out on in my RSS pipe?
http://pipes.yahoo.com/pipes/pipe.run?_id=e9fdf79f13be013e7c3a2e4a7d0f2900&_render=rss
RSS feeds are just XML, so the first thing you would do is find an XML parser for the language that you are wanting to use.
PHP has SimpleXML built in and it's fast and easy to use.
You'd use that to pull out all the links like this.
foreach ($xml->channel->item as $key => $item) {
$link = $item->link
}
That's simple to understand, our root XML element is <channel> then inside that we have all of the news <item> tags. So we loop through those and pull out each child <link> element.
Then once I'd got that far, I realised it wouldn't take me much more to do the whole thing for you. I strip the links down to just the domain by replacing http:// with an empty string. And then exploding the string using / as the delimiter. Doing this splits the string into chunks that are pulled from between the slashes. Therefore, the first chunk is our domain.
<?php
$url = 'http://pipes.yahoo.com/pipes/pipe.run?_id=e9fdf79f13be013e7c3a2e4a7d0f2900&_render=rss';
$xml = simplexml_load_file($url);
foreach ($xml->channel->item as $key => $item) {
$link = $item->link;
$link = str_replace("http://", "", $link);
$parts = explode('/', $link);
$domain = $parts[0];
print($domain . "<br/>");
}
?>
This code gives me an output of:
www.ft.com
www.dailystar.com.lb
www.ft.com
www.ft.com
www.ft.com
www.ft.com
www.dailystar.com.lb
www.bbc.co.uk
....
Then it's a case of PHP switch statements to get the desired outcome for each link. Like so:
switch($domain) {
case "www.bbc.co.uk":
// Do BBC stuff
break;
case "www.dailystar.com.lb":
// Do daily star stuff
break;
default:
// Do something for domains that aren't covered above
break;
}
Good luck!

How do i extract a link from a longer link using php?

I was wondering if someone knows what the best method would be to extract a link from another link , Here's an example:
If I have links in the following format:
http://www.youtube.com/watch?v=35HBFeB4jYg OR
http://it.answers.yahoo.com/question/index?qid=20080520042405AApM2Rv OR
https://www.google.it/search?q=rap+tedesco&aq=f&oq=rap+tedesco&aqs=chrome.0.57j62l2.2287&sourceid=chrome&ie=UTF-8#hl=en&sclient=psy-ab&q=migliori+programatori&oq=migliori+programatori&gs_l=serp.3..0i19j0i13i30i19l3.9986.13880.0.14127.14.10.0.4.4.0.165.931.6j4.10.0...0.0...1c.1.7.psy-ab.tPmiWRyUVXA&pbx=1&bav=on.2,or.r_cp.r_qf.&fp=ffc0e9337f73a744&biw=1280&bih=699
How would I go about extracting only the web pages like so:
http://www.youtube.com
http://it.answers.yahoo.com
https://www.google.it
I was wondering if and what regular expression I could use with PHP to achieve this, also are regular expressions the way to go?
There is a PHP function for parsing URLs: parse_url
$url = 'http://it.answers.yahoo.com/question/index?qid=20080520042405AApM2Rv';
$p = parse_url($url);
echo $p["scheme"] . "// . "$p["host"];
Use function parse_url.
$link = "https://www.google.it/search?q=rap+tedesco";
$parseUrl = parse_url($link);
$siteName = $parseUrl['scheme']."://". $parseUrl['host'];
Using Regexp.
preg_match('#http(s?)://([\w]+\.){1}([\w]+\.?)+#',$link,$matches);
echo $matches[0];
Codeviper Demo.
You just want to have the domain of the page, in PHP there exists a function called parse_url that could help

Return id of video from metacafe using preg_match or something

Im trying to write plugin that will catch video details from various video websites simply by copy and paste url from browser.
But im having problems with metacafe videos.
To fetch details from example this video url
http://www.metacafe.com/watch/cb-V9EJSvnKvTcH/confronting_fear_in_virtual_reality/
I need to go to metacafe http://www.metacafe.com/api/item/cb-V9EJSvnKvTcH/ and parse xml from this video id.
I thought it would be easy but problem is metacafe use only video id for xml data and url contains also video name after id.
To embed only video i use
$video_grab_url = 'http://www.metacafe.com/watch/cb-V9EJSvnKvTcH/confronting_fear_in_virtual_reality/';
$embed_vid_url = parse_url( $video_grab_url );
$metacafe = mb_substr( $embed_vid_url['path'], 7, -1 );
With this i get video id with name for embed.
But as i said for other details i need to parse xml data from url and to get url of xml data i need only video id without name.
Im not php pro so i got little lost in using preg_match
How do i use preg_match to get only this "cb-V9EJSvnKvTcH" so i can pull xml data and parse info like duration, thumb, tags, etc...
Try this one. Maybe this can help you.
(?<=watch/).*?(?=/)
sample PHP code,
<?php
$subject = "theLINKhere";
$pattern = "#(?<=watch/).*?(?=/)#"; // edit: Added modifiers
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 3);
print_r($matches);
?>
Or you can also use something like this:
$url = 'http://www.metacafe.com/watch/cb-V9EJSvnKvTcH/confronting_fear_in_virtual_reality/';
$path = parse_url($url, PHP_URL_PATH);
$pieces = explode('/', $path);
$video_id = $pieces[2];

Extracting specific information out of string

I have a string that contains an address to a youtube video, I want to use this to display the video in a pop-up lightbox. In the current form the link will not work in the lightbox:
http://www.youtube.com/v/CD2LRROpph0?f=videos&c=TEST&app=youtube_gdata&version=3
I has an idea to extract the video id 'CD2LRROpph0' and just append that to a regular youtube url, for example
http://www.youtube.com/watch?v=CD2LRROpph0.
Which i know works in the lightbox.
Any ideas on how to extract this code from the string???
This one will handle different protocols and different YouTube URLs (in case YouTube come out with country specific TLDs, for example).
$urlTokens = parse_url($url);
$newUrl = $urlTokens['scheme'] . '://' . $urlTokens['host'] . '/watch?v=' . preg_replace('~^/v/~', '', $urlTokens['path']);
CodePad.
$newUrl = preg_replace('#http://www.youtube.com/v/([a-z0-9_\-]{11}).*$#i',
'http://www.youtube.com/watch?v=$1', $string);
Or try this, you get an array back and can use the query you want
http://www.php.net/manual/en/function.parse-url.php

Categories