It seems Google's URLs are structured differently these days. So it is harder to extract the referring keyword from them. Here is an example:
http://www.google.co.uk/search?q=jquery+post+output+46&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a#pq=jquery+post+output+46&hl=en&cp=30&gs_id=1v&xhr=t&q=jquery+post+output+php+not+running&pf=p&sclient=psy-ab&client=firefox-a&hs=8N5&rls=org.mozilla:en-US%3Aofficial&source=hp&pbx=1&oq=jquery+post+output+php+not+run&aq=0w&aqi=q-w1&aql=&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=bdeb326aa44b07c5&biw=1280&bih=875
The search I performed was actually "jquery post output php not running", so the first 'q=' does not contain the full search. The second one does. I'd like to write a script that always extracts the last 'q=', but I'm not sure if Google's URL's always have the full search last. Anyone had any experience with this.
You can accomplish this using parse_url(), parse_str(), and urldecode(), where $str is the refer string:
$fragment = parse_url($str, PHP_URL_FRAGMENT);
parse_str($fragment, $arr);
$query = urldecode($arr['q']); // jquery post output php not running
Related
Here is the format of affiliate URL I have http://tracking.vcommission.com/aff_c?offer_id=2119&&url=http%3A%2F%2Fwww.netmeds.com%2F%3Fsource_attribution%3DVC-CPS-Emails%26utm_source%3DVC-CPS-Emails%26utm_medium%3DCPS-Emails%26utm_campaign%3DEmails
If you see it has 2 URLs:
first URL: is for vcommission.com and
Second URL: netmeds.com
I have CSV file with lot of rows. Each rows may have different second URL. I wanted to get second URL for each rows. First URL is also not static as for different CSV, this would also different.
How can I get second URL?
Some basic string parsing like this should give you an idea.
$url='http://tracking.vcommission.com/aff_c?offer_id=2119&&url=http%3A%2F%2Fwww.netmeds.com%2F%3Fsource_attribution%3DVC-CPS-Emails%26utm_source%3DVC-CPS-Emails%26utm_medium%3DCPS-Emails%26utm_campaign%3DEmails';
list($u,$q)=explode('url=',urldecode($url));
$o=(object)parse_url($q);
echo $o->host;
A good way to find the domain for a URL is with parse_url
Unfortunately due to the way your data is stored this is not really an option however you may be able to use some sort of regex to find contained web addresses in the query string
<?php
$url = "http://tracking.vcommission.com/aff_c?offer_id=2119&&url=http%3A%2F%2Fwww.netmeds.com%2F%3Fsource_attribution%3DVC-CPS-Emails%26utm_source%3DVC-CPS-Emails%26utm_medium%3DCPS-Emails%26utm_campaign%3DEmails";
$p = parse_url($url);
$pattern = "/www[^%]*/";
preg_match($pattern, $p['query'], $result);
var_dump($result);
You may need to adjust the regex pattern based on how the other data presents itself.
I have seen on most online newspaper websites that when i click on a headline link, e.g. two thieves caught red handed, it normally opens a url like this: www.example.co.uk/news/two-thieves-caught-red-handed.
How do I deal with this url in php code, so that I can only pick the last part in the url. e.g. two-thieves-caught-red-handed. After that I want to work with this string.
I know how to deal with GET parameters like "www.example.co.uk/news/headline=two thieves caught red handed".
But I do not want to do it that way. Could you show me another way.
You can use the combination of explode and end functions for that
for example:
<?php
$url = "www.example.co.uk/news/two-thieves-caught-red-handed";
$url = explode('/', $url);
$end = end($url);
echo "$end";
?>
The code will result
two-thieves-caught-red-handed
You have several options in php to get the current url. For a detailed overview look here.
One would be to use $_SERVER[REQUEST_URI] and the use a string manipulation function for extraction of the parts you need.
Maybe this thread will help you too.
I am using a script to check links on a given page. I am using simple html DOM to parse the information into an array. I have to check the href of all the a tags to find if they contain a file or something like # or JS.
I tried the following without success.
if(preg_match("|^(.*)|iU", $href)){
save_link();
}
I dont know it my pattern is wrong or if there is a better method to complete this function.
I want to be able to detect if $href contains .com .php .file extensions. This way it will filter out items like # "function()" and other items used in the href attribute.
EDIT:
parse_url will not work stop posting it. The value # returns as a valid url like I stated above I am trying to look for any string followed by .* with no more than 4 chars following the .
I believe that the function you're looking for is parse_url().
This function will take a URL string, and return an array of components, which will allow you to work out what kind of URL it is.
However note that it has issues with incomplete URLs in PHP versions prior to 5.4.7, so you need to have the very latest PHP to get the best out of it.
Hope that helps.
See http://php.net/manual/en/function.parse-url.php
I'm assuming you don't want to match fragments (#) because you are not concerned with following internal anchors.
parse_url breaks up the different parts of the url into an array. You can see the path component of the URL in this array and run your check against that.
You can use parse_url() , like this :
$res = parse_url($href);
if ( $res['scheme'] == 'http' || $res['scheme'] == 'https'){
//valid url
save_link();
}
UPDATE:
I've added code to filter only http and https urls, thanks to Baba for spotting this.
I've been working with the Sphider search engine for an internal website, we need to be able to quickly search for contact details in exported .htm(l) files.
$fulltxt = ereg_replace("[_A-Za-z0-9-]+(\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(\.[A-Za-z]{2,3})", "\\0", $fulltxt);
I am replacing e-mail addresses with a convenient mailto: link so users can open Outlook straight from the search results.
However,
while (preg_match("/[^\>](".$change.")[^\<]/i", " ".$fulltxt." ", $regs)) {
$fulltxt = preg_replace("/".$regs[1]."/i", "<b>".$regs[1]."</b>", $fulltxt);
}
It replaces all matches in the search results with bold tags, which resuts into the tags been included in Outlook's 'To...' field. It looks something like this in HTML (thanks Yuriy):
<b>name</b>.surname#domain
I have tried adding a value to the 'limit' parameter:
while (preg_match("/[^\>](".$change.")[^\<]/i", " ".$fulltxt." ", $regs)) {
$fulltxt = preg_replace("/".$regs[1]."/i", "<b>".$regs[1]."</b>", $fulltxt, 1);
}
Supposingly this should be the solution to my problem by simply replacing only the first occurrence (being the name as the pattern is name-phone num-email and we always search by name), instead it only makes it incredibly slow to the point i get a timeout message from the server. I've been trying various solutions but have been out of luck.
Any ideas? Am i doing something wrong?
Thanks.
(*Original heavily edited).
Did I understand you right that something like this happens?
<b>email#domain</b>
Why don't you put tags into search results first, and only then apply "mailto:" anchors to emails? Added 's would be easy to filter out in the patter on that second step.
I am trying to extract urls from a large number of google search results. Getting them from the source code is proving to be quite challenging as the delimiters are not clear and not all of the urls are in the code. Is there a tool that can extract urls from a certain area of an image? If so that may be a better solution.
Any help would be much appreciated.
Try using the JSON/Atom Custom Search API instead: http://code.google.com/apis/customsearch/v1/overview.html. It gives you 100 api calls per day, something you can increase to 10000 per day, if you pay.
Use this excellent lib: http://simplehtmldom.sourceforge.net/manual.htm
// Grab the source code
$html = file_get_html('http://www.google.com/');
// Find all anchors, returns a array of element objects
$ret = $html->find('a');
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $ret->href;
EDit :
All "natural" search urls are in the #res div it seems.. With simplehtmldom find first #res, than all url inside of it. Don't remember exactly the syntax but it must be this way :
$ret = $html->find('div[id=res]')->find('a');
or maybe
$html->find('div[id=res] a');