Problems on getting a page's title with PHP

Problems on getting a page's title with PHP - php

I did this function in PHP to get a page's title. I know it might look a bit messy, but that's because I'm a beginner in PHP. I have used preg_match("/<title>(.+)<\/title>/i",$returned_content,$m) inside the if before and it hasn't worked as I expected.
function get_page_title($url) {
$returned_content = get_url_contents($url);
$returned_content = str_replace("\n", "", $returned_content);
$returned_content = str_replace("\r", "", $returned_content);
$lower_rc = strtolower($returned_content);
$pos1 = strpos($lower_rc, "<title>") + strlen("<title>");
$pos2 = strpos($lower_rc, "</title>");
if ($pos2 > $pos1)
return substr($returned_content, $pos1, $pos2-$pos1);
else
return $url;
}
This is what I get when I try to get the titles of the following pages using the function above:
http://www.google.com -> "302 Moved"
http://www.facebook.com -> ""http://www.facebook.com"
http://www.revistabula.com/posts/listas/100-links-para-clicar-antes-de-morrer -> "http://www.revistabula.com/posts/listas/100-links-para-clicar-antes-de-morrer"
(When I add a / to the end of the link, I can get the title successfully: "100 links para clicar antes de morrer | Revista Bula")
My questions are:
- I know google is redirecting to my country's mirror when i try to access google.com, but how can I get the title of the page it redirects to?
- What is wrong in my function that makes it get the title of some pages, but not of others?

HTTP clients should follow redirects. That 302 status code means that the content you tried to get isn't at that location, and the client should follow the Location: header to figure out where it is.
You have two problems here. The first is not following redirects. If you use cURL, you can get it to follow redirects by setting this:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
See this question for a full solution:
Make curl follow redirects?
The second problem is that you are parsing HTML with RegEx. Don't do that. See this question for better alternatives:
How do you parse and process HTML/XML in PHP?

Why not try something like this?? Works very well.
function get_page_title($url)
{
$source = file_get_contents($url);
$results = preg_match("/<title>(.*)<\/title>/", $source, $title_matches);
if (!$results)
return null;
//get the first match, this is the title
$title = $title_matches[1];
return $title;
}

Related

Parsing closed brackets in URL and http_build_query with it inserts number in closed bracket

may not have explained this properly but here we go.
I have a URL that looks like http://www.test.co.uk/?page=2&area[]=thing&area[]=thing2
Multiple "area"s can be added or removed from the URL via links on the site. on each addition of n "area" I wanted to remove the "page" part of the URL. so it can be reset to page1. I used parse_url to take that bit out.
Then I built an http query so it could generate the URL properly without "page"
this resulted in "area%5B0%5D=" "area%5B1%5D=" instead of "area[]="
When I use urldecode, now it shows "area[0]=" and "area[1]="
I need it to be "[]" because when using a link to remove an area, it checks for the "[]=" - when it's [0] it doesn't recognise it. How do I keep it as "[]="?
See code below.
$currentURL = currentURL();
$parts = parse_url($currentURL);
parse_str($parts['query'], $query);
unset($query['page']);
$currenturlfinal = http_build_query($query);
urldecode($currenturlfinal);
$currentURL = "?" . urldecode($currenturlfinal);
This is what I've done so far - it fixes the visual part in the URL - however I don't think I've solved anything as I've realised that what represents 'area' and 'thing' is not recognised as $key or $val as a result of what I think is parsing or reencoding the url in accordance with the code below. So I still can't remove 'areas' using the links
$currentURL_with_QS2 = currentURL();
$parts = parse_url($currentURL_with_QS2);
parse_str($parts['query'], $query);
unset($query['page']);
$currenturlfinal = http_build_query($query);
$currenturlfinal = preg_replace('/%5B[0-9]+%5D/simU', '[]', $currenturlfinal);
urldecode($currenturlfinal);
$currentURL_with_QS = "?" . $currenturlfinal;
$numQueries = count(explode('&', $_SERVER['QUERY_STRING']));
$get = $_GET;
if (activeCat($val)) { // if this category is already set
$searchString = $key . '[]= ' . $val; // we build the query string to remove
I'm using Wordpress as well may I add - maybe there's a way to reset the pagination through Wordpress. of course even then - when I go to page 2 on any page it still changes the "[]" to "5b0%5d" etc....
EDIT: this is all part of a function that refers to $key (the area/category) and $val (name of area or category) which is echoed in the link itself
EDIT2: It works now!
I don't know why but I had to use the original code and make the adjustments I did before again and now it works exactly how I want it to! Yet I couldn't see any visible differences in both codes afterwards. Strange...

As far as I know, there is no built-in way to do this.
You could try with:
$currenturlfinal = http_build_query($query);
Where $query is querystring part w/o area parameters and then:
foreach ($areas as $area) {
$currenturlfinal .= '&area[]='.$area;
}
UPD:
you could try with:
$query = preg_replace('/%5B[0-9]+%5D/simU', '%5B%5D', $query);
just place it right after http_build_query call.

Google + button counts showing "0" using the Sharrre library ( Json , Php )

So checked via a phpinfo() and Safe Mode on my server is off, Curl is activated and there are no reasons for it not to work.
I also made sure Sharrre.php is in my root directory. Even included the Curlurl to the php file. Tried both absolute and relative linking. The google button with the counter shows as soon it is uploaded but not as expected because the counter shows 0 the entire time.
The culprit seems to be: $json = array('url'=>'','count'=>0);
After a few lines of other code we got this:
if(filter_var($_GET['url'], FILTER_VALIDATE_URL)){
if($type == 'googlePlus'){ //source http://www.helmutgranda.com/2011/11/01/get-a-url-google-count-via-php/
$contents = parse('https://clients6.google.com/rpc?key=AIzaSyCKSbrvQasunBoV16zDH9R33D88CeLr9gQurl=' . $url . '&count=true');
preg_match( '/window\.__SSR = {c: ([\d]+)/', $contents, $matches );
if(isset($matches[0])){
$json['count'] = (int)str_replace('window.__SSR = {c: ', '', $matches[0]);
}
}
So either the google url code is not valid anymore or... well maybe there is something wrong with the suspected culprit because:
when changed to a value higher than 0 $json = array('url'=>'','count'=>15);
It shows 15 counts as you can see. I want it to be dynamic though and get the counts I already have and update those per click.
What can be done to solve this?

In my particular case the problem was in the asignement of the URL to the Curl object.
The original script sharrre.php sets the URL by asigning it to an array element of the curl object, but this is not working and causes Google counter not retrieve any amount.
Instead, the URL must be asigned by the curl_setopt() function.
This resolved this problem in my case:
sharrre.php:
//...
$ch = curl_init();
//$options[CURLOPT_URL] = $encUrl; // <<<--- not working! comment this line.
curl_setopt_array($ch, $options);
curl_setopt($ch, CURLOPT_URL, $encUrl ); // <<<--- Yeeaa, working! Add this line.
//...
Hope this help.

How to create bitly shortened url from user's inputed text?

Begginer here, people. Could anybody suggest any kind of solution? I've an user inputed text.
First of all I check if the text has any urls:
$post = preg_replace('/https?:\/\/[\w\-\.!~?&+\*\'"(),\/]+/','<a class="post_link"
href="$0">$0</a>',$post);
And after that I need to retrieve that url and put as a variable($url) to this function:
$short=make_bitly_url('$url','o_6sgltp5sq4as','R_f5212f1asdads1cee780eed00d2f1bd2fd794f','xml');
And finally, echo both url and user's text. Thanks in advance for ideas and critiques.
I've tried something like that:
$post = preg_replace('/https?:\/\/[\w\-\.!~?&+\*\'"(),\/]+/e',$url,$post){
$shorten = make_bitly_url($url,'o_6sgltpmm5sq4','R_f5212f11cee780ekked00d2f1bd2fd794f','json');
return '<a class="post_link" href="$shorten">$shorten</a>';
};
But even for me it looks some kind of nonsense.

Bitly does have an API available for use. You should check out API Documentation

Here's how to use the bit.ly API from PHP:
/* make a URL small */
function make_bitly_url($url,$login,$appkey,$format = 'xml',$version = '2.0.1')
{
//create the URL
$bitly = 'http://api.bit.ly/shorten?version='.$version.'&longUrl='.urlencode($url).'&login='.$login.'&apiKey='.$appkey.'&format='.$format;
//get the url
//could also use cURL here
$response = file_get_contents($bitly);
//parse depending on desired format
if(strtolower($format) == 'json')
{
$json = #json_decode($response,true);
return $json['results'][$url]['shortUrl'];
}
else //xml
{
$xml = simplexml_load_string($response);
return 'http://bit.ly/'.$xml->results->nodeKeyVal->hash;
}
}
/* usage */
$short = make_bitly_url('http://davidwalsh.name','davidwalshblog','R_96acc320c5c423e4f5192e006ff24980','json');
echo 'The short URL is: '.$short;
// returns: http://bit.ly/11Owun
Source: David Walsh article
HOWEVER, if you wanted to create your own URL shortening system (similar to bit.ly -- and surprisingly easy to do), here is an 8-part tutorial from PHPacademy on how to do that:
Difficulty level: beginner / intermediate
Each video is approx ten minutes.
Part 1
Part 2
Part 3
Part 4
Part 5
Part 6
Part 7
Part 8

How to use python/PHP to remove redundancy in URL link?

Many website add tags to url link for tracking purpose, such as
http://www.washingtonpost.com/blogs/answer-sheet/post/report-we-still-dont-know-much-about-charter-schools/2012/01/13/gIQAxMIeyP_blog.html?wprss=linkset&tid=sm_twitter_washingtonpost
If we remove the appendix "?wprss=linkset&tid=sm_twitter_washingtonpost", would still go to same page.
Is there any general approach could remove those redundancy element? Any comment would be helpful.
Thanks!

To remove query, fragment parts from URL
In Python using urlparse:
import urlparse
url = urlparse.urlsplit(URL) # parse url
print urlparse.urlunsplit(url[:3]+('','')) # remove query, fragment parts
Or a more lightweight approach but it might be less universal:
print URL.partition('?')[0]
According to rfc 3986 URI can be parsed using the regular expression:
/^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?/
Therefore if there is no fragment identifier (the last part in the above regex) or the query component is present (the 2nd to last part) then URL.partition('?')[0] should work, otherwise answers that split an url on '?' would fail e.g.,
http://example.com/path#here-?-ereh
but urlparse answer still works.
To check whether you can access page via URL
In Python:
import urllib2
try:
resp = urllib2.urlopen(URL)
except IOError, e:
print "error: can't open %s, reason: %s" % (URL, e)
else:
print "success, status code: %s, info:\n%s" % (resp.code, resp.info()),
resp.read() could be used to read the contents of the page.

To remove query string in URL :
<?php
$url = 'http://www.washingtonpost.com/blogs/answer-sheet/post/report-we-still-dont-know-much-about-charter-schools/2012/01/13/gIQAxMIeyP_blog.html?wprss=linkset&tid=sm_twitter_washingtonpost';
$url = explode('?',$url);
$url = $url[0];
//check output
echo $url;
?>
To check URL valid or not:
You can use PHP function get_headers($url). Example:
<?php
//$url_o = 'http://www.washingtonpost.com/blogs/answer-sheet/post/report-we-still-dont-know-much-about-charter-schools/2012/01/13/gIQAxMIeyP_blog.html?wprss=linkset&tid=sm_twitter_washingtonpost';
$url_o = 'http://mobile.nytimes.com/article?a=893626&f=21';
$url = explode('?',$url_o);
$url = $url[0];
$header = get_headers($url);
if(strpos($header[0],'Not Found'))
{
$url = $url_o;
}
//check output
echo $url;
?>

You can use a regular expression:
$yourUrl = preg_replace("/[?].*/","",$yourUrl);
Which meanss: "replace the question mark and everything afterwards with an empty string".

You can make a URL parser that will cut everything from "?" and on
<?php
$pos = strpos($yourUrl, '?'); //First, find the index of "?"
//Then, cut all the chars after the "?" and a append to a new URL string://
$newUrl = substr($yourUrl, 0, -1*(strlen($yourUrl)-((int)$pos)));
echo ($newUrl);
?>

How to write a PHP script to find the number of indexed pages in Google?

I need to find the number of indexed pages in google for a specific domain name, how do we do that through a PHP script?
So,
foreach ($allresponseresults as $responseresult)
{
$result[] = array(
'url' => $responseresult['url'],
'title' => $responseresult['title'],
'abstract' => $responseresult['content'],
);
}
what do i add for the estimated number of results and how do i do that?
i know it is (estimatedResultCount) but how do i add that? and i call the title for example this way: $result['title'] so how to get the number and how to print the number?
Thank you :)

I think it would be nicer to Google to use their RESTful Search API. See this URL for an example call:
http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site:stackoverflow.com&filter=0
(You're interested in the estimatedResultCount value)
In PHP you can use file_get_contents to get the data and json_decode to parse it.
You can find documentation here:
http://code.google.com/apis/ajaxsearch/documentation/#fonje
Example
Warning: The following code does not have any kind of error checking on the response!
function getGoogleCount($domain) {
$content = file_get_contents('http://ajax.googleapis.com/ajax/services/' .
'search/web?v=1.0&filter=0&q=site:' . urlencode($domain));
$data = json_decode($content);
return intval($data->responseData->cursor->estimatedResultCount);
}
echo getGoogleCount('stackoverflow.com');

You'd load http://www.google.com/search?q=domaingoeshere.com with cURL and then parse the file looking for the results <p id="resultStats" bit.
You'd have the resulting html stored in a variable $html and then say something like
$arr = explode('<p id="resultStats"'>, $html);
$bottom = $arr[1];
$middle = explode('</p>', $bottom);
Please note that this is untested and a very rough example. You'd be better off parsing the html with a dedicated parser or matching the line with regular expressions.

google ajax api estimatedResultCount values doesn't give the right value.
And trying to parse html result is not a good way because google blocks after several search.

Count the number of results for site:yourdomainhere.com - stackoverflow.com has about 830k

// This will give you the count what you see on search result on web page,
//this code will give you the HTML content from file_get_contents
header('Content-Type: text/plain');
$url = "https://www.google.com/search?q=your url";
$html = file_get_contents($url);
if (FALSE === $html) {
throw new Exception(sprintf('Failed to open HTTP URL "%s".', $url));
}
$arr = explode('<div class="sd" id="resultStats">', $html);
$bottom = $arr[1];
$middle = explode('</div>', $bottom);
echo $middle[0];
Output:
About 8,130 results
//vKj
Case 2: you can also use google api, but its count is different:
https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=ursitename&callback=processResults

https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site:google.com
cursor":{"resultCount":"111,000,000","
"estimatedResultCount":"111000000",

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Problems on getting a page's title with PHP - php

Related

Parsing closed brackets in URL and http_build_query with it inserts number in closed bracket

Google + button counts showing "0" using the Sharrre library ( Json , Php )

How to create bitly shortened url from user's inputed text?

How to use python/PHP to remove redundancy in URL link?

How to write a PHP script to find the number of indexed pages in Google?

Categories

Resources