Scrape FULL image src with PHP

Scrape FULL image src with PHP - php

I am trying to scrape img src's with php, I can get the src fine, but if the src does not include the full path then I can't really reuse it. Is there a way to grab the full path of the image using php (browsers can get it if you use the right click menu).
ie. How do I get a FULL path including the domain in one of the following two examples?
src="../foo/logo.png"
src="/images/logo.png"
Thanks,
Allan

You don't need a regex... just some patience. I don't really want to write the code for you, but just check if the src starts with http://, and if not, you have like 3 different cases.
If it begins with a / then prepend http://domain.com
If it begins with .. you'll have to split the full URL and hack off pieces until the src starts with a /
Else (it begins with a letter), the take the full domain, and strip it down to the last slash then append the src URL.
Or.... be lazy and steal this script
$url = "http://www.goat.com/money/dave.html";
$rel = "../images/cheese.jpg";
$com = InternetCombineURL($url,$rel);
// Returns http://www.goat.com/images/cheese.jpg
function InternetCombineUrl($absolute, $relative) {
$p = parse_url($relative);
if($p["scheme"])return $relative;
extract(parse_url($absolute));
$path = dirname($path);
if($relative{0} == '/') {
$cparts = array_filter(explode("/", $relative));
}
else {
$aparts = array_filter(explode("/", $path));
$rparts = array_filter(explode("/", $relative));
$cparts = array_merge($aparts, $rparts);
foreach($cparts as $i => $part) {
if($part == '.') {
$cparts[$i] = null;
}
if($part == '..') {
$cparts[$i - 1] = null;
$cparts[$i] = null;
}
}
$cparts = array_filter($cparts);
}
$path = implode("/", $cparts);
$url = "";
if($scheme) {
$url = "$scheme://";
}
if($user) {
$url .= "$user";
if($pass) {
$url .= ":$pass";
}
$url .= "#";
}
if($host) {
$url .= "$host/";
}
$url .= $path;
return $url;
}
From http://www.web-max.ca/PHP/misc_24.php

Unless you have the site URL you're starting with (in which case you can prepend it to the value of the src attribute) it seems like all you're left with there is a string.
I'm assuming you don't have access to any additional information of course. If you're parsing HTML, I'd assume you must be able to access an absolute URL to at least the HTML page, but perhaps not.

Related

Internal Server Error when trying to check url

I'm trying to check the string after the last trailing slash in my URL.
My code is as follows:
$url = "http://$_SERVER[HTTP_HOST]$_SERVER[REQUEST_URI]";
$data = substr($url, strrpos($url, '/') + 1);
if($data == "dashboard") {
require_once VIEW_ROOT . '/cp/dashboard_view.php';
} else {
echo $data;
}
Once I go to http://MYURL/dashboard/in it should show in as the $data. Instead it gives me a 500 error.

You can simply use explode() function to break the string... .Or else $_SERVER[REQUEST_URI] shall give you the data after the host name...
But for the data after the last '/' explode function will work the best..
This will work.
$url = "http://$_SERVER[HTTP_HOST]$_SERVER[REQUEST_URI]";
$x = explode('/',$url);
$data = $x[sizeof($x)-1];
echo $data;

You should try :
$url = "http://".$_SERVER[HTTP_HOST].$_SERVER[REQUEST_URI];
You need to join
http:// string with $_SERVER[HTTP_HOST] and then $_SERVER[REQUEST_URI] using .(dot).

PHP replace URL segment with str_replace();

I have "/foo/bar/url/" coming straight after my domain name.
What I want is to find penultimate slash symbol in my string and replace it with slash symbol + hashtag. Like so: from / to /# (The problem is not how to get URL, but how to handle it)
How this could be achieved? What is the best practice for doing stuff like that?
At the moment I'm pretty sure that I should use str_replace();
UPD. I think preg_replace() would be suitable for my case. But then there is another problem: what should regexp look like in order to make my issue solved?
P.S. Just in a case I'm using SilverStripe framework (v3.1.12)

$url = '/foo/bar/url/';
if (false !== $last = strrpos($url, '/')) {
if (false !== $penultimate = strrpos($url, '/', $last - strlen($url) - 1)) {
$url = substr_replace($url, '/#', $penultimate, 1);
}
}
echo $url;
This will output
/foo/bar/#url/
If you want to strip the last /:
echo rtrim($url, '/'); // print /foo/bar/#url

Here is a method that would function. There are probably cleaner ways.
// Let's assume you already have $url_string populated
$url_string = "http://whatever.com/foo/bar/url/";
$url_explode = explode("\\",$url_string);
$portion_count = count($url_explode);
$affected_portion = $portion_count - 2; // Minus two because array index starts at 0 and also we want the second to last occurence
$i = 0;
$output = "";
foreach ($url_explode as $portion){
$output.=$portion;
if ($i == $affected_portion){
$output.= "#";
}
$i++;
}
$new_url = $output;

Assuming you now have
$url = $this->Link(); // e.g. /foo/bar/my-urlsegment
You can combine it like
$handledUrl = $this->ParentID
? $this->Parent()->Link() + '#' + $this->URLSegment
: $this->Link();
where $this->Parent()->Link() is e.g. /foo/bar and $this->URLSegment is my-urlsegment
$this->ParentID also checks if we have a parent page or are on the top level of SiteTree

I might be tooooo late for answering this question but I thought this might help you. You can simply use preg_replace like as
$url = '/foo/bar/url/';
echo preg_replace('~(\/)(\w+)\/$~',"$1#$2",$url);
Output:
/foo/bar/#url

In my case this solved my problem:
$url = $this->Link();
$url = rtrim($url, '/');
$url = substr_replace($url, '#', strrpos($url, '/') + 1, 0);

Adding to url with link

My url contains many variables that I want untouched (don't worry they aren't important).
Let's say it contained...
../index.php?id=5
How would I make a url that just adds
&current=1
rather than replacing it entirely?
I'd like...
../index.php?id=5&current=1
rather than..
../index.php?current=1
I know it's a simple question but that's why I can't figure it out.
Thanks.

To append a parameter to a URL you can do this:
function addParam( $url, $param ){
if( strrpos( $url, '?' ) === false){
$url .= '?' . $param;
} else {
$url .= '&' . $param;
}
return $url;
}
$url = "../index.php?id=5";
$url = addParam( $url, "current=1");

You should just create your link to 'add' that parameter
The Link
and then obviously in the index.php somewhere you'll look for the current variable and do what you need to:
<?php
if(isset($_GET['current']) && !empty($_GET['current]) {
// Do stuff here for the 'current' variable
$current = trim($_GET['current']);
}
?>

On the links that you require the $current variable, I suppose that you could just casually put it in the href attribute. For the index,php file, so something like this....
if(isset($_GET['current']))
{
$current = $_GET['current'];
//Do the rest of what you need to do with this variable
}

Try this one:
$givenVar = "";
foreach($_GET as $key=>$val){
$givenVar .= "&".$key."=".$val;
}
$var = "&num=1";
$link = "?".$givenVar."".$var;
echo $link;

You can just add the variable to the href,
When you clink it while the address is
../index.php?id=5
trust me you then go to
../index.php?id=5&current=1
BUT if you click that link again, than you 'll go to
../index.php?id=5&current=1&current=1
Actually I thinks that's tricky and bad practice to just append the variable.
I suggest you to do it like:
<?php
$query = isset($_GET) ? http_build_query($_GET) . '&current=1' : 'current=1';
?>
A Label
take a look http://us.php.net/manual/en/function.http-build-query.php

I don't know why in Earth you would need this, but here we are. This should do the trick.
$appendString = "&current=1";
$pageURL = $_SERVER["REQUEST_URI"].$appendString;
$_SERVER["REQUEST_URI"] should return just the name of the requested page, with any other GET variable attached. The other string should be clear enough!

How to fetch rss feed url of a website using php?

I need to find the rss feed url of a website programmatically.
[Either using php or jquery]

The general process has already been answered (Quentin, DOOManiac), so some code (Demo):
<?php
$location = 'http://hakre.wordpress.com/';
$html = file_get_contents($location);
echo getRSSLocation($html, $location); # http://hakre.wordpress.com/feed/
/**
* #link http://keithdevens.com/weblog/archive/2002/Jun/03/RSSAuto-DiscoveryPHP
*/
function getRSSLocation($html, $location){
if(!$html or !$location){
return false;
}else{
#search through the HTML, save all <link> tags
# and store each link's attributes in an associative array
preg_match_all('/<link\s+(.*?)\s*\/?>/si', $html, $matches);
$links = $matches[1];
$final_links = array();
$link_count = count($links);
for($n=0; $n<$link_count; $n++){
$attributes = preg_split('/\s+/s', $links[$n]);
foreach($attributes as $attribute){
$att = preg_split('/\s*=\s*/s', $attribute, 2);
if(isset($att[1])){
$att[1] = preg_replace('/([\'"]?)(.*)\1/', '$2', $att[1]);
$final_link[strtolower($att[0])] = $att[1];
}
}
$final_links[$n] = $final_link;
}
#now figure out which one points to the RSS file
for($n=0; $n<$link_count; $n++){
if(strtolower($final_links[$n]['rel']) == 'alternate'){
if(strtolower($final_links[$n]['type']) == 'application/rss+xml'){
$href = $final_links[$n]['href'];
}
if(!$href and strtolower($final_links[$n]['type']) == 'text/xml'){
#kludge to make the first version of this still work
$href = $final_links[$n]['href'];
}
if($href){
if(strstr($href, "http://") !== false){ #if it's absolute
$full_url = $href;
}else{ #otherwise, 'absolutize' it
$url_parts = parse_url($location);
#only made it work for http:// links. Any problem with this?
$full_url = "http://$url_parts[host]";
if(isset($url_parts['port'])){
$full_url .= ":$url_parts[port]";
}
if($href{0} != '/'){ #it's a relative link on the domain
$full_url .= dirname($url_parts['path']);
if(substr($full_url, -1) != '/'){
#if the last character isn't a '/', add it
$full_url .= '/';
}
}
$full_url .= $href;
}
return $full_url;
}
}
}
return false;
}
}
See: RSS auto-discovery with PHP (archived copy).

This is something a lot more involved than just pasting some code here. But I can point you in the right direction for what you need to do.
First you need to fetch the page
Parse the string you get back looking for the RSS Autodiscovery Meta tag. You can either map the whole document out as XML and use DOM traversal, but I would just use a regular expression.
Extract the href portion of the tag and you now have the URL to the RSS feed.

The rules for making RSS discoverable are fairly well documented. You just need to parse the HTML and look for the elements described.

A slightly smaller function that will grab the first available feed, whether it is rss or atom (most blogs have two options - this grabs the first preference).
public function getFeedUrl($url){
if(#file_get_contents($url)){
preg_match_all('/<link\srel\=\"alternate\"\stype\=\"application\/(?:rss|atom)\+xml\"\stitle\=\".*href\=\"(.*)\"\s\/\>/', file_get_contents($url), $matches);
return $matches[1][0];
}
return false;
}

Make a local link global /test.php -> example.com/test.php

I've been working on a spider algorithm and have been having some issues with the links.
example of how it works:
got content from -> example.com/bob/index.php?page=funny+faces
content is :
<html>
link 1
link 2
link 3
</html>
pass content through get links function
links function returned
[0] = ../jack/index.php
[1] = /bob_more_info
[2] = http://www.youtube.com
now I need to make these links urls by what page I got them on (example.com/bob/index.php?page=funny+faces)
so
[0] -> ../jack/index.php into example.com/jack/index.php
[1] -> /bob_more_info into example.com/bob/bob_more_info
[2] -> http://www.youtube.com
What I am asking for is a function that can do the conversion. This is mine, but it's not always working and is becoming a pain. If you could edit it or write me a function it would be much appreciated. Thanks in advance.
Here is my function currently:
//example:
//$newURL = URLfix("example.com/bob/index.php?page=funny+faces", "../jack/index.php");
function URLfix ($url, $ext)
{
if(is_valid_url($url."/"))
{
$url .= "/";
}
$ar1 = explode("/", $url);
if(count($ar1) == 1)
{
return $url."/".$ext;
}
$target = $ar1[count($ar1) - 1];
if($target == "")
{
return $url.$ext;
}
if(strpos(" ".$target, "."))
{
$cur = "";
for($i = 0; $i < count($ar1) - 1; $i ++)
{
$cur .= $ar1[$i];
$cur .= "/";
}
return $cur.$ext;
}
return $url."/".$ext;
}

use explode() to split the $url into an array delimited by /, then $bits[0] for example would contain example.com

since
example.com/jack/index.php
is equivalent to:
example.com/bob/../jack/index.php
I wouldn't worry about that part. For the url, I would remove the query string first, then pop off the last segment to get the base url:
list($url, $query_string = explode("?", $url);
$segments = explode("/", $url);
array_pop($segments);
$base_url = implode("/", $segments);
Do be sure to add some error checks.

A specification exists which explains step by step how to resolve a relative URI to it's base URI. It's RFC 3986:
What you call a "global link" is just the URI Reference.
What you call a "local link" is named Relative Reference.
Every relative reference has a base reference it refers to. The base reference is a URI reference. You can resolve a new URI reference from any base URI reference and the relative reference. This process is called Relative Resolution.
PHP code that does this, is available in the Net_URL2 PEAR Package it has an example how to use this look for ->resolve().

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scrape FULL image src with PHP - php

Related

Internal Server Error when trying to check url

PHP replace URL segment with str_replace();

Adding to url with link

How to fetch rss feed url of a website using php?

Make a local link global /test.php -> example.com/test.php

Categories

Resources