Change relative URLs to absolute URLs after Curl - php

I'm trying to find a regular expression that is able to change all URLs of a curl'ed document from relative to absolute.
One of the way I found is the post here but it works only for the first URL and not for all.
This is the code I'm using:
$url="http://www.example.com";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_DNS_USE_GLOBAL_CACHE, 0);
curl_setopt($ch, CURLOPT_DNS_CACHE_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result=curl_exec($ch);
curl_close($ch);
$result = preg_replace('~(href|src)=(["\'])(?!#)(?!http://)([^\2]*)\2~i','$1="http://www.example.com$3"', $result);
echo $result;
Where am I doing wrong?
EDIT
Just to explain better. I haven't an array of urls, but I have an entire document gathered from curl so I need a preg replace method.

I'm not exactley sure why it replaces it just one time (maybe it has something to do with the backreference), but when you wrap it in a while loop, it should work.
$pattern = '~(href|src)=(["\'])(?!#|//|http)([^\2]*)\2~i';
while (preg_match($pattern, $result)) {
$result = preg_replace($pattern,'$1="http://www.example.com$3"', $result);
}
(I also changed the pattern slightly.)

Related

php web spider. how to identify url with hash as same page?

I have a function:
public function getHeaders($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
$x = curl_exec($ch);
curl_close($ch);
return (array) HTTP::parse_header_string($x) ;
}
When $url=http://www.google.com', i have header location:http://www.google.de/?gfe_rd=cr&ei=SOMEHASHGOESHERE`
load it again and get all same but, 'SOMEHASHGOESHERE' is other now.
My task is to develop web-crawler. I know how to do basic logic of it. But there are few nuances. One of them are: What must do my spider if requested url send to it header 'location' and try to redirect? What model of behavior must control my spider to be impossible drop it into infinite redirect loop?
(how to identify similar urls like http://www.google.de/?gfe_rd=cr&ei=SOMEHASHGOESHERE which usually are using for loop redirection and give to my spider understanding to ignore such links )
If you are trying to just process the target of all redirections you can get curl to follow url's without returning redirection page.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
if you are just interested in the base url without url parameters you can get it easily with explode:
$urlParts = explode("?",$url);
$baseUrl = $urlParts[0];

Using curl to bring search results from external site

I have 2 sites, one main, one external. On the main site, I am using Lucene to search through it. The problem is, I am trying to also search through the external site.
The Form action for the external site:
<form action="https://secure.bcchf.ca/SuperheroPages/searchResults.cfm?Event=WOT" method="post" name="search_tribute" >
I've tried to use curl, but it only brings up the search form without actually doing the search (the field is empty as well).
<?php
$ch = curl_init("https://secure.bcchf.ca/SuperheroPages/searchResults.cfm?Event=WOT");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, tname='hello');
$output = curl_exec($ch);
echo $output;
curl_close($ch);
?>
Any tips?
I don't have access to the form action since it's on an external site. All i have is a form that links to it when I submit it.
<?php
$ch = curl_init("https://secure.bcchf.ca/SuperheroPages/searchResults.cfm?Event=WOT");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, array("teamName" => "hello", "searchType" => "team"));
$output = curl_exec($ch);
echo $output;
curl_close($ch);
?>
Can you try this?
I'm pretty sure it's supposed to be teamName instead of tName
Most search engine use GET and not POST .. you can try
// asumption
$_POST['search'] = "hello";
// Return goole Search Result
echo curlGoogle($_POST['search']);
function curlGoogle($keyword) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.google.com/search?hl=en&q=' . urlencode($keyword) . '&btnG=Google+Search&meta=');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FILETIME, true);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Or if you want post then
curl_setopt($ch, CURLOPT_POSTFIELDS, array("search"=>"hello"));
Your php code is not valid syntax, it does not compile.
So if this is really what you have, your problem is that your file generates a fatal error.
That being said, this question is hard to answer since we don't know the site you want to grab your search results from.
Try modifying your line like this:
curl_setopt($ch, CURLOPT_POSTFIELDS, "search=hello");
or alternatively
curl_setopt($ch, CURLOPT_POSTFIELDS, array("search" => "hello");
Maby it will work, however it may be that more post data is required or that the element name is not correct.
You have to look at the form or try making a request and look at it with chromes developer tools or firebug.
Also there are a number of ways for external sites to prevent what you are doing, altough evertything can be worked around somehow.
Assuming that is not the case, I hope i could help you.
Try just putting it into an array.
as that will be the variable the $_POST checks on the other side
and just checked your link, its teamName for the field
$fields = array("teamName"=>"julia");
Then..
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
So your complete code is...
<?php
$ch = curl_init("https://secure.bcchf.ca/SuperheroPages/searchResults.cfm?Event=WOT");
$fields = array("teamName"=>"julia");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
$output = curl_exec($ch);
var_dump($output);
curl_close($ch);
?>

CURL not working when used inside a function

I'm trying to execute curl using the following code.
mainFunction{
.
.
$url = strtolower($request->get('url', NULL));
$html_output= $this->startURLCheck($url);
.
.
}
function startURLCheck($url)
{
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$html_output = curl_exec($ch);
}
When i give the string URL directly this is working fine. But then I pass the string data through a function curl is not executing. curl_error gives shows no errors too. I tried many encoding and decoding method for the string with same result.Am i doing something wrong? I working using XAMPP server on windows.
I'm passing URL to this function after getting the URL from a HTML post request in another function.
The problem is that your function startURLCheck does not actually return a value for the main program to use. Change the last line:
function startURLCheck($url)
{
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
return curl_exec($ch);
}
In your calling code, take out the "$this->"
$html_output = startURLCheck($url);
$html_output now contains results of the curl call.
I have assumed that you copied and pasted this code from somewhere since your "mainFunction" declaration is syntactically incorrect, and you used "$this->" without specifying that startURLCheck was a method of an object.
If in fact you intend startURLCheck to be an object method and you want it to set $html_output on the object, do this:
<?php
class Example {
private $html_output;
function mainFunction()
{
$url='http://www.ebay.com/itm/Apple-iPhone-5-16GB-Black-Slate-Cricket-intl-UNLOCKED-pleeze-read-ad-/251252227033';
$this->startURLCheck($url);
echo "HTML output: " . $this->html_output;
}
function startURLCheck($url)
{
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$this->html_output = curl_exec($ch);
}
}
$example = new Example();
$example->mainFunction();
I have tested this code on the command line (not in a web page). If you copy and paste this into a file and run it using php -r you will see the results. (And note that I didn't include a closing ?> tag. The closing tag is optional when the file contains only PHP code and no HTML. In fact it is recommended that the closing tag be omitted in such cases. See http://php.net/manual/en/language.basic-syntax.instruction-separation.php)
Please also note in your question code for mainFunction you have illegal spaces before "pleeze" in your URL and you are missing the semicolon at the end of the $url assignment.
Hope this helps. Good luck.
This works good.
<?php
function excURL()
{
$ch = curl_init();
$url = "http://www.google.com";
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$html_output = curl_exec($ch);
echo $html_output;
}
excURL();
?>
Hey Guys I have found the problem..Finally..
When I set CURLOPT_FOLLOWLOCATION for the curl this is working fine...
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
But stil it is not clear why it worked when I hardcoded the URL inside the function and did not work when I passed url as a variable into the function, without setting CURLOPT_FOLLOWLOCATION ... When I set this option it is working in both ways..

How to get the URL of a download link

I am trying to parse a page which contains some links. These links, if followed, will redirect to some files to download.
For example, Download which redirects to <a href="http://example.com/1.pdf".
I don't want to download the file, I just want to get the file link (int this case http://example.com/1.pdf).
I am trying this:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, FALSE); // Return in string
curl_setopt($ch, CURLOPT_URL, $url);
curl_exec($ch);
var_dump(curl_getinfo($ch));
But, it gives me the file contents.
Does anyone have any idea how to this?
==EDIT==
Thank you guys. I solved it like this:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLINFO_HEADER_OUT, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_NOBODY, TRUE);
curl_exec($ch);
$info = curl_getinfo($ch);
Now, $info contains the header and I can the link from it.
The reason the output is being sent to the screen is because you're telling cURL to do so. If you want to store the response in a variable the following line:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, FALSE);
should read:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
Then, actually retrieve the returned output from curl_exec like so:
$output = curl_exec($ch);
Once you have the returned HTML content from the remote page in the $output variable you can use DOMdocs or regex (but preferably DOM) to parse out any information you want.
UPDATE
I can't tell because the question is vaguely worded: is there actually a Location header redirect happening? If so, you'll want to do as #heiko suggests to prevent cURL from following the redirect and retrieve the headers. Then you can easily parse the contents of the location header:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE);
curl_setopt($ch, CURLINFO_HEADER, TRUE); // add header output
# make sure to not follow Location: Header
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE);
# add Response Header to Output, so that you can find the Location-Header in there!
curl_setopt($ch, CURLINFO_HEADER_OUT, TRUE);
Use RETURN TRANSFER as 1, also use htmlentities() if you want to display HTML source on your page , else just echo the variable ( to display the page [redirects to google] ).
<?php
$url = "http://www.google.co.in";
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // Return in string
curl_setopt($ch, CURLOPT_URL, $url);
$varx = curl_exec($ch);
echo htmlentities($varx);
?>
With the $varx variable , use Regular Expressions to match which data you want.

How to resolve url's to final destination in php

How do i resolve the urls like the one below:
http://www.google.co.in/url?sa=t&source=newssearch&cd=1&ved=0CC4QqQIwAA&url=http%3A%2F%2Fwww.usatoday.com%2Fnews%2Fworld%2Fstory%2F2011-09-18%2Findia-earthquake-fatalities%2F50456078%2F1&ei=JkF2TriYPImGrAeHxdCFDQ&usg=AFQjCNEshh4QAZQlM_tVPoT_l7rJ0ag21Q
to it's final url
http://www.usatoday.com/news/world/story/2011-09-18/india-earthquake-fatalities/50456078/1
I've tried curl but it's resolving it to http://www.google.co.in/http
http://sandbox.phpcode.eu/g/fc7c1/1
$ch = curl_init('http://www.google.co.in/url?sa=t&source=newssearch&cd=1&ved=0CC4QqQIwAA&url=http%3A%2F%2Fwww.usatoday.com%2Fnews%2Fworld%2Fstory%2F2011-09-18%2Findia-earthquake-fatalities%2F50456078%2F1&ei=JkF2TriYPImGrAeHxdCFDQ&usg=AFQjCNEshh4QAZQlM_tVPoT_l7rJ0ag21Q');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
$response = curl_exec($ch);
$info = curl_getinfo($ch);
echo $info['url'];
All you are after is the value of the url parameter. You can preg_split the initial url by /&\?/, then take the element starting with url=, finally split it by = sign and use urldecode on the final value.

Categories