I have tried many methods to try and get the contents of a url, with the hashes effecting the output. I really am not sure how to explain it, but here's an example...
Doing:
echo file_get_contents('test.com/whatever.php?t1=1&t2=2#this');
Will return the same results as:
echo file_get_contents('test.com/whatever.php?t1=1&t2=2');
Even though if I navigate to it in my web browser, it will make a difference. Of course the urls above aren't the actual one's that I am using, but I hope you get the point.
Things I have tried:
CURL:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
file_get_contents:
file_get_contents($url);
fread:
//don't know where I put the code.
Those are all the things I have tried, and don't really know where to go next besides here. I'm really not sure if this is even possible, but I hope it is.
Thanks for any help, sh042067.
Hashes are not sent to the server. The hash part is usually only accessed by the browser. Yo're probably seeing some kind of AJAX retrieval in action so you'll need to find out the actual URL that is being called and use that instead. You could use Firebug for this.
Found the references : http://en.wikipedia.org/wiki/Fragment_identifier
Check these too
Why the hash part of the URL is not in the server side?
Can I read the hash portion of the URL on my server-side application (PHP, Ruby, Python, etc.)?
Your response will be same, since the hashes are process on the client side, not on server side
Quoted from wikipedia
The fragment identifier functions differently than the rest of the URI: namely, its processing is exclusively client-side with no
participation from the server — of course the server typically helps
to determine the MIME type, and the MIME type determines the
processing of fragments. When an agent (such as a Web browser)
requests a resource from a Web server, the agent sends the URI to the
server, but does not send the fragment. Instead, the agent waits for
the server to send the resource, and then the agent processes the
resource according to the document type and fragment value.
The hash part (called the 'fragment') is never sent to a server. There's no valid way in the HTTP protocol to do it. The fragment is only for navigation within the page, but the browser requests and receives the same data regardless.
Some sites manage the fragment with JavaScript to issue asynchronous requests to load new data and alter the page dynamically, which blurs the line somewhat.
In your case you'll have to find the HTML element on the page which has an id="this" because that's where the fragment points and it's where the browser would scroll down to. Or if you get a fragment that looks like query variables you'll have to find the real, non-JavaScript URL of what you're trying to request.
Related
I need to store XML data sent over HTTP POST to my server. In the log files I see that the data is successfully sent to my server. But I have no idea how to get the data.
I tried to catch them with the php://input stream like in the code below. The problem I see is that php://input is just read when the file containing the code is called.
$xml = file_get_contents("php://input");
$var_str = var_export($xml, true);
file_put_contents('api-test/test.txt', $var_str);
Is there any way to set some kind of listener/watcher to the php://input stream? Maybe PHP is the wrong technology to realize this. Is there some other way like AJAX?
The problem I see is that php://input is just read when the file containing the code is called.
Yes.
That's how PHP (in a server-side programming context) works.
The client makes an HTTP request to a URL
The server receives the HTTP request and determines that that URL is handled by a particular PHP program (typically by matching the path component of the URL to a directory and file name unless the Front Controller Pattern is being used)
The PHP program is executed and has access to data from the request
The server sends the output of the PHP program back
Is there any way to set some kind of listener/watcher to the php://input stream?
You get a new stream every time a request is made. So the typical way to watch it is to put a PHP script at the URL that the request is being made to.
Then make sure each request is made to the same URL.
(If you need to support requests being made to different URLs, then look into the Front Controller Pattern).
Maybe PHP is the wrong technology to realize this.
It's a perfectly acceptable technology for handling HTTP requests.
Is there some other way like AJAX?
Ajax is a buzzword meaning "Make an HTTP request with JavaScript". Since you are receiving the requests and not making them, Ajax isn't helpful.
I am concerned about the safety of fetching content from unknown url in PHP.
We will basically use cURL to fetch html content from user provided url and look for Open Graph meta tags, to show the links as content cards.
Because the url is provided by the user, I am worried about the possibility of getting malicious code in the process.
I have another question: does curl_exec actually download the full file to the server? If yes then is it possible that viruses or malware be downloaded when using curl?
Using cURL is similar to using fopen() and fread() to fetch content from a file.
Safe or not, depends on what you're doing with the fetched content.
From your description, your server works as some kind of intermediary that extracts specific subcontent from a fetched HTML content.
Even if the fetched content contains malicious code, your server never executes it, so no harm will come to your server.
Additionally, because your server only extracts specific subcontent (Open Graph meta tags, as you say),
everything else that is not what you're looking for in the fetched content is ignored,
which means your users are automatically protected.
Thus, in my opinion, there is no need to worry.
Of course, this relies on the assumption that the content extraction process is sound.
Someone should take a look at it and confirm it.
does curl_exec actually download the full file to the server?
It depends on what you mean by "full file".
If you mean "the entire HTML content", then yes.
If you mean "including all the CSS and JS files that the feched HTML content may refer to", then no.
is it possible that viruses or malware be downloaded when using curl?
The answer is yes.
The fetched HTML content may contain malicious code, however, if you don't execute it, no harm will come to you.
Again, I'm assuming that your content extraction process is sound.
Short answer is file_get_contents is safe you retrieve data, even curl is. It is up to you what you do with that data.
Few Guidelines:
1. Never Run eval on that data.
2. Don't save it to database without filtering.
3. Don't even use file_get_contents or curl.
Use: get_meta_tags
array get_meta_tags ( string $filename [, bool $use_include_path = false ] )
// Example
$tags = get_meta_tags('http://www.example.com/');
You will have all meta tags parsed, filtered in an array.
you can use httpclient.class instead of file_get_content or curl. because it connect's the page through the socket.After download the data you can take the meta data using preg_match.
Expanding on the answer made by Ray Radin.
Tips on precautionary measures
He is correct that if you use sound a sound process to search the fetched resource there should be no problem in fetching whatever url is provided. Some examples here are:
Don't store the file in a public facing directory on your webserver. Then you expose yourself to this being executed.
Don't store it in a database, this might lead to a second order sql injection attack
In general, don't store anything from the resource you are requesting, if you have to do this use a specific whitelist of what you are searching for
Check the header information
Even though there is no foolprof way of validating what you are requesting with a specific url. There are ways you can make your life easier and prevent some potential issues.
For example a url might point to a large binary, large image file or something similar.
Make a HEAD request first to get the header information. Then look at the Content-type and Content-length headers to see if the content is a plain text html file
You should however not trust these since they can be spoofed. Doing this will hovewer make sure that even non-malicous content won't crash your script. Requesting image files is presumably something you don't want users to do.
Guzzle
I recommend using Guzzle to do your request since it is in my opinion provides some functionallity that should make this easier
It is safe but you will need to do a proper data check before using it. As you should with any data input anyway.
I'm performing a cURL post with PHP and trying to reduce the amount of bandwidth I am using. I don't need anything back from the remote site I am posting to since I control the remote site all my tracking to make sure the post was successful is done on the receiving end.
My questions is...
When you set CURLOPT_NOBODY to TRUE:
Does it still download the body and simply not return it to you?
OR
Does it ignore the body and not download it at all?
From the PHP manual on curl_setopt (emphasis mine):
CURLOPT_NOBODY: TRUE to exclude the body from the output. Request method is then set to HEAD. Changing this to FALSE does not change it to GET.
So, the answer is no. It won't download the body then because it is a HTTP HEAD request then:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.
I know want to know what happens behind the scene of a HTTP post method.
i.e browser sends a HTTP post request to a server side script in PHP (eg).
How does PHP's $_POST variable get the values from the client.
Could someone explain in details or point to a guide.
The HTTP protocol(*) specifies how the browser should send the request.
HTTP basically consists of a set of headers in plain text, separated by line feeds, followed by the data being transmitted. Inside the HTTP request, POST data is actually formatted pretty much the same as GET data; it's just in a different part of the HTTP headers.
You can use tools like Firebug or Fiddler to see exactly how the headers and data are formatted for incoming and outgoing HTTP requests. It's actually all quite simple to read, so you should be able to work it out just by looking.
Once it gets to the server, the PHP interpreter is responsible for translating the raw HTTP request data into its standard $_GET, $_POST, etc variables. This is something that PHP does for you.
Other languages (eg Perl) do not have this functionality built in, so a Perl programmer would have to have code in their program to parse the incoming request data into useful variables. Fortunately, even Perl has a standard library which can be included that does the job, so even Perl programmers don't generally have to write the code themselves any more.
The way PHP, and any other language, does it is simply string manipulation. As I said, the HTTP data is plain text and is received in simple string format, so it's just a case of breaking it down by splitting it on question mark and equal sign characters.
As PHP does it all behind the scenes, you probably don't need to worry about the exact mechanisms it uses, but the PHP source code is available if you really want to find out.
I said it's all in plain text. HTTPS, of course, is encrypted. However by the time PHP gets hold of it, the Apache server has already done the decryption, so as far as PHP is concerned it's still plain text.
(*) Before anyone pulls me up on it, yes, I know that saying "HTTP protocol" is a redundancy, like "ATM machine" or "PIN number".
The browser encodes the data according to the content-type of the form, then transmits it as the body of a POST request. PHP then picks it up and populates $_POST with the names and values (performing special handling when the name includes the characters [ and ] or .).
I'd suggest to get a capturing proxy (e.g. Fiddler) or a network capture tool (e.g. Wireshark) and watch your own browsing traffic for a while; it will give you a nice view of the issue.
Other than that, POST is rather similar to GET, except that the data is sent in the body of the request instead of the URL, and there are two ways to encode them (multipart-form-data in addition to the urlencode that's shared with GET)
Well, let's ilustrate step by step, starting with a page containing a [form action="foo.php" method="post"]
Once you click submit (or hit enter), browser will trigger an event named "submit". This event can be catched internally for processing with javascript/dom, and this is what most sites do for validation or Ajax routines.
If routines does not stop the flow with a return false, browser continues to process the post request (this process is the same as making a post with XMLHttpRequest Object).
Browser will check first method, action and content encoding, then parse inputs values to know the size of data it will send, and encode it.
Finally it send something like this (raw values):
POST /foo.php HTTP/1.1
Host: example.org
Content-Type: application/x-www-form-urlencoded
Content-Length: 7
foo=bar
This is a POST request. But note that it can send content-length and send variables in chunks. Browser and server know this can happen (this is the POST method purpose). When a server receives a POST request, it keeps listening to the browser until the content received match the informed content length.
Now the other side. Server receives the request, listen the content, parse it (foo = bar; xxx = baz), and make it available on its environment for that specific request, thus you can catch it with PHP or Python, or Java...
That's it. Ah note you can pass both GET and POST variables in the same request!
Using a [form action="foo.php?someVar=123&anotherVar=TRUE" method="post"]
Will make the browser send the request as
POST /foo.php?someVar=123&anotherVar=TRUE HTTP/1.1
Host: example.org
Content-Type: application/x-www-form-urlencoded
Content-Length: 7
foo=bar
And server when parsing this request will make the following variables available:
GET[someVar] = 123
GET[anotherVar] = TRUE
POST[foo] = bar
I ran into this problem when scraping sites with heavy usage of javascript to obfuscate it's data.
For example,
"a href="javascript:void(0)" onClick="grabData(23)"> VIEW DETAILS
This href attribute, reveals no information about the actual URL. You'd have to manually look and examine the grabData() javascript function to get a clue.
OR
The old school way is manually opening up Live HTTP header add on for firefox, and monitoring the POST perimeters, which reveals the actual URL being POSTed.
So i'm wondering, is there a way to capture the POST parameters in a server side script or Javscript, as Live HTTP header does, for the outgoing and incoming POST parameters? This would make even the most javscript obfuscated web pages easily scrapable.
thanks.
I'm not sure I understand the question but...
In PHP, incoming POST parameters are stored in the $_POST array, you can display them with print_r($_POST);.