Sometimes when using multiple concurrent connections and scraping with cURL in my PHP script, incomplete webpages are returned. Is there some value in curl_getinfo() that will let me know if a webpage was 100% fetched vs. only 90% fetched?
Would the content-size header of a returned page be the actual size of what was returned or would it be the entire page? If so, I could check the content-size against the actual size of the response..
Thanks!
Assuming your question is whether you can check if the content size header comes from the other side or is calculated on your side, yes, you can use that header to check if you've received the full response because it is generated on the other side from the actually intended content. A few things, though:
It's Content-Length, not Content-Size;
you can use it as long as you trust the other party to set it correctly;
it may not be available because while it SHOULD exist, it is not strictly necessary.
Related
I am concerned about the safety of fetching content from unknown url in PHP.
We will basically use cURL to fetch html content from user provided url and look for Open Graph meta tags, to show the links as content cards.
Because the url is provided by the user, I am worried about the possibility of getting malicious code in the process.
I have another question: does curl_exec actually download the full file to the server? If yes then is it possible that viruses or malware be downloaded when using curl?
Using cURL is similar to using fopen() and fread() to fetch content from a file.
Safe or not, depends on what you're doing with the fetched content.
From your description, your server works as some kind of intermediary that extracts specific subcontent from a fetched HTML content.
Even if the fetched content contains malicious code, your server never executes it, so no harm will come to your server.
Additionally, because your server only extracts specific subcontent (Open Graph meta tags, as you say),
everything else that is not what you're looking for in the fetched content is ignored,
which means your users are automatically protected.
Thus, in my opinion, there is no need to worry.
Of course, this relies on the assumption that the content extraction process is sound.
Someone should take a look at it and confirm it.
does curl_exec actually download the full file to the server?
It depends on what you mean by "full file".
If you mean "the entire HTML content", then yes.
If you mean "including all the CSS and JS files that the feched HTML content may refer to", then no.
is it possible that viruses or malware be downloaded when using curl?
The answer is yes.
The fetched HTML content may contain malicious code, however, if you don't execute it, no harm will come to you.
Again, I'm assuming that your content extraction process is sound.
Short answer is file_get_contents is safe you retrieve data, even curl is. It is up to you what you do with that data.
Few Guidelines:
1. Never Run eval on that data.
2. Don't save it to database without filtering.
3. Don't even use file_get_contents or curl.
Use: get_meta_tags
array get_meta_tags ( string $filename [, bool $use_include_path = false ] )
// Example
$tags = get_meta_tags('http://www.example.com/');
You will have all meta tags parsed, filtered in an array.
you can use httpclient.class instead of file_get_content or curl. because it connect's the page through the socket.After download the data you can take the meta data using preg_match.
Expanding on the answer made by Ray Radin.
Tips on precautionary measures
He is correct that if you use sound a sound process to search the fetched resource there should be no problem in fetching whatever url is provided. Some examples here are:
Don't store the file in a public facing directory on your webserver. Then you expose yourself to this being executed.
Don't store it in a database, this might lead to a second order sql injection attack
In general, don't store anything from the resource you are requesting, if you have to do this use a specific whitelist of what you are searching for
Check the header information
Even though there is no foolprof way of validating what you are requesting with a specific url. There are ways you can make your life easier and prevent some potential issues.
For example a url might point to a large binary, large image file or something similar.
Make a HEAD request first to get the header information. Then look at the Content-type and Content-length headers to see if the content is a plain text html file
You should however not trust these since they can be spoofed. Doing this will hovewer make sure that even non-malicous content won't crash your script. Requesting image files is presumably something you don't want users to do.
Guzzle
I recommend using Guzzle to do your request since it is in my opinion provides some functionallity that should make this easier
It is safe but you will need to do a proper data check before using it. As you should with any data input anyway.
I have a simple curl call that retrieves HTML page from the server, then preg_replace() that inserts something in the page and then the result of that is echoed back to the browser.
What I noticed is that if HTTP server that curl is trying to get HTML page from, uses header 'Transfer-Enoding: chunked', html output will be somehow encoded(I noticed a few strange signs) and preg_replace() call will do the job but the browser will just get ERR_INVALID_CHUNKED_ENCODING and won't load the page. There must be a way to replace part of the page without messing up chunked encoding ?
Chunked transfer-encoding is a HTTP 1.1 feature where the server doesn't know the size of the resource when it starts to send the data so it sends a series "chunks" to the client, each chunk preceded with the size (in number of bytes in hexadecimal) of the chunk.
Alas, if you insert data into a chunk, you must change the size of the chunk too when you send it to the browser. Alternatively of course, you get the full thing, do your replacement and send out the entire response in one single chunk (or even without chunks).
A proper HTTP 1.1 client should be able to decode the chunks and a proper HTTP 1.1 server should send a legitimate series of chunks (a somewhat common server-side error is to leave out the final zero-sized chunk).
See here for the spec: https://www.rfc-editor.org/rfc/rfc7230#section-4.1
I'm performing a cURL post with PHP and trying to reduce the amount of bandwidth I am using. I don't need anything back from the remote site I am posting to since I control the remote site all my tracking to make sure the post was successful is done on the receiving end.
My questions is...
When you set CURLOPT_NOBODY to TRUE:
Does it still download the body and simply not return it to you?
OR
Does it ignore the body and not download it at all?
From the PHP manual on curl_setopt (emphasis mine):
CURLOPT_NOBODY: TRUE to exclude the body from the output. Request method is then set to HEAD. Changing this to FALSE does not change it to GET.
So, the answer is no. It won't download the body then because it is a HTTP HEAD request then:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.
How if at all can I strip all the header from a PHP response through apache in order to just stream the text response. I've tried adding a custom htaccess but to no avail. I've got limited control of the hosting server. The stream is read by an embedded device which doesn't need any headers.
It get's to a point where certain headers are NEEDED to be interpreted by the browser so it can render the output. If the reason why you want to remove the header is for a chat-like feature, think about using a persitant keep-alive connection
Tips in reducing bandwidth
Use ajax: keep the response from PHP in JSON format and update DOM elements
Gzip.
Just don't worry about headers -- typically a HTTP OK response will only take up < 200 bytes, hardly anything in comparison to the actual page content. Focus on where it really matters.
Edit:
To suit your case look into using sockets (UDP would be a good option if wanting to cut back on a lot of bandwidth) socket_listen() (non UDP) or socket_bind() capabable of UDP
That's impossible.
You are using HTTP protocol and HTTP protocol response always contains headers.
Either do not use HTTP or teach your device to strip headers. It's not that hard.
Anyway, php has very little to do with removing headers. There is also a web-server that actually interacts with your device and taught to send proper headers.
There is a PHP function called header_remove(). I never used it before but you can try if this works for you. Note that this function is available since PHP 5.3.0.
I know it's possible, but I can't seem to figure it out.
I have a mySQL query that has a couple hundred thousand results. I want to be able to send the results, but it seems the response requires content-length header to start downloading.
In phpMyAdmin, if you go to export a database, it starts the download right away, FF just says unknown file size, but it works. I looked at that code, but couldn't figure it out.
Help?
Thanks.
this document in section 14.14 states that In HTTP, it SHOULD be sent whenever the message's length can be determined prior to being transferred, unless this is prohibited by the rules in section 4.4. This means it DOESN'T have to be sent if you can't say the size of it.
Just don't send it.
If you want to send parts of data to the browser before all data is available, do you flush your output buffer? Maybe that's what is the problem, not lack of a header?
The way you use flush is like that:
generate some output, which should add it to the buffer
flush() it, which should send current buffer to the client
goto 1
So, if your query returns a lot of results, you could just generate the output for, lets say, 100 or 1000 of them, then flush, and so on.
Also, to tell the client browser to attempt to save a file instead of displaying it in window, you can try using the Content-disposition: attachment header as well. See the specification here, 19.5.1 section.
You can use chunked transfer-encoding. Basically, you send a "Transfer-encoding: chunked" header, and then the data is sent in chunked mode, meaning that you send the length of a chunk followed by the chunk. You keep repeating this until the end of data, at which point you send a zero-length chunk.
Details are in RFC 2616.
It’s possible that you are using gzip which waits for all content to be generated. Check your .htaccess for directives regarding this.
You don’t need to set a Content-Length header field. This header field is just to tell the client amount of data it has to expected. In fact, a “wrong” value can cause that the client discards some data.
If you use the LiveHTTPHeaders plugin in FireFox, you can see the difference between the headers being sent by phpMyAdmin and the headers being sent by your application. I don't know specifically what you're missing, but this should give you a hint. One header that I see running a quick test is "Transfer-Encoding: chunked", and I also see that they're not sending content-length. Here's a link to the FireFox plugin if you don't already have it:
LiveHTTPHeaders
http content length is a SHOULD field, so you can drop it...
but you have to set transfer encoding then
have a look at Why "Content-Length: 0" in POST requests?