I'm attempting to retrieve images from a web page, and it has been working well so far, except one of the sites I am looking at is serving images as Content-Type: text/html, causing my script to reject it as not a real image.
This is the code snippet I am using to determine content-type:
$accepted_mime = array('image/gif', 'image/jpeg', 'image/jpg', 'image/png');
$headers = get_headers($image);
// Find the Content-Type header
$num_headers = sizeOf($headers);
for($x=0;$x<$num_headers;$x++) {
preg_match('/^Content-Type: (.+)$/', $headers[$x], $mime_type);
if (isset($mime_type[1]) && in_array($mime_type[1], $accepted_mime)) {
return true;
}
}
For sites I've tried, they return properly (results such as image/gif, image/png, etc), but mpaa.org seems to serve their images with type text/html. Is this normal?
I added a print_r to see the header array returned by get_headers`:
Array
(
[0] => http://www.mpaa.org/templates/images/header_mpaa_logo.gif
[1] => Array
(
[0] => HTTP/1.1 200 OK
[1] => Server: nginx/1.2.0
[2] => Date: Sat, 17 Nov 2012 17:19:06 GMT
[3] => Content-Type: text/html
[4] => Connection: close
[5] => P3P: CP="NON DSP COR ADMa OUR IND UNI COM NAV INT"
[6] => Cache-Control: no-cache, no-store, must-revalidate
[7] => Pragma: no-cache
)
)
I could easily add text/html to my list of accepted content-types, but that's definitely not the ideal solution ;) Does anyone know why mpaa.org serves their images with this Content-Type? Is it regular practice to do so (perhaps with legacy websites/servers)?
Thanks :)
The wonderful MPAA is using user-agent sniffing or checking cookies to determine if your browser supports JavaScript. Since you are not specifying a user-agent string or sending cookies, they assume you don't have JavaScript and return a page saying that, instead of the original image.
If you load this with a browser, you'll note that you do get image/gif, and the image you are after: http://www.mpaa.org/templates/images/header_mpaa_logo.gif
If you make that same request with cURL or Fiddler, or some other oddball user-agent string:
This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser.
Dont rely on headers. They can be changed easily and as you encounter now, are not reliable.
I would do it like this:
Download the image
Check if the image is an image (by using getimagesize or something like that)
Related
I want to post a file to a server with a relative path supplied to the file's filename within the Content-Disposition header (using PHP 7.0 on Ubuntu with curl 7.47):
curl server/index.php -F "file=#somefile.txt;filename=a/b/c.txt"
Applying the --trace-ascii /dev/stdout option shows:
0000: POST /index.php HTTP/1.1
0031: Host: server
004a: User-Agent: curl/7.47.0
0063: Accept: */*
0070: Content-Length: 111511
0088: Expect: 100-continue
009e: Content-Type: multipart/form-data; boundary=--------------------
00de: ----e656f77ee2b4759a
00f4:
...
0000: --------------------------e656f77ee2b4759a
002c: Content-Disposition: form-data; name="file"; filename="a/b/c.txt
006c: "
006f: Content-Type: application/octet-stream
0097:
...
Now, my simple test script <?php print_r($_FILES["file"]); ?> outputs:
Array
(
[name] => c.txt
[type] => application/octet-stream
[tmp_name] => /tmp/phpNaikad
[error] => 0
[size] => 111310
)
However, I expected [name] => a/b/c.txt. Where is the flaw in my logic?
According to https://stackoverflow.com/a/3393822/1647737 the filename can contain relative path.
The PHP manual also implies this and suggests sanitizing with basename().
As we can see from the php-interpreter sources, _basename() filter invoked for security reason and/or to fix some cons particular browsers.
File: php-src/main/rfc1867.c
Lines ~1151 and below:
/* The \ check should technically be needed for win32 systems only where
* it is a valid path separator. However, IE in all it's wisdom always sends
* the full path of the file on the user's filesystem, which means that unless
* the user does basename() they get a bogus file name. Until IE's user base drops
* to nill or problem is fixed this code must remain enabled for all systems. */
s = _basename(internal_encoding, filename);
if (!s) {
s = filename;
}
Use case: user clicks the link on a webpage - boom! load of files sitting in his folder.
I tried to pack files using multipart/mixed message, but it seems to work only for Firefox
This is how my response looks like:
HTTP/1.0 200 OK
Connection: close
Date: Wed, 24 Jun 2009 23:41:40 GMT
Content-Type: multipart/mixed;boundary=AMZ90RFX875LKMFasdf09DDFF3
Client-Date: Wed, 24 Jun 2009 23:41:40 GMT
Client-Peer: 127.0.0.1:3000
Client-Response-Num: 1
MIME-Version: 1.0
Status: 200
--AMZ90RFX875LKMFasdf09DDFF3
Content-type: image/jpeg
Content-transfer-encoding: binary
Content-disposition: attachment; filename="001.jpg"
<< here goes binary data >>--AMZ90RFX875LKMFasdf09DDFF3
Content-type: image/jpeg
Content-transfer-encoding: binary
Content-disposition: attachment; filename="002.jpg"
<< here goes binary data >>--AMZ90RFX875LKMFasdf09DDFF3
--AMZ90RFX875LKMFasdf09DDFF3--
Thank you
P.S. No, zipping files is not an option
Zipping is the only option that will have consistent result on all browsers. If it's not an option because you don't know zips can be generated dynamically, well, they can. If it's not an option because you have a grudge against zip files, well..
MIME/multipart is for email messages and/or POST transmission to the HTTP server. It was never intended to be received and parsed on the client side of a HTTP transaction. Some browsers do implement it, some others don't.
As another alternative, you could have a JavaScript script opening windows downloading the individual files. Or a Java Applet (requires Java Runtimes on the machines, if it's an enterprise application, that shouldn't be a problem [as the NetAdmin can deploy it on the workstations]) that would download the files in a directory of the user's choice.
Remember doing this >10 years ago in the netscape 4 days. It used boundaries like what your doing and didn't work at all with other browsers at that time.
While it does not answer your question HTTP 1.1 supports request pipelining so that at least the same TCP connection can be reused to download multiple images.
You can use base64 encoding to embed an (very small) image into a HTML document, however from a browser/server standpoint, you're technically still sending only 1 document. Maybe this is what you intend to do?
Embedd Images into HTML using Base64
EDIT: i just realized that most methods i found in my google search only support firefox, and not iE.
You could make a json with multiple data urls.
Eg:
{
"stamp.png": "data:image/png;base64,...",
"document.pdf": "data:application/pdf;base64,..."
}
(extending trinalbadger587's answer)
You could return an html with multiple clickable, downloadable, inplace data links:
<html>
<body>
<a download="yourCoolFilename.png" href="data:image/png;base64,...">PNG</a>
<a download="theFileGetsSavedWithThisName.pdf" href="data:application/pdf;base64,...">PDF</a>
</body>
</html>
I've been researching this all morning and have decided that as a last-ditch effort, maybe someone on Stack Overflow has a "been-there, done-that" type of answer for me.
Background Recently, I implemented compression on our (intranet-oriented) Apache (2.2) server using filters so that all text-based files are compressed (css, js, txt, html, etc.) via mod_deflate, mentioning nothing about php scripts. After plenty of research on how best to compress PHP output, I decided to use the gzcompress() flavor because the PHP documentation suggests that using zlib library and gzip (using the deflate algorithm, blah blah blah) is preferred over ob_gzipwhatever().
So I copied someone else's method like so:
<?php # start each page by enabling output buffering and disabling automatic flushes
ob_start();ob_implicit_flush(0);
(program logic)
print_gzipped_page();
function print_gzipped_page() {
if (headers_sent())
$encoding = false;
elseif(strpos($_SERVER['HTTP_ACCEPT_ENCODING'],'x-gzip') !== false )
$encoding = 'x-gzip';
elseif(strpos($_SERVER['HTTP_ACCEPT_ENCODING'],'gzip') !== false )
$encoding = 'gzip';
else
$encoding = false;
if($encoding){
$contents = ob_get_contents(); # get contents of buffer
ob_end_clean(); # turn off OB and flush buffer
$size = strlen($contents);
if ($size < 512) { # too small to be worth a compression
echo $contents;
exit();
} else {
header("Content-Encoding: $encoding");
header('Vary: Accept-Encoding');
# 8-byte file header: g-zip file (1f 8b) compression type deflate (08), next 5 bytes are padding
echo "\x1f\x8b\x08\x00\x00\x00\x00\x00";
$contents = gzcompress($contents, 9);
$contents = substr($contents, 0,$size); # faster than not using a substr, oddly
echo $contents;
exit();
}
} else {
ob_end_flush();
exit();
}
}
Pretty standard stuff, right?
Problem Between 10% and 33% of all our PHP page requests sent via Firefox go out fine and come back g-zipped, only Firefox displays the compressed ASCII in lieu of decompressing it. AND, the weirdest part, is that the content size sent back is always 30 or 31 bytes larger than the size of the page correctly rendered. As in, when the script is displayed properly, Firebug shows content size of 1044; when Firefox shows a huge screen of binary gibberish, Firebug shows a content size of 1074.
This happened to some of our users on legacy 32-bit Fedora 12s running Firefox 3.3s... Then it happened to a user with FF5, one with FF6, and some with the new 7.1! I've been meaning to upgrade them all to FF7.1, anyway, so I've been updating them as they have issues, but FF7.1 is still exhibiting the same behavior, just less frequently.
Diagnostics I've been installing Firebug on a variety of computers to watch the headers, and that's where I'm getting confused:
Normal, functioning page response headers:
HTTP/1.1 200 OK
Date: Fri, 21 Oct 2011 18:40:15 GMT
Server: Apache/2.2.15 (Fedora)
X-Powered-By: PHP/5.3.2
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 1045
Keep-Alive: timeout=10, max=75
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8
(Notice that content-length is generated automatically)
Same page when broken:
HTTP/1.1 200 OK
(everything else identical)
Content-Length: 1075
The sent headers always include Accept-Encoding: gzip, deflate
Things I've tried to fix the behavior:
Explicitly declare content length with uncompressed and compressed lengths
Not use the substr() of $contents
Remove checksum at the end of $contents
I don't really want to use gzencode because my testing showed it to be significantly slower (9%) than gzcompress, presumably because it's generating extra checksums and whatnot that I (assumed) the web browsers don't need or use.
I cannot duplicate the behavior on my 64-bit Fedora 14 box running Firefox 7.1. Not once in my testing before rolling the compression code live did this happen to me, neither in Chrome nor Firefox. (Edit: Immediately after posting this, one of the windows I'd left open that sends meta refreshes every 30 seconds finally broke after ~60 refreshes in Firefox) Our handful of Windows XP boxes are behaving the same as the Fedora 12s. Searching through Firefox's Bugzilla kicked up one or two bug requests that were somewhat similar to this situation, but that was for versions pre-dating 3.3 and was with all gzipped content, whereas our Apache gzipped css and js files are being downloaded and displayed without error each time.
The fact that the content-length is coming back 30/31 bytes larger each time leads me to think that something is breaking inside my script/gzcompress() that is mangling something in the response that Firefox chokes on. Naturally, if you play with altering the echo'd gzip header, Firefox throws a "Content Encoding Error," so I'm really leaning towards the problem being internal to gzcompress().
Am I doomed? Do I have to scrap this implementation and use the not-preferred ob_start("ob_gzhandler") method?
I guess my "applies to more than one situation" question would be: Are there known bugs in the zlib compression library in PHP that does something funky when receiving very specific input?
Edit: Nuts. I readgzfile()'d one of the broken, non-compressed pages that Firefox downloaded and, lo and behold!, it echo'd everything back perfectly. =( That means this must be... Nope, I've got nothing.
okay 1st of all you don't seem to be setting the content length header, which will cause issues, instead, you are making the gzip content longer so that it matches the content length size that you were receiving in the 1st place. This is going to turn ugly. My suggestion is that you replace the lines
# 8-byte file header: g-zip file (1f 8b) compression type deflate (08), next 5 bytes are padding
echo "\x1f\x8b\x08\x00\x00\x00\x00\x00";
$contents = gzcompress($contents, 9);
$contents = substr($contents, 0,$size); # faster than not using a substr, oddly
echo $contents;
with
$compressed = gzcompress($contents, 9);
$compressed_length = strlen($compressed); /* contains no nulls i believe */
header("Content-length: $compressed_length");
echo "\x1f\x8b\x08\x00\x00\x00\x00\x00", $compressed;
and see if it helps the situation.
Ding! Ding! Ding! After mulling over this problem all weekend, I finally stumbled across the answer after re-reading the PHP man pages for the umpteenth time... From the zlib PHP documentation, "Whether to transparently compress pages." Transparently! As in, nothing else is required to get PHP to compress its output once zlib.output_compression is set to "On". Yeah, embarrassing.
For reasons unknown, the code being called, explicitly, from the PHP script was compressing the already-compressed contents and the browser was simply unwrapping the one layer of compression and displaying the results. Curiously, the strlen() of the content didn't vary when output_compression was on or off, so the transparent compression must occur after the explicit compression, but it occasionally decided not to compress what was already compressed?
Regardless, everything is resolved by simply leaving PHP to its own devices. zlib doesn't need output buffering or anything to compress the output.
Hope this helps others struggling with the wonderful world of HTTP compression.
I am using thickbox on ubercart/drupal 6 on ubuntu. The problem is I moved the site from a windows machine to ubuntu. All problems with paths and permissions sorted and site is working well.
The only problem I'm having now is when I click on a product image, thickbox is supposed to show a preview pop up. Instead, it shows weird characters in the pop up window. A copy/paste of those characters:
�����JFIF�,,�����Exif��MM����� ���������������������������������������(�������1��������2����������������i����������4NIKON CORPORATION�NIKON D70s���,�����,���Adobe Photoshop 7.0�2008:08:21 17:13:50���%�������������������"�������������0221���������������������������֒� �����ޒ������������������������ �������� ����������,�������90��������90��������90��������0100��������������������������������������������������������"���������������������������������������E������������������������������ ��������� ����������������������� ��X������� 2008:08:19 15:40:17�2008:08:19 15:40:17�����������������+��� ������ ASCII��� ���������������������������������(�������������������� W�������H������H��������JFIF��H�H�����Adobe_CM����Adobe�d��������� ������7"�������?���������� ��������� �3�!1AQa . . . . . . and a lot more similar chars
The images are uploaded properly and I can see them under sites/default/files/. Even the thumbnails are generated. These thumbnails appear on the site as well. Also right clicking a thumbnail and open in new tab shows me the whole image properly.
Also, Thickbox sends an ajax GET request for the image to a URL that looks something like this:
http://127.0.0.1/elegancia/?q=system/files/imagecache/product_full/image_1.jpg&random=1299550719133
Copy pasting the same request from firebug into a new browser tab opens the image successfully.
From firebug, these are the request response headers for the ajax request:
Response Headers
view source
Date Tue, 08 Mar 2011 02:18:39 GMT
Server Apache/2.2.16 (Ubuntu)
X-Powered-By PHP/5.3.3-1ubuntu9.3
Expires Tue, 22 Mar 2011 02:18:39 GMT
Last-Modified Tue, 08 Mar 2011 01:21:47 GMT
Cache-Control max-age=1209600, private, must-revalidate
Content-Length 111831
Etag "4dfe0f3d345781ac89aae5c2a10361ad"
Keep-Alive timeout=15, max=92
Connection Keep-Alive
Content-Type image/jpeg
Request Headers
view source
Host 127.0.0.1
User-Agent Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.2.15) Gecko/20110303 Ubuntu/10.10 (maverick) Firefox/3.6.15
Accept text/html, */*
Accept-Language en-gb,en;q=0.5
Accept-Encoding gzip,deflate
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive 115
Connection keep-alive
X-Requested-With XMLHttpRequest
Referer http://127.0.0.1/elegancia/
Cookie SESS7a3e11dd748683d65ee6f3c6a918aa02=bijhrr4tl66t42majfs3702a06; has_js=1
Looks like it was a thickbox (Javascript) issue. PHP and Apache work fine when it comes to recognizing the image using mime.
If there are arguments in the image URL, eg.
(http://127.0.0.1/elegancia/?q=system/files/imagecache/product_full/image_1.jpg&random=1299550719133)
- causes Thickbox to show nonsense characters instead due to thickbox image recognition algorithm.
URLs not ending with an image extension makes the thickbox javascript to treat the image like another mime type that is not an image.
To work around, one needs to modify line 53 of /modules/thickbox/thinkbox.js, by adding " || urlType == '/preview' " to the list of choices in order to make thickbox.js believe in its heart that a Drupal-encoded image link is in fact an image and not an imposter.
Assuming your image size is "preview," change line 53 from:
if(urlType == '.jpg' || urlType == '.jpeg' || urlType == '.png' || urlType == '.gif' || urlType == '.bmp' ){//code to show images
to this:
if(urlType == '.jpg' || urlType == '.jpeg' || urlType == '.png' || urlType == '.gif' || urlType == '.bmp' || urlType == '/preview'){//code to show images
Also, modify line 50 to this:
var urlString = /\.jpg|\.jpeg|\.png|\.gif|\.bmp|\/preview/g;
(substitute "/preview" for "/thumbnail," "/quarter," or whatever you configured your image module to create (and name) various sizes.
Another solution which I've found is to add a path_info addition to the URL to specify the image-type. For example, my URL previously was:
/image.php?foo=bar
I changed it to:
/image.php/image.gif?foo=bar
Note that if you're using a webserver such as Apache, which by default restricts the use of path_info, you may need to turn it on with the AcceptPathInfo directive for the affected path.
I prefer this solution to altering the Thickbox source, because altering modules which may get replaced with updated versions means a possible loss of fixes, whereas altering the path_info should continue to function with any upgrades.
The browser is rendering the file as text, when it should treat it as a JPEG image. You need to send the 'Content-Type: image/jpeg' header to tell the browser how to render the content. Check your web server configuration.
For Apache, your httpd.conf file should have lines like this:
LoadModule mime_magic_module modules/mod_mime_magic.so
LoadModule mime_module modules/mod_mime.so
...
TypesConfig /etc/mime.types
And then, in /etc/mime.types:
image/jpeg jpeg jpg jpe
This all applies to files which are served by the web server directly. If you can enter the URL in a browser and see the image, then none of this is a problem.
If the files are served by a script, then you need to make sure the header is sent by the script. In PHP:
header('Content-type: image/jpeg');
echo file_get_contents($image_path);
Working on a prebuilt system that grabs remote images and saves them to a server.
Currently there is no checking on the image as to whether it indeed exists at that remote location, and it is of a certain file type (jpg, jpeg, gif) and I'm tasked with doing both.
I thought this was quite trivial as I'd simply use a simple regex and getimagesize($image):
$remoteImageURL = 'http://www.exampledomain.com/images/image.jpg';
if(#getimagesize($remoteImageURL) && preg_match("/.(jpg|gif|jpeg)$/", $remoteImageURL) )
{
// insert the image yadda yadda.
}
The problem occurs when I don't have any control over the url that I'm grabbing the image from, for example:
http://www.exampledomain.com/images/2?num=1
so when it comes to this, both the regex and getimagesize() will fail, is there a better way of doing this?
You could do
$headers = get_headers( 'http://www.exampledomain.com/images/2?num=1' );
The $headers variable will then contain something like
Array
(
[0] => HTTP/1.1 200 OK
[1] => Date: Thu, 14 Oct 2010 09:46:18 GMT
[2] => Server: Apache/2.2
[3] => Last-Modified: Sat, 07 Feb 2009 16:31:04 GMT
[4] => ETag: "340011c-3614-46256a9e66200"
[5] => Accept-Ranges: bytes
[6] => Content-Length: 13844
[7] => Vary: User-Agent
[8] => Expires: Thu, 15 Apr 2020 20:00:00 GMT
[9] => Connection: close
[10] => Content-Type: image/png
)
This will tell you whether the resource exists and what content-type (which is not necessarily also the image type) it is.
EDIT as per Pekka's comment, you still might want to determine it's mime-type after downloading. See
How Can I Check If File Is MP3 Or Image File.
Some of the given approaches work on remote files too, so you can probably skip the get_headers pre-check altogether. Decide for yourself which one suits your needs.
Store the remote file on your server then run getimagesize on the local one.