Converting a multi-page PDF into single-page PDFs and extracting image - php

I have a multi-page PDF file that has information I need to parse. The information and picture is confined to its own page. I need to extract the text and image from the PDF.
I'm using CentOS and PHP.
My attempt:
I originally tried using a combination of pdftotext and imagemagick. I converted the PDF into an image and that actually separated the pages into their own images. Unfortunately the quality of the image on the page came out very poor.
My goal:
I need to split the PDF into multiple PDFs, one per page. Then, I need to extract the image from that page with the best quality possible.
Thanks.

imagemagick does not fit to perform this task
when you need to extract images from a pdf, at their original size (i.e. the best, since any other resolution is or lesser or bigger than original), you must to use
pdfimages
http://www.foolabs.com/xpdf/download.html
(static binaries are available if you cannot compile from source)
syntax:
pdfimages file.pdf image-root
the image resulting will have the extension .ppm , unless you add the switch -j to have jpeg images as output

pdfseparate to split multi-page.pdf to 1.pdf 2.pdf … + convert 1.pdf 1.png …
pdfseparate (part of poppler) to split multi-page.pdf to 1.pdf 2.pdf …
pdfseparate multi-page.pdf ./single-pages/%d.pdf
extracts all pages from multi-page.pdf
and saves them as single page PDFs, (%d variable for page number)
mogrify (part of ImageMagick) to batch convert all single page PDFs to PNGs at your desired resolution (in DPI)
mogrify ./single-pages/*.pdf -density 300 -format png

Related

How do CMYK/RGB color spaces work in pdfs and images, and how does it affect their inter convertibility?

I have a task where I need to take PDFs that are mock ups of printing products, and check their resolution, size and colour-space. I need to use Imagick with PHP to complete this task.
The printing shop that will print these PDFs only have CMYK printers and so, the uploaded PDF need to have CMYK colours. But I am not clear on how colour-spaces(CMYK/RGB) work in PDF, or in jpeg/png images. So, I have a few questions that will hopefully help me understand the thing better and complete the task:
From what I understand, we can draw objects or add images to the pdf that can have their colours defined as RGB or CMYK, but how does this affect the colour-space of the entire PDF?
Is it possible to check the colour-space of a PDF in php, without converting it jpeg/png?
If I have images in a PDF defined in either CMYK or RGB colour-space and convert the PDF to jpeg/png with Imagick, does the colour-space remain the same in the converted image unless specifically mentioned by Imagick::transformImageColorspace()?
A short background information on how colour-spaces work, how they are defined and detected and how they are affected when the file is converted from one mime-type to another.
P.S.: I am converting the PDFs to jpeg/png and checking the colour of the converted file as below, but it always gives false, no matter what pdf I use.
$img = new imagick(self::$_imgArray[0]);
if($img->getimagecolorspace() == imagick::COLORSPACE_CMYK)
echo "Image is in CMYK";
I have a task where I need to take PDFs that are mock ups of printing products, and check their resolution, size and colour-space.
A PDF page does not have a resolution (though images on the page do). It does have a "physical" dimension, default being Letter size. PDF units are by default 1/72 inch. If a PDF page contains pure vector data, then it look great at any resolution.
See below for more detail, but a single PDF page/document can contain one or more of Gray, RGB, CMYK, LAB, and more, color spaces.
but how does this affect the colour-space of the entire PDF?
It doesn't, the PDF itself does not have an overall color space. Typically a PDF processor would convert all graphics to a target color space, e.g. Chrome would at some point have everything in RGB since it is drawing to a screen.
Is it possible to check the colour-space of a PDF in php, without converting it jpeg/png?
Sure, though a single PDF could contain greyscale, rgb, cmyk, lab, separation colors, etc. Again, there is no one color space in a PDF file.
If I have images in a PDF defined in either CMYK or RGB colour-space and convert the PDF to jpeg/png with Imagick, does the
colour-space remain the same in the converted image unless
specifically mentioned by Imagick::transformImageColorspace()?
It would depend on the software doing the conversion. Since PNG does not support CMYK, then at the very least any CMYK would be converted. Exactly what happens depends on the software, the settings, and the target output format and what is supports.
A short background information on how colour-spaces work, how they are defined and detected and how they are affected when the file is
converted from one mime-type to another.
See section 8.6 here: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Here is another good link
https://www.color-management-guide.com/color-spaces.html

Image magic rotate particular page in php

I have a PDF where I need to rotate particular page in PDF.
I have used below command but it rotates all pages in that pdf
convert -density 300x300 -rotate 90 input.pdf result.pdf
Is there any way to rotate particular page ?
Thanks in advance
Using ImageMagick, you can rotate a particular page using the [pagenum] syntax after the filename like this:
convert document.pdf[3] -rotate 90 page3rotated.pdf
However, that will only extract that single page (i.e. page 3), rotate it and save it (and it alone) in the output file. Of course, you could extract ALL the pages, rotate the one you want and re-assemble your PDF, but I can't help thinking that there must be a better way with a different tool.
In general, you should be aware that ImageMagick will rasterise your PDF (so it is basically a picture) and that words/text will no longer be selectable as words/text when ImageMagick puts your PDF back together.
If you are using Linux, You may use PDFMod. This allows independent operations on given pages of the pdf file.
I have attached a screenshot below for demonstration purpose.

pdf to png or jpg conversion and back again with php - losing fonts

I need to accept hundreds of pdfs at a time via PHP. I am storing these files on S3, so, file size will become a concern - not only for storing, but general handling. I'm finding the best way to reduce file size is conversion from PDF to PNG and back to PDF. A 15M file drops to 700kb. The problem is I'm losing certain fonts. Is there a way to ensure this doesn't happen? How do I ensure the process I use maintains the fonts in the original document? Is there some massive font library I can install?
from the command line I've tried...
Imagemagick
Ghostscript
pdftk
inkscape (real nice output)
They all work with varying levels of success, but each of them lose certain fonts - and not always the same ones.
Nope!
The .PDF format is an encapsulation of "graphics commands," such as "render the following text at position (X,Y) in the workspace using font Z."
When you "convert" such a file to any(!) "image file" format, you are in fact asking the PDF-engine to "carry out those graphics commands," producing a bitmap (a rectangular grid of pixels ...) as its only output.
Well, once you have done that, "you can never go back." The PDF-engine rendered its rectangular grid of pixels as best it could, and now, both it and the PDF-file that it consumed are gone, leaving you only with a rectangular grid of (output) pixels.

Linux PDF to HTML + Image Map + jpeg image files

What I want is a way to parse a PDF file into HTML with the image map (the hyperlinks) and the images must be in jpg format.
I have a Magazine Reader and I need the images and the position, href and size of each hyperlink.
The solution needs to be to run into a linux server.
Any suggestions?
Many thanks!
You should take a look to the pdf2html project or pdf2htmlEX.
That needs some tweaks to convert png to jpg as well.
This is that simple as :
convert foo.png foo.jpg
with ImageMagick tools.
See the README.

How to convert specific no. of pages from a .pdf file to .png image using imagemagick

i am using Imagemagick for converting my .pdf file to .png images
but when i issue the command
$convert sample.pdf image.png
then it will convert all the pages of sample.pdf file to .png images but exactly i want to
convert a specific no. of pages(e.g. first 10 pages or page no.22 or 12 etc.)
then pleases suggest me a way to solve this issue.
and one more question is that:
when we view our .pdf files in google docs .pdf viewer then they are also in image format
but we can select and copy the text written on pages to the clipboard(simply select the text and press
Ctrl+c)
so how can i implement this so the users of my website can select the text form my images.
(there are already some discussion about it on stackoverflow but they are not very clear)
for i in {0..9} 11 21
do
convert "sample.pdf[$i]" "image_$i".png
done
Benoits answer is what you were looking for for slicing and converting a PDF in to images.
Alternatively you can use pdftk with the cat operation. This would get you the first 10 pages and generate a new sliced PDF for example.
pdftk YOUR.PDF cat 1-10 output SLICED.PDF
Regarding your second question about converting an image PDF to a PDF with text data the only way is to use a OCR tool like Tesseract for example.
The only problem is that those OCR tools are not always that exact. In other words sometimes they will not always be able to output what you read on that image.

Categories