I'm looking for a solution / API (i.e. like PDFLib) that can extract (and remove) a drawn path from a graphic PDF. For example a path that outlines a picture or logo that was drawn in Illustrator or Indesign (not JPG clipping path), that is set to a specific spot color (ie "CutContour"). I need to get the data that makes up that path to extract for use in a cutting system.
While PDFLib can extract text, it cannot extract graphic elements. I'm even open to solutions outside of PHP!
Thanks in advance!
I wasn't able to find any php pdf parsers, but...
If you aren't opposed to using an alternate language, I found a ruby gem that will parse a pdf file. From the docs it looks like you're able to grab a hash of the objects of a file.
http://rubygems.org/gems/pdf-reader
If you're looking for a pure programmatic solution that might work, but seems like it would be difficult.
Otherwise, I know that you can open up pdf files in Adobe Illustrator and extract graphics that way. You might even be able to write some javascripts that will automate the process. This solution obviously won't work on linux though.
Related
I hope you are doing well.
I need to know about a PHP library that converts a PDF file having images as well to be converted in a HTML file with the following features that the library can do.
HTML file needs to be of version 3.2 compatible
Save the images in PDF file having .jpg extension
Correct font from PDF needs to be used in the HTML file.
A result folder that contains the images and html file in one folder
I have tried most of the PHP libraries but most of the PHP libraries are NOT doing my needed tasks.
Please, help let me know about a library that do all the above 4 requirements (image attached for reference)
Waiting for your kind responses.
Thanks
I am not very sure, But here is a library in PHP I found.
Here
Try this:
http://www.pdfaid.com/pdf-to-html.aspx
Or this:
http://webdesign.about.com/od/pdf/tp/tools-for-converting-pdf-to-html.htm
Or this...
http://www.pdfconvertonline.com/pdf-to-html-online.html
There are plenty of options available to you, the secret is to use a new fangled thing called a Search Engine, such as a Bing or a Google.
you will also do well to research on Stack Overflow before asking your question:
1) HTML 3.2 wes superceeded in 1997, this is very nearly twenty years ago, why on eart are you still needing a comparatively ancient technology when there are far better improvements available such as XML HTML, HTML 4.01 and HTML5.
2) Please read How can I extract embedded fonts from a PDF as valid font files?
3) Also to extract images you can use:
http://www.makeuseof.com/tag/extract-images-pdf-files-save-windows/
but again, there are several options available to you if you care to look for them.
You seem to imply a fundamental misunderstanding about HTML; there are several different ways of getting any desired result with HTML. You have a PDF file and you want it to look a certain way, this look depends on the browser you are looking at it on. For example if you use a PDF to HTML converter as linked above you will very probably find that the output will look different on Internet Explorer 7 versus on Firefox versus Internet Explorer 10. There is no one way of writing output on HTML or with CSS.
If you want a custom built library to do your specific task then you will need to employ a professional to do it, or you will need to code it yourself. This obviously should be charged to the client for requiring a technology that is extremely outdated. You can probably search github for a similar library (the one linked by CK Khan looks like what you're after) and then fork it and make your own variation for your needs. I very much doubt anyone is going to put time into developing a system to output HTML 3.2 from a PDF, and even less likely to develop this system for free and to your exact specifications.
It also appears that you can not directly incorporate font families into the <font> tag in HTML 3.2, only being able to edit size and colour of fonts. You can use CSS1 font-family to show font families. See here.
I want to give users a preview of certain files on my site and will be using scribd API. Does anyone know how I can access the full file from my server and save the file under a different name , which I will then show to users..Can't think of a way to do this with PHP for .docx and image files...Help is much appreciated.
For "splitting" images, use an image processing library like gd to crop the image (lots of examples to be found on how to do that all over the place). For Word documents, use a library like PHPWord (or one of the other myriad such libraries) to open the document, remove/extract as much text as you need, then save that into a new Word file.
For other file types, find the appropriate method that allows you to manipulate that format, then do whatever you need to do with it.
I want to generate PDF from a PHP file that includes HTML controls like textbox, and textarea. I attached CSS in the same. I tried FPDF, DOMPDF and TCPDF, but still I don't get exactly what I want. How do I pass HTML controls with PHP variables and CSS to these libraries?
mpdf is another option that you could try.
EDIT :
Found another solution for it, TCPDF is a FLOSS PHP class for generating PDF documents. Looks more dominating library.
"PRINCEXML" is a good library (not completely free now).
Others:
If your meaning is to create a PDF file from PHP, pdflib will help you (as some other suggested).
Else, if you want to convert an HTML page in PDF via PHP, you'll find
a little trouble outta here.. For three years I have been trying to do it as best as I
can.
So, the options I know are:
HTML2PS: same of DOMPDF, but this one convert first in .ps
(Ghostscript), then, in whatever format you need (PDF, JPEG, PNG). For
me it is a little better than dompdf, but I have the same speed problem.. Oh,
it has better compatibility with CSS.
Those two are PHP classes, but if you can install some software on the
server, and access it through passthru() or system(), have a look at
these too:
wkhtmltopdf: based on webkit (safari's wrapper), is really fast and
powerful... It seem like it is the best one (atm) for converting HTML pages to PDF on the fly, taking only two seconds for a three pages XHTML document
with CSS 2. It is a recent project. Anyway, the Google Code page is often
updated.
htmldoc: this one is a tank, it really never stops orcrashes... The project
seems to have died in 2007, but anyway if you don't need CSS compatibility
this can be nice for you.
** Thumbs Up For Strae.
If I understand your needs correctly I don't think any PHP-PDF class would do that.
Mostly you could insert only text and images to a PDF file, so if you would want something that looks like an HTML element you would need to insert it as an image.
Usually just putting HTML doesn't mean all your elements would stay intact in the PDF . (Different world, after all)
http://www.fpdf.org/ is the site having a great HTML-to-PDF class which work well. I am using it, but you have to first study its functionality and then start.
How do I convert a PDF file to HTML in PHP? Is there any lib or web service? I mean free, thanks!
Google pdf2html, pdftohtml looks to be the only viable one. and it's based on a command line program, not PHP. so it may not be useful to you. Google is capable of converting, so there may be a way to do it with GDocs as well. though I'm not sure of that. At any rate, I hope this gets you on the proper path at least.
I've tried Poppler's pdftohtml command to convert PDF files to HTML files. Check it out on The HTML file output of Poppler is lighter when used but the output is not very accurate.
If you want accurate output you should use pdf2htmlEX I've converted complicated PDF files and got the best HTML output.
You can't.
PDFs are complex documents containing embedded fonts, vector graphics and layout information that cannot be represented in HTML in an automated way. You may be able to extract the TEXT of the document, but that's about it.
How can I rotate a pdf document using php and linux?
Rotate an Entire PDF Document's Pages to 180 Degrees
$command = "pdftk in.pdf cat 1-endS output out.pdf";
system($command);
You could use pdf90 from PDFjam.
To address some of the other suggestions:
I would be wary of adjusting the Rotate attribute directly, as this attribute is stored as text, and '90' or '270' obviously uses a different number of bytes to '0'. I believe inserting the required bytes can make a mess of the index tables that appear at the end of a PDF file. After that, you're reliant on a viewer being able to interpret the damaged file.
Rendering the PDF to an image and rotating that is going to rasterize any text or vector graphics, leading to either a much larger file size, or much lower quality.
You would have to use a external library like this to extract the info a generate an image, then put it back to the pdf(or a new one)
EDIT:
If your going to get a Logo or a diagram this is a good choice, if its a big document with text and lots of images... its going to be pretty hard, could you edit the OP with more info on what you need?
You will have to access the PDF as a binary file then find and adjust the "Rotate" attribute for each page (and possibly the "MediaBox" attribute). I am not aware of any PDF libraries for PHP that allow for this sort of direct manipulation of existing files. This method will not require changing anything about the content of the pages, it just changes the orientation the pages are displayed in by viewers (similar to the EXIF Orientation information in JPEG images).
This snippet of perl should help illustrate what parts of the file you are looking for.
There are a few libraries for handling PDFs with PHP.
Here's a good code example using such a library. I found it, just by Googling "PHP PDF":
http://www.fpdf.org/en/script/script2.php