Convert scanned pdf files to text-searchable pdf files - php

I want to convert scanned pdf files to text-searchable pdf files.
I want to give an input as a scanned PDF then my expected output is searchable PDF.
There are few tools which give us the text as output from scanned pdf file but I want text searchable pdf file as output, not just the text.
I have searched about it and found 1 solution here but my Production server is amazon centos and installation of this tool is only working for ubuntu not for amazon centos.
I am ready to pay for it if required. Please help me to give the link of any open source web api or paid web api services or any tools which can convert to text searchable pdf file.
I am using PHP language in my web applicatin.

There are several commercial web API services that will convert scanned PDFs (or scanned images generally) to searchable PDF. Of these, I would recommend trying ABBYY's Cloud OCR SDK. They've been in the OCR space for decades and use their own OCR engine, which tends to give better OCR results than APIs based off other technologies (e.g. Tesseract) based on my observations and what I've heard from others.

Related

Conversion of complex DOCXs to PDFs in a PHP web environment

I am working on a PHP web application which programatically generates some DOCX files.
I want these files to be converted to PDF, but their layout is so complex that not any PHP-PDF generator library (domPDF, TCPDF, etc.) works well. They result in a poorly formatted PDF in each case.
In this situation, I have decided to let Google Drive do the conversion. For this, I have to:
Upload the DOCX files to GDrive
And then export them in PDF...
I have seen all of the GDrive API documentation, but it is very poorly documented. I only want to execute one single PHP script which:
Uploads the file to GDrive
Downloads its exported PDF version
Lets the PDF be downloaded when the script is finished...
I am searching for the optimal way to achieve this behaviour... (with or without GDrive, since the LibreOffice/Openffice CLI command is not an option because I am on a web hosting and I can't install any software...).
Have you considered using a file conversion service to do this for you ?
For complete transparency I work for Zamzar (an online file conversion website), we have recently released a developer API - https://developers.zamzar.com/ that would allow you to convert your DOCX files to PDF with little or no loss of formatting.
This would then eradicate your need to convert your file(s) using the Google Drive API. Check out our fairly extensive docs here - https://developers.zamzar.com/docs.

Converting images to pdfs server side

I'm currently working on a project involving converting form results to a downloadable PDF, which is simple enough. I was recently asked, however, to add attachment functionality. I'm using dompdf to convert the form results to PDF, but is there a way to convert the attachments separately (can be jpg, png, doc/x, or pdf) to a PDF file and then append the attachment file to the dompdf output?
I can handle the implementation; are there any free libraries that will support anything like this? I found FPDF, which supports images, but it does not support Word files.
First of all, you will need to find a library for every kind of conversion you need (you mentioned jpg, png, and doc/x, but you didn't say if that was all of them.)
For common office formats, you can launch a headless (meaning it can run on a server without a graphical display) instance of OpenOffice or LibreOffice. Then you can interact with it from various programming languages, or you can use a ready-made commandline tool such as pyodconverter, to ask it to convert between various file formats. This is the best way to convert doc and docx files to pdf, by the way, short of spending money on Microsoft software.
As for "appending the attachment file", by which I take it you mean concatenating a bunch of PDF files together, you can use the free tool pdftk.

How to convert pptx and ppt to pdf and swf

I am trying to convert pptx and ppt files to swf on a Linux server.
I can convert from pdf to swf, so I have settled on looking for a pptx and ppt to pdf converter.
I have looked at Open Office, but it seems to require an x server for its full version, and OdfConverter does not seem to work right, even to odf.
Is there an API that can do this for me, or does anyone have experience doing this sort of thing well? I found an API that would do this, but instead of charging for the service, they want us to put a link that is unacceptable for our site.
You can run Open Office without an X server (in headless mode) and use the UNO interface from PHP with the PUNO module

Best way to convert files into pdf files using php

What is, according to you, the best way to convert uploaded files of any kind (.doc, .docx,...) into a pdf-file using nothing but php. Is it even possible to do so?
I looked at FPDF, but this creates the pdf files from text.
An other solution previously given was to use the PDFlib library on your server, but unfortunately, my server doesn't support this library...
What is the best way to convert to files my users upload on my site to pdf files?
A simpler approach would be to restrict uploads to .PDF format programmatically and require your users to only upload .pdf files. Provide a link on the upload page to a free and open source pdf printer (e.g. Cuteftp) that the user can install to create .pdf documents from any file that can be printed.
Trying to do it through PHP will be problematic because the uploads could be generated from many different programs that would be impossible to cater for in their entirety. e.g. How would it handle Scribus or ABC Flowcharter or any other 'non-standard' application someone used to create a document?
Much better to filter the upload upfront.
The best server-side PDF generator from those I tried was, so far, wkhtmltopdf, a WebKit-based, self-contained invisible browser that can render any HTML+CSS and generate a PDF from it. Reasonably fast and fairly reliable, has some useful PDF options, such as page size, orientation, etc.
The second part of the job in your case is to convert documents to HTML prior to feeding them to wkhtmltopdf. If possible, have your users upload the docs in HTML (Word and Co. can export (crappy) HTML). If this is not an option, you will have to find a tool just for that, which, in my opinion, is much easier than finding a tool that converts Word docs directly into PDF.
Good thing about wkhtmltopdf is also that you can feed the output of your PHP script to it using the ob_xxx() functions.
PHP Excel best simple way to create doc, docx, xls, xlsx, pdf files with PHP. Its lot easier with clear documentation.
Use Microsoft Office to render Microsoft Office documents, if you care about accuracy at all. This is easily done by invoking Office over COM.
Get access to your server, and install what you need. Doing so would be far easier than monkeying around with sub-par solutions.
Well... I can think of one way of doing it quite easily, but it doesn't involve using PHP.
Upload your documents to a folder on your server, that are browsable by your users.
EG: http://mysite.com/docs/
Then get your users to install a virtual printer driver such as Primo PDF
http://www.primopdf.com/index.aspx
then they can load the document into their browser, and print to PDF for offline browsing.
If this is not an option, and your dealing with office documents that conform to the openXML standard, you could attempt to parse the XML doc into a PHP page for display in the browser, then use JavaScript to trigger a print.
Unfortunately, it does still depend on your user having a PDF printer installed.
Alternatively, you could just load the docs natively, and print to your own PDF printer, then upload the PDF's to the web server for download.
I can't think of any easy way of doing this otherwise, without installing all sorts of different document parser tool-kits and doing a huge amount of behind the scenes work.

PDF parsing specific text

hi I'm working on an app that parses out pdf data for viewing on mobile devices, I'm looking for a way to scan through a pdf file for specific text and getting the x & y coordinates of that text block. Is that even possible. I working on a Linux server, with php but I'm flexible to use whatever means to get this working. Thanks.
Commercial options:
TET (Text Extraction Toolkit) SDK from http://www.pdflib.com; Acrobat plug-in available for testing the mechanism
pdfToolbox SDK from http://www.callassoftware.com; interactive desktop version available for testing
if you are ready to do some more of the coding yourself: Adobe PDF Library, SDK, available through Datalogics
All are pretty mature, TET is very specific to text extraction, pdfToolbox is a general purpose SDK for analyzing and manipulating PDFs (but has a specific feature to do text extraction, with coordinates of text on the page), and Adobe PDF Library is rather a general purpose development tool (offers a lot of low level features, but code would have to be written that does find text/words/characters and pulls out the coordinates).
Disclaimer: I work for callas software, my view on pdfToolbox may be biased.

Categories