PDF parsing specific text - php

hi I'm working on an app that parses out pdf data for viewing on mobile devices, I'm looking for a way to scan through a pdf file for specific text and getting the x & y coordinates of that text block. Is that even possible. I working on a Linux server, with php but I'm flexible to use whatever means to get this working. Thanks.

Commercial options:
TET (Text Extraction Toolkit) SDK from http://www.pdflib.com; Acrobat plug-in available for testing the mechanism
pdfToolbox SDK from http://www.callassoftware.com; interactive desktop version available for testing
if you are ready to do some more of the coding yourself: Adobe PDF Library, SDK, available through Datalogics
All are pretty mature, TET is very specific to text extraction, pdfToolbox is a general purpose SDK for analyzing and manipulating PDFs (but has a specific feature to do text extraction, with coordinates of text on the page), and Adobe PDF Library is rather a general purpose development tool (offers a lot of low level features, but code would have to be written that does find text/words/characters and pulls out the coordinates).
Disclaimer: I work for callas software, my view on pdfToolbox may be biased.

Related

Convert scanned pdf files to text-searchable pdf files

I want to convert scanned pdf files to text-searchable pdf files.
I want to give an input as a scanned PDF then my expected output is searchable PDF.
There are few tools which give us the text as output from scanned pdf file but I want text searchable pdf file as output, not just the text.
I have searched about it and found 1 solution here but my Production server is amazon centos and installation of this tool is only working for ubuntu not for amazon centos.
I am ready to pay for it if required. Please help me to give the link of any open source web api or paid web api services or any tools which can convert to text searchable pdf file.
I am using PHP language in my web applicatin.
There are several commercial web API services that will convert scanned PDFs (or scanned images generally) to searchable PDF. Of these, I would recommend trying ABBYY's Cloud OCR SDK. They've been in the OCR space for decades and use their own OCR engine, which tends to give better OCR results than APIs based off other technologies (e.g. Tesseract) based on my observations and what I've heard from others.

reports in PDF with tables in PHP

I have to generate few reports in PDF format with some inventory stats (no graphs, only tables). Additionally, I have to generate some pdf labels for the placed orders and units in a nice tabular format (taking care of landscape orientation and line wrapping) for the web platform. Which PHP API/Library would be best suitable for this purpose. I am using Zend framework but Zend's PDF API is not rich enough to serve the cause.
One option I am considering is to use LateX for generating PDFs.
Advices? Suggestions?
There are several PDF generation libraries and executables.
I've used:
TCPDF
DOMPDF
html2pdf as #redreggae suggested
wkhtmltopdf
Many other alive & dead solutions
They all rendered HTML to PDF. The problem of all (except wkhtmltopdf) was that they all used different (non-standard) rendering engines and results were often different between them and unsatisfying. wkhtmltopdf uses WebKit to interpret the HTML and create a pdf file. I personally prefer wkhtmltopdf after trying/using (in production) all other of the ones listed.. There is one drawback to it - it is an executable and as such it must be called with exec() however this should not be a big issue when proper coding is applied such that you prevent code injection.
If you want something higher level than HTML to PDF converters, you can try PHPJasperXML, it's a renderer for JasperReports on pure PHP.

Displaying word documents, excel sheet, power point in browser

Is there is any way to displaying world document,excel sheet and power point in browser with out downloading.
I assume that you are going to use php for this, so you can try checking some libraries such as PHPWord by Microsoft for example.
If you wish to only display the document content, it is possible to do using some scripting language such as php. Basically office 2007+ formats are zipped XML documents with changed extension. Make a simple word 2007+ document, save it and change extension from .docx to .zip, than you can extract it and see what it's made of. You can find a lot of details here. Now displaying content may be a little tricky. As mentioned, there are libraries out there to handle this, but how will they handle the documents, I am not really sure. Most of them are abandoned, PHPword is in beta since 2011.
There are some indications that Apache is working on cloud version of Open office, but there is no release date yet. Once done, you will have a full featured office suite web app.
If you feel really creative you could use cron job (or scheduled task if you like Windows) to open a document, take a screenshot and basically make .jpg or .png version of the document (works fine with short documents, longer ones may be problematic), displaying it in a browser without much complication. It is also possible to schedule export to .pdf - all browsers do have Adobe PDF plugins.
To sum up, using php for parsing simple documents should be fine, but getting complex docs to display properly, may be much more difficult task and possibly not worth your time. I would go for cron export to pdf, to preserve most if not all of the document's structure.

Exporting SVG to PDF in a offline TideSDK webapp

I have an offline HTML5/CSS/JS app built with TideSDK in which a bar chart is drawn with Highcharts as an SVG "tag" using data entered by the user. I need to export this chart in a PDF document, which will also contain text and tables.
As it is an offline app I can't use the export module included in Highcharts (except the getSVG() method) or other solutions like DocRaptor.
I'm open to use another JS plugin for drawing the chart, but I really love the "look and feel" and the features of Highcharts graphs.
As you may know, with TideSDK I can embed Python, PHP or Perl scripts/modules in my app (I prefer avoid Perl as I've never used it).
The other limitation is that I cannot ask to the final users to install another software than mine, so I can't use wkhtmltopdf with PHP. Except if I manage to install it through my app in a transparent process (not sure it is particularly easy to do).
After having search for several days, my final idea is to use the CairoSVG Python module to export the graph in a first PDF. Then I will find a JS (jsPDF) or Python tool to include this PDF in the final PDF containing text and tables.
I will start to test this solution soon and let you know if I managed it. Nevertheless, if some of you already have to manage a similar problem, I will be very happy to hear your solutions.
The app will run, in a first time, on Windows platform and should be adapted to MacOS, Linux, Android and iOS in later phases.
There is simple command line tool (wkhtml to pdf), that lets you convert html pages to pdf files. Its not strictly python/js solution, but you can call os.system() or something similar from python to use it. It helped me with my python program and it is very simple to use. Command is "wkhtmltopdf.exe inputname outputname" and you get what you want. And its free software. Site: https://code.google.com/p/wkhtmltopdf/

turn web page into an image on the fly?

I was wondering if there was any way of turning an entire HTML page into a png (or other kind of image?) I'm trying to create PDFs on the fly, but it's pulling across my styles as text, but I want the styles to stay the same as the page (cufon and all). Any help would be appreciated! :)
This doesn't look straightforward. The backend (PHP etc.) doesn't do rendering, layout. It merely generates content.
The layout and visual aspects of the website are done by your client (browser) and the backend has no way of accessing this.
However, given an HTML file, there are libraries that can render it into a PDF like Prince XML that seem to be capable of this.
The only way to generate an image identical, or even near, what a visitor sees in their browser when viewing your site is to launch a browser and take a screenshot. You need the browser's rendering engine to render the page. All the libraries you find to do it without a browser create something much different than what the visitor sees, and won't render cufon or other fancy things at all.
Companies that offer screenshot previews of a webpage now run many servers, each running many virtual PCs, each running a full operating system and real web browser. They have all those systems pulling jobs, opening the webpages in real browsers, taking screenshots and saving images. You won't replicate that with a little PHP script.
http://ipinfo.info/html/rendering_services.php
Turning web pages into images and PDFs is a royal pain using PHP. Solutions often require OS level scripting, fake printer drivers, or screen capturing, which can make for a rather fragile setup. I ran into the same issue a few years ago and started working on native PHP extension that leveraged the Gecko engine to render HTML to PDF, but never finished it.
The best answer I've seen doesn't quite turn a full web page into a PDF, but instead does XML to PDF. XEP by RenderX is the commercial tool Apple uses to produce developer documentation in many formats, including HTML and beautifully rendered PDFs, from an XML source. The great thing about using the XEP tool in conjunction with PHP is that PHP deals with XML very well, so you can pass generated XML to the XEP binary, let it do the conversion to PDF, then deal with the resulting PDF file in PHP.
consider building a regular PDF file that resembles your web page:
PHP::PDF - constructing using php.
PDF Reference - file structure.

Categories