How do extract text layer and background layer from pdf?

How do extract text layer and background layer from pdf? - php

In my project I've to do a PDF Viewer in HTML5/CSS3 and the application has to allow user to add comments and annotation. Actually, I've to do something very similar to crocodoc.com.
At the beginning I was thinking to create images from the PDF and allow user create area and post comments associates to this area. Unfortunately, the client wants also navigate in this PDF and add only comments on allowed sections (for example, paragraphs or selected text).
And now I'm in front of one problem that is to get the text and the best way to do it. If any body has some clues how I can reach it, I would appreciate.
I tried pdftohtml, but output doesn't look like the original document whom is really complex (example of document). Even this one doesn't reflect really the output, but is much better than pdftohtml.
I'm open to any solutions, with preference for command line under linux.

I've been down the same road as you, with even much more complex tasks.
After trying out everything I ended up using C# under Mono (so it runs on linux) with iTextSharp.
Even with a very complete library such as iTextSharp, some tasks required allot of trial-and-error :)
To extract the text from a page is easy (check the below snipper), however if you intend to keep the text coordinates, fonts and sizes, you will have more work to do.
int pdf_page = 5;
string page_text = "";
PdfReader reader = new PdfReader("path/to/pdf/file.pdf");
PRTokeniser token = new PRTokeniser(reader.GetPageContent(pdf_page));
while(token.NextToken())
{
if(token.TokenType == PRTokeniser.TokType.STRING)
{
page_text += token.StringValue;
}
else if(token.StringValue == "Tj")
{
page_text += " ";
}
}
Do a Console.WriteLine(token.StringValue) on all tokens to see how paragraphs of text are structured in PDFs. This way you can detect coordinates, font, font size, etc.
Addition:
Given the task you are required to do, I have a suggestion for you:
Extract the text with coordinates and font families and sizes - all information about each paragraph. Then, to a PDF-to-images, and in your online viewer, apply invisible selectable text over the paragraphs on the image where needed.
This way your users can select a part of the text where needed, without the need of reconstructing the whole PDF in html :)

I recently researched and discovered a native PHP solution to achieve this using FOSS. The FPDI PHP class can be used to import a PDF document for use with either the TCPDF or FPDF PHP classes, both of which provide functionality for creating, reading, updating and writing PDF documents. Personally, I prefer TCPDF as it provides a larger feature set (TCPDF vs. FPDF), a richer API (TCPDF vs. FPDF), more usage examples (TCPDF vs. FPDF) and a more active community forum (TCPDF vs. FPDF).
Choose one of the before mentioned classes, or another, to programmatically handle PDF documents. Focusing on both current and possible future deliverables, as well as the desired user experience, decide where (e.g. server - PHP, client - JavaScript, both) and to what extent (feature driven) your interactive logic should be implemented.
Personally, I would use a TCPDF instance obtained by importing a PDF document via FPDI to iteratively inspect, translate to a common format (XML, JSON, etc.) and store the resulting representation in relational tables designed to persist data pertinent to the desired level of document hierarchy and detail. The necessary level of detail is often dictated by a specifications document and its mention of both current and possible future deliverables.
Note: In this case, I strongly advise translating documents and storing them in a common format to create a layer of abstraction and transparency. For example, a possible and unforeseen future deliverable might be to provide the same application functionality for users uploading Microsoft Word documents. If the uploaded Microsoft Word document was not translated and stored in a common format then updates to the Web service API and dependent business logic would almost certainly be necessary. This ultimately results in storing bloated, sub-optimal data and inefficient use of development resources in designing, developing and supporting multiple translators. It would also be an inefficient use of server resources to translate outbound data for every request, as opposed to translating inbound data to an optimal format only once.
I would then extend the base document tables by designing and relating additional tables for persisting functionality specific document asset data such as:
Versioned Additions / Edits / Deletions
What
Header / Footer
Text
Original Value
New Value
Image
Page(s) (one, many or all)
Location (relative - textual anchor, absolute - x/y coordinates)
File (relative or absolute directory or url)
Brush (drawing)
Page(s) (one, many or all)
Location (relative - textual anchor, absolute - x/y coordinates)
Shape (x/y coordinates to redraw line, square, circle, user defined, etc.)
Type (pen, pencil, marker, etc.)
Weight (1px, 3px, 5px, etc.)
Color
Annotation
Page
Location (relative - textual anchor, absolute - x/y coordinates)
Shape (line, square, circle, user defined, etc.)
Value (annotation text)
Comment
Target (page, another text/image/brush/annotation asset, parent comment - threading)
Value (comment text)
When
Date
Time
Who
User
Once some, all or more, of the document and its asset data has a place to persist I would design, document and develop a PHP Web service API to expose CRUD and PDF document upload functionality to the UI consumer, while enforcing core business rules. At this point, the remaining work now lies on the Client-side. Currently, I have relational tables persisting both a document and its asset data, as well as an API exposing sufficient functionality to the consumer, in this case the Client-side JavaScript.
I can now design and develop a Client-side application using the latest Web technologies such as HTML5, JavaScript and CSS3. I can upload and request PDF documents using the Web service API and easily render the returned common format out to the browser however I decide (probably HTML in this case). I can then use 100% native JavaScript and/or 3rd party libraries for DOM helper functionality, creating vector graphics to provide drawing and annotation features, as well as access and control functional and stylistic attributes of currently selected document text and/or images. I can provide a real-time collaborative experience by employing WebSockets (before mentioned WebService API does not apply), or a semi-delayed, but still fairly seamless experience using XMLHttpRequest.
From this point forward the sky is the limit and the ball is in your court!

It's a hard task you're trying to accomplish.
To read text from a PDF, have a look at PEAR's PDF_Reader proposal code.

There's also a very extensive documentation around Zend_PDF(), which also allows the loading and parsing of a PDF document. The various elements of the PDF can be iterated on and thus also being transformed to HTML5 or whatever you like. You may even embed the notations from your website into the PDFs and vice versa.
Still, you have been given no easy task. Good Luck.

pdftk is a very good tool to do thinks like that (I don't know if it can do exactly this task).
http://www.pdflabs.com/docs/pdftk-cli-examples/

Related

Loading several thousands of images efficiently

I am developing a website that uses Google Maps.
The map has 4000-5000 markers.
Upon the client enters the website the server determines all "active" markers and sends a JSON document telling the client information about each marker and what marker-icon to use (a url to an image on the server, ex: icon: '/icon/xxx.png').
The website loads instantly but it takes about 5 seconds until all markers are shown since the client has to fetch those ~5000 images.
The images can change so the server only knows when the client ask for it, exactly which image each marker uses.
How can I speedup this process?
Can I dynamically create a spritesheet of some sort or pack all those files and let the client unpack them for faster loading?
The server backend is PHP för this part.

You can indeed create a spritesheet, or if they are relatively simple in style, you can also create them dynamically as SVGs within your page code, and when you declare each marker, you give it a path defined in SVG within your JS (as served by your main page, or a static file containing functions for many of them).
SVG info here:
https://developer.mozilla.org/en-US/docs/Web/SVG/Tutorial/Paths
Some good examples in this prior answer : How to use SVG markers in Google Maps API v3
A spritesheet is a good way to do this if the total number is relatively unchanging, and each client is likely to use a large subset of the sprites.
The downside of a spritesheet in my experience is maintenance - combining a code and an art workflow can be a bit of a pain!

Save html file as PDF

I'm using a PHP Output Buffer to create an HTML file of a dynamic 'Data Review' page, I then save this output as an HTML file to the server and would like to create a PDF file of this HTML file (stored on the server) but every solution I've looked at requires you to put in HTML code into a variable, but I have the .HTML file that I want to convert to PDF automatically but can't seem to find a solution.
The overall idea here is to supply the user a 'copy' of the data review via email, so I assumed a PDF would be best, but if there are any other suggestions, I would happily consider something else.
Any help would be greatly appreciated.
Thank you!

I've looked heavily into generating PDFs in PHP and so here is what I've found over a few years...
PDF Conversion tools
FPDF
This option is really good if you want to generate a PDF file using the PDF method (I will coin it this because you literally generate the PDF piece by piece).
Features include:
Choice of measure unit, page format and margins
Page header and footer management
Automatic page break
Automatic line break and text justification
Image support (JPEG, PNG and GIF)
Colors
Links
TrueType, Type1 and encoding support
Page compression
Notes
Performance: Fast
Cost: Free
Ease of use: Difficult
Difficult to use unless you play a lot with it.
Good documentation.
Other:
Duplication of files (need to have HTML version of a page and an FPDF version of a page if you need to generate PDFs)
MPDF
This option is really good if you want to generate a PDF file from HTML and CSS and still have additional and extensive PDF customization.
Features include:
PDF generation from UTF-8 encoded HTML
It is based on FPDF and HTML2FPDF with a number of enhancements
Notes
Performance: Mediocre
Not the fastest but does the job
Cost: Free
Ease of use: Easy
Hardest part is knowing what is and is not valid HTML and CSS for MPDF)
Great documentation.
Not all CSS is supported and some CSS is extended causing some confusion
PrinceXML
This option is probably the best if you want high performance and high reliability.
Features include:
Powerful Layout
Headers and footers
Page numbers, duplex printing
Tables, lists, columns, floats
Footnotes, cross-references
Web Standards
HTML, XHTML, XML, SVG
Cascading Style Sheets (CSS)
JavaScript/ECMAScript
JPEG, PNG, GIF, TIFF
PDF Output
Bookmarks, links, metadata
Encryption and Document Security
Font embedding and subsetting
PDF attachments
Easy Integration
PHP and Ruby on Rails
Java class for servlets
.NET for C# and ASP
ActiveX/COM for VB6
Fonts & Unicode
OpenType fonts, TrueType and CFF
Kerning, Ligatures, Small Caps
Chinese, Japanese, Korean, Arabic, Hebrew, Hindi and others
Friendly Support
Prompt email support
Web forum, user guide
Regular upgrades
Notes
Performance: Fast
Pricing: $$$
Server License
1 license - $3,800
2 license - $3,420
3 license - $3,040
4 license - $2,850
5+ license - $2,800
OEM (with minimum commitment of 2 years, can be run on any number of servers; so you can create a server farm if you really need)
20,000 documents/month at $5,000
100,000 documents/month at $7,500
500,000 documents/month at $10,000
They also have an academic discount of 50% at $1,900 and a Desktop License for $495 as well as other plans (see here for full list)
Ease of use: Easy
I have not used PrinceXML directly (pricey), but we are currently looking into this as an option for our business.
DocRaptor
This option is really good if you want a high quality API. This is a cloud-hosted option for creating PDF and XLS files. Uses PrinceXML in the backend.
Features include:
You just send HTML, JS, and CSS
Uptime guaranteed
Unlimited document size
Expert support, including document debugging
Pretty much offers everything that PrinceXML does, but double check with their support or documentation for anything specific you may require.
API-based: Works with PHP, NodeJS, Ruby, Python, Java, C#
Notes
Performance: Fast
Depends on internet connection, so if your internet goes down, so does this part of your code.
Pricing: $ - $$$
Currently, their pricing plans are as follows (taken from their website):
Basic - 125 docs/mo - $15/mo
Professional - 325 docs/mo - $29/mo
Premium - 1,250 docs/mo - $75/mo
Max - 5,000 docs/mo - $149/mo
Bronze - 15,000 docs/mo - $399/mo
Silver - 40,000 docs/mo - $1,000/mo
Gold - 100,000 docs/mo - $2,250/mo
Enterprise - ∞ docs/mo - unlisted (contact them)
Ease of use: Very easy
Probably the easiest because you don't actually deal with the document or setup, etc. You just send your files and get a PDF back.
Great documentation
I contacted their support in the past and it was actually very helpful.
They use a proprietary JavaScript engine that allows you to use delayed or asynchronous JavaScript
wkhtmltopdf
This option is really good if you want the next best thing behind the purchased options above (PrinceXML and DocRaptor).
Features include:
[Uses] the Qt WebKit rendering engine
Create your HTML document that you want to turn into a PDF (or image). Run your HTML document through the tool.
Notes
Performance: Fast
Cost: Free
Ease of use: Easy
Uses command line unless you use a library such as the one created by MikeHaertl
We currently use this option and find it performs very well and has great support for HTML tags and CSS properties.
If you need to send variables to the PDF pages that need to be generated, you cannot use $_SESSION variables as this is ran through the command line and uses a separate browser. You need to pass all your variables through $_GET variables.
Other options: Many taken from this question
Cloud-based
HTM2PDF: Source
PDFmyURL: Source
PDFCrowd: Source 1, Source 2
PDFLayer: Source
RotativaHQ: Source
Client-side
jsPDF: Source
Server-side
TCPDF - Many people recommended this option: Source
ZendPDF - Part of Zend Framework: Source
flying-saucer - Java library usable via system(): Source 1, Source 2
CutyCapt: Source
PhantomJS: Source
Snappy: Source
DOMPDF: Source
HTML2PDF: Source
PDFReactor
HTML2PS - No solid links for this project, so I linked to Google search for it
Apache FOP
PHP - PHP has its native library for creating PDFs, I assume this is probably one of the most difficult ways to go about doing this, but if you're really adventurous, why not?
PDFLib - Many other libraries are based off this one
ReportLab - Python-based
iText - Java-based: Source
ActivePDF
WeasyPrint - Python-based. This is apparently really good?
xHTML2PDF - Python-based
Other options
We deal with many vendors. Some vendors send us PDFs for their invoices or other documents while others send us HTML emails (with all our invoice information in it), and some others even send us links to the invoices.
The easiest option is to create the document in HTML and send users a link to that document (secured obviously). This would allow users to view the invoice whenever they want (and from any device with a browser) and would also allow them to print from the browser if needed. This method also generates traffic to your website which is usually also beneficial to the business.
What we've done in the past is create a link to the file on the website (secured) so that they can view it in the browser, and then have a button to download the invoice (which just downloads a PDF version of that webpage generated with one of the PDF Conversion tools listed above - currently wkhtmltopdf).
In my opinion, the best method would be to combine all delivery approaches into one. Send an email with the file information in the email's HTML content and attach a PDF of that file. Inside the header portion of the email content (at the top of the email), send a link giving the recipient direct access to the webpage containing all the information (located within their account in your secure portal). This allows them to view it in the browser just in case they can't view it properly in their email and in case they don't have a PDF viewer (I know it's rare nowadays, but you'd be surprised just how many people out there have outdated systems - we still need to send faxes to some clients because they still don't have emails; yes still now in 2017, sigh...). On your website, also provide them with a download link for the PDF document (which would again just take the page they are currently on and convert it into a PDF and automatically download it through the browser).
I hope this helps!

I would like to add another option in the probable solution list. Aspose.PDF Cloud API also offers features to convert HTML to PDF. It provides SDKs for all popular programming languages.
PHP sample code for HTML to PDF conversion:
//Html file with resource files
$name = "HtmlWithImage.zip";
$html_file_name = "HtmlWithImage.html";
$height = 650;
$width = 250;
$src_path = $name;
$response = pdfApi->getHtmlInStorageToPdf($src_path, $html_file_name, $height, $width);
print_r($response);
echo "Completed!!!!";
I work with Aspose as developer evangelist.

Table data extraction from image or scanned documents (Not pdf)

I want to extract the table data from images or scanned documents and map the header fields to their particular values mostly in an insurance document.I have tried by extracting them line by line and then mapping them using their position on the page. I gave the table boundary by defining a table start and end pivot, but it doesn't give me proper result, since headers have multiple lines sometimes (I had implemented this in php). I also want to know whether I can use machine learning to achieve the same.
For pdf documents I have used tabula-java which worked pretty well for me. Is there a similar type of implementation for images as well?
Insurance_Image
The documents would be of similar type as in the link above but of different service providers so a generic method of extracting such data would be very useful.
In the image above I want map values like Make = YAMAHA, MODEL= FZ-S, CC= 153 etc
Thanks.

I would definitively give a go to Tesseract, a very good OCR engine. I have been using it successfully in reading all sorts of documents embedded in emails (PDF, images) and a colleague of mine used it for something very similar to your use case - reading specific fields from invoices.
After you parse the document, simply use regex to pick the fields of interest.
I don't think machine learning would be particularly useful for you, unless you plan to build your own OCR engine. I'd start with existing libraries, they offer very good performance.

The easiest and most reliable way to do it without much knowledge in OCR would be this:
- Take an empty template for reference and mark the boxes coordinates that you need to extract the data from. Label them and save them for future use. This will be done only once for each template.
- Now when reading the same template, resize it to match the reference templates dimensions (If it's not already matching).
- You have already every box's coordinates and know what data it should contain (because you labeled them and saved them on the first step).
Which means that now you can just analyze the pixels contained in each box to know what is written there.
This means that given a list of labeled boxes (that you extracted in the first step), you should be able to get the data in each one of these boxes. If this data is typed and not hand written the extracted data would be easier to analyze or do whatever you want with it using simple OCR libraries.
Or if the data is always the same size and font like your example template above, then you could just build your own small database of letters of that font and size. or maybe full words? Depends on each box's possible answers.
Anyway this is not the best approach by far but it would definitely get the work done with minimal effort and knowledge in OCR.

Creating printable content with Php/JavaScript/Html/CSS

I work for a care centre that would like a feature on their website where friends and family can choose from a selection of care cards to deliver to someone they know. They will be able to choose a title, an image and type in some text on the card that we assemble and deliver. They need me to make an application for them that assembles the cards in a printer-friendly fashion (placing text and images in the right areas) that they will print and fold before delivery.
Image of what I am trying to create: http://i.imgur.com/f8GnD.png
Reading about how to do this I realize that I have two issues:
Size of card on-screen can't be fixed due to printer DPI
Should I use html/CSS to make a table with 4 cells to create this card? Php image library? JavaScript?
Any help would great.

I have the best luck, in terms of printing, with PDFs. The document format is nice, too, because it is portable and the user may choose to print somewhere other than where they accessed your site.
The best PDF-generating library I've used for PHP is fPDF: http://www.fpdf.org/
PDFs are great for printing full-page documents. All but the most ancient operating systems provide users the ability to open and print PDFs, and because PDF is a document format the printed output is fairly consistent between systems and printers.
The other route you suggest is certainly possible - you can build it up using HTML and CSS. There are serious drawbacks to this, however. Foremost, each user is going to have varying printer settings in their browser, and the browser is not configured by default to be good to your full-page printing. Most user agents add page numbers, margins, the date & time, the URL.... in short, your print from the browser is going to rely on the user tinkering with their browser print settings. There is nothing you can do to influence these settings from your end.

There are third-party utilities that generate PDFs on the server, based on your HTML. PDFs have solved many print-related issues internally so you don't have to worry about them yourself.

What's the best way to capture a signature online?

I'm building a website application in PHP that requires a signature from the end user.
For this part of the website it will be exclusively viewed on Windows based tablets.
So, my question is this:
What's the best way to capture a signature online?
I've looked at flash or HTML5 canvas/excanvas, but I am seeking a more experienced answer.
Thanks.

From: http://willowsystems.github.io/jSignature
jSignature is a JavaScript widget (a jQuery plugin) that simplifies
creation of a signature capture field in a browser window, allowing a
user to draw a signature using mouse, pen, or finger.
Works in all mainstream browsers that support Canvas or Flash Captures
signatures as smooth vector images. (Yes, SVG is supported!)
Ingenious, super-efficient (i.e. not lagging) real-time curve
smoothing. Allows manipulation of signature strokes, like “Undo last
stroke” Automatically adapts to your page’s layout and colors. Free
and Open Source.
The documentation is very clear and the demo shows you how it works.

I needed to capture signatures for a current project and found Signature Pad to be very useful. It uses HTML5 canvas, json and jQuery.
https://thomasjbradley.github.io/signature-pad/examples/accept-signature.html

Flash would be great if you need to support older tablets, running non HTML 5 capable systems. Some things to keep in mind:
Try to transfer the data as a common image format. GIF or PNG would be ideal. This will make it far easier to keep track of and to parse through at a later date. Future-proofing, since a custom or uncommon format may fall out of favor faster.
Transfer the data over a secure connection. Always.
Remember that the legality of this is dubious. Both for use as a binding contract, and also for the transfer of the signature itself. Consult a lawyer if you haven't yet. Ideally one who deals with digital contracts.

Try the Signature class from Objective.js. Very simple to use. The test program shows how to add it in PHP to a document. All you need is a canvas, create an instance of a Signature and pass the canvas to the object. The code is 115 lines.
http://www.objectivejs.org/en/manual/objective-js/signature
Objective.js - Programming object oriented interfaces and animations in JavaScript

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.