Table data extraction from image or scanned documents (Not pdf)

Table data extraction from image or scanned documents (Not pdf) - php

I want to extract the table data from images or scanned documents and map the header fields to their particular values mostly in an insurance document.I have tried by extracting them line by line and then mapping them using their position on the page. I gave the table boundary by defining a table start and end pivot, but it doesn't give me proper result, since headers have multiple lines sometimes (I had implemented this in php). I also want to know whether I can use machine learning to achieve the same.
For pdf documents I have used tabula-java which worked pretty well for me. Is there a similar type of implementation for images as well?
Insurance_Image
The documents would be of similar type as in the link above but of different service providers so a generic method of extracting such data would be very useful.
In the image above I want map values like Make = YAMAHA, MODEL= FZ-S, CC= 153 etc
Thanks.

I would definitively give a go to Tesseract, a very good OCR engine. I have been using it successfully in reading all sorts of documents embedded in emails (PDF, images) and a colleague of mine used it for something very similar to your use case - reading specific fields from invoices.
After you parse the document, simply use regex to pick the fields of interest.
I don't think machine learning would be particularly useful for you, unless you plan to build your own OCR engine. I'd start with existing libraries, they offer very good performance.

The easiest and most reliable way to do it without much knowledge in OCR would be this:
- Take an empty template for reference and mark the boxes coordinates that you need to extract the data from. Label them and save them for future use. This will be done only once for each template.
- Now when reading the same template, resize it to match the reference templates dimensions (If it's not already matching).
- You have already every box's coordinates and know what data it should contain (because you labeled them and saved them on the first step).
Which means that now you can just analyze the pixels contained in each box to know what is written there.
This means that given a list of labeled boxes (that you extracted in the first step), you should be able to get the data in each one of these boxes. If this data is typed and not hand written the extracted data would be easier to analyze or do whatever you want with it using simple OCR libraries.
Or if the data is always the same size and font like your example template above, then you could just build your own small database of letters of that font and size. or maybe full words? Depends on each box's possible answers.
Anyway this is not the best approach by far but it would definitely get the work done with minimal effort and knowledge in OCR.

Related

Is it possible to make a placeholder inside an image and replace it later?

I've been asked to make a php software were the admin/user can import a certificate of attendance template and then import a list of names, the software should then generate certificate images with the names from the list.
My preliminary idea was that the certificate should be in an image form and have a placeholders to be replaced by the name, for example:
Here we can replace the <Type Person's Name Here> placeholder with the names from the list.
Unfortunately this is all theoretical, I can't just search for the placeholder in a .jpg or a .png file to replace it.
After some research i found that some sites provide certificates templates in a kind of an open standard .ai and .eps both can be opened by Adobe Illustrator and then you find that all the text is editable and you can simply replace the name placeholder, example.
The problem is that both these file types are proprietary and I doubt i can find a php library to edit them easily.
This is very similar to the situation in 20253386, except that the certificate template will need to be changed, if i had only one certificate template then i only need to know the coordinates of the name placeholder and use Imagick::annotateImage but with importable templates I have no idea were the name should be placed based on the certificate design.
So what other options do i have ? are there any open standards to use for image templates ? is there another way to tackle this problem?

I would push the problem of the coordinates onto the user, and devise some simple encoding system for templates images, perhaps using the file name. For example, if you had two files:
CertificateOfAttendance2020 name#200,200 description#200,300 date#100,500.png
CertificateOfEnrolment2020 name#200,200 date#100,500.png
You can parse their names,eg split on space, take the first as the name, split the others on # and take the first bit as name of field and the second as x,y of where to put it.
show the user the template name part in a picker. When they pick eg Certificate of Attendance, say "this certificate requires three extra bits of info" and put the fields dynamically in your UI:
Name:
Description:
Date: defaulted to today
Or when they pick the enrolment one just show them two fields
Let the user type what they want, use some tool to put those things at the coordinates. It is the person who designs the image that should provide the coordinates. If you're writing a tool for that too then the tool can name the file. Maybe you can have some way of encoding left/right/center align, font size and face.. maybe eventually you'll just put a block of css in the file name, then throw the whole thing out and replace it with an html file describing the most elaborate certificates in the world, and have placeholders inside the html, and the image will just be a background :)
Generating the image, could be an image library like imagemagick, could be you do it as html divs with big image and absolute position then use eg wkhtmltopdf to make it a pdf.. loads of options, and we aren't here to design it for you.. (even though it might look like that's what I've done). The reason why we don't "design your system for you" is that it's fairly subjective what is best and there is no right answer - the method I've talked about above has potentially many flaws that may make it unsuitable for your context. My point is purely that you need to find some way to store/associate the data you need to ask for (name, description) with the image that provides the background and the info that drives where those things shall be placed - I've tried to keep this "answer" full of "could do" and "perhaps" rather than "should do" and as such I regard it really as just an extended comment to a mostly off topic question- there's only one fact here that really makes it an answer (and it's in bold above)

How to get the color palette of a website using PHP?

I need to find the common colors used in a particular website.Most of the cases it will be body background,header background etc. But the problem is, some of the classes or IDs override other.So we cannot get the exact color patterns. Is there any way to find the exact color patterns of a website which browser picking?

As Havelock pointed out, the idea that does come to mind is transforming the page into an image, and then getting the color-palette from that. It does however have a few problems:
There is no guarantee, that what the library returns is what the users seens in a particular browser, yet alone all.
The processing needed could be way easier implemented in other languages than PHP. I dont mean, that it cant be done, but it is just not well suited for this task.
If you do however continue along this path, I would recommend trying something with an API, to get your screenshots, and then just using some PHP to parse them. Example for such a service - http://browsershots.org/xmlrpc/

There are several online services to extract colors from websites. Including image colors:
http://www.colorcombos.com/grabcolors.html
http://www.hextractor.com/
more...
A PHP class to extract colors from images can be found here. See also How do I get the Hex Code of a color on my webpage
Also a FireFox Plugin exists.

How should I generate an isometric image of a Minecraft skin in PHP?

I'm trying to generate 3D isometric views of players' heads, but I'm not sure what kind of support PHP has for this type of operation, or of any external libraries that may be better suited.
Basically I need to take a net like this (here is a diagram showing what each portion is mapped to) and make a 3D head from it. I also need to include the 'head accessory' portions, which should be slightly larger/offset from the actual head.
Does anyone know how I should go about this?

Well first it will be a complex job in my view.
The http://www.minecraftwiki.net/images/0/01/Skinzones.png file you mentioned is flat, but you have to convert that in ISOMETRIC 3D look, so you have to distort the images
For example look at the images below
So you can see that 3D box image is created from the pieces of other images, the logic is to add perspective to the flat images and join them. but as it is 2D we will call it Image Distortion.
Unfortunately GD Library which comes bundled with PHP is not advanced enough to let you do such things.
You have to use some other library like Image Magic and this link is tutorial for using distort functions http://www.imagemagick.org/Usage/distorts/
Second big thing is the processing of the images, you can process the images live but it will consume lots of resources on server, so it is suggested that you use pre processed images, and not process them every time.
To generate the Isometric image you have to write the code your self, and it may need alteration on each image character depending upon the size of the image. But when you have written a code it will be easy.
My Suggestion is to write your own code once, then alter it for every character and save the processed images in a sprite and use them when you add play functionality.
check out this link as well
http://www.fmwconcepts.com/imagemagick/index.php

How do extract text layer and background layer from pdf?

In my project I've to do a PDF Viewer in HTML5/CSS3 and the application has to allow user to add comments and annotation. Actually, I've to do something very similar to crocodoc.com.
At the beginning I was thinking to create images from the PDF and allow user create area and post comments associates to this area. Unfortunately, the client wants also navigate in this PDF and add only comments on allowed sections (for example, paragraphs or selected text).
And now I'm in front of one problem that is to get the text and the best way to do it. If any body has some clues how I can reach it, I would appreciate.
I tried pdftohtml, but output doesn't look like the original document whom is really complex (example of document). Even this one doesn't reflect really the output, but is much better than pdftohtml.
I'm open to any solutions, with preference for command line under linux.

I've been down the same road as you, with even much more complex tasks.
After trying out everything I ended up using C# under Mono (so it runs on linux) with iTextSharp.
Even with a very complete library such as iTextSharp, some tasks required allot of trial-and-error :)
To extract the text from a page is easy (check the below snipper), however if you intend to keep the text coordinates, fonts and sizes, you will have more work to do.
int pdf_page = 5;
string page_text = "";
PdfReader reader = new PdfReader("path/to/pdf/file.pdf");
PRTokeniser token = new PRTokeniser(reader.GetPageContent(pdf_page));
while(token.NextToken())
{
if(token.TokenType == PRTokeniser.TokType.STRING)
{
page_text += token.StringValue;
}
else if(token.StringValue == "Tj")
{
page_text += " ";
}
}
Do a Console.WriteLine(token.StringValue) on all tokens to see how paragraphs of text are structured in PDFs. This way you can detect coordinates, font, font size, etc.
Addition:
Given the task you are required to do, I have a suggestion for you:
Extract the text with coordinates and font families and sizes - all information about each paragraph. Then, to a PDF-to-images, and in your online viewer, apply invisible selectable text over the paragraphs on the image where needed.
This way your users can select a part of the text where needed, without the need of reconstructing the whole PDF in html :)

I recently researched and discovered a native PHP solution to achieve this using FOSS. The FPDI PHP class can be used to import a PDF document for use with either the TCPDF or FPDF PHP classes, both of which provide functionality for creating, reading, updating and writing PDF documents. Personally, I prefer TCPDF as it provides a larger feature set (TCPDF vs. FPDF), a richer API (TCPDF vs. FPDF), more usage examples (TCPDF vs. FPDF) and a more active community forum (TCPDF vs. FPDF).
Choose one of the before mentioned classes, or another, to programmatically handle PDF documents. Focusing on both current and possible future deliverables, as well as the desired user experience, decide where (e.g. server - PHP, client - JavaScript, both) and to what extent (feature driven) your interactive logic should be implemented.
Personally, I would use a TCPDF instance obtained by importing a PDF document via FPDI to iteratively inspect, translate to a common format (XML, JSON, etc.) and store the resulting representation in relational tables designed to persist data pertinent to the desired level of document hierarchy and detail. The necessary level of detail is often dictated by a specifications document and its mention of both current and possible future deliverables.
Note: In this case, I strongly advise translating documents and storing them in a common format to create a layer of abstraction and transparency. For example, a possible and unforeseen future deliverable might be to provide the same application functionality for users uploading Microsoft Word documents. If the uploaded Microsoft Word document was not translated and stored in a common format then updates to the Web service API and dependent business logic would almost certainly be necessary. This ultimately results in storing bloated, sub-optimal data and inefficient use of development resources in designing, developing and supporting multiple translators. It would also be an inefficient use of server resources to translate outbound data for every request, as opposed to translating inbound data to an optimal format only once.
I would then extend the base document tables by designing and relating additional tables for persisting functionality specific document asset data such as:
Versioned Additions / Edits / Deletions
What
Header / Footer
Text
Original Value
New Value
Image
Page(s) (one, many or all)
Location (relative - textual anchor, absolute - x/y coordinates)
File (relative or absolute directory or url)
Brush (drawing)
Page(s) (one, many or all)
Location (relative - textual anchor, absolute - x/y coordinates)
Shape (x/y coordinates to redraw line, square, circle, user defined, etc.)
Type (pen, pencil, marker, etc.)
Weight (1px, 3px, 5px, etc.)
Color
Annotation
Page
Location (relative - textual anchor, absolute - x/y coordinates)
Shape (line, square, circle, user defined, etc.)
Value (annotation text)
Comment
Target (page, another text/image/brush/annotation asset, parent comment - threading)
Value (comment text)
When
Date
Time
Who
User
Once some, all or more, of the document and its asset data has a place to persist I would design, document and develop a PHP Web service API to expose CRUD and PDF document upload functionality to the UI consumer, while enforcing core business rules. At this point, the remaining work now lies on the Client-side. Currently, I have relational tables persisting both a document and its asset data, as well as an API exposing sufficient functionality to the consumer, in this case the Client-side JavaScript.
I can now design and develop a Client-side application using the latest Web technologies such as HTML5, JavaScript and CSS3. I can upload and request PDF documents using the Web service API and easily render the returned common format out to the browser however I decide (probably HTML in this case). I can then use 100% native JavaScript and/or 3rd party libraries for DOM helper functionality, creating vector graphics to provide drawing and annotation features, as well as access and control functional and stylistic attributes of currently selected document text and/or images. I can provide a real-time collaborative experience by employing WebSockets (before mentioned WebService API does not apply), or a semi-delayed, but still fairly seamless experience using XMLHttpRequest.
From this point forward the sky is the limit and the ball is in your court!

It's a hard task you're trying to accomplish.
To read text from a PDF, have a look at PEAR's PDF_Reader proposal code.

There's also a very extensive documentation around Zend_PDF(), which also allows the loading and parsing of a PDF document. The various elements of the PDF can be iterated on and thus also being transformed to HTML5 or whatever you like. You may even embed the notations from your website into the PDFs and vice versa.
Still, you have been given no easy task. Good Luck.

pdftk is a very good tool to do thinks like that (I don't know if it can do exactly this task).
http://www.pdflabs.com/docs/pdftk-cli-examples/

Compare and merge two word documents using php

I am doing a project on online medical transcription training. For that we are not allowed the original documents to the users.
User must type all the contents he hear and he uploads the documents to the server. Then the Original document will be compared or merged to his edited document and the result file will be downloaded to him to verify.
I need to do this in php? is it possible?
I heard about COM object in php. but i dint find any good example.

By searching "word com php", you should find a lot of sample code via Google.
However, there are other solutions (e.g. convert .doc to html or text) which should be faster and less platform dependent.

Keep in mind that what word shows as content of a doc is not allways its complete content, but a result of more or less editing. Two docs may show amd print the very same text, but may contain just this plain text as well as large portions of deleted/edited/changed text and as such be much much bigger. So your only choice is IMHO to use calls to word to compare two or more different documents.

I referred in link
http://pear.php.net/manual/en/package.text.text-diff.intro.php
It has the feature what i specified. It compare two documents and also has merge document.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.