Compare and merge two word documents using php - php

I am doing a project on online medical transcription training. For that we are not allowed the original documents to the users.
User must type all the contents he hear and he uploads the documents to the server. Then the Original document will be compared or merged to his edited document and the result file will be downloaded to him to verify.
I need to do this in php? is it possible?
I heard about COM object in php. but i dint find any good example.

By searching "word com php", you should find a lot of sample code via Google.
However, there are other solutions (e.g. convert .doc to html or text) which should be faster and less platform dependent.

Keep in mind that what word shows as content of a doc is not allways its complete content, but a result of more or less editing. Two docs may show amd print the very same text, but may contain just this plain text as well as large portions of deleted/edited/changed text and as such be much much bigger. So your only choice is IMHO to use calls to word to compare two or more different documents.

I referred in link
http://pear.php.net/manual/en/package.text.text-diff.intro.php
It has the feature what i specified. It compare two documents and also has merge document.

Related

Table data extraction from image or scanned documents (Not pdf)

I want to extract the table data from images or scanned documents and map the header fields to their particular values mostly in an insurance document.I have tried by extracting them line by line and then mapping them using their position on the page. I gave the table boundary by defining a table start and end pivot, but it doesn't give me proper result, since headers have multiple lines sometimes (I had implemented this in php). I also want to know whether I can use machine learning to achieve the same.
For pdf documents I have used tabula-java which worked pretty well for me. Is there a similar type of implementation for images as well?
Insurance_Image
The documents would be of similar type as in the link above but of different service providers so a generic method of extracting such data would be very useful.
In the image above I want map values like Make = YAMAHA, MODEL= FZ-S, CC= 153 etc
Thanks.
I would definitively give a go to Tesseract, a very good OCR engine. I have been using it successfully in reading all sorts of documents embedded in emails (PDF, images) and a colleague of mine used it for something very similar to your use case - reading specific fields from invoices.
After you parse the document, simply use regex to pick the fields of interest.
I don't think machine learning would be particularly useful for you, unless you plan to build your own OCR engine. I'd start with existing libraries, they offer very good performance.
The easiest and most reliable way to do it without much knowledge in OCR would be this:
- Take an empty template for reference and mark the boxes coordinates that you need to extract the data from. Label them and save them for future use. This will be done only once for each template.
- Now when reading the same template, resize it to match the reference templates dimensions (If it's not already matching).
- You have already every box's coordinates and know what data it should contain (because you labeled them and saved them on the first step).
Which means that now you can just analyze the pixels contained in each box to know what is written there.
This means that given a list of labeled boxes (that you extracted in the first step), you should be able to get the data in each one of these boxes. If this data is typed and not hand written the extracted data would be easier to analyze or do whatever you want with it using simple OCR libraries.
Or if the data is always the same size and font like your example template above, then you could just build your own small database of letters of that font and size. or maybe full words? Depends on each box's possible answers.
Anyway this is not the best approach by far but it would definitely get the work done with minimal effort and knowledge in OCR.

Compare between two images in php

**hello, i'm trying to develop new web site for special purpose, i have list of images in uploaded in sever, i need to upload image from pc and doing search in list of images in server and return list of images has best similarity of uploaded image depend on image color not face all of those using php
this link describe my problem but no codes thanks **
This is a very complex task you are trying to do (and especially hard, because you want to do it in PHP).
What I can think of (in general) to achieve this contains the following sub-tasks:
Recognize colours
Recognize shapes
Recognize the connections of the upper two
In PHP the last two is nearly impossible (and makes no sense as PHP is not an image processing library, there are only basic functions in it). But you can do the first one using this library:
https://github.com/thephpleague/color-extractor
You can make the comparison as fine as you want. Get the most used colours (eg. 1000 one) and compare them as an array. Obviously you won't get an exact match, but if you compare the first 1000 and you find 500 match than that picture is somewhat similar to the other. However you could get completely false results, so this is rather a programmatic solution than a logical one.

get data from documents

I want to make a web app that can get the values from a commonly used file type (such as xsl or ppt) to allow me to convert it into a custom format (like Google Drive). With an xsl (excel document) file, for example, I want to be able to get the value for each cell. I would be fine getting html for a file (like getting the html code that would display a word document) because values can be extracted out of that. I would like to be able to do it on the client side, but I am okay with using it on the server side with PHP.
Another approach would be to import the file as XML. PHP has great support for XML and could make short work of this. If you can get the files uploaded as Open Doc Format you can parse just about any of the types you listed (XLS, PPT, DOC, etc).
A pretty easy way to get data out of an excel sheet online is to use a Google Apps Script. The process would be a lot to explain here, but with a bit of google searching, you can find all your answers.
As for a PPT, I can't think of an easy way.
As for documents (i.e. pdf, doc, docx), you can use Google Apps Script as well.
Although, if you're making your own tool for this, you may want to just research how the data is stored in the file and work from there.

Get number of words in a document

I'm about to make a translator site (in PHP) where people can order a translator to translate their documents. From the site people are able to upload their file then it would be connected to a translator/member of the site. The problem is how I make an application that count the price from the document.
The most common way to rate the translation price is per word. So I need to know how many words in the document that a customer's uploaded. I thought there must be possible to count words from text file such as a word document. However, I couldn't find any way to get exact amount of a ms word 2003 document (.doc). I've found a way to count .docx, but not .doc. And there will be more files such as PDF, or rtf.
I've seen another method which only count the file size, but I don't think it would give the same result for different document format. Or it is?
The simple way I could think is to ask the visitors to copy/paste their text on a textarea, but I don't think this is the best way.
Would someone gives me an advice how can I solve this?
If you're running your site on a *nix server, you might want to try the following:
$word_count = system("wc -w " . $filename);
And, yes, I've been lead to believe that it works with .doc and .docx documents. PDF's are a whole other story. I'll have to research that one.

How can i convert a php page into .doc file with php

Recently i worked in a project. On this project I need convert page into a Microsoft word document (.doc file) and offer the document for download, all using PHP. But I can't solve this problem.
Please help me. Thank You very much, Arif
This is not easy to solve.
First off, if you want to write real word documents, you will have to do on Windows. You can use COM to talk to Word and this is how you manage to get good results. I've tried all the unix/linux based solutions and the results were not so great.
Otherwise, I'd suggest you write RTF -- which is just as good. And in the end, you can call the .rtf-file, .doc and no one will notice it. RTF has a couple limitations (formatting), but on the flipside -- it's all ASCII and the RTF standard is pretty comprehensive and well documented.
There's a class which does it pretty nicely -- phpLiveDocx (this is a great introduction). And this class also claims to write PDF and DOC -- but I haven't tried those yet. I use another solution for PDF.
I would recommend using the RTF format instead of the .doc - it's much simpler to write to, and all text editors understand it. Similar recommendation for .csv when you want to output an Excel file.
Perhaps not the answer you seek, but still interesting to note, there is a open source word processor out there called abiword that has a CLI (Command Line Interface). You can use it to easily convert between document formats. I know that at least one website uses it to convert text files into various formats.
It is actively getting developed and could easily be used as a 3de party black box solution to converting documents server side.
Here is a blog from one of the developers on how to integrate it with PHP
Server-Side AbiWord
abiword home page

Categories