Get number of words in a document

Get number of words in a document - php

I'm about to make a translator site (in PHP) where people can order a translator to translate their documents. From the site people are able to upload their file then it would be connected to a translator/member of the site. The problem is how I make an application that count the price from the document.
The most common way to rate the translation price is per word. So I need to know how many words in the document that a customer's uploaded. I thought there must be possible to count words from text file such as a word document. However, I couldn't find any way to get exact amount of a ms word 2003 document (.doc). I've found a way to count .docx, but not .doc. And there will be more files such as PDF, or rtf.
I've seen another method which only count the file size, but I don't think it would give the same result for different document format. Or it is?
The simple way I could think is to ask the visitors to copy/paste their text on a textarea, but I don't think this is the best way.
Would someone gives me an advice how can I solve this?

If you're running your site on a *nix server, you might want to try the following:
$word_count = system("wc -w " . $filename);
And, yes, I've been lead to believe that it works with .doc and .docx documents. PDF's are a whole other story. I'll have to research that one.

Related

Issue converting to .pdf a merged .docx file that opens fine in Word

So, I have the following scenario.
I am working on a system for academical papers. I have several inputs that are for stuff like author name, coauthors, title, type of paper, introduction, objectives and so on. I store all that information in a database. The user has a Preview button which when clicked, generates a Word asynchronously and sends the file location back to the user and that file is afterwards shown to the user in an iframe using Google Doc Viewer.
There's a specific use case where the user/author of the paper can attach a .docx file with a table, or a .jpeg file for a figure. That table/figure has to be included inside the final .docx file.
For the .docx generation process I am using PHPWord.
So up until this point everything works fine, but my issues start when I try to mix everything and put together the .docx file.
Approach Number One
My first approach on doing this was to do everything with PHPWord. I create the file, add the texts where required and in the case of the image just insert the image and after that the figure caption below the image.
Things get tricky though, when I try doing the same thing with the .docx table file. My only option was to get the table XML using this. It did the trick, but the problem I ran into was that when I opened the resulting Word file, the table was there, but had lost all of its styling and had transparent borders. Because of those transparent borders, afterwards when converting it to PDF the borders were ignored and the table info is just scrambled text.
Approach Number Two (current one)
After fighting with Approach Number One and just complicating stuff more, I decided to do something different. Since I already generated one docx file with the main paper information and I needed to add another docx file, I decided to use the DocX Merge Library.
So, what i basically did was I have three generated word files, one for the main paper information, one for the table and one for the table caption (that last one is mainly to not overcomplicated the order of information). Also, that data is not in the table .docx file.
Then I run this:
$dm->merge( [
'paper-info.docx',
'attached-table.docx',
'attached-table-caption.docx'
], 'complete-file.docx');
So, afterwards, I check and the Word file is generated just as I need it with the table maintaining its original styles and dimensions.
If I open it in LibreOffice though, I get this error message:
Then if I continue and open the file, the file opens correctly with all the data with the only exception that it no longer respects the fonts of the file as they appear in Word.
So, the problem comes in the next step. Since I need to present a preview of the file using Google Doc Viewer using this syntax:
<iframe src="https://docs.google.com/gview?embedded=true&hl=es_LA&url=https://usersite.net/complete-file.docx?pid=explorer&efh=false&a=v&chrome=false&embedded=true" width="100%" height="600" style="border: none;"></iframe>
The document gets loaded fine, but when I review it what I see is that it only shows the content of the first paper-info.docx file and ends right where the table and table caption should appear. I open the exact same file in Word and it shows the table and caption.
The other issue is when I try to convert the file to PDF.
If I use PHPWord's method of conversion in combination with DomPDF I get the exact same issue as with the Google Docs Viewer, I just have the content of the first file, using this code:
$phpWordPDF = \PhpOffice\PhpWord\IOFactory::load('complete-file.docx');
$xmlWriterPDF = \PhpOffice\PhpWord\IOFactory::createWriter($phpWordPDF, 'PDF');
$xmlWriterPDF->save('complete-file-pdf');
So my only other viable route was to use LibreOffice's command line using this command:
soffice --headless --convert-to pdf complete-file.docx
This converts the file correctly, but has the issue mentioned when trying to open the .docx file in LibreOffice, the font styles are disconfigured.
Also weird part is that if I try to run this in my PHP script:
shell_exec('soffice --headless --convert-to pdf complete-file.docx');
Nothing happens.
I am running Apache 2.4.25, PHP 7.4.11 on Windows 10 x64.
Conclusion
Until now my best result was by merging the files, but it also caused this issue. So maybe the issue is coming from the merging process I am using. What would be ideal is to be able to just insert the table with styles and everything using PHPWord, but I haven't been able to and haven't found any examples on how to do that.
Another option that I've seen is this library, but the merge features is only in the license that's $599 USD, and since I am pretty close to solving this, I am not sure if it would solve my issue. If it does, I'd invest in it since I need to get this done ASAP, but I wanted to check with you guys what your recommendations would be for this case. Maybe another merging library or doing everything via PHPWord.
Help is appreciated!

After a lot of attempts to fix it, I wasn't able to achieve what I wanted with PHPWord and the merging library I mentioned.
Since I needed to fix this I decided to invest in the paid library I mentioned in my question. It was an expensive purchase, but for those who are interested, it does exactly what was required and it does it perfectly.
The two main functions I required were document merging and importing of content to a .docx file.
So I had to purchase the Premium package. Once there, the library literally does everything for you.
Example for docx files merge code:
require_once 'classes/MultiMerge.php';
$merge = new MultiMerge();
$merge->mergeDocx('document.docx', array('second.docx', 'other.docx'), 'output.docx', array());
Example for how to import a table from another docx file
require_once 'classes/CreateDocx.php';
$docx = new CreateDocxFromTemplate('document.docx');
// import tables
$referenceNode = array(
'type' => 'table',
);
$docx->importContents('document_1.docx', $referenceNode);
$docx->createDocx('output');
As you can see it is pretty easy. This answer is by no means an ad for this library, but for those that have the same problem as me, this is a life saver.

Table data extraction from image or scanned documents (Not pdf)

I want to extract the table data from images or scanned documents and map the header fields to their particular values mostly in an insurance document.I have tried by extracting them line by line and then mapping them using their position on the page. I gave the table boundary by defining a table start and end pivot, but it doesn't give me proper result, since headers have multiple lines sometimes (I had implemented this in php). I also want to know whether I can use machine learning to achieve the same.
For pdf documents I have used tabula-java which worked pretty well for me. Is there a similar type of implementation for images as well?
Insurance_Image
The documents would be of similar type as in the link above but of different service providers so a generic method of extracting such data would be very useful.
In the image above I want map values like Make = YAMAHA, MODEL= FZ-S, CC= 153 etc
Thanks.

I would definitively give a go to Tesseract, a very good OCR engine. I have been using it successfully in reading all sorts of documents embedded in emails (PDF, images) and a colleague of mine used it for something very similar to your use case - reading specific fields from invoices.
After you parse the document, simply use regex to pick the fields of interest.
I don't think machine learning would be particularly useful for you, unless you plan to build your own OCR engine. I'd start with existing libraries, they offer very good performance.

The easiest and most reliable way to do it without much knowledge in OCR would be this:
- Take an empty template for reference and mark the boxes coordinates that you need to extract the data from. Label them and save them for future use. This will be done only once for each template.
- Now when reading the same template, resize it to match the reference templates dimensions (If it's not already matching).
- You have already every box's coordinates and know what data it should contain (because you labeled them and saved them on the first step).
Which means that now you can just analyze the pixels contained in each box to know what is written there.
This means that given a list of labeled boxes (that you extracted in the first step), you should be able to get the data in each one of these boxes. If this data is typed and not hand written the extracted data would be easier to analyze or do whatever you want with it using simple OCR libraries.
Or if the data is always the same size and font like your example template above, then you could just build your own small database of letters of that font and size. or maybe full words? Depends on each box's possible answers.
Anyway this is not the best approach by far but it would definitely get the work done with minimal effort and knowledge in OCR.

Determine if PDF file has searchable text in PHP

We have hundreds of PDF files on a server. Some of them contain searchable text and others do not.
I was asked to find out which are searchable and which are not.
Does anybody know of a way to read in a bunch of PDFs and determine if that PDF document contains text that is searchable/selectable or if the pdf only contains non-selectable/searchable text which needs to be OCRd?
I don't even need to actually read in the text; I just need to be able to detect possibly by tags or keywords, something that suggests that there are fonts or something like that in the raw data.
Are there tags in a searchable PDF that make it easy to detect?
Thanks

You could modify this code(pdf2text) to suit your purposes, I believe. Or this answer might get you to the right spot as well.

Embed code in video file

I'm sorry if the question is ambiguous, I'll try to explain.
I'm working on an existing PHP download script for videos and some parts of it are broken. There's code in there that's supposed to place a specific member code inside the video file before download, but it doesn't work. Here's the code:
//embed user's code in video file
$fpTarget = fopen($filename, "a");
fwrite($fpTarget, $member_code);
fclose($fpTarget);
$member_code is a random 6-character code.
Now, this would make sense to me if it were a text file, but since it's a video file, how could this possibly work and what is it supposed to do? If the member code is somehow added to the video, how can I see it after download it? I have no experience with video files, so any help is appreciated (a modification of the available code or new code would be equally welcome).
I'm sorry I can't give a more precise description of what the code is supposed to do, I'm trying to figure that out myself.

It may work, depending on the format/type of the video. MPG files are fairly tolerant of "noise" in a file and players would skip over your code because it doesn't look like valid video frame data.
Other formats/players may puke, because the format requires certain data be at specific offsets relative to the end of the file, which you've now shifted by 6 characters.
Your best bet is to figure see if whatever format you're serving up has provisions for metadata in its specifications. e.g. there might be support for a comment field somewhere that you can simply slap the code into.
However, if you're doing all this for 'security' or tracking unauthorized sharing of the video, then simply writing the number into a header is fairly easy to bypass. A better bet would be to watermark the video somehow so that the code is embedded in the actual video data, so that "This video belongs to member XYZ only" is displayed while playing.

You don't write to the content of the file directly, not like you would with a text file. As you've noticed, this effectively corrupts the video and you have no way of reasonably reading the information.
For audio/video files, you write to meta-data that's packaged with the file. How this is packaged and what you can do with it generally depends heavily on the container format used for the file. (Remember that container and codec are two different things. The codec is the format used to encode the audio/video, the container is the file format in which that data stream is stored.)
A library like getID3 might be a good place to start. I've never used it, but it seems to be what you're looking for. What you would essentially do is write a value to the meta-data in the container (either a pre-defined value for that container or maybe a custom key/value pair, etc.) which would be part of the file. Then, when reading the file, you can get that data. (Now, that last part depends heavily on what's reading the file. The data is there, but not every player cares about it. You'll want to match up what you're writing to with what you usually see/read from the file's internal meta-data.)

Compare and merge two word documents using php

I am doing a project on online medical transcription training. For that we are not allowed the original documents to the users.
User must type all the contents he hear and he uploads the documents to the server. Then the Original document will be compared or merged to his edited document and the result file will be downloaded to him to verify.
I need to do this in php? is it possible?
I heard about COM object in php. but i dint find any good example.

By searching "word com php", you should find a lot of sample code via Google.
However, there are other solutions (e.g. convert .doc to html or text) which should be faster and less platform dependent.

Keep in mind that what word shows as content of a doc is not allways its complete content, but a result of more or less editing. Two docs may show amd print the very same text, but may contain just this plain text as well as large portions of deleted/edited/changed text and as such be much much bigger. So your only choice is IMHO to use calls to word to compare two or more different documents.

I referred in link
http://pear.php.net/manual/en/package.text.text-diff.intro.php
It has the feature what i specified. It compare two documents and also has merge document.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.