Convert doc to txt

Convert doc to txt - php

I'm on a Linux server and I need to convert MS Word 97-2003 .doc format to plain text .txt files using PHP
I already tried this solutions:
How to extract text from word file .doc,docx,.xlsx,.pptx php
Extract text from doc and docx
But both are just working fine for .docx format.
The issue is when I convert files, I got scrap characters at the end of the text.
The length of the chars I don't need vary depending on the length of the file.
Also, it may happen that if the file is a bit long, it get truncated.
Is there any simple way to get this converted?

I've lastly come to use the following solution, launching Antiword:
private function doc() {
$file = escapeshellarg($this->filename);
$text = `/usr/sbin/antiword -w 0 $file`;
return html_entity_decode(utf8_encode(trim($text)));
}

Related

Accents issue doc to txt

First I want to convert pdf file to html, but the api can't do that.
So I tried to convert pdf to txt. I have a lot of problems with multiple space or line...
So I tried (again) to convert pdf to word and word. The word is perfect.
Unfortunately, ConvertApi can't convert word to html... and I can found a free library to convert word to html.
So I tried (again and again) to convert word to txt.
Now I have accents problems on the txt file :
régime become r‚gime
matière become matiŠres
contrôle become contr“le

file_get_contents returns bizarre characters from raw text file

This is very bizarre. I have a .txt file on my Windows server. I'm using file_get_contents to retrieve it, but the first several characters show up as a diamond with a question make inside them. I've tried recreating the file from scratch and it's the same result. What's really bizarre is other files don't have this issue.
Also, if I put a * at the start of the file it seems to fix it, but if I try to open the file and do it with PHP it's still messed up.
The start of the file in question begins with: Trinity Cannon - that's a direct copy and paste from the text file. I've tried re-typing it and the first few characters are always that diamond with a question mark.
$myfile='C:\\inetpub\\wwwroot\\fastpitchscores\\data\\2020.txt';
$fh = file_get_contents($myfile);
echo $fh; // Trinity Cannon
echo $fh[0]; // �

It sounds like whatever editor you used to originally create the file a UTF Byte Order Mark at the beginning the file.
You typically can't edit the BOM from within an editor. If your editor has a encoding conversion functionality, try converting to ASCII. For example, in Notepad++ use Encoding->Encode in ANSI.

DOCX Encoding issues

I have a PHP script that reads information in from a MySQL Database and puts it into a DOCX file, using a template. In the template, there are placeholders called <<<variable_name>>> where variable_name is the name of the MySQL field.
DOCX files are Zip archives, so my PHP script uses the ZipArchive library to open up the DOCX and edit the document.xml file, replacing the placeholders with the correct data.
This worked fine until today, when I ran into some coding issues. Any non-ANSI characters do not encode properly and make the output DOCX corrupt. MS Word gives the error message "Illegal XML character."
When I unzip the document and open document.xml in notepad++, I can see the problematic characters. By going to the encoding menu, and selecting "Encode in ANSI", I can see the characters normally: They are Pound (£) symbols. When N++ is set to "Encode in UTF-8 they appear as a hexadecimal value.
By selecting the N++ option to "Convert to UTF-8", the characters appear OK in UTF-8 and MS Word opens the document perfectly. But I don't want to manually unzip my DOCX archive every time I create something - The whole point of the script is to make generating the document quick and easy.
Obviously I need the PHP script to output the file in UTF-8 to make the '£' characters appear properly.
My code (Partially copied from another question on SO):
if (!copy($source, $target)) // make a duplicate so we dont overwrite the template
print "Could not duplicate template.\n";
$zip = new ZipArchive();
if ($zip->open($target, ZIPARCHIVE::CHECKCONS) !== TRUE)
print "Source is not a docx.\n";
$content_file = substr($source, -4) == '.odt' ? 'content.xml' : 'word/document.xml';
$file_contents = $zip->getFromName($content_file);
// Code here to process the file, get list of substitutions to make
foreach ($matches[0] as $x => $variable)
{
$find[$x] = '/' . $matches[0][$x] . '/';
$replace[$x] = $$matches[1][$x];<br>\n";
}
$file_contents = preg_replace($find, $replace, $file_contents, -1, $count);
$zip->deleteName($content_file);
$zip->addFromString($content_file, $file_contents);
$zip->close();
chmod($target, 0777);
I have tried:
$file_contents = iconv("Windows-1252", "UTF-8", $file_contents);
And:
$file_contents_utf8 = utf8_encode($file_contents_utf8);
To try to get the PHP script to encode the file in UTF-8.
How can I make the PHP script encode the file into UTF-8 when saving, using the ZipArchive library?

Don't use any conversion functions; simply use utf8 everywhere.
Let's check that you really have utf8 -- In PHP, use the bin2hex() function, apply it to the string that supposedly contains £, you should see C2A3, which is the utf8 hex £.

LibreOffice writer truncate html .doc files at 65533 characters?

I generate a .doc html formated file from a PHP script. Everithing work fine, my file is well generated, but if I try to open it with LibreOffice (v4.2.8.2) the file is silently truncated to the 65533th character when displayed.
Is there a workaround ? Is it a bug ? Have you any informations about that?

I found the problem. All my text was within the <body> tag. I broke parts of my text within <div> and it worked (it won't work with html5 tags such as <article>).
I think that LibreOffice can't handle more than 65533 characters by tags.
In addition to that I also remarked the "same" problem in LibreOffice Calc, if you open a .xls file html formated, it will not display more than 65533 non empty cells (here I didn't find (/searched) a workaround).
I think it's a BIG bug with this software (I didn't test with other such as Ooo or MS Office). At least a warning message might be displayed.

How to decode some special fonts in a pdf using xpdf?

I am using xpdf to convert pdf files to text.
Below is the code used for it.
$content = shell_exec('pdftotext '.$filename.' -');
Xpdf is not able to convert few special fonts in pdf to text.
for example: bizarre font cannot be converted to text using xpdf.
Are they any alternative software which can convert all kind of fonts in pdf to text in PHP.

Maybe you should try the Poppler version of pdftotext if the XPDF version fails for your files....
However, take note of this fact, please: Not even Acrobat Reader can extract all cases of well rendered text on a PDF page to a text file...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Convert doc to txt - php

I've lastly come to use the following solution, launching Antiword: private function doc() { $file = escapeshellarg($this->filename); $text = `/usr/sbin/antiword -w 0 $file`; return html_entity_decode(utf8_encode(trim($text))); }

Related

Accents issue doc to txt

file_get_contents returns bizarre characters from raw text file

DOCX Encoding issues

LibreOffice writer truncate html .doc files at 65533 characters?

How to decode some special fonts in a pdf using xpdf?

Categories

Resources