I am trying to parse a word document file. I upload the using PHP then I am trying to get contents using file_get_contents(); function but the problem is when its displayed in front end a lots of garbage code in there like
Æ�Ѐ¤d�¤d�[$\$gd®l±����„h¤d�¤d�[$\$^„hgd®l±���
&�F�¤d�¤d�[$\$gd3¡���gd3¡����„,¤d�¤d�[$\$^„,gd(E����¤d�¤d�[$\$gdÿ/��<��C��D��I��Å������O��P��‚��¡��¢��¬����®��Ù��ã��ó��ô�����
So my question is how can I clean up this text?
Maybe give this a shot? http://www.phpclasses.org/package/3553-PHP-Edit-Microsoft-Word-documents-using-COM-objects.html
Word documents (like docx and doc) are not straight text files - they are actually proprietary file types that do not just have the text from byte 0 - this is how they have fancy formatting and fonts. .docx files are actually archives (.zip files) that contain a myriad of XML and styles.
Your best bet is to use a text input form, or find code online that allows you to extract just the text. Or, download the doc files to your own computer and use your own copy of MS word to open it.
Related
I have an invoice.doc file and want to extract a customer email address, remove it from the doc file, add a company logo on top right, and save the file in original format using php.
MS Word saves its files in a compressed format, so you won't be able to see or edit the contents without decompressing it first. If you pop it open with a regular text editor you will know what I mean.
Your best shot would probably be to use PHPWord.
Take a look at it here: http://phpword.codeplex.com/
For old .DOC documents, to extract the e-mail you could use AntiWord. To alter the document is different story tho. Perhaps using ActiveX if you are on Windows with MS Office installed.
For new .DOCX format you do have some options, because basically the document is just a zipped XML file.
Is it possible to read a text file with PHP with styles?
My client doesn't want to write any code (even [b][/b]) and he has to send those text files to some translators to translate them into 4 languages.
Then i have to post them on a site. They are very large texts and i was wondering how can i deal with this to keep the format without having to read and format all of them with BBcode or HTML code directly (as they are updated very often with some changes)
I see 2 possible answers :
Strip all tags, send texts to translator and re-add style formatting tags (see strip_tags())
Write a script that converts texts into an editable file format like .docx or .odt and reverse the process when texts comes back. (there are some PHP libraries that can do that)
I am trying to read a .doc file and find tokens like {name}, {phone}, {address} etc. now display tokens with text box and allow user to replace by inserting original data. so that .doc file will replace with actual data.how to do this using php? the color, fonts, and style of .doc should not be changed.
thanks....
This will be very tricky if you are using the old style Word documents. The new Word documents are saved in a some sort of Zip archive and therefore are much easier to edit.
You can extract this files and with some knowledge of the contents and Word WSDL you can edit the contents of the file.
Much easier is to make use of the PHPDocX Library. We are using it in a project and works like a charm. Only disadvantage is that it only works with .docx files.
I need a window program to convert word file (.doc) into text. Something like "anitiword" for windows.
I need it because I need to convert word file into text and use Lucence to index it and I am in a windows environment :(
Thanks for all your help!!!
Yes. That program is called MS Word.
Open the file in Word via COM, and save it as text programmatically. On the other hand, is Lucene not able to read Word documents natively?
if you really need a program, here's one. Have not tried, but you can give it a shot. Otherwise, you can just use COM / vbscript.
Using POI (http://poi.apache.org/) you should be able to index the old binary DOC formats. Relevant code snippets can be found on http://kalanir.blogspot.com/2008/08/how-to-index-microsoft-format-documents.html.
And for DOCX, since that's basically a ZIP file which contains a bunch of XML and resource files, it should be relatively easy to find the XML file containing the actual text (I think it's word/document.xml) and indexing the text contained in it (after stripping off all XML data)...
You can use the OpenXML SDK to easily strip the text out of DOCX files. Does not work with .doc though--you probably need to use MS Word and COM for that.
I am looking for some way to code a function (I'm open to any language or library at this point) to take an already existing PDF file as input and return a modified PDF file that links certain words to different URLs. I know PHP and ColdFusion both have good tools for dealing with PDF's, but I haven't been able to find anything that works.
I've been doing this by going through Acrobat and linking the text by hand and was wondering if there was any way to automate the procedure.
Thanks!
With ColdFusion you can extract the text with DDX (see Extracting text from a PDF document on the page), modify it using search/replace and generate new document.
If I understand what you're trying to do, you should be able to use CFPDF (http://livedocs.adobe.com/coldfusion/8/htmldocs/Tags_p-q_02.html#2922772) to read the pdf file into a ColdFusion variable, replace whatever content you want in that variable, then save the content back to pdf.