Importing/Copying and Pasting Word Document to HTML - php

We need to import OR copy and paste word documents and convert them to HTML ready data.
Here's my thoughts:
collect the text with file_get_contents
apply the function nl2br
However, it does not account for bold and other text formatting.
Also, there are several microsoft characters that we shouldn't require.
What is a good strategy for word imports into beautiful HTML?

I wouldn't try to tackle all of this on your own. word2cleanhtml.com looks like it will suit your needs and may have an API offering soon.
However, it appears that you can use Word itself from the command line to convert your document for you. This will, of course, require that MS Word is installed on your PHP server.
shell_exec("C:/Program Files/Microsoft Office/Office12/WINWORD.EXE /msaveashtml C:/path/to/your.doc");
The above code uses the macro defined in this answer to a similar question. You will need to copy the the saveashtml macro from that answer and add it to Word.

Related

PHP str_replace into microsoft word template

Right now I'm get task to make generate contract letter function in HRMS.
I'm already using CKEditor but the result is very different since the purpose made CKEditor is not like Microsoft Word or Google Docs purpose.
So I'm having idea that I'm making the template first in Microsoft Word and use PHP function str_replace to passing the data into Microsoft Word template.
The question is :
1. With that flow, is it possible to do that?
2. If Question 1 is possible can you hit me with the sample?
Many Thanks,
Hendra
There are several Classes that can do at least part of what you are trying to do:
wrklst/docxmustache
openTBS – Tiny But Strong
PHPWord
docxtemplater pro (basic opensource / free version / MIT license available as of writing; image replacing is a commercial plugin)
docxpresso (commercial)
phpdocx (commercial)
The first 4 of these are at least partially open source and investigating the code will help you understand the process, which is not trivial with word. In addition you can check out http://officeopenxml.com for the format details.
The main problem I see is with proper HTML to openXML conversion. Meaning to convert the styling from CKEditor (which might be HTML) into the proper XML Styling, which functions quite differently and a direct translation is not trivial. Check out https://github.com/wrklst/docxmustache/blob/master/src/WrkLst/DocxMustache/HtmlConversion.php so see some basic HTML conversion on singular runs of bold, italic and underlined text.
To my knowledge there is no maintained open source package that delivers proper html to openxml conversion. If you need this and cannot write it yourself, you will probably go for one of the paid solutions.
Good luck.
Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t>. If you want to embed loops to iterate over an array, it becomes a real hassle.
source : https://docxtemplater.readthedocs.io/en/latest/goals.html
You could use the library I created in answer for this problem : https://github.com/open-xml-templating/docxtemplater , it works with JS in the browser or with node.js.

Is there an embedded visual (PHP) parser for Microsoft Word?

I am writing about code that I have written for documentation, mainly in PHP. I also have other languages that I will write about but I am wondering what the easiest way to display code within a word document. I could just import a print screen of a Notepad++ document but I would like an easy way to include code into Microsoft Word without having to print screen it every time I want to make a change. I am looking for something that will allow me to edit the code within word, but obviously not be functional. I would like there to be some sort of visual parsing so that similarly to Notepad++ it is more readable.
You don't need to print screen in Notepad++. You can export/copy the text as RTF and preserve syntax highlighting and formatting.
I'm not on my PC at the moment, but the option is either under the TextFX menu, or the Plugins menu.
Works very nicely.
Edit:
This menu, press 'Copy RTF to Clipboard', and you can paste into Word.
Instead of using word to document your code, you could instead check out a document markup language called Latex.
It allows for easy documentation of code(and math) and is therefor a really good tool for creating scientific reports.
http://www.latex-project.org/
Here is a basic tutorial on how it works:
http://www.youtube.com/watch?v=SoDv0qhyysQ
(This youtube video explains the basics)

Fill in a Microsoft Word Form with PHP

I have a form created in Microsoft Word that I need to fill in via PHP. I have looked at PHPWord, but it looks like you can only create Word documents with it. I considered exporting the form to XML and editing it that way, but the formatting gets screwy from the export. Is there another way?
At the time of writing this, there is no direct solution.
You might want to have a look at the best answer for Create Word Document using PHP in Linux to get a hint (uses OpenOffice documents that you can change since they are XML+ZIP, and converts opendocument to .doc on cmdline).
Another alternative is - if you run your script on a windows server - to use the COM interface to speak with Word. See http://drewd.com/2007/01/25/reading-from-a-word-document-with-com-in-php for an example to read a file, and - by digging through the Word COM API - you can also change existing documents.

where can i find any text formatting scripts like the one used in stackoverflow?

I want to give users the option to format text like bold, italic, add image etc ...
I want to give list of options as such as given here in stackoverflow while asking questions
Where can i find any predefined scripts for that? I searched on google , but i think i haven't searched with a proper text and i couldn't find anything relevant!
Actually:
Markdown is used by SO
Prettify is the code colorizer that StackOverflow uses.
TinyMCE/ WMD Editor (used by SO)
Markdown for PHP is located at
http://michelf.com/projects/php-markdown/
Alternatives to Markdown can be found at
http://en.wikipedia.org/wiki/Lightweight_markup_language
Somewhat related is this blog entry about what StackOverflow was built with:
https://blog.stackoverflow.com/2008/09/what-was-stack-overflow-built-with/
You'll find many more answers about SO on https://meta.stackoverflow.com/
SO uses the WMD editor, which you can find here. It also uses MarkdownSharp to generate the HTML shown on the page. You'd need to replace this with a PHP version of Markdown -- #Gordon's answer contains a link.
What you are after is something to 'parse' the text.
This will be a special function that looks at a string such as **my text** and notices the pair of * before and after the string my text it then converts the first pair into a <b> and the second pair get turned into </b>.
You can either do it in JavaScript or server side code, either before or after you store/read from the data base.
There are lots of library's that other people have been mentioning. But if you wanted to do it your self, that is the basic principle.
This are all editor. So use any of the editor and customize your input option.

Is there a way to decode html e-mails?

I am writing support software and I figured for highlighting stuff it would be great to have HTML support.
Looking at Outlooks "HTML" I want to crawl up into the fetal position and cry!
Is there a php class to unscramble HTML emails to support basic HTML? I don't want to display the E-Mails in a frame because I want to work with the data and analyse it. I also don't want to support stupid things like changing font since its a webapp I want my webapp to say what the font is and not have some hippie who sends the support team e-mails in comic sans and yellow color. I want to support bold, italic, underlined, streched out and lists (http://dl.getdropbox.com/u/5910/Jing/2009-02-23_2100.png).
I also don't quite know the difference between rich-text and html since I always thought rich-text only allowed the functions I wanted but I seem to be able to do everything in rich-text which I can do in Html.
Also I should add I am using the Zend Framework because of the fabulous Zend_Mail
You can pipe it through htmltidy and then further filter it with something like HtmlPurifier, but of course you may strip out something that is essential to understanding the contents. That's the problem with a visual format, like html.
You can use PHP's strip_tags() function, and it's optional "allowable_tags" parameter. This will allow you to strip out all the tags that are not <em> <b> <strong> <u> etc.
About RTF vs. HTML, my understanding is that when Outlook and Exchange communicate with non-RTF compliant systems they convert RTF to HTML. I'm not sure this is always true, or how consistent that function is, but that might explain why messages sent RTF appear to be HTML.
I'm pretty sure you'll have to write your own class... there is no real class like that in the PHP documents I've seen..
Or you could use the plain-text variant attached to the e-mail. If there is no plain-text variant you could use a stripped version of the html. I think using these steps you would have a nice result:
Remove newlines
Turn </p> and <br/> into newline
Strip all html tags
Pulling out the HTML from an Outlook mail may seem scary at first, but it's only HTML tags - just a whole lot of them!
So if you just locate to a "<" and then find the next ">" you have a tag. If it is not something you want to have, like "</strong>" just throw it away and repeat Simple as that.
(I have done exactly this in a spelling and grammar checker which not only pulls out plain text from Outlook and checks it - it can then push all the user's changes back into the HTML without destroying any tags. The latter was not easy, though! ;-)

Categories