Convert HTML code to doc using PHP and PHPWord

Convert HTML code to doc using PHP and PHPWord - php

I am using PHPWord to load a docx template and replace tags like {test}. This is working perfectly fine.
But I want to replace a value with html code. Directly replacing it into the template is not possible. There is now way to do this using PHPWord, as far as I know.
I looked at htmltodocx. But it seams it will not work either, is it posible to transform a peace of code like <p>Test<b>test</b><br>test</p> to a working doc markup? I only need the basic code, no styleing. but Linebreaks have to work.

Here is the link to the github. It is working fine Html-Docx-js.
And it is the demo also available here.
Other option is this Link.
$toOpenXML = HTMLtoOpenXML::getInstance()->fromHTML("<p>te<b>s</b>t</p>");
$templateProcessor->setValue('test', $toOpenXML);

The other answers propose H2OXML which only supports
Bold, italic and underlined text
Bulled lists
As described in their docs and their last update was in 2012.
I did some research and found a pretty nice solution:
$var = 'Some text';
$xml = "<w:p><w:r><w:rPr><w:strike/></w:rPr><w:t>". $var."</w:t></w:r></w:p>";
$templateProcessor->setValue('param_1', $xml);
The above example, shows how would be a striked text. Instead of "w:strike" you can use "w:i" for italic or "w:b" bold, and so on. Not sure if it works on all tags or not.

Thanks for your answer, Varun.
The simple PHP library H2OXML works for me https://h2openxml.codeplex.com/
$toOpenXML = HTMLtoOpenXML::getInstance()->fromHTML("<p>te<b>s</b>t</p>");
$templateProcessor->setValue('test', $toOpenXML);
I can now convert html code to insert it using PHPWord.

$content = '<p>Test<b>test</b><br>test</p>';
use it before IOFactory::createWriter();
\PhpOffice\PhpWord\Shared\Html::addHtml($section, $content);

Related

PHP pdf form parse regex

I have a two PDF forms that I'd like to input values for using PHP. There doesn't seem to be any open source solutions. The only solution seems to be SetaSign which is over $400. So instead I'm trying to dump the data as a string, parse using a regex and then save. This is what I have so far:
$pdf = file_get_contents("../forms/mypdf.pdf");
$decode = utf8_decode($pdf);
$re = "/(\d+)\s(?:0 obj <>\/AP<>\/)(.*)(?:>> endobj)/U";
preg_match_all($re, $decode, $matches);
print_r($matches);
However, my print_r is empty even after testing here. The matches on the right are first a numerical identifier for the field (I think) and then V(XX1) where "XX1" is the text I've manually entered into the form and saved (as a test to find how and where that data is stored). I'm assuming (but haven't tested) that N<>>>/AS/Off is a checkbox.
Is there something I need to change in my regex to find matches like (2811 0 obj <>/AP<>/V(XX2)>> endobj) where the first find will be a key and the second find is the value?

Part 1 - Extract text from PDF
Download the class.pdf2text.php # http://pastebin.com/dvwySU1a (Updated on 5 of April 2014) or http://www.phpclasses.org/browse/file/31030.html (Registration required)
Usage:
include('class.pdf2text.php');
$a = new PDF2Text();
$a->setFilename('test.pdf');
$a->decodePDF();
echo $a->output();
The class doesn't work with all pdf's I've tested, give it a try and you may get lucky :)
Part 2 - Write to PDF
To write the pdf contents use tcpdf which is an enhanced and maintained version of fpdf.

Thanks for those who've looked into this. I decided to convert the pdfs (since I'm not doing this as a batch) into svg files. This online converter kept the form fields and with some small edits I've made them printable. Now, I'll be able to populate the values and have a visual representation of the pdf. I may try tcpdf in the event I want to make it an actual pdf again though I'm assuming it wont keep the form fields.

Can I "fix" misplaced HTML tags using PHP?

Is there a quick and easy way to fix HTML tags that are misplaced, in a web document? Such as:
<strong><span style="border:1px;">Text</strong></span>
/\ /\
|______________________________________|
So that it looks like:
<strong><span style="border:1px;">Text</span></strong>
Edit: you are suggesting HTML fixers, but what I'm looking for is a function type solution. Would it help if you could consider this to be BBcode? [b][u]Text[\b][\u]

I think the best solution is using Html Purifier, works pretty good:
Demo: http://htmlpurifier.org/demo.php
Works with your input perfectly.

How should a computer know whether you meant for the span to be inside the strong, or the other way around?
The "quick and easy way" is to run your document through an HTML validator, then fix the issues that it identifies using your noggin and keyboard.

You can use tidy::repairFile() or tidy::repairString(), but repairing is not straightforward, so you can never be sure the result will be what you expect. Example from the documentation:
<?php
$file = 'file.html';
$tidy = new tidy();
$repaired = $tidy->repairfile($file);
rename($file, $file . '.bak');
file_put_contents($file, $repaired);
?>

How to insert wysiwyg edited text into database

The Idea
I have a forum in my project and i'm trying to implement wysiwyg editor to post questions and answers. Actually the project is already completed and now i'm trying to implement NicEdit wysiwyg editor to all the required textareas. I'm using the following lines of prior to database insertion of posted data :
$post_body = $_POST['post_body'];
$post_body = nl2br(htmlspecialchars($post_body));
$post_body = mysqli_real_escape_string($db_conx,$post_body);
Problem
When I insert the posted data I'm getting the following output :
Conclusion
I want to know how to insert the wysiwyg edited content into the database. Apart from this, please suggest me some nice wysiwyg plugins which are easy to embed in my project. Currently, i'm using NicEdit which is pretty easy to embed but with limited functionalities.

Why don't you decode it with: htmlspecialchars_decode() before echoing?
That may solve your problem.
Edit:
You could do strip_tags() and allow only few markup tags.
Eg:
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
Thus you could strip risky tags like <script>

Don't do this:
$post_body = nl2br(htmlspecialchars($post_body));
That is how you create an HTML representation of plain text. You are asking people to submit HTML, not plain text.
(Do, however, implement a whitelist based, HTML aware XSS filter such as HTML Purifier)

How to get Wikipedia "clean" content?

I'm using Mediawiki api in order to get content from Wikipedia pages.
I've written a code which generates the next query (for example):
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=hawaii
Which retrieves only the leading paragraph from the Wikipdia page about Hawaii.
The problem is that as you might notice there are a lot of irrelevant substrings such as:
"[[Molokai|Moloka{{okina}}i]], [[Lanai|Lāna{{okina}}i]], [[Kahoolawe|Kaho{{okina}}olawe]], [[Maui]] and the [[Hawaii (island)|".
All those barckets [[]] are not relevant , and I wonder whether there is an alegant method to pull only 'clean' content from such pages?
Thanks in advance.

You can get a clean HTML text from Wikipedia with this query:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii
If you want just a plain text, without HTML, try this:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii&explaintext

please try this:
$relevant = preg_replace('/[[.*?]]/', '', $string);
EDIT: just found this - hope it is helpful

Read the content of a PDF with PHP?

I need to read certain parts from a complex PDF. I searched the net and some say FPDF is good, but it cant read PDF, it can only write. Is there a lib out there which allows to get certain content of a given PDF?
If not, whats a good way to read certain parts of a given PDF?
Thanks!

I see two solutions here:
converting your PDF file into something else before: text, html.
using a library to do so and bad news here, most of them are written in Java.
https://whatisprymas.wordpress.com/2010/04/28/lucene-how-to-index-pdf-files/

What about that ?
http://www.phpclasses.org/package/702-PHP-Searches-pdf-documents-for-text.html
ps: I don't test this class, just read the description.

$result = pdf2text ('sample.pdf');
echo "<pre>$result</pre>";
How to get “clean” text :source code pdf2text
http://webcheatsheet.com/php/reading_clean_text_from_pdf.php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Convert HTML code to doc using PHP and PHPWord - php

Here is the link to the github. It is working fine Html-Docx-js. And it is the demo also available here. Other option is this Link. $toOpenXML = HTMLtoOpenXML::getInstance()->fromHTML("<p>te<b>s</b>t</p>"); $templateProcessor->setValue('test', $toOpenXML);

Thanks for your answer, Varun. The simple PHP library H2OXML works for me https://h2openxml.codeplex.com/ $toOpenXML = HTMLtoOpenXML::getInstance()->fromHTML("<p>te<b>s</b>t</p>"); $templateProcessor->setValue('test', $toOpenXML); I can now convert html code to insert it using PHPWord.

$content = '<p>Test<b>test</b><br>test</p>'; use it before IOFactory::createWriter(); \PhpOffice\PhpWord\Shared\Html::addHtml($section, $content);

Related

PHP pdf form parse regex

Can I "fix" misplaced HTML tags using PHP?

How to insert wysiwyg edited text into database

How to get Wikipedia "clean" content?

Read the content of a PDF with PHP?

Categories

Resources