Convert HTML to DOC in PHP [duplicate] - php

This question already has answers here:
Convert html to word /excel / powerPoint with PHP
(3 answers)
Closed 6 years ago.
I am doing html to doc conversion by php header function. Converted file is saved in doc format. But when I edit and save this, it creates a folder with same name as doc file name, which contains three files (themedata.thmx, filelist.xml, colorschememapping.xml).
I am using following code for doc generation.
header("Cache-Control: ");
header("Pragma: ");
header('Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document');
header('Content-Disposition: attachment; filename="'.$filename.'.doc"');
I want to convert it in pure doc file, please help to fix this problem.
I have tried many other plugins to convert HTML to DOC for example :-
1. Pandoc :- It is converting html to docx but not picking style and images from html file.
2. PHPWord :- I didn't find any option in PHPWord to convert fully styled HTML in to docx. It is create docx by function (addTable, addCell) which I do not want.
3. htmltodocx.codeplex.com :- This plugin needs some specific style. It is not supporting all css.
4. unoconv :- Not found working.
5. Open office :- not found any functional command.

You cannot change a file format simply by changing the file extension. Do you think you can convert a PDF to a movie by changing the file name from .pdf to .mp4? I hope not, because it doesn't make any sense.
What's happening is that you're telling the browser to save data of an HTML file with a .doc extension. When you double click that file to open it, Word opens (because it's associated with .doc extensions). Word is stupid misleading forgiving enough to recognise that the file does not actually contain Word DOC data, but HTML, and it converts it for you on the fly without telling you.
When you then save this file, it creates an actual DOC/DOCX file for it; but apparently that doesn't happen cleanly and the container is breaking apart.
What you're seeing it a misbehaviour in Microsoft Word (on several levels).
What you should be doing to begin with is create an actual Word document, e.g. using https://github.com/PHPOffice/PHPWord.

Related

PHP - Check if pdf contains given text - TcpdfFpdi / pdftk / fpdi

I have a pdf document and I want to check if a specific text occurs (which are tags that I put in while generating the pdf) in the document, however using these libraries (tcpdfFpdi, pdftk or fdpi) I couldn't figure out if it's possible or how to do it.
$str = "{hello}";
$pdf = new TcpdfFpdi();
$pdf->setSourceFile($filePath);
$pdf->searchForText($str); // something like this which returns boolean
If I try without any library to dd(file_get_contents($filePath)), it returns a very long output and doesn't seem to contain the file I want so I think it's better to use one of those libraries.
Just an idea…
It's no actual PHP solution but you could use tools like pdftotext which I know from this post (where a PDF file is converted into a string to count its words): https://superuser.com/a/221367/535203
You can install it and play around with that command and call it from within your PHP application.
As far as I remember (long time ago since I used pdftotext) the output text is not exaclty the PDF's content but to search a few tags in it it's at least a good try.

How should I set the download name of a pdf with fpdf?

I am trying to set a name for a pdf file I generated with FPDF. However for some reason the browser changes some characters.
I am sending this:
$pdfTitle = 'Overview: 2017/2018'
$pdf->Output( 'D', $pdfTitle, true );
Yet when I save my pdf it changes some characters and I and the download name becomes: 'Overview_ 2017_2018'.
I am using UTF-8 encoding on my php file.
FPDF-documentation: http://fpdf.org/en/doc/output.htm
I have two questions:
How can I make sure the download name is the same as the one I set in my php file?
What is the underlying issue that changes the name?
PS: In the real project the string will come from a database, so I can only access the string programatically and not make direct changes to it.
You are using the special characters : and / in your filename in your code. Because of this fpdf is filtering your outputs filename.
For example:
Overview: 2017/2018
^ ^ are not supported as filename in Windows & some other OS.
Tip:
You may add .pdf in your name if file is not saving as pdf file.

Export Microsoft word xml file into docx

I am trying to create a Microsoft word document without using any 3rd party libraries. What I am trying to do is :
Create a template document in Microsoft Word
Save it as an XML File
Read this XML file and populate the data in PHP
I am able to do it so far. I would like to export it as an *.docx format. However when I do that, it is throwing an exception, when I try to open it.
Error Message : File is corrupt and cannot be opened
However, when I save it as *.doc, I am able to open the word document.
Any idea, what could be wrong. Do I need to use any libraries to export it to an docx file ?
Thanks
Docx is not backwards-compatible with doc. Docx is a zipped format: Docx Tag Info.
I would recommend you to create another template for the docx format, because the formats are so different.
Also, you might want to check that your code is writing the correct encoding. Before I put it in the correct encoding I was getting odd letters that weren't compatible when I converted it into a .docx format. To do this I implemented it in the inputstream:
InputStreamReader isr= new InputStreamReader(template.getInputStream(entry), "UTF-8");
BufferedReader fileContents = new BufferedReader(isr);
I used this with enumeration for the entry, but the "UTF-8" puts it in the right format and eliminates the odd characters. I was also getting "null" typed out at the end of some of the xml's, so I eliminated that by taking it out (I brought the contents of each file into a string so I could manipulate it anyway):
String ending = "null";
while(sb.indexOf(ending) != -1){
sb.delete(sb.indexOf(ending), (sb.indexOf(ending) + ending.length()));
}
sb was the stringbuilder I put it into. This problem may have been solved with the UTF-8, but I fixed it before I implemented the encoding, so figured I'd include it in case it ends up being a problem. I hope this helps.

Edit Word content, Replace Keyword with other text with PHP [duplicate]

This question already has answers here:
How to Create a word .doc file from a .doc template in php
(4 answers)
Closed 8 years ago.
I want to edit a file which extension is .doc
This file contains some keywords such as:
<<Customer>>
<<DateStart>>
ecc.
Now I want to read the content of the file, edit it and then put it in new word.
I try in this way:
header ("Content-Type: application/msword; ");
header ("Content-Disposition: attachment; filename=new.doc");
$filename='a.doc';
$key = "«Customer»";
$fc=file($filename);
$f=fopen("'C:\Users\Ciro\Desktop\new.doc","w");
foreach($fc as $line){
if (strpos($line,$key))
$line=str_replace($key,"some new text",$line);
file_put_contents($f,$line);
}
fclose($f);
fclose($fc);
How can I fix it?
The following line was wrong, it had a misplaced '
f=fopen("C:\Users\Ciro\Desktop\new.doc","w");
Also, DOC files aren't simple TXT files because its content is encoded, if possible, convert it to a TXT file.
This is how a DOC file looks like to PHP
Basically, PHP cannot see what you see and you won't be able to achieve what you want, unless you convert the DOC file to a RAW format, like TXT.
UPDATE:
Try catdoc, which converts any .doc file into plain text. See the catdoc homepage

Concatanate RTF files with PHP withouth header

I have some RTF files generated by users with Microsoft Word. I need to be able to concatenate these files, and the result file should still be readable by libreoffice. I'm using libreoffice in order to convert the result file into a PDF file.
In order to concatenate two files, my application remove the last character of the first file and the first one of my other file. The files headers are not removed (I'm not speaking about page header).
For some reason, libreoffice do not like the headers inserted by Microsoft Word. But it works fine if I open these files with Wordpad and save them.
Another way to remove these headers is to convert these files into RTF before I concatenate them. This way i can convert into PDF, but libreoffice make a serious mess with my tabs when i convert my files to RTF.
So how can I remove the headers through PHP withouth messing with tabs ? Or do you have another way to get to the same result ?
Edit :
In a nutshell, I must be able to concanate these files and that libreoffice could open it. And my tabs must still display nicely in Microsoft Word.
As you can guess, users don't want to use Wordpad. And my customer's IT department has to comply to that wish ( office politics).
UPDATE :
I have to do the merging first, because of business rules. The files are merged, then my users can modify it using Word (no problems here). Then they ask their boss to validate it. If the boss agree to validate, the RTF file become a PDF file.
UPDATE 2 :
I have a begenning of a solution. If the RTF file start by plain text or a picture, you have to remove everything until you get \pard. But this does not work if you file start by a tab.
UPDATE 3 :
If you want to support tab too, you have to remove evrything until you get \pard or \trowd. I'm going to post the total solution once i get a working code. This will works fine as long you don't need colours and that all yours files use the same font (because we don't remove the RTF headers of the first file).
If the limitations with the 'pure RTF' approach come back to bite you, you could use LibreOffice to convert your RTF files to docx, then use a tool to merge the docx files.
There are such tools for .NET and Java (such as our MergeDocx product); I'm not sure what you'll find for PHP.
I succeed to build a reliable code, which make possible to manipulate the RTF files created with Microsoft Word. It works as long as you only need text, pictures and tabs, and don't need fancy things as color. Color works for text, but beside that ...
$content = "";
//stristr Returns all of haystack starting from and including the first occurrence of needle to the end.
$tmp_pard = stristr($RTFstring, "\pard");
//stristr fail to detect \trowd
$tmp_tab = stristr($RTFstring, "trowd");
if($tmp_pard != "" || $tmp_tab != "") {
//We pick the longer string. Because we want the first occurence of \pard or \trowd
if(strlen($tmp_pard) > strlen($tmp_tab))
// { is added so concatenation code still works. We just remove headers.
$content = "{" . substr($RTFstring,-strlen($tmp_pard)) ;
else
$content = "{" . "\\". substr($RTFstring,-strlen($tmp_tab)) ;
} else {
$content = $RTFstring;
}
return $content;

Categories