Document to Markdown using PHP - php

I was wondering is there a way to convert a document(doc or docx), which contains images and text into a markdown.
Ex: Document contains an Image and description for that image
I was trying to convert that document into a markdown such as follows
<img src="doument_name/media/image1.png" width="624" height="505" />
Followed by description with markdown
When I search, I only found Markdown parser's, converters which convert text data into HTML

Doc and Docx are complex proprietary formats, and markdown is not widely used. Converting one to the other directly will be difficult. It's better to use HTML as the intermediate step.
There are many PHP solutions out there to read MS-docs, but, as you perhaps have found, they're all slightly flawed. They also don't do the conversion to HTML, they read, but don't convert, or they don't include images, etc.
As an alternative you could try an online API, like:
http://apiv2.online-convert.com/
I haven't tested this, but it could be a good solution: It converts to HTML, and you have to write, and maintain, very little code yourself.
The conversion from HTML to markdown is relatively easy, you can find examples online:
https://github.com/thephpleague/html-to-markdown
https://github.com/Elephant418/Markdownify

Related

Create Word Document from PHP Documentation

To document my code I thought it would be best practice to use phpDoc syntax, because there are several parsers out there and some IDEs create IntelliSense out of it.
Now I need to put the documentation (API) into a word file, but I don't know which parser is able to output .doc or similar.
I tried DoxyGen, which outputs .rtf and phpDocumentor2, which can only export to .html and .xml (?).
Is there a way to generate a .doc(x) file from phpDoc? Or a simple way to get a document which can be imported to word?
I would appreciate if I don't have to change the phpDoc syntax, because my documentation is very long.
Edit: The prefered parser would be phpDocumentor2, because it supports PHP 5.3 functionalities and it's faster than DoxyGen, but phpDocumentor2 has less features than phpDocumentor, which is no longer maintained, related to output formats.
Edit: I tried to copy content from the .rtf file into the .docx file, but when I select 'Use Destination Styles', both Word instances suspend and do not respond.
Presumably you want one large Word doc that contains all the info for your project in the one doc/file... therefore just opening the phpDoc2 HTML output into Word in order to convert it to docx will not meet your need, since that would be one docx per phpdoc2 HTML page.
You might try altering your searches to be for a tool that can spider a given HTML page, recursively pick up all its target page hierarchy, and convert it all into a single docx. You might have more luck finding a tool that does this but produces a PDF... then you could just use Word to convert the PDF into docx.

Reading text file with style (bold, italics...)

Is it possible to read a text file with PHP with styles?
My client doesn't want to write any code (even [b][/b]) and he has to send those text files to some translators to translate them into 4 languages.
Then i have to post them on a site. They are very large texts and i was wondering how can i deal with this to keep the format without having to read and format all of them with BBcode or HTML code directly (as they are updated very often with some changes)
I see 2 possible answers :
Strip all tags, send texts to translator and re-add style formatting tags (see strip_tags())
Write a script that converts texts into an editable file format like .docx or .odt and reverse the process when texts comes back. (there are some PHP libraries that can do that)

PHP to edit PDF

I have some random PDF that I need to edit. And by edit, to replace an image and some text.
All of the PHP PDF libraries that I saw, create a PDF from scratch.
Is there a way to edit a page of the PDF by replacing images and text ?
There was another recent discussion on this: PHP PDF template library with PDF output? - There is no ready-made library for that.
While technically it's doable (PDF is actually a simple text based registry format, looked through specification once); the internal structure and encoding of text make it awfully difficult to locate and replace text. If you hardcode the object ids, and just create a new 25 1 obj revision for example, then a simple programmatic update might work. But neither FPDF nor TCPDF can do that AFAIK. (Look into FPDI import however.) And if you say you have some "random pdf" it's even less likely.
Try one of the format conversion methods (openoffice to pdf). You could manually convert PDF to OpenDraw probably, and after PHP-based editing convert it back. I'm very unsure if it brings usable results though.

How do I convert a PDF file to HTML in PHP?

How do I convert a PDF file to HTML in PHP? Is there any lib or web service? I mean free, thanks!
Google pdf2html, pdftohtml looks to be the only viable one. and it's based on a command line program, not PHP. so it may not be useful to you. Google is capable of converting, so there may be a way to do it with GDocs as well. though I'm not sure of that. At any rate, I hope this gets you on the proper path at least.
I've tried Poppler's pdftohtml command to convert PDF files to HTML files. Check it out on The HTML file output of Poppler is lighter when used but the output is not very accurate.
If you want accurate output you should use pdf2htmlEX I've converted complicated PDF files and got the best HTML output.
You can't.
PDFs are complex documents containing embedded fonts, vector graphics and layout information that cannot be represented in HTML in an automated way. You may be able to extract the TEXT of the document, but that's about it.

extracting content from pdf using PHP

Could you please tell me how to extract content from PDF document using PHP? Formatting is the main problem im facing here. So let me know, if there are some ways to extract content with the same format and to display it on an online text editor.
Thanks
Have a look at XPDF
I suppose you could do
$text = shell_exec("pdftotext $pdffile");
As for displaying it in an editor? Well, which editor?
To retain some type of formatting information, and assuming by web editor you mean HTML editor, you can convert it to HTML. Perhaps there are other tools available, but since i use xpdf i came across this converter that is based on xpdf.
Basic usage
pdftohtml -noframes -c test.pdf test.html
To get it into your favorite editor
echo file_get_contents('test.html');
You may need to wrap things inside PHP functions/classes. And you may want to add security measures and whatnot.
As far as I can see, it is not possible to convert a PDF to editable HTML using PHP on the fly, while preserving formatting. There are a number of Desktop apps around that all try to extract data from PDFs with sometimes more, sometimes less reliable results. I would say this is not realistically possible at the moment and all you can do is to extract plain text using XPDF or other command line tools.
It may be different with that new XML-Based PDF format but I don't really know anything about that yet.
Feel free to prove me wrong, of course - I'd be very interested myself if there were a solution.

Categories