At the moment I am managing my user guide using Microsoft Word 2003 and am converting it to a PDF file that can be downloaded from website plus is included by product installer.
I would like to move to a mechanism that achieves the following:
Generates PDF file with clickable TOC and front page
Generates HTML5 compliant output per chapter/section but without HTML skeleton
Generates JSON TOC for user guide (chapter/section outline)
I would like to package the PDF file with the distributed product.
I would like to create some simple PHP scripts that generate HTML pages with a context sensitive TOC (showing sections of current chapter) plus showing the relevant documentation.
I have no issues with developing the PHP scripts to achieve this, but I would like to know how I can generate the above outputs. I would preferably like to type documentation using an off-the-shelf GUI. I am happy to write XSLT2 stylesheets to perform any necessary conversions.
To give people an idea of what I am after:
Current PDF manual: http://rotorz.com/tilesystem/user-guide.pdf
API documentation which is generated using custom XSLT2 stylesheets into a bunch of "incomplete" HTML files, with a JSON TOC which is then brought together by PHP: http://rotorz.com/tilesystem/api
As you navigate through my API documentation you will notice that the TOC on the left is context sensitive. I would like my user guide to work in a similar way.
Is there a free alternative to Prince: http://www.princexml.com/ for paged media CSS?
After spending quite some time reading into lots of variations I have come across a potential solution...
Create a very simple "static" CMS using PHP and http://aloha-editor.org for my WYSIWYG editor. Possibly using https://github.com/chillitom/CefSharp to embed the editor straight into a more relevant GUI.
Convert the HTML5 pages into PDF using "wkhtmltoxdoc" with custom cover, header and footer .html files. Plus generates a TOC page automatically.
"wkhtmltoxdoc" also generates an XML TOC which can easily be converted to JSON.
I am still experimenting with "wkhtmltoxdoc" but it seems pretty good! Unless of course there is an even easier solution...
ADDED:
It seems that my TOC file will need to be a mixture of manually written and automatically generated. Something along the lines of the Eclipse TOC schema will suffice where a simple XSLT stylesheet can automatically fill in the blanks by grabbing H1-6 tags plus adding unique identifiers for hash links.
This TOC can thus be consumed by XSLT2 stylesheets and then finally converted to JSON for consumption by PHP scripts.
Mock-up extract for my existing documentation:
<?xml version="1.0" encoding="UTF-8"?>
<toc>
<topic label="Introduction" href="introduction.html"/>
<topic label="Getting Started">
<topic label="Installation" href="getting-started/installation.html"/>
<topic label="User Interface" href="getting-started/ui/index.html">
<topic label="Menu Commands" href="getting-started/ui/menu-commands.html"/>
<topic label="Tile System Panel" href="getting-started/ui/tile-system-panel.html"/>
<topic label="Brush Designer" href="getting-started/ui/brush-designer.html"/>
</topic>
<topic label="User Preferences" href="getting-started/user-preferences.html"/>
</topic>
<topic label="Creating a Tile System" href="creating-a-tile-system">
<!-- ... -->
</topic>
</toc>
Reference to Eclipse documentation:
http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.platform.doc.isv%2Freference%2Fextension-points%2Forg_eclipse_help_toc.html
After a lot of research and experimentation I have decided to use DITA (Darwin Information Typing Architecture). For me the nicest thing about DITA is that it is topic based which makes the documentation modular and reusable.
The DITA schema is relatively simple and good XML editors provide useful insight into the available elements and attributes.
DITA documents can be combined for purpose using DITAMAP's. For example one might choose to distribute a "Quick Start Guide" which encompasses a minimal amount of information whilst a full blown "User Guide" will contain far more detail. The beauty is that the same information can be reused for both documents; plus the documents can be outputted to a number of delivery formats:
XHTML (single file or chunked)
PDF
Docbook
The process of transforming the output into the delivery format is easily handled using the DITA Open Toolkit (aka DITA-OT). This toolkit is available from: http://dita-ot.sourceforge.net which is installed simply by extracting the provided archive. The toolkit can be accessed easily by running startcmd.bat (on Windows) or startcmd.sh (Unix-like systems).
Customising and branding PDF output is not an easy task. Customizing XHTML output is significantly easier but still requires knowledge of XSL transforms. Customisations can be made by creating a plugin and placing it within the plugins folder of DITA-OT. One thing that I would like to emphasise is that once customisations have been made you must invoke ant -f integrator.xml before changes will become apparent. Lack of this knowledge caused me a lot of confusion!
The generated XHTML files are very simple (which is great!) because this makes them easy to customize. Adding the HTML5 DOCTYPE is not so easy though; but for my purposes this really doesn't matter seen as though my PHP scripts only care about what's inside <body>.
I haven't been able to find any good WYSIWYG editors XML Mind seems to be a really good WYSIWYG editor that is also really easy to use. I suspect that it wouldn't be too hard to create a basic web-based solution using something like the Aloha Editor (http://aloha-editor.org).
Whilst it seems rather difficult to customise the PDF output, it seems quite easy to generate all documentation into a single XHTML page which can then be formatted using CSS, and then finally converted using wkhtmltopdf. I haven't decided on my solution yet, but at least this is a viable option for those who are unable (or don't have the time to) customise the XSL:FO stylesheets of DITA-OT.
ADDED: After some searching I found that there is an another open source alternative to DITA-OT called "Ditac" which seems a lot easier to use and produces a far nicer output. The tool is created by the creators of "XML Mind". Whilst the tool is command line based, those who use "XML Mind" can benefit from a feature rich GUI:
http://www.xmlmind.com/ditac/
Note: I left my previous answer because it may be of use to others.
Related
I hope you are doing well.
I need to know about a PHP library that converts a PDF file having images as well to be converted in a HTML file with the following features that the library can do.
HTML file needs to be of version 3.2 compatible
Save the images in PDF file having .jpg extension
Correct font from PDF needs to be used in the HTML file.
A result folder that contains the images and html file in one folder
I have tried most of the PHP libraries but most of the PHP libraries are NOT doing my needed tasks.
Please, help let me know about a library that do all the above 4 requirements (image attached for reference)
Waiting for your kind responses.
Thanks
I am not very sure, But here is a library in PHP I found.
Here
Try this:
http://www.pdfaid.com/pdf-to-html.aspx
Or this:
http://webdesign.about.com/od/pdf/tp/tools-for-converting-pdf-to-html.htm
Or this...
http://www.pdfconvertonline.com/pdf-to-html-online.html
There are plenty of options available to you, the secret is to use a new fangled thing called a Search Engine, such as a Bing or a Google.
you will also do well to research on Stack Overflow before asking your question:
1) HTML 3.2 wes superceeded in 1997, this is very nearly twenty years ago, why on eart are you still needing a comparatively ancient technology when there are far better improvements available such as XML HTML, HTML 4.01 and HTML5.
2) Please read How can I extract embedded fonts from a PDF as valid font files?
3) Also to extract images you can use:
http://www.makeuseof.com/tag/extract-images-pdf-files-save-windows/
but again, there are several options available to you if you care to look for them.
You seem to imply a fundamental misunderstanding about HTML; there are several different ways of getting any desired result with HTML. You have a PDF file and you want it to look a certain way, this look depends on the browser you are looking at it on. For example if you use a PDF to HTML converter as linked above you will very probably find that the output will look different on Internet Explorer 7 versus on Firefox versus Internet Explorer 10. There is no one way of writing output on HTML or with CSS.
If you want a custom built library to do your specific task then you will need to employ a professional to do it, or you will need to code it yourself. This obviously should be charged to the client for requiring a technology that is extremely outdated. You can probably search github for a similar library (the one linked by CK Khan looks like what you're after) and then fork it and make your own variation for your needs. I very much doubt anyone is going to put time into developing a system to output HTML 3.2 from a PDF, and even less likely to develop this system for free and to your exact specifications.
It also appears that you can not directly incorporate font families into the <font> tag in HTML 3.2, only being able to edit size and colour of fonts. You can use CSS1 font-family to show font families. See here.
I've used a couple of days to think of a best practice to generate a PDF, which end users can customize the layout for themselves. The PDF output needs to be saved on the server or sent back to the PHP file so the PHP file can save it, and the PHP file needs to know that it went OK.
I thought the best way to do this was to use XML, XSLT and Apache Cocoon. But I'm not sure if this is possible or if it's a good idea since I can't find any information of people doing anything similar. It cannot be an uncommon problem.
The idea came when I read about Cocoon converting XML through XSLT to PDF:
http://cocoon.apache.org/2.1/howto/howto-html-pdf-publishing.html
and being able to take in variables:
http://old.nabble.com/how-to-access-post-parameters-from-sitemap-td31478752.html
This is what I had in mind:
A php file gets called by a user, the php file generates a source XML file with a specific name
The php file then makes a request to Cocoon (on the same web server) to apply the user defined XSLT on the XML file. A parameter will be needed here to know which XSLT to apply.
The request is handled by the PHP file and then saved as a PDF on the server, and can later be mailed away.
Will this work at all? Is there a better way to handle this?
The core problem is that the users need to be able to customize the layout on the PDFs themselves, and I need the server to save the PDF and to mail it later on. The users will use it for order confirmations, invoices, etc. And I wouldn't like to hard code the layout for each user.
I've had some good results in the past by setting up JasperReports Server and creating reports using iReport Designer. They're both available in F/OSS ("community") editions, though you can pay for support and value-adds if you need those things.
This was a good solution for us, since we could access it via the Java API for our Java system, and via SOAP for our PHP system. The GUI designer made tweaking reports very easy for non-technical business staff too.
I use webkithtml2pdf to generate my PDF:s. Just create a document with HTML and CSS for printing like you would usually do, the run it through the converter.
It works great for generating things like invoices. You can use SVG for logos and illustrations, and they will look great in print since they are vector based. Even rounded corners with dotted outlines works perfectly.
A minor gotcha is that the input html must have th htm or html file name suffix, so you can't use the default tempfile functions.
I'm considering creating all the reports of a series of desktop business apps directly to html. Most of the reports are tables (maybe compound reports), headers, footers, etc. (no images, vector graphics, etc.).
After a search in SO, I've read lots of post regarding problems with page breaks and things like that (I don't need pixel positioning at all, but yes control at page breaks).
For example, let's say I have a big table with currency values and I need the last row of the table per page to display the running totals at that point.. it is something feasible to do easily or I will run in lots of trouble?
What technologies can help me here?
HTML5
Javascript
CSS
PHP Librarys
JQuery
Some notes:
The html will be displayed with the chrome or firefox engine embeded, so the diferences between browsers it's not a problem for me.
I can have the php preprocessor embedded if that helps to generate more easily the reports, I'm just looking fot the best technology at hand to make the work well..
I'm tired of report generators with "WYSIWYG" designers (Crystal Report, FastReport, ReportBuilder, etc.)
Thanks!
We made the exact move you're thinking about almost a year ago and haven't looked back. Most communication with our client is over the web, so it's been a perfect fit. They can view html outputs easily on our website, and can generate pdf's of the page (server side) whenever necessary. The program we use for pdf conversion is a free, easy-to-use, open-source project called wkhtmltopdf.
Where we are is great, but getting here was difficult.
Deciding which pdf engine to use was a long, painful process. The short of it is that HTML is for viewing pages on the internet, not for viewing pages on paper. Page-breaks will be the bane of your existence in this game -- you literally have to measure each page and create your own clean-looking breaks for every single report (otherwise, all html-to-pdf converters out there will just keep rendering the document onto the next page as it if encountered no page-break at all). Further complicating the matter is that every html-to-pdf engine out there handles this sh*t differently and you'll have to write a tailored solution to test each one to see if it meets your individual needs.
Now, the good news:
You can save yourself a lot of trouble by heeding my advice and going with wkhtmltopdf for your finalized reporting outputs. This little program is simply amazing -- it uses a webkit engine, renders CSS/javascript accurately, has header/footer control, optionally creates a table-of-contents page, and (most importantly) consistently produces excellent looking pdf's without having to customize your code base. It also has a variety of great command line switches, and it is very, very fast. I say again: it is very, very fast.
Best of all, it's a command line tool that can be used in batch processing. And did I mention that it's really, really fast?
Browser support for printing is generally terrible. However, there are other tools, notably Prince (which is not free) and Flying Saucer (which is free) that can generate PDF output from XML/HTML plus CSS. Prince even supports JavaScript though I don't have any experience with it.
I've got a Java back end in my current application, so for me Flying Saucer works fine for simple reports. I pre-process an HTML template with FreeMarker and then run the result through Flying Saucer. It's got a surprisingly smart rendering engine.
The CSS3 Paged Media spec (well, proposed spec) has all sorts of cool stuff in it but they're almost totally unimplemented in the browsers. Even the CSS2 paged media stuff is only supported half-heartedly.
Speaking of Prince, you might look into DocRaptor. DocRaptor is another HTML to PDF conversion application. It uses Prince XML, and handles CSS better than comparable programs.
It isn't free, but offers a free 30 day trial for all accounts, so there's no harm in trying it out, at least.
DocRaptor
In my project I've to do a PDF Viewer in HTML5/CSS3 and the application has to allow user to add comments and annotation. Actually, I've to do something very similar to crocodoc.com.
At the beginning I was thinking to create images from the PDF and allow user create area and post comments associates to this area. Unfortunately, the client wants also navigate in this PDF and add only comments on allowed sections (for example, paragraphs or selected text).
And now I'm in front of one problem that is to get the text and the best way to do it. If any body has some clues how I can reach it, I would appreciate.
I tried pdftohtml, but output doesn't look like the original document whom is really complex (example of document). Even this one doesn't reflect really the output, but is much better than pdftohtml.
I'm open to any solutions, with preference for command line under linux.
I've been down the same road as you, with even much more complex tasks.
After trying out everything I ended up using C# under Mono (so it runs on linux) with iTextSharp.
Even with a very complete library such as iTextSharp, some tasks required allot of trial-and-error :)
To extract the text from a page is easy (check the below snipper), however if you intend to keep the text coordinates, fonts and sizes, you will have more work to do.
int pdf_page = 5;
string page_text = "";
PdfReader reader = new PdfReader("path/to/pdf/file.pdf");
PRTokeniser token = new PRTokeniser(reader.GetPageContent(pdf_page));
while(token.NextToken())
{
if(token.TokenType == PRTokeniser.TokType.STRING)
{
page_text += token.StringValue;
}
else if(token.StringValue == "Tj")
{
page_text += " ";
}
}
Do a Console.WriteLine(token.StringValue) on all tokens to see how paragraphs of text are structured in PDFs. This way you can detect coordinates, font, font size, etc.
Addition:
Given the task you are required to do, I have a suggestion for you:
Extract the text with coordinates and font families and sizes - all information about each paragraph. Then, to a PDF-to-images, and in your online viewer, apply invisible selectable text over the paragraphs on the image where needed.
This way your users can select a part of the text where needed, without the need of reconstructing the whole PDF in html :)
I recently researched and discovered a native PHP solution to achieve this using FOSS. The FPDI PHP class can be used to import a PDF document for use with either the TCPDF or FPDF PHP classes, both of which provide functionality for creating, reading, updating and writing PDF documents. Personally, I prefer TCPDF as it provides a larger feature set (TCPDF vs. FPDF), a richer API (TCPDF vs. FPDF), more usage examples (TCPDF vs. FPDF) and a more active community forum (TCPDF vs. FPDF).
Choose one of the before mentioned classes, or another, to programmatically handle PDF documents. Focusing on both current and possible future deliverables, as well as the desired user experience, decide where (e.g. server - PHP, client - JavaScript, both) and to what extent (feature driven) your interactive logic should be implemented.
Personally, I would use a TCPDF instance obtained by importing a PDF document via FPDI to iteratively inspect, translate to a common format (XML, JSON, etc.) and store the resulting representation in relational tables designed to persist data pertinent to the desired level of document hierarchy and detail. The necessary level of detail is often dictated by a specifications document and its mention of both current and possible future deliverables.
Note: In this case, I strongly advise translating documents and storing them in a common format to create a layer of abstraction and transparency. For example, a possible and unforeseen future deliverable might be to provide the same application functionality for users uploading Microsoft Word documents. If the uploaded Microsoft Word document was not translated and stored in a common format then updates to the Web service API and dependent business logic would almost certainly be necessary. This ultimately results in storing bloated, sub-optimal data and inefficient use of development resources in designing, developing and supporting multiple translators. It would also be an inefficient use of server resources to translate outbound data for every request, as opposed to translating inbound data to an optimal format only once.
I would then extend the base document tables by designing and relating additional tables for persisting functionality specific document asset data such as:
Versioned Additions / Edits / Deletions
What
Header / Footer
Text
Original Value
New Value
Image
Page(s) (one, many or all)
Location (relative - textual anchor, absolute - x/y coordinates)
File (relative or absolute directory or url)
Brush (drawing)
Page(s) (one, many or all)
Location (relative - textual anchor, absolute - x/y coordinates)
Shape (x/y coordinates to redraw line, square, circle, user defined, etc.)
Type (pen, pencil, marker, etc.)
Weight (1px, 3px, 5px, etc.)
Color
Annotation
Page
Location (relative - textual anchor, absolute - x/y coordinates)
Shape (line, square, circle, user defined, etc.)
Value (annotation text)
Comment
Target (page, another text/image/brush/annotation asset, parent comment - threading)
Value (comment text)
When
Date
Time
Who
User
Once some, all or more, of the document and its asset data has a place to persist I would design, document and develop a PHP Web service API to expose CRUD and PDF document upload functionality to the UI consumer, while enforcing core business rules. At this point, the remaining work now lies on the Client-side. Currently, I have relational tables persisting both a document and its asset data, as well as an API exposing sufficient functionality to the consumer, in this case the Client-side JavaScript.
I can now design and develop a Client-side application using the latest Web technologies such as HTML5, JavaScript and CSS3. I can upload and request PDF documents using the Web service API and easily render the returned common format out to the browser however I decide (probably HTML in this case). I can then use 100% native JavaScript and/or 3rd party libraries for DOM helper functionality, creating vector graphics to provide drawing and annotation features, as well as access and control functional and stylistic attributes of currently selected document text and/or images. I can provide a real-time collaborative experience by employing WebSockets (before mentioned WebService API does not apply), or a semi-delayed, but still fairly seamless experience using XMLHttpRequest.
From this point forward the sky is the limit and the ball is in your court!
It's a hard task you're trying to accomplish.
To read text from a PDF, have a look at PEAR's PDF_Reader proposal code.
There's also a very extensive documentation around Zend_PDF(), which also allows the loading and parsing of a PDF document. The various elements of the PDF can be iterated on and thus also being transformed to HTML5 or whatever you like. You may even embed the notations from your website into the PDFs and vice versa.
Still, you have been given no easy task. Good Luck.
pdftk is a very good tool to do thinks like that (I don't know if it can do exactly this task).
http://www.pdflabs.com/docs/pdftk-cli-examples/
Is there a views plugin that I can use to generate a xml file? I would like something that I could choose the fields I would like to be in the xml and how they would appear (as a tag or a attribute of the parent tag).
For example: I have a content type Picture that has three fields: title, size and dimensions. I would like to create a view that could generate something like this:
<pictures>
<picture size="1000" dimensions="10x10">
<title>
title
</title>
</picture>
<picture size="1000" dimensions="10x10">
<title>
title
</title>
</picture>
...
</pictures>
If there isn't nothing already implemented, what should I implement? I thought about implementing a display plugin, a style, a row plugin and a field handler. Am I wrong?
I wouldn't like do it with the templates because I can't think in a way to make it reusable with templates.
A custom style plugin is definitely capable of doing this; I whipped one up to output Atom feeds instead of RSS. You might find a bit of luck starting with the Views Bonus Pack or Views Datasource. Both attempt to provide XML and other output formats for Views data, though the latter was a Google Summer of Code project and hasn't been updated recently. Definitely a potential starting point, though.
You might want to look at implementing another theme for XML or using the Services module. Some details about it (from its project page):
A standardized solution for building API's so that external clients can communicate with Drupal. Out of the box it aims to support anything Drupal Core supports and provides a code level API for other modules to expose their features and functionality. It provide Drupal plugins that allow others to create their own authentication mechanisms, request formats, and response formats.
Also see:
http://cmsproducer.com/generate-how-to-drupal-node-XML-XHTML
In Drupal 8 the Services module is now part of core (RESTful Web Services). This will allow you to provide any entity in xml or json. Also with views.
Read more here: https://drupalize.me/blog/201401/introduction-restful-web-services-drupal-8
There is a somewhat old description of this process on the Drupal forums. It references Drupal 4.7 and 5.x. I suspect the steps for 5.x would be same technique if not same code for Drupal 6.
if you use drupal 7 and a higher version of it you can use views data export module for export as xml,xls,...
https://www.drupal.org/project/views_data_export