PDF to HTML via Google? - php

I have been trying a long time to get the IIHF PDF's (example here: http://stats.iihf.com/Hydra/349/IHM349131_74_3_0.pdf) to a parseable form.
Now I've finally did it, because Google's cache stores a HTML version from it (http://webcache.googleusercontent.com/search?q=cache:http://stats.iihf.com/Hydra/349/IHM349131_74_3_0.pdf) and it could be parsed easily.
The only problem is, that Google doesn't cache every PDF they have and even if they cache a file, it could take days to appear there.
Is there any way to get those HTML versions via any API or even manually?
Edit: These PDFs have somehow corrupted character maps, so that normal PDF to HTML converters can't convert them. Forgot to say.

Related

get data from documents

I want to make a web app that can get the values from a commonly used file type (such as xsl or ppt) to allow me to convert it into a custom format (like Google Drive). With an xsl (excel document) file, for example, I want to be able to get the value for each cell. I would be fine getting html for a file (like getting the html code that would display a word document) because values can be extracted out of that. I would like to be able to do it on the client side, but I am okay with using it on the server side with PHP.
Another approach would be to import the file as XML. PHP has great support for XML and could make short work of this. If you can get the files uploaded as Open Doc Format you can parse just about any of the types you listed (XLS, PPT, DOC, etc).
A pretty easy way to get data out of an excel sheet online is to use a Google Apps Script. The process would be a lot to explain here, but with a bit of google searching, you can find all your answers.
As for a PPT, I can't think of an easy way.
As for documents (i.e. pdf, doc, docx), you can use Google Apps Script as well.
Although, if you're making your own tool for this, you may want to just research how the data is stored in the file and work from there.

Get PDF output from XML generated by a PHP file and translated with an XSLT

I've used a couple of days to think of a best practice to generate a PDF, which end users can customize the layout for themselves. The PDF output needs to be saved on the server or sent back to the PHP file so the PHP file can save it, and the PHP file needs to know that it went OK.
I thought the best way to do this was to use XML, XSLT and Apache Cocoon. But I'm not sure if this is possible or if it's a good idea since I can't find any information of people doing anything similar. It cannot be an uncommon problem.
The idea came when I read about Cocoon converting XML through XSLT to PDF:
http://cocoon.apache.org/2.1/howto/howto-html-pdf-publishing.html
and being able to take in variables:
http://old.nabble.com/how-to-access-post-parameters-from-sitemap-td31478752.html
This is what I had in mind:
A php file gets called by a user, the php file generates a source XML file with a specific name
The php file then makes a request to Cocoon (on the same web server) to apply the user defined XSLT on the XML file. A parameter will be needed here to know which XSLT to apply.
The request is handled by the PHP file and then saved as a PDF on the server, and can later be mailed away.
Will this work at all? Is there a better way to handle this?
The core problem is that the users need to be able to customize the layout on the PDFs themselves, and I need the server to save the PDF and to mail it later on. The users will use it for order confirmations, invoices, etc. And I wouldn't like to hard code the layout for each user.
I've had some good results in the past by setting up JasperReports Server and creating reports using iReport Designer. They're both available in F/OSS ("community") editions, though you can pay for support and value-adds if you need those things.
This was a good solution for us, since we could access it via the Java API for our Java system, and via SOAP for our PHP system. The GUI designer made tweaking reports very easy for non-technical business staff too.
I use webkithtml2pdf to generate my PDF:s. Just create a document with HTML and CSS for printing like you would usually do, the run it through the converter.
It works great for generating things like invoices. You can use SVG for logos and illustrations, and they will look great in print since they are vector based. Even rounded corners with dotted outlines works perfectly.
A minor gotcha is that the input html must have th htm or html file name suffix, so you can't use the default tempfile functions.

PHP - Workaround for reading user-selected text from PDF?

I am working on a project that allows a user to upload text or content from an HTML page in Japanese and then use their cursor to select words in the text/content to translate into English. However, I would like to be able to expand this functionality to PDF files. Essentially, I'd like the user to be able to submit a PDF file and have the browser render that PDF file in such a way that when the user selects/highlights words in the PDF, the browser can somehow relay what the text of the highlighted section is, such as via javascript, to be then relayed to a PHP variable.
I know there are a lot of posts on stackoverflow asking similar questions (I've spent hours upon hours trying to sort through them all!), but I can't seem to find a definitive answer on whether this is possible. It seems there are lots of options for converting PDF to HTML or extracting text from PDF, but to be quite honest, I'm confused if any of those options are relevant to what I am trying to accomplish. And I know there's a javascript API for Adobe, but I'm under the impression the javascript needs to be embedded in the PDF already, which will not be true if the user is uploading their own PDF files to render. Even if that is possible, it seems there's no native text selection support in the Adobe javascript API....
Is there a straightforward workaround (oxymoron?) to doing this? Again, I want to be able to pass text selected in a PDF to a variable -- the effect is the user highlights words they don't know so those words can be added to a word bank for retrieval in a dictionary.
Let me know if I can be clearer on anything. Thank you!
I think your best bet is to convert the PDF to HTML (see this answers) and then you are already set as you already implemented everything for regular HTML.

Insert data into PDF, filled in an online form, on submit action

First of all my apologies to all the people who think this question is a repeated one or they find a similar question to this.
I am working on a project in which I have an online form and some PDFs stored on the server.
Functionality
On the submit action I have to get the data from the form, fill it to the copy of PDF and finally download it.
Approach
I followed these steps to achieve this functionality:
Converted the pdfs to html with this http://www.pdfdownload.org/free-pdf-to-html.aspx online tool.
Embedded the html with form variables and regenerated the PDFs with this library / dompdf library.
Problem
The approach is a brute force one as the html generated are far away from the real ones. So lot of effort is wasted in adjusting the html.
The process is so slow and not reliable as most of the time I get memory error or some other issues.
I need to to automate this process. What I have found through searching is I should create an FDF file that contains my variable and pass it to the PDF using some library and then download it.
I am able to create the FDF file but missing any library in PHP (I found one in JAVA) that I can use to create the PDF and download it. One library that I found is pdf tool kit but that is a command line tool and I am not able to use it on the server at run time and download the PDF file.
Anybody having done this before please help.
(Sorry for this long post)
Thanks,
Madhup
Check out FPDI. It allows you to load some existing PDF, draw on it programatically, and output a new PDF. Which, if I read your question right, is what you're trying to do.
There's some example code here.

Spinning PDF's by passing a URL and $_GET parameters?

I have spent a lot of time trying to get dompdf (http://www.digitaljunkies.ca/dompdf/) to work but I keep running into problems. I am trying to generate a PDF from a PHP script which generates a fairly complex, filled out web form. The script accepts a $_GET parameter (record number) and fills out the form accordingly with data from the database. I have no problem getting this data into the script as a string or any type of value really. What I am wondering is what the best approach would be for converting this type of data to a PDF?
The flow is as follows: user completes form and is taken to confirmation page which I would like to add a "Save as PDF" button. At this point one of two things could happen, the page that is currently being displayed in the browser could be spun directly to a pdf or a call to itself (scriptname.php?id=xyz) could be made using something like PHP's http_get() function and store the HTML as a string. From there I am having issues with preparing an accurate representation as a PDF.
I have heard some talk about fpdf but their examples don't really lead me to believe you can use dynamic data as the source, but please correct me if I am wrong about this.
Any input would be appreciated.
-- Nicholas
Well, I didn't know dompdf. Strange that it uses either a commercial library (PDFlib) or an outdated (?) one (CPDF, not updated for 3 years). But well, as long as it works (concept is interesting).
I don't understand what you mean by "use dynamic data as the source" (or rather, I see not point in generating static PDF!), but FPDF is used to generate various dynamic documents, like invoices in e-commerce products. I saw people using forks (like TCPDF) to handle Unicode data, though.
You can't transform an HTML page to PDF with FPDF, but you have a quite precise control of layout, using concept of cells with data.
You can see such kind of code there: http://svn.prestashop.com/trunk/classes/PDF.php
In the past when faced with this issue, I have used FPDF for the placement of data on a PDF template. Then, by setting the appropriate HTTP header, force the browser to pop open the Download / Save As box for the user to save said PDF.
In a class that extends FPDF/FPDI appropriately, use something like the following to generate a PDF from a template PDF you've already created (http://www.setasign.de/products/pdf-php-solutions/fpdi/):
$this->setSourceFile('pdf_template.pdf');
$template_page = $this->importPage(1, '/MediaBox');
$this->useTemplate($template_page, 0, 0);
Then, have FPDF generate the PDF for output using:
$this->Output();
You can also extend FPDF to accept (limited) HTML for formatting using the script found here: http://www.fpdf.org/en/script/script41.php

Categories