I have been pondering writing this question for quite some time.
I work for a small-sized news corporation in Vietnam.
The server I have is running for documents is the latest version of Ubuntu (with PHP/Apache obviously), which means that formats such as .doc and .docx will not be able to be opened natively, as far as I know.
However, when reporters upload documents, half the time they do it in some sort of Microsoft format. This means my Linux machine cannot open and pick out keywords, which is extremely frustrating to me; this is because things like pdf2txt.py do not work.
Is a way to get around this problem, without inconveniencing the reporters too much? I understand that since I am running a Linux server, I may have to run some sort of third-party application to do the work for me, which could work in the short run, but it could pose some security risks.
Summary: How can I have a Linux server automatically convert any format such as .doc and .docx to PDF for further manipulation?
For oldschool doc files, take a look at catdoc, and wv.
For an all around solution that can convert anything that OpenOffice can open to anything that OpenOffice can save, is unoconv.
Related
I am developing an API, in PHP, hosted on a linux server, that requires me to make jpeg previews for a .pptx powerpoint presentation.
I first convert the file to pdf and then convert the pdf to jpegs.
The second step is easy, with ghostscript, it's the first part that's proving difficult.
I have tried using the libreoffice executable, but pptx isn't completely compatible. Certain backgrounds become invisible.
I have the same problem with many 3rd party APIs (which I suspect also use libreoffice); the ones that do work, are ridiculously expensive.
Installing office on a Linux server and using COM functions seems impossible, or very tedious at best.
I have looked at Aspose.Slides, which also seems rather expensive, and their documentation is filled with errors.
I could use suggestions on how to tackle this problem.
I have tried to find the underlying problem of why LibreOffice and online conversion tools have a problem with the backgrounds of the presentations I need to convert.
The background is a .emf file, which has bad support.
My solution
I've unzipped the presentation, converted the .emf files to png (using ghostscript), changed all mentions of .emf to .png in the XML, and rezipped the altered presentation.
When I now use the LibreOffice headless to convert to pdf, the background shows up.
It might be a bit hacky, but it works for the intent of my program.
ps. I see that my question has gathered a few downvotes. In my opinion it was a valid question, and listed the various solutions that had worked for others, but not for me. If anyone has insights or ways to improve it, feel free to comment.
I'm doing a job for a client where I'm supposed to make a web application that can convert office documents and pdfs into a series of images, one per page/sheet. It's really easy with pdfs but much harder with the office files. So a solution that will convert the office files into pdfs is also good, I'll do the conversion to images later.
The environment will most likely be a windows server with the office products already installed (weird, I know)
Ideas I tried so far:
Using office COM objects - I ran into multiple problems and it seems inefficient and unreliable.
OpenOffice and unoconv - ran into problems using them on Windows. I may try Linux but my client says OpenOffice didn't work for them and I think I read somwhere that this solution is not recommended.
Aspose - tried it before realizing my client won't pay for it. It seems like the ideal solution but it's too expensive. So I need a solution without payed software.
Other options I thought of and didn't try:
Office interop dll - sounds ideal but people talk about performance issues and memory leaks.
Power Tools for Open XML (https://powertools.codeplex.com/) or DocX (http://docx.codeplex.com/) - libraries that sound useful but if I understand correctly they'll only work for docx files.
Any suggestions?
Are you aware that Adobe offers this as a service?
Is there is any way to displaying world document,excel sheet and power point in browser with out downloading.
I assume that you are going to use php for this, so you can try checking some libraries such as PHPWord by Microsoft for example.
If you wish to only display the document content, it is possible to do using some scripting language such as php. Basically office 2007+ formats are zipped XML documents with changed extension. Make a simple word 2007+ document, save it and change extension from .docx to .zip, than you can extract it and see what it's made of. You can find a lot of details here. Now displaying content may be a little tricky. As mentioned, there are libraries out there to handle this, but how will they handle the documents, I am not really sure. Most of them are abandoned, PHPword is in beta since 2011.
There are some indications that Apache is working on cloud version of Open office, but there is no release date yet. Once done, you will have a full featured office suite web app.
If you feel really creative you could use cron job (or scheduled task if you like Windows) to open a document, take a screenshot and basically make .jpg or .png version of the document (works fine with short documents, longer ones may be problematic), displaying it in a browser without much complication. It is also possible to schedule export to .pdf - all browsers do have Adobe PDF plugins.
To sum up, using php for parsing simple documents should be fine, but getting complex docs to display properly, may be much more difficult task and possibly not worth your time. I would go for cron export to pdf, to preserve most if not all of the document's structure.
Does anybody know a purely PHP based way to alter the frequency of an MP3 file?
I am on shared hosting with this, so installing ffmpeg or something similar is out of the question.
If this requires actually altering the audio data, then I guess it is not possible nor feasible to do with PHP, but I was thinking maybe this is just a header setting. I don't know.
Background:
A client's website is utilizing a Flash based MP3 player to play some audio.
The client is producing the audio herself.
The trouble is that the tools that she is producing it with, and is familiar with, automatically produces MP3 files with a frequency of 48000hz, while some versions of Flash have trouble playing anything with a frequency differing from 44100khz. (See my related question here).
I would like to avoid adding yet another program to the already complex audio production process, and solve this on the web server end if possible.
I was thinking maybe this is just a header setting.
No. That is, you can probably change it in the header, if you don't mind your MP3s being played too slow or too fast with a shifted pitch.
If you want it to sound the same, you will need to re-encode. Decoding to WAV (or raw samples), resampling, then re-encoding is a possibility, and probably your only one.
Maybe the way MP3 works allows for a shortcut (like JPEG allowing for lossless rotation), but I am unaware of any such methods.
I've been taken onboard to work on a PHP-based web application. One part of the application generates thumbnail images for MS Office documents on demand, and it uses MS Office + the VeryPDF docprint utility to do this. Because of this one requirement, the system is running on Windows Server 2003 + IIS.
I would prefer to have the system running on a Linux server, rather than MS, as I have far more experience in administering Linux systems than Windows and we have no other in-house technical staff.
Does anyone know a way to handle the document conversion using native Linux software? I would love something PHP native, but am willing to look outside that if necessary.
I have never done anything like this, so I'm just throwing an idea off the top of my head.
Have you thought about utilizing Open Office's capabilities to create thumbnail images? I know OO saves thumbnail images within a created document, so all you need to do is extract the image to display it. (This is demonstrated on the Ubuntu forums.) You could always do something sort of "hackish" where you use run a file through OpenOffice and extract the image to display a small thumbnail.
Again, I have no idea how well this will work, but it may be worth a shot.
To anyone else who comes across this, I have ended up going with the newer version of jodconverter. The sample code includes a basic web page that can be POSTed to using something like Pear's HTTP_Request2. A sample class (by yours truly) which uses this is mentioned in the comments in jodconverter's group on google code.