Is there any way to extract hyperlinks from .doc. I got bunch of hyperlinks in doc that I need to import in my database.
I have tried converting doc to HTML, but hyperlinks are not transferred.
Regardz,
Mladen
We had a similar issue and ended up using a third party component called Aspose.Words.
You can find it here: http://www.aspose.com
It's available for .NET and Java.
You could try importing the file into OpenOffice and see whether hyperlinks are transferred. OpenDocument is just a ZIP file with XML inside, very easy to parse once you've got the hang of it.
I have done the following thing. I have opened the .doc file with officeXP, then published it as a blog and after that I have saved that blog in the form of filtered web page. That gives you nice HTML which you can parse with ease.
I realise this is some months after your initial question, however, You can also extract hyperlinks in a .doc file through through Word Automation. There are hyperlink objects in the API that you can easily extract.
Related
I have uploaded the doc files in my server and now i have to show the content of doc files in web page. Now i am not getting any way to show that content as doc may contain images, tables, and other type of texts.
I tried to work with file_get_contents and then using headers but not possible in all ways.
Firstly, .doc/.docx file is not a text file. I don't think with file_get_contents you shall get any normal text character as output.
You can do two things:
Convert the doc/docx to pdf and show in your web page [Try: OfficeToPdf]
Convert the doc/docx to html compatible format using library [Try: PhpWord]
Links
PhpWord: http://phpword.codeplex.com/
OfficeToPdf: http://officetopdf.codeplex.com/
The libraries I mentioned are for example, there are a plenty of libraries available for free. Just Google to get them. Good luck :)
Docx is a zipped file format, so you can't get any information with file_get_contents.
Doc is not zipped, so you will directly get all information in XML format when opening with file_get_contents.
If you want to convert the files for a webviewing, you could use one of the libraries mentioned by #programmer
PhpWord: http://phpword.codeplex.com/
OfficeToPdf: http://officetopdf.codeplex.com/
I want to make a web app that can get the values from a commonly used file type (such as xsl or ppt) to allow me to convert it into a custom format (like Google Drive). With an xsl (excel document) file, for example, I want to be able to get the value for each cell. I would be fine getting html for a file (like getting the html code that would display a word document) because values can be extracted out of that. I would like to be able to do it on the client side, but I am okay with using it on the server side with PHP.
Another approach would be to import the file as XML. PHP has great support for XML and could make short work of this. If you can get the files uploaded as Open Doc Format you can parse just about any of the types you listed (XLS, PPT, DOC, etc).
A pretty easy way to get data out of an excel sheet online is to use a Google Apps Script. The process would be a lot to explain here, but with a bit of google searching, you can find all your answers.
As for a PPT, I can't think of an easy way.
As for documents (i.e. pdf, doc, docx), you can use Google Apps Script as well.
Although, if you're making your own tool for this, you may want to just research how the data is stored in the file and work from there.
I am trying to read a .doc file and find tokens like {name}, {phone}, {address} etc. now display tokens with text box and allow user to replace by inserting original data. so that .doc file will replace with actual data.how to do this using php? the color, fonts, and style of .doc should not be changed.
thanks....
This will be very tricky if you are using the old style Word documents. The new Word documents are saved in a some sort of Zip archive and therefore are much easier to edit.
You can extract this files and with some knowledge of the contents and Word WSDL you can edit the contents of the file.
Much easier is to make use of the PHPDocX Library. We are using it in a project and works like a charm. Only disadvantage is that it only works with .docx files.
I need an option from within PHP to Manipulate .docx (Microsoft Office 2007) document.
I need to:
Read the internal text
Convert to .html
To view them inside a browser.
To replace text.
I know I can use Word Automation, creating a COM object of Microsoft Word, but it's too slow, unstable and I have to have it installed on the server.
Is there any library or code that can do it from PHP?
There is PHPWord for that by the authors of PHPExcel.
Docx is just a ZIP file containing multiple XML files and embedded media files like images. Because of this, you can read and edit the document with ease. Just unzip it, open word/document.xml, do reading & writing, and repack the files.
Convet to HTML may be difficult. But you'll find a thumbnail of the first page in docProps/thumbnail.jpeg.
Note that you'll have to familiarize yourself with the XML structure to do any complex edits. There's a summary XML docProps/app.xml which has some metadata for the file so don't forget to update it. Read more from Wikipedia: http://en.wikipedia.org/wiki/Office_Open_XML
You may have a look at PHPDocX I believe it does all you are asking for.
You may replace variables in a template or just plain text from a prexisting Word document.
It offers quite a few conversion options.
You can also extract the text.
You can work with the internal format directly.
DOCX is just a zip file, and inside that there's word/document.xml containing the actual document.
It's quite trivial to unzip the file, read document.xml, str_replace() what you're looking for, save it and re-zip the directory, and it makes for a lightweight, quick and easy mail merge capability for word documents. This also works for other office formats.
Here's the official docs on the internal structure for more information.
There is also a PHP class for merging new content into an existing .docx file. It is available here: http://www.tinybutstrong.com/ . The documentation is pretty good as well as having many examples and it is all free and open source. It does require familiarity with the .docx concepts, though.
I am looking for some way to code a function (I'm open to any language or library at this point) to take an already existing PDF file as input and return a modified PDF file that links certain words to different URLs. I know PHP and ColdFusion both have good tools for dealing with PDF's, but I haven't been able to find anything that works.
I've been doing this by going through Acrobat and linking the text by hand and was wondering if there was any way to automate the procedure.
Thanks!
With ColdFusion you can extract the text with DDX (see Extracting text from a PDF document on the page), modify it using search/replace and generate new document.
If I understand what you're trying to do, you should be able to use CFPDF (http://livedocs.adobe.com/coldfusion/8/htmldocs/Tags_p-q_02.html#2922772) to read the pdf file into a ColdFusion variable, replace whatever content you want in that variable, then save the content back to pdf.