How to split PDF file at a given keyword using PHP? - php

I'd like to be able to parse a PDF file using PHP, specifically search the large PDF file using certain keywords and split it into multiple PDF files at the keywords searched for.
I did some research and I found a lot of ways to write PDF files using PHP but very few to parse and split the PDF file.
any web-sites I could go to? libraries I could use?

Related

Search inside 70gb of PDF files

I have 70gb PDF files, and I want to search inside them with PHP and some Ajax.
The code must search on all PDF files and extract the data out into table,
For example: 1547AD
When I hit enter the code will search in all PDF files and extract all PDF files that contain "1547AD" inside them.
My problem is: of course putting these data inside MySQL will be better for the server and stronger but imagine extracting all tables in 70GB of PDF files! and these pdf files updated daily, also there is alot of traffic on this page.
My question is: Is it the right way to build this in PHP or I should use another language and/or another method for this kind of heavy data?

how to show content of doc file having text and images in a web page using php

I have uploaded the doc files in my server and now i have to show the content of doc files in web page. Now i am not getting any way to show that content as doc may contain images, tables, and other type of texts.
I tried to work with file_get_contents and then using headers but not possible in all ways.
Firstly, .doc/.docx file is not a text file. I don't think with file_get_contents you shall get any normal text character as output.
You can do two things:
Convert the doc/docx to pdf and show in your web page [Try: OfficeToPdf]
Convert the doc/docx to html compatible format using library [Try: PhpWord]
Links
PhpWord: http://phpword.codeplex.com/
OfficeToPdf: http://officetopdf.codeplex.com/
The libraries I mentioned are for example, there are a plenty of libraries available for free. Just Google to get them. Good luck :)
Docx is a zipped file format, so you can't get any information with file_get_contents.
Doc is not zipped, so you will directly get all information in XML format when opening with file_get_contents.
If you want to convert the files for a webviewing, you could use one of the libraries mentioned by #programmer
PhpWord: http://phpword.codeplex.com/
OfficeToPdf: http://officetopdf.codeplex.com/

word document {tokens} replacement using PHP

I am trying to read a .doc file and find tokens like {name}, {phone}, {address} etc. now display tokens with text box and allow user to replace by inserting original data. so that .doc file will replace with actual data.how to do this using php? the color, fonts, and style of .doc should not be changed.
thanks....
This will be very tricky if you are using the old style Word documents. The new Word documents are saved in a some sort of Zip archive and therefore are much easier to edit.
You can extract this files and with some knowledge of the contents and Word WSDL you can edit the contents of the file.
Much easier is to make use of the PHPDocX Library. We are using it in a project and works like a charm. Only disadvantage is that it only works with .docx files.

How to clean up garbage text from string using PHP?

I am trying to parse a word document file. I upload the using PHP then I am trying to get contents using file_get_contents(); function but the problem is when its displayed in front end a lots of garbage code in there like
Æ�Ѐ¤d�¤d�[$\$gd®l±����„h¤d�¤d�[$\$^„hgd®l±���
&�F�¤d�¤d�[$\$gd3¡���gd3¡����„,¤d�¤d�[$\$^„,gd(E����¤d�¤d�[$\$gdÿ/��<��C��D��I��Å������O��P��‚��¡��¢��¬��­��®��Ù��ã��ó��ô�����
So my question is how can I clean up this text?
Maybe give this a shot? http://www.phpclasses.org/package/3553-PHP-Edit-Microsoft-Word-documents-using-COM-objects.html
Word documents (like docx and doc) are not straight text files - they are actually proprietary file types that do not just have the text from byte 0 - this is how they have fancy formatting and fonts. .docx files are actually archives (.zip files) that contain a myriad of XML and styles.
Your best bet is to use a text input form, or find code online that allows you to extract just the text. Or, download the doc files to your own computer and use your own copy of MS word to open it.

PHP to PDF - Use text from XML document to create PDF?

I have some xml files that contain text, which are displayed on my website. I want to extract the text from these xml files and convert them to a pdf document that users can download.
how can I can extract this text from the xml documents? (libraries etc?)
how can I use this text to create a pdf document?
I am working in a PHP environment, however if this is not the suitable language, I could change.
There are many ways to parse an XML file, and many ways to output a PDF file.
I suggest you start with the XML functions within PHP http://php.net/manual/en/book.xml.php
There are also various classes to write PDF files, try googleing for them.

Categories