I am trying to replace strings in a word document by reading the file into a variable $content and then using str_ireplace() to change the string. I can read the content from the file but I str_ireplace() does not seem to be able to replace the string. I assumed it would because the string is 'binary safe' according to the PHP documentation. Sorry, I am a beginner with PHP file manipulation so all this is quite new to me.
This is what I have written.
copy('jack.doc' , 'newFile.doc');
$handle = fopen('newFile.doc','rb');
$content = '';
while (!feof($handle))
{
$content .= fread($handle, 1);
}
fclose($handle);
$handle = fopen('newFile.doc','wb');
$content = str_ireplace('USING_ICT_BOX', 'YOUR ICT CONTENT', $content);
fwrite($handle, $content);
fclose($handle);
When I download the new file, it opens as it should in MS Word but it shows the old string and not the one that should be replaced.
Can I fix this issue? Is there any better tool I can use for replacing strings in MS Word thourgh PHP?
I have same requirement for Edit .doc or .docx file using php and i have find solution for it.
And i have write post on It :: http://www.onlinecode.org/update-docx-file-using-php/
copy('jack.doc' , 'newFile.doc');
$full_path = 'newFile.doc';
if($zip_val->open($full_path) == true)
{
// In the Open XML Wordprocessing format content is stored.
// In the document.xml file located in the word directory.
$key_file_name = 'word/document.xml';
$message = $zip_val->getFromName($key_file_name);
$timestamp = date('d-M-Y H:i:s');
// this data Replace the placeholders with actual values
$message = str_replace("client_full_name", "onlinecode org", $message);
$message = str_replace("client_email_address", "ingo#onlinecode.org", $message);
$message = str_replace("date_today", $timestamp, $message);
$message = str_replace("client_website", "www.onlinecode.org", $message);
$message = str_replace("client_mobile_number", "+1999999999", $message);
//Replace the content with the new content created above.
$zip_val->addFromString($key_file_name, $message);
$zip_val->close();
}
Maybe this would point you to the right direction: http://davidwalsh.name/read-pdf-doc-file-php
Solutions I've found so far (not tested though):
Docvert - works for Doc, free, but not directly usable
PHPWordLib - works for Doc, not free
PHPDocX - DocX only, needs Zend.
I am going to opt for PHPWord www.phpword.codeplex.com as I believe teachers are going to get Office 2007 next year and also I will try and find some way to convert between .docx and .doc through PHP to support them in the mean time.
If you can reach a web-service, look at Docmosis Cloud services since it can mailmerge a doc file with your data and give you back a doc/pdf/other. You can https post to the service to make the request so is pretty straight forward from PHP.
There is many way to handle word document file on linux
antiword - not much effective as it converts into plain text.
pyODconvert
open-office or liboffice - through UNO
unoconv utility - need to installation permission on server
There is one python script which is most usable for online file conversion but you need to convert those file through command line.
There is no specific and satisfied solution to handle word files by only using php code.
I hunted for a long time to reach at this suggestion.
Related
I have tried to extract the user email addresses from my server. But the problem is maximum files are .txt but some are CSV files with txt extension. When I am trying to read and extract, I could not able to read the CSV files which with TXT extension. Here is my code:
<?php
$handle = fopen('2.txt', "r");
while(!feof($handle)) {
$string = fgets($handle);
$pattern = '/[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}/i';
preg_match_all($pattern, $string, $matches);
foreach($matches[0] as $match)
{
echo $match;
echo '<br><br>';
}
}
?>
I have tried to use this code for that. The program is reading the complete file which are CSV, and line by line which are Text file. There are thousands of file and hence it is difficult to identify.
Kindly, suggest me what I should do to resolve my problem? Is there any solution which can read any format, then it will be awesome.
Well your files are different. Because of that you will have to take a different approach for each of those. In more general terms this is usually calling adapting and is mostly provided using the Adapter design pattern.
Should you use the adapter design pattern you would have a code inspecting the extension of a file to be opened and a switch with either txt or csv. Based on the value you would retrieve aTxtParseror aCsvParser` respectively.
However, before diving deep into this territory you might want to have a look at the files first. I cannot say this for sure without seeing the structures but you can. If the contents of both the text and csv files are the same then a very simple approach is to change the extension to either txt or a csv for all files and then process them using same logic, knowing files with the same extension will now be processed in the same manner.
But from what I understood the file structures actually differ. So to keep your code concise the adapter pattern, having two separate classes/functions for parsing and another one on top of that for choosing the right parsing function (this top function would actually be a form of a strategy) and running it.
Either way, I very much doubt so there is a solution for the problem you are facing as a file structure is mostly your and your own.
Ok, so problem is when CSV file has too long string line. Based on this restriction I suggest you to use example from php.net Here is an example:
$handle = #fopen("/tmp/inputfile.txt", "r");
if ($handle) {
while (($buffer = fgets($handle, 4096)) !== false) {
echo $buffer;
// do your operation for searching here
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
}
I am using urldecode data for writing a content in to a text file, but in that file all the contents are showing together(not aligned expected) in windows notepad(in windows wordpad it is coming correctly), also when i open it in Ubuntu contents are coming correctly(my contents have enter key and spaces some special characters too).
$attachment_file = fopen(Yii::app()->basePath.'/../uploads/attachment'.$user_id.'.txt', "a+") or die("Unable to open file!");
$content = urldecode($note_data["note_data"]);
fwrite($attachment_file,$content);
fclose($attachment_file);
For the quick fix i did
$content = str_replace("\n","\r\n",$content);
but i want to know is there any other methods to do it.
If you are using Linux to create the file, you should manually add this. If you use Windows, You can try str_replace("\n", PHP_EOL, $content) instead.
I don't understand why you are doing urldecode. Maybe you should use something like utf8_decode if you have your data in utf-8 format.
Im trying to create a dynamic pdf using php,
I can load the pdf once the form is submitted but soon as I try to edit the pdf it fails to load
I shortened the code to keep it straight forward
PDF Structure
Your Name is : <<NAME>>
PHP Dynamic Script
set_time_limit(180);
//Create Variable Names
$name = $_POST['name'];
function pdf_replace($pattern, $replacement, $string) {
$len = strlen($pattern);
$regexp = '';
for($i = 0; $i <$len; $i++) {
$regexp .= $pattern[$i];
if($i < $len - 1) {
$regexp .= "(\)\-{0,1}[0-9]*\(){0,1}";
}
}
return ereg_replace ($regexp, $replacement, $string);
}
header('Content-Disposition: filename=cert.pdf');
header('Content-type: application/pdf');
//$date = date('F d, Y');
$filename = 'testform.pdf';
$fp = fopen($filename, 'r');
$output = fread($fp, filesize($filename));
fclose($fp);
//Replace the holders
$output = pdf_replace('<<NAME>>', $name, $output);
echo $output;
If I comment out the output it loads the form fine but soon as I try to run the function to replace the placeholder it fails to load. Anyone do something like this before?
I've tried your code and I can assume there are one of the following reasons that can cause the problem:
When you call pdf_replace() PHP returns Deprecated notice on ereg_replace() function. This breaks PDF structure and cause PDF fail to load. This function is deprecated since PHP 5.3.0. Simple solution is to start using preg_replace() instead.
function pdf_replace($pattern, $replacement, $string) {
return preg_replace('/'.preg_quote($pattern,'/').'/', $replacement, $string);
}
In case you can't do it then the solution is to either edit php.ini file and edit error_reporting parameter. You can add "^ E_DEPRECATED" to your current config value to disable Deprecated notices. Other option is to add error_reporting() at the beginning of your script with appropriate value.
I do not see the PDF you use but some PDF generators encode PDF source. In this case it is a problem to find text there. For example I've tried "Print to PDF" feature on Mac and I was not able to find plain text in source there. So either fopen or ereg_replace can complain about wrong file format. In this case you should use some library that can work with PDF in more clever manner. I prefer FPDF but there are plenty of such libraries.
I'm a web designer and I use ezPDF to create pdf files for making reports and views it in any browsers. Here's a webpage that i looked for the tutorials: http://www.weberdev.com/get_example.php3?ExampleID=4804
I hope this would be helpful :)
Your code is seemingly having two problems.
First of all, ereg_replace() is a DEPRECATED function. Due to this the php script is throwing an error and the error message breaks the pdf structure. Change the function to preg_replace() and it should work.
Secondly, I tried your code with a sample pdf-form. It seems that ereg_replace() cannot process some characters. In the pdf-form that I used, this function truncates the string(the pdf-form data) after it meets a specific character namely, §. Thats why even if you suppress the the error using error_reporting(E_ALL ^ E_DEPRECATED);,
the code will not work even then.
So, you better go for preg_replace();
<?php
set_time_limit(180);
//Create Variable Names
$name = $_POST['name'];
function pdf_replace($pattern, $replacement, $string) {
$len = strlen($pattern);
$regexp = '';
for($i = 0; $i <$len; $i++) {
$regexp .= $pattern[$i];
if($i < $len - 1) {
$regexp .= "(\)\-{0,1}[0-9]*\(){0,1}";
}
}
return preg_replace ($regexp, $replacement, $string);
}
header('Content-Disposition: filename=cert.pdf');
header('Content-type: application/pdf');
//$date = date('F d, Y');
$filename = 'testform.pdf';
$fp = fopen($filename, 'r');
$output = fread($fp, filesize($filename));
//fclose($fp);
//Replace the holders
$output = pdf_replace('<<NAME>>', $name, $output);
echo $output;
?>
It is not possible to replace a text string in a PDF without very specific requirements on the original PDF document.
Some comments on such project:
A PDF document uses byte offsets to allow fast access to specific objects within the document. By changing a string these offsets will get invalid and the PDF document can be seen as damaged.
At the end most content streams in a PDF document are compressed. So the string you are searching for is (maybe) in one of these streams and not "visible".
A string you "see" after a PDF is rendered has not to be the same as in the PDF source.
The replaced/new string may use characters which are not available in the used font or they are matched to other characters by an separate encoding.
And much more things to consider...
why you are not using str_replace?
please check if this code is working for you (assumed you can see the words you want to replace in simple text editor):
// hide all errors so there will be no errors output which will break your PDF file
error_reporting(0);
ini_set("display_errors", 0);
//Create Variable Names
$name = $_POST['name'];
$filename = 'testform.pdf';
$fp = fopen($filename, 'r');
$output = fread($fp, filesize($filename));
fclose($fp);
//Replace the holders
$output = str_replace('<<NAME>>', $name, $output);
// I added the time just to avoid browser caching of the PDF file
header('Content-Disposition: filename=cert'.time().'.pdf');
header('Content-type: application/pdf');
echo $output;
There are many PHP projects to generate dynamic PDF files using PHP, why you dont use them, why you are using this regular expression?, probably you will have some designs you will
have to add in the future to the pdf such as tables logos and stuff and the easiest way is to convert HTML to PDF.
If you are looking for performance and almost %100 of look alike the HTML and CSS you have you should use wkhtmltopdf its free and easy to use it runs on your server but it is not PHP it is executable file that you will have to execute it from your PHP code.
http://wkhtmltopdf.org/
another alternative is to use is pure PHP unlike the previous which is executable
http://www.tcpdf.org/
I used both and prefer the wkhtmltopdf because its faster and priciest in HTML and CSS
I have tried many things like How to extract text from word file .doc,docx,.xlsx,.pptx php.
But this isn't a solution.
My server is Linux based so enabling extension=php_com_dotnet.dll is not the solution.
Another solution was installing LIBRE office on server and converting the .doc file to .txt on the fly and then counting the words from that file. This is very tedious job and time consuming.
I just need a simple php script that removes the special characters from the .doc file and count the number of words.
You can try with this PHP class that claims to be able to convert both .doc and .docx files in textual format.
http://www.phpclasses.org/package/7934-PHP-Convert-MS-Word-Docx-files-to-text.html
According to the example given, that's how you can use it:
require("doc2txt.class.php");
$docObj = new Doc2Txt("test.docx");
//$docObj = new Doc2Txt("test.doc");
$txt = $docObj->convertToText();
echo $txt;
As you pointed out, the core function of this library, as of many others, is something like this:
<?php
function read_doc($filename)
{
$fileHandle = fopen($filename, "r");
$line = #fread($fileHandle, filesize($filename));
$lines = explode(chr(0x0D) , $line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE) || (strlen($thisline) == 0))
{
}
else
{
$outtext.= $thisline . " ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/_()]/", "", $outtext);
return $outtext;
}
echo read_doc("sample.doc");
?>
I've tested this function with a .doc file and it seems to work quite well. It needs some fixes with the last part of the document (there is still some random text that is generated at the end of the output), but with some fine tuning it works reasonably.
EDIT:
You are right, this functions works correctly only with .docx documents (the document I tested was probably made using the same mechanism). Saving a file with .doc extension, this function doesn't work!
The only help I'm able to give you right now is the .doc binary specifications link (here is an even more complete file), where you can actually see how the binary structure is made and extract the informations from there. I can't do it now, so I hope that somebody else may help you through this!
At the end i had to use Libreoffice. But its very efficient to use it. It solved my all the problem.
So my advice would be to install the 'HEADLESS' package of libreoffice on server and use the command line conversion
I've built a tool that incorporates various methods found around the web and on Stack Overflow that provides word, line and page counts for doc, docx, pdf and txt files. I hope it's of use to people. If anyone can get rtf working with it I'd love a pull request! https://github.com/joeblurton/doccounter
I want to have PHP read an (uploaded) powerpoint presentation, and minimally extract the text from each slide (grabbing more info like images and layouts would even be better, but I would settle for just the text at this point).
I know that google apps does it in its presentation app, so I am guessing there is some way to translate the powerpoint binary, but I can't seem to find any info on how to do it.
Any ideas on what to try?
Thanks -
Depending on the version, you can take a look on the Zend Framework as Zend_Search_Lucene is able to index PowerPoint 2007 files. Just take a look at the corresponding class file, i think it's something like Zend_Search_Lucene_Document_Pptx.
Yes of course it's possible.
[Here's a start.](http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/PowerPoint97-2007BinaryFileFormat(ppt)Specification.pdf) I wouldn't say it's very well documented/formated, but it's not that hard once you get started. Start by focusing only on elements you need (slides, text, etc).
A less detailed and simpler approach would be to open .ppt file in hex editor and look for information you are interesed in (you should be able to see text within the binary data) and what surrounds it. Then based on what surrounds that information you could write a parser which extracts this information.
Here's a sample function I created form a similar one that extracts text from Word documents. I tested it with Microsoft PowerPoint files, but it won't decode OpenOfficeImpress files saved as .ppt
For .pptx files you might want to take a look at Zend Lucene.
function parsePPT($filename) {
// This approach uses detection of the string "chr(0f).Hex_value.chr(0x00).chr(0x00).chr(0x00)" to find text strings, which are then terminated by another NUL chr(0x00). [1] Get text between delimiters [2]
$fileHandle = fopen($filename, "r");
$line = #fread($fileHandle, filesize($filename));
$lines = explode(chr(0x0f),$line);
$outtext = '';
foreach($lines as $thisline) {
if (strpos($thisline, chr(0x00).chr(0x00).chr(0x00)) == 1) {
$text_line = substr($thisline, 4);
$end_pos = strpos($text_line, chr(0x00));
$text_line = substr($text_line, 0, $end_pos);
$text_line = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/","",$text_line);
if (strlen($text_line) > 1) {
$outtext.= substr($text_line, 0, $end_pos)."\n";
}
}
}
return $outtext;
}
I wanted to post my resolution to this.
Unfortunately, I was unable to get PHP to reliably read the binary data.
My solution was to write a small vb6 app that does the work by automating PowerPoint.
Not what I was looking for, but, solves the issue for now.
That being said, the Zend option looks like it may be viable at some point, so I will watch that.
Thanks.