Can PHP read text from a PowerPoint file? - php

I want to have PHP read an (uploaded) powerpoint presentation, and minimally extract the text from each slide (grabbing more info like images and layouts would even be better, but I would settle for just the text at this point).
I know that google apps does it in its presentation app, so I am guessing there is some way to translate the powerpoint binary, but I can't seem to find any info on how to do it.
Any ideas on what to try?
Thanks -

Depending on the version, you can take a look on the Zend Framework as Zend_Search_Lucene is able to index PowerPoint 2007 files. Just take a look at the corresponding class file, i think it's something like Zend_Search_Lucene_Document_Pptx.

Yes of course it's possible.
[Here's a start.](http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/PowerPoint97-2007BinaryFileFormat(ppt)Specification.pdf) I wouldn't say it's very well documented/formated, but it's not that hard once you get started. Start by focusing only on elements you need (slides, text, etc).
A less detailed and simpler approach would be to open .ppt file in hex editor and look for information you are interesed in (you should be able to see text within the binary data) and what surrounds it. Then based on what surrounds that information you could write a parser which extracts this information.

Here's a sample function I created form a similar one that extracts text from Word documents. I tested it with Microsoft PowerPoint files, but it won't decode OpenOfficeImpress files saved as .ppt
For .pptx files you might want to take a look at Zend Lucene.
function parsePPT($filename) {
// This approach uses detection of the string "chr(0f).Hex_value.chr(0x00).chr(0x00).chr(0x00)" to find text strings, which are then terminated by another NUL chr(0x00). [1] Get text between delimiters [2]
$fileHandle = fopen($filename, "r");
$line = #fread($fileHandle, filesize($filename));
$lines = explode(chr(0x0f),$line);
$outtext = '';
foreach($lines as $thisline) {
if (strpos($thisline, chr(0x00).chr(0x00).chr(0x00)) == 1) {
$text_line = substr($thisline, 4);
$end_pos = strpos($text_line, chr(0x00));
$text_line = substr($text_line, 0, $end_pos);
$text_line = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/","",$text_line);
if (strlen($text_line) > 1) {
$outtext.= substr($text_line, 0, $end_pos)."\n";
}
}
}
return $outtext;
}

I wanted to post my resolution to this.
Unfortunately, I was unable to get PHP to reliably read the binary data.
My solution was to write a small vb6 app that does the work by automating PowerPoint.
Not what I was looking for, but, solves the issue for now.
That being said, the Zend option looks like it may be viable at some point, so I will watch that.
Thanks.

Related

Should I use "fgetcsv" instead of "array_map" and how to do it?

I've made this script to extract data from a CSV file.
$url = 'https://flux.netaffiliation.com/feed.php?maff=3E9867FCP3CB0566CA125F7935102835L51118FV4';
$data = array_map(function($line) { return str_getcsv($line, '|'); }, file($url));
It's working exactly as I want but I've just been told that it's not the proper way to do it and that I really should use fgetcsv instead.
Is it right ? I've tried many ways to do it with fgetcsv but didn't manage at all to get anything close.
Here is an example of what i would like to get as an output :
$data[4298][0] = 889698467841
$data[4298][1] = Figurine Funko Pop! - N° 790 - Disney : Mighty Ducks - Coach Bombay
$data[4298][2] = 108740
$data[4298][3] = 14.99
First of all, there is no the ONE proper way to do things in programming. It is up to you and depends on your use case.
I just downloaded the CSV file and it is ca. 20MB big. In your solution you download the whole file at once. If you do not have any memory restrictions and you do not have to give a fast feedback to the caller, I mean if the delay for downloading of the whole file is not important, your solution is better solution, if you want to guarantee the processing of the whole content. In this case, you read all the content at once and the further processing of the content does not depend on other things like your Internet connection etc.
If you want to use fgetcsv, you would read from the URL line by line squentially. Your connection has to remain until a line has been processed. In this you do not need big memory allocation but it would take longer to having processed the whole content.
Both methods have their pros and contras. You should know what is your goal. How often would you run this script? You should consider your use case and make a decision which method is the best for you.
Here is the same result without array_map():
$url = 'https://flux.netaffiliation.com/feed.php?maff=3E9867FCP3CB0566CA125F7935102835L51118FV4';
$lines = file($url);
$data = [];
foreach($lines as $line)
{
$data[] = str_getcsv(trim($line), '|');
//optionally:
//$data[] = explode('|',trim($line));
}
$lines = null;

How to read any of the file type using php7?

I have tried to extract the user email addresses from my server. But the problem is maximum files are .txt but some are CSV files with txt extension. When I am trying to read and extract, I could not able to read the CSV files which with TXT extension. Here is my code:
<?php
$handle = fopen('2.txt', "r");
while(!feof($handle)) {
$string = fgets($handle);
$pattern = '/[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}/i';
preg_match_all($pattern, $string, $matches);
foreach($matches[0] as $match)
{
echo $match;
echo '<br><br>';
}
}
?>
I have tried to use this code for that. The program is reading the complete file which are CSV, and line by line which are Text file. There are thousands of file and hence it is difficult to identify.
Kindly, suggest me what I should do to resolve my problem? Is there any solution which can read any format, then it will be awesome.
Well your files are different. Because of that you will have to take a different approach for each of those. In more general terms this is usually calling adapting and is mostly provided using the Adapter design pattern.
Should you use the adapter design pattern you would have a code inspecting the extension of a file to be opened and a switch with either txt or csv. Based on the value you would retrieve aTxtParseror aCsvParser` respectively.
However, before diving deep into this territory you might want to have a look at the files first. I cannot say this for sure without seeing the structures but you can. If the contents of both the text and csv files are the same then a very simple approach is to change the extension to either txt or a csv for all files and then process them using same logic, knowing files with the same extension will now be processed in the same manner.
But from what I understood the file structures actually differ. So to keep your code concise the adapter pattern, having two separate classes/functions for parsing and another one on top of that for choosing the right parsing function (this top function would actually be a form of a strategy) and running it.
Either way, I very much doubt so there is a solution for the problem you are facing as a file structure is mostly your and your own.
Ok, so problem is when CSV file has too long string line. Based on this restriction I suggest you to use example from php.net Here is an example:
$handle = #fopen("/tmp/inputfile.txt", "r");
if ($handle) {
while (($buffer = fgets($handle, 4096)) !== false) {
echo $buffer;
// do your operation for searching here
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
}

How to count words from .doc file using php script?

I have tried many things like How to extract text from word file .doc,docx,.xlsx,.pptx php.
But this isn't a solution.
My server is Linux based so enabling extension=php_com_dotnet.dll is not the solution.
Another solution was installing LIBRE office on server and converting the .doc file to .txt on the fly and then counting the words from that file. This is very tedious job and time consuming.
I just need a simple php script that removes the special characters from the .doc file and count the number of words.
You can try with this PHP class that claims to be able to convert both .doc and .docx files in textual format.
http://www.phpclasses.org/package/7934-PHP-Convert-MS-Word-Docx-files-to-text.html
According to the example given, that's how you can use it:
require("doc2txt.class.php");
$docObj = new Doc2Txt("test.docx");
//$docObj = new Doc2Txt("test.doc");
$txt = $docObj->convertToText();
echo $txt;
As you pointed out, the core function of this library, as of many others, is something like this:
<?php
function read_doc($filename)
{
$fileHandle = fopen($filename, "r");
$line = #fread($fileHandle, filesize($filename));
$lines = explode(chr(0x0D) , $line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE) || (strlen($thisline) == 0))
{
}
else
{
$outtext.= $thisline . " ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/_()]/", "", $outtext);
return $outtext;
}
echo read_doc("sample.doc");
?>
I've tested this function with a .doc file and it seems to work quite well. It needs some fixes with the last part of the document (there is still some random text that is generated at the end of the output), but with some fine tuning it works reasonably.
EDIT:
You are right, this functions works correctly only with .docx documents (the document I tested was probably made using the same mechanism). Saving a file with .doc extension, this function doesn't work!
The only help I'm able to give you right now is the .doc binary specifications link (here is an even more complete file), where you can actually see how the binary structure is made and extract the informations from there. I can't do it now, so I hope that somebody else may help you through this!
At the end i had to use Libreoffice. But its very efficient to use it. It solved my all the problem.
So my advice would be to install the 'HEADLESS' package of libreoffice on server and use the command line conversion
I've built a tool that incorporates various methods found around the web and on Stack Overflow that provides word, line and page counts for doc, docx, pdf and txt files. I hope it's of use to people. If anyone can get rtf working with it I'd love a pull request! https://github.com/joeblurton/doccounter

Searching through very large files with php to extract a block very efficiently

I've been having a major headache lately with parsing metadata from video files, and found part of the problem is a disregard of various standards (or at least differences in interepretation) by video-production software vendors (and other reasons).
As a result I need to be able scan through very large video (and image) files, of various formats, containers and codecs, and dig out the metadata. I've already got FFMpeg, ExifTool Imagick and Exiv2 each to handle different types of metadata in various filetypes and been through various other options to fill some other gaps (please don't suggest libraries or other tools, I've tried them all :)).
Now I'm down to scanning the large files (upto 2GB each) for an XMP block (which is commonly written to movie files by Adobe suite and some other software). I've written a function to do it, but I'm concerned it could be improved.
function extractBlockReverse($file, $searchStart, $searchEnd)
{
$handle = fopen($file, "r");
if($handle)
{
$startLen = strlen($searchStart);
$endLen = strlen($searchEnd);
for($pos = 0,
$output = '',
$length = 0,
$finished = false,
$target = '';
$length < 10000 &&
!$finished &&
fseek($handle, $pos, SEEK_END) !== -1;
$pos--)
{
$currChar = fgetc($handle);
if(!empty($output))
{
$output = $currChar . $output;
$length++;
$target = $currChar . substr($target, 0, $startLen - 1);
$finished = ($target == $searchStart);
}
else
{
$target = $currChar . substr($target, 0, $endLen - 1);
if($target == $searchEnd)
{
$output = $target;
$length = $length + $endLen;
$target = '';
}
}
}
fclose($handle);
return $output;
}
else
{
throw new Exception('not found file');
}
return false;
}
echo extractBlockReverse("very_large_video_file.mov",
'<x:xmpmeta',
'</x:xmpmeta>');
At the moment it's 'ok' but I'd really like to get the most out of php here without crippling my server so I'm wondering if there is a better way to do this (or tweaks to the code which would improve it) as this approach seems a bit over the top for something as simple as finding a couple of strings and pulling out anything between them.
You can use one of the fast string searching algorithms - like Knuth-Morris-Pratt
or Boyer-Moore in order to find the positions of the start and end tags, and then read all the data between them.
You should measure their performance though, as with such small search patterns it might turn out that the constant of the chosen algorithm is not good enough for it to be worth it.
With files this big, I think that the most important optimization would be to NOT search the string everywhere. I don't believe that a video or image will ever have a XML block smack in the middle - or if it has, it will likely be garbage.
Okay, it IS possible - TIFF can do this, and JPEG too, and PNG; so why not video formats? But in real world applications, loose-format metadata such as XMP are usually stored last. More rarely, they are stored near the beginning of the file, but that's less common.
Also, I think that most XMP blocks will not have sizes too great (even if Adobe routinely pads them in order to be able to "almost always" quickly update them in-place).
So my first attempt would be to extract the first, say, 100 Kb and last 100 Kb of information from the file. Then scan these two blocks for "
If the search does not succeed, you will still be able to run the exhaustive search, but if it succeeds it will return in one ten-thousandth of the time. Conversely, even if this "trick" only succeeded one time in one thousand, it would still be worthwhile.

Search Text In Files Using PHP

How to search text in some files like PDF, doc, docs or txt using PHP?
I want to do similar function as Full Text Search in MySQL,
but this time, I'm directly search through files, not database.
The search will do searching in many files that located in a folder.
Any suggestion, tips or solutions for this problem?
I also noticed that, google also do searching through the files.
For searching PDF's you'll need a program like pdftotext, which converts content from a pdf to text. For Word documents a simular thingy could be available (because of all the styling and encryption in Word files).
An example to search through PDF's (copied from one of my scripts (it's a snippet, not the entire code, but it should give you some understanding) where I extract keywords and store matches in a PDF-results-array.):
foreach($keywords as $keyword)
{
$keyword = strtolower($keyword);
$file = ABSOLUTE_PATH_SITE."_uploaded/files/Transcripties/".$pdfFiles[$i];
$content = addslashes(shell_exec('/usr/bin/pdftotext \''.$file.'\' -'));
$result = substr_count(strtolower($content), $keyword);
if($result > 0)
{
if(!in_array($pdfFiles[$i], $matchesOnPDF))
{
array_push($matchesOnPDF, array(
"matches" => $result,
"type" => "PDF",
"pdfFile" => $pdfFiles[$i]));
}
}
}
Depending on the file type, you should convert the file to text and then search through it using i.e. file_get_contents() and str_pos(). To convert files to text, you have - beside others - the following tools available:
catdoc for word files
xlhtml for excel files
ppthtml for powerpoint files
unrtf for RTF files
pdftotext for pdf files
If you are under a linux server you may use
grep -R "text to be searched for" ./ // location is everything under the actual directory
called from php using exec resulting in
cmd = 'grep -R "text to be searched for" ./';
$result = exec(grep);
print_r(result);
2021 I came across this and found something so I figure I will link to it...
Note: docx, pdfs and others are not regular text files and require more scripting and/or different libraries to read and/or edit each different type unless you can find an all in one library. This means you would have to script out each different file type you want to search though including a normal text file. If you don't want to script it completely then you have to install each of the libraries you will need for each of the file types you want to read as well. But you still need to script each to handle them as the library functions.
I found the basic answer here on the stack.

Categories