How to search text in some files like PDF, doc, docs or txt using PHP?
I want to do similar function as Full Text Search in MySQL,
but this time, I'm directly search through files, not database.
The search will do searching in many files that located in a folder.
Any suggestion, tips or solutions for this problem?
I also noticed that, google also do searching through the files.
For searching PDF's you'll need a program like pdftotext, which converts content from a pdf to text. For Word documents a simular thingy could be available (because of all the styling and encryption in Word files).
An example to search through PDF's (copied from one of my scripts (it's a snippet, not the entire code, but it should give you some understanding) where I extract keywords and store matches in a PDF-results-array.):
foreach($keywords as $keyword)
{
$keyword = strtolower($keyword);
$file = ABSOLUTE_PATH_SITE."_uploaded/files/Transcripties/".$pdfFiles[$i];
$content = addslashes(shell_exec('/usr/bin/pdftotext \''.$file.'\' -'));
$result = substr_count(strtolower($content), $keyword);
if($result > 0)
{
if(!in_array($pdfFiles[$i], $matchesOnPDF))
{
array_push($matchesOnPDF, array(
"matches" => $result,
"type" => "PDF",
"pdfFile" => $pdfFiles[$i]));
}
}
}
Depending on the file type, you should convert the file to text and then search through it using i.e. file_get_contents() and str_pos(). To convert files to text, you have - beside others - the following tools available:
catdoc for word files
xlhtml for excel files
ppthtml for powerpoint files
unrtf for RTF files
pdftotext for pdf files
If you are under a linux server you may use
grep -R "text to be searched for" ./ // location is everything under the actual directory
called from php using exec resulting in
cmd = 'grep -R "text to be searched for" ./';
$result = exec(grep);
print_r(result);
2021 I came across this and found something so I figure I will link to it...
Note: docx, pdfs and others are not regular text files and require more scripting and/or different libraries to read and/or edit each different type unless you can find an all in one library. This means you would have to script out each different file type you want to search though including a normal text file. If you don't want to script it completely then you have to install each of the libraries you will need for each of the file types you want to read as well. But you still need to script each to handle them as the library functions.
I found the basic answer here on the stack.
Related
I have a pdf document and I want to check if a specific text occurs (which are tags that I put in while generating the pdf) in the document, however using these libraries (tcpdfFpdi, pdftk or fdpi) I couldn't figure out if it's possible or how to do it.
$str = "{hello}";
$pdf = new TcpdfFpdi();
$pdf->setSourceFile($filePath);
$pdf->searchForText($str); // something like this which returns boolean
If I try without any library to dd(file_get_contents($filePath)), it returns a very long output and doesn't seem to contain the file I want so I think it's better to use one of those libraries.
Just an idea…
It's no actual PHP solution but you could use tools like pdftotext which I know from this post (where a PDF file is converted into a string to count its words): https://superuser.com/a/221367/535203
You can install it and play around with that command and call it from within your PHP application.
As far as I remember (long time ago since I used pdftotext) the output text is not exaclty the PDF's content but to search a few tags in it it's at least a good try.
I have tried to extract the user email addresses from my server. But the problem is maximum files are .txt but some are CSV files with txt extension. When I am trying to read and extract, I could not able to read the CSV files which with TXT extension. Here is my code:
<?php
$handle = fopen('2.txt', "r");
while(!feof($handle)) {
$string = fgets($handle);
$pattern = '/[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}/i';
preg_match_all($pattern, $string, $matches);
foreach($matches[0] as $match)
{
echo $match;
echo '<br><br>';
}
}
?>
I have tried to use this code for that. The program is reading the complete file which are CSV, and line by line which are Text file. There are thousands of file and hence it is difficult to identify.
Kindly, suggest me what I should do to resolve my problem? Is there any solution which can read any format, then it will be awesome.
Well your files are different. Because of that you will have to take a different approach for each of those. In more general terms this is usually calling adapting and is mostly provided using the Adapter design pattern.
Should you use the adapter design pattern you would have a code inspecting the extension of a file to be opened and a switch with either txt or csv. Based on the value you would retrieve aTxtParseror aCsvParser` respectively.
However, before diving deep into this territory you might want to have a look at the files first. I cannot say this for sure without seeing the structures but you can. If the contents of both the text and csv files are the same then a very simple approach is to change the extension to either txt or a csv for all files and then process them using same logic, knowing files with the same extension will now be processed in the same manner.
But from what I understood the file structures actually differ. So to keep your code concise the adapter pattern, having two separate classes/functions for parsing and another one on top of that for choosing the right parsing function (this top function would actually be a form of a strategy) and running it.
Either way, I very much doubt so there is a solution for the problem you are facing as a file structure is mostly your and your own.
Ok, so problem is when CSV file has too long string line. Based on this restriction I suggest you to use example from php.net Here is an example:
$handle = #fopen("/tmp/inputfile.txt", "r");
if ($handle) {
while (($buffer = fgets($handle, 4096)) !== false) {
echo $buffer;
// do your operation for searching here
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
}
I have some RTF files generated by users with Microsoft Word. I need to be able to concatenate these files, and the result file should still be readable by libreoffice. I'm using libreoffice in order to convert the result file into a PDF file.
In order to concatenate two files, my application remove the last character of the first file and the first one of my other file. The files headers are not removed (I'm not speaking about page header).
For some reason, libreoffice do not like the headers inserted by Microsoft Word. But it works fine if I open these files with Wordpad and save them.
Another way to remove these headers is to convert these files into RTF before I concatenate them. This way i can convert into PDF, but libreoffice make a serious mess with my tabs when i convert my files to RTF.
So how can I remove the headers through PHP withouth messing with tabs ? Or do you have another way to get to the same result ?
Edit :
In a nutshell, I must be able to concanate these files and that libreoffice could open it. And my tabs must still display nicely in Microsoft Word.
As you can guess, users don't want to use Wordpad. And my customer's IT department has to comply to that wish ( office politics).
UPDATE :
I have to do the merging first, because of business rules. The files are merged, then my users can modify it using Word (no problems here). Then they ask their boss to validate it. If the boss agree to validate, the RTF file become a PDF file.
UPDATE 2 :
I have a begenning of a solution. If the RTF file start by plain text or a picture, you have to remove everything until you get \pard. But this does not work if you file start by a tab.
UPDATE 3 :
If you want to support tab too, you have to remove evrything until you get \pard or \trowd. I'm going to post the total solution once i get a working code. This will works fine as long you don't need colours and that all yours files use the same font (because we don't remove the RTF headers of the first file).
If the limitations with the 'pure RTF' approach come back to bite you, you could use LibreOffice to convert your RTF files to docx, then use a tool to merge the docx files.
There are such tools for .NET and Java (such as our MergeDocx product); I'm not sure what you'll find for PHP.
I succeed to build a reliable code, which make possible to manipulate the RTF files created with Microsoft Word. It works as long as you only need text, pictures and tabs, and don't need fancy things as color. Color works for text, but beside that ...
$content = "";
//stristr Returns all of haystack starting from and including the first occurrence of needle to the end.
$tmp_pard = stristr($RTFstring, "\pard");
//stristr fail to detect \trowd
$tmp_tab = stristr($RTFstring, "trowd");
if($tmp_pard != "" || $tmp_tab != "") {
//We pick the longer string. Because we want the first occurence of \pard or \trowd
if(strlen($tmp_pard) > strlen($tmp_tab))
// { is added so concatenation code still works. We just remove headers.
$content = "{" . substr($RTFstring,-strlen($tmp_pard)) ;
else
$content = "{" . "\\". substr($RTFstring,-strlen($tmp_tab)) ;
} else {
$content = $RTFstring;
}
return $content;
I have a slightly unconventional task I am trying to accomplish with .ZIP archives in PHP. I have a zip archive used for an automation task (It's a startup package for Amazon EC2 instances) which contains a number of text and xml files. What I need to do is find/replace a few pieces of text within those files, and output a BASE 64 encoded string (not write a new .zip file) using PHP on the fly.
I have no problem with getting the file contents and base64 enconding them with file_get_contents(), and base64_encode(), or the find/replace, it's the unzipping, and zipping to and from strings I can't seem to figure out.
I would like to avoid unzipping the archive, copying the files, editing the files writing a new .zip to disk, and then getting the contents and encoding that. I was hoping there might be a solution that looks more like this:
Get the contents of the zip file into a string.
$originalZipFile = file_get_contents('Path/To/ZipFile');
"Unzip" the data in that string, to a new string to expose the bits of text I want to find/replace.
$unzippedFile = someFunction($originalZipFile);
Find and replace bits of text.
$processedString = str_replace($find, $replace, $unzippedFile);
"Rezip" the processed string into a new string.
$rezippedFile = someOtherFunction($processedString);
Base64 encode the "rezziped" string.
$desiredOutputString = base64_encode($rezippedFile);
I have looked at the PHP ZipArchive class, but it doesn't seem to have the functions I'm looking for.
Any insights are greatly appreciated!
-Oliver
Well, I believe I found a pretty good solution to this. For any others looking for similar solutions, I would recommend looking at the ZipStream-PHP class by Paul Duncan.
With this class, you are able to dynamically write files, contents, and directories to a zip file, which is then streamed without writing a file to disk.
Pretty automagical.
i have about 50 file in some folder
i want get all files contain 'Name : EMAD' inside!
example:
the php imap library!
they search in thousands messages text files for some word!
i have wrote an stupid function to open file and search inside
$s = scandir('dir');
foreach($s as $file){
$content = file_get_contents($file);
if(strpos($content,'Name : ENAD') !== false)
$matched_files[] = $file;
}
but what if there is thousands of files!
should i open all files???? !!!!
is that possible to search for something inside file without open it?
if NO
what is the best and fast way to do that ?php
is that possible to search for something inside file without open it?
of course - no. Where is your common sense? Can you search a refrigerator without opening it?
i have about 50 file in some folder
no problem, it will be fast enough. opening a file is not THAT heavy operation as you imagine.
but what if there is thousands of files!
first have a thousand then come to ask.
what is the best and fast way to do that ?
Store your data in database, not files
Is there a reason you need to use PHP functions? This is exactly what grep was designed to do...
You could always just use PHP's exec() to run an appropriate grep command, such as:
grep -lr 'Name : ENAD' dir
But you might also want to consider (if you're the person creating thousands of files in the first place) whether that is the best way of storing your data - if you usually need the ability to search quickly, you might want to either use a database instead of plain files (e.g. MySQL, PostgreSQL, SQLite, et cetera), or keep a search index (using e.g. Sphinx, Solr, or Lucene).