Search Within Text Files (.doc, .docx, .pdf etc) in mysql database - php

I want to make a module which search within a files (file type: .doc, .docx, .pdf). By using "file_get_contents()" I can find the files but for that I have to specify the location where all the files are. In my case I have the files in many folders (like this: C:\xampp\htdocs\cats1\attachments\site_1\0xxx..) the files are always store in the "0xxx" folder (By other application). I just want to specify the path so that no matter how many "folders" the "0xxx" folder contain, it search within it. I am quite new to php, please do help. My code for this application is below.
<?php
$matched_files = array();
if(isset($_POST['submit']))
{
$skills = $_POST['skills'];
$experience= $_POST['experience'];
$location = $_POST['location'];
$path = 'C:\Docs';
$dir = dir($path);
// Get next file/dir name in directory
while (false !== ($file = $dir->read()))
{
if ($file != '.' && $file != '..')
{
// Is this entry a file or directory?
if (is_file($path . '/' . $file))
{
// Its a file, yay! Lets get the file's contents
$data = file_get_contents($path . '/' . $file);
// Is the str in the data (case-insensitive search)
if (stripos($data, $skills) !== false and (stripos($data, $experience) !== false and (stripos($data, $location) !== false)))
{
$matched_files[] = $file;
}
}
}
}
$dir->close();
$matched_files_unique = array_unique($matched_files);
}
?>

The files that you're mentioning are not text files. Additionally, it is not a good idea to store these files' contents in a database. Here's the approach I would take:
Store these files using their hash (generated from something like
sha1()) as the file name to store the files to the filesystem.
Create a table to store the metadata (file name, data uploaded, hash
name) of the files.
Within the above-mentioned table, create a text column to store
the extracted text from the files. Each file type will require a
different tool. For instance, for PDFs, you can use something like
pdftotext.
Do your searches in the database by selecting the filename (hash)
from the table where the keywords are contained within the text
column (or whatever search criteria you want).
Open the file named by the returned hash and return that file to the
user.

Related

PHP Check if dynamic named folder exists

I'm having problems checking if a dinamically named folder exists using php. I know i can use
file_exists()
to check if a folder or file exists, but the problem is that the name of the folders I'm checking may vary.
I have folders where the first part of the name is fixed, and the part after "_" can vary. As an example:
folder_0,
where every folder will start with "folder_" but after the "_" it can be anything.
Anyway i can check if a folder with this property exists?
Thanks in advance.
SR
You make a loop to go through all the files/folders in the parent folder:
$folder = '/path-to-your-folder-having-subfolders';
$handle = opendir($folder) or die('Could not open folder.');
while (false !== ($file = readdir($handle))) {
if (preg_match("|^folder_(.*)$|", $file, $match)) {
$curr_foldername = $match[0];
// If you come here you know that such a folder exists and the full name is in the above variable
}
}
function find_wildcard_dirs($prefix)
{
return array_filter(glob($prefix), 'is_dir');
}
e.g.
print_r(find_wildcard_dirs("/tmp/santos_*'));

Make folders accessible to select users only

My website has a form where users can upload documents. They are stored in the www.mysite.com/uploads folder.RIght now anyone who types that path in the brower can view those files. I was to make it so only people with access can view it. How would I do that? Thanks.
You should use .htaccess to manage all user with login and password.
More information on the link : http://www.elated.com/articles/password-protecting-your-pages-with-htaccess/
Step 1: do not upload your files to a folder inside docroot. That is, if your document root is /var/www/html, make the upload location something like /var/www/uploads.
Step 2: Create a PHP file accessfile.php that authenticates admin and takes file name as $_GET parameter. e.g. http://site.com/accessfile.php?file=myfile.pdf
Inside accessfile.php, you may want to write a small program as below:
header("Content-Disposition: attachment");
file_get_contents("/var/www/uploads/{$file}");
Step 3: If admin needs to browse, create a quick browse option:
function &list_directory($dirpath) {
if (!is_dir($dirpath) || !is_readable($dirpath)) {
error_log(__FUNCTION__ . ": Argument should be a path to valid, readable directory (" . var_export($dirpath, true) . " provided)");
return null;
}
$paths = array();
$dir = realpath($dirpath);
$dh = opendir($dir);
while (false !== ($f = readdir($dh))) {
if (strpos("$f", '.') !== 0) { // Ignore ones starting with '.'
$paths["$f"] = "$dir/$f";
}
}
closedir($dh);
return $paths;
}

Selecting file to be edited

i have an application that is used to edit .txt files. the application is made up of 3 parts
Displays contents of a folder with the files to be edited(each file is a link when clicked it opens on edit mode).
writing in to a file.
saving to file.
part 2 and 3 I have completed using fopen and fwrite functions that wasn't too hard. the part that i need help is part one currently I open the file by inputing its location and file name like so in the php file where i have the display function and save function:
$relPath = 'file_to_edit.txt';
$fileHandle = fopen($relPath, 'r') or die("Failed to open file $relPath ! ");
but what i want is for the file to open in edit mode when clicked instead of typing in the files name every time.
$directory = 'folder_name';
if ($handle = opendir($directory. '/')){
echo 'Lookong inside \''.$directory.'\'<br><br>';
while ($file = readdir($handle)) {
if($file !='.' && $file!='..'){
echo '<a href="'.$directory.'/'.$file.'">'.$file.'<a><br>';
}
}
}
this is the code that ti use to display the list of files that are in a specified folder.
Can anyone give me some pointers how I can achieve this ? any help will be greatly appreciated.
To get content of file use file_get_contents();
To put content of file use file_put_contents(); with FILE_APPEND flag for editing.
To recieve list of files in directory you can use DirectoryIterator
Example:
foreach (new DirectoryIterator('PATH/') as $fileInfo) {
if($fileInfo->isDot()) continue;
echo $fileInfo->getFilename() . "<br>\n";
}
If you don't want to put filenames you can put read files once put in db assign ids to them and use links with id param. The other solution is to store files in session array and assign keys for them. When you want to get a file you just need to provide key instead of whole filename and path.
Example with $_SESSION
$file_arr = array();
foreach (new DirectoryIterator('PATH/') as $fileInfo) {
if($fileInfo->isDot()) continue;
$file_arr[] = array("path" => $fileInfo->getPathname(), 'name' => $fileInfo->getFilename());
}
$_SESSION['files'] = $file_arr;
then in view you can use
foreach($_SESSION['files'] as $k=>$file)
{
echo "<a href='edit.php?f=".$k."'>'.$file['name'].'</a>";
}
and edit.php
$file = (int)$_GET['f'];
if(array_key_exits($file, $_SESSION['files'])
{
$fileInfo = $_SESSION[$file'];
//in file info you have now $fileInfo['path'] $fileInfo['name']
}

PHP: How can I grab a single file from a directory without scanning entire directory?

I have a directory with 1.3 Million files that I need to move into a database. I just need to grab a single filename from the directory WITHOUT scanning the whole directory. It does not matter which file I grab as I will delete it when I am done with it and then move on to the next. Is this possible? All the examples I can find seem to scan the whole directory listing into an array. I only need to grab one at a time for processing... not 1.3 Million every time.
This should do it:
<?php
$h = opendir('./'); //Open the current directory
while (false !== ($entry = readdir($h))) {
if($entry != '.' && $entry != '..') { //Skips over . and ..
echo $entry; //Do whatever you need to do with the file
break; //Exit the loop so no more files are read
}
}
?>
readdir
Returns the name of the next entry in the directory. The entries are returned in the order in which they are stored by the filesystem.
Just obtain the directories iterator and look for the first entry that is a file:
foreach(new DirectoryIterator('.') as $file)
{
if ($file->isFile()) {
echo $file, "\n";
break;
}
}
This also ensures that your code is executed on some other file-system behaviour than the one you expect.
See DirectoryIterator and SplFileInfo.
readdir will do the trick. Check the exampl on that page but instead of doing the readdir call in the loop, just do it once. You'll get the first file in the directory.
Note: you might get ".", "..", and other similar responses depending on the server, so you might want to at least loop until you get a valid file.
do you want return first directory OR first file? both? use this:
create function "pickfirst" with 2 argument (address and mode dir or file?)
function pickfirst($address,$file) { // $file=false >> pick first dir , $file=true >> pick first file
$h = opendir($address);
while (false !== ($entry = readdir($h))) {
if($entry != '.' && $entry != '..' && ( ($file==false && !is_file($address.$entry)) || ($file==true && is_file($address.$entry)) ) )
{ return $entry; break; }
} // end while
} // end function
if you want pick first directory in your address set $file to false and if you want pick first file in your address set $file to true.
good luck :)

Should we sanitize $_FILES['filename']['name']?

After the user uploads an image to the server, should we sanitize $_FILES['filename']['name']?
I do check file size/file type etc. But I don't check other things. Is there a potential security hole?
Thank you
Absolutely! As #Bob has already mentioned it's too easy for common file names to be overwritten.
There are also some issues that you might want to cover, for instance not all the allowed chars in Windows are allowed in *nix, and vice versa. A filename may also contain a relative path and could potentially overwrite other non-uploaded files.
Here is the Upload() method I wrote for the phunction PHP framework:
function Upload($source, $destination, $chmod = null)
{
$result = array();
$destination = self::Path($destination);
if ((is_dir($destination) === true) && (array_key_exists($source, $_FILES) === true))
{
if (count($_FILES[$source], COUNT_RECURSIVE) == 5)
{
foreach ($_FILES[$source] as $key => $value)
{
$_FILES[$source][$key] = array($value);
}
}
foreach (array_map('basename', $_FILES[$source]['name']) as $key => $value)
{
$result[$value] = false;
if ($_FILES[$source]['error'][$key] == UPLOAD_ERR_OK)
{
$file = ph()->Text->Slug($value, '_', '.');
if (file_exists($destination . $file) === true)
{
$file = substr_replace($file, '_' . md5_file($_FILES[$source]['tmp_name'][$key]), strrpos($value, '.'), 0);
}
if (move_uploaded_file($_FILES[$source]['tmp_name'][$key], $destination . $file) === true)
{
if (self::Chmod($destination . $file, $chmod) === true)
{
$result[$value] = $destination . $file;
}
}
}
}
}
return $result;
}
The important parts are:
array_map('basename', ...), this makes sure that the file doesn't contain any relative paths.
ph()->Text->Slug(), this makes sure only .0-9a-zA-Z are allowed in the filename, all the other chars are replaced by underscores (_)
md5_file(), this is added to the filename iff another file with the same name already exists
I prefer to use the user supplied name since search engines can use that to deliver better results, but if that is not important to you a simple microtime(true) or md5_file() could simplify things a bit.
Hope this helps! =)
The filename is an arbitrary user supplied string. As a general rule, never trust arbitrary user supplied values.
You should never use the user supplied filename as the name to save the file under on the server, always create your own filename. The only thing you may want to do with it is to save it as metadata for informational purposes. When outputting that metadata, take the usual precautions like sanitation and escaping.
you also need to check for duplicate names. It's too easy for multiple people to upload an image called 'mycat.jpg', which if uploaded to the same folder would overwrite a previously uploaded file by the same name. You can do this by putting a unique id in the file name (as Prix suggests). Also verify that the file type doesn't just end with an image extension but also is an actual image; you don't want your server acting as a blind host for random files.

Categories