Extract only text from ePub - php

I'm trying to do some text analysis on ebooks, so I need to extract the plain text from epub files. Below is example code from php.net which unzips the epub and then echos all its contents.
My problem is that it also tries to echo images so I get lots of this: ��̹,{ϥ㓦,�[k�رO?��� being echoed. Ideally It would just give me super-basic plain text. Any ideas on how to skip echoing a $zip_entry which is an image (or anything non-text)?
Thanks!
$zip = zip_open("book.epub");
if ($zip) {
while ($zip_entry = zip_read($zip)) {
echo "Name: " . zip_entry_name($zip_entry) . "\n";
echo "Actual Filesize: " . zip_entry_filesize($zip_entry) . "\n";
echo "Compressed Size: " . zip_entry_compressedsize($zip_entry) . "\n";
echo "Compression Method: " . zip_entry_compressionmethod($zip_entry) . "\n";
if (zip_entry_open($zip, $zip_entry, "r")) {
echo "File Contents:\n";
$buf = zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
echo "$buf\n";
zip_entry_close($zip_entry);
}
echo "\n";
}
zip_close($zip);
}

Is there a
content.opf
file in the root of unzipped epub? If so examine it's content. You should see something like:
<item id="chapter19" href="zzzzzzz.xhtml" media-type="application/xhtml+xml" />
<item id="image1" href="images/yyyyy.jpg" media-type="image/jpeg" />
this should give you good idea how to avoid images.

Related

PHP: sort ZIP in alphabetical order

The following PHP file creates a ZIP File and works as it should.
<?php
$zip = new ZipArchive();
$ZIP_name = "./path/Prefix_" .$date . ".zip";
if ($zip->open($ZIP_name, ZipArchive::CREATE)!==TRUE) {
exit("There is a ZIP Error");
}
if ($zip->open($ZIP_name, ZipArchive::CREATE)==TRUE) {
echo "ZIP File can be created" . "<br>";
}
foreach($list as $element) {
$path_and_filename = "../path_to_somewhere/product_"
. $element
. ".csv";
$zip->addFile($path_and_filename, basename($path_and_filename));
}
echo "numfiles: " . $zip->numFiles . "\n"; // number of element files
echo "status:" . $zip->status . "\n"; // Status "0" = okay
$zip->close();
?>
There is only a small blemish:
The above foreach-loop retrieves elements from an array where all elements are sorted in alphabetical order. After the ZIP-File creation, the files within the ZIP are in different order, maybe caused by different file size.
Is there a way to sort the csv files within the ZIP with PHP later on? I'm new to ZIP creation with PHP an I have not found something helpful in the documentation.
You can't do that, better just sort the file list in your program, not in the file system (-;

looping in file_get_content php and output the content of each file with filename

I m using this code to output content of files in a directory but I have two problems
first is that this directory contains sub-dir and this code doesn't output
files content in these sub-dir
Second problem
I want this code to output like
Filename:"name of the file"
Content :"content of the file"
so that I can parse this
$dir = new DirectoryIterator('./Chemistry');
foreach($dir as $file)
{
if(!$file->isDot() && $file->isFile() && strpos($file->getFilename(), '.md') !== false)
{
$content = file_get_contents($file->getPathname());
echo $content
}
}
?>
If I understand correct you want to output:
echo 'Filename: ' . $file->getFilename() . ' Content: ' . $content;
If this is not your intention I will require a more thorough explanation.
* edit *
For dir iteration use http://php.net/manual/en/function.scandir.php
From the top of my head it goes a little something like this (hit it)...
function loopDir($path) {
$allFilesAndFoldersInCurrentFolder = scandir($path);
foreach($allFilesAndFoldersInCurrentFolder as item) {
if(is_dir($item)) {
loopDir($path.'/'.$item);
}
else {
echo 'Filename: ' . $file->getFilename() . ' Content: ' . $content;
}
}
}

3 dots in html source code when listing folders in PHP

this is the code i wrote but i keep seeing these 3 dots before the folder name showing up:
<?php
$dir=new DirectoryIterator("wallpapers");
while ($dir->valid())
{
$file=$dir->current();
echo $file->getFilename();
echo "<br>";
$dir->next();
}
?>
You should be good to go with this
$dir = "/etc/php5/*";
foreach(glob($dir) as $file)
{
echo "filename: $file : filetype: " . filetype($file) . "<br />";
}

File flushed when attempting to update contents

I have written a very simple page counter and a logging script that increments a counter stored in a file and logs information about the client's operating system and which browser they use. It's a simple spare time project I've been working on, and as such it is extremely rudimentary, writing the counter and the logged information in a designated folder for each page on the site, in a new file for each day.
The thing is, I recently used blitz.io to test my site, and when I ran a "Rush" of 250 requests per second, the counters and the logs were completely flushed, except for the very last query.
I'm not perfectly sure what happened, but I suspect something along the lines of PHP not properly finishing up the previous query before taking on the next one.
I use file_get_contents()/file_put_contents() for the both of them, instead of file(). Would changing to file() solve the problem?
Here's the counter:
$filename = '.' . $_SERVER['PHP_SELF'];
$counterpath = '/Websites/inc/logs/counters/total/' . getCurrentFileName() . '-counter.txt';
$globalcounter = '/Websites/inc/logs/counters/total/global-counter.txt';
if (file_exists($counterpath)) {
$hit_count = file_get_contents($counterpath);
$hit_count++;
file_put_contents($counterpath,$hit_count);
}
else {
$hit_count = "1";
file_put_contents($counterpath, $hit_count);
}
And here's the logger:
$logdatefolder = '/Websites/inc/logs/ip/' . date('Y-m-d',$_SERVER['REQUEST_TIME_FLOAT']);
$logfile = $logdatefolder . "/" . getCurrentFileName() . '-iplog.html';
$ua = getbrowser();
if (false == (file_exists($logdatefolder))) {
mkdir($logdatefolder);
}
function checkRef() {
if (!isset($_SERVER['HTTP_REFERER'])) {
//If not isset -> set with dummy value
$_SERVER['HTTP_REFERER'] = 'N/A';
}
return $_SERVER['HTTP_REFERER'];
}
/* Main logger */
$logheader = "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en-US\"><head><title>" . getCurrentFileName() . " log</title><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" /></head><body>";
$logentry = date("Y-m-d, H:i:s, O T") . ":" .
"<br />- Requesting: http://giphtbase.org" . $_SERVER['REQUEST_URI'] .
"<br />- Arriving from: " . checkRef() .
"<br />- Browser: " . $ua['browser'] .
"<br />- Full browser name: " . $ua['name'] .
"<br />- Operating system: " . $ua['platform'] .
"<br />- Full user agent: " . $ua['userAgent'] .
"<br />";
$logfooter = "<!-- Bottom --></body></html>";
if (file_exists($logfile)) {
$logPage = file_get_contents($logfile);
$logContents = str_replace("<!-- Bottom --></body></html>","",$logPage);
file_put_contents($logfile, $logContents . $logentry . $logfooter);
}
elseif (false == (file_exists($logfile))) {
file_put_contents($logfile, $logheader . $logentry . $logfooter);
}
You should use the FILE_APPEND flag in your file_put_contents() otherwise you will only ever see the last entry:
file_put_contents($logfile, $logContents . $logentry . $logfooter, FILE_APPEND);
As for the counter, it looks like the file is trying to be written to too many times by different threads, causing it to be inaccessible. You should either use a database, or create a file_lock, or create temporary files and run a cronjob to do the math.

How to get each file's extension in a given folder in PHP on Windows XP?

E.g:
folder name:
myFonlder
files in myFolder
myFolder.01.mkv
myFolder.02.mkv
myFolder.03.avi
myFolder.04.mts
...
// each file's extension may be different.
So,how can I extract the extension of each file?
Thank you very much!!
[update]
my own solution; want to know is it fast enough!?
foreach (glob("d:\\myFolder\\*.*") as $filename) {
//echo "$filename size " . filesize($filename) . "\n";
$path_parts = pathinfo($filename);
echo $path_parts['dirname'], "\n";
echo $path_parts['basename'], "\n";
echo $path_parts['extension'], "\n";
echo $path_parts['filename'], "\n"; // since PHP 5.2.0
}
<?php
foreach (new DirectoryIterator('../moodle') as $fileInfo) {
if($fileInfo->isDot()) continue;
echo $fileInfo->getFilename() . "<br>\n";
}
?>
And use http://www.php.net/manual/en/function.pathinfo.php on filename
This is the quick and easy solution on windows.
exec("dir d:\directory_name /b" ,$output); // in Linux dir will change to ls
foreach($output as $file_name){
$file_parts = explode(".",$file_name);
echo "File Name : ".$file_parts[0]."\n";
echo "File Extension : ".$file_parts[1]."\n\n";
}
Enjoy..!!

Categories