Search inside 70gb of PDF files - php

I have 70gb PDF files, and I want to search inside them with PHP and some Ajax.
The code must search on all PDF files and extract the data out into table,
For example: 1547AD
When I hit enter the code will search in all PDF files and extract all PDF files that contain "1547AD" inside them.
My problem is: of course putting these data inside MySQL will be better for the server and stronger but imagine extracting all tables in 70GB of PDF files! and these pdf files updated daily, also there is alot of traffic on this page.
My question is: Is it the right way to build this in PHP or I should use another language and/or another method for this kind of heavy data?

Related

How to search files using text inside in Lravel/S3?

I'm using Laravel7 with Aws-S3 as file storage.
The files are PDF only and and I want to add search files feature i.e. If a user search for a text, I want to list all PDF files that has the matching text.
Is this something possible using Laravel alone or using Aws/S3?
I know I can extract all the text of a file on upload and store it in database, and when user search for a text, I can search it from database using %LIKE% query but this will take a huge database space.
I'm looking for something better.

Storing PDF files on a MySQL server Intelligently

I've been tasked with creating a search system that will help users to navigate through multiple 1000+ page pdf files. However, these files will first have to be put on a MySQL DB. The issue that i'm currently having is how do I store these PDF files on the DB and assign the relevant PDF headers to the DB.
Example:
Adding each Part/Header/Section/Subsection individually on the DB in different tables.
Would this all have to be manually entered? Bare in mind we are talking 100,000s pages + of PDF.
Thanks
You would be better to store some meta data in the database, and store the location of the PDF file.
i.e. a table called 'documents' may have the following fields:
id,path,keywords,category
The path would be: /some/location/to/my/pdf/file.pdf
The keywords could be; 'pdf1, what is a pdf, some search text'
This will allow you to store the pdf files.
Alternatively you could use something like Google - they allow you to use their search technology. It used to be in the form of a 'google yellow box' but I believe it's now part of their cloud stuff!
HTH

Auto-generate multiple PHP files from Template and excel files

Auto-generate multiple PHP files from Template and excel files
Am looking for suggestions to efficiently auto-generate multiple PHP (and HTML) files from a template doc pulling fields from an excel file
What I want to do is:
1. Populate the fields in the template file from an excel file, each excel row will generate a new file
2. Save it as a PHP file
3. Name each generated file based on specified field in the correspondig row
I am intentionally trying to multiple and separate static HTML and PHP pages. Have been using mail merge with word and excel, but, takes to long to resave word file as php and rename, etc. Not sure how to programmaticaly do this, and my skillset is limited.
Open to different approaches to handle this, appreciate any help and thoughts.
Thanks!

Approach to generate generate pdf file from html page?

I studied wkhtmltopdf, tcpdf mechanism to generate pdf files. wkhtmltopdf where you directly pass a .html file and it gives you the pdf where in tcpdf you need to code entire pdf.
my case is I'm having a pdf form template Which I've converted into html so user can fill that form and after i fill that template with user entered values then I'll give an option to user to download the html (user filled) file as PDF document, so template will have user entered data next to that labels.
so first
PDF template >> convert to .HTML page >> process with php echoing >> convert it back with user input to a PDF file.
I'm confused here which approach I should use.
Install wkhtmltopdf on server and use it to pass .html page
problem: Everytime I need to save .html page on server and pass again it to wkhtmltopdf.
using TCPDF I need to write lots of code to create pdf exactly same as template PDF docs I'm having
and then using php echoing those user enterted values.
Which approach should i use If I'm expecting 1000+ users will be saving page as pdf at same time, approach which will be more easier and scalable in future.
First of all - I think you should go with the HTML form to PDF approach, so that's either wkhtmltopdf or a tool that already does this for you like PDFmyFORM.
In case you're expecting to go to 1000 saves concurrently then you definitely want to roll your own solution instead of going with an external service though.
There are patches in the wkhtmltopdf issue list that suggest caching (see this one) and you may also want to think about whether all these forms have to be generated as PDF again. You could use APC cache to somehow cache PDFs based on the same values being filled in. That could save you a bunch of time.
Other solutions you may want to look into are for example PhantomJS, which is a headless webkit browser too, but then based on JS - so that may reduce your server load alltogether...

Using PHP to read text from a PDF stored in mysql

I am trying to use PHP to read the text from a PDF file that is stored in a mysql database. I tried using class.pdf2text.php, which works with an actual file. I tried to use the MYSQL_RESULT variable with the pdf file contents with that class, but it doesn't work. I've got to be missing something really easy, I just know it.
Basically, this is what I'm trying to do:
I have a database with PDF files. I need to convert a PDF from that database to text and then search on that text for certain data. Is there a way to do this without creating external files in PHP?

Categories