Search text into pdf from database - php

I'm trying do do a research page that search every pdf from my database that contain the keyword I'm searching.
The problem is I can't have my pdf + the raw text inside my Database (I am extremely short on the space...)
What I am doing right now is when a user search something, On all my pdf, one by one I use a .php I transform the PDF into raw text then search for the keywords... But this is really long before having a result. and I fear when they'll be many user my server won't like it. (I just assume that I've never used server before and I don't really know what is good or bad)
Would it be worth it for me to add space on my server to put all the raw text from my pdf into the database aswell so I can search with Mysql query ? or is there a smarter way to do it i didn't think of ?
(I don't have the PDF inside the database, just the path, so i can't get space on that)

Related

Heatmap in PHP, using mysql database

I've currently got a database with just short of 2000 client locations in Australia. What I am trying to do is to display this data on a heatmap, to be embedded into an existing website.
I've done a heap of looking around, and can't seem to find exactly what I'm after.
http://www.heatmapapi.com/sample_googlev3.aspx
http://www.heatmaptool.com/documentation.php
These are along the right lines of what I want to achieve, however I cannot see these working with data from a mysql database (require the data to be hard-coded, or uploaded through CSV files).
Has anyone come across this sort of thing before, or managed to achieve it?
Both of the examples you provide would potentially work.
With the first you would need to use the data you have to dynamically generate the javascript, or at least the values that go into the javascript.
The second is probably the better option. You would provide a path to the script that would dynamically generate a CSV file.

Determine if PDF file has searchable text in PHP

We have hundreds of PDF files on a server. Some of them contain searchable text and others do not.
I was asked to find out which are searchable and which are not.
Does anybody know of a way to read in a bunch of PDFs and determine if that PDF document contains text that is searchable/selectable or if the pdf only contains non-selectable/searchable text which needs to be OCRd?
I don't even need to actually read in the text; I just need to be able to detect possibly by tags or keywords, something that suggests that there are fonts or something like that in the raw data.
Are there tags in a searchable PDF that make it easy to detect?
Thanks
You could modify this code(pdf2text) to suit your purposes, I believe. Or this answer might get you to the right spot as well.

Displaying text content on a PHP image, using text from another site

I have a dynamic image coded in PHP, which is supposed to serve as a display for statistics, using an external site. Each statistic will be displayed using the following general code:
imagettftext($template,10,0,5,35,$black,$font,$statVar1);
...where $statVar1 is the name of the first statistic that will be shown (I have a total of 99 to display).
The only way I know how to do this is to use PHP with JSON, to save the data to a file, and then load them, whenever the image is displayed, but I don't need that sort of data saving in this case, so all I am looking for is just to display whatever the content is on the source page at the same time that the image was loaded.
My JSON solution involved getElementsByTagName, and I would select out the HTML tags and select the right one with item(), but is there a pure PHP solution I could use instead? For instance, if I wanted to visit the DC Vault site, and set my $statVar1 to the Overall Position value of the first team (ie, display '1' on my stat image), what would I do?
If I understand correctly, you want to extract data from another page/site in your PHP code.
The simplest way is to use something like the file_get_contents() with an URL as parameter instead of a regular file path. Learn about PHP URL wrappers.
https://www.php.net/manual/en/wrappers.php
You can than parse returned data to extract what you need.
If the respective site does not provide any export in structured format like XML or JSON, you will need to parse HTML document.
Regular expressions would be a way to go. See the PHP function preg_match.

PHP - Workaround for reading user-selected text from PDF?

I am working on a project that allows a user to upload text or content from an HTML page in Japanese and then use their cursor to select words in the text/content to translate into English. However, I would like to be able to expand this functionality to PDF files. Essentially, I'd like the user to be able to submit a PDF file and have the browser render that PDF file in such a way that when the user selects/highlights words in the PDF, the browser can somehow relay what the text of the highlighted section is, such as via javascript, to be then relayed to a PHP variable.
I know there are a lot of posts on stackoverflow asking similar questions (I've spent hours upon hours trying to sort through them all!), but I can't seem to find a definitive answer on whether this is possible. It seems there are lots of options for converting PDF to HTML or extracting text from PDF, but to be quite honest, I'm confused if any of those options are relevant to what I am trying to accomplish. And I know there's a javascript API for Adobe, but I'm under the impression the javascript needs to be embedded in the PDF already, which will not be true if the user is uploading their own PDF files to render. Even if that is possible, it seems there's no native text selection support in the Adobe javascript API....
Is there a straightforward workaround (oxymoron?) to doing this? Again, I want to be able to pass text selected in a PDF to a variable -- the effect is the user highlights words they don't know so those words can be added to a word bank for retrieval in a dictionary.
Let me know if I can be clearer on anything. Thank you!
I think your best bet is to convert the PDF to HTML (see this answers) and then you are already set as you already implemented everything for regular HTML.

Convert HTML & CSS to DOC(X)?

Is there some utility that could be called via command line to produce a doc(x) file? The source file would be HTML and CSS.
I am trying to generate Word documents on the fly with PHP. I am only aware of phpdocx library, which is very low level and not much use for me (I already have one poor implementation of Word document generation).
What I need from a document:
TOC
Images
Footers/Headers (they could be manually made on each HTML page)
Table
Lists
Page break (able to decide what goes to which page, eg one HTML file per page, join multiple HTML files to produce the entire document.)
Paragraphs
Basic bold/etc styles
I didn't find PHPDOCX very useful either. An alternative could be PHPWord, i think it covers what you need. According the website it can do these things:
Insert and format document sections
Insert and format Text elements
Insert Text breaks
Insert Page breaks
Insert and format Images and binary OLE-Objects
Insert and format watermarks (new)
Insert Header / Footer
Insert and format Tables
Insert native Titles and Table-of-contents
Insert and format List elements
Insert and format hyperlinks
Very simple template system (new)
In your case that isn't enough, but there is a plugin available to convert (basic) HTML to Docx and it works very good in my opinion. http://htmltodocx.codeplex.com/
I am using this for a year or two now and am happy with it. Altough i have to add that the HTML can't be to complex.
The way I usually do these is to have a word document template file with the parts I want to replace using keywords (usually something like "{FIRSTNAME}").
This allows you to read the file via PHP then simply do str_replace on all the parts you want to replace, then write that to another file.
Dynamic tables using this method are a bit more tricky, as you need a sub template for a row, which you can then include inside the main template as many times as required.
I'm not sure if this is the best solution, it's always seemed very fiddly to me and every time I'm asked to do this I get frustrated with it, but I guess it works. So if anyone knows a better solution I'd love to hear it too!

Categories