Searching text in pdf using php

Searching text in pdf using php - php

I am having a big database roughly it has 5 lakh (500K) entries now all those entries also have some document associated with them (i.e. every id has at least pdf file). Now I need a robust method to search for a particular text in those pdf files and if I find it, it should return the respective 'id'
kindly share some fast and optimized ways to search text in a pdf using PHP. Any idea will be appreciated.
note: Changing the pdf to text and then searching is not what I am looking for obviously, it will take a longer time.
In one line I need the best way to search for text in pdf using PHP

If this is a one-time task, there is probably no 'fast' solution.
If this is a recurring task,
Extract the text via some tool. (Sorry, I don't know of a tool.)
Store that text in a database table.
Apply a FULLTEXT index to that table.
Now the searching will be fast.

I myself wrote a website in ReactJS to search for info in PDF files (indexed books), which I indexed using Apache SOLR search engine.
What I did in React is, in essence:
queryValue = "(" + queryValueTerms.join(" OR ") + ")"
let query = "http://localhost:8983/solr/richText/select?q="
let queryElements = []
if(searchValue){
queryElements.push("text:" + queryValue)
}
...
fetch(query)
.then(res => res.json())
.then((result) =>{
setSearchResults(prepareResults(result.response.docs, result.highlighting))
setTotal(result.response.numFound)
setHasContent(result.response.numFound > 0)
})
Which results in a HTTP call:
http://localhost:8983/solr/richText/select?q=text:(chocolate%20OR%20cake)
Since this is ReactJS and just parts of code, it is of little value to you in terms of PHP, but I just wanted to demonstrate what the approach was. I guess you'd be using Curl or whatever.
Indexing itself I did in a separate service, using SolrJ, i.e. I wrote a rather small Java program that utilizes SOLR's own SolrJ library to add PDF files to SOLR index.
If you opt for indexing using Java and SolrJ (was the easiest option for me, and I didn't do Java in years previously), here are some useful resources and examples, which I collected following extensive search for my own purposes:
https://solr.apache.org/guide/8_5/using-solrj.html#using-solrj
I basically copied what's here:
https://lucidworks.com/post/indexing-with-solrj/
and tweaked it for my needs.
Tip: Since I was very rusty with Java, instead of setting classpaths etc, quick solution for me was to just copy ALL libraries from SOLR's solrj folder, to my Java project. And possibly some other libraries. May be ugly, but did the job for me.

Related

Datatables large data json with "smart searching", or server side with regex, or better approach?

I'm building this Bible app with Datatables.
My problem is, the data is big. Total is approximately 500 MB.
It's only about 32,000 lines, but they are paragraphs of text with heavy html/css markup.
Searching needs to be "smart searching" (partial word match).
It looks like there are 2 possible options to store the data:
I can have the data live in a mysql table. I know how to implement server side processing, but I don't know how to implement regex searching. It's been done successfully in a couple of ways here: https://datatables.net/forums/discussion/3343/server-side-processing-and-regex-search-filter/p1 (I don't know enough php to understand how to).
I can have the data live in one or perhaps even multiple json files. Then have the user download all once into local storage. Then perform Datatables smart searching normally. I'm not sure the searching will be good though. I tried this offline, loading only 50 MB and the searching is already quite slow. (Again, my programming knowledge is very limited).
Please have a look and feel free to guide me in the right direction :)
http://torah.byethost14.com/AdminLTE-master/pages/tables/_talmidimEdition.html

"Heavy html/css markup"? Will you be searching for html tags? Probably not. So...
Have another column that is the pure text -- no markup, no html, no css, not even verse numbers. (Book an verse number should probably be in separate columns.)
Then add a FULLTEXT index to that pure text column. Be aware of the limitations of fulltext.
REGEXP is slow, and will always scan all rows in the table.
There are also a few fulltext search engines that can be added onto MySQL. (I do not have advice on whether they are applicable for your app.)

Storing a shapefile into postgresql using PHP

I'm trying to develop a PHP script that lets users upload shapefiles to import to a postGIS database.
First of all, for the conversion part, AFAIK we can use shp2pgsql to convert the shapefile to a postgresql table; I was wondering if there is another way of doing the conversion, as I would prefer not to use the exec() command.
I would also appretiate any idea on storing the data in a way that does not require dozens of uniquenamed tables.

There seems to be no other way than using the postgresql's binary to convert the shapefile. Although it is not really a bad choice, I would rather not use exec() if there is a PHP native function, or apache module to do it!
However, it sounds like exec is the only sane option available. So I'm going to use it.
No hard feelings! :)
About the last part, it's a different question and should be asked separately. Although, I'm afraid there is no other way of doing it.
UPDATE example added
$queries = shell_exec("shp2pgsql -s ".SRID." -c $shpfilpath $tblname")
or respond(false, "Error parsing the shapfile.");
pg_query($queries) or respond(false, "Query failed!");
SRID is a constant containing the "SRID"!
$shpfilpath is a path to the desired ShapeFile
$tblname is the desired name for the table

See this blog post about loading shapefiles using the PHP shapefile reader plugin from here. http://www.phpclasses.org/package/1741-PHP-Read-vectorial-data-from-geographic-shape-files.html. The blog post focuses on using PHP on the backend to load data for a Flash app, but you should be able to ignore the flash part and use the PHP portion for your needs.
Once you have the data loaded from the shapefile, you could convert the geometry to a WKT string and use ST_GeomFromText or other PostGIS functions to store in the database.
Regarding the unique columns for a shapefile, I've found that to be the most straightforward way to store ad-hoc shapefile attributes and then retrieve that data. However, you could use a "tuple" system, and convert the attributes to strings, then store them in arbitrarily named columns (col1, col2, col3, etc.) if you don't care about attribute names or types.
If you cared about names and types, you could go one step further and store them as a shapefile "schema" in another table.

Write your shp2pgsql and define its parameters using text editor ie
sublime notepad etc.
Copy, paste and change shapefile name for each
layer.
Save as a batch file .bat.
Pull up command window.
Pull up directory with where yu .bat file is saved.
Hit enter and itll run the code for all your shapefiles and they will be uploaded to the
database you defined in writing your code.
Use qgis, go to postgis window and hit connect.
You are good to go your shapefiles are now ready to go and can be added as layers to your map. Make sure the spatial reference matches what they were prior to running it. Does that make sense? I hope that helped its the quickest way.

Adding this answer just for the benefit of anyone who is looking for the same as the OP and does not want to rely on eval() nor external tools.
As of August 2019, you could use PHP Shapefile, a free and open source PHP library I have been developing and maintaining for a few years that can read and write any ESRI Shapefile and convert it natively from/to WKT and GeoJSON, without any third party dependency.
Using my library, which provides WKT to use with PostGIS ST_GeomFromText() function and an array containing all the data to perform a simple INSERT, makes this task trivial, fast and secure, without the need of evil eval().

Slow searching in php

I'm new in php and mysql. Now i facing a problem is i need search data in a large database, but it take more than 3 minute to search a word, sometime the browser show timeout. I using technique FULLTEXT to do a searching, so any solution to decrease the searching time?

create index for the table field which you will prefer subsequently, even it take some memory space query result should return best results within less time.

This doesn't answer your question directly but is a suggestion:
I had the same problem with full text search so I switched to SOLR:
http://lucene.apache.org/solr/
It's a search server based on the Lucene library written in Java. It's used by some of the largest scale websites:
http://wiki.apache.org/solr/PublicServers
So speed and scalability isn't an issue. You don't need to know Java to implement it however. It offers a REST interface that you can query and even gives the option to return the search results in PHP array format.
Here's the official tutorial:
https://builds.apache.org/job/Solr-trunk/javadoc/doc-files/tutorial.html
SOLR searches through indexed files so you need to get your database contents into xml or json files. You can use the Data Import Handler extension for that:
http://wiki.apache.org/solr/DataImportHandler
To query the REST interface you can simply use get_file_contents() php function or CURL. Or the PHP sdk for SOLR:
http://wiki.apache.org/solr/SolPHP

Depends on how big your database is. Adding an index for the field you are searching is the first thing to do.
I have been into the same problem and adding an index for the field worked great.

How to do search with out using the scripting language and by using jquery?

I have a html site. In that site around 100 html files are available. i want to develop the search engine . If the user typing any word and enter search then i want to display the related contents with the keyword. Is't possible to do without using any server side scripting? And it's possible to implement by using jquery or javascript?? Please let me know if you have any ideas!!!
Advance thanks.

Possible? Yes. You can download all the files via AJAX, save their contents in an array of strings, and search the array.
The performance however would be dreadful. If you need full text search, then for any decent performance you will need a database and a special fulltext search engine.

3 means:
Series of Ajax indexing requests: very slow, not recommended
Use a DB to store key terms/page refernces and perform a fulltext search
Utilise off the shelf functionality, such as that offered by google

The only way this can work is if you have a list of all the pages on the page you are searching from. So you could do this:
pages = new Array("page1.htm","page2.htm"...)
and so on. The problem with that is that to search for the results, the browser would need to do a GET request for every page:
for (var i in pages)
$.get(pages[i], function (result) { searchThisPage(result) });
Doing that 100 times would mean a long wait for the user. Another way I can think of is to have all the content served in an array:
pages = {
"index" : "Some content for the index",
"second_page" : "Some content for the second page",
etc...
}
Then each page could reference this one script to get all the content, include the content for itself in its own content section, and use the rest for searching. If you have a lot of data, this would be a lot to load in one go when the user first arrives at your site.
The final option I can think of is to use the Google search API: http://code.google.com/apis/customsearch/v1/overview.html

Quite simply - no.
Client-side javascript runs in the client's browser. The client does not have any way to know about the contents of the documents within your domain. If you want to do a search, you'll need to do it server-side and then return the appropriate HTML to the client.
The only way to technically do this client-side would be to send the client all the data about all of the documents, and then get them to do the searching via some JS function. And that's ridiculously inefficient, such that there is no excuse for getting them to do so when it's easier, lighter-weight and more efficient to simply maintain a search database on the server (likely through some nicely-packaged third party library) and use that.

some useful resources
http://johnmc.co/llum/how-to-build-search-into-your-site-with-jquery-and-yahoo/
http://tutorialzine.com/2010/09/google-powered-site-search-ajax-jquery/
http://plugins.jquery.com/project/gss

If your site is allowing search engine indexing, then fcalderan's approach is definitely the simplest approach.
If not, it is possible to generate a text file that serves as an index of the HTML files. This would probably be rudimentarily successful, but it is possible. You could use something like the keywording in Toby Segaran's book to build a JSON text file. Then, use jQuery to load up the text file and find the instances of the keywords, unique the resultant filenames and display the results.

How do I design a web interface for browsing text man pages?

I would like to design a web app that allows me to sort, browse, and display various attributes (e.g. title, tag, description) for a collection of man pages.
Specifically, these are R documentation files within an R package that houses a collection of data sets, maintained by several people in an SVN repository. The format of these files is .Rd, which is LaTeX-like, but different.
R has functions for converting these man pages to html or pdf, but I'd like to be able to have a web interface that allows users to click on a particular keyword, and bring up a list (and brief excerpts) for those man pages that have that keyword within the \keyword{} tag.
Also, the generated html is somewhat ugly and I'd like to be able to provide my own CSS.
One obvious option is to load all the metadata I desire into a database like MySQL and design my site to run queries and fetch the appropriate data.
I'd like to avoid that to minimize upkeep for future maintainers. The number of files is small (<500) and the amount of data is small (only a couple of hundred lines per file).
My current leaning is to have a script that pulls the desired metadata from each file into a summary JSON file and load this summary.json file in PHP, decode it, and loop through the array looking for those items that have attributes that match the current query (e.g. all docs with keyword1 AND keyword2).
I was starting in that direction with the following...
$contents=file_get_contents("summary.json");
$c=json_decode($contents,true);
foreach ($c as $ind=>$val ) { .... etc
Another idea was to write a script that would convert these .Rd files to xml. In that case, are there any lightweight frameworks that make it easy to sort and search a small collection of xml files?
I'm not sure if xQuery is overkill or if I have time to dig into it...
I think I'm suffering from too-many-options-syndrome with all the AJAX temptations. Any help is greatly appreciated.
I'm looking for a super simple solution. How might some of you out there approach this?

My approach would be parsing the keywords (from your description i assume they have a special notation to distinguish them from normal words/text) from the files and storing this data as searchindex somewhere. Does not have to be mySQL, sqlite would surely be enough for your project.
A search would then be very simple.
Parsing files could be automated as post-commit-hook to your subversion repository.

Why don't you create table SUMMARIES with column for each of summary's fields?
Then you could index that with full-text index, assigning different weight to each field.
You don't need MySQL, you can use SQLite which has the the Google's full-text indexing (FTS3) built in.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.