I'm new in php and mysql. Now i facing a problem is i need search data in a large database, but it take more than 3 minute to search a word, sometime the browser show timeout. I using technique FULLTEXT to do a searching, so any solution to decrease the searching time?
create index for the table field which you will prefer subsequently, even it take some memory space query result should return best results within less time.
This doesn't answer your question directly but is a suggestion:
I had the same problem with full text search so I switched to SOLR:
http://lucene.apache.org/solr/
It's a search server based on the Lucene library written in Java. It's used by some of the largest scale websites:
http://wiki.apache.org/solr/PublicServers
So speed and scalability isn't an issue. You don't need to know Java to implement it however. It offers a REST interface that you can query and even gives the option to return the search results in PHP array format.
Here's the official tutorial:
https://builds.apache.org/job/Solr-trunk/javadoc/doc-files/tutorial.html
SOLR searches through indexed files so you need to get your database contents into xml or json files. You can use the Data Import Handler extension for that:
http://wiki.apache.org/solr/DataImportHandler
To query the REST interface you can simply use get_file_contents() php function or CURL. Or the PHP sdk for SOLR:
http://wiki.apache.org/solr/SolPHP
Depends on how big your database is. Adding an index for the field you are searching is the first thing to do.
I have been into the same problem and adding an index for the field worked great.
Related
I am having a big database roughly it has 5 lakh (500K) entries now all those entries also have some document associated with them (i.e. every id has at least pdf file). Now I need a robust method to search for a particular text in those pdf files and if I find it, it should return the respective 'id'
kindly share some fast and optimized ways to search text in a pdf using PHP. Any idea will be appreciated.
note: Changing the pdf to text and then searching is not what I am looking for obviously, it will take a longer time.
In one line I need the best way to search for text in pdf using PHP
If this is a one-time task, there is probably no 'fast' solution.
If this is a recurring task,
Extract the text via some tool. (Sorry, I don't know of a tool.)
Store that text in a database table.
Apply a FULLTEXT index to that table.
Now the searching will be fast.
I myself wrote a website in ReactJS to search for info in PDF files (indexed books), which I indexed using Apache SOLR search engine.
What I did in React is, in essence:
queryValue = "(" + queryValueTerms.join(" OR ") + ")"
let query = "http://localhost:8983/solr/richText/select?q="
let queryElements = []
if(searchValue){
queryElements.push("text:" + queryValue)
}
...
fetch(query)
.then(res => res.json())
.then((result) =>{
setSearchResults(prepareResults(result.response.docs, result.highlighting))
setTotal(result.response.numFound)
setHasContent(result.response.numFound > 0)
})
Which results in a HTTP call:
http://localhost:8983/solr/richText/select?q=text:(chocolate%20OR%20cake)
Since this is ReactJS and just parts of code, it is of little value to you in terms of PHP, but I just wanted to demonstrate what the approach was. I guess you'd be using Curl or whatever.
Indexing itself I did in a separate service, using SolrJ, i.e. I wrote a rather small Java program that utilizes SOLR's own SolrJ library to add PDF files to SOLR index.
If you opt for indexing using Java and SolrJ (was the easiest option for me, and I didn't do Java in years previously), here are some useful resources and examples, which I collected following extensive search for my own purposes:
https://solr.apache.org/guide/8_5/using-solrj.html#using-solrj
I basically copied what's here:
https://lucidworks.com/post/indexing-with-solrj/
and tweaked it for my needs.
Tip: Since I was very rusty with Java, instead of setting classpaths etc, quick solution for me was to just copy ALL libraries from SOLR's solrj folder, to my Java project. And possibly some other libraries. May be ugly, but did the job for me.
I'm trying to develop a PHP script that lets users upload shapefiles to import to a postGIS database.
First of all, for the conversion part, AFAIK we can use shp2pgsql to convert the shapefile to a postgresql table; I was wondering if there is another way of doing the conversion, as I would prefer not to use the exec() command.
I would also appretiate any idea on storing the data in a way that does not require dozens of uniquenamed tables.
There seems to be no other way than using the postgresql's binary to convert the shapefile. Although it is not really a bad choice, I would rather not use exec() if there is a PHP native function, or apache module to do it!
However, it sounds like exec is the only sane option available. So I'm going to use it.
No hard feelings! :)
About the last part, it's a different question and should be asked separately. Although, I'm afraid there is no other way of doing it.
UPDATE example added
$queries = shell_exec("shp2pgsql -s ".SRID." -c $shpfilpath $tblname")
or respond(false, "Error parsing the shapfile.");
pg_query($queries) or respond(false, "Query failed!");
SRID is a constant containing the "SRID"!
$shpfilpath is a path to the desired ShapeFile
$tblname is the desired name for the table
See this blog post about loading shapefiles using the PHP shapefile reader plugin from here. http://www.phpclasses.org/package/1741-PHP-Read-vectorial-data-from-geographic-shape-files.html. The blog post focuses on using PHP on the backend to load data for a Flash app, but you should be able to ignore the flash part and use the PHP portion for your needs.
Once you have the data loaded from the shapefile, you could convert the geometry to a WKT string and use ST_GeomFromText or other PostGIS functions to store in the database.
Regarding the unique columns for a shapefile, I've found that to be the most straightforward way to store ad-hoc shapefile attributes and then retrieve that data. However, you could use a "tuple" system, and convert the attributes to strings, then store them in arbitrarily named columns (col1, col2, col3, etc.) if you don't care about attribute names or types.
If you cared about names and types, you could go one step further and store them as a shapefile "schema" in another table.
Write your shp2pgsql and define its parameters using text editor ie
sublime notepad etc.
Copy, paste and change shapefile name for each
layer.
Save as a batch file .bat.
Pull up command window.
Pull up directory with where yu .bat file is saved.
Hit enter and itll run the code for all your shapefiles and they will be uploaded to the
database you defined in writing your code.
Use qgis, go to postgis window and hit connect.
You are good to go your shapefiles are now ready to go and can be added as layers to your map. Make sure the spatial reference matches what they were prior to running it. Does that make sense? I hope that helped its the quickest way.
Adding this answer just for the benefit of anyone who is looking for the same as the OP and does not want to rely on eval() nor external tools.
As of August 2019, you could use PHP Shapefile, a free and open source PHP library I have been developing and maintaining for a few years that can read and write any ESRI Shapefile and convert it natively from/to WKT and GeoJSON, without any third party dependency.
Using my library, which provides WKT to use with PostGIS ST_GeomFromText() function and an array containing all the data to perform a simple INSERT, makes this task trivial, fast and secure, without the need of evil eval().
I have files I need to convert into a database. These files (I have over 100k) are from an old system (generated from a COBOL script). I am now part of the team that migrate data from this system to the new system.
Now, because we have a lot of files to parse (each files is from 50mb to 100mb) I want to make sure I use the right methods in order to convert them to sql statement.
Most of the files have these following format:
#id<tab>name<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
the address2 is optional and can be empty
or
#id<tab>client<tab>taxid<tab>tagid<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
these are the 2 most common lines (I'll say around 50%), other than these, all the line looks the same but with different information.
Now, my question is what should I do to open them to be as efficient as possible and parse them correctly?
Honestly, I wouldn't use PHP for this. I'd use awk. With input that's as predictably formatted as this, it'll run faster, and you can output into SQL commands which you can also insert via a command line.
If you have other reasons why you need to use PHP, you probably want to investigate the fgetcsv() function. Output is an array which you can parse into your insert. One of the first user-provided examples takes CSV and inserts it into MySQL. And this function does let you specify your own delimiter, so tab will be fine.
If the id# in the first column is unique in your input data, then you should definitely insert this into a primary key in mysql, to save you from duplicating data if you have to restart your batch.
When I worked on a project where it was necessary to parse huge and complex log files (Apache, firewall, sql), we had a big gain in performance using the function preg_match_all(less than 10% of the time required using explode / trims / formatting).
Huge files (>100Mb) are parsed in 2 or 3 minutes in a core 2 duo (the drawback is that memory consumption is very high since it creates a giant array with all the information ready to be synthesized).
Regular expressions allow you to identify the content of line if you have variations within the same file.
But if your files are simple, try ghoti suggestion (fgetscv), will work fine.
If you're already familiar with PHP then using it is a perfectly fine tool.
If records do not span multiple lines, the best way to do this to guarantee that you won't run out of memory will be to process one line at a time.
I'd also suggest looking at the Standard PHP Library. It has nice directory iterators and file objects that make working with files and directories a bit nicer (in my opinion) than it used to be.
If you can use the CSV features and you use the SPL, make sure to set your options correctly for the tab characters.
You can use trim to remove the # from the first and last fields easily enough after the call to fgetcsv
Just sit and parse.
It's one-time operation and looking for the most efficient way makes no sense.
Just more or less sane way would be enough.
As a matter of fact, most likely you'll waste more overall time looking for the super-extra-best solution. Say, your code will run for a hour. You will spend another hour to find a solution that runs 30% faster. You'll spend 1,7 hours vs. 1.
I have a html site. In that site around 100 html files are available. i want to develop the search engine . If the user typing any word and enter search then i want to display the related contents with the keyword. Is't possible to do without using any server side scripting? And it's possible to implement by using jquery or javascript?? Please let me know if you have any ideas!!!
Advance thanks.
Possible? Yes. You can download all the files via AJAX, save their contents in an array of strings, and search the array.
The performance however would be dreadful. If you need full text search, then for any decent performance you will need a database and a special fulltext search engine.
3 means:
Series of Ajax indexing requests: very slow, not recommended
Use a DB to store key terms/page refernces and perform a fulltext search
Utilise off the shelf functionality, such as that offered by google
The only way this can work is if you have a list of all the pages on the page you are searching from. So you could do this:
pages = new Array("page1.htm","page2.htm"...)
and so on. The problem with that is that to search for the results, the browser would need to do a GET request for every page:
for (var i in pages)
$.get(pages[i], function (result) { searchThisPage(result) });
Doing that 100 times would mean a long wait for the user. Another way I can think of is to have all the content served in an array:
pages = {
"index" : "Some content for the index",
"second_page" : "Some content for the second page",
etc...
}
Then each page could reference this one script to get all the content, include the content for itself in its own content section, and use the rest for searching. If you have a lot of data, this would be a lot to load in one go when the user first arrives at your site.
The final option I can think of is to use the Google search API: http://code.google.com/apis/customsearch/v1/overview.html
Quite simply - no.
Client-side javascript runs in the client's browser. The client does not have any way to know about the contents of the documents within your domain. If you want to do a search, you'll need to do it server-side and then return the appropriate HTML to the client.
The only way to technically do this client-side would be to send the client all the data about all of the documents, and then get them to do the searching via some JS function. And that's ridiculously inefficient, such that there is no excuse for getting them to do so when it's easier, lighter-weight and more efficient to simply maintain a search database on the server (likely through some nicely-packaged third party library) and use that.
some useful resources
http://johnmc.co/llum/how-to-build-search-into-your-site-with-jquery-and-yahoo/
http://tutorialzine.com/2010/09/google-powered-site-search-ajax-jquery/
http://plugins.jquery.com/project/gss
If your site is allowing search engine indexing, then fcalderan's approach is definitely the simplest approach.
If not, it is possible to generate a text file that serves as an index of the HTML files. This would probably be rudimentarily successful, but it is possible. You could use something like the keywording in Toby Segaran's book to build a JSON text file. Then, use jQuery to load up the text file and find the instances of the keywords, unique the resultant filenames and display the results.
Here's the setup, I have a Lucene Index and it works well with the 2,000 documents I have indexed. I have been using Luke (Lucene Index Toolbox, v.0.9.2) to debug queries, and am using ZF 1.9.
The layout for my Lucene Index is as follows:
I = Indexed
T = Tokenized
S = Stored
Fields:
author - ITS
category - ITS
publication - ITS
publicationdate - IS
summary - ITS
title - ITS
Basically I have a form that is searchable by the above fields, letting you mix and match any of the above information, and will parse it into a zend luceue query. That is not the problem, the problem is when I start combining terms, the "optimize" method that fires within the find causes the query to just disappear.
Here is an example search I am running right now:
Form Version:
Title: test title
Publication: publication name
Lucene Query Parse:
+(title:test title) +(publication:publication name)
Now if I take this query string, and slap it into LUKE, and hit "Search", it returns the results just fine. When I use the Query Find method, it bombs out. So I did a little research into how it functions and found a problem (I believe)
First off, heres the actual lines of code that does the searching:
$searchQuery = "+(title:test title) +(publication:publication name)";
$hits = new ArrayObject($this->index->find($searchQuery));
It's a simplified version of the actual code, but thats what it generates.
Now heres what I've noticed after some debugging, the "optimize" method just destroys the query itself. I created the following code:
$rewrite = $searchQuery->rewrite($this->index);
$optimize = $searchQuery->rewrite($this->index)->optimize($this->index);
echo "======<br/>";
echo "Original: ".$searchQuery."<br/>";
echo "Rewrite: ".$rewrite."<br/>";
echo "Optimized + Rewrite: ".$optimize."<br/>";
echo "======<br/>";
Which outputs the following text:
======
Original: +(title:test title) +(publication:publication name)
Rewrite: +(title:test title) +(publication:publication name)
Optimized + Rewrite:
======
Notice how the 3rd output is completely empty. It appears that the Rewrite & Optimize on the query is causing the query string to just empty itself.
Does anyone have any idea why the optimize method seems to just be removing my query all together? Am I missing a filter or some sort of interface that might need to be parsed? All of the queries work perfectly when I paste them into LUKE and run them against the index by hand, but something silly is going on with the way Zend is parsing the query to do the search.
Any help is appreciated.
I will be quite frank, Zend_Search_Lucene (ZSL) is buggy and not maintained since a long time now.
It is also conceptually wrong. Let me explain why:
Search engines are there to reply fast to search queries, the problem with ZSL is that it is implemented in pure PHP. It means that at every query, all indexes files are read and reloaded again, continuously. It can't be fast.
There is nothing wrong with Lucene itself, there is even a very good alternative named Solr which is based on Lucene: it is a search server implemented in Java which can index and reply to all your Lucene queries. Because of the server nature of Solr, you don't suffer of poor performance by reloading all the Lucene files again and again.
This is somewhat different that what you asked, I waited two years for my ZSL bugs to be solved, it's now the case using Solr :)