search big database

search big database - php

I have a database which holds URL's in a table (along with other many details about the URL). I have another table which stores strings that I'm going to use to perform searches on each and every link. My database will be big, I'm expecting at least 5 million entries in the links table.
The application which communicates with the user is written in PHP. I need some suggestions about how I can search over all the links with all the patterns (n X m searches) and in the same time not to cause a high load on the server and also not to lose speed. I want it to operate at high speed and low resources. If you have any hints, suggestions in pseudo-code, they are all welcomed.
Right now I don't know whether to use SQL commands to perform these searches and have some help from PHP also or completely do it in PHP.

First I'd suggest that you rethink the layout. It seems a little unnecessary to run this query for every user, try instead to create a result table, in which you just insert the results from that query that runs ones and everytime the patterns change.
Otherwise, make sure you have indexes (full text) set on the fields you need. For the query itself you could join the tables:
SELECT
yourFieldsHere
FROM
theUrlTable AS tu
JOIN
thePatternTable AS tp ON tu.link LIKE CONCAT('%', tp.pattern, '%');

I would say that you pretty definately want to do that in the SQL code, not the PHP code. Also searching on the strings of the URLs is going to be a long operation so perhaps some form of hashing would be good. I have seen someone use a variant of a Zobrist hash for this before (google will bring a load of results back).
Hope this helps,
Dan.

Do as much searching as you practically can within the database. If you're ending up with an n x m result set, and start with at least 5 million hits, that's a LOT Of data to be repeatedly slurping across the wire (or socket, however you're connecting to the db) just to end up throwing away most (a lot?) of it each time. Even if the DB's native search capabilities ('like' matches, regexp, full-text, etc...) aren't up to the task, culling unwanted rows BEFORE they get sent to the client (your code) will still be useful.

You must optimize your tables in DB. Use a md5 hash. New column with md5, will use index and faster found text.
But it don't help if you use LIKE '%text%'.
You can use Sphinx or Lucene.

Related

Autocomplete concept

I'm programming a search engine for my website in PHP, SQL and JQuery. I have experience in adding autocomplete with existing data in the database (i.e. searching article titles). But what about if I want to use the most common search queries that the users type, something similar to the one Google has, without having so much users to contribute to the creation of the data (most common queries)? Is there some kind of open-source SQL table with autocomplete data in it or something similar?

As of now use the static data that you have for auto complete.
Create another table in your database to store the actual user queries. The schema of the table can be <queryID, query, count> where count is incremented each time same query is supplied by some other user [Kind of Rank]. N-Gram Index (so that you could also auto-complete something like "Manchester United" when person just types "United", i.e. not just with the starting string) the queries and simply return the top N after sorting using count.
The above table will gradually keep on improving as and when your user base starts increasing.
One more thing, the Algorithm for accomplishing your task is pretty simple. However the real challenge lies in returning the data to be displayed in fraction of seconds. So when your query database/store size increases then you can use a search engine like Solr/Sphinx to search for you which will be pretty fast in returning back the results to be rendered.

You can use Lucene Search Engiine for this functionality.Refer this link
or you may also give look to Lucene Solr Autocomplete...

Google has (and having) thousands of entries which are arranged according to (day, time, geolocation, language....) and it is increasing by the entries of users, whenever user types a word the system checks the table of "mostly used words belonged to that location+day+time" + (if no answer) then "general words". So for that you should categorize every word entered by users, or make general word-relation table of you database, where the most suitable searched answer will be referenced to.

Yesterday I stumbled on something that answered my question. Google draws autocomplete suggestions from this XML file, so it is wise to use it if you have little users to create your own database with keywords:
http://google.com/complete/search?q=[keyword]&output=toolbar
Just replacing [keyword] with some word will give suggestions about that word then the taks is just to parse the returned xml and format the output to suit your needs.

Performance for multiple searches ordered by random

I'm in the design phase of a website and I have a solution for a feature but I don't know if it will be the good one when the site, hopefully, grows. I want the users to be able to perform searches for other users and the results they find must be ordered: first the "spotlighted" users, then all the rest. The result must be ordered randomly, respecting the previously mentioned order, and with pagination.
One of the solutions I have in mind is to store the query results in a session variable in the server side. For performance, when the user leaves the search this variable is destroyed.
What will happen when the site has thousands of users and every day thousands of searches are performed? My solution will be viable or the server will be overloaded?
I have more solutions in mind like an intermediate table where n times by day users are dumped in the mentioned order. This way there is no need to create a big array in the user's session and pagination is done via multiple queries against the database.
Although I appreciate any suggestions I'm specially interested into hear opinions from developers seasoned in transited sites.
(The technology employed is LAMP, with InnoDb tables)

Premature optimization is bad. But you should be planning ahead. You dont need to implement it. But prepare yourself.
If there are thousands of users searching this query everyday then caching the query result in session is not a good idea. Cause same result can be cached for some users while other needs to execute it. For such case I'd recommend you save the search result in user independent data structure (File, memory etc).
For each search query save the result, creation date, last access date in your disk or in memory.
If any user searches the same query show the result from cache
Run a cron that invalidates the cache after sometime.
This way frequent searches will most time promptly available. Also it reduces the load on your database.

This is definitely not the answer you are looking for, but I have to say it.
Premature Optimization is the root of all evil.
Get that site up with a simple implementation of that query and come back and ask if that turns out to be your worst bottleneck.

I'm assuming you want to decrease the hitting on the DB by caching search results so other users searching for the same set of factors don't have to hit the DB again--especially on very loose query strings on non-indexed fields. If so, you can't store it in a session--that's only available to the single user.
I'd use a caching layer like Cache_Lite and cache the result set from the db query based on the query string (not the sql query, but the search parameters from your site). That way identical searches will be cached. Handle the sorting and pagination of the array in PHP, not in the DB.

mysql and php: querying the db vs. reading in the whole thing

I'm struggling with a philosophical question on database programming in PHP. In particular, I'm trying to decide when it's best to read in an entire table into an object, vs. querying MySQL directly whenever I need data.
Is there ever a situation where you'd want to just read in the entire database into an object? Where do you draw the line?
For example, if I had a table full of names and phone numbers, and I need to get the phone number for one individual, that's a simple one-time mysql query. Reading in an entire table into an associative array just to get one phone number sounds ridiculous... But:
(1) what if I need to get the names and phone numbers of 50 individuals? 100? 1000?
(2) When is it more efficient (if ever) to read in the entire table into an object? Is performing 1000 mysql queries on 1000 names always going to be more efficient than reading in the entire table?
(2a) Obviously it would depend on the total number of records in the table. Would it be better to do 1000 queries for 1000 phone numbers, or read in a table of 2000 total records from a MySQL into an associative array? What if it was 5000 total records, and I needed 1000? What if it was 10k? Etc. etc.
(3) What if I need to do something a little more complex, like return all phone numbers in a certain area code? Obviously in that case I could use a regexp SQL query, but I'm sure I could come up with a more complex case where a simple query doesn't give me exactly what I want.
I guess what I'm getting at is, as a developer, you have several knobs you can turn to optimize your application. Obviously you want to think about the data you're using and optimize the database model to match the types of data requests you'll be doing. But sometimes you get into a mutually exclusive case where you're forced to pick optimizing your data model for one scenario, at the expense of another, competing scenario.
Any thoughts?

Databases are designed to be efficient at locating and returning exactly the data that you need to work with for a particular operation.
Transferring data over a network connection is orders of magnitude slower than processing it on the machine where it resides. Use databases for what they're good at... holding lots of information and allowing application code to query and work with exactly the subset of that data it needs to at a given point in time.
If you find that you need to frequently access the same data over and over, caching it at the application layer or in a dedicated caching solution like memcached does make sense, but I cannot imagine a scenario where it makes sense just to read in a whole table because my application logic needs to process a subset of the rows and/or columns in the table.

(3) but I'm sure I could come up with a more complex case where a simple query doesn't give me exactly what I want.
This is usually an indication that your database hasn't been properly normalized and/or has design flaws.
(2) When is it more efficient (if ever) to read in the entire table into an object? Is performing 1000 mysql queries on 1000 names always
Neither is a good choice. SQL is intended for set-based operations. You really need to use the system correctly for it to work well, but to do this you have to have properly designed your database. The best thing would be to write one query that returns exactly the records you want, no more and no less.
what if I need to get the names and phone numbers of 50 individuals
Maybe use something like select * where ID in (1,2,3,...,50), if you have a larger number of users, maybe create a temporary table with the list of users you want, and join on that. With a properly designed database there is usually a good way to retrieve a set of data with a single query.

performance issue on displaying records

I have a table with just 3,000 records.
I render these 3000 records in the home page without pagination, my client is not interested in pagination...
So to show page completely it takes around 1 min, 15 sec. What can be done to make the page load more quickly?
My table structure:
customer table
customer id
customer name
guider id
and few columns
guider table
guider id
guider name
and few columns

Where's the slow down? The query or the serving?
If the former, see the comments above. If the latter:
Enable gzip on the server. Otherwise capture the [HTML?] output to a file, compress it (zip), then serve it as a download. Same for any other format if you think something else can render it better than a browser (CSV and Open Office).
If you're outputting the data into a HTML table then you may have an issue where the browser is waiting for the end of the table before rendering it. You can either break this into multiple table chunks like every 500 records/rows or try CSS "table-layout: fixed;".

Check the Todos
sql Connection (dont open the
connection in loop) for query it
should be one time connection
check your queries and analyse it if you are using some complex logic
which can be replaced
use standard class for sql connection and query ; use ezsql
sql query best practice

While you could implement a cache to do this, you don't necessarily need to do so, an introducing unnecessary cache structures can often cause problems of its own. Depending on where the bottleneck is, it may not even help you much, or at all.
You need to look in two places for your analysis:
1) The query you're using to get your data. Take a look at its plan, or if you're not comfortable doing that, run it in your favorite query tool and see how long it takes to come back. If it doesn't take too long, you've got a pretty good idea that your bottleneck isn't the query. If the query itself takes a long time, that's where you should focus your efforts.
2) How your page is rendering. What is the size of your page, in bytes? It may be too big. Can you cut the size down by formatting? Can you more effectively use CSS to eliminate duplicate styling on the page? Are you using a fixed or dynamic table layout? Dynamic is generally going to be quite a bit slower, especially for large tables. Try to avoid nesting tables. Do everything you can to make the page as small as possible, and keep testing!

while displaying records i want to
display guidername so , i did once
function that return the guider name
Sounds like you need to use a JOIN. Here's a simple example:
SELECT * FROM customer JOIN guider ON guider.id=customer.guider_id
This will change your page from using N + 1 (3001) queries to just one.
Make sure both guider.id and customer.guider_id are indexed and of appropriate data types (such as integers).

This is a little list, what you should think about for improving the performance, the importance is relative to each point, so the first ist not to be the most important to you - which depends on the details of your project.
Check your database structure. If there are just these two tables, their might be little you can do. But keep in mind that there is stuff like indices and with an increasing number of records a second denormalizes table structure will improve the speed of retrieving results.
Use rather one Query for selecting your data, than iterating through ids and doing selects repeatedly
Run a separate Query for the guiders, I assume there are only a few of them. Save all guiders in a data structure, e.g. a dictionary, first and use the foreign key to apply the correct one to the current record - this might save a lot of data which has to be transmitted from the database to your web server.
Get your result set by using something like mysqli_result::fetch_all() which returns a 2-dimensional array with all results. This should be faster than iteration through each row with fetch_row()
Sanitize your HTML Output, use (external) CSS. This will save a lot of output space if you format your stuff with style=" ... a lot of formatting code ..." attributes in each line. If you use one large table, split them up in multiple tables (some browsers wait for the complete table to load before rendering it).
In a lot of languages very important: Use a string builder for concatenating your results into the output string!
Caching: Think about generating the output once a day or once an hour. Write it to a cachefile which is opened instead of querying the database and building the same stuff on every request. Maybe you want to offer this generated file as download, rather than displaying it as plain HTML Site on the web.
Last but not least, check the connections to webserver and database, the server load as well as the number of requests. If your servers are running on heavy load everything ales here might help reducing the load or you just have to upgrade hardware.

LOL
everyone is talking of big boys toys, like database structure, caching and stuff.
While the problem most likely lays in mere HTML and browsers.
Just to split whole HTML table in chunks will help first chunk to show up immediately while others will eventually come.
Only ones were right who said to profile whole thing first.
Trying to answer without profiling results is shooting in the dark.

How bad is using SELECT MAX(id) in MYSQL instead of mysql_insert_id() in PHP?

Background: I'm working on a system where the developers seem to be using a function which executes a MYSQL query like "SELECT MAX(id) AS id FROM TABLE" whenever they need to get the id of the LAST inserted row (the table having an auto_increment column).
I know this is a horrible practice (because concurrent requests will mess the records), and I'm trying to communicate that to the non-tech / management team, to which their response is...
"Oh okay, we'll only face this problem when we have
(a) a lot of users, or
(b) it'll only happen when two people try doing something
at _exactly_ the same time"
I don't disagree with either point, and think we'll run into this problem much sooner than we plan. However, I'm trying to calculate (or figure a mechanism) to calculate how many users should be using the system before we start seeing messed up links.
Any mathematical insights into that? Again, I KNOW its a horrible practice, I just want to understand the variables in this situation...
Update: Thanks for the comments folks - we're moving in the right direction and getting the code fixed!

The point is not if potential bad situations are likely. The point is if they are possible. As long as there's a non-trivial probability of the issue occurring, if it's known it should be avoided.
It's not like we're talking about changing a one line function call into a 5000 line monster to deal with a remotely possible edge case. We're talking about actually shortening the call to a more readable, and more correct usage.
I kind of agree with #Mark Baker that there is some performance consideration, but since id is a primary key, the MAX query will be very quick. Sure, the LAST_INSERT_ID() will be faster (since it's just reading from a session variable), but only by a trivial amount.
And you don't need a lot of users for this to occur. All you need is a lot of concurrent requests (not even that many). If the time between the start of the insert and the start of the select is 50 milliseconds (assuming a transaction safe DB engine), then you only need 20 requests per second to start hitting an issue with this consistently. The point is that the window for error is non-trivial. If you say 20 requests per second (which in reality is not a lot), and assuming that the average person visits one page per minute, you're only talking 1200 users. And that's for it to happen regularly. It could happen once with only 2 users.
And right from the MySQL documentation on the subject:
You can generate sequences without calling LAST_INSERT_ID(), but the utility of
using the function this way is that the ID value is maintained in the server as
the last automatically generated value. It is multi-user safe because multiple
clients can issue the UPDATE statement and get their own sequence value with the
SELECT statement (or mysql_insert_id()), without affecting or being affected by
other clients that generate their own sequence values.

Instead of using SELECT MAX(id) you shoud do as the documentation says :
Instead, use the internal MySQL SQL function LAST_INSERT_ID() in an SQL query
Even so, neither SELECT MAX(id) nor mysql_insert_id() are "thread-safe" and you still could have race condition. The best option you have is to lock tables before and after your requests. Or even better use transactions.

I don't have the math for it, but I would point out that response (a) is a little silly. Doesn't the company want a lot of users? Isn't that a goal? That response implies that they'd rather solve the problem twice, possibly at great expense the second time, instead of solve it once correctly the first time.

This will happen when someone has added something to the table between one insert and that query running. So to answer your question, two people using the system has the potential for things to go wrong.
At least using the LAST_INSERT_ID() will get the last ID for a particular resource so it won't matter how many new entries have been added in between.

In addition to the risk of getting the wrong ID value returned, there's also the additional database query overhead of SELECT MAX(id), and it's more PHP code to actually execute than a simple mysql_insert_id(). Why deliberately code something to be slow?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.