I was wondering how I could quickly search a data string of up to 1 billion bytes of data. The data is all numeric. Currently, we have the data split into 250k files and the searches using strpos (fastest built-in function) on each file until it finds something.
Is there a way I can index to make it go faster? Any suggestions?
Eventually I would like to find multiple occurrences, which, as of now, would be done with the offset parameter on strpos.
Any help would surely lead to recognition where needed.
Thanks!
- James Hartig
Well, your tags indicate what you should do (the tag I am referring to is "indexing").
Basically, you should have separate files which would have the indexes for the data. It would have the data strings that you are looking for, as well as the file and byte positions that it is in.
You would then access the index, look up your value and then find the location(s) in the original file(s) for the data string, and process from there.
A good answer may require that you get a little more specific.
How long is the search query? 1 digit? 10 digits? Arbitrary length?
How "fast" is fast enough? 1 second? 10 seconds? 1 minute?
How many total queries per second / minute / hour do you expect?
How frequently does the data change? Every day? Hour? Continuously?
When you say "multiple occurrences" it sounds like you mean overlapping matches.
What is the "value" of the answer and to how many people?
A billion ain't what it used to be so you could just index the crap out of the whole thing and have an index that is 10 or even 100 times the original data. But if the data is changing by the minute, that would mean your were burning more cycles to create the index than to search it.
The amount of time and money you put into a solution is a function of the value of that solution.
You should definitely get a girlfriend. Besides helping you spend your time better it can grow fat without bursting. Oh, and the same goes for databases.
All of Peter Rowell's questions pertain. If you absolutely must have an out-of-the box answer then try grep. You can even exec it from PHP if you like. It is orders of magnitude faster than strpos. We've actually used it quite well as a solution for something that couldn't deal with indexing.
But again, Peter's questions still all apply. I'd answer them before diving into a solution.
Would a hash function/table work? Or a Suffix Array/Tree?
Related
I have a heavy script that we run plenty. Below is the algorithm used:
Load 4500 rows from database and store them as an array. (A)
Load 600000 rows from database and store them as an array. (B)
For each element in (A) looks for a match in (B).
go to next element in (A)
So the maximum amount of iteration of this script are 4500 * 60 000 which is 270,000,000 so you understand that this can be a bit sweaty for PHP.
Can I make this process more efficient somehow?
Reading the rows from a database is not really an issue it is the array iterations that bring heavy costs.
It does work pretty fast but one factor (60000) will increase greatly in the years to come.
So any ideas?
Here are a few different answers. My guess is that the first one is the right one, the
easy one and sufficient, but it's very hard to be sure.
Possible Answer 1: Use SQL
As the comments indicate it sounds awfully lot like a join. In addition your post seem
to indicate that you only take an action when a match is found and that not every element
in A have a match. This mean your SQL statement should only return the matching rows, not
all of them. It doesn't matter that you can't do everything i SQL, if you can let it
organise your data for you.
Possible Answer 2: Sort the arrays
Maybe you can sort the arrays (again, preferably let your database do this). Possibly you
can sort B so that search for a match is quicker. Or put the search value in the key of the
array so that searching is very quick. Or if you are lucky you might be able to sort both
arrays in a way that makes all A's and B's in the same order. i.e. for any A you pick you
know that the right B either do not exist or exist later in the B array.
Possible Answer 3: Explain more about the problem
You have only given us your current algorithm, not what you are actually trying to do. Most
likely iterating over everything is not the best idea, but no one can say unless they know
more about your data and what you want to do in the end.
it depends on your data, of course....
some general aspects:
this really sounds like a use case for the database queries, not the php script. Looking for matches in datasets is what databases are good at, no tricks will make php scripts play even in the same league
if you really have to use the php scripting functions try to
not to hit your allowed memory limits. your php server will just exit with an error, but if your sql site result set becomes too big your sql server may begin to write temp data to the hd, which will slow down the whole execution time -> if possible, fetch and process the data in chunks (offset, limit)
if you match whole words, build your matching array in such a way, that the search criterion is a key, not a value, so that you can use isset($potentialMatches[$searchTerm]), which is way faster than in_array($searchTerm, $potentialMatches) for larger arrays. Mockup:
while ($row = $resultSet->fetch_assoc()) {
$potentialMatches[$row['search_column']] = $row;
}
but it can't be stressed enough: the usual course to handle this would be:
do the matching DB-side
process the matches in your script
if necessary: do a new query for non-matches
if 3., process those results in your script
I have a very common problem, but cannot seem to find a good answer for it.
I need to get a page's worth of rows from a table, as well as enough info to paginate this data. So in general I need a very rough estimate of the total number of rows (in general All I need to know is ceil(count()/50)).
So count() is really overkill. And I already have a SELECT * FROM table LIMIT 0, 50 running, so if it can be appended to this command all the better.
I have heard about SQL_CALC_FOUND_ROWS. But I also heard that it is not particularly more efficient than just doing the count yourself. "Unfortunately, using SQL_CALC_FOUND_ROWS has the nasty consequence of blowing away any LIMIT optimization that might happen".
So, all in all, I kindof think using MySQL's row estimate is the way to go. But I do not know how to do that. Or how off this estimate might be.
Note1: In my situation most of the tables I am working with are just updated a few times a day, not all the time.
Note2: I am using PDO with php.
Another interesting idea I found:
A better design is to convert the pager to a “next” link. Assuming there are 20 results per page, the query should then use a LIMIT of 21 rows and display only 20. If the 21st row exists in the results, there’s a next page, and you can render the “next” link.
If you don't need the total count of the table it's indeed the fastests solution.
It is an old topic that was beaten to death. Many times. Count is the fastest way to get number of rows in a typical table.
But if you never delete anything from it (which is a weird assumption, but will work in some cases.), then you could simply get ID of the last row (which may be faster, but not necessarily). This would also fit your estimations need, as most likely won't be correct.
But then again, if you are using for example myisam, then nothing beats count (which is true for most cases).
I recently bought myself a domain for personal URL shortening.
And I created a function to generate alphanumeric strings of 4 characters as reference.
BUT
How do I check if they are already used or not? I can't check for every URL if it exists in the database, or is this just the way it works and I have to do it?
If so, what if I have 13.000.000 URLs generated (out of 14.776.336). Do I need to keep generating strings until I have found one that is not in the DB yet?
This just doesn't look the right way to do it, anyone who can give me some advise?
One memory efficient and faster way I think of is following. This problem can be solved without use of database at all. The idea is that instead of storing used urls in database, you can store them in memory. And since storing them in memory can take a lot of memory usage, so we will use a bit set (an array of bits ) and we only one bit for each url.
For each random string you generate, create a hashcode for that that lies b/w 0 and max number K.
Create a bit set( basically a bit array). Whenever you use some url, set corresponding hash code bit in bit set to 1.
Whenever you generate a new url, see if its hashcode bit is set. If yes, then discard that url and generate a new one. Repeat the process till you get one unused one.
This way you avoid DB forever, your lookups are extremely fast and it takes least amount of memory.
I borrowed the idea from this place
A compromise solution is to generate a random id, and if it is already in the database, find the first empty id that is bigger than it. (Wrapping around if you can't find any empty space in the range above.)
If you don't want the ids to be unguessable (you probably don't if you only use 4 characters), this approach works fine and is quick.
One algorithm is to try a few times to find a free url of N characters, if still not found, increase N. Start with N=4.
I am looking to generate a random number for every user contribution as a title of the contribution.
I could simply check the database each time with a query and generate a number which does not equal to any of the entries of the database. But I imagine this as inefficient and it could become slow if the database is big in my opinion. Also I'd have to contain all the numbers of the database somewhere to manage the "not equals to", in an array or something similar but that can end up as a giant one.
Excuse the layman's speech I am new to this.
Any suggestions how this can be solved efficiently without straining the resources too much? You can explain it linguistically and do not have to provide me any scripts, I will figure it out.
You can use uniqid(). I'm not sure how portable it is.
Example:
printf("uniqid(): %s\r\n", uniqid());
Will output something like:
uniqid(): 4b3403665fea6
uniqid() will give you a random number that can technically repeat.
Maybe you can apply a simple algorithm on an auto-increment field? n(n+1)/2 or something?
this question may seem too basic to some, but please bear with be, it's been a while since I dealt with decent database programming.
I have an algorithm that I need to program in PHP/MySQL to work on a website. It performs some computations iteratively on an array of objects (it ranks the objects based on their properties). In each iteration the algorithm runs through all collection a couple of times, accessing various data from different places of the whole collection. The algorithm needs several hundred iterations to complete. The array comes from a database.
The straightforward solution that I see is to take the results of a database query and create an object for each row of the query, put the objects to an array and pass the array to my algorithm.
However, I'm concerned with efficacy of such solution when I have to work with an array of several thousand of items because what I do is essentially mirror the results of a query to memory.
On the other hand, making database query a couple of times on each iteration of the algorithm also seems wrong.
So, my question is - what is the correct architectural solution for a problem like this? Is it OK to mirror the query results to memory? If not, which is the best way to work with query results in such an algorithm?
Thanks!
UPDATE: The closest problem that I can think of is ranking of search results by a search engine - I need to do something similar to that. Each result is represented as a row of a database and all results of the set are regarded when the rank is computed.
Don't forget, premature optimization is the root of all evil. Give it a shot copying everything to memory. If that uses too much mem, then optimize for memory.
Memory seems like the best way to go - iff you can scale up to meet it. Otherwise you'll have to revise your algorithm to maybe use a divide and conquer type of approach - do something like a merge sort.
It really depends on the situation at hand. It's probably rarely required to do such a thing, but it's very difficult to tell based off of the information you've given.
Try to isolate the data as much as possible. For instance, if you need to perform some independent action on the data that doesn't have data dependencies amongst iterations of the loop, you can write a query to update the affected rows rather than loading them all into memory, only to write them back.
In short, it is probably avoidable but it's hard to tell until you give us more information :)
If you are doing a query to the database, when the results come back, they are already "mirrored to memory". When you get your results using mysql_fetch_assoc (or equiv) you have your copy. Just use that as the cache.
Is the computation of one object dependent on another, or are they all independent? If they are independent, you could load just a small number of rows from the database, converting them to objects as you describe. Then run your hundreds of iterations on these, and then output the result for that block. You then proceed to the next block of items.
This keeps memory usage down, since you are only dealing with a small number of items rather than the whole data set, and avoids running multiple queries on the database.
The SQL keywords LIMIT and OFFSET can help you step through the data block by block.
Writing ranking queries with MySQL is possible as well, you just need to play with user-defined variables a bit. If you will provide some input data and the result you are going to achieve, the replies will be more detailed
can you use a cron job to do your ranking, say once per day, hour, or whatever you need, and then save the items ranking to a field in its row?
that way when you call your rows up you could just order them by the ranking field.