How to search multiple millions of strings really fast? [closed]

How to search multiple millions of strings really fast? [closed] - php

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
So here is the situation
We have 250,000 radio stations.
Each radio station comes with 2 strings.
These 2 strings can either be Song Name, Album Name or Artist Name
We don't know which one is what. But one of them is surely the song name, we don't know which one.
Usually the other one is Artist(telling it for worst case scenarios, we don't want to create a situation of worst case scenario by assuming it as Album)
Now we have a database which consists of 4.5 million Artists, 7 million Albums and 150 million songs.(and a bunch of few other data which don't matter) These 3 different rows are in different tables. These are the tables where we will do our searching and matching. We can sort them Alphabetically or however it suits us to speed up the process.
These tables are interrelated.
In these tables a Song name always has an artist and album(in their respective table) associated with it, an album always has artist/s and song/s associated with......you get the idea
With 2 strings that comes with each radio station, I have to recognize 3 things
Song Name
Album Name
Artist Name
Now I am assuming the best case scenario would be if we match the first string of the channels with The Artist Names in the tables. If we get a match we can easily find if the other string gets a match under the Song name(and Album name) associated with the Artist matched. (Let's assume for the sake of simplicity that an Album Name cannot be same as Artist Name or song name or vice versa)
If we don't get a match for Artist with the first string, we try out the second string. and then we repeat the same with Album if we don't get a match.
What should be the algorithm for getting the fastest results ?
I have a server of 56(using some ram already) Gb but I want to reserve 20 Gb for other purposes. (But if you can provide a very great solution by using the reserve, don't hesitate to suggest.)
We also have SSD storage. Do you think this all can be done for all the radio stations within a minute ? Preferably 30 secs?
Please let me know how to proceed.
Here is the image for better understanding

Well all of these are strings. It is an interesting Search problem, creating a separate specific search index (a Trie like structure) would be good. Now coming to your problem the best data structure to index your data would be a Finite State Transducer. It is much more compact than a Trie as in real world the strings and text share a lot of suffixes and an FST allows you to share suffixes as well as prefixes, think Graphs. However Trie doesn't allow you to share suffixes. Also as you would have values to your keys so you will require something like a Transducer (think sorted maps) which emits a value given a key and not a Finite State Acceptor which is more like a sorted set and not a map like structure.
Lucene has a great implementation and I suppose a lot of things like Suggestions, Edit Distances are all based on it. They have also decoupled it from their main Inverted Index.
More information on Lucene Finite State Transducers:
http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html
Index 1,600,000,000 Keys with Automata and Rust: http://blog.burntsushi.net/transducers/

Related

Database number of columns and separate table? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm using MySQL database and wondering about the database table designs.
I see sets of two kinds of tables designed by an experience PHP developer. Basically one table contains some dynamic stats and the other table more static ones but each record would have the same row_id. For example, a user table where info like name, pass are stored and a user_stats table where like a summary of actions maybe like the total money spent, etc.
Is there any real advantage in doing this besides adding more clarity to the functions of each table. How many columns would you say is optimal in a table? Is it bad to mix the more static info and dynamic stuff together into like 20 columns of the same table?

From a design standpoint, you might consider that the User table describes things, namely "users." There might be hundreds of thousands of them, and rows might need to remain in that table ... even for Users who've been pushing-up daisies in the local graveyard for many years now ... because this table associates the user_id values that are scattered throughout the database with individual properties of that "thing."
Meanwhile, we also collect User_Stats. But, we don't keep these stats nearly so long, and we don't keep them about every User that we know of. (Certainly not about the ones who live in the graveyard.) And, when we want to run reports about those statistics, we don't want to pore through all of those hundreds-of-thousands of User records, looking for the ones that actually have statistics.
User_Stats, then, is an entirely separate collection of records. Yes, it is related to Users, e.g. in the referential-integrity sense that "any user_id (foreign key ...) in User_Stats must correspond to a user_id in Users. But it is, nevertheless, "an entirely separate collection of records."
Another important reason for keeping these as a separate table is that there might naturally be a one-to-many relationship between a User and his User_Stats. (For instance, maybe you keep aggregate statistics by day, week, or month ...)
If you have nothing better to do with your afternoon than to read database textbooks ... ;-) ... the formal name of this general topic is: "normal forms." ("First," "Second," "Third," and so on.)

Many to many vs one row [duplicate]

This question already has answers here:
Many database rows vs one comma separated values row
(4 answers)
Closed 8 years ago.
I'm interested how and why many to many relationship is better than storing the information in one row.
Example: I have two tables, Users and Movies (very big data). I need to establish a relationship "view".
I have two ideas:
Make another column in Users table called "views", where I will store the ids of the movies this user has viewed, in a string. for example: "2,5,7...". Then I will process this information in PHP.
Make new table users_movies (many to many), with columns user_id and movie_id. row with user_id=5 and movie_id=7 means that user 5 has viewed movie 7.
I'm interested which of this methods is better and WHY. Please consider that the data is quite big.

The second method is better in just about every way. Not only will you utilize your DBs indexes to find records faster, it will make modification far far easier.

Approach 1) could answer the question "Which movies has User X viewed" by just having an SQL like "...field_in_set(movie_id, user_movielist) ...". But the other way round ("Which user do have viewed movie x") won't work on an sql basis.
That's why I always would go for approach 2): clear normalized structure, both ways are simple joins.

It's just about the needs you have. If you need performance then you must accept redundancy of the information and add a column. If your main goal is to respect the Normalization paradigma then you should not have redundancy at all.
When I have to do this type of choice I try to estimate the space loss of redundancy vs the frequency of the query of interest and its performance.

A few more thoughts.
In your first situation if you look up a particular user you can easily get the list of ids for the films they have seen. But then would need a separate query to get the details such as the titles of those movies. This might be one query using IN with the list of ids, or one query per film id. This would be inefficient and clunky.
With MySQL there is a possible fudge to join in this situation using the FIND_IN_SET() function (although a down side of this is you are straying in to non standard SQL). You could join your table of films to the users using ON FIND_IN_SET(film.id, users.film_id) > 0 . However this is not going to use an index for the join, and involves a function (which while quick for what it does, will be slow when performed on thousands of rows).
If you wanted to find all the users who had view any film a particular user had viewed then it is a bit more difficult. You can't just use FIND_IN_SET as it requires a single string and a comma separated list. As a single query you would need to join the particular user to the film table to get a lot of intermediate rows, and then join that back against the users again (using FIND_IN_SET) to find the other users.
There are ways in SQL to split up a comma separated list of values, but they are messy and anyone who has to maintain such code will hate it!
These are all fudges. With the 2nd solution these easy to do, and any resulting joins can easily use indexes (and possibly the whole queries can just use indexes without touching the actual data).
A further issue with the first solution is data integretity. You will have to manually check that a film doesn't appear twice for a user (with the 2nd solution this can easily be enforced using a unique key). You also cannot just add a foreign key to ensure that any film id for a user does actually exist. Further you will have to manually ensure that nothing enters a character string in your delimited list of ids.

Best implementation of a large number of bit flags in php and Mysql [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Can someone offer a solution for implementing flags in php and mysql. I have a large number of flags that represent owning items. There over 200 different item, but over time this will grow to as many as 500-600.
My initial thought was to store this information in a data blob, and update it in a trigger in mysql. But it appears that bit operations are limited to 64 bits.
The basic operation is to give an item by id type (say item 156) which would set 156th bit in the blob.

If you store 200 "items" as bit flags, that will occupy 25 bytes per user. Regardless of the number of users.
If instead you have a UserItems table with two columns, , , that is 8 bytes per pair. If users have, on average, 3 items or fewer, then the normalized approach is actually smaller than the bit-packing approach.
It also offers several advantages. The normalized approach would naturally have an items table with descriptive information about the items. This could be easily joined in, so you would know which items are red, or in German, or size 16, or take diesel fuel -- whatever the appropriate attributes are for your items. And these could have item hierarchies with important category information as well.
In addition, the basic UserItems table might be too small. Perhaps you want other information about the acquisition of an item -- such as when it was acquired, or the quantity. Well, you can add columns to the UserItem table. The bit-packing approach is a bit less flexible.
The advice is to use a standard database approach. This has worked on many different applications, some bigger than the one you are contemplating. If you really do understand the problem and understand the performance implications of different approaches, there are some circumstances where bit-packing could be the right solution. But it is not the way to start the design.

As I see it:
you have a list of objects and a list of potential owners
you want to implement a "owner owns object" relation
The simplest solution would be to have a table that associates an owner and an object.
A record in this table would be equivalent to a cell in the matrix representing the ownership relation.
If this matrix is populated sparsely enough, then the table is a good enough solution.
If you know that an average owner will own 50 objects or so, then you might want to organize your objects in groups and have a slightly more complex relation: "owner owns objects in object group", where you store the owner id, the group id and a bitmap specifying which objects are owned within the group.
Provided your owners don't pick items at random (i.e. you can guess reasonably well which objects will be likely to be owned together), this approach would be more flexible than a huge bitmap for each owner, that you would have to update each time you add a new set of objects to your database, and probably more size-efficient.

There are several answers here that should be considered prior to using this solution. In my particular case I analyzed the use case and it was necessary to use binary data and bit twiddle the results. So to the solution....
First the mysql data type was varbinary({maxLen}). where maxlen is the size of bits I would probably need. In my case I chose 64 which provides me with 512 bits. I chose this because it doesn't pad out the data. This was because it wanted to store the smallest amount of data and it wasn't until later that users would begin needing larger bits flipped.
Second the php... (WHAT CRAP, but it works) - I select from the DB, convert, bittwiddle and update. But there is a bit more to it than that.
$q = $sqli->query("Select binaryFlags from Table where id= $id");
$r = $q->fetch_assoc();
$c = $r['binaryFlags']; // holds the binary raw data
$c = bin2hex($c);// convert to hex - don't know why I can't convert directly to dec
$c = hexdec($c);// now i have a hex string convert to dec
$c |= 1 << ($flag-1); // now i have dec i can OR $flag is the bit to flip and is 1 based)
$c = '0x'.dechex($c); //convert back to hex and add the '0x' so that the sql will know how to deal with it.
$q = $sqli->query("update Table set binaryFlags = $c"); // acutall update with proper binary.
Hope this can help someone else AND if anyone has a better way, or reasons why I had to jump thru hoops please leave a comment

Store options as descriptive strings or numbers? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
What's the best way of storing options in mySQL? As descriptive strings or integers that are associated with each string.
Let's say I have this question in my UI:
What's your favorite ice cream flavor?
Vanilla
Chocolate
Strawberry
Is it better to store those in the DB as 1, 2, 3 in an INT(1) field or as strings vanilla, chocolate, or strawberry, in a CHAR field? I know the INT field will be faster, but probably not drastically unless there's tens of thousands of rows.
If they're stored as strings then I wouldn't need to do any extra PHP code, whereas if they're stored as numbers, I'd have to define the value of 1 = vanilla, etc...
What's the general consensus on this?

The usual approach with relational databases is to make a new table called icecream_flavor or whatver. Then you can add new flavours at a later date, and your program can offer all the current flavour choices when it asks. You can store choices by table ID (ie an integer).

If paddy answer isn't an option then you should store the values as ENUM.
While ENUM is pretty equivalent to TINYINT(1).
ENUM is only the answer if the values you let the user choose are already pre-fixed, otherwise you would have to edit the table. But if you use ENUM MySQL has the engine optimized while inserting, and selecting from ENUMs. It's the obvious choice, for example (Male\Female).
Otherwise the answer to your question is TINYINYT(1), which is fastest then both CHAR and INT(1).

If you have a constant set of values and you're not going to relate that values with another data in your database (information related to each type of ice-cream) you can use MySQL's special field type called ENUM

If each user is identifiable by some unique key (say an email address) then you may find you do not need a numerical id.
Keeping the flavour as a flavour when YOU control the option (so you dont get Vanilla, vanilla vanila and so on) suggests to me that you store the real value (Vanilla).
This avoids a join and makes your database data meaningful when you browse it.
You can add an index to the flavour column of the database, so you can "show all users who prefer Vanilla" very cheaply.
(If you potentially have infinite options to store, then maybe you should be investigating using a nosql databases)

I would suggest that you keep the primary index as an int. The simplest reason for doing this is for what you don't currently know and to allow your data model to evolve.
Assuming your example, if you go with the actual word, and in six months you decide to also put in a size - single scoop, double scoop etc, you suddenly have a problem as your primary key that you thought was unique suddenly has multiple entries for each flavour. If on the other hand you go with a numeric format, you can easily have as many variants of Vanilla as you like and everything is happy.
In this case, I would also suggest keeping the actual primary key as a key and nothing more. Add in another column that then has the flavour, or create a lookup table that will store the actual properties of each item. If you keep a numeric format, but have ID 1 as Vanilla, you basically run into the same problem as before, keep the ID as the ID and your details seperate.
To keep data about your items, you can use something like this:
Master Table
ID Name
1 Ice Cream
Properties Table
ID Type
1 Flavour
2 Size
Property Table
ID PropID Detail
1 1 Vanilla
2 1 Strawberry
3 2 Single Scoop
4 2 Double Scoop
MasterToProperty
MasterID Property PropID
1 1 1
1 2 4
This basically gives you an unlimited number of options, and adding any thing you want (an entry for choc chips for example) is simply a matter of adding a few rows of data, not table changes.

Set up your database like this:
Questions
id, question_text
ex. (1, "What is your favorite ice cream?")
Options
id, question_id, option_text
ex. (1, 1, "Vanilla")
Responses
id, user_id, question_id, option_id
ex. (1, 421, 1, 1)
You should also throw some created and modified fields into each table for keeping track of changes.

Looking for a starting point for a tagging system [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Basically I want to setup a tagging system like stack overflow has for entries and trying to plan out how a relevance based search would work. I want to have an option to pull up similar tagged entries for a related entries section. Right now I am using two tables for tags, a table for each unique tag and a join table. I am trying to think if that will work for being able to generate a list of entries that share similar tags.
If anyone has any ideas, or links to articles I could read on it to get my brain heading in the right direction that would be amazing. Thank you!

add one more field to entities table: tags. with string of comma separated tags, to prevent 2 more joins for selecting entities list.

Perhaps you could have a separate table to store related entries.
EntryId RelatedEntryId
Then you could have a CRON job recompute the relationships periodically and update the table. It would be less expensive than trying to compute these relationships on the fly.

You'll need to keep track of how often one tag is linked to another. Like, say "php" and "mysql" share 50 articles (or whatever the main content being tagged is), while "php" and "sql-server" might have 3, and "php" and "apache" have 25. So given "php," you'd want to return "mysql" and "apache" in that order (possibly letting "sql-server" fall to the wayside).
No way is this ideal, just thinking out loud (and kind of expanding on stephenc's answer, now that I see it):
CREATE TABLE tag_relations (
tag_id int unsigned not null,
related_tag_id int unsigned not null,
relation_count smallint unsigned not null,
PRIMARY KEY (tag_id, related_tag_id),
KEY relation_count (relation_count)
);
Then for each unique tag tied to an article, loop through all other tags and INSERT / UPDATE, incrementing the relation_count by 1. That means ("php", "mysql") and ("mysql", "php") are two completely different relations to be maintained, but without digging through search concepts I've probably forgotten, it'll still function. If something has 10+ tags, updates will be very slow (maybe pass that to cron like stephenc suggested), but it'll be easier to search this way. Nice and straightforward like so:
SELECT related_tag_id, COUNT(relation_count) AS total_relations
FROM tag_relations
WHERE tag_id IN ([list,of,tag,IDs,to,compare])
// AND tag_id NOT IN ([list,of,tag,IDs,to,compare]) -- probably
GROUP BY related_tag_id
ORDER BY total_relations DESC
Easier than having to check against both tag_id & related_tag_id and sum them up through a mess of subqueries, at least. JOIN on your tags table to get the actual tagnames & you're set.
So if you're looking up "php" and "mysql," and "apache" often relates to both, it'll be near the top since it's counting & weighting each common relation. It won't strictly limit it to common links though, so add HAVING total_relations >= x (x being an arbitrary cutoff) and/or just a regular LIMIT x to keep things relevant.
(note: research the heck out of this before thinking this is even slightly useful - I'm sure there's some known algorithm out there that's 100x smarter and I'm just not remembering it.)
PHPro.org has a good writeup too, using a similar idea.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.