Do you think it's a good idea?
Like storing keywords from the real database inside a sqlite database, along with object IDs. SO when you search you do it with sqlite to get the IDs of the objects you found, and then query the real database using those IDs.
example object from the mysql db:
ID slug title content
_____________________________________________________________________________
5 bla-bla Bla Bla I know what you did last summer
this would get indexed in the sqlite like:
ID keywords
_____________________________________________________________________________
5 know, summer, last, what
or maybe
keyword objects
_____________________
know 5, 6
summer 5
lst 5, 7, 10
...
but you would get a huge database, probably with ~15000 entries considering the english vocabulary
but you would get a huge database, probably with ~15000 entries
15,000 records is a piece of cake for MySQL and most other RDBMS. What you should do is set up your text in MyIsam tables so you can take advantage of full-text indexing and searching.
The idea of a database is that it is capable of doing query operations very fast and efficiently.
SQLite on the other hand is a perfect tool for development purposes since you do not have to setup a db instance. However with it, comes certain down sides such as that it cannot handle many concurrent connections at once efficiently or at all.
Therefore the suggested approach in my opinion is not the best since SQLite would not be able to handle many queries and therefore defeat the whole purpose of the database.
It might be a lot better just to maintain a high performance db which would be able to handle all the queries. And there are usually tons of ways you can optimize a db such as mysql, postgresql, etc.
EDIT
Just a thought. Maybe breaking a string into words and treating them as keywords is not the best way. The problem is that the search will just return if a certain keyword was used somewhere in the system however that will not consider the context and the priority from where the keyword came from. I don't know much about searching but having a some sort of rank system would seem to be beneficial.
Related
I have an activity records table named revisions (showed in following image) built for a big learning management system, which mainly keeps record of CRUD operations on tables (e.g. who has done what on which object in what time).
This table may contain up to 3M records of data. I want to build a search functionality for this on the front-end with PHP/Laravel.
Now my question is that what things should I consider for building search functionalities with high performance for tables with millions of records of data, what are the things on code level, database level, or are there 3rd party stuff to support these kind of issues?
I am experienced with building systems with PHP/Laravel, Python/Django, Ruby, etc. But I have never encountered with a case like this, dealing with millions records of data. So please keep in mind my knowledge/experience level. I have NO experience on this level.
Note: Search will be an advance search, making users able to search with different criteria and parameters, the object which is changed, who has changed it, when it's changed, etc.
Let me know if my question still isn't clear.
I would recommend to take a look at the https://www.elastic.co/products/elasticsearch and save your activity records to its storage when you do save to the main database. Then you can easily search any field. Elasticsearch can store a schema free JSON documents, if you prefer more SQL way, there is another search engine - http://sphinxsearch.com/.
There is no problem inserting a zillion rows into a table. Performance problems come when you try to do non-trivial SELECTs on the table. You mentioned "search"; you will have to limit what the 'users' can search for. But at least make a stab at what they might want to search for.
You mentioned "searching for an object", but I don't see a column called object. How many rows might there be for a given object? Do you need all the rows? Or selected ones? (An INDEX on object is likely to make the query efficient, regardless of table size.)
Third-party software sometimes gets in the way of dealing with really large tables. Beware.
I'm building a database of IT candidates for a friend who owns a recruitment company. He has a database of thousands of candidates currently in an excel spreadsheet and I'm converting it into mySQL database.
Each candidate has a skill field with their skills listed as a string e.g. "javascript, php, nodejs..." etc.
My friend will have employees under him who will also search the database, however we want to make it so they are limited to search results with candidates with specific skills depending on what vacancy they are working on for security reasons (so they don't steal large sections of the database and go and setup their own recruitment company with the data).
So if an employee is working on a javascript role, they will be limited to search results where the candidate has the word "javascript" in their skills field. So if they searched for all candidates named "Michael" then it would only return "Michaels" with javascript skills for instance.
My concern is that the searches might take too long if for every search since it must scan the skills field which can sometimes be a long string.
Is my concern justified? If so is there a way to optimize this?
If the number of records are in the thousands, you probably won't have any speed issues (just make sure you're not querying more often than you should).
You've tagged this question with a 'mysql' tag so I'm assuming that's the database you're using. Make sure you add a FULLTEXT index to speed up the search. Please note, however, that this type of index is only available for INNODB table starting with MySQL 5.6.
Try the builtin search first, but if you find it to be too slow, or not accurate enough in it's results, you can look at external full-text search engines. I've personally had very good experience with the Sphinx search server, where it easily indexed millions of text records and returned good results.
Your queries will require a full table scan (unless you use a full text index). I highly recommend that you change the data structure in the database by introducing two more tables: Skills and CandidateSkills.
The first would be a list of available skills, containing rows such as:
SkillId SkillName
1 javascript
2 php
3 nodejs
The second would say which skills each person has:
CandidateId SkillId
1 1
2 1
2 2
This will speed up the searches, but that is not the primary reason. The primary reason is fix problems and enable functionality such as:
Preventing spelling errors in the list of searchs.
Providing a basis for enabling synonym searches.
Making sure thought goes into adding new skills (because they need to be added to the Skills table.
Allowing the database to scale.
If you attempt to do what you want using a full text index, you will learn a few things. For instance, the default minimum word length is 4, which would be a problem if your skills include "C" or "C++". MySQL doesn't support synonyms, so you'd have to muck around to get that functionality. And, you might get unexpected results if you have have skills that are multiple words.
Simplified scenario:
I have a table with about 100,000 rows.
I will need to pick about 300-400 rows, based on certain criteria, to display them on a web page.
Considering the above scenario, which one of the below approaches will you recommend?
Approach 1: Use just one database query to select the entire table into one big array of 100,000 rows. Using loops, pick required 300-400 rows from the array and pass it one to the front-end. Minimum load on the database server, as it's just one query. Put's more load on the PHP, as it has to store and search through an array of 100,000.
Approach 2: Using a loop, PHP will generate a new query for each row of required data. Collecting all the data will require 300-400 independent queries. More load on the server. Compared to approach 1, lesser load on PHP.
Opinions / thoughts will be appreciated!
100,000 rows is a small amount for MySQL rdbms.
You would better do fine tuning of the db server.
So I recommend neither 1 nor 2.
Just:
SELECT * FROM `your_table` WHERE `any_field` = 'YOUR CRITERIA' LIMIT 300;
When your data overcomes 1,000,000 rows you should think about strong indexes optimization and maybe you'll have to create a stored procedure for complicated select. I assure you it's not PHP work in any case.
As your question asks from Performance prospective, your both approaches would consume some resources. I would still go for approach 1 in this case, as it doesn't make query to database again and again, if you generate query for each row i.e. 300-400 queries. When it comes to huge project designing, database always comes as bottleneck.
To be honest, both approaches are not good. Its good practice to have good database design and query selection. What you are trying to achieve could be done by suitable query.
Using PHP to loop through the data is really a bad idea, after all, a database is designed to perform queries. PHP will need to loop through all the record, and doesn't use an index to speed things up; this is roughly equivalent to a 'table scan' in the database.
In order to get the most performance out of your database, it's important to have a good design and (for example) create indexes on the right columns.
Also, if you haven't decided yet what RDBMS you're going to use, depending on your usage, some databases have more advanced options that can assist in better performance (e.g. PostgreSQL has support for geographical information)
Pease provide some actual data (what kind of data will be stored, what kind of fields) and samples of the kind of queries / filters that will need to be performed so that people will be able to give you an actual answer, not a hypothetical
I'm building a very large website currently it uses around 13 tables and by the time it's done it should be about 20.
I came up with an idea to change the preferences table to use ID, Key, Value instead of many columns however I have recently thought I could also store other data inside the table.
Would it be efficient / smart to store almost everything in one table?
Edit: Here is some more information. I am building a social network that may end up with thousands of users. MySQL cluster will be used when the site is launched for now I am testing using a development VPS however everything will be moved to a dedicated server before launch. I know barely anything about NDB so this should be fun :)
This model is called EAV (entity-attribute-value)
It is usable for some scenarios, however, it's less efficient due to larger records, larger number or joins and impossibility to create composite indexes on multiple attributes.
Basically, it's used when entities have lots of attributes which are extremely sparse (rarely filled) and/or cannot be predicted at design time, like user tags, custom fields etc.
Granted I don't know too much about large database designs, but from what i've seen, even extremely large applications store their things is a very small amount of tables (20GB per table).
For me, i would rather have more info in 1 table as it means that data is not littered everywhere, and that I don't have to perform operations on multiple tables. Though 1 table also means messy (usually for me, each object would have it's on table, and an object is something you have in your application logic, like a User class, or a BlogPost class)
I guess what i'm trying to say is that do whatever makes sense. Don't put information on the same thing in 2 different table, and don't put information of 2 things in 1 table. Stick with 1 table only describes a certain object (this is very difficult to explain, but if you do object oriented, you should understand.)
nope. preferences should be stored as-they-are (in users table)
for example private messages can't be stored in users table ...
you don't have to think about joining different tables ...
I would first say that 20 tables is not a lot.
In general (it's hard to say from the limited info you give) the key-value model is not as efficient speed wise, though it can be more efficient space wise.
I would definitely not do this. Basically, the reason being if you have a large set of data stored in a single table you will see performance issues pretty fast when constantly querying the same table. Then think about the joins and complexity of queries you're going to need (depending on your site)... not a task I would personally like to undertake.
With using multiple tables it splits the data into smaller sets and the resources required for the query are lower and as an extra bonus it's easier to program!
There are some applications for doing this but they are rare, more or less if you have a large table with a ton of columns and most aren't going to have a value.
I hope this helps :-)
I think 20 tables in a project is not a lot. I do see your point and interest in using EAV but I don't think it's necessary. I would stick to tables in 3NF with proper FK relationships etc and you should be OK :)
the simple answer is that 20 tables won't make it a big DB and MySQL won't need any optimization for that. So focus on clean DB structures and normalization instead.
I am building a web app, and I am thinking about how I should build the database.
The app will be feed by keywords, then it will retrieve info for those keywords and save it into the database with a datestamp. The info will be from different source like, num of results from yahoo, diggs from the last month that contains that keyword, etc.
So I was thinking the a simple way to do it would be to have a table with an id and keyword column where the keywords would be stored, and another table for ALL the data with a id(same as keyword), datestamp, data_name, data_content.
Is this a good way to use mysql or could this in someway make queries slower or sometihng? should I build tables for each type of data I want to use? I am mostly looking for a good performance on the application.
Another reason I would like to use only one table for the data is that I can easly add more data_name(s) without touching the db.
The second table i.e, the table which contains various information about keywords ,id column in this table can be used as a foreign key to the first table id column
Is this a good way to use mysql or
could this in someway make queries
slower or sometihng? should I build
tables for each type of data I want to
use? I am mostly looking for a good
performance on the application.
A lot of big MySQL players(like for example flickr) use MySQL as a simple KV(key-value) store.
Furthermore if you are concerned with performance you should cache your data in memcached/redis(there is nothing which can beat memory).
I have another recommendation: index your content. If you plan on storing content and be able to search using keywords, then use mysql to store the details of your document, like author, text and some other info and use Lucene to create an index. Lucene is originally for Java, but has ports to many languages, and PHP is no exception.
You can use Zend framework to manage Lucene indexes, with very little effort, browse thru the documentation or look for a tutorial online. The thing of this recommendation is simple:
- You'll improve your search time drastically
- Your keyword acceptance will be higher, Lucene will give power to search
I hope I can help!
Best luck!