I'm building a database of IT candidates for a friend who owns a recruitment company. He has a database of thousands of candidates currently in an excel spreadsheet and I'm converting it into mySQL database.
Each candidate has a skill field with their skills listed as a string e.g. "javascript, php, nodejs..." etc.
My friend will have employees under him who will also search the database, however we want to make it so they are limited to search results with candidates with specific skills depending on what vacancy they are working on for security reasons (so they don't steal large sections of the database and go and setup their own recruitment company with the data).
So if an employee is working on a javascript role, they will be limited to search results where the candidate has the word "javascript" in their skills field. So if they searched for all candidates named "Michael" then it would only return "Michaels" with javascript skills for instance.
My concern is that the searches might take too long if for every search since it must scan the skills field which can sometimes be a long string.
Is my concern justified? If so is there a way to optimize this?
If the number of records are in the thousands, you probably won't have any speed issues (just make sure you're not querying more often than you should).
You've tagged this question with a 'mysql' tag so I'm assuming that's the database you're using. Make sure you add a FULLTEXT index to speed up the search. Please note, however, that this type of index is only available for INNODB table starting with MySQL 5.6.
Try the builtin search first, but if you find it to be too slow, or not accurate enough in it's results, you can look at external full-text search engines. I've personally had very good experience with the Sphinx search server, where it easily indexed millions of text records and returned good results.
Your queries will require a full table scan (unless you use a full text index). I highly recommend that you change the data structure in the database by introducing two more tables: Skills and CandidateSkills.
The first would be a list of available skills, containing rows such as:
SkillId SkillName
1 javascript
2 php
3 nodejs
The second would say which skills each person has:
CandidateId SkillId
1 1
2 1
2 2
This will speed up the searches, but that is not the primary reason. The primary reason is fix problems and enable functionality such as:
Preventing spelling errors in the list of searchs.
Providing a basis for enabling synonym searches.
Making sure thought goes into adding new skills (because they need to be added to the Skills table.
Allowing the database to scale.
If you attempt to do what you want using a full text index, you will learn a few things. For instance, the default minimum word length is 4, which would be a problem if your skills include "C" or "C++". MySQL doesn't support synonyms, so you'd have to muck around to get that functionality. And, you might get unexpected results if you have have skills that are multiple words.
Related
I have an activity records table named revisions (showed in following image) built for a big learning management system, which mainly keeps record of CRUD operations on tables (e.g. who has done what on which object in what time).
This table may contain up to 3M records of data. I want to build a search functionality for this on the front-end with PHP/Laravel.
Now my question is that what things should I consider for building search functionalities with high performance for tables with millions of records of data, what are the things on code level, database level, or are there 3rd party stuff to support these kind of issues?
I am experienced with building systems with PHP/Laravel, Python/Django, Ruby, etc. But I have never encountered with a case like this, dealing with millions records of data. So please keep in mind my knowledge/experience level. I have NO experience on this level.
Note: Search will be an advance search, making users able to search with different criteria and parameters, the object which is changed, who has changed it, when it's changed, etc.
Let me know if my question still isn't clear.
I would recommend to take a look at the https://www.elastic.co/products/elasticsearch and save your activity records to its storage when you do save to the main database. Then you can easily search any field. Elasticsearch can store a schema free JSON documents, if you prefer more SQL way, there is another search engine - http://sphinxsearch.com/.
There is no problem inserting a zillion rows into a table. Performance problems come when you try to do non-trivial SELECTs on the table. You mentioned "search"; you will have to limit what the 'users' can search for. But at least make a stab at what they might want to search for.
You mentioned "searching for an object", but I don't see a column called object. How many rows might there be for a given object? Do you need all the rows? Or selected ones? (An INDEX on object is likely to make the query efficient, regardless of table size.)
Third-party software sometimes gets in the way of dealing with really large tables. Beware.
I am working on a project where I have a database, that contains a summary field, which is filled in by a web form that visitors to the site enter on.
When the user completes entering the summary field, I want to perform a lookup using the words that were entered by the the user on the page for similar records in the database that contain the same keywords that they've filled in on the page.
I was thinking I could split the summary string that is submitted and then loop through the array and build up a query so the query would end up something like:
SELECT *
FROM my_table
WHERE summary LIKE '%keyword1%'
OR summary LIKE '%keyword2'
OR summary LIKE '%keyword3%';
However, this seems massively inefficient, and as the database could grow quite big, could potentially become quite a slow query to run.
I then found the MySQL IN clause, but this only seems to work with multiple values where a field can only contain 1 of these values in a row.
Is there a way I can use the IN function, or is there a better MySQL function that I can use to do what I want, or is my first idea the only way round it?
An example of what I am trying to achieve is a bit like on Stack Overflow. When you lose focus of the title field, it pops up similar questions based on the title you've provided.
I would recommend reading this manual page InnoDB FULLTEXT Indexes and the one on Full-Text Restrictions. New functionality of full text has been incorporated in recent releases of mysql, augmenting the use of it with INNODB tables.
Concerning the inability to upgrade a mysql version, there is no reason why one cannot mix and match MyISAM and INNODB tables in the same db. As such, one would keep textual information in MyISAM (where historically FTS index power was available), and doing joins to INNODB tables when needed. This avoids the "must upgrade to version 5.6" argument.
Legend: FTS=Full Text Search
There's a very similar question: Modeling products with vastly different sets of needed-to-know information and linking them to lineitems? But I can't find an answer that help me;
Someone at the above Q&A points to designing database to hold different metadata information , which has a fantastic accepted answer, but since search function is explicitly needed in my program, I don't want performance to be compromised.
I'm a "technician" that uses PHP + Oracle to keep track of the selling progress of our company and generate reports. Our workflow generally looks like this:
Marketing guys provides prepared data-set to my system;
Frontline staffs (sales) mark progress on my system;
Anyone can search results in the system;
I generate reports back to marketing guys.
The problem:
Many columns of data-sets are the same (or can be considered the same), like these:
account|customer_name|gender|location|program_segment|...
But the marketing dept. like coming up new ideas (and abandoning existing ones), so each "sales program (campaign)" may has its own data, e.g.
For program 1, they may contain:
...|prev_coupon_code|last_usage_amount|...
For program 2, however, they may contain:
...|is_in_plan_1|is_in_plan_2|...
You got the idea.
Unsuccessful attempts:
In order to hold all data, I used to use a "long enough" table that has all possible properties (columns), and leave blank/unnecessary properties NULL.
But now I feel that it will never be "long enough", as there're too many "properties" and even more "sales focusing point": I drafted a 41 column table for a new version of the system and suddenly they proposed a new program that has information that can't fit.
Someone suggested me to create "dummy columns" in the table and "remember" different meaning of them in frontend. This can work for several datatypes, like NUMBER(1) for Y/N, DATE, etc., but when talking about VARCHAR2, I'm not sure how many of that is enough...plus this makes the table look "dirty".
Question:
Frustrated, I'm now seriously considering using different tables for different programs, and use UNION clause to generate big report in case they're asked "how are we selling this month/season/year?"
Technically, is this a good practice? Should I implement it?
Edit #1:
To clarify, one "sales program" will generally be running for a few months before it got abandoned, and there'll be at least one data-set per month for each running program.
And there can be more than one program running at the same time.
Edit #2:
Those "program-specified" columns are of various number: one program may need 10, while another may only need 1.
This is one of those situations where there is no right answer, just a choice of kludges.
I would plump for using an XMLType to hold the transient data structures. XML gives us the ability to have defined schemas for each plan, but using an XMLType obviates the need to change the database itself. We can index XPath queries so the performance can still be good. Find out more.
The one problem is that writing queries against XML is a bit of a pfaff, but I think awkward queries will be an issue for whichever apporach you take.
You may or may not be aware that it is possible to index the contents of a character LOB in Oracle. You might look up Oracle Intermedia / multimedia (depends on your version) and talk to your DBAs to see if it is available to you.
This would make it possible to create a common structure for common data items - eg campaign, start_date, end_date, &c but then to dump your spreadsheet/xml data/csv file into a CLOB field.
The plain-text indexing is not as hard as it first sounds and it is very cute indeed.
If you go down the different table path you will forever be changing code to meet the changing columns etc.
One option would be to have 2 additional columns 'campaign_name', 'campaign_value' and put the column name they send you in the NAME column and the value in the value column.
So,
account|customer_name|.....|campaign_name|campaign_value
'ACC001'|'Frank Burns'|........|'prev_coupon_code'|[value of prev_coupon_code
and then in your 2nd example:
account|customer_name|.....|campaign_name|campaign_value
'ACC001'|'Frank Burns'|........|'is_in_plan_1'|[value of is_in_plan_1
Update - yes, this would involve changing the grain of the table so you would add a set of data for each of the campaigns.
The import would be a little different in that you'd UNION the records for each of the column names that appear on there, and the reporting would need to take into account the grain change.
It sounds like a complete waste of space, but if these are Excel sheets then performance shouldn't matter. If it did you would need to split the tables into
- campaigns, accounts, accounts_campaigns
On my current job, I successfully use following system for 2 years.
You have one main table, let's say 'report', that consist of common columns for all kind of reports.
id - primary, auto_increment.
name - name of the report.
Then, for each specific report, you have another table, called something like "report_marketing". There you have report_id column, that is foreign key to first main table. And here you add all specific columns for this specific report.
To get results, you simply use LEFT JOIN.
If some reports share some columns from 2 or more tables, you can always join more than one column.
Here is example of query you might have:
SELECT report.name, report_marketing.ammount FROM report WHERE report.type = 'M'
LEFT JOIN report_marketing ON report_marketing.report_id = report.id;
I'm programming a search engine for my website in PHP, SQL and JQuery. I have experience in adding autocomplete with existing data in the database (i.e. searching article titles). But what about if I want to use the most common search queries that the users type, something similar to the one Google has, without having so much users to contribute to the creation of the data (most common queries)? Is there some kind of open-source SQL table with autocomplete data in it or something similar?
As of now use the static data that you have for auto complete.
Create another table in your database to store the actual user queries. The schema of the table can be <queryID, query, count> where count is incremented each time same query is supplied by some other user [Kind of Rank]. N-Gram Index (so that you could also auto-complete something like "Manchester United" when person just types "United", i.e. not just with the starting string) the queries and simply return the top N after sorting using count.
The above table will gradually keep on improving as and when your user base starts increasing.
One more thing, the Algorithm for accomplishing your task is pretty simple. However the real challenge lies in returning the data to be displayed in fraction of seconds. So when your query database/store size increases then you can use a search engine like Solr/Sphinx to search for you which will be pretty fast in returning back the results to be rendered.
You can use Lucene Search Engiine for this functionality.Refer this link
or you may also give look to Lucene Solr Autocomplete...
Google has (and having) thousands of entries which are arranged according to (day, time, geolocation, language....) and it is increasing by the entries of users, whenever user types a word the system checks the table of "mostly used words belonged to that location+day+time" + (if no answer) then "general words". So for that you should categorize every word entered by users, or make general word-relation table of you database, where the most suitable searched answer will be referenced to.
Yesterday I stumbled on something that answered my question. Google draws autocomplete suggestions from this XML file, so it is wise to use it if you have little users to create your own database with keywords:
http://google.com/complete/search?q=[keyword]&output=toolbar
Just replacing [keyword] with some word will give suggestions about that word then the taks is just to parse the returned xml and format the output to suit your needs.
I have a location search website for a city, we started out with collecting data for all possible categories in the city like Schools, Colleges, Departmental Stores etc and stored their information in a separate table, as each entry had different details apart from their name, address and phone number.
We had to integrate search in the website to enable people to find information, so we built an index table where in we stored the categories and related keywords for the same category and the table which much be fetched if that category was searched for. Later on we added the functionality of searching on the name and address as well by adding another master table containing those fields from all the tables to one place. Now my doubt is the following
The application design is improper, and we have written queries like select * from master where name like "%$input%" , all over, since our database is MYSQL and PHP on serverside, is there any suggestion for me to improve on the design of the system?
People want more features like splitting the keywords and ranking them according to relevance etc, is there any ready framework available which runs search on a database.
I tried using Full Text Search in MYSQL and it seems effective to me, is that enough?
Correct me if i am wrong, i had a look into Lucene and Google Custom Search, don't they work on making an index by crawling existing webpages and building their own index? I have a collection of tables on a mysql database on which i have to apply searching. What options do i have?
To address your points:
Using %input% is very bad. That will cause a full table scan every query. Under any amount of load or on even a remotely large dataset your DB server will choke.
An RDBMS alone is not a good solution for this. You are looking in the right place by seeking a separate solution for search. Something which can communicate well with your RDBMS is good; something that runs inside an RDBMS won't do what you need.
Full Text Search in MySQL is workable for very basic keyword searches, nothing more. The scope of usefulness is extremely limited - you need a highly predictable usage model to leverage the built-in searching. It is called "search" but it's not really search the way most people think of it. Compared to the quality of search results we have come to expect from Google and Bing, it does not compare. In that sense of the word "search", it is something else - like Notepad vs Word. They both are things to type in, but that's about it.
As far as separate systems for handling search, Lucene is very good. Lucene works however you want it to work, essentially. You can interact with it programatically to insert indexable documents. Likewise, a Google Appliance (not Google Custom Search) can be given direct meta feeds which expose whatever you want to be indexed, such as data directly from a database.
Take a look at sphinx: http://www.sphinxsearch.com/
Per their site:
How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles.
It's quite popular with a lot of people in the rails community right now, and they all rave about how awesome it is :)