Partial keyword searching

Partial keyword searching - php

Can anyone give me an idea of how I can do partial keyword searching with php/mysql search engine?
For example, if a person search for "just can't get enough" i want it to return search result containing keywords "just can't get enough by black eyed peas" or from keywords " black eyed peas just can't get enough".
Another Example: If I entered "orange juice" i want it to return result with keywords "orange juice taste good"
Its pretty much like google and youtube search.
The code I'm using is: http://tinypaste.com/eac6cf

The search method you've used is the standard method of searching from within small amounts of records. For example, if you had just around a thousand records, it would be ok.
But if you've got to search from millions of records, this method is not to be used as it will be terribly slow.
Rather you have two options.
Explode your search field and build your own index containing single words and a reference to the record position. Then only search your index and seek the corresponding record from the main table.
Use MySQL's Full text search feature. This is easier to implement but has its own restrictions. This way you don't have to build the index yourself.

MySQL full-text search would help here, but would only work with myISAM tables and the performance tends to go through the drain when your data-sets becomes quite large.
At the company I work for, we push our search queries to Sphinx. Sites like Craigslist, The Pirate Bay, Slashdot all use this so it's pretty much proven for production use.

In MySQL, you can use a MyISAM type table and simply define a text field (CHAR, VARCHAR, or TEXT) and then create a FULLTEXT index. Just keep in mind the size of the text field, the more allowed characters, the larger the size of the index and the slower it's going to be to update.
Other large data-set options would include something like Solr but unless you already know your data is going to have a ton of data, you could certainly start with MySql and see how it goes.
Most MySQL editors, including phpmyadmin provide a gui for adding indexes, if you're doing it by hand the code would look something like:
CREATE TABLE IF NOT EXISTS `test2` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` text CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`id`),
FULLTEXT KEY `ft_name` (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

Related

How to store 60 Booleans in a MySQL Database?

I'm building a mobile App I use PHP & MySQL to write a backend - REST API.
If I have to store around 50-60 Boolean values in a table called "Reports"(users have to check things in a form) in my mobile app I store the values (0/1) in a simple array. In my MySql Table should I create a different column for each Boolean value or is it enough if I simply use a string or an Int to store it as a "number" like "110101110110111..."?
I get and put the data with JSON.
UPDATE 1: All I have to do is check if everything is 1, if one of them is 0 then that's a "problem". In 2 years this table will have around 15.000-20.000 rows, it has to be very fast and as space-saving as possible.
UPDATE 2: In terms of speed which solution is faster? Making separate columns vs store it in a string/binary type. What if I have to check which ones are the 0s? Is it a great solution if I store it as a "number" in one column and if it's not "111..111" then send it to the mobile app as JSON where I parse the value and analyse it on the user's device? Let's say I have to deal with 50K rows.
Thanks in advance.

A separate column per value is more flexible when it comes to searching.
A separate key/value table is more flexible if different rows have different collections of Boolean values.
And, if
your list of Boolean values is more-or-less static
all your rows have all those Boolean values
your performance-critical search is to find rows in which any of the values are false
then using text strings like '1001010010' etc is a good way to store them. You can search like this
WHERE flags <> '11111111'
to find the rows you need.
You could use a BINARY column with one bit per flag. But your table will be easier to use for casual queries and eyeball inspection if you use text. The space savings from using BINARY instead of CHAR won't be significant until you start storing many millions of rows.
edit It has to be said: every time I've built something like this with arrays of Boolean attributes, I've later been disappointed at how inflexible it turned out to be. For example, suppose it was a catalog of light bulbs. At the turn of the millennium, the Boolean flags might have been stuff like
screw base
halogen
mercury vapor
low voltage
Then, things change and I find myself needing more Boolean flags, like,
LED
CFL
dimmable
Energy Star
etc. All of a sudden my data types aren't big enough to hold what I need them to hold. When I wrote "your list of Boolean values is more-or-less static" I meant that you don't reasonably expect to have something like the light-bulb characteristics change during the lifetime of your application.
So, a separate table of attributes might be a better solution. It would have these columns:
item_id fk to item table -- pk
attribute_id attribute identifier -- pk
attribute_value
This is ultimately flexible. You can just add new flags. You can add them to existing items, or to new items, at any time in the lifetime of your application. And, every item doesn't need the same collection of flags. You can write the "what items have any false attributes?" query like this:
SELECT DISTINCT item_id FROM attribute_table WHERE attribute_value = 0
But, you have to be careful because the query "what items have missing attributes" is a lot harder to write.

For your specific purpose, when any zero-flag is a problen (an exception) and most of entries (like 99%) will be "1111...1111", i dont see any reason to store them all. I would rather create a separate table that only stores unchecked flags. The table could look like: uncheked_flags (user_id, flag_id). In an other table you store your flag definitions: flags (flag_id, flag_name, flag_description).
Then your report is as simple as SELECT * FROM unchecked_flags.
Update - possible table definitions:
CREATE TABLE `flags` (
`flag_id` TINYINT(3) UNSIGNED NOT NULL AUTO_INCREMENT,
`flag_name` VARCHAR(63) NOT NULL,
`flag_description` TEXT NOT NULL,
PRIMARY KEY (`flag_id`),
UNIQUE INDEX `flag_name` (`flag_name`)
) ENGINE=InnoDB;
CREATE TABLE `uncheked_flags` (
`user_id` MEDIUMINT(8) UNSIGNED NOT NULL,
`flag_id` TINYINT(3) UNSIGNED NOT NULL,
PRIMARY KEY (`user_id`, `flag_id`),
INDEX `flag_id` (`flag_id`),
CONSTRAINT `FK_uncheked_flags_flags` FOREIGN KEY (`flag_id`) REFERENCES `flags` (`flag_id`),
CONSTRAINT `FK_uncheked_flags_users` FOREIGN KEY (`user_id`) REFERENCES `users` (`user_id`)
) ENGINE=InnoDB;

You may get a better search out of using dedicated columns, for each boolean, but the cardinality is poor and even if you index each column it will involve a fair bit of traversal or scanning.
If you are just looking for HIGH-VALUES 0xFFF.... then definitely bitmap, this solves your cardinality problem (per OP update). It's not like you are checking parity... The tree will however be heavily skewed to HIGH-VALUES if this is normal and can create a hot spot prone to node splitting upon inserts.
Bit mapping and using bitwise operator masks will save space but will need to be aligned to a byte so there may be an unused "tip" (provisioning for future fields perhaps), so the mask must be of a maintained length or the field padded with 1s.
It will also add complexity to your architecture, that may require bespoke coding, bespoke standards.
You need to perform an analysis on the importance of any searching (you may not ordinarily expect to be searching all. or even any of the discrete fields).
This is a very common strategy for denormalising data and also for tuning service request for specific clients. (Where some reponses are fatter than others for the same transaction).

Case 1: If "problems" are rare.
Have a table Problems with ids, and a TINYINT with the value (50-60) of the problem. With suitable indexes on that table you can lookup whatever you need.
Case 2: Lots of items.
Use a BIGINT UNSIGNED to hold up to 64 0/1 value. Use an expression like 1 << n to build a mask for the nth (counting from 0) bit. If you know, for example, that there exactly 55 bits, then the value of all 1s is (1<<55)-1. Then you can find the items with "problems" via WHERE bits = (1<<55)-1.
Bit Operators and functions
Case 3: You have names for the problems.
SET ('broken', 'stolen', 'out of gas', 'wrong color', ...)
That will build a DATATYPE with (logically) a bit for each problem. See also the function FIND_IN_SET() as a way to check for one problem.
Cases 2 and 3 will take about 8 bytes for the full set of problems -- very compact. Most SELECT that you might perform would scan the entire table, but 20K rows won't take terribly long and will be a lot faster than having 60 columns or a row per problem.

Compare two big data - 20 million products

I want to compare two product database based on title,
I first data is about 3 million from which I want to compare and the second data is 10 million, I am doing this because to remove duplicate products.
I have tried this by using MySQL query writing program in PHP which check title (name = '$name') if the data will return zero so it will be unique but it is quite slow 2 sec per result.
The second method I have used is storing data in the text file and using the regular expression, but it will also slow.
What is the best way to compare large data to find out unique products.?
Table DDL:
CREATE TABLE main ( id int(11) NOT NULL AUTO_INCREMENT,
name text,
image text, price int(11) DEFAULT NULL,
store_link text,
status int(11) NOT NULL,
cat text NOT NULL,
store_single text,
brand text,
imagestatus int(11) DEFAULT NULL,
time text,
PRIMARY KEY (id) )
ENGINE=InnoDB AUTO_INCREMENT=9250887
DEFAULT CHARSET=latin1;

Since you have to go over 10 mio titles 3 mio times its going to take some time. My approach would be to see if you can get all titles from both lists in a php script. Then compare them there in memory. Have the script create delete statements to a text file which you then execute on the db.
Not in your question but probably you next problem: different spellings see
similar_text()
soundex()
levenshtein()
for some help with that.

In my opinion this is what database are made for. I wouldn't reinvent the wheel in your shoes.
Once this is agreed, you should really check database structure and indexing to speed up your operations.

I have been using SQLyog to compare databases of around 1-2 million data. It gives an option for "One-way synchronization","Two-way synchronization" and also "Visually merge data" to sync the databases.
The important part is,it gives an option to compare data on chunks, and this value can be specified by us in writing the chunk limit inorder to avoid connection loss.

If your DB support it, use a left join and filter rows where the right side is not null. But first create indexes with your keys in both tables (column name).
If your computer/server memory support to upload in memory the 3 millions on objects in a HashSet, then create a HashSet using the NAME as the key and then read one by one the other set (10 million objects) and validate if the object exist in the HashSet. If it exist, then it is duplicated. (I want to suggest dump the data into a text files and then read the files to create the structure)
If the previous strategies fail then is time to implement some kind of MapReduce. You can implement it comparing with one of the previous approaches a subset of your data. For example,
comparing all the products that start with some letter.

I have tried a lot using MySQL queries but data was very slow, only find out the solution is using sphinx, Index whole database and searching for every product string on sphinx index and same time removing duplicate products getting ids from sphinx.

Good SQL engine for huge rows in a single column table

I have a table where I store almost any English word. This table is for a Scrabble type word game currently I am working on. Here is the syntax,
create table words(
`word` varchar(50),
primary key `word`
)
This table will be very big. And I have to check every time if the given word exists when gamer makes a move.
I am using mysql. Currently I have stored ENABLE words there. My question is when I start adding more words and gamers start to play wont it be performing low? If so, is there any way I can optimize it? Does NO-SQL has anything to do with this scenario?

You should have no performance problems but if you are worried about performance you can keep this in mind:
Using LIKE instead of = will cause slower queries if you have a lot of rows. (but you must have an extremely large amount of rows for a noticeable difference)
Also, you might do some testing to see which performs better on large tables, select count or select * or select word.

table with about 1 million unique keywords utf_unicode performance

I have a table with 1 million unique keywords in all languages stored in utf_unicode format. Lately I have been having problems with selects with each select taking up to 1 second. This is really causing a slowdown in the queries.
The structure for the keyword table is (keyword_id, keyword, dirty) -> The keyword_id is the primary key, keyword has unique index and dirty has a simple index. keyword has a varchar type with 20 chars max. The dirty is a boolean.
The problems are being faced when selecting with "keyword" in the where field. How can I speed this table up.
I am using MySQL with PHP.
SAMPLE QUERY
SELECT k_id
FROM table
WHERE keyword = "movies"

Have you considered using a memory table instead of myisam in my experience is goes 10 times faster then myisam. You'll just need another table to rebuild from if the server crashes. Also instead of varchar use char 20. This will make the table a fixed format and mysql will be able to find it's result much faster.

If you have unique keywords and if you aren't doing any similarity/"like" queries, then you can create a hash index. That would guarantee a single row lookup.
The minor disadvantage this may have is that creating a hash index may take up more space than a regular index(btree based).
Refrences:
https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html
https://dev.mysql.com/doc/refman/8.0/en/create-index.html

I need a unknown amount of coordinates field (or few fields) How to go about doing this?

Using MySQL I need to have a list (possibly long list) of x,y coordinates. How should I go about this?
Apologies for the amazing amount of vague in this question! I didn't want to explain my entire project but I suppose some more explanation is in order for this to make any sense as a question.
Ok I'm doing a a map/direction web application for a client (no, I've looked into Google Maps API, but I need to map their buildings/campus so I don't think that applies well). So my current plan is to create some PHP scripts that will run through dijkstra's algorithm (I'm purposely dumbing this down quite a bit because, again, I don't want to explain the whole project) but since that algorithm is based on the use of a graph I was going to have an Edge table that will contain various Coords so that I know, in the image, how to draw my lines. Does this make any more sense to you guys now? Again I apologize, I should've gone a little more into my issue originally.

making a lot of assumptions since your question is vague...
Use two tables with a foreign key, this is the standard approach to model a one to many relationship
create table table1 (
id int
--more columns presumably
)
create table coordinates (
id int,
table_id int --foreign key with table1,
x int,
y int
)

MySQL is a database which, stores data. You can create a table with XCoord and YXoord fields which can handle millions of rows with ease.
CREATE TABLE Coordinates( id int(11) NOT NULL Auto_increment,
X double NOT NULL,
Y double NOT NULL,
PRIMARY KEY (id)
)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.