MySQL Optimization for data tables and Query optimization - php

So a LOT of details here, my main objective is to do this as fast as possible.
I am calling an API which returns a large json encoded string.
I am storing the quoted encoded string into MySQL (InnoDB) with 3 fields: (tid (key), json, tags) in a table called store.
I will, at up to 3+ months later, pull information from this database by using:
WHERE tags LIKE "%something%" AND "%somethingelse%"
Tags are + delimited. (Which makes them too long to be efficiently keyed.)
'anime+pikachu+shingeki no kyojin+pokemon+eren+attack on titan+'
I do not wish to repeat API calls at ANYTIME. If you are going to include an API call use:
API(tag, time);
All of the JSON data is needed.
This table is an active archive.
One Idea I had was to put the tags into their own 2 column table (pid, tag (key)). pid points to tid in the store table.
Are there any MySQL configurations I can change to make this faster?
Are there any table structure changes I can do to make this faster?
Is there anything else I can do to make this faster?
QUOTED JSON Example (Messy, to see another clean example see TUMBLR APIv2):
'{\"blog_name\":\"roxannemariegonzalez\",\"id\":62108559921,\"post_url\":\"http:\\/\\/\\/post\\/62108559921\",\"slug\":\"\",\"type\":\"photo\",\"date\":\"2013-09-24 00:36:56 GMT\",\"timestamp\":1379983016,\"state\":\"published\",\"format\":\"html\",\"reblog_key\":\"uLdTaScb\",\"tags\":[\"anime\",\"pikachu\",\"shingeki no kyojin\",\"pokemon\",\"eren\",\"attack on titan\"],\"short_url\":\"http:\\/\\/\\/ZxlLExvrzMen\",\"highlighted\":[],\"bookmarklet\":true,\"note_count\":19,\"source_url\":\"http:\\/\\/\\/entry\\/78231354\\/via\\/roxannegonzalez?page=2\",\"source_title\":\"\",\"caption\":\"\",\"link_url\":\"http:\\/\\/\\/entry\\/78231354\\/via\\/roxannegonzalez\",\"image_permalink\":\"http:\\/\\/\\/image\\/62108559921\",\"photos\":[{\"caption\":\"\",\"alt_sizes\":[{\"width\":500,\"height\":444,\"url\":\"http:\\/\\/\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_500.png\"},{\"width\":400,\"height\":355,\"url\":\"http:\\/\\/\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_400.png\"},{\"width\":250,\"height\":222,\"url\":\"http:\\/\\/\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_250.png\"},{\"width\":100,\"height\":89,\"url\":\"http:\\/\\/\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_100.png\"},{\"width\":75,\"height\":75,\"url\":\"http:\\/\\/\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_75sq.png\"}],\"original_size\":{\"width\":500,\"height\":444,\"url\":\"http:\\/\\/\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_500.png\"}}]}'

Look into the Mysql MATCH()/AGAINST() functions and FULLTEXT index feature, this is probably what you are looking for. Make sure a FULLTEXT index will operate reasonably on a json document.
What kind of data sizes are we talking about? Huge amounts of memory is cheap these days, so having the entire Mysql dataset buffered in memory where you can do full text scans isn't unreasonable.
Breaking out some of the json field values and putting them into their own columns would allow you to search quickly for those ... but that doesn't help you for the general case.

This option you suggested is the correct design:
One Idea I had was to put the tags into their own 2 column table (pid,
tag (key)). pid points to tid in the store table.
But if you're searching LIKE '%something%' then the leading '%' will mean the index can only be used to reduce disk reads - you will still need to scan the entire index.
If you can drop the leading % (because you now have the entire tags) then this is certainly the way to go. The trailing '%' is not as important.


Speed of SELECT Distinct vs array unique

I am using WordPress with some custom post types (just to give a description of my DB structure - its WP's).
Each post has custom meta, which is stored in a separate table (postmeta table). In my case, I am storing city and state.
I've added some actions to WP's save_post/trash_post hooks so that the city and state are also stored in a separate table (cities) like so:
ID postID city state
auto int varchar varchar
I did this because I assumed that this table would be faster than querying the rather large postmeta table for a list of available cities and states.
My logic also forced me to add/update cities and states for every post, even though this will cause duplicates (in the city/state fields). This must be so because I must keep track of which states/cities exist (actually have a post associated with them). When a post is added or deleted, it takes its record to or from the cities table with it.
This brings me to my question(s).
Does this logic make sense or do I suck at DB design?
If it does make sense, my real question is this: **would it be faster to use MySQL's "SELECT DISTINCT" or just "SELECT *" and then use PHP's array_unique on the results?**
Edits for comments/answers thus far:
The structure of the table is exactly how I typed it out above. There is an index on ID, but the point of this table isn't to retrieve an indexed list, but to retrieve ALL results (that are unique) for a list of ALL available city/state combos.
I think I may go with (I don't know why I didn't think of this before) just adding a serialized list of city/state combos in ONE record in the wp_options table. Then I can just get that record, and filter out the unique records I need.
Can I get some feedback on this? I would imagine that retrieving and filtering a serialized array would be faster than storing the data in a separate table for retrieval.
To answer your question about using SELECT distinct vs. array_unique, I would say that I would almost always prefer to limit the result set in the database assuming of course that you have an appropriate index on the field for which you are trying to get distinct values. This saves you time in transmitting extra data from DB to application and for the application reading that data into memory where you can work with it.
As far as your separate table design, it is hard to speculate whether this is a good approach or not, this would largely depend on how you are actually preforming your query (i.e. are you doing two separate queries - one for post info and one for city/state info or querying across a join?).
The is really only one definitive way to determine what is fastest approach. That is to test both ways in your environment.
1) Fully normalized table(when it have only integer values and other tables have only one int+varchar) have advantage when you not dooing full table joins often and dooing alot of search on normalized fields. As downside it require large join/sort buffers and result more complex queries=much less chance query will be auto-optimized by mysql. So you have optimize your queries yourself.
2)Select distinct will be faster in almost any cases. Only case when it will be slower - you have low size sort buffer in /etc/my.conf and much more size memory buffer for php.
Distinct select can use indexes, while your code can't.
Also sending large amount of data to your app require alot of mysql cpu time and real time.

Database/datasource optimized for string matching?

I want to store large amount (~thousands) of strings and be able to perform matches using wildcards.
For example, here is a sample content:
(each line has additionnal data too, like tags, but the matching is only against that key)
Here is an example of what I would like to match against the data:
(* being a wildcard here, it can be a different character)
I naively considered storing it in a MySQL table and using % wildcards with the LIKE operator, but MySQL indexes will only work for characters on the left of the wildcard, and in my case it can be anywhere (i.e. %/Folder3).
So I'm looking for a fast solution, that could be used from PHP. And I am open: it can be a separate server, a PHP library using files with regex, ...
Have you considered using MySQL's regular expression engine? Try something like this:
FROM your_table
WHERE your_query_string REGEXP pattern_column
This will return rows with regex keys that your query string matches. I expect it will perform better than running a query to pull all of the data and doing the matching in PHP.
More info here:
You might want to use the multicore approach to solve that search in a fraction of the time, i would recommend for search and matching, using FPGA's but thats probably the hardest way to do it, consider THIS ARTICLE using CUDA, you can do that searches in 16x usual time, in multicore CPU Systems, you can use posix, or a cluster of computers to do the job (MPI for example), you can call Gearman service to run the searches using advanced algorithms.
Were it me, I'd store out the key field two times ... once forward and once reversed (see mysql's reverse function). you can then search the index with left(main_field) and left(reversed_field). it won't help you when you have a wildcard in the middle of the string AND the beginning (e.g. "*Folder1*Folder2), but it will when you have a wildcard at the beginning or the end.
e.g. if you want to search */Folder1 then search where left(reverse_field, 8) = '1redloF/';
for Folder1/*/FolderX search where left(reverse_field, 8) = 'XredloF/' and left(main_field, 8) = 'Folder1/'
If your strings represent some kind of hierarchical structure (as it looks like in your sample content), actually not "real" files, but you say you are open to alternative solutions - why not consider something like a file-based index?
Choose a new directory like myindex
Create an empty file for each entry using the string key as location & file name in myindex
Now you can find matches using glob - thanks to the hierarchical file structure a glob search should be much faster than searching up all your database entries.
If needed you can match the results to your MySQL data - thanks to your MySQL index on the key this action will be very fast.
But don't forget to update the myindex structure on INSERT, UPDATE or DELETE in your MySQL database.
This solution will only compete on a huge data-set (but not too huge as #Kyle mentioned) with a rather deep than wide hierarchical structure.
Sorry this would only work if the wildcards are in your search terms not in the stored strings itself.
As the wildcards (*) are in your data and not in your queries I think you should start with breaking up your data into pieces. You should create an index-table having columns like:
dataGroup INT(11),
exactString varchar(100),
wildcardEnd varchar(100),
wildcardStart varchar(100),
If you have a value like "Folder1/Folder2" store it in "exactString" and assign the ID of the value in the main data table to "dataGroup" in the above index table.
If you have a value like "Folder1/*" store a value of "Folder1/" to "wildcardEnd" and again assign the id of the value in the main table to the "dataGroup" field in above Table.
You can then do a match within your query using:
indexTable.wildcardEnd = LEFT('Folder1/WhatAmILookingFor/Data', LENGTH(indexTable.wildcardEnd))
This will truncate the search string ('Folder1/WhatAmILookingFor/Data') to "Folder1/" and then match it against the wildcardEnd field. I assume mysql is clever enough not to do the truncate for every row but to start with the first character and match it against every row (using B-Tree indexes).
A value like "*/Folder4" will go into the field "wildcardStart" but reversed. To cite Missy Elliot: "Is it worth it, let me work it
I put my thing down, flip it and reverse it" ( So store a value of "4redloF/" in "wildcardStart". Then a WHERE like the following will match rows:
indexTable.wildcardStart = LEFT(REVERSE('Folder1/WhatAmILookingFor/Folder4'), LENGTH(indexTable.wildcardStart))
of course you could do the "REVERSE" already in your application logic.
Now about the tricky part. Something like "*/Fo*4" should get split up into two records:
# Record 1
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> oF/
wildcardEnd ==> /Fo
# Record 2
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> 4
Now if you match something you have to take care that every index-record of a dataGroup gets returned for a complete match and that no overlapping occurs. This could also get solved in SQL but is beyond this question.
Database isn't the right tool to do these kinds of searches. You can still use a database (any database and any structure) to store the strings, but you have to write the code to do all the searches in memory. Load all the strings from the database (a few thousand strings is really no biggy), cache them and run your search\match algorithm on them.
You probably have to code your algorithm yourself because the standard tools will be an overkill for what you are trying to achieve and there is no garantee that they will be able to achieve exactly what you need.
I would build a regex representation of your wildcard based strings and run those regexs on your input. Your probabaly will have to do some work until you get the regex right, but it will be the fastest way to go.
I suggest reading the keys and their associated payload into a binary tree representation ordered alphanumerically by key. If your keys are not terribly "clumped" then you can avoid the (slight additional) overhead building of a balanced tree. You also can avoid any tree maintenance code as, if I understand your problem correctly, the data will be changing frequently and it would be simplest to rebuild the tree rather than add/remove/update nodes in place. The overhead of reading into the tree is similar to performing an initial sort, and tree traversal to search for your value is straight-forward and much more efficient than just running a regex against a bunch of strings. You may even find while working it through that your wild cards in the tree will lead to some shortcuts to prune the search space. A quick search show lots of resources and PHP snippets to get you started.
If you run SELECT folder_col, count(*) FROM your_sample_table group by folder_col do you get duplicate folder_col values (ie count(*) greater than 1)?
If not, that means you can produce an SQL that would generate a valid sphinx index (see
I wouldn't recommend to do text search on large collection of data in MySQL. You need a database to store the data but that would be it. For searching use a search engine like:
Solr (
Elastic Search (
Sphinx (
Those services will allow you doing all sort of funky text search (including Wildcards) in a blink of an eye ;-)

Large mysql query in PHP

I have a large table of about 14 million rows. Each row has contains a block of text. I also have another table with about 6000 rows and each row has a word and six numerical values for each word. I need to take each block of text from the first table and find the amount of times each word in the second table appears then calculate the mean of the six values for each block of text and store it.
I have a debian machine with an i7 and 8gb of memory which should be able to handle it. At the moment I am using the php substr_count() function. However PHP just doesn't feel like its the right solution for this problem. Other than working around time-out and memory limit problems does anyone have a better way of doing this? Is it possible to use just SQL? If not what would be the best way to execute my PHP without overloading the server?
Do each record from the 'big' table one-at-a-time. Load that single 'block' of text into your program (php or what ever), and do the searching and calculation, then save the appropriate values where ever you need them.
Do each record as its own transaction, in isolation from the rest. If you are interrupted, use the saved values to determine where to start again.
Once you are done the existing records, you only need to do this in the future when you enter or update a record, so it's much easier. You just need to take your big bite right now to get the data updated.
What are you trying to do exactly? If you are trying to create something like a search engine with a weighting function, you maybe should drop that and instead use the MySQL fulltext search functions and indices that are there. If you still need to have this specific solution, you can of course do this completely in SQL. You can do this in one query or with a trigger that is run each time after a row is inserted or updated. You wont be able to get this done properly with PHP without jumping through a lot of hoops.
To give you a specific answer, we indeed would need more information about the queries, data structures and what you are trying to do.
Redesign IT()
If for size on disc is not !important just joints table into one
Table with 6000 put into memory [ memory table ] and make backup every one hour
INSERT IGNORE into back.table SELECT * FROM my.table;
Create "own" index in big table eq
Add column "name index" into big table with id of row
Need more info about query to find solution

Process optimization for large sets of data

I currently have a project where we are dealing with 30million+ keywords for PPC advertising. We maintain these lists in Oracle. There are times where we need to remove certain keywords from the list. The process includes various match-type policies to determine if the keywords should be removed:
EXACT: WHERE keyword = '{term}'
CONTAINS: WHERE keyword LIKE '%{term}%'
TOKEN: WHERE keyword LIKE '% {term} %' OR keyword LIKE '{term} %'
OR keyword LIKE '% {term}'
Now, when a list is processed, it can only use one of the match-types listed above. But, all 30mil+ keywords must be scanned for matches, returning the results for the matches. Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.
Do you have any suggestions on how to optimize the process so this will run much faster?
Here is an example query to search for Holiday Inn:
SELECT * FROM keyword_list
lower(text) LIKE 'holiday inn' OR
lower(text) LIKE '% holiday inn %' OR
lower(text) LIKE 'holiday inn %'
Here is the pastebin for the output of EXPLAIN:
Some additional information that may be useful. A keyword can consist of multiple words like:
this is a sample keyword
i like my keywords
keywords are great
Never use a LIKE match starting with "%" o large sets of data - it can not use the table index on that field and will do a table scan. This is your source of slowness.
The only matches that can use the index are the ones starting with hardcoded string (e.g. keyword LIKE '{term} %').
To work around this problem, create a new indexing table (not to be confused with database's table index) mapping individual terms to keyword strings contining those terms; then your keyword LIKE '% {term} %' becomes t1.keyword = index_table.keyword and index_table.term="{term}".
I know that mine approach can look like heresies for RDBMS guys but I verified it many times in practice and there is no magic. One should just know little bit about possible IO and processing rates and some of simple calculation. In short, RDBMS is not right tool for this sort of processing.
From mine experience perl is able do regexp scan roughly in millions per second. I don't know how fast you are able dump it from database (MySQL can up to 200krows/s so you can dump all your keywords in 2.5 min, I know that Oracle is much worse here but I hope it is not more than ten times i.e. 25 min). If your data are average 20 chars your dump will be 600MB, for 100 chars it is 3GB. It means that with slow 100MB/s HD your IO will take from 6s to 30s. (All involved IO is sequential!) It is almost nothing in comparison with time of dump and processing in perl. Your scan can slow down to 100k/s depending of number of keywords you would like to remove (I have experienced regexp with 500 branching patterns with this speed) so you can process resulting data in less than 5 minutes. If resulting cardinality will not be huge (in tens of hundreds) output IO should not be problem. Anyway your processing should be in minutes, not hours. If you generate whole keyword values for deletion you can use index in delete operation, so you will generate series of DELETE FROM <table> WHERE keyword IN (...) stuffed with keywords to remove in amount up to maximal length of SQL statement. You can also try variant where you will upload this data to temporary table and then use join. I don't know what would be faster in Oracle. It would take about 10 minutes in MySQL. You are unlucky that you have to deal with Oracle but you should be able remove hundreds of {term}'s in less than hour.
P.S.: I would recommend you to use something with better regular expressions like (included in V8 aka node.js) or new binary module in Erlang R14A but weak regexp engine in perl would not be weak point in this task, it would be RDBMS.
I think the problem is one of how the keywords are stored. If I'm interpreting your code correctly, the KEYWORD column is made up of a string of blank-separated keyword values, such as
Because of this you're forced to use LIKE to do your searches, and that's probably the souce of the slowness.
Although I realize this may be somewhat painful, it might be better to create a second table, perhaps called KEYWORDS, which would contain the individual keywords which relate to a given base table record (I'll refer to the base table as PPC since I don't know what's it really called). Assuming that your current base table looks like this:
<other fields>...);
What you could do would be to rebuild the tables as follows:
<other fields>...);
KEYWORD VARCHAR2(25), -- or whatever is appropriate for a single keyword
You'd populate the NEW_PPC_KEYWORD table by pulling out the individual keywords from the old PPC.KEYWORD field, putting them into the NEW_PPC_KEYWORD table. With only one keyword in each record in NEW_PPC_KEYWORD you could now use a simple join to pull all the records in NEW_PPC which had a keyword by doing something like
WHERE K.KEYWORD = '<whatever>';
Share and enjoy.
The info is insufficient to give any concrete advice. If the expensive LIKE matching is unavoidable then the only thing I see at the moment is this:
Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.
Have you tried to cache the results of the queries in a table? Keyed by the input keyword?
Because I do not believe that the whole data set, all keywords can change overnight. And since they do not change very often it makes sense to simply keep the results in a extra table precomputed so that future queries for the keyword can be resolved via cache instead of going again over the 30Mil entries. Obviously, some sort of periodic maintenance has to be done on the cache table: when keywords are modified/deleted and when the lists are modified the cache entries have to be updated recomputed. To simplify the update, one would keep in the cache table also the ID of the original rows in keyword_list table which contributed the results.
To the UPDATE: Insert data into the keyword_list table already lower-cased. Use extra row if the original case is needed for later.
In the past I have participated in the design of one ad system. I do not remember all the details but the most striking difference is that we were tokenizing everything and giving every unique word an id. And keywords were not free form - they were also in DB table, were also tokenized. So we never actually matched the keywords as strings: queries were like:
from DICT, AD
DICT.word = :input_word and
DICT.word_id = AD.word_id
DICT is a table with words and AD (analogue of your keyword_list) with the words from ads.
Essentially one can summarize the problem you experience as "full table scan". This is pretty common issue, often highlighting poor design of data layout. Search the net for more information on what can be done. SO has many entries too.
Your explain plan says this query should take a minute, but it's actually taking hours? A simple test on my home PC verifies that a minute seems reasonable for this query. And on a server with some decent IO this should probably only take a few seconds.
Is the problem that you're running the same query dozens of times sequentially for different keywords? If so, you need to combine all the searches together so you only scan the table once.
You could look into Oracle Text indexing. It is designed to support the kind of in-text search you are talking about.
My advice is to raise the cach size to hundreds of gb. Throw hardware at it. If you cant build a Beowulf cluster or build a binAry space search engine.

PHP and MySQL: optimize database

I have a database with over 10,000,000 rows. Querying it right now can take a few seconds just to find some basic information. This isn't preferable, I know that the best way to optimize is to minimize the number of rows which is possible, but right now I don't have the time to do this.
What's the easiest way to optimize a MySQL database so that when querying it, the time taken is short?
I don't mind about the size of the database, that doesn't really matter so any optimizations that increase the size are fine. I'm not very good with optimization, right now I have indexes set up, but I'm not sure how much better I can get from there.
I'll eventually trim down the database properly, but is there a quick temporary solution?
Besides indexing which has already been suggested, you may want to also look into partitioning tables if they are large.
Partitioning in MySQL
It's tough to be specific here, because we have very limited information, but proper indexing along with partitioning can go a very long way. Indexing properly can be a long subject, but in a very general sense you'll want to index columns you query against.
For example, say you have a table of employees, and you have your usual columns of SSN, FNAME, LNAME. In addition to those columns, we'll say that you have an additional 10 columns in the table as well.
Now you have this query:
Ignoring the fact that the SSN could likely be the primary key here and may already have a unique index on it, you would likely see a performance benefit by creating another composite index containing the columns (SSN, FNAME, LNAME). The reason this is beneficial is because the database can satisfy this query by simply looking at the composite index because it contains all the values needed in a sorted and compact space. (that is, less I/O). Even though the index on SSN only is a better access method to doing a full table scan, the database still has to read the data blocks for the index (I/O), find the value(s) which will contain pointers to the records needed to satisfy the query, then will need to read different data blocks (read: more random I/O) in order to retrieve the actual values for fname and lname.
This is obviously very simplified, but using indexes in this way can drastically reduce I/O and increase performance of your database.
Some other links here you may find helpful:
MySQL indexes - how many are enough?
When should I use a composite index?
MySQL Query Optimization (Particularly the section on "Choosing Indexes")
As I can see you request 40k rows from the database, this load of data needs time just to be transferred.
Also, never ask "how to improve in general". There is no way of "general" optimization. Optimization is always result of profiling and research of your particular case.
Use indexes on columns you search on very often.
In your example, 'WHERE x=y', if y is column name, create an index with y also.
The key with index is the # of result from your select query should be around 3% ~ 5% comparing entire table and it will be faster.
Also archieving table helps. I do not know how to do this, mostly DBA task.
For DBA it is simple task if they have been doing this.
If you're doing ordering or complex queries you may need to use multi-column indexes. For example if you're searching where = 'y' OR = 'z' it might be worth putting an index on name,phone. Simplified example, but if you need to do this you'll need to research it further anyway :)
Are your queries using your indexes? What does running an EXPLAIN on your select queries tell you?
The first (and easiest) step will be making sure your queries are optimized.
