Locking MySql table or data to prevent getting same data - php

I have a database with 1 million entries and 3 programs that process something with the data. The 3 programs get the data over an api call. For example 100 entries for each request. What is the best way to prevent that the programs get the same 100 entries?
I tried to update an id per program to database, but that don't solve the problem. Because if the 3 programs request data and update from one is still running it can be that other program get same data.
I have tried LOCK TABLES but it is the maintable in my databse. So all other processes from php and so also slow down extremly. Because table is totaly locked every few minutes.

How about an app_id column.
App 1 does this...
UPDATE table SET app_id=3 where app_id=NULL LIMIT 100
SELECT * FROM table WHERE app_id=1 LIMIT 100
----process and when done----
UPDATE table SET app_id=NULL where app_id=1 LIMIT 100
App 2 does this...
UPDATE table SET app_id=2 where app_id=NULL LIMIT 100
SELECT * FROM table WHERE app_id=2 LIMIT 100
----process and when done----
UPDATE table SET app_id=NULL where app_id=2 LIMIT 100
You can have unlimited apps and they should only get their own records. This kinda hits your DB harder. Maybe you could use a combo of memcache/stored procedures to limit db load depending on your architecture.
Another option might be able to handle this with the API. You could create a system wide global variable and store what app has what records. Anytime the API is called, it will look at the system wide variable to know if it can obtain data or from what record to start with. That might be easier on the DB.

Related

Cache Dynamic Query PHP + BigQuery

I have a database in Google BigQuery with million of rows (more than 2 million new rows every day) contains of my user activities
I create a PHP program to get the insight from this database with many queries to show like statistic of data per day, per hour and many more
I have two cases with two problem:
I try to find the data of user activities in date between 2016-11-01 and 2016-11-10, and then I need to break down the data for only 2016-11-05 data only (the data basically is the subset of queries result). This data need to be clasify per day or per hour or per user type and many more. Right now I use many queries in database to group this data and to do many data manipulation. For example "SELECT * FROM user_activities WHERE date>='2016-11-01' AND date<='2016-11-10' GROUP BY date,hour" and then when I need to break down the data in 2016-11-05 only I re-run the query: "SELECT * FROM user_activities WHERE date='2016-11-05' GROUP BY date,hour"
Or sometimes I need to query the data with different parameter, for example the user activities between 2016-11-01 and 2016-11-10 who contains activities "A", and then I need to change witch activities "B". I have a column to identify the type of activities that user do. Right now I run the query like "SELECT * FROM user_activities WHERE activities like 'A' and then when the activities type is changed I run new query "SELECT * FROM user_activities WHERE activities like 'B'.
So my question is:
Because the data on my database is so big, and because the insight query activities in my PHP program is so high frequency, the cost of data management and processing become so high. For case like case 1 and 2 is there any alternate solution like PHP caching to make the database request become less?
In just 1-2 days my BigQuery data request can become Terabyte of data. I'm afraid it is not too efficient in term of my cost database management.
As far I have tried these solutions:
I take the raw data from my database, cache it on PHP and run the
data manipulation manually. For example I run "SELECT * FROM
user_activities WHERE date>='2016-11-01' AND date<='2016-11-10'" and
then I try to run data manipulation like group by hour or group by
user type or group by user activities manually and sequentially on
PHP function. But because my data contains million of data the
process become so long and not efficient.
I take the raw data from my database, insert it to temporary table,
and then manipulate the data by query to temporary table. But this
process become not efficient too because the insert process for
million rows of data become so long.
Do you have any suggestion how can I optimize my problem?
Implement the Partitioned Tables as has been recommended for you.
If you have one single big table with 5TB of data without partition your costs are high.
When you do Partitioned Tables, you have only the storage for those days to query not the whole table. Just a fraction of it, like 10GB or smaller. And you pay that only.
You can save a query result into a table directly instead of reimporting as you say, and query only that table which is smaller for further aggregation.
Try to not use 'SELECT *' instead just select the columns you must have in your output.
If the data is enough small, and you do lots of small querios on it, you may want to take out from BQ and store in ElasticSearch or MySQL and run from there the queries.

Import big file into mysql, on a Heroku app

I need some help.
I have an php app on Heroku. In this app, there's a form that upload an csv file, to be imported on Mysql(cleardb).
The problem it's, that the file it's large (will always be large), and the function it's taking too much time to finish (about 90 seconds). The timeout on heroku it's 30 seconds, and there's no way to change that.
I tried to use Heroku Scheduler (like cron), but the minimal frequency it's 10 minutes, and a script that can take 90 seconds, using this scheduler, will take 30 minutes, because as i said, the timeout of heroku it's 30 seconds.
Well, what can i do? there's an alternative scheduler?
Example of the import:
CSV
name,productName,points,categoryName,coordName,date
MYSQL
[users]
userID
userName
categoryID
coordID
[products]
productID
productName
[coords]
coordID
coordName
[categories]
categoryID
categoryName
[points]
pointID
productID
userID
value
in all tables, i need to make a select to see if the category, coord, etc, already exists. If exists, return id, if not, insert a new line.
I dont think that there's a way to decrease time execution time. I'm trying to find a way to decrease the schedule to 2 minutes, 3 minutes, etc. So, in about 10 minutes, all lines will be imported.
thanks!
This is what I would start with (because it's relatively simple/quick to implement and should give you a reference point and some wiggle room for further tests in a short period of time):
Import all the data as-is into a temporary table (if the server's RAM allow you can also try the memory engine).
Then, after the data has been imported, create the indices needed for the following queries (and check via EXPLAIN or any other tool that shows you if and how the indices are used):
query all the categories that are in the temporary table but not in your live data tables
create those categories in the live tables.
query all coords that are in the temporary table but not in your live data tables.
create those coords in the live tables.
you get the idea ...repeat for all necessary data.
then just import the data from the temp table into the live tables via INSERT...SELECT queries. Think about what kind of transaction/locking you will need for this. It might be that the order of queries will make a difference. But if you're only adding data, I assume that a rather low isolation level should do... not sure though. But maybe that's not your concern right now?

PHP / MySQL Performance Suggestion

I have table(1) that holds a total records value for table(2). I do this so that I can quickly show users the total value without having to run select count every time a page is brought up.
My Question:
I am debating on whether or not to update that total records value in table(1) as new records come in or to have a script run every 5 minutes to update the total records value in table(1).
Problem is we plan on having many records created during a day which will result in an additional update for each one.
However if we do a script it will need to run for every record in table(1) and that update query will have a sub query counting records from table(2). This script will need to run like every 5 to 10 minutes to keep things in sync.
table(1) will not grow fast maybe at peak it could get to around 5000 records. table(2) has the potential to get massive over 1 million records in a short period of time.
Would love to hear some suggestions.
This is where a trigger on table 2 might be useful, automatically updating table 1 as part of the same transaction, rather than using a second query initiated by PHP. It's still a slight overhead, but handled by the database itself rather than a larger overhead in your PHP code, and maintains the accuracy of the table 1 counts ACIDly (assuming you use transactions)
There is a difference between myisam and innodb engines. If you need to count the total number of rows in the table COUNT(*) FROM table, than if you are using myisam, you will get this number blazingly fast no matter what is the size of the table (myisam tables already store the row count, so it just reads it).
Innodb does not store such info. But if an approximate row count is sufficient, SHOW TABLE STATUS can be used.
If you need to count based on something, COUNT(*) FROM table WHERE ... then there are two different options:
either put an index on that something, and count will be fast
use triggers/application logic to automatically update field in the other table

Porting SQL results into memcached

I have few tables which are accessed frequently by users. The same kind of queries are running again and again which cause extra load on the server.
The records do not insert/update frequently, I was thinking to cache the IDs into memcached and then fetch them from database this will reduce the burden of searching/sorting etc.
Here is an example
SELECT P.product_id FROM products P, product_category C WHERE P.cat_id=C.cat_id AND C.cat_name='Fashion' AND P.is_product_active=true AND C.is_cat_active=true ORDER BY P.product_date DESC
The above query will return all the product ids of a particular category which will be imported into memcached and then rest of the process (i.e., paging) will be simulated same as we do with mysql result sets.
The insert process will either expire the cache or insert the product id on the first row of the array.
My question is this the practical apporach? How do people deal with searches say if a person is searching for a product which returns 10000 results (practically may not possible) do they search every time tables? Is there any good example available of memcached and mysql which shows how these tasks can be done?
you may ask yourself if you really need to invalidate the cache upon insert/update of a product.
Usually a 5 minutes cache can be acceptable for a product list.
If your invalidation scheme is time-based only (new entries will only appear after 5 min) there is a quick&dirty trick that you can use with memcache : simply use as a memcache key an md5 of your sql query string, and tell memcache to keep the result of the select for 5 minutes.
I hope this will help you

Pagination Strategies for Complex (slow) Datasets

What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don't want to hit the DB for every page view. Currently there are ~45k rows returned by this query.
Here are some of the approaches I've considered:
Cache the row count and update it every X minutes
Limit (and offset) the rows counted to 41 (for example) and display the page picker as "1 2 3 4 ..."; then recompute if anyone actually goes to page 4 and display "... 3 4 5 6 7 ..."
Get the row count once and store it in the user's session
Get rid of the page picker and just have a "Next Page" link
I've had to engineer a few pagination strategies using PHP and MySQL for a site that does over a million page views a day. I persued the strategy in stages:
Multi-column indexes I should have done this first before attempting a materialized view.
Generating a materialized view. I created a cron job that did a common denormalization of the document tables I was using. I would SELECT ... INTO OUTFILE ... and then create the new table, and rotate it in:
SELECT ... INTO OUTFILE '/tmp/ondeck.txt' FROM mytable ...;
CREATE TABLE ondeck_mytable LIKE mytable;
LOAD DATA INFILE '/tmp/ondeck.txt' INTO TABLE ondeck_mytable...;
DROP TABLE IF EXISTS dugout_mytable;
RENAME TABLE atbat_mytable TO dugout_mytable, ondeck_mytable TO atbat_mytable;
This kept the lock time on the write contended mytable down to a minimum and the pagination queries could hammer away on the atbat materialized view. I've simplified the above, leaving out the actual manipulation, which are unimportant.
Memcache I then created a wrapper about my database connection to cache these paginated results into memcache. This was a huge performance win. However, it was still not good enough.
Batch generation I wrote a PHP daemon and extracted the pagination logic into it. It would detect changes mytable and periodically regenerate the from the oldest changed record to the most recent record all the pages to the webserver's filesystem. With a bit of mod_rewrite, I could check to see if the page existed on disk, and serve it up. This also allowed me to take effective advantage of reverse proxying by letting Apache detect If-Modified-Since headers, and respond with 304 response codes. (Obviously, I removed any option of allowing users to select the number of results per page, an unimportant feature.)
Updated:
RE count(*): When using MyISAM tables, COUNT didn't create a problem when I was able to reduce the amount of read-write contention on the table. If I were doing InnoDB, I would create a trigger that updated an adjacent table with the row count. That trigger would just +1 or -1 depending on INSERT or DELETE statements.
RE page-pickers (thumbwheels) When I moved to agressive query caching, thumb wheel queries were also cached, and when it came to batch generating the pages, I was using temporary tables--so computing the thumbwheel was no problem. A lot of thumbwheel calculation simplified because it became a predictable filesystem pattern that actually only needed the largest page numer. The smallest page number was always 1.
Windowed thumbweel The example you give above for a windowed thumbwheel (<< 4 [5] 6 >>) should be pretty easy to do without any queries at all so long as you know your maximum number of pages.
My suggestion is ask MySQL for 1 row more than you need in each query, and decide based on the number of rows in the result set whether or not to show the next page-link.
MySQL has a specific mechanism to compute an approximated count of a result set without the LIMIT clause: FOUND_ROWS().
MySQL is quite good in optimizing LIMIT queries.
That means it picks appropriate join buffer, filesort buffer etc just enough to satisfy LIMIT clause.
Also note that with 45k rows you probably don't need exact count. Approximate counts can be figured out using separate queries on the indexed fields. Say, this query:
SELECT COUNT(*)
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
can be approximated by this one:
SELECT COUNT(*) *
(
SELECT COUNT(*)
FROM mytable
) / 1000
FROM (
SELECT 1
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
LIMIT 1000
)
, which is much more efficient in MyISAM.
If you give an example of your complex query, probably I can say something more definite on how to improve its pagination.
I'm by no means a MySQL expert, but perhaps giving up the COUNT(*) and going ahead with COUNT(id)?

Categories