I have a scraper which periodically scrapes articles from news sites and stores them in a database [MYSQL].
The way the scraping works is that the oldest articles are scraped first and then i move onto much more recent articles.
For example an article that was written on the 1st of Jan would be scraped first and given an ID 1 and an article that was scraped on the 2nd of Jan would have an ID 2.
So the recent articles would have a higher id as compared to older articles.
There are multiple scrapers running at the same time.
Now i need an endpoint which i can query based on timestamp of the articles and i also have a limit of 10 articles on each fetch.
The problem arises for example when there are 20 articles which were posted with a timestamp of 1499241705 and when i query the endpoint with a timestamp of 1499241705 a check is made to give me all articles that is >=1499241705 in which case i would always get the same 10 articles each time,changing the condition to a > would mean i skip out on the articles from 11-20. Adding another where clause to check on id is unsuccessful because articles may not always be inserted in the correct date order as the scraper is running concurrently.
Is there a way i can query this end point so i can always get consistent data from it with the latest articles coming first and then the older articles.
EDIT:
+-----------------------+
| id | unix_timestamp |
+-----------------------+
| 1 | 1000 |
| 2 | 1001 |
| 3 | 1002 |
| 4 | 1003 |
| 11 | 1000 |
| 12 | 1001 |
| 13 | 1002 |
| 14 | 1003 |
+-----------------------+
The last timestamp and ID is being sent through the WHERE clause.
E.g.
$this->db->where('unix_timestamp <=', $timestamp);
$this->db->where('id <', $offset);
$this->db->order_by('unix_timestamp ', 'DESC');
$this->db->order_by('id', 'DESC');
On querying with a timestamp of 1003, ids 14 and 4 are fetched. But then during the next call, id 4 would be the offset thereby not fetching id 13 and only fetching id 3 the next time around.So data would be missing .
Two parts: timestamp and id.
WHERE timestamp <= $ts_leftoff
AND ( timestamp < $ts_leftoff
OR id <= $id_leftoff )
ORDER BY (timestamp DESC, id DESC)
So, assuming id is unique, it won't matter if lots of rows have the same timestamp, the order is fully deterministic.
There is a syntax for this, but unfortunately it is not well optimized:
WHERE (timestamp, id) <= ($ts_leftoff, $id_leftoff)
So, I advise against using it.
More on the concept of "left off": http://mysql.rjweb.org/doc.php/pagination
Related
I am making a social application which needs paging for the posts.
Here is the database:
id | post | time |
---------|---------------|----------|
1 | "oldest" | 9:00 |
2 | "old" | 10:00 |
3 | "new" | 11:00 |
4 | "newest" | 12:00 |
In my app:
Newest posts are on top and I only load 2 posts at the time.
Let's say the first 2 data is loaded into the app
4 (12:00) newest
3 (11:00) new
User scrolls down, the app detects that the last post was reached, so it requests the PHP file to download 2 more the following order:
2 (10:00) old
1 (9:00) older
It works fine. The following is my code:
$qry = $db->prepare('SELECT id, post
FROM posts
WHERE id < :lastLoadedId
ORDER BY time DESC LIMIT 0, 2');
The problem / question:
My server deletes really old posts automatic (in order to save space)
Let's assume that after a while the mysql table reaches it's limitations (last available id which is 2,147,483,647)
Then I need to give ids again from 1:
here comes the problem.
id | post | time |
--------------|---------------|----------|
1 | "new" | 11:00 |
2 | "newest" | 12:00 |
2,147,483,646 | "oldest" | 9:00 |
2,147,483,647 | "old" | 10:00 |
The first 2 data is loaded again into my app.
2 (12:00) newest
1 (11:00) new
When it tries to load more, it searches for IDs that are smaller than number 2, but since 2,147,483,647 is bigger therefore it would not return back the "oldest" and "old" posts.
Should I worry about this?
How does big companies handle that much data? After a while they start a new table?
According to the MySQL website, the unsigned bigint can go up to 18446744073709551615. If you insert 1 million records per second 24x7, it will take 584542 years to reach the limit. So I don't think you should worry too much.
Here is an example :
CREATE TABLE foo (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`),
)
Note that the 20 stands for the number of digits to be displayed and has nothing to do with storage.
Why not adding another column ?
ALTER TABLE `files` ADD `real_id` BIGINT NOT NULL AFTER `id`;
So before adding an article, search for the last (biggest) real_id and then increment it.
You have a problem with your design.
Your id is not a real id. The id should be primary key and auto increment. There shouldn't be a case of reusing id. It's confusing.
Your id data type INT is not enough to support real life data. As suggested by other developers, change it to bigint.
MySQL internally stores date time or timestamp as integer. The size impact is small. But you should use Id only (order by id desc, instead of time) and make sure it's primary key. This will make your query very fast because it's directly working on the clustered index
I have 2 tables named collections and posts. I need to display the collection name and 3 posts from each collection collected by users. Some of the collection has less than 3 posts and some have no posts at all. Also I need to count the posts (Not total number of posts but the posts produce by the query)
MySQL Tables
Collections
| collection_id | collection_name | uid |
| 1 | My collection 01 | 1 |
| 2 | My collection 02 | 1 |
| 3 | My collection 03 | 1 |
| 4 | My collection 04 | 2 |
| 5 | My collection 05 | 2 |
| 6 | My collection 06 | 1 |
Posts
| posts_id | post_title | cid |
| 1 | post title 1 | 1 |
| 2 | post title 2 | 1 |
| 3 | post title 3 | 1 |
| 4 | post title 4 | 3 |
| 5 | post title 5 | 2 |
| 6 | post title 6 | 3 |
cid is the collection id. So what I want to and uid is the user id. I want the results to be display
3 posts from My collection 01
post title 1
post title 2
post title 3
1 posts from My collection 02
post title 5
2 posts from My collection 02
post title 4
post title 6
Just made the example according to the dummy data I added in the table above.
I tried with left join with no luck
SELECT * FROM collections LEFT JOIN posts ON posts.cid= collections. collection_id WHERE posts ON posts.cid= collections. collection_id AND collections. uid=1 ORDER BY collections. collection_id DESC LIMIT 0, 16
With this query I can get the collection name and 1 post.
But if i run two queries it will work (1 inside the other)
SELECT * FROM collections WHERE uid=1 ORDER BY collection_id DESC LIMIT 0, 16
Then I get the collection id and run another query inside while loop of above query
SELECT * FROM posts WHERE cid=$collection_id ORDER BY post_id DESC LIMIT 0, 3
I really love to do it with a single query. Your help is greatly appreciated.
There is no easy way to do that. Maybe with very complex query, but it will be difficult to maintain and may be even less efficient than doing that with several simpler queries.
The solution described by you costs 1 + (number of categories) queries, not two, of course. You could union them easily, and then you would have two queries and less trips do database, but similar load for db (comparing to your solution).
Even if you would assume, that there is a way to fetch everything with single query, then db has to do almost the same work (fetch 3 newest posts from every category). So having 2 queries vs 1 hypothetical is not a big penalty in terms of performance. Moreover, I can imagine that DB engine could have some issues with finding most optimal execution plan, especially if you would add there functions etc.
And the last solution. There is a way for fetching up to 3 posts from each category, but that require modifying schema and some application-side work. You can add a boolean column "most_recent" and have always 3 posts per cat. with true and false for the rest. You would have to keep updating it every time when you are adding/deleting posts. That is achievable as well with db triggers. Then your problem is trivial to resolve, but only because you have done some precomputation.
Any Idea how can I identify if there is new client added on my database.
I was thinking about identifying it thru date_added field.
id client_name date_added
---------------------------------
1 ABC 2013-01-02
2 XYZ 2013-01-03
3 EFG 2013-01-02
4 HIJ 2013-01-05
as you can see a new client added HIJ on 2013-01-05.
I was looking with this kind of result:
Client List
Total NO: 4
New Client
Total No: 1
Client Name: HIJ
add a field new to the table, default it to 1, on page load use that for the select and set it to 0 to indicate its not longer new.
It's hard to tell but based on your comment ...my reference date is 1 month interval... you might be looking for something like this
SELECT id, client_name, new_count, total_count
FROM
(
SELECT id, client_name
FROM clients
WHERE date_added BETWEEN CURDATE() - INTERVAL 1 MONTH AND CURDATE()
) c CROSS JOIN
(
SELECT
(
SELECT COUNT(*) new_count
FROM clients
WHERE date_added BETWEEN CURDATE() - INTERVAL 1 MONTH AND CURDATE()
) new_count,
(
SELECT COUNT(*) total_count
FROM clients
) total_count
) t
Obviously you can easily change CURDATE() with any other reference date in the past in this query and you get results for that date.
Lets assume that you have following sample data
+------+-------------+------------+
| id | client_name | date_added |
+------+-------------+------------+
| 1 | ABC | 2013-05-13 |
| 2 | XYZ | 2013-06-13 |
| 3 | EFG | 2013-06-13 |
| 4 | HIJ | 2013-08-11 |
+------+-------------+------------+
and today is 2013-08-13 then the output from the query will be
+------+-------------+-----------+-------------+
| id | client_name | new_count | total_count |
+------+-------------+-----------+-------------+
| 4 | HIJ | 1 | 4 |
+------+-------------+-----------+-------------+
You could remember, in your webpage or PHP script, the highest ID value previously seen. Or the highest timestamp (better than a date) previously seen.
I prefer ID or Version numbers for concurrency-related stuff (locking, finding the latest etc) -- since they should be defined to be ascending, can't suffer "same millisecond" collisions, and are more efficient.
I assume you're going to hold the "state" of your application (as to what the user has seen) in hidden fields in the form, or somesuch. This would then track the "last seen" and allow you to identify "newly added" since the last pageview.
If you expect to identify newly added when coming from a different page or logging onto the application, you'll need to store the "state" in the database instead.
That depends on what you consider NEW. You have to define what you're going to compare the records against (reference date). Once you define it, you could use a query like the following:
SELECT * FROM client WHERE date_added >= '$date'
where $date is the reference date.
Suppose I have a MySQL table that looks like the following, where I keep track of when (Date) a user (User.id) read an article on my website (Article.id):
------------------------------------------
Article_Impressions
------------------------------------------
date | user_id | article_id
--------------------+---------+-----------
2013-04-02 15:33:23 | 815 | 2342
2013-04-02 15:38:21 | 815 | 108
2013-04-02 15:39:33 | 161 | 4815
...
I'm trying to determine how many session I had, as well as average session duration per user on a given day. A session ends when an article was not read within 30 minutes after another article.
Question
How can I efficiently determine how many session I had on a given day? I'm using PHP and MySQL.
My first idea is to query all that data for a given day, sorted by user. Then I iterate through each user, check if an impression was within 30 minutes of the last impression, and tally up a total count of session each user had that day.
Since we have around 2 million impressions a day on our site, I'm trying to optimize this report generator.
Try this query
Query 1:
select
#sessionId:=if(#prevUser=user_id AND diff <= 1800 , #sessionId, #sessionId+1) as sessionId,
#prevUser:=user_id AS user_id,
article_id,
date,
diff
from
(select #sessionId:=0, #prevUser:=0) b
join
(select
TIME_TO_SEC(if(#prevU=user_id, TIMEDIFF(date, #prevD), '00:00')) as diff,
#prevU:=user_id as user_id,
#prevD:=date as date,
article_id
from
tbl
join
(select #prev:=0, #prevU=0)a
order by
user_id,
date) a
[Results]:
| SESSIONID | USER_ID | ARTICLE_ID | DATE | DIFF |
-----------------------------------------------------------------
| 1 | 161 | 4815 | 2013-04-02 15:39:33 | 0 |
| 2 | 815 | 2342 | 2013-04-02 15:33:23 | 0 |
| 2 | 815 | 108 | 2013-04-02 15:38:21 | 298 |
| 3 | 815 | 108 | 2013-04-02 16:38:21 | 3600 |
This query will return a unique session for every new user and also for same user if the next article read is after 30 mins as per your requirement mentioned in your question. The diff column returns the seconds difference between the 2 articles by the same user which helps us count the sessionId. Now using this result it will be easy for you to count the average time per user and also total time per session.
Hope this helps you...
SQL Fiddle
If the concept of the user "session" is important to your analytics, then I would start logging data in your table to make querying of session-related data not such a painful process. A simple approach would be to log your PHP session ID. If your PHP session id is set to have the same 30 minute expiry, and you log the PHP session ID to this table then you would basically have exactly what you are looking for.
Of course that won't help you with your existing records. I would probably go ahead and create the session field and then back-populate it with randomly generated "session" id's. I wouldn't look for a fully SQL solution for this, as it may not do what you want in terms of handling edge cases (sessions spanning across days, etc.). I would write a script to perform this backfill, which would contain all the logic you need.
My general approach would be to SELECT all the records like this:
SELECT user_id, date /* plus any other fields like unique id that you would need for insert */
FROM Article_Impressions
WHERE session_id IS NULL
ORDER BY user_id ASC, date ASC
Note: make sure you have index on both user_id and date fields.
I would then loop through the result set, building a temp array of each user_id, and loop through that array for all date values assigning a randomly generated session id which would change each time the date change was greater than 30 minutes. Once the user value increments, I would make inserts for that previous user to update the session_id values and then reset the temp array to empty and continue that process with the next user.
Note that it is probably important to take the approach of keeping a relatively small temp/working array like this, as with the number of records you are talking about, you are likely not going to be able to read the entire result set into an array in memory.
Once your data is populated, the query becomes trivial:
Unique sessions for each day:
SELECT DATE(date) as `day`, COUNT(DISTINCT session_id) AS `unique_sessions`
FROM Article_Impressions
GROUP BY `day`
ORDER BY `day` DESC /* or ASC depending on how you want to view it */
Average sessions per day:
SELECT AVG(sessions_per_day.`unique_sessions`) AS `average_sessions_per_day`
FROM
(
SELECT DATE(date) as `day`, COUNT(DISTINCT session_id) AS `unique_sessions`
FROM Article_Impressions
GROUP BY `day`
) AS sessions_per_day
GROUP BY sessions_per_day.`day`
Note: you need an index on the new session_id field.
Thanks for reading.
This is not a coding question as much as it is a logic one. But if my current logic is wrong, some coding help would be appreciated.
I have made a table on my database which is a log of everything that happens on my site.
When a user registers, it's saved. When he logs, again. And so on. Each action is represented by a number.
The data looks like this
----------------------------
| id | action | timestamp |
----------------------------
| 1 | 1 | 1299132900 |
| 2 | 2 | 1346876672 |
| 3 | 14 | 1351983948 |
| 4 | 1 | 1359063373 |
----------------------------
ID and action are of type INT(11) and timestamp is TIMESTAMP
I'm using a query to retrieve all records from the last 30 days.
SELECT id, action, timestamp FROM log WHERE timestamp >= DATE_SUB( CURDATE(),INTERVAL 30 DAY)
It works, and gives me all the correct values.
I need to arrange this data to make a graphic in flot.
As I see it, there are 2 steps:
Group the results by action number.
Then, inside each group, separate values by date, so the X axis of the graphic is date and Y axis is count.
With those arrays I could make different javascript data arrys to pass to flot.
Am I on the right track?
Should there be several mysql queries, or a GROUP BY clause?
I'm kind of lost here and would appreciate any help.