I have a database with almost 100,000 comments and I would like to detect the most used words (using stop words to avoid common words).
I want to do this only one time, and then use a few of the most popular words to Tag the comments that contains them.
Can you help me with the Query and PHP code to do this?
Thanks!
The easiest approach I guess would be:
Create two new tables: keywords (id, word) and keywords_comments (keyword_id, comment_id, count)
keywords saves an unique id and the keyword you found in a text
keywords_comments stores one row for each connection between each comment that contains that keyword. In count you wil save the number of times this keyword occurred in the comment. The two columns keyword_id + comment_id together form a unique or directly the primary key.
Retrieve all comments from the database
Parse through all comments and split by non-characters (or other boundaries)
Write these entries to your tables
Example
You have the following two comments:
Hello, how are you?!
Wow, hello. My name is Stefan.
Now you would iterate over both of them and split them by non-characters. This would result in the following lowercase words for each text:
- First text: hello, how, are, you
- Second text: wow, hello, my, name, is, stefan
As soon as you have parsed one of this text, you can already insert it into the database again. I guess you do not want to load 100.000 comments to RAM.
So it would go this:
Parse first text an get the keywords above
Write each keyword into the tabke keywords if it is not there yet
Set a reference from the keyword to the comment (keywords_comments) and set the count correctly (in our example each word occurs only once in each text, you have to count that).
Parse second text
…
Minor improvement
A very easy improvement you probably have to use for 100.000 comments, is to use a counting variable or add a new field has_been_analyzed to each comment. Then you can read them comment by comment from the database.
I usually use counting variables when I read data chunkwise and know that the data cannot not change from the direction I am starting (i.e. it will stay consistent up to the point I currently am). Then I do something like:
SELECT * FROM table ORDER BY created ASC LIMIT 0, 100
SELECT * FROM table ORDER BY created ASC LIMIT 100, 100
SELECT * FROM table ORDER BY created ASC LIMIT 200, 100
…
Consider that this only works if we know for sure that there are no dates to be added at a place we think we already read. E.g. using DESC would not work, as there could be data inserted. Then the whole offset would break and we would read one article twice and never read the new article.
If you cannot make sure that the outside counting variable stays consistent, you can add a new field analyzed which you set to true as soon as you have read the comment. Then you can always see which comments have already been read and which not. An SQL query would then look like this:
SELECT * FROM table WHERE analyzed = 0 LIMIT 100 /* Reading chunks of 100 */
This works as long as you do not parallelize the workload (with multiple clients or threads). Otherwise you would have to make sure that reading + setting true is atomar (synchronized).
Related
My database has a table messages with 3 fields: 'messageid',messagetext (varchar), dateposted (datetime)
I want to store a bunch of messages in the field messagetext along with their respective date of posting in the field dateposted. A lot of these messages will have hashtags in them.
Then, using PHP and MySQL I want to find out which hashtags are the top 5 most frequently mentioned hashtags in messages posted in the past week.
How can I do this? I'd really appreciate any help. Many thanks in advance.
Do not take me wrongly, but you've set up yourself for a world of hurt. The best way to proceed would be to follow lonesomeday's advice and parse hashtags at insert time. This also greatly reduces processing time as well as making it more deterministic (the workload is "spread" between inserts).
If you want to proceed anyway, you need to tackle several problems.
1) Recognize the tags.
2) Multiple-select the tags. If you have a message saying that "#MySQL splitting is #cool", you want to get two rows from that one message, one saying 'MySQL', the other 'cool'.
3) Selecting the appropriate messages
4) Performances
You can approach this in at least two ways. You can use a stored function , which you find here on SO (actually from this site) - you will have to modify it though.
This syntax will get you the first occurrence of #hashtag in value plus all the text following it:
select substring(value, LENGTH(substring_index(value, '#', 1))+1);
You will then need to decide where, for each #hashtag, it #stops. (And it could be #parenthesized). At this point you need a regexp, or to search for a sequence of at least one alphanumeric character - in regexp parlance, [a-zA-Z0-9]+ - by specifying all possible characters or by using a loop, i.e. "#" is OK, "#t" is OK, "#ta" is OK, "#tag" is OK, "#tag," is not and so your hashtag is '#tag' (or 'tag').
Another more promising approach is to use a user defined function to capture the hashtags; you can use PREG_CAPTURE.
You will probably have to merge both approaches: modifying the stored function's setup and inner loop to read
DECLARE cur1 CURSOR FOR SELECT messages.messagetext
FROM messages
WHERE messages.messagetext LIKE '%#%';
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
DROP TEMPORARY TABLE IF EXISTS table2;
CREATE TEMPORARY TABLE table2( hashtag` VARCHAR(255) NOT NULL
) ENGINE=Memory;
...
SET occurrence = (SELECT LENGTH(msgtext)
- LENGTH(REPLACE(msgtext, '#, ''))
+1);
SET i=1;
WHILE i <= occurrence DO
INSERT INTO table2 VALUES SELECT PREG_CAPTURE('/#([a-z0-9]+)/i', messagetext, occurrence));
SET i = i + 1;
END WHILE;
...
This will return a list of message-ids and hashtags. You then need to GROUP them BY hashtag, count them and ORDER by count DESC, finally adding LIMIT 5 to only get the five most popular.
I have a table in MySql with several different fields, one of them contains a description that could be a couple of paragraphs long.
I am trying to figure out a way to have php automatically go through these description fields and create a list of the top keywords used. I am looking for the top keywords for the entire table, not each post individually.
I know this is a bit of a resource heavy operation, and it wouldn't be run very often anyways.
But I'd like to get a list like this:
some x 121
most x 110
frequent x 90
words x 50
So that I could see what the top used words are in the description field. Any idea at all where to start?
You can run you query,
loop through the records and append descriptions together into 1 big happy string.
Then, you can explode by ' ' into array
Get array of values using array_count_values()
Re-sort in descending order arsort()
Update
Sample code too:
$string = '';
foreach (your_result_set as one_row)
{
$string .= $one_row['text'];
}
$data = explode(' ', $string);
$data = array_count_values($data);
arsort($data);
If you have control over the database one way would be to add triggers to this table that maintains another table with all keywords.
The insert trigger would go through new.description and increment all keywords found
The delete trigger would do the same but for old.description and decrement the keywords
The update trigger would do the same as delete and insert, ie decrease all found in old.description and increase for new.description.
Once you have done and tried these triggers dump all data and re import it to have the trigger do the work on all existing data.
there are a few ways you can do this. i'm guessing you dont want to count every word, as words like and,if,it etc will all be meaningless.
also how many rows are we talking?
a simple solution is to create and array called words. loop through each row.
explode the paragraph using " ", which gives you each word. you may also wish to do a str_to_lower first if case is an issue.
loop through and use array_key_exists to see if there is a key if not create it.
and add a value of one. otherwise incriment the value by one.
this will give you counts of each word.
if this is for a search of a large database it would be worthwhile adding keywords to a seperate table on insert.
one way i think this would be good is to add the 5 most frequent used words excluding those in the exclude list (and,it,or,a,i etc). and add any word that appears in the keyword table.
there are issues with this. i have this response and dont mention php, sql or query which are what the post related to .maybe it would be worth having tags/keywords added on insert.
I am Looking for a way to display ranking/statistics of Tag/Keywords. I tried but no success. I know PHP very well but I am confused about how to get statistics of a keyword using PHP/MYSQL.
Like this: http://bit.ly/gHLXXo (aka http://www.torrentpond.com/stats/keywords).
Please Solve my Problem.
EDIT: I just create a keywords table.. with (ID, Keywords, time, view) column and use some query to get result.. but no luck.. I don't know how to manage it.. Do I need to add 30 columns for each day or do I need to use serialize to store database as array? Is there any solution please give me ...
EDIT2: I don't need a chart or graph for this; I just need the keyword trends.
You need sql query like this:
SELECT COUNT( * ) AS `Rows` , `keyword_id`
FROM `keywords_table`
GROUP BY `keyword_id`
ORDER BY `Rows` DESC
LIMIT 0 , 30
Where "Rows" = count of this keyword and "keyword-id" = keyword id or name.
This return you top 30 keywords with the number of times the keyword appears.
Your first step should to consider what data is needed to allow the statistics to be generated.
You will need to keep track of the individual keywords (or sets of keywords). Then, for each time one of the keywords is used, you'll insert a record into a statistical table which identifies the keyword and the date/time when it was used. When the search keyword is new, you create a new keyword entry in the list of keywords, as well as an entry in the 'use of keywords' table.
Your aggregate processing then needs to compute how often each keyword was used in the relevant period. You can do this daily; you won't be retrospectively adding new records. Given the aggregated data stored over time, you can compute positions (rankings) and changes in position. You can aggregate over days, weeks, months if need so be. The aggregate data will be stored in separate tables from the operational data. Once the basic unit of time (probably the day, maybe an hour) is past, you can consider whether to remove the original raw data - after you've done the first step aggregations.
This is kind of a weird question so my title is just as weird.
This is a voting app so I have a table called ballots that has a two important fields: username and ballot. The field ballot is a VARCHAR but it basically stores a list of 25 ids (numbers from 1-200) as CSVs. For example it might be:
22,12,1,3,4,5,6,7,...
And another one might have
3,4,12,1,4,5,...
And so on.
So given an id (let's say 12) I want to find which row (or username) has that id in the leading spot. So in our example above it would be the first user because he has 12 in the second position whereas the second user has it in the third position. It's possible that multiple people may have 12 in the leading position (say if user #3 and #4 have it in spot #1) and it's possible that no one may have ranked 12.
I also need to do the reverse (who has it in the worst spot) but I figure if I can figure out one problem the rest is easy.
I would like to do this using a minimal number of queries and statements but for the life of me I cannot see how.
The simplest solution I thought of is to traverse all of the users in the table and keep track of who has an id in the leading spot. This will work fine for me now but the number of users can potentially increase exponentially.
The other idea I had was to do a query like this:
select `username` from `ballots` where `ballot` like '12,%'
and if that returns results I'm done because position 1 is the leading spot. But if that returned 0 results I'd do:
select `username` from `ballots` where `ballot` like '*,12,%'
where * is a wildcard character that will match one number and one number only (unlike the %). But I don't know if this can actually be done.
Anyway does anyone have any suggestions on how best to do this?
Thanks
I'm not sure I understood correctly what you want to do - to get a list of users who have a given number in the 'ballot' field ordered by its position in that field?
If so, you should be able to use MySQL FIND_IN_SET() function:
SELECT username, FIND_IN_SET(12, ballot) as position
FROM ballots
WHERE FIND_IN_SET(12, ballot) > 0
ORDER BY position
This will return all rows that have your number (e.g. 12) somewhere in ballot sorted by position you can apply LIMIT to reduce the number of rows returned.
Suppose it is a long article (say 100,000 words), and I need to write a PHP file to display page 1, 2, or page 38 of the article, by
display.php?page=38
but the number of words for each page can change over time (for example, right now if it is 500 words per page, but next month, we can change it to 300 words per page easily). What is a good way to divide the long article and store into the database?
P.S. The design may be further complicated if we want to display 500 words but include whole paragraphs. That is, if we are showing word 480 already but the paragraph has 100 more words remaining, then show those 100 words anyway even though it exceeds the 500 words limit. (and then, the next page shouldn't show those 100 words again).
I would do it by splitting articles on chuks when saving them. The save script would split the article using whatever rules you design into it and save each chunk into a table like this:
CREATE TABLE article_chunks (
article_id int not null,
chunk_no int not null,
body text
}
Then, when you load a page of an article:
$sql = "select body from article_chunks where article_id = "
.$article_id." and chunk_no=".$page;
Whenever you want to change the logic of splitting articles into pages, you run a script thats pulls all the chunks together and re-splits them:
UPDPATE: Giving the advice I suppose your application is read-intensive more than write-intensive, meaning that articles are read more often than they are written
You could of course output exactly 500 words per page, but the better way would be to put some kind of breaks into your article (end of sentence, end of paragraph). Put these at places where a break would be good. This way your pages won't have exactly X words in it each, but about or up to X and it won't tear sentences or paragraphs apart.
Of course, when displaying the pages, don't display these break markers.
You might want to start by breaking the article up into an array of paragraphs by using split command:
http://www.php.net/split
$array = split("\n",$articleText);
It's better way to manually cut the text, because it's not a good idea to leave a program that determines where to cut. Sometimes it will be cut just after h2 tag and continue with text on next page.
This is simple database structure for that:
article(id, title, time, ...)
article_body(id, article_id, page, body, ...)
The SQL query:
SELECT a.*, ab.body, ab.page
FROM article a
INNER JOIN article_body ab
ON ab.article_id = a.id
WHERE a.id = $aricle_id AND ab.page= $page
LIMIT 1;
In application you can use jQuery to simple add new textarea for another page...
Your table could be something like
CREATE TABLE ArticleText (
INTEGER artId,
INTEGER wordNum,
INTEGER wordId,
PRIMARY KEY (artId, wordNum),
FOREIGN KEY (artId) REFERENCES Articles,
FOREIGN KEY (wordId) REFERENCES Words
)
this of course may be very space-expensive, or slow, etc, but you'll need some measurements to determine that (as so much depends on your DB engine). BTW, I hope it's clear that the Articles table is simply a table with metadata on articles keyed by artId, and the Words table a table of all words in every article keyed by wordId (trying to save some space there by identifying already-known words when an article is entered, if that's feasible...). One special word must be the "end of paragraph" marker, easily identifiable as such and distinct from every real word.
If you do structure your data like this you gain lots of flexibility in retrieving by page, and page length can be changed in a snap, even query by query if you wish. To get a page:
SELECT wordText
FROM Articles
JOIN ArticleText USING (artID)
JOIN Words USING (wordID)
WHERE wordNum BETWEEN (#pagenum-1)*#pagelength AND #pagenum * #pagelength + #extras
AND Articles.artID = #articleid
parameters #pagenum, #pagelength, #extras, #articleid are to be inserted in the prepared query at query time (use whatever syntax your DB and language like, such as :extras or numbered parameters or whatever).
So we get #extras words beyond expected end-of-page and then on the client side we check those extra words to make sure one of them is the end-paragraph marker - otherwise we'll do another query (with different BETWEEN values) to get yet more.
Far from ideal, but, given all the issues you've highlighted, worth considering. If you can count on the page length always being e.g. a multiple of 100, you can adopt a slight variation of this based on 100-word chunks (and no Words table, just text stored directly per row).
Let the author divide the article into parts themselves.
Writers know how to make an article interesting and readable by dividing it into logical parts, like "Part 1—Installation", "Part 2—Configuration" etc. Having an algorithm do it is a bad decision, imho.
Chopping an article in the wrong place just makes the reader annoyed. Don't do it.
my 2¢
/0