Determine top keywords in MySql table field using php - php

I have a table in MySql with several different fields, one of them contains a description that could be a couple of paragraphs long.
I am trying to figure out a way to have php automatically go through these description fields and create a list of the top keywords used. I am looking for the top keywords for the entire table, not each post individually.
I know this is a bit of a resource heavy operation, and it wouldn't be run very often anyways.
But I'd like to get a list like this:
some x 121
most x 110
frequent x 90
words x 50
So that I could see what the top used words are in the description field. Any idea at all where to start?

You can run you query,
loop through the records and append descriptions together into 1 big happy string.
Then, you can explode by ' ' into array
Get array of values using array_count_values()
Re-sort in descending order arsort()
Update
Sample code too:
$string = '';
foreach (your_result_set as one_row)
{
$string .= $one_row['text'];
}
$data = explode(' ', $string);
$data = array_count_values($data);
arsort($data);

If you have control over the database one way would be to add triggers to this table that maintains another table with all keywords.
The insert trigger would go through new.description and increment all keywords found
The delete trigger would do the same but for old.description and decrement the keywords
The update trigger would do the same as delete and insert, ie decrease all found in old.description and increase for new.description.
Once you have done and tried these triggers dump all data and re import it to have the trigger do the work on all existing data.

there are a few ways you can do this. i'm guessing you dont want to count every word, as words like and,if,it etc will all be meaningless.
also how many rows are we talking?
a simple solution is to create and array called words. loop through each row.
explode the paragraph using " ", which gives you each word. you may also wish to do a str_to_lower first if case is an issue.
loop through and use array_key_exists to see if there is a key if not create it.
and add a value of one. otherwise incriment the value by one.
this will give you counts of each word.
if this is for a search of a large database it would be worthwhile adding keywords to a seperate table on insert.
one way i think this would be good is to add the 5 most frequent used words excluding those in the exclude list (and,it,or,a,i etc). and add any word that appears in the keyword table.
there are issues with this. i have this response and dont mention php, sql or query which are what the post related to .maybe it would be worth having tags/keywords added on insert.

Related

Loading specifc field from row using array_filter and array_slice, want to empty field afterwards

I am having a pretty complicated issue here. Let me first try to explain it in words. I am inserting 100 keywords and 100 URL's in a database table, using fields named url_01, url_02, url_03, anchor_01, anchor_02, anchor_03 etc.
Now I want to pick the first URL/anchor combination and I manage to do that using a combination of array_filter and array_slice like this:
$anchor=current(array_filter(array_slice($gig,5)));
$url=current(array_filter(array_slice($gig,105)));
The reason I slice it at position 5 is cause the other fields should be ignored, and anchor_01 ranges till anchor_00 at the end, so at position 105 my url fields start. This works fine and it provides me with the url/anchor combo while skipping empty fields.
The reason I use current(array_filter) is cause I want to erase the anchor/url combo from the database after it has been used and this allows me to take the first entry while ignoring empty spots, while the slice makes sure I start after my other keys.
However, when I made this I only thought about how to pull it from the database, not how to erase it.
So my question: Is there a way to find out that it are those two I want to erase and not any of the others?
Or is there a much better way where I can first insert a list of 100 url's and 100 anchors, and make sure I always get a unique value from the database, as each entry may only be used once, and preferebly I see it erased after that.
I found a solution, I am not able to erase the previous value but I am able to select the right field by adding a counter like this:
$counting=$gig['count'];
$anchor=current(array_filter(array_slice($gig,6+$counting)));
$url=current(array_filter(array_slice($gig,106+$counting)));
if($server_date>$post_date) {
$add_count=$counting+1;
$sql_date="UPDATE `wp_backlinks` SET count='".$add_count."' WHERE id=".$client_id;
$result = $wpdb->query($sql_date);
} else { echo 'WAIT'; }
Easy fix, still I tend to think there are better ways to accomplish this in php?

Searching a Database Table For a Large List of Keywords

I am trying to search a table for multiple keywords. However, I am not looking for one keyword or even 10. It is around one thousand keywords. These keywords are also in a table and can be controlled. I would rather not hard-code these keywords into my SQL command...
The target table I am searching contains a lot of text and a cell could contain an entire sentences or paragraph... so doing something like a 'full text' search in mySQL seems like a good start.
Very similar to this question, but again, when I speak of multiple keywords, I mean hundreds to thousands. mysql FULLTEXT search multiple words
Can I dump my keyword table into an array and run a FULLTEXT search? Can this even be approached with mySQL or are there limits im not considering? Im open to other technology suggestions too. Sorry that I dont have code or errors to post. I am first trying to understand conceptually how to approach this. -tia
Recently i had to do a similar decision. I decided to go with lucene. I store the indexable fields in lucene, and return an id for mysql row.
Other choice is sphinx , a full tutorial can be found here.
See a related post here. And here.
SELECT * FROM articles WHERE body LIKE '%$keyword%';
you just need to use a for loop in your mysql query to read all keyworks.
its clear that you separated your keywords with comma or -
so you need to explode keyword by comma and put them in a variable .
for example :
$keywords = "key1,key2,key3,..."; // values come from keywords column from db.
now you just need to explode $keywords.
$keys=explode(',',$keywords);
and finally in your query you need to use a for loop:
$query = "SELECT * From targettable";
$i = 0;
foreach ($keys as $key)
{
$query .= "WHERE keywords LIKE '%".$key."%' )";
$i++;
}
i named your keywords column = "keywords".
also you can easily add your other condition into $query.
Watch out with mysql fulltext searches, if your result set is > 50% of the entire data set, mysql sees this is a failed search, and you wont get anything.
This sounds likely to occur if you have a hit list of 1000s of possible words.
I'd say you would be better off investigating the extraction of keywords from text (lets imagine they are articles) as they are stored.
Then store these keywords in their own table.
For best results though, you'd likely want to investigate Natural Language Processing and extract meaning from articles rather than just words.

How to handle an array in a SQL field?

I have a feed that comes from the State of Florida in a CSV that I need to load daily into MySQL. It is a listing of all homes for sale in my area. One field has a list of codes, separated by commas. Here's one such sample:
C02,C11,U01,U02,D02,D32,D45,D67
These codes all mean something (pool, fenced in area, etc) and I have the meanings in a separate table. My question is, how should I handle loading these? Should I put them in their own field as they are in the CSV? Should I create a separate table that holds them?
If I do leave them as they are in a field (called feature_codes), how could I get the descriptions out of a table that has the descriptions? That table is simply feature_code, feature_code_description. I don't know how to break them apart in my first query to do the join to bring the description in.
Thank you
As a general rule, csv data should never stored in a field, especially if you actually need to consider individual bits of the csv data, instead of just the csv string as a whole.
You SHOULD normalize the design and split each of those sub "fields" into their own table.
That being said, MySQL does have find_in_set() which allows you sort-of search those csv strings and treat each as its own distinct datum. It's not particularly efficient to use this, but it does put a bandaid on the design.
You should keep the information about feature codes in a separate table, where each row is a pair of house identifier, and feature identifier
HouseID FeatureID
1 C07
1 D67
2 D02
You can use explode() to separate your CSV string : http://php.net/manual/en/function.explode.php
$string = 'C02,C11,U01,U02,D02,D32,D45,D67';
$array = explode(',', $string);
Then with your list of feature_codes you can easily retrieve your feature_code_description but you need to do another query to get an array with all your feature_codes and feature_code_description.
Or split your field and put it in another table with the home_id.
You can save it in your DB as is and when you read it out you can run the php function explode. Go check that function out. It will build an array for you out of a string separating the values by whatever you want . In your case you can use:
$array_of_codes = explode(",", $db_return_string);
This will make an array out of each code separating them by the commas between them. Good luck.

Find out most popular words in MySQL / PHP

I have a database with almost 100,000 comments and I would like to detect the most used words (using stop words to avoid common words).
I want to do this only one time, and then use a few of the most popular words to Tag the comments that contains them.
Can you help me with the Query and PHP code to do this?
Thanks!
The easiest approach I guess would be:
Create two new tables: keywords (id, word) and keywords_comments (keyword_id, comment_id, count)
keywords saves an unique id and the keyword you found in a text
keywords_comments stores one row for each connection between each comment that contains that keyword. In count you wil save the number of times this keyword occurred in the comment. The two columns keyword_id + comment_id together form a unique or directly the primary key.
Retrieve all comments from the database
Parse through all comments and split by non-characters (or other boundaries)
Write these entries to your tables
Example
You have the following two comments:
Hello, how are you?!
Wow, hello. My name is Stefan.
Now you would iterate over both of them and split them by non-characters. This would result in the following lowercase words for each text:
- First text: hello, how, are, you
- Second text: wow, hello, my, name, is, stefan
As soon as you have parsed one of this text, you can already insert it into the database again. I guess you do not want to load 100.000 comments to RAM.
So it would go this:
Parse first text an get the keywords above
Write each keyword into the tabke keywords if it is not there yet
Set a reference from the keyword to the comment (keywords_comments) and set the count correctly (in our example each word occurs only once in each text, you have to count that).
Parse second text
…
Minor improvement
A very easy improvement you probably have to use for 100.000 comments, is to use a counting variable or add a new field has_been_analyzed to each comment. Then you can read them comment by comment from the database.
I usually use counting variables when I read data chunkwise and know that the data cannot not change from the direction I am starting (i.e. it will stay consistent up to the point I currently am). Then I do something like:
SELECT * FROM table ORDER BY created ASC LIMIT 0, 100
SELECT * FROM table ORDER BY created ASC LIMIT 100, 100
SELECT * FROM table ORDER BY created ASC LIMIT 200, 100
…
Consider that this only works if we know for sure that there are no dates to be added at a place we think we already read. E.g. using DESC would not work, as there could be data inserted. Then the whole offset would break and we would read one article twice and never read the new article.
If you cannot make sure that the outside counting variable stays consistent, you can add a new field analyzed which you set to true as soon as you have read the comment. Then you can always see which comments have already been read and which not. An SQL query would then look like this:
SELECT * FROM table WHERE analyzed = 0 LIMIT 100 /* Reading chunks of 100 */
This works as long as you do not parallelize the workload (with multiple clients or threads). Otherwise you would have to make sure that reading + setting true is atomar (synchronized).

Count line breaks in a field and order by

I have a field in a table recipes that has been inserted using mysql_real_escape_string, I want to count the number of line breaks in that field and order the records using this number.
p.s. the field is called Ingredients.
Thanks everyone
This would do it:
SELECT *, LENGTH(Ingredients) - LENGTH(REPLACE(Ingredients, '\n', '')) as Count
FROM Recipes
ORDER BY Count DESC
The way I am getting the amount of linebreaks is a bit of a hack, however, and I don't think there's a better way. I would recommend keeping a column that has the amount of linebreaks if performance is a huge issue. For medium-sized data sets, though, I think the above should be fine.
If you wanted to have a cache column as described above, you would do:
UPDATE
Recipes
SET
IngredientAmount = LENGTH(Ingredients) - LENGTH(REPLACE(Ingredients, '\n', ''))
After that, whenever you are updating/inserting a new row, you could calculate the amounts (probably with PHP) and fill in this column before-hand. Or, if you're into that sort of thing, try out triggers.
I'm assuming a lot here, but from what I'm reading in your post, you could change your database structure a little bit, and both solve this problem and open your dataset up to more interesting uses.
If you separate ingredients into its own table, and use a linking table to index which ingredients occur in which recipes, it'll be much easier to be creative with data manipulation. It becomes easier to count ingredients per recipe, to find similarities in recipes, to search for recipes containing sets of ingredients, etc. also your data would be more normalized and smaller. (storing one global list of all ingredients vs. storing a set for each recipe)
If you're using a single text entry field to enter ingredients for a recipe now, you could do something like break up that input by lines and use each line as an ingredient when saving to the database. You can use something like PHP's built-in levenshtein() or similar_text() functions to deal with misspelled ingredient names and keep the data as normalized as possbile without having to hand-groom your [users'] data entry too much.
This is just a suggestion, take it as you like.
You're going a bit beyond the capabilities and intent of SQL here. You could write a stored procedure to scan the string and return the number and then use this in your query.
However, I think you should revisit the design of whatever is inserting the Ingredients so that you avoid searching strings in of every row whenever you do this query. Add a 'num_linebreaks' column, calculate the number of line breaks and set this column when you're adding the Indgredients.
If you've no control over the app that's doing the insertion, then you could use a stored procedure to update num_linebreaks based on a trigger.
Got it thanks, the php code looks like:
$check = explode("\r\n", $_POST['ingredients']);
$lines = count($check);
So how could I update all the information in the table so Ingred_count based on field Ingredients in one fellow swoop for previous records?

Categories