php scanning content for specific keywords - php

As part of a CMS admin, I would like to scan new articles for specific keyphrases/tags that are stored in a mysql db.
I am proficient enough to be able to pull the list of keywords out, loop through them and do stripos, and substr_count to build an array of the found keywords. but the average article is about 700 words and there are 16,000 tags and growing so currently the loop takes about 0.5s which was longer than I had hoped, and will only ever get longer.
Is there a better way of doing this? Even if this type of procedure has a special name, that could help.
I have PHP 5.3 on Fedora, it is also on dedicated servers so I don't have any shared host issues.
EDIT - I am such a scatterbrain, I swore blind that I copy and pasted some code! clearly not
$found = array();
while($row = $pointer->fetch_assoc())
{
if(stripos($haystack, $row["Name"]) )
{
$found[$row["Name"]] = substr_count( $haystack, $row["Name"]);
}
}
arsort($found);
I think I explained myself badly, because I want to do the procedure on new articles they are currently not in the database, so I was just going to use $_POST in an ajax request, rather than saving the article to the DB first.

http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html is exactly what you are looking for if you don't want to use a search engine script such as sphinx/solr.

It sounds like your code looks something like this:
foreach($keywords as $keyword){
if(strpos($keyword, $articleText) != -1){
$foundKeywords[] = $keyword;
}
}
Something you may consider since the keywords array is so large and will continue to grow is to switch your processing to loop through the words in the text instead of the keywords array. Something like this:
$textWords = explode(" ", $articleText);
foreach($textWords as $word){
if( array_search($word, $keywords) && !array_search($word, $foundKeywords) ){
$foundKeywords[] = $word;
}
}

Related

Long running console command slows down

Initial text
I created console in Laravel project. The command takes html data from one table, checks two pattern matches for each record by preg_match. If it returns true, updates are being done to other table's record that has the same attribute as record from the first table that is currently in focus in foreach loop.
Number of records is cca 3500
After cca 150 iterations, command dramatically slows down, and I need one day for getting the command done.
I read all similar issues from this forum but they didn't help me. Not even the answer about forcing garbage collection.
Code is like following:
$ras = RecordsA::all();
$pattern = '/===this is the pattern===/';
foreach($ras as $ra){
$html = $ra->html;
$rb = RecordB::where("url", $ra->url)->first();
$rb->phone = preg_match($pattern, $html, $matches) ? $matches[1] : $rb->phone;
$rb->save();
}
I was searching for possible issue about preg_match performance but it was unsuccessful.
Did anybody meet such problem?
For MMMTroy update
I forgot to say I also tried custom but similar to your code:
$counter = DB::select("select count(*) as count from records_a")->first();
//Pattern for Wiktor Stribiżew :)
$pattern = '/Telefon:([^<])+</';
for($i = 0; $i < $counter->count; $i+=150){
$ras = RecordsA::limit(150)->offset($i);
foreach($ras as $ra){
$html = $ra->html;
$rb = RecordB::where("url", $ra->url)->first();
$rb->phone = preg_match($pattern, $html, $matches) ? $matches[1] : $rb->phone;
$rb->save();
}
}
"Pagination via OFFSET" is Order(N*N). You would be better off with Order(N), so "remember where you left off".
More discussion.
There is a good chance you a running out of memory. Laravel has a handy method to "chunk" results which dramatically reduces the amount of memory by limiting the amount of items you are looping. Try something like this.
$pattern = '/===this is the pattern===/';
Records::chunk(100, function($ras)use($pattern){
foreach($ras as $ra){
$html = $ra->html;
$rb = RecordB::where("url", $ra->url)->first();
$rb->phone = preg_match($pattern, $html, $matches) ? $matches[1] : $rb->phone;
$rb->save();
}
});
What this is doing is grabbing 100 records at a time, and then looping through those. Once done, it creates an offset and grabs the next records in the database. This will prevent the entire loop from being stored in memory.
Does your database grow while looping through? What happens if RecordB is not found and it returns null? Feels to me your table for RecordB is growing, causing the search query to slow down.
Had recently similar problems and hitting memory limits. There is 1 thing whats the number 1 of slowing down stuff and leaking memory.
The DB::$queryLog (disable it with: DB::disableQueryLog();). Everytime there is a query called, the query string will be stored in a variable.
Perhaps one of those things is causing it, but else the code looks fine to me.

Ordering and Selecting frequently used tags

I have looked on stackoverflow for a solution to this however couldn't find a good answers which outlined the issues I was having; Essentially what I'm trying to achieve is to array out 15 of the most frequent tags used from all my users subjects.
This is how I currently select the data
$sql = mysql_query("SELECT subject FROM `users`");
$row = mysql_fetch_array($sql);
I do apologise for the code looking nothing like what I'm trying to achieve I really don't have any clue where to begin with trying to achieve this and came here for a possible solution. Now this would work fine and I'd be able to array them out and however my problem is the subjects contain words along with the hash tags so an example room subject would look like hey my name is example #follow me how would I only grab the #followand once I've grabbed all the hashtags from all of the subjects to echo the most frequent 15?
Again I apologise for the code looking nothing like what I'm trying to achieve and I appreciate anyone's help. This was the closest post I found to solving my issue however was not useful.
Example
Here is three room subjects;
`Hello welcome to my room #awesome #wishlist`
`Hey hows everyone doing? #friday #awesome`
`Check out my #wishlist looking #awesome`
This is what I'm trying to view them as
[3] #awesome [2] #wishlist [1] #friday
What you want to achieve here is pretty complex for an SQL query and you are likely to run in to efficiency problems with parsing the subject every time you want to run this code.
The best solution is probably to have a table that associates tags with users. You can update this table every time a user changes their subject. To get the number of times a tag is used then becomes trivial with COUNT(DISTINCT tag).
One way would be to parse the result set in PHP. Once you query your subject line from the database, let's say you have them in the array $results, then you can build a frequency distribution of words like this:
$freqDist = [];
foreach($results as $row)
{
$words = explode(" ", $row);
foreach($words as $w)
{
if (array_key_exists($w, $freqDist))
$freqDist[$w]++;
else
$freqDist[$w] = 1;
}
}
You can then sort in descending order and display the distribution of words like this:
arsort($freqDist);
foreach($freqDist as $word => $count)
{
if (strpos($word, '#') !== FALSE)
echo "$word: $count\n";
else
echo "$word: does not contain hashtag, DROPPED\n";
}
You could also use preg_match() to do fancier matching if you want but I've used a naive approach with strpos() to assume that if the word has '#' (anywhere) it's a hashtag.
Other functions of possible use to you:
str_word_count(): Return information about words used in a string.
array_count_values(): Counts all the values of an array.

Want to display few words before and end of search keyword

For example I have a string in database table
"This is a test and I want to test everything on my website using
testing tool. I need help of my people to search the new things on
portal but I can not find any source of help"
This data is in my mysql table
No when I search a word or string in this content then I want to display first 20 and next 20 words with Search string
For example if I search my people
Then I should get result as following
website using testing tool. I need help of my people to search the new things
Or if I search portal then it should give result as following
I need help of my people to search the new things on portal but I can not find any source of help
I tried using mysql like query but it show full content.
You can use mysql LIKE statement and then use php explode function to get the single sentence. Do a strpos checking on the exploded pieces to see whether you are taking the correct result.
$keyword = "portal";
$result_str = "This is a test and i want to test everything on my website using testing tool. I need help of my people to search the new things on portal but i can not find any source of help";
$suggestions = explode(".", $result_str);
$match = "";
foreach ($suggestions as $key => $value) {
if(strpos($value,$keyword) !==false)
$match = $value;
continue;
}
echo $match;

Need to pull a list of image url's from a Mysql database based on a subdomain, and store them as a list

I have a Mysql Forum database which I need to query all the posts in for a specific set of images, on a particular url.
The url's are of images hosted on a subdomain that we don't have access to like this "http://images.website.com/images/randomnumberhere.jpg".
I need a mysql query to pull these out and process them into a list which we can later loop through to grab them all and move them.(I got this part handled)
I'm a php/mysql programmer but this feels like a regex problem and i'm not so great with that yet.
The issue is we don't have a list of the images, and it's a big long random number (so far as I can see). So what I need is a string like "images.website.com/images/(randomnumbers).jpg" and then put them into a list.
You could get all fancy pants-like and use regular expressions,
but you could also try a simple
SELECT * FROM image_table WHERE image_source LIKE '%images.website.com/images/%'
Is this what you are looking for?
If you're looking to pull all of the text from the database, and then use PHP to create a list of the images, try something like this:
$image_list = array();
while($row = $sql->fetch_array())
{
$text = $row['text'];
/* Changed to preg_match_all */
if(preg_match_all("/http:\/\/images.website.com\/images\/[0-9]+\.(jpg|jpeg|png|gif)/i", $text, $matches))
{
$image_list[] = $matches[0];
}
}
Nothing fancy, and I didn't test it, but it should work. That's a hardcoded regex that matches the URL you're looking for specifically. You may want to modify it so that it can match multiple domains from an array, or something, but it should get you started.
EDIT: Should have mentioned that you could then loop through the $image_list array to display the images, or whatever you're going to do with them.
$rand = //rand() function or whatever you would like to create the random number, since you didn't give ranges, it might be an array aswell;
$string = "images.website.com/images/$rand.jpg";
Or even a simple loop:
for ($i = 1; $i < 100; $i++) {
echo $string = "images.website.com/images/$i.jpg";
}
It's up to you how would you use it with your database

Searching for a link in a website and displaying it PHP

hello im a newbie in php i am trying make a search function using php but only inside the website without any database
basically if i want to search a string namely "Health" it would display the lines
The Joys of Health
Healthy Diets
This snippet is the only thing i could find if properly coded would output the "lines" i want
$myPage = array("directory.php","pages.php");
$lines = file($myPage[n]);
echo $lines[n];
i havent tried it yet if it would work but before i do i want to ask if there is any better way to do this?
if my files have too many lines wont it stress out the server?
The file() function will return an array. You should use file_get_contents() instead, as it returns a string.
Then, use regular expressions to find specific text within a link.
Your goal is fine but the method you're thinking about is not. the file() function read a file, line by line, and inserts it into an array. This assumes the HTML is well-structured in a human-readable fashion, which is not always the case. However, if you're the one providing the HTML and you make sure the structure is perfectly defined, ok... here you have the example you provided us with but complete (take into account it's the 'wrong' way of solving your problem, but if you want to follow that pattern, it's ok):
function pagesearch($pages, $string) {
if (!empty($pages) && !empty($string)) {
$tags = [];
foreach ($pages as $page) {
if ($lines = file($page)) {
foreach ($lines as $line) {
if (!empty($line)) {
if (mb_strpos($line, $string)) {
$tags[$page][] = $line;
}
}
}
}
}
return $tags;
}
}
This will return you an array with all the pages you referenced with all occurrences of the word you look for, separated by page. As I said, it's not the way you want to solve this, but it's a way.
Hope that helps
Because you do not want to use any database and because the term database is very broad and includes the file-system you want to do a search in some database without having a database.
That makes no sense. In your case one database at least is the file-system. If you can accept the fact that you want to search a database (here your html files) but you do not want to use a database to store anything related to the search (e.g. some index or cached results), then what you suggest is basically how it is working: A real-time, text-based, line-by-line file-search.
Sure it is very rudimentary but as your constraint is "no database", you have already found the only possible way. And yes it will stress your server when used because real-time search is expensive.
Otherwise normally Lucene/Solr is used for the job but that is a database and a server even.

Categories