TLDR; I'm trying to get some 'fuzzy' results from a search query that fails to come up with things that are actually there.
The problem:
This specific setup is in Wordpress land, but that's not decisively relevant to the issue. There's this longstanding problem with wordpress search – that it uses AND operators, not OR. This leads to the following problem:
Some people (not necessarily many but often key customers) will search for john+doe+jane and find nothing. Had they searched for just john+doe they'd have found a ton of results. Or maybe they simply misspell a word or, worse, it was misspeled in the article etc. I need this fixed somehow.
I tried all the plugins, but they either fare worse than the default search or they just won't work (likely because I also made other customizations to the search, but I can't go back on those). So eventually I tried to work my own way out.
I know next to nothing about databases (and not much about Wordpress hooks either), so I tried to move the OR issue to php.
My thinking: First of all, are there quite a few search results? If true, trust Wordpress search that it has done its job for 99+ percent of cases. But are there very few or none at all? Then there may be a problem. Step in, try and fix it!
So, for when there are fewer than 10 results, I made this fallback script:
The solution:
// [Wordpress search loop ran before this]
// find out how many results below the target
$missing = 10 - count($posts);
// or `$missing = 10 - $wp_query->found_posts` - all the same – but already constructed this `$posts` array which I'll need later anyway.
// only do it if you actually need it (less than 10 search results from WP)
if ($missing > 0) {
// give each search term its own life
$terms = explode(" ",$string);
// where `$string = get_query_var('s')` // or `$string = explode("+",$_SERVER['REQUEST_URI'])` in a more generic environment.
// ...
$weight = array();
$results = array();
// start some nasty nested foreaches to search for each term
foreach ($terms as $term) {
// first, get this huge array of posts // wordpress stuff
$results = get_posts(array('s' => $term,'numberposts' => 10));
// then iterate through each of them
foreach ($results as $result) {
// (drop mostly useless one-or-two-letter words)
if (strlen($term) < 3) continue;
// reward each occurence with a biscuit, they'll prove useful later*
$matches[$result->ID] += 1;
// don't let any term become a neverending story (stop inner foreach)
if(count($matches) == 10) break;
}
// stop the game altogether – by now we should have just enough (stop outer foreach)
if(count($matches) > 100) break;
}
// sort the results nicely by biscuits awarded
// *(so if one term came up 3/4 times it goes before one that one that came up 2/4)
arsort($matches);
// to make arrays compatible it's ok to forget the exact biscuit numbers (they don't really matter anymore)
$matches = array_keys($matches);
// if Wordpress found you first we don't need you, you're already up there in the original loop
$fallbacks = array_diff($matches,$posts);
// [standard wordpress loop]
$fallback = new WP_Query( array (
'post__in' => $fallbacks,
'orderby' => post__in,
// add just enough to get to ten
'posts_per_page' => $missing
)
);
if ( $fallback->have_posts() ) {
while ( $fallback->have_posts() ) {
$fallback->the_post();
// [magic goes here]
}
}
}
Does the job. However...
The issues I'm aware of:
Performance: It takes long but not too long. Page (openresty/nginx+redis+varnish) would load in .8s instead of the usual .3 cached or .4-.5 busted*.
Bad bots doing bad stuff: There's merciless sanitizing in WP and there's decent rate-limiting on the webserver, throwing nice 400's and 429's**.
The issues I'm not aware of:
*Not live yet so I don't now how it would scale. Too many at once – what could happen? Can that sort of nested foreach kill a database?
**I'd hate if someone still manages to find a weak spot before I do. Any in sight?
Related
I want to search and replace links based on a correspondence array.
I wrote this solution but I find it a bit simplistic and maybe not efficient enough to handle 2000 pages and 15000 links. What do you think? Use DOMDocument or regex would be more effective? Thank you for your answers.
$correspondences = array(
"old/exercise-2017.aspx" => "/new/exercise2017.aspx",
"old/exercise-2016.aspx" => "/new/exercise2016.aspx",
"old/Pages/index.aspx" => "/new/en/previous-exercises/index.aspx"
);
$html = '<ul><li>Appraisal exercise 2017</li><li>Appraisal exercise 2016</li><li> Previous appraisal exercises</li></ul>';
foreach($correspondences as $key => $value) {
if(strpos($html, $key)) {
$html = str_replace($key, $value, $html);
}
}
echo $html;
?>
This approach is not the most efficient, but it should be fine as long as you do it only once and store the result. Given that you have already implemented it this way, you should just go with it unless you run into an actual performance problem.
If you are trying to do this at runtime (i.e. modify the page every single time it is served) then, yes, this is likely to be problematic. 15000 string searches per page is likely to be slow.
In that case, the most obvious change would be the one implied by this answer: do it once and save the result, instead of calculating it at run time.
If you must do it at runtime, then the optimal solution would use DOMDocument to get the URL. You could then replace it based on a set of rules if possible (e.g. if /old/Pages/ always gets translated to /new/en/previous-exercizes then implement logic for that). Or you could use a dictionary keyed to the old URL to get the new URL, if you must individually code each path.
I've made a script that pretty much loads a huge array of objects from a mysql database, and then loads a huge (but smaller) list of objects from the same mysql database.
I want to iterate over each list to check for irregular behaviour, using PHP. BUT everytime I run the script it takes forever to execute (so far I haven't seen it complete). Is there any optimizations I can make so it doesn't take this long to execute...? There's roughly 64150 entries in the first list, and about 1748 entries in the second list.
This is what the code generally looks like in pseudo code.
// an array of size 64000 containing objects in the form of {"id": 1, "unique_id": "kqiweyu21a)_"}
$items_list = [];
// an array of size 5000 containing objects in the form of {"inventory: "a long string that might have the unique_id", "name": "SomeName", id": 1};
$user_list = [];
Up until this point the results are instant... But when I do this it takes forever to execute, seems like it never ends...
foreach($items_list as $item)
{
foreach($user_list as $user)
{
if(strpos($user["inventory"], $item["unique_id"]) !== false)
{
echo("Found a version of the item");
}
}
}
Note that the echo should rarely happen.... The issue isn't with MySQL as the $items_list and $user_list array populate almost instantly.. It only starts to take forever when I try to iterate over the lists...
With 130M iterations, adding a break will help somehow despite it rarely happens...
foreach($items_list as $item)
{
foreach($user_list as $user)
{
if(strpos($user["inventory"], $item["unique_id"])){
echo("Found a version of the item");
break;
}
}
}
alternate solutions 1 with PHP 5.6: You could also use PTHREADS and split your big array in chunks to pool them into threads... with break, this will certainly improve it.
alternate solutions 2: use PHP7, the performances improvements regarding arrays manipulations and loop is BIG.
Also try to sort you arrays before the loop. depends on what you are looking at but very oftenly, sorting arrays before will limit a much as possible the loop time if the condition is found.
Your example is almost impossible to reproduce. You need to provide an example that can be replicated ie the two loops as given if only accessing an array will complete extremely quickly ie 1 - 2 seconds. This means that either the string your searching is kilobytes or larger (not provided in question) or something else is happening ie a database access or something like that while the loops are running.
You can let SQL do the searching for you. Since you don't share the columns you need I'll only pull the ones I see.
SELECT i.unique_id, u.inventory
FROM items i, users u
WHERE LOCATE(i.unique_id, u inventory)
I need help to find workaround for getting over memory_limit. My limit is 128MB, from database I'm getting something about 80k rows, script stops at 66k. Thanks for help.
Code:
$posibilities = [];
foreach ($result as $item) {
$domainWord = str_replace("." . $item->tld, "", $item->address);
for ($i = 0; $i + 2 < strlen($domainWord); $i++) {
$tri = $domainWord[$i] . $domainWord[$i + 1] . $domainWord[$i + 2];
if (array_key_exists($tri, $possibilities)) {
$possibilities[$tri] += 1;
} else {
$possibilities[$tri] = 1;
}
}
}
Your bottleneck, given your algorithm, is most possibly not the database query, but the $possibilities array you're building.
If I read your code correctly, you get a list of domain names from the database. From each of the domain names you strip off the top-level-domain at the end first.
Then you walk character-by-character from left to right of the resulting string and collect triplets of the characters from that string, like this:
example.com => ['exa', 'xam', 'amp', 'mpl', 'ple']
You store those triplets in the keys of the array, which is nice idea, and you also count them, which doesn't have any effect on the memory consumption. However, my guess is that the sheer number of possible triplets, which is for 26 letters and 10 digits is 36^3 = 46656 possibilities each taking 3 bytes just for key inside array, don't know how many boilerplate code around it, take quite a lot from your memory limit.
Probably someone will tell you how PHP uses memory with its database cursors, I don't know it, but you can do one trick to profile your memory consumption.
Put the calls to memory-get-usage:
before and after each iteration, so you'll know how many memory was wasted on each cursor advancement,
before and after each addition to $possibilities.
And just print them right away. So you'll be able to run your code and see in real time what and how seriously uses your memory.
Also, try to unset the $item after each iteration. It may actually help.
Knowledge of specific database access library you are using to obtain $result iterator will help immensely.
Given the tiny (pretty useless) code snippet you've provided I want to provide you with a MySQL answer, but I'm not certain you're using MySQL?
But
- Optimise your table.
Use EXPLAIN to optimise your query. Rewrite your query to put as much of the logic in the query rather than in the PHP code.
edit: if you're using MySQL then prepend EXPLAIN before your SELECT keyword and the result will show you an explanation of actually how the query you give MySQL turns into results.
Do not use PHP strlen function as this is memory inefficient - instead you can compare by treating a string as a set of array values, thus:
for ($i = 0; !empty($domainWord[$i+2]); $i++) {
in your MySQL (if that's what you're using) then add a LIMIT clause that will break the query into 3 or 4 chunks, say of 25k rows per chunk, which will fit comfortably into your maximum operating capacity of 66k rows. Burki had this good idea.
At the end of each chunk clean all the strings and restart, set into a loop
$z = 0;
while ($z < 4){
///do grab of data from database. Preserve only your output
$z++;
}
But probably more important than any of these is provide enough details in your question!!
- What is the data you want to get?
- What are you storing your data in?
- What are the criteria for finding the data?
These answers will help people far more knowledgable than me to show you how to properly optimise your database.
Background:
I am creating a method to pull information from one table, check if it exists in a second table, and if it doesn't then insert the information. While comparing the data, the check must first see if the name matches, and then proceed to check if the account number matches the corresponding name. This code is heavily commented so a step by step process can be followed as well. I am greatly appreciative of all the help the masters of stackexchange bring to the table, so after scratching my head for a while, I bring my issues to the table.
Apparently my code is breaking when it goes to check the account number, even though a matching name exists. According to my best skills, all the code should work perfectly.
As you will be able to see through the print_f of all_validation_names, as well as all_validation_accounts, the information of the first passing statement is contained in the array with a check.
My question is as follows: Why can't a varchar variable, stored in an array with a corresponding key ($check_account_number) be passed into in_array. The function always returns else, where although the variable is "1", it isn't found using the corresponding key in $all_validation_accounts.
Please help!
Code For The Processing Of The Script
if (in_array($check_agency_name,$all_validation_names))
{
//Set a marker for the key index on which row the name was found
$array_key= array_search($check_agency_name, $all_validation_names);
//Now check if the account corresponds to the right account name
if (in_array($check_account_number,$all_validation_accounts["$array_key"]))
//NOTE: i've even tried strict comparison using third parameter true
{
//Bounce Them If the Account Matches with a name So theres no duplicate entry (works)
}
//If the account doesn't correspond to the right account name, enter the information
else
//ALSO DOESNT WORK: if (!(in_array($check_account_number,$all_validation_accounts["$array_key"])))
{
// The account doesn't exist for the name, so enter the information
//CODE TO ENTER INFORMATION THEN BOUNCE (works)
break;
}
}
The Output:
Passing Name to match: a And Account:1
There are accounts existing, so heres all the validation arrays:
all_validation_names: Array (
[0] => a
[1] => a
[2] => b
)
all validation accounts: Array
(
[0] => 1
[1] => 2
[2] => 2
)
the system found check_agency_name in the array all_validation_names,
so lets check the account number with this key: 0
the account name existed, but the system didn't find the account
number match, so it must enter it as if the account never existed
heres the information that should be entered: used_selector: 0
agency_name:A address: city: state: zip: account_number: 1
now the system should enter the info, and bounce the client out
This is not an answer. Your code is full of redundancies and is extremely difficult to read. You need to do a lot more work on your own to narrow down the issue before you come to SO for assistance. It's unlikely anyone will have the patience to step through your code, especially since they also can't run it without having a copy of your invisible database.
However, I felt obligated to tell you that your __trim() function is poorly named and far too complicated for what it does. It is not a "trimmer", since unlike the real trim() it removes characters from the "inside" of strings as well as the ends. And your implementation is far too complex.
It is the same as this one-line function:
function __trim($str, $charlist=" \t\r\n\0\x0B") {
return str_replace(str_split($charlist), '', $str);
}
Once I saw that forty-line __trim function I lost the will to read any more code in the question.
Update
You've narrowed down your question, good. I'll attempt to infer the state of the program at this point. (This is something you should have done for us....)
Slimming your code to very minimum:
if (in_array($check_agency_name, $all_validation_names)) {
$array_key = array_search($check_agency_name, $all_validation_names);
if (in_array($check_account_number, $all_validation_accounts["$array_key"])) {
error_log('IF-BRANCH');
} else {
error_log('ELSE-BRANCH');
// break; // Removed because it makes no sense here.
}
}
Let's see what the values of these vars are:
$check_agency_name = 'a'
$all_validation_accounts = array(1, 2, 2);
$all_validation_names = array('a', 'a', 'b');
$array_key = "0";
$check_account_number = ???
You haven't told us what $check_account_number is. However we still know enough to see at least one thing that is wrong. The following lines are all the same (just replacing variable names with values):
in_array($check_account_number, $all_validation_accounts["$array_key"])
in_array($check_account_number, $all_validation_accounts["0"])
in_array($check_account_number, $all_validation_accounts[0])
in_array($check_account_number, 1)
Here we see that regardless of the value of $check_account_number, in_array() will return false because its second argument is not an array!
(By the way, don't quote key lookups: $all_validation_accounts[$array_key] is more straightforward--the quoting accomplishes nothing except to obscure your intent.)
This code would also produce an error Warning: in_array() expects parameter 2 to be array, integer given in /file/name.php on line ##. This points you to the problem right away, and you would have seen it if you had checked your error logs. Set your PHP error reporting very high (E_ALL | E_STRICT), watch your error logs, and make sure your program does not produce any error output while running normally. If you see anything, fix your code.
Warning: this is not the answer to the question
Lets start from your trim on steroids:
Better optimize it with in built php functions as follows:
function __trim($str, $charlist = '') {
$result = '';
/* list of forbidden chars to be trimmed */
$forbidden_list = array(" ", "\t", "\r", "\n", "\0", "\x0B");
if (empty($charlist)) {
for ($i = 0; $i < strlen($str); $i++){
if(!in_array($str[$i],$forbidden_list))$result .= $str[$i];
}
} else { // else means in case if (condition) was not met
for ($i = 0; $i < strlen($str); $i++) {
if(strpos($charlist, $str[$i])===false)$result .= $str[$i];
}
}
return ($result);
}
Ritual Lecture
Stop using mysql. It is deprecated. Using it means inviting trouble.
I have answered my own question by rethinking the logic behind the checks.
Thank you for all your responses.
Here is a link to the solution. I have posted it online to share as well as optimize, because all the great minds of SO deserve it!
https://stackoverflow.com/questions/15775444/entry-of-items-checking-sharing-optimization-from-post-form
I've tagged this post as WordPress, but I'm not entirely sure it's WordPress-specific, so I'm posting it on StackOverflow rather than WPSE. The solution doesn't have to be WordPress-specific, simply PHP.
The Scenario
I run a fishkeeping website with a number of tropical fish Species Profiles and Glossary entries.
Our website is oriented around our profiles. They are, as you may term it, the bread and butter of the website.
What I'm hoping to achieve is that, in every species profile which mentions another species or a glossary entry, I can replace those words with a link - such as you'll see here. Ideally, I would also like this to occur in news, articles and blog posts too.
We have nearly 1400 species profiles and 1700 glossary entries. Our species profiles are often lengthy and at last count our species profiles alone numbered more than 1.7 million words of information.
What I'm Currently Attempting
Currently, I have a filter.php with a function that - I believe - does what I need it to do. The code is quite lengthy, and can be found in full here.
In addition, in my WordPress theme's functions.php, I have the following:
# ==============================================================================================
# [Filter]
#
# Every hour, using WP_Cron, `my_updated_posts` is checked. If there are new Post IDs in there,
# it will run a filter on all of the post's content. The filter will search for Glossary terms
# and scientific species names. If found, it will replace those names with links including a
# pop-up.
include "filter.php";
# ==============================================================================================
# When saving a post (new or edited), check to make sure it isn't a revision then add its ID
# to `my_updated_posts`.
add_action( 'save_post', 'my_set_content_filter' );
function my_set_content_filter( $post_id ) {
if ( !wp_is_post_revision( $post_id ) ) {
$post_type = get_post_type( $post_id );
if ( $post_type == "species" || ( $post_type == "post" && in_category( "articles", $post_id ) ) || ( $post_type == "post" && in_category( "blogs", $post_id ) ) ) {
//get the previous value
$ids = get_option( 'my_updated_posts' );
//add new value if necessary
if( !in_array( $post_id, $ids ) ) {
$ids[] = $post_id;
update_option( 'my_updated_posts', $ids );
}
}
}
}
# ==============================================================================================
# Add the filter to WP_Cron.
add_action( 'my_filter_posts_content', 'my_filter_content' );
if( !wp_next_scheduled( 'my_filter_posts_content' ) ) {
wp_schedule_event( time(), 'hourly', 'my_filter_posts_content' );
}
# ==============================================================================================
# Run the filter.
function my_filter_content() {
//check to see if posts need to be parsed
if ( !get_option( 'my_updated_posts' ) )
return false;
//parse posts
$ids = get_option( 'my_updated_posts' );
update_option( 'error_check', $ids );
foreach( $ids as $v ) {
if ( get_post_status( $v ) == 'publish' )
run_filter( $v );
update_option( 'error_check', "filter has run at least once" );
}
//make sure no values have been added while loop was running
$id_recheck = get_option( 'my_updated_posts' );
my_close_out_filter( $ids, $id_recheck );
//once all options, including any added during the running of what could be a long cronjob are done, remove the value and close out
delete_option( 'my_updated_posts' );
update_option( 'error_check', 'working m8' );
return true;
}
# ==============================================================================================
# A "difference" function to make sure no new posts have been added to `my_updated_posts` whilst
# the potentially time-consuming filter was running.
function my_close_out_filter( $beginning_array, $end_array ) {
$diff = array_diff( $beginning_array, $end_array );
if( !empty ( $diff ) ) {
foreach( $diff as $v ) {
run_filter( $v );
}
}
my_close_out_filter( $end_array, get_option( 'my_updated_posts' ) );
}
The way this works, as (hopefully) described by the code's comments, is that each hour WordPress operates a cron job (which is like a false cron - works upon user hits, but that doesn't really matter as the timing isn't important) which runs the filter found above.
The rationale behind running it on an hourly basis was that if we tried to run it when each post was saved, it would be to the detriment of the author. Once we get guest authors involved, that is obviously not an acceptable way of going about it.
The Problem...
For months now I've been having problems getting this filter running reliably. I don't believe that the problem lies with the filter itself, but with one of the functions that enables the filter - i.e. the cron job, or the function that chooses which posts are filtered, or the function which prepares the wordlists etc. for the filter.
Unfortunately, diagnosing the problem is quite difficult (that I can see), thanks to it running in the background and only on an hourly basis. I've been trying to use WordPress' update_option function (which basically writes a simple database value) to error-check, but I haven't had much luck - and to be honest, I'm quite confused as to where the problem lies.
We ended up putting the website live without this filter working correctly. Sometimes it seems to work, sometimes it doesn't. As a result, we now have quite a few species profiles which aren't correctly filtered.
What I'd Like...
I'm basically seeking advice on the best way to go about running this filter.
Is a Cron Job the answer? I can set up a .php file which runs every day, that wouldn't be a problem. How would it determine which posts need to be filtered? What impact would it have on the server at the time it ran?
Alternatively, is a WordPress admin page the answer? If I knew how to do it, something along the lines of a page - utilising AJAX - which allowed me to select the posts to run the filter on would be perfect. There's a plugin called AJAX Regenerate Thumbnails which works like this, maybe that would be the most effective?
Considerations
The size of the database/information being affected/read/written
Which posts are filtered
The impact the filter has on the server; especially considering I don't seem to be able to increase the WordPress memory limit past 32Mb.
Is the actual filter itself efficient, effective and reliable?
This is quite a complex question and I've inevitably (as I was distracted roughly 18 times by colleagues in the process) left out some details. Please feel free to probe me for further information.
Thanks in advance,
Do it when the profile is created.
Try reversing the whole process. Rather than checking the content for the words, check the words for the content's words.
Break the content post on entry into words (on space)
Eliminate duplicates, ones under the smallest size of a word in the database, ones over the largest size, and ones in a 'common words' list that you keep.
Check against each table, if some of your tables include phrases with spaces, do a %text% search, otherwise do a straight match (much faster) or even build a hash table if it really is that big a problem. (I would do this as a PHP array and cache the result somehow, no sense reinventing the wheel)
Create your links with the now dramatically smaller lists.
You should be able to easily keep this under 1 second even as you move out to even 100,000 words you are checking against. I've done exactly this, without caching the word lists, for a Bayesian Filter before.
With the smaller list, even if it is greedy and gathers words that don't match "clown" will catch "clown loach", the resulting smaller list should be only a few to a few dozen words with links. Which will take no time at all to do a find and replace over a chunk of text.
The above doesn't really address your concern over the older profiles. You don't say exactly how many there are, just that there is a lot of text and that it is on 1400 to 3100 (both items) put together. This older content you could do based on popularity if you have the info. Or on date entered, newest first. Regardless the best way to do this is to write a script that suspends the time limit on PHP and just batch-runs a load/process/save on all the posts. If each one takes about 1 second (probably much less but worst case) you are talking 3100 seconds which is a little less than an hour.