PHP Solr Client - Request Entity Too Large - php

I am using SOLR PHP client to query data from a solr service. My code looks similar to the below in that I'm passing in a query to search on multiple IDs (in my case, a lot). The problem arises for me when I'm searching for too many IDs at once. I get a 'Request Entity Too Large' when passing in too many IDs. Is there a way to get around this? From what I see in various examples, the syntax seems to be 'id:1 OR id:2 OR id:3 etc.' when searching on multiple values. This there a different syntax that would reduce the size of the request being passed into the service? e.g. In SQL we can say, 'id in (1,2,3,etc.)'. Can we do something similar for the SOLR query?
<?php
require_once( 'SolrPHPClient/Apache/Solr/Service.php' );
$solr = new Apache_Solr_Service( 'localhost', '8983', '/solr' );
$offset = 0;
$limit = 10;
$queries = array(
'id: 1 OR id: 2 OR id:3 OR id:4 OR id:5 OR id:6 or id:7 or id:8' // in my case, this list keeps growing and growing
);
foreach ( $queries as $query ) {
$response = $solr->search( $query, $offset, $limit );
if ( $response->getHttpStatus() == 200 ) {
print_r( $response->getRawResponse() );
}
else {
echo $response->getHttpStatusMessage();
}
}
?>

Solr supports searching through POST HTTP requests instead of GET requests which will allow you to have a much bigger query. I don't know how to enable this in the PHP client you're using.
However this is only a bandaid. The real problem is that you seem to be misusing Solr. You will likely run into other limits and performance issues, simply because Solr wasn't designed to do what you're trying to do. You wouldn't use PHP to write an operating system, right? It's the same with this.
I recommend creating a new question with the real issue you have that led you to run this kind of queries.

Solr supports range queries, i.e. id:[1 TO 10], or id:[* TO *] to match all. Since it looks like many of your "ORs" are with respect to sequential IDs, this should help.

Related

Searching mysql database with multiple PHP queries instead of OR operator

TLDR; I'm trying to get some 'fuzzy' results from a search query that fails to come up with things that are actually there.
The problem:
This specific setup is in Wordpress land, but that's not decisively relevant to the issue. There's this longstanding problem with wordpress search – that it uses AND operators, not OR. This leads to the following problem:
Some people (not necessarily many but often key customers) will search for john+doe+jane and find nothing. Had they searched for just john+doe they'd have found a ton of results. Or maybe they simply misspell a word or, worse, it was misspeled in the article etc. I need this fixed somehow.
I tried all the plugins, but they either fare worse than the default search or they just won't work (likely because I also made other customizations to the search, but I can't go back on those). So eventually I tried to work my own way out.
I know next to nothing about databases (and not much about Wordpress hooks either), so I tried to move the OR issue to php.
My thinking: First of all, are there quite a few search results? If true, trust Wordpress search that it has done its job for 99+ percent of cases. But are there very few or none at all? Then there may be a problem. Step in, try and fix it!
So, for when there are fewer than 10 results, I made this fallback script:
The solution:
// [Wordpress search loop ran before this]
// find out how many results below the target
$missing = 10 - count($posts);
// or `$missing = 10 - $wp_query->found_posts` - all the same – but already constructed this `$posts` array which I'll need later anyway.
// only do it if you actually need it (less than 10 search results from WP)
if ($missing > 0) {
// give each search term its own life
$terms = explode(" ",$string);
// where `$string = get_query_var('s')` // or `$string = explode("+",$_SERVER['REQUEST_URI'])` in a more generic environment.
// ...
$weight = array();
$results = array();
// start some nasty nested foreaches to search for each term
foreach ($terms as $term) {
// first, get this huge array of posts // wordpress stuff
$results = get_posts(array('s' => $term,'numberposts' => 10));
// then iterate through each of them
foreach ($results as $result) {
// (drop mostly useless one-or-two-letter words)
if (strlen($term) < 3) continue;
// reward each occurence with a biscuit, they'll prove useful later*
$matches[$result->ID] += 1;
// don't let any term become a neverending story (stop inner foreach)
if(count($matches) == 10) break;
}
// stop the game altogether – by now we should have just enough (stop outer foreach)
if(count($matches) > 100) break;
}
// sort the results nicely by biscuits awarded
// *(so if one term came up 3/4 times it goes before one that one that came up 2/4)
arsort($matches);
// to make arrays compatible it's ok to forget the exact biscuit numbers (they don't really matter anymore)
$matches = array_keys($matches);
// if Wordpress found you first we don't need you, you're already up there in the original loop
$fallbacks = array_diff($matches,$posts);
// [standard wordpress loop]
$fallback = new WP_Query( array (
'post__in' => $fallbacks,
'orderby' => post__in,
// add just enough to get to ten
'posts_per_page' => $missing
)
);
if ( $fallback->have_posts() ) {
while ( $fallback->have_posts() ) {
$fallback->the_post();
// [magic goes here]
}
}
}
Does the job. However...
The issues I'm aware of:
Performance: It takes long but not too long. Page (openresty/nginx+redis+varnish) would load in .8s instead of the usual .3 cached or .4-.5 busted*.
Bad bots doing bad stuff: There's merciless sanitizing in WP and there's decent rate-limiting on the webserver, throwing nice 400's and 429's**.
The issues I'm not aware of:
*Not live yet so I don't now how it would scale. Too many at once – what could happen? Can that sort of nested foreach kill a database?
**I'd hate if someone still manages to find a weak spot before I do. Any in sight?

Caching SQL Lookups & Retrieving Data

I'm trying to setup caching of postcode lookups, which adds the resulting lookup to a text file using the following;
file_put_contents($cache_file, $postcode."\t".$result."\n", FILE_APPEND);
I'd like to be able to check this file before running a query, which i have done using this:
if( strpos(file_get_contents($cache_file),$postcode) !== false) {
// Run function
}
What I'd like to do, is search for the $postcode with in the text file (as above) and return the data one tab over ($result).
Firstly, is this possible?
Secondly, is this is even a good way to cache SQL lookups?
1) yes it's possible - the easiest way would be storing the lookup data in an array and write/read it from a file with serialze / unserialze
$lookup_codes = array(
'10101' => 'data postcode 1 ...',
'10102' => 'data postcode 2 ...',
// ...
);
file_put_contents($cache_file, serialize($lookup_codes));
$lookup_codes = unserialize(file_get_contents($cache_file));
$postcode = '10101';
if(array_key_exists($postcode, $lookup_codes)){
// ... is available
}
2) is the far more interesting question. It really depends on your data, the structure, the amount etc.
In my opinion, caching add more complexity to your application, and so if possible avoid it :-)
You could try to:
Optimizing your SQL query or database structure to speed it up for requesting postcode data.
Normally Databases are quite fast - and therefore made for such use-cases
I'm not sure which db you are running, but for MySQL look into Select Optimization. Or as another keyword you can search for INDEX which boost queries quite heavy
file_get_contents is really fast, but when you are changing the file often maybe look into other ways of caching, like Memcached for storing it In-Memory

php include array vs mysql query: good idea?

I have an 2D array with a few sub-arrays (about 30, and sub-arrays have 10 elements).
I need to get quite frequently basic data from the array , I have a function that return the contents of it (or partial) all around my scripts. The function looks like:
function get_my_data($index = false){
$sub0 = array(
'something' => 'something',
'something else' => 'else',
...
);
$sub1 = array(
'something' => 'something different',
'something else' => 'else different',
...
);
...
$sub30 = array(
'something' => 'something 30 times different',
'something else' => 'else 30 times different',
...
);
$data = array($sub0,$sub1,$sub2,...,$sub30);
if($index !== false)
return $data[$index];
else
return $data;
?>
And then I call to it using include:
<?php
include 'my_data.php';
$id = $_GET['id'];
$mydata = get_my_data($id);
...
?>
I've done this because when I was starting this project, I didn't imagined I would have more that 10 sub-arrays, and I neither that I would need to have it dynamic. In fact, now I have to add a dynamic column (an index to sub-arrays) and it is not a great idea to use array declaration in this case. I immediately thought to use database, transferring data would not difficult, but if I do that, then I need to change my function get_my_data and insert a query in it, so, for it's called many times, I would have a lot of queries, pretty much every script of my website have one of it. I think performance would be worst (cause mysql is already largely used in my website). The dynamic data would change not too frequently (client do that).
The ideas I have to solve this problem are:
save all data in database and get it through mysql queries,
leave on php side and use files to manage dynamic data,
leave the static part on php side, add a logical connector (such 'id' index in sub-arrays) and id column in mysql database, and get the dynamic data on mysql
I don't want to lose much performance, do yo have any advice or suggestions?
Putting data like this in code is the worst possible plan. Not only do you create a whole bunch of junk and then throw out almost all of it, but if any of this changes it's a nightmare to maintain. Editing source code, checking it into version control, and deploying it is a lot of work to make a simple change to some data.
At the very least store this in a data format like JSON, YAML or XML so you can read it in on-demand and change a data-only file as necessary.
Ideally you put this in a database and query against it when necessary. Databases are designed to store, update, and preserve data like this.
You can also store JSON in the database, MySQL 5.7 even has a native column type for it, which makes this sort of thing even easier.

using MatchAll in elasticsearch and elastica

I am having a hard time trying to use MatchAll in elastic search using elastica, currently I have the following querystring:
$pictureQuery = new \Elastica\Query\QueryString();
$pictureQuery->setParam('query', $searchquery);
$pictureQuery->setParam('fields', array(
'caption'
));
$items = $itemFinder->find($pictureQuery);
the issue with this query is that it only returns 10 results. I wanted to return all results, in this case MatchAll. I am however having issues on how to get all matching results, how do I do so?
Elasticsearch returns by default the top 10 results (the more relevant).
That's the expected behavior.
Elasticsearch allows to change page size (size) and change page (from). Have a look at From/Size API.
In Elastica, I guess it's here: http://elastica.io/api/classes/Elastica.Query.html#method_setSize

Recursive MySQL function call eats up too much memory and dies

I have the following recursive function which works... up until a point. Then the script asks for more memory once the queries exceed about 100, and when I add more memory, the script typically just dies (I end up with a white screen on my browser).
public function returnPArray($parent=0,$depth=0,$orderBy = 'showOrder ASC'){
$query = mysql_query("SELECT *, UNIX_TIMESTAMP(lastDate) AS whenTime
FROM these_pages
WHERE parent = '".$parent."' AND deleted = 'N' ORDER BY ".$orderBy."");
$rows = mysql_num_rows($query);
while($row = mysql_fetch_assoc($query)){
// This uses my class and places the content in an array.
MyClass::$_navArray[] = array(
'id' => $row['id'],
'parent' => $row['parent']
);
MyClass::returnPArray($row['id'],($depth+1));
}
$i++;
}
Can anyone help me make this query less resource intensive? Or find a way to free up memory between calls... somehow.
The white screen is likely because of a stack overflow. Do you have a row where the parent_id is it's own id? Try adding AND id != '".(int)$parent."' to the where clause to prevent that kind of bug from creeping in...
**EDIT: To account for circular references, try modifying the assignment to something like:
while($row = mysql_fetch_assoc($query)){
if (isset(MyClass::$_navArray[$row['id']])) continue;
MyClass::$_navArray[$row['id']] = array(
'id' => $row['id'],
'parent' => $row['parent']
);
MyClass::returnPArray($row['id'],($depth+1));
}
Shouldn't you stop recursion at some point (I guess you do need to return from method if the number of rows is 0) ? From the code you posted I see an endless recursive calls to returnPArray.
Let me ask you this... are you just trying to build out a tree of pages? If so, is there some point along the hierarchy that you can call an ultimate parent? I've found that when storing tress in a db, storing the ultimate parent id in addition to the immediate parent makes it much faster to get back as you don't need any recursion or iteration against the db.
It is a bit of denormalization, but just a small bit, and it's better to denorm than to recurse or iterate vs the db.
If your needs are more complex, it may be better to retrieve more of the tree than you need and use app code to iterate through to get just the nodes/rows you need. Most application code is far superior to any DB at iteration/recursion.
Most likely you're overloading on active query result sets. If, as you say, you're getting about 100 iterations deep into the recursion, that means you've got 100 queries/resultsets open. Even if each query only returns one row, the whole resultset is kept open until the second fetch call (which would return false). You never get back to any particular level to do that second call, so you just keep firing off new queries and opening new result sets.
If you're going for a simple breadcrumb trail, with a single result needed per tree level, then I'd suggest not doing a while() loop to iterate over the result set. Fetch the record for each particular level, then close the resultset with mysql_free_result(), THEN do the recursive call.
Otherwise, try switching to a breadth-first query method, and again, free the resulset after building each tree level.
Why are you using a recursive function? When I look at the code, it looks as though you're simply creating a table which will contain both the child and parent ID of all records. If that's what you want as a result then you don't even need recursion. A simple select, not filtering on parent_id (but probably ordering on it) will do, and you only iterate over it once.
The following will probably return the same results as your current recursive function :
public function returnPArray($orderBy = 'showOrder ASC'){
$query = mysql_query("SELECT *, UNIX_TIMESTAMP(lastDate) AS whenTime
FROM these_pages
WHERE deleted = 'N' ORDER BY parent ASC,".$orderBy."");
$rows = mysql_num_rows($query);
while($row = mysql_fetch_assoc($query)){
// This uses my class and places the content in an array.
MyClass::$_navArray[] = array(
'id' => $row['id'],
'parent' => $row['parent']
);
}
}
I'd suggest getting all rows in one query and build up the tree-structure using pure PHP:
$nodeList = array();
$tree = array();
$query = mysql_query("SELECT *, UNIX_TIMESTAMP(lastDate) AS whenTime
FROM these_pages WHERE deleted = 'N' ORDER BY ".$orderBy."");
while($row = mysql_fetch_assoc($query)){
$nodeList[$row['id']] = array_merge($row, array('children' => array()));
}
mysql_free_result($query);
foreach ($nodeList as $nodeId => &$node) {
if (!$node['parent_id'] || !array_key_exists($node['parent_id'], $nodeList)) {
$tree[] = &$node;
} else {
$nodeList[$node['parent_id']]['children'][] = &$node;
}
}
unset($node);
unset($nodeList);
Adjust as needed.
There are a few problems.
You already noticed the memory problem. You can set unlimited memory by using ini_set('memory_limit', -1).
The reason you get a white screen is because the script exceeds the max execution time and you either have display_errors turned off or error_reporting is set to E_NONE. You can set unlimited execution time by using set_time_limit(0).
Even with "unlimited" memory and "unlimited" time, you are still obviously constrained by the limits of your server and your own precious time. The algorithm and data model that you have selected will not scale well, and if this is meant for a production website, then you have already blown your time and memory budget.
The solution to #3 is to use a better data model which supports a more efficient algorithm.
Your function is named poorly, but I'm guessing it means to "return an array of all parents of a particular page".
If that's what you want to do, then check out Modified Pre-order Tree Traversal as a strategy for more efficient querying. This behavior is already built into some frameworks, such as Doctrine ORM, which makes it particularly easy to use.

Categories