Zend Lucene exhausts memory when indexing

Zend Lucene exhausts memory when indexing - php

An oldish site I'm maintaining uses Zend Lucene (ZF 1.7.2) as it's search engine. I recently added two new tables to be indexed, together containing about 2000 rows of text data ranging between 31 bytes and 63kB.
The indexing worked fine a few times, but after the third run or so it started terminating with a fatal error due to exhausting it's allocated memory. The PHP memory limit was originally set to 16M, which was enough to index all other content, 200 rows of text at a few kilobytes each. I gradually increased the memory limit to 160M but it still isn't enough and I can't increase it any higher.
When indexing, I first need to clear the previously indexed results, because the path scheme contains numbers which Lucene seems to treat as stopwords, returning every entry when I run this search:
$this->index->find('url:/tablename/12345');
After clearing all of the results I reinsert them one by one:
foreach($urls as $v) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnStored('content', $v['data']);
$doc->addField(Zend_Search_Lucene_Field::Text('title', $v['title']);
$doc->addField(Zend_Search_Lucene_Field::Text('description', $v['description']);
$doc->addField(Zend_Search_Lucene_Field::Text('url', $v['path']);
$this->index->addDocument($doc);
}
After about a thousand iterations the indexer runs out of memory and crashes. Strangely doubling the memory limit only helps a few dozen rows.
I've already tried adjusting the MergeFactor and MaxMergeDocs parameters (to values of 5 and 100 respectively) and calling $this->index->optimize() every 100 rows but neither is providing consistent help.
Clearing the whole search index and rebuilding it seems to result in a successful indexing most of the time, but I'd prefer a more elegant and less CPU intensive solution. Is there something I'm doing wrong? Is it normal for the indexing to hog so much memory?

I had a similar problem for a site I had to maintain that had at least three different languages and had to re-index the same 10'000+ (and growing) localized documents for each different locale separately (each using their own localized search engine). Suffice to say that it failed usually within the second pass.
We ended up implementing an Ajax based re-indexing process that called the script a first time to initialize and start re-indexing. That script aborted at a predefined number of processed documents and returned a JSON value indicating if it was completed or not, along with other progress information. We then re-called the same script again with the progress variables until the script returned a completed state.
This allowed also to have a progress bar of the process for the admin area.
For the cron job, we simply made a bash script doing the same task but with exit codes.
This was about 3 years ago and nothing has failed since then.

Related

LAMP / Laravel - Report generation maxing out single CPU

So I have developed a report generation system in Laravel. We are using php 7 (opcache enabled) / apache / mysql / on a centos 7 box. With one report, grabbing all the information ends up taking about 15 seconds but then I have to loop through and do a bunch of filtering on Collections etc etc. I have optimized this from top to bottom for about a week and have got the entire report generation to take about 45 seconds (dealing with multiple tables with greater than 1 million entries). This maxes out my CPU until its done of course.
My issue is when we pushed it live to the client their CPU is not up to the task. They have 4 cpu's # 8 cores each # 2.2ghz. However, since php is a single process it only runs on one cpu and maxes it out and since its so slow it takes closer to 10 minutes to run the report.
Is there any way to get apache / php / linux ...whatever....to use all 4 cpu's for a single php process? The only other option is to tell the client they need a better server....not an option. Please help.

So I stopped trying to find a way to have the server handle my code better and found a few ways to optimize my code.
First off, I used the collection groupBy() method to group my collection so that i had a bunch of sub-arrays with the id as key. When I looped through these I just grabbed that sub-array instead of using the collection's filter() method which is REALLY slow when dealing with this many items. That saved me a LOT of processing power.
Secondly, every time I used a sub-array I removed it from the main array. So the array became smaller and smaller every time it went through the foreach.
These optimizations ended up saving me a LOT of processing power and now my reports run fine. After days of searching for a way to allow php to handle parallel processing etc I have come to the conclusion that its simply not possible.
Hope this helps.

PHP While Loop Slowdown over time

I have seemingly harmless while loop that goes through the result-set of a mysql query and compares the id returned from mysql, to one in a very large multidimensional array:
//mysqli query here
while($row = fetch_assoc())
{
if(!in_array($row['id'], $multiDArray['dimensionOne']))
{
//do something
}
}
When the script first executes, it is running through the results at about 2-5k per second. Sometimes more, rarely less. The result set brings back 7million rows, and the script peaks at 2.8GB of memory.
In terms of big data, this is not a lot.
The problem is, around the 600k mark, the loop starts to slow down, and by 800k, it is processing a few records a second.
In terms of server load and memory use, there are no issues.
This is behaviour I have noticed before in other scripts dealing with large data sets.
Is array seek time progressively slower as the internal pointer moves deeper?

That really depends on what happens inside the loop. I know you are convinced it's not a memory issue but it looks like one. Program usually get very slow when system tries to get extra RAM by using SWAP. Using hard drive is obviously very slow and that's what you might be experiencing. It's very easy to benchmark it.
In one terminal run
vmstat 3 100
Run you scrip and observe vmstat. Look into IO and SWAP. If that is really not the case then profile execution with XDEBUG. It might be tricky because you do many iterations and this will also cause major IO.

How to Prevent PHP Allowed Memory Size Fatal Error from query result?

I have a (cake) php function designed to update entries in a MySQL table; first of all it runs a query to get all the new items since a $lastImportDate, and then loops through performing various actions on the item, saving the item, saving related table information.
Unfortunately this seems to be collapsing under the excessive weight of one particular request. Here there are 5917 (and probably counting!) entries to update. The function loops through for a while, but eventually dies with an "Allowed memory size of [lots of bytes] exhausted" error.
Without forcing anyone to wade through extensive code - what strategies can I reasonably adopt to try to stop this memory leak from rendering this table un-updateable?

Sound like you are saving all the records in a variable. Try getting only the ID's, then looping through the ID's. You get every record by doing $this->Model->findById($id) and do your actions on that particular record. Then you save that record and unset the data, this way you'll never have much data stored in variables. It doesn't surprise me you run out of memory if you put everything in one variable.
Increasing memory limit is just a bad temporary solution, imagine having to update 100000 or more records. Eventually you'll run out.

You can always try setting:
ini_set('memory_limit', '-1');
[docs]
Or a more appropriate value
Alternatively, farm more out to SQL (using CASE updates, or nested select updates), or set up a cron job to launch the script in smaller intervals until the update is complete...

I don't know exactly the nature of the task you need to accomplish. If you need to update entries with a regular interval: Make a Cake Shell, put it in cron schedule (run every hour if you have to).
If you need to perform an update every time you do an import to the DB: you can try loop{ get and process 100 records, repeat until done} Try to maximize the number of records you can process before hitting the memory limit.

Suggestion for using php array?

Let say i have 100k records in table, after fetching that records from table i am pushing it to an array with some calculations, and then send them to server for further processing.
I have test the scenario with(1k) records, its working perfectly, but worrying about if there is performance issue, because the page which do calculation and fetching records from db run after each 2 mins.
My Question is can I use array for more than 2 Millions records?

There's no memory on how much data an array can hold, the limit is server memory/PHP memory limit.
Why would you push 100k records into an array? You know databases have sorting and limiting for that reason!

My Question is can I use array for more than 2 Millions records?
Yes you can, 2 Million array entries is not a limit in PHP for arrays. The array limit depends on the memory that is available to PHP.
ini_set('memory_limit', '320M');
$moreThan2Million = 2000001;
$array = range(0, $moreThan2Million);
echo count($array); #$moreThan2Million
You wrote:
The page is scheduled and run after 2 min, so I am worrying about the performance issue.
And:
But I need to fetch all, not 100 at time, and send them to server for further processing.
Performance for array operations is dependent on processing power. With a fast enough computer, you should not run into any problems. However, keep in mind that PHP is an interpreted language and therefore considerably slower than compiled binaries.
If you need to run the same script every 2 minutes but the runtime of the script is larger than two minutes, you can distribute script execution over multiple computers, so one process is not eating the CPU and memory resources of the other and can finish the work in meantime another process runs on an additional box.
Edit
Good answer, but can you write your consideration, about how much time the script will need to complete, if the there is no issue with the server processor and RAM.
That depends on the size of the array, the amount of processing each entry needs (in relation to the overall size of the array) and naturally the processor power and the amount of RAM. All these are unspecified with your question, so I can specifically say, that I would consider this unspecified. You'll need to test this on your own and building metrics for your application by profiling it.
I have 10GB RAM and More than 8 Squad processor.
For example you could do a rough metric for 1, 10, 100, 1000, 10000, 100000 and 1 million entries to see how your (unspecified) script scales on that computer.
I am sending this array to another page for further processing.
Metric as well the amount of data you send between computers and how much bandwidth you have available for inter-process communication over the wire.

Let say i have 100k records in table, after fetching that records from table i am pushing it to an array with some filters.
Filters? Can't you just write a query that implements those filters instead? A database (depending on vendor) isn't just a data store, it can do calculations and most of the time it's much quicker than transferring the data to PHP and doing the calculations there. If you have a database in, say, PostgreSQL, you can do pretty much everything you've ever wanted with plpgsql.

crawling scraping and threading? with php

I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...
The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.
The scripts only crawl two levels down into the page so.. main page www.url.com and links on that page www.url.com/post1 www.url.com/post2
My problem is that as I start to get a larger collection of blogs. They are only scanned once ever 20 to 30 minutes and when I add a new blog to to script there is a backup in scanning the links as only one is processed every minute.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
What is the best way I could speed this process up.
Is there a way I can have multiple scripts affecting the DB but write them so they do not overwrite each other but queue the results?
Is there some way to create threading in PHP so that a script can process links at its own pace?
Any ideas?
Thanks.

USE CURL MULTI!
Curl-mutli will let you process the pages in parallel.
http://us3.php.net/curl
Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.
You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down
http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/

pseudo code for running parallel scanners:
start_a_scan(){
//Start mysql transaction (needs InnoDB afaik)
BEGIN
//Get first entry that has timed out and is not being scanned by someone
//(And acquire an exclusive lock on affected rows)
$row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
(scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
LIMIT 1 FOR UPDATE
//let everyone know we're scanning this one, so they'll keep out
UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
//Commit transaction
COMMIT
//scan
scan_target($row['url'])
//update entry state to allow it to be scanned in the future again
UPDATE scan_targets SET being_scanned = false, \
scanned_at = NOW() WHERE id = $row['id']
}
You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.
And then you can have several scan processes running in parallel! Yey!
cheers!
EDIT: I forgot that you need to make the first SELECT with FOR UPDATE. Read more here

This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.

Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.

CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.
It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.