Indexing large DB's with Lucene/PHP

Indexing large DB's with Lucene/PHP - php

Afternoon chaps,
Trying to index a 1.7million row table with the Zend port of Lucene. On small tests of a few thousand rows its worked perfectly, but as soon as I try and up the rows to a few tens of thousands, it times out. Obviously, I could increase the time php allows the script to run, but seeing as 360 seconds gets me ~10,000 rows, I'd hate to think how many seconds it'd take to do 1.7million.
I've also tried making the script run a few thousand, refresh, and then run the next few thousand, but doing this clears the index each time.
Any ideas guys?
Thanks :)

I'm sorry to say it, because the developer of Zend_Search_Lucene is a friend and he has worked really hard it, but unfortunately it's not suitable to create indexes on data sets of any nontrivial size.
Use Apache Solr to create indexes. I have tested that Solr runs more than 300x faster than Zend for creating indexes.
You could use Zend_Search_Lucene to issue queries against the index you created with Apache Solr.
Of course you could also use the PHP PECL Solr extension, which I would recommend.

Try speeding it up by selecting only the fields you require from that table.
If this is something to run as a cronjob, or a worker, then it must be running from the CLI and for that I don't see why changing the timeout would be a bad thing. You only have to build the index once. After that new records or updates to them are only small updates to your Lucene database.

Some info for you all - posting as an answer so I can use the code styles.
$sql = "SELECT id, company, psearch FROM businesses";
$result = $db->query($sql); // Run SQL
$feeds = array();
$x = 0;
while ( $record = $result->fetch_assoc() ) {
$feeds[$x]['id'] = $record['id'];
$feeds[$x]['company'] = $record['company'];
$feeds[$x]['psearch'] = $record['psearch'];
$x++;
}
//grab each feed
foreach($feeds as $feed) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('id',
$feed["id"]));
$doc->addField(Zend_Search_Lucene_Field::Text('company',
$feed["company"]));
$doc->addField(Zend_Search_Lucene_Field::Text('psearch',
$feed["psearch"]));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('link',
'http://www.google.com'));
//echo "Adding: ". $feed["company"] ."-".$feed['pcode']."\n";
$index->addDocument($doc);
}
$index->commit();
(I've used google.com as a temp link)
The server its running on is a local install of Ubuntu 8.10, 3Gb RAM and a Dual Pentium 3.2GHz chip.

Related

PHP Script Internal Server Error when lots of data

Summary
This is a script (CakePHP 2.10.18 - LAMP dedicated server with PHP 5.3) that loads information from 2 MySQL tables, and then does some process of the data to output it to excel.
Table 1 has users, and Table 2 has info about those users (one record per user). The script has the goal of grabbing the record of a user from Table 1, grabbing its related info from Table 2, and put it in an excel row (using PHPExcel_IOFactory library for this).
The information extracted of those tables is of around 8000 records from each, the tables themselves have 100K and 300K total records respectively. All the fields in those tables are ints and small varchars with the exception of one field in the second table (datos_progreso seen in the code below), which is a text field and contains serialized data, but nothing big.
The issue is that if I run the script for the full 16000 records I get an Internal Server Error (without really any explanation in the logs), if I run the script for 1000 records it all works fine, so this seems to point out it's a resources issue.
I've tried (among other things that I will explain at the end) increasing the memory_limit from 128M to 8GB (yes you read that right), max_execution_time from 90 to 300 seconds, and max_input_vars from 1000 to 10000, and that isn't solving the problem.
My thoughts are that the amount of data isn't that huge to cause the resources to run out, but I've tried optimizing the script in several ways and can't get it to work. The only time I get it to work is by running it on a small portion of the records like I mention above.
I would like to know if there's something script-wise or php-configuration-wise I can do to fix this. I can't change the database tables with the information by the way.
Code
This is just the relevant bits of code that I think matter, the script is longer:
$this->Usuario->bindModel(
array('hasMany' => array(
'UsuarioProgreso' => array('className' => 'UsuarioProgreso', 'foreignKey' => 'id_usuario', 'conditions' => array('UsuarioProgreso.id_campania' => $id_campania)))
));
$usuarios = $this->Usuario->find('all', array(
'conditions'=>array('Usuario.id_campania'=>$id_campania, 'Usuario.fecha_registro >'=>'2020-05-28'),
'fields'=>array('Usuario.id_usuario', 'Usuario.login', 'Usuario.nombre', 'Usuario.apellido', 'Usuario.provincia', 'Usuario.telefono', 'Usuario.codigo_promocion'),
'order'=>array('Usuario.login ASC')
));
$usuario = null;
$progreso_usuario = null;
$datos_progreso = null;
$i = 2;
foreach ($usuarios as $usuario) {
if (isset($usuario['UsuarioProgreso']['datos_progreso'])) {
$datos_progreso = unserialize($progreso['UsuarioProgreso']['datos_progreso']);
$unit = 1;
$column = 'G';
while ($unit <= 60) {
if (isset($datos_progreso[$unit]['punt']))
$puntuacion = $datos_progreso[$unit]['punt'];
else
$puntuacion = ' ';
$objSheet->getCell($column.$i)->setValue($puntuacion);
$column++;
$unit++;
}
$nivel = 1;
$unidad_nivel = array(1 => 64, 2 => 68, 3 => 72, 4 => 76, 5 => 80, 6 => 84);
while ($nivel <= 6) {
$unidad = $unidad_nivel[$nivel];
if (isset($datos_progreso[$unidad]['punt']))
$puntuacion = $datos_progreso[$unidad]['punt'];
else
$puntuacion = ' ';
$objSheet->getCell($column.$i)->setValue($puntuacion);
$column++;
$nivel++;
}
}
//Free the variables
$usuario = null;
$progreso_usuario = null;
$datos_progreso = null;
$i++;
}
What I have tried
I have tried not using bindModel, and instead just load the information of both tables separately. So loading all the info of users first, looping through it, and on each loop grab the info for that specific user from Table 2.
I have tried also something similar to the above, but instead of loading all the info at once for the users from Table 1, just load first all their IDs, and then loop through those IDs to grab the info from Table 1 and Table 2. I figured this way I would use less memory.
I have also tried not using CakePHP's find(), and instead use fetchAll() with "manual" queries, since after some research it seemed like it would be more efficient memory-wise (didn't seem to make a difference)
If there's any other info I can provide that can help understand better what's going on please let me know :)
EDIT:
Following the suggestions in the comments I've implemented this in a shell script and it works fine (takes a while but it completes without issue).
With that said, I would still like to make this work from a web interface. In order to figure out what's going on, and since the error_logs aren't really showing anything relevant, I've decided to do some performance testing myself.
After that testing, these are my findings:
It's not a memory issue since the script is using at most around 300 MB and I've given it a memory_limit of 8GB
The memory usage is very similar whether it's via web call or shell script
It's not a timeout issue since I've given the script 20 minutes limit and it crashes way before that
What other setting could be limiting this/running out that doesn't fail when it's a shell script?

The way I solved this was using a shell script by following the advice from the comments. I've understood that my originally intended approach was not the correct one, and while I have not been able to figure out what exactly was causing the error, it's clear that using a web script was the root of the problem.

Postgres using too much CPU

I've been trying to debug an issue with postgres where there is far too much CPU usage on the server. I figured it might have to do with unoptimized queries and the like, however, I was unable to find any solution there. I tried using different settings for postgres, tweaking around the config. I have finally set my configurations to :
max_connections = 1000
shared_buffers = 4GB
effective_cache_size = 12GB
work_mem = 4194kB
maintenance_work_mem = 1GB
checkpoint_segments = 32
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
checkpoint_timeout - 15 min
random_page_cost = 0.5
seq_page_cost = 0.2
The server can easily provide these resources. I still couldn't get the net CPU usage to fall (it hits 40%+ on a single user, 80%+ on 2, and after 3 it begins to crawl.
Finally, I wrote the following function as a test:
public function testLoad(){
define("DBCOFIG", "host=hostname port=5432 dbname=db user=user password=pwd");
pg_connect(DBCOFIG)or die('Failed');
pg_query("select 1");
echo 'hi';die;
}
When I hit this function, it produces the exact same results, i.e. 40% CPU usage while the call is active. Clearly, the issue is not with the queries being fired, but rather the connection itself which php is making to the database. Every user will create a new http request, and every new request will create a new connection to the database, which will then create a problem.
I plan on having a userbase with around 100 parallel connections at any time, so obviously the current setup will not work for me. Any advice on where I'm going wrong? Some configuration I may be missing?

PHP DB caching, without including files

I've been searching for a suitable PHP caching method for MSSQL results.
Most of the examples I can find suggest storing the results in an array, which would then get included to page. This seems great unless a request for the content was made at the same time as it being updated/rebuilt.
I was hoping to find something similar to ASP's application level variables, but far as I'm aware, PHP doesn't offer this functionality?
The problem I'm facing is I need to perform 6 queries on page to populate dropdown boxes. This happens on the vast majority of pages. It's also not an option to combine the queries. The cached data will also need to be rebuilt sporadically, when the system changes. This could be once a day, once a week or a month. Any advice will be greatly received, thanks!

You can use Redis server and phpredis PHP extension to cache results fetched from database:
$redis = new Redis();
$redis->connect('/tmp/redis.sock');
$sql = "SELECT something FROM sometable WHERE condition";
$sql_hash = md5($sql);
$redis_key = "dbcache:${sql_hash}";
$ttl = 3600; // values expire in 1 hour
if ($result = $redis->get($redis_key)) {
$result = json_decode($result, true);
} else {
$result = Db::fetchArray($sql);
$redis->setex($redis_key, $ttl, json_encode($result));
}
(Error checks skipped for clarity)

Prevent PHP from sending multiple emails when running parallel instances

This is more of a logic question than language question, though the approach might vary depending on the language. In this instance I'm using Actionscript and PHP.
I have a flash graphic that is getting data stored in a mysql database served from a PHP script. This part is working fine. It cycles through database entries every time it is fired.
The graphic is not on a website, but is being used at 5 locations, set to load and run at regular intervals (all 5 locations fire at the same time, or at least within <500ms of each-other). This is real-time info, so time is of the essence, currently the script loads and parses at all 5 locations between 30ms-300ms (depending on the distance from the server)
I was originally having a pagination problem, where each of the 5 locations would pull a different database entry since i was moving to the next entry every time the script runs. I solved this by setting the script to only move to the next entry after a certain amount of time passed, solving the problem.
However, I also need the script to send an email every time it displays a new entry, I only want it to send one email. I've attempted to solve this by adding a "has been emailed" boolean to the database. But, since all the scripts run at the same time, this rarely works (it does sometimes). Most of the time I get 5 emails sent. The timeliness of sending this email doesn't have to be as fast as the graphic gets info from the script, 5-10 second delay is fine.
I've been trying to come up with a solution for this. Currently I'm thinking of spawning a python script through PHP, that has a random delay (between 2 and 5 seconds) hopefully alleviating the problem. However, I'm not quite sure how to run exec() command from php without the script waiting for the command to finish. Or, is there a better way to accomplish this?
UPDATE: here is my current logic (relevant code only):
//get the top "unread" information from the database
$query="SELECT * FROM database WHERE Read = '0' ORDER BY Entry ASC LIMIT 1";
//DATA
$emailed = $row["emailed"];
$Entry = $row["databaseEntryID"];
if($emailed == 0)
{
**CODE TO SEND EMAIL**
$EmailSent="UPDATE database SET emailed = '1' WHERE databaseEntryID = '$Entry'";
$mysqli->query($EmailSent);
}
Thanks!

You need to use some kind of locking. E.g. database locking
function send_email_sync($message)
{
sql_query("UPDATE email_table SET email_sent=1 WHERE email_sent=0");
$result = FALSE;
if(number_of_affacted_rows() == 1) {
send_email_now($message);
$result = TRUE;
}
return $result;
}
The functions sql_query and number_of_affected_rows need to be adapted to your particular database.
Old answer:
Use file-based locking: (only works if the script only runs on a single server)
function send_email_sync($message)
{
$fd = fopen(__FILE__, "r");
if(!$fd) {
die("something bad happened in ".__FILE__.":".__LINE__);
}
$result = FALSE;
if(flock($fd, LOCK_EX | LOCK_NB)) {
if(!email_has_already_been_sent()) {
actually_send_email($message);
mark_email_as_sent();
$result = TRUE; //email has been sent
}
flock($fd, LOCK_UN);
}
fclose($fd);
return $result;
}

You will need to lock the row in your database by using a transaction.
psuedo code:
Start transaction
select row .. for update
update row
commit
if (mysqli_affected_rows ( $connection )) >1
send_email();

How to "release" memory in loop?

I have a script that is running on a shared hosting environment where I can't change the available amount of PHP memory. The script is consuming a web service via soap. I can't get all my data at once or else it runs out of memory so I have had some success with caching the data locally in a mysql database so that subsequent queries are faster.
Basically instead of querying the web service for 5 months of data I am querying it 1 month at a time and storing that in the mysql table and retrieving the next month etc. This usually works but I sometimes still run out of memory.
my basic code logic is like this:
connect to web service using soap;
connect to mysql database
query web service and store result in variable $results;
dump $results into mysql table
repeat steps 3 and 4 for each month of data
the same variables are used in each iteration so I would assume that each batch of results from the web service would overwrite the previous in memory? I tried using unset($results) in between iterations but that didn't do anything. I am outputting the memory used with memory_get_usage(true) each time and with every iteration the memory used is increased.
Any ideas how I can fix this memory leak? If I wasn't clear enough leave a comment and I can provide more details. Thanks!
***EDIT
Here is some code (I am using nusoap not the php5 native soap client if that makes a difference):
$startingDate = strtotime("3/1/2011");
$endingDate = strtotime("7/31/2011");
// connect to database
mysql_connect("dbhost.com", "dbusername" "dbpassword");
mysql_select_db("dbname");
// configure nusoap
$serverpath ='http://path.to/wsdl';
$client = new nusoap_client($serverpath);
// cache soap results locally
while($startingDate<=$endingDate) {
$sql = "SELECT * FROM table WHERE date >= ".date('Y-m-d', $startingDate)." AND date <= ".date('Y-m-d', strtotime($startingDate.' +1 month'));
$soapResult = $client->call('SelectData', $sql);
foreach($soapResult['SelectDateResult']['Result']['Row'] as $row) {
foreach($row as &$data) {
$data = mysql_real_escape_string($data);
}
$sql = "INSERT INTO table VALUES('".$row['dataOne']."', '".$row['dataTwo']."', '".$row['dataThree'].")";
$mysqlResults = mysql_query($sql);
}
$startingDate = strtotime($startingDate." +1 month");
echo memory_get_usage(true); // MEMORY INCREASES EACH ITERATION
}

Solved it. At least partially. There is a memory leak using nusoap. Nusoap writes a debug log to a $GLOBALS variable. Altering this line in nusoap.php freed up a lot of memory.
change
$GLOBALS['_transient']['static']['nusoap_base']->globalDebugLevel = 9;
to
$GLOBALS['_transient']['static']['nusoap_base']->globalDebugLevel = 0;
I'd prefer to just use php5's native soap client but I'm getting strange results that I believe are specific to the webservice I am trying to consume. If anyone is familiar with using php5's soap client with www.mindbodyonline.com 's SOAP API let me know.

Have you tried unset() on $startingDate and mysql_free_result() for $mysqlResults?
Also SELECT * is frowned upon even if that's not the problem here.
EDIT: Also free the SOAP result too, perhaps. Some simple stuff to begin with to see if that helps.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Indexing large DB's with Lucene/PHP - php

Related

PHP Script Internal Server Error when lots of data

Postgres using too much CPU

PHP DB caching, without including files

Prevent PHP from sending multiple emails when running parallel instances

How to "release" memory in loop?

Categories

Resources