Indexing text files in PHP - php

I have been set a challenge to create an indexer that takes all words 4 characters or more, and stores them in a database along with how many times the word was used.
I have to run this indexer on 4,000 txt files. Currently, it takes about 12-15 minutes - and I'm wondering if anyone has a suggestion for speeding things up?
Currently I'm placing the words in an array as follows:
// ==============================================================
// === Create an index of all the words in the document
// ==============================================================
function index(){
$this->index = Array();
$this->index_frequency = Array();
$this->original_file = str_replace("\r", " ", $this->original_file);
$this->index = explode(" ", $this->original_file);
// Build new frequency array
foreach($this->index as $key=>$value){
// remove everything except letters
$value = clean_string($value);
if($value == '' || strlen($value) < MIN_CHARS){
continue;
}
if(array_key_exists($value, $this->index_frequency)){
$this->index_frequency[$value] = $this->index_frequency[$value] + 1;
} else{
$this->index_frequency[$value] = 1;
}
}
return $this->index_frequency;
}
I think the biggest bottleneck at the moment is the script to store the words in the database. It needs to add the document to the essays table and then if the word exists in the table just append essayid(frequency of the word) to the field, if the word doesnt exist, then add it...
// ==============================================================
// === Store the word frequencies in the db
// ==============================================================
private function store(){
$index = $this->index();
mysql_query("INSERT INTO essays (checksum, title, total_words) VALUES ('{$this->checksum}', '{$this->original_filename}', '{$this->get_total_words()}')") or die(mysql_error());
$essay_id = mysql_insert_id();
foreach($this->index_frequency as $key=>$value){
$check_word = mysql_result(mysql_query("SELECT COUNT(word) FROM `index` WHERE word = '$key' LIMIT 1"), 0);
$eid_frequency = $essay_id . "(" . $value . ")";
if($check_word == 0){
$save = mysql_query("INSERT INTO `index` (word, essays) VALUES ('$key', '$eid_frequency')");
} else {
$eid_frequency = "," . $eid_frequency;
$save = mysql_query("UPDATE `index` SET essays = CONCAT(essays, '$eid_frequency') WHERE word = '$key' LIMIT 1");
}
}
}

You might consider profiling your app to know exactly where are your bottlenecks. This might give you a better understanding of what can be improved.
Regarding DB optimisation: check if you have an index on word column, then try lowering the number of times you access DB. INSERT ... ON DUPLICATE KEY UPDATE ..., maybe?

Related

PHP prevent double clean url (improvements?)

For a client at work we have build a website.The website has an offering page which can contain variants of the same type/build, so they ran into problems with double clean-urls.
Just now I wrote a function to prevent that from happening by appending a number to the URL. If thatclean url also exists it counts up.
E.g.
domain.nl/product/machine
domain.nl/product/machine-1
domain.nl/product/machine-2
Updated! return $clean_url; on recursion and on return
The function I wrote works fine, but I was wondering if I have taken the right approach and if it maybe could be improved. Here's the code:
public function prevent_double_cleanurl($cleanurl)
{
// makes sure it doesnt check against itself
if($this->ID!=NULL) $and = " AND product_ID <> ".$this->ID;
$sql = "SELECT product_ID, titel_url FROM " . $this->_table . " WHERE titel_url='".$cleanurl."' " . $and. " LIMIT 1";
$result = $this->query($sql);
// if a matching url is found
if(!empty($result))
{
$url_parts = explode("-", $result[0]['titel_url']);
$last_part = end($url_parts);
// maximum of 2 digits
if((int)$last_part && strlen($last_part)<3)
{
// if a 1 or 2 digit number is found - add to it
array_pop($url_parts);
$cleanurl = implode("-", $url_parts);
(int)$last_part++;
}
else
{
// add a suffix starting at 1
$last_part='1';
}
// recursive check
$cleanurl = $this->prevent_double_cleanurl($cleanurl.'-'.$last_part);
}
return $cleanurl;
}
Depending on the likeliness of a "clean-url" being used multiple times, your approach may not be the best to roll with. Say there was "foo" to "foo-10" you'd be calling the database 10 times.
you also don't seem to sanitize the data you shove into your SQL queries. Are you using mysql_real_escape_string (or its mysqli, PDO, whatever brother)?
Revised code:
public function prevent_double_cleanurl($cleanurl) {
$cleanurl_pattern = '#^(?<base>.*?)(-(?<num>\d+))?$#S';
if (preg_match($cleanurl_pattern, $base, $matches)) {
$base = $matches['base'];
$num = $matches['num'] ? $matches['num'] : 0;
} else {
$base = $cleanurl;
$num = 0;
}
// makes sure it doesnt check against itself
if ($this->ID != null) {
$and = " AND product_ID <> " . $this->ID;
}
$sql = "SELECT product_ID, titel_url FROM " . $this->_table . " WHERE titel_url LIKE '" . $base . "-%' LIMIT 1";
$result = $this->query($sql);
foreach ($result as $row) {
if ($this->ID && $row['product_ID'] == $this->ID) {
// the given cleanurl already has an ID,
// so we better not touch it
return $cleanurl;
}
if (preg_match($cleanurl_pattern, $row['titel_url'], $matches)) {
$_base = $matches['base'];
$_num = $matches['num'] ? $matches['num'] : 0;
} else {
$_base = $row['titel_url'];
$_num = 0;
}
if ($base != $_base) {
// make sure we're not accidentally comparing "foo-123" and "foo-bar-123"
continue;
}
if ($_num > $num) {
$num = $_num;
}
}
// next free number
$num++;
return $base . '-' . $num;
}
I don't know about the possible values for your clean-urls. Last time I did something like this, my base could look like some-article-revision-5. That 5 being part of the actual bullet, not the duplication-index. To distinguish them (and allow the LIKE to filter out false positives) I made the clean-urls look like $base--$num. the double dash could only occur between the base and the duplication-index, making things a bit simpler…
I have no way to test this, so its on you, but here's how I'd do it. I put a ton of comments in there explaining my reasoning and the flow of the code.
Basically, the recursion is unnecessary will result in more database queries than you need.
<?
public function prevent_double_cleanurl($cleanurl)
{
$sql = sprintf("SELECT product_ID, titel_url FROM %s WHERE titel_url LIKE '%s%%'",
$this->_table, $cleanurl);
if($this->ID != NULL){ $sql.= sprintf(" AND product_ID <> %d", $this->ID); }
$results = $this->query($sql);
$suffix = 0;
$baseurl = true;
foreach($results as $row)
{
// Consider the case when we get to the "first" row added to the db:
// For example: $row['titel_url'] == $cleanurl == 'domain.nl/product/machine'
if($row['title_url'] == $cleanurl)
{
$baseurl = false; // The $cleanurl is already in the db, "this" is not a base URL
continue; // Continue with the next iteration of the foreach loop
}
// This could be done using regex, but if this works its fine.
// Make sure to test for the case when you have both of the following pages in your db:
//
// some-hyphenated-page
// some-hyphenated-page-name
//
// You don't want the counters to get mixed up
$url_parts = explode("-", $row['titel_url']);
$last_part = array_pop($url_parts);
$cleanrow = implode("-", $url_parts);
// To get into this block, three things need to be true
// 1. $last_part must be a numeric string (PHP Duck Typing bleh)
// 2. When represented as a string, $last_part must not be longer than 2 digits
// 3. The string passed to this function must match the string resulting from the (n-1)
// leading parts of the result of exploding the table row
if((is_numeric($last_part)) && (strlen($last_part)<=2) && ($cleanrow == $cleanurl))
{
$baseurl = false; // If there are records in the database, the
// passed $cleanurl isn't the first, so it
// will need a suffix
$suffix = max($suffix, (int)$last_part); // After this foreach loop is done, $suffix
// will contain the highest suffix in the
// database we'll need to add 1 to this to
// get the result url
}
}
// If $baseurl is still true, then we never got into the 3-condition block above, so we never
// a matching record in the database -> return the cleanurl that was passed here, no need
// to add a suffix
if($baseurl)
{
return $cleanurl;
}
// At least one database record exists, so we need to add a suffix. The suffix we add will be
// the higgest we found in the database plus 1.
else
{
return sprintf("%s-%d", $cleanurl, ($suffix + 1));
}
}
My solution takes advantage of SQL wildcards (%) to reduce the number of queries from n down to 1.
Make sure that you ensure problematic case I described in lines 14-20 works as expected. Hyphens in the machine name (or whatever it is) could do unexpected things.
I also used sprintf to format the query. Make sure you sanitize any string that is passed through as a string (e.g. $cleanurl).
As #rodneyrehm points out, PHP is very flexible with what it considers a numeric string. You might consider switching out is_numeric() for ctype_digit() and see how that works.

php sql find and insert in empty slot

I have a game script thing set up, and when it creates a new character I want it to find an empty address for that players house.
The two relevant table fields it inserts are 'city' and 'number'. The 'city' is a random number out of 10, and the 'number' can be 1-250.
What it needs to do though is make sure there's not already an entry with the 2 random numbers it finds in the 'HOUSES' table, and if there is, then change the numbers. Repeat until it finds an 'address' not in use, then insert it.
I have a method set up to do this, but I know it's shoddy- there's probably some more logical and easier way. Any ideas?
UPDATE
Here's my current code:
$found = 0;
while ($found == 0) {
$num = (rand()%250)+1; $city = (rand()%10)+1;
$sql_result2 = mysql_query("SELECT * FROM houses WHERE city='$city' AND number='$num'", $db);
if (mysql_num_rows($sql_result2) == 0) { $found = 1; }
}
You can either do this in PHP as you do or by using a MySQL trigger.
If you stick to the PHP way, then instead of generating a number every time, do something like this
$found = 0;
$cityarr = array();
$numberarr = array();
//create the cityarr
for($i=1; $i<=10;$i++)
$cityarr[] = i;
//create the numberarr
for($i=1; $i<=250;$i++)
$numberarr[] = i;
//shuffle the arrays
shuffle($cityarr);
shuffle($numberarr);
//iterate until you find n unused one
foreach($cityarr as $city) {
foreach($numberarr as $num) {
$sql_result2 = mysql_query("SELECT * FROM houses
WHERE city='$city' AND number='$num'", $db);
if (mysql_num_rows($sql_result2) == 0) {
$found = 1;
break;
}
}
if($found) break;
}
this way you don't check the same value more than once, and you still check randomly.
But you should really consider fetching all your records before the loops, so you only have one query. That would also increase the performance a lot.
like
$taken = array();
for($i=1; $i<=10;$i++)
$taken[i] = array();
$records = mysql_query("SELECT * FROM houses", $db);
while($rec = mysql_fetch_assoc($records)) {
$taken[$rec['city']][] = $rec['number'];
}
for($i=1; $i<=10;$i++)
$cityarr[] = i;
for($i=1; $i<=250;$i++)
$numberarr[] = i;
foreach($cityarr as $city) {
foreach($numberarr as $num) {
if(in_array($num, $taken[]) {
$cityNotTaken = $city;
$numberNotTaken = $number;
$found = 1;
break;
}
}
if($found) break;
}
echo 'City ' . $cityNotTaken . ' number ' . $numberNotTaken . ' is not taken!';
I would go with this method :-)
Doing it the way you say can cause problems when there is only a couple (or even 1 left). It could take ages for the script to find an empty house.
What I recommend doing is insert all 2500 records in the database (combo 1-10 with 1-250) and mark with it if it's empty or not (or create a combo table with user <> house) and match it on that.
With MySQL you can select a random entry from the database witch is empty within no-time!
Because it's only 2500 records, you can do ORDER BY RAND() LIMIT 1 to get a random row. I don't recommend this when you have much more records.

MySQL INSERT - Using a for() loop with query & do non-set values insert as "" by default?

I have two arrays with anywhere from 1 to 5 set values. I want to insert these values into a table with two columns.
Here's my current query, given to me in another SO question:
INSERT INTO table_name (country, redirect)
VALUES ('$country[1]', '$redirect[1]'),
('$country[2]', '$redirect[2]'),
('$country[3]', '$redirect[3]'),
('$country[4]', '$redirect[4]'),
('$country[5]', '$redirect[5]')
ON DUPLICATE KEY UPDATE redirect=VALUES(redirect)
I'm a little concerned however with what happens if some of these array values aren't set, as I believe the above assumes there's 5 sets of values (10 values in total), which definitely isn't certain. If a value is null/0 does it automatically skip it?
Would something like this work better, would it be a lot more taxing on resources?
for($i = 0, $size = $sizeof($country); $i <= size; $i++) {
$query = "INSERT INTO table_name (country, redirect) VALUES ('$country[$i]', '$redirect[$i]) ON DUPLICATE KEY UPDATE redirect='$redirect[$i]'";
$result = mysql_query($query);
}
Questions highlighted in bold ;). Any answers would be very much appreciated :) :)!!
Do something like this:
$vals = array()
foreach($country as $key => $country_val) {
if (empty($country_val) || empty($redirect[$key])) {
continue;
}
$vals[] = "('" . mysql_real_escape_string($country_val) . "','" . mysql_real_escape_string($redirect[$key]) . "')";
}
$val_string = implode(',', $vals);
$sql = "INSERT INTO .... VALUES $val_string";
That'll built up the values section dynamically, skipping any that aren't set. Note, however, that there is a length limit to mysql query strings, set by the max_allowed_packet setting. If you're building a "huge" query, you'll have to split it into multiple smaller ones if it exceeds this limit.
If you are asking whether php will automatically skip inserting your values into the query if it is null or 0, the answer is no. Why dont you loop through the countries, if they have a matching redirect then include that portion of the insert statement.. something like this: (not tested, just showing an example). It's one query, all values. You can also incorporate some checking or default to null if they do not exist.
$query = "INSERT INTO table_name (country, redirect) VALUES ";
for($i = 0, $size = $sizeof($country); $i <= size; $i++) {
if(array_key_exists($i, $country && array_key_exists($i, $redirect)
if($i + 1 != $size){
$query .= "('".$country[$i]."', '".$redirect[$i]).",";
} else $query .= "('".$country[$i]."', '".$redirect[$i].")";
}
}
$query .= " ON DUPLICATE KEY UPDATE redirect=VALUES(redirect);"
$result = mysql_query($query);

How to remove htmlentities() values from the database?

Long before I knew anything - not that I know much even now - I desgined a web app in php which inserted data in my mysql database after running the values through htmlentities(). I eventually came to my senses and removed this step and stuck it in the output rather than input and went on my merry way.
However I've since had to revisit some of this old data and unfortunately I have an issue, when it's displayed on the screen I'm getting values displayed which are effectively htmlentitied twice.
So, is there a mysql or phpmyadmin way of changing all the older, affected rows back into their relevant characters or will I have to write a script to read each row, decode and update all 17 million rows in 12 tables?
EDIT:
Thanks for the help everyone, I wrote my own answer down below with some code in, it's not pretty but it worked on the test data earlier so barring someone pointing out a glaring error in my code while I'm in bed I'll be running it on a backup DB tomorrow and then on the live one if that works out alright.
I ended up using this, not pretty, but I'm tired, it's 2am and it did its job! (Edit: on test data)
$tables = array('users', 'users_more', 'users_extra', 'forum_posts', 'posts_edits', 'forum_threads', 'orders', 'product_comments', 'products', 'favourites', 'blocked', 'notes');
foreach($tables as $table)
{
$sql = "SELECT * FROM {$table} WHERE data_date_ts < '{$encode_cutoff}'";
$rows = $database->query($sql);
while($row = mysql_fetch_assoc($rows))
{
$new = array();
foreach($row as $key => $data)
{
$new[$key] = $database->escape_value(html_entity_decode($data, ENT_QUOTES, 'UTF-8'));
}
array_shift($new);
$new_string = "";
$i = 0;
foreach($new as $new_key => $new_data)
{
if($i > 0) { $new_string.= ", "; }
$new_string.= $new_key . "='" . $new_data . "'";
$i++;
}
$sql = "UPDATE {$table} SET " . $new_string . " WHERE id='" . $row['id'] . "'";
$database->query($sql);
// plus some code to check that all out
}
}
Since PHP was the method of encoding, you'll want to use it to decode. You can use html_entity_decode to convert them back to their original characters. Gotta loop!
Just be careful not to decode rows that don't need it. Not sure how you'll determine that.
I think writing a php script is good thing to do in this situation. You can use, as Dave said, the html_entity_decode() function to convert your texts back.
Try your script on a table with few entries first. This will make you save a lot of testing time. Of course, remember to backup your table(s) before running the php script.
I'm afraid there is no shorter possibility. The computation for millions of rows remains quite expensive, no matter how you convert the datasets back. So go for a php script... it's the easiest way
This is my bullet proof version. It iterates over all Tables and String columns in a database, determines primary key(s) and performs updates.
It is intended to run the php-file from command line to get progress information.
<?php
$DBC = new mysqli("localhost", "user", "dbpass", "dbname");
$DBC->set_charset("utf8");
$tables = $DBC->query("SHOW FULL TABLES WHERE Table_type='BASE TABLE'");
while($table = $tables->fetch_array()) {
$table = $table[0];
$columns = $DBC->query("DESCRIBE `{$table}`");
$textFields = array();
$primaryKeys = array();
while($column = $columns->fetch_assoc()) {
// check for char, varchar, text, mediumtext and so on
if ($column["Key"] == "PRI") {
$primaryKeys[] = $column['Field'];
} else if (strpos( $column["Type"], "char") !== false || strpos($column["Type"], "text") !== false ) {
$textFields[] = $column['Field'];
}
}
if (!count($primaryKeys)) {
echo "Cannot convert table without primary key: '$table'\n";
continue;
}
foreach ($textFields as $textField) {
$sql = "SELECT `".implode("`,`", $primaryKeys)."`,`$textField` from `$table` WHERE `$textField` like '%&%'";
$candidates = $DBC->query($sql);
$tmp = $DBC->query("SELECT FOUND_ROWS()");
$rowCount = $tmp->fetch_array()[0];
$tmp->free();
echo "Updating $rowCount in $table.$textField\n";
$count=0;
while($candidate = $candidates->fetch_assoc()) {
$oldValue = $candidate[$textField];
$newValue = html_entity_decode($candidate[$textField], ENT_QUOTES | ENT_XML1, 'UTF-8');
if ($oldValue != $newValue) {
$sql = "UPDATE `$table` SET `$textField` = '"
. $DBC->real_escape_string($newValue)
. "' WHERE ";
foreach ($primaryKeys as $pk) {
$sql .= "`$pk` = '" . $DBC->real_escape_string($candidate[$pk]) . "' AND ";
}
$sql .= "1";
$DBC->query($sql);
}
$count++;
echo "$count / $rowCount\r";
}
}
}
?>
cheers
Roland
It's a bit kludgy but I think the mass update is the only way to go...
$Query = "SELECT row_id, html_entitied_column FROM table";
$result = mysql_query($Query, $connection);
while($row = mysql_fetch_array($result)){
$updatedValue = html_entity_decode($row['html_entitied_column']);
$Query = "UPDATE table SET html_entitied_column = '" . $updatedValue . "' ";
$Query .= "WHERE row_id = " . $row['row_id'];
mysql_query($Query, $connection);
}
This is simplified, no error handling etc.
Not sure what the processing time would be on millions of rows so you might need to break it up into chunks to avoid script timeouts.
I had the exact same problem. Since I had multiple clients running the application in production, I wanted to avoid running a PHP script to clean the database for every one of them.
I came up with a solution that is far from perfect, but does the job painlessly.
Track all the spots in your code where you use htmlentities() before inserting data, and remove that.
Change your "display data as HTML" method to something like this :
return html_entity_decode(htmlentities($chaine, ENT_NOQUOTES), ENT_NOQUOTES);
The undo-redo process is kind of ridiculous, but it does the job. And your database will slowly clean itself everytime users update the incorrect data.

Codeigniter: Get affected fields in update

There's a way to get which fields were modified after a update query?
I want to keep track what field XXX user modified... any ways using active records?
I needed this exact functionality so I wrote this code. It returns the number of fields that were affected.
FUNCTION STARTS:
function mysql_affected_fields($sql)
{
// Parse SQL update statement
$piece1 = explode( "UPDATE ", $sql);
$piece2 = explode( "SET", $piece1[1]);
$sql_parts['table'] = trim($piece2[0]);
$piece1 = explode( "SET ", $sql);
$piece2 = explode( "WHERE", $piece1[1]);
$sql_parts['set'] = trim($piece2[0]);
$fields = explode (",",$sql_parts['set']);
foreach($fields as $field)
{
$field_parts = explode("=",$field);
$field_name = trim($field_parts[0]) ;
$field_value = trim($field_parts[1]) ;
$field_value =str_replace("'","",$field_value);
$sql_parts['field'][$field_name] = $field_value;
}
$piece1 = explode( "WHERE ", $sql);
$piece2 = explode( ";", $piece1[1]);
$sql_parts['where'] = trim($piece2[0]);
// Get original field values
$select = "SELECT * FROM ".$sql_parts['table']." WHERE ".$sql_parts['where'];
$result_latest = mysql_query($select) or trigger_error(mysql_error());
while($row = mysql_fetch_array($result_latest,MYSQL_ASSOC))
{
foreach($row as $k=>$v)
{
if ($sql_parts['field'][$k] == $v)
{
}
else
{
$different++;
}
}
}
return $different;
}
There is no way using active record to get this easily, but if you are only supporting one specific database type (let's say MySQL) you could always use Triggers?
Or, Adam is about right. If you have a WHERE criteria for your UPDATE you can SELECT it before you do the UPDATE then loop through the old and new versions comparing.
This is exactly the sort of work Triggers were created for, but of course that puts too much reliance on the DB which makes this less portable yada yada yada.
solution
instructions:
SELECT row, that user wants to modify
UPDATE it
Compute differences between selected and update it
Store the differences somewhere (or mail it, show it, whatever)
simple

Categories