Optimization of foreach for thousands items - php

I'm running the code below over a set of 25,000 results. I need to optimize it because i'm hitting the memory limit.
$oldproducts = Oldproduct::model()->findAll(); /*(here i have 25,000 results)*/
foreach($oldproducts as $oldproduct) :
$criteria = new CDbCriteria;
$criteria->compare('`someid`', $oldproduct->someid);
$finds = Newproduct::model()->findAll($criteria);
if (empty($finds)) {
$new = new Newproduct;
$new->someid = $oldproduct->someid;
$new->save();
} else {
foreach($finds as $find) :
if ($find->price != $oldproduct->price) {
$find->attributes=array('price' => $oldproduct->price);
$find->save();
}
endforeach;
}
endforeach;
Code compares rows of two tables by someid. If it find coincidence it updates price column, if not creates a new record.

Use CDataProviderIterator which:
... allows iteration over large data sets without holding the entire set in memory.
You first have to pass a CDataProvider instance to it:
$dataProvider = new CActiveDataProvider("Oldproduct");
$iterator = new CDataProviderIterator($dataProvider);
foreach($iterator as $item) {
// do stuff
}

You could process the rows in chunks of ~5000 instead of getting all the rows in 1 go!
$cnt = 5000;
$offset = 0;
do {
$oldproducts = Oldproduct::model()->limit($cnt)->offset($offset)->findAll(); /*(here i have 25,000 results)*/
foreach($oldproducts as $oldproduct) {
// your code
}
$offset += $cnt;
} while($oldproducts >= $cnt);

Related

PHP create unique value compared to values from object

I need to create a unique value for $total, to be different from all other values from received object. It should compare total with order_amount from object, and then if it is the same, it should increase its value by 0.00000001, and then check again through that object to see if it matches again with another order_amount. The end result should be a unique value, with minimal increase compared to the starting $total value. All values are set to have 8 decmal places.
I have tried with the following but it won't get me the result i need. What am i doing wrong?
function unique_amount($amount, $rate) {
$total = round($amount / $rate, 8);
$other_amounts = some object...;
foreach($other_amounts as $amount) {
if ($amount->order_amount == $total) {
$total = $total + 0.00000001;
}
}
return $total;
}
<?php
define('EPSILON',0.00000001);
$total = 4.00000000;
$other_amounts = [4.00000001,4.00000000,4.00000002];
sort($other_amounts);
foreach($other_amounts as $each_amount){
if($total === $each_amount){ // $total === $each_amount->order_amount , incase of objects
$total += EPSILON;
}
}
var_dump($total);
OUTPUT
float(4.00000003)
You may add an additional break if $total < $each_amount to make it a bit more efficient.
UPDATE
To sort objects in $other_amounts based on amount, you can use usort.
usort($other_amounts,function($o1,$o2){
if($o1->order_amount < $o2->order_amount ) return -1;
else if($o1->order_amount > $o2->order_amount ) return 1;
return 0;
});
Ok, here's the solution I came up with. First I created a function to deliver random objects with random totals so I could work with, unnecessary for you but useful for the sake of this test:
function generate_objects()
{
$outputObjects = [];
for ($i=0; $i < 100; $i++) {
$object = new \stdClass();
$mainValue = random_int(1,9);
$decimalValue = random_int(1,9);
$object->order_amount = "{$mainValue}.0000000{$decimalValue}";
$outputObjects[] = $object;
}
return $outputObjects;
}
And now for the solution part, first the code, then the explanation:
function unique_amount($amount, $rate) {
$total = number_format(round($amount / $rate, 8), 4);
$searchTotal = $total;
if (strpos((string) $searchTotal, '.') !== false) {
$searchTotal = str_replace('.', '\.', $searchTotal);
}
$other_amounts = generate_objects();
$similarTotals = [];
foreach($other_amounts as $amount) {
if (preg_match("/^$searchTotal/", $amount->order_amount)) {
$similarTotals[] = $amount->order_amount;
}
}
if (!empty($similarTotals)) {
rsort($similarTotals);
$total = ($similarTotals[0] + 0.00000001);
}
// DEBUG
//echo '<pre>';
//$vars = get_defined_vars();
//unset($vars['other_amounts']);
//print_r($vars);
//die;
return $total;
}
$test = unique_amount(8,1);
echo $test;
I decided to use RegEx to find the amounts that starts with the ones I provided. Since in the exercise I provided only integers with 1-9 in the last decimal case, I tracked them and added them to one array $similarTotals.
Then I sorted this array, if not empty, descending by the values, got the first item and incremented by 0.00000001.
So, finally, returning either the $total (assuming nothing was found) or the incremented first item in the array.
PS. I did not expect this code to be this big but, well...
You can see the test here: https://3v4l.org/WGThI

Continue on PHP

Hi there i have some questions, how to continuing the data while the condition has reach 500 value on the new file and then create some random number beside the sitemap name. this is my script:
$rand = rand(1,9);
$open1 = fopen("sitemap-$rand.txt", 'w');
$web = 'http://'.$_SERVER['SERVER_NAME'].'/';
$ws = $_SERVER['SERVER_NAME'];
$i = 1;
foreach ($data as $key => $value) {
$hasil = home_base_url().strtolower($value).'.html'."\n";
fwrite($open1, $hasil);
if (++$i == 500) {
break;
}
}
Thanks for your help
You mean that there are 500 lines, per random numbered sitemap?
if (++$i == 500) {
$value = rand(0,9);// set valuie to a new random value
$i = 0; // reset the counter
}
N.B.: It might be that the loop will pick the same random number twice. Either pick a larger number to decrease the odds, or create a small function which knows which names are used, so you must pick a new name.

Trouble reading huge CSV file with php fgetcsv - understanding memory consumption

Good morning,
I´m actually going through some hard lessons while trying to handle huge csv files up to 4GB.
Goal is to search some items in a csv file (Amazon datafeed) by a given browsenode and also by some given item id´s (ASIN). To get a mix of existing items (in my database) plus some additional new itmes since from time to time items disapear on the marketplace. I also filter the title of the items because there are many items using the same.
I have been reading here lots af tips and finally decided to use php´s fgetcsv() and thought this function will not exhaust memory, since it reads the file line by line.
But no matter what I try I´m always running out of memory.
I can not understand why my code uses so much memory.
I set the memory limit to 4096MB, time limit is 0. Server has 64 GB Ram and two SSD hardisks.
May someone please check out my piece of code and explain how it is possible that im running out of memory and more important how memory is used?
private function performSearchByASINs()
{
$found = 0;
$needed = 0;
$minimum = 84;
if(is_array($this->searchASINs) && !empty($this->searchASINs))
{
$needed = count($this->searchASINs);
}
if($this->searchFeed == NULL || $this->searchFeed == '')
{
return false;
}
$csv = fopen($this->searchFeed, 'r');
if($csv)
{
$l = 0;
$title_array = array();
while(($line = fgetcsv($csv, 0, ',', '"')) !== false)
{
$header = array();
if(trim($line[6]) != '')
{
if($l == 0)
{
$header = $line;
}
else
{
$asin = $line[0];
$title = $this->prepTitleDesc($line[6]);
if(is_array($this->searchASINs)
&& !empty($this->searchASINs)
&& in_array($asin, $this->searchASINs)) //search for existing items to get them updated
{
$add = true;
if(in_array($title, $title_array))
{
$add = false;
}
if($add === true)
{
$this->itemsByASIN[$asin] = new stdClass();
foreach($header as $k => $key)
{
if(isset($line[$k]))
{
$this->itemsByASIN[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
}
}
$title_array[] = $title;
$found++;
}
}
if(($line[20] == $this->bnid || $line[21] == $this->bnid)
&& count($this->itemsByKey) < $minimum
&& !isset($this->itemsByASIN[$asin])) // searching for new items
{
$add = true;
if(in_array($title, $title_array))
{
$add = false;
}
if($add === true)
{
$this->itemsByKey[$asin] = new stdClass();
foreach($header as $k => $key)
{
if(isset($line[$k]))
{
$this->itemsByKey[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
}
}
$title_array[] = $title;
$found++;
}
}
}
$l++;
if($l > 200000 || $found == $minimum)
{
break;
}
}
}
fclose($csv);
}
}
I know my answer is a bit late but I had a similar problem with fgets() and things based on fgets() like SplFileObject->current() function. In my case it was on a windows system when trying to read a +800MB file. I think fgets() doesn't free the memory of the previous line in a loop. So every line that was read stayed in memory and let to a fatal out of memory error. I fixed it using fread($lineLength) instead but it is a bit trickier since you must supply the length.
It is very hard to manage large data using array without encountering timeout issue. Instead why not parse this datafeed to a database table and do the heavy lifting from there.
Have you tried this? SplFileObject::fgetcsv
<?php
$file = new SplFileObject("data.csv");
while (!$file->eof()) {
//your code here
}
?>
You are running out of memory because you use variables, and you are never doing an unset(); and use too many nested foreach. You could shrink that code in more functions
A solution should be, use a real Database instead.

Find Unique Tags in an XML file

I have an XML file which contains the tag <image_file_name>, this tag repeats and occasionally this value is duplicated, I am trying to find the total number of unique instances values of <image_file_name>.
$simpleXML = simplexml_load_file("stock_availability.xml");
$uniqueProducts = array();
foreach ($simpleXML->product as $product) {
$image_file_name = $product->image_file_name;
if(in_array($image_file_name, $uniqueProducts)) {
echo 1;
} else {
$uniqueProducts[] = $image_file_name;
echo 2;
}
$image_file_name = null;
}
echo count($uniqueProducts);
The count() returns the total number of instances of image_file_name not unique instances.
2 is also echoed continuously and 1 is never echoed.
I've stared at your code for a few minutes and cannot see a problem. I take it you've used a print_r($uniqueProducts) and have found duplicates in the output.
Have you checked for whitespace or case differences in the otherwise duplicate entries?
Try using some standardising - e.g by using strtoupper(trim()) and quotes.
e.g.:
$simpleXML = simplexml_load_file("stock_availability.xml");
$uniqueProducts = array();
foreach ($simpleXML->product as $product) {
$image_file_name = $product->image_file_name;
$dup_check=strtoupper(trim($image_file_name));
if(in_array("$dup_check", $uniqueProducts)) {
echo 1;
} else {
$uniqueProducts[] = "$dup_check";
echo 2;
}
$image_file_name = null;
}
print_r($uniqueProducts);
echo "<br>".count($uniqueProducts);
See if this helps any.

How to improve performance iterating a DOMDocument?

I'm using cURL to pull a webpage from a server. I pass it to Tidy and throw the output into a DOMDocument. Then the trouble starts.
The webpage contains about three thousand (yikes) table tags, and I'm scraping data from them. There are two kinds of tables, where one or more type B follow a type A.
I've profiled my script using microtome(true) calls. I've placed calls before and after each stage of my script and subtracted the times from each other. So, if you'll follow me through my code, I'll explain it, share the profile results, and point out where the problem is. Maybe you can even help me solve the problem. Here we go:
First, I include two files. One handles some parsing, and the other defines two "data structure" classes.
// Imports
include('./course.php');
include('./utils.php');
Includes are inconsequential as far as I know, and so let's proceed to the cURL import.
// Execute cURL
$response = curl_exec($curl_handle);
I've configured cURL to not time out, and to post some header data, which is required to get a meaningful response. Next, I clean up the data to prepare it for DOMDocument.
// Run about 25 str_replace calls here, to clean up
// then run tidy.
$html = $response;
//
// Prepare some config for tidy
//
$config = array(
'indent' => true,
'output-xhtml' => true,
'wrap' => 200);
//
// Tidy up the HTML
//
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$html = $tidy;
Up until now, the code has taken about nine seconds. Considering this to be a cron job, running infrequently, I'm fine with that. However, the next part of the code really barfs. Here's where I take what I want from the HTML and shove it into my custom classes. (I plan to stuff this into a MySQL database too, but this is a first step.)
// Get all of the tables in the page
$tables = $dom->getElementsByTagName('table');
// Create a buffer for the courses
$courses = array();
// Iterate
$numberOfTables = $tables->length;
for ($i=1; $i <$numberOfTables ; $i++) {
$sectionTable = $tables->item($i);
$courseTable = $tables->item($i-1);
// We've found a course table, parse it.
if (elementIsACourseSectionTable($sectionTable)) {
$course = courseFromTable($courseTable);
$course = addSectionsToCourseUsingTable($course, $sectionTable);
$courses[] = $course;
}
}
For reference, here's the utility functions that I call:
//
// Tell us if a given element is
// a course section table.
//
function elementIsACourseSectionTable(DOMElement $element){
$tableHasClass = $element->hasAttribute('class');
$tableIsCourseTable = $element->getAttribute("class") == "coursetable";
return $tableHasClass && $tableIsCourseTable;
}
//
// Takes a table and parses it into an
// instance of the Course class.
//
function courseFromTable(DOMElement $table){
$secondRow = $table->getElementsByTagName('tr')->item(1);
$cells = $secondRow->getElementsByTagName('td');
$course = new Course;
$course->startDate = valueForElementInList(0, $cells);
$course->endDate = valueForElementInList(1, $cells);
$course->name = valueForElementInList(2, $cells);
$course->description = valueForElementInList(3, $cells);
$course->credits = valueForElementInList(4, $cells);
$course->hours = valueForElementInList(5, $cells);
$course->division = valueForElementInList(6, $cells);
$course->subject = valueForElementInList(7, $cells);
return $course;
}
//
// Takes a table and parses it into an
// instance of the Section class.
//
function sectionFromRow(DOMElement $row){
$cells = $row->getElementsByTagName('td');
//
// Skip any row with a single cell
//
if ($cells->length == 1) {
# code...
return NULL;
}
//
// Skip header rows
//
if (valueForElementInList(0, $cells) == "Section" || valueForElementInList(0, $cells) == "") {
return NULL;
}
$section = new Section;
$section->section = valueForElementInList(0, $cells);
$section->code = valueForElementInList(1, $cells);
$section->openSeats = valueForElementInList(2, $cells);
$section->dayAndTime = valueForElementInList(3, $cells);
$section->instructor = valueForElementInList(4, $cells);
$section->buildingAndRoom = valueForElementInList(5, $cells);
$section->isOnline = valueForElementInList(6, $cells);
return $section;
}
//
// Take a table containing course sections
// and parse it put the results into a
// give course object.
//
function addSectionsToCourseUsingTable(Course $course, DOMElement $table){
$rows = $table->getElementsByTagName('tr');
$numRows = $rows->length;
for ($i=0; $i < $numRows; $i++) {
$section = sectionFromRow($rows->item($i));
// Make sure we have an array to put sections into
if (is_null($course->sections)) {
$course->sections = array();
}
// Skip "meta" rows, since they're not really sections
if (is_null($section)) {
continue;
}
$course->addSection($section);
}
return $course;
}
//
// Returns the text from a cell
// with a
//
function valueForElementInList($index, $list){
$value = $list->item($index)->nodeValue;
$value = trim($value);
return $value;
}
This code takes 63 seconds. That's over a minute for a PHP script to pull data from a webpage. Sheesh!
I've been advised to split up the workload of my main work loop, but considering the homogenous nature of my data, I'm not entirely sure how. Any suggestions on improving this code are greatly appreciated.
What can I do to improve my code execution time?
It turns out that my loop is terribly inefficient.
Using a foreach cut time in half to about 31 seconds. But that wasn't fast enough. So I reticulated some splines and did some brainstorming with about half of the programmers that I know how to poke online. Here's what we found:
Using DOMNodeList's item() accessor is linear, producing exponentially slow processing times in loops. So, removing the first element after each iteration makes the loop faster. Now, we always access the first element of the list. This brought me down to 8 seconds.
After playing some more, I realized that the ->length property of DOMNodeList is just as bad as item(), since it also incurs linear cost. So I changed my for loop to this:
$table = $tables->item(0);
while ($table != NULL) {
$table = $tables->item(0);
if ($table === NULL) {
break;
}
//
// We've found a section table, parse it.
//
if (elementIsACourseSectionTable($table)) {
$course = addSectionsToCourseUsingTable($course, $table);
}
//
// Skip the last table if it's not a course section
//
else if(elementIsCourseHeaderTable($table)){
$course = courseFromTable($table);
$courses[] = $course;
}
//
// Remove the first item from the list
//
$first = $tables->item(0);
$first->parentNode->removeChild($first);
//
// Get the next table to parse
//
$table = $tables->item(0);
}
Note that I've done some other optimizations in terms of targeting the data I want, but the relevant part is how I handle progressing from one item to the next.

Categories