My script is a spider that checks if a page is a "links page" or is a "information page".
if the page is a "links page" then it continue in a recursive manner (or a tree if you will)
until it finds the "information page".
I tried to make the script recursive and it was easy but i kept getting the error:
Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to
allocate 39 bytes) in /srv/www/loko/simple_html_dom.php on line 1316
I was told i would have to use the for loop method because no matter if i use the unset() function the script won't free memory and i only have three levels i need to loop through so it makes sense. But after i changed the script the error occurs again, but maybe i can free
memory now?
Something needs to die here, please help me destruct someone!
set_time_limit(0);
ini_set('memory_limit', '256M');
require("simple_html_dom.php");
$thelink = "http://www.somelink.com";
$html1 = file_get_html($thelink);
$ret1 = $html1->find('#idTabResults2');
// first inception level, we know page has only links
if (!$ret1){
$es1 = $html1->find('table.litab a');
//unset($html1);
$countlinks1 = 0;
foreach ($es1 as $aa1) {
$links1[$countlinks1] = $aa1->href;
$countlinks1++;
}
//unset($es1);
//for every link in array do the same
for ($i = 0; $i < $countlinks1; $i++) {
$html2 = file_get_html($links1[$i]);
$ret2 = $html2->find('#idTabResults2');
// if got information then send to DB
if ($ret2){
pullInfo($html2);
//unset($html2);
} else {
// continue inception
$es2 = $html2->find('table.litab a');
$html2 = null;
$countlinks2 = 0;
foreach ($es2 as $aa2) {
$links2[$countlinks2] = $aa2->href;
$countlinks2++;
}
//unset($es2);
for ($j = 0; $j < $countlinks2; $j++) {
$html3 = file_get_html($links2[$j]);
$ret3 = $html3->find('#idTabResults2');
// if got information then send to DB
if ($ret3){
pullInfo($html3);
} else {
// inception level three
$es3 = $html3->find('table.litab a');
$html3 = null;
$countlinks3 = 0;
foreach ($es3 as $aa3) {
$links3[$countlinks3] = $aa3->href;
$countlinks3++;
}
for ($k = 0; $k < $countlinks3; $k++) {
echo memory_get_usage() ;
echo "\n";
$html4 = file_get_html($links3[$k]);
$ret4 = $html4->find('#idTabResults2');
// if got information then send to DB
if ($ret4){
pullInfo($html4);
}
unset($html4);
}
unset($html3);
}
}
}
}
}
function pullInfo($html)
{
$tds = $html->find('td');
$count =0;
foreach ($tds as $td) {
$count++;
if ($count==1){
$name = html_entity_decode($td->innertext);
}
if ($count==2){
$address = addslashes(html_entity_decode($td->innertext));
}
if ($count==3){
$number = addslashes(preg_replace('/(\d+) - (\d+)/i', '$2$1', $td->innertext));
}
}
unset($tds, $td);
$name = mysql_real_escape_string($name);
$address = mysql_real_escape_string($address);
$number = mysql_real_escape_string($number);
$inAlready=mysql_query("SELECT * FROM people WHERE phone=$number");
while($e=mysql_fetch_assoc($inAlready))
$output[]=$e;
if (json_encode($output) != "null"){
//print(json_encode($output));
} else {
mysql_query("INSERT INTO people (name, area, phone)
VALUES ('$name', '$address', '$number')");
}
}
And here is a picture of the growth in memory size:
I modified the code a little bit to free as much memory as I see could be freed.
I've added a comment above each modification. The added comments start with "#" so you could find them easier.
This is not related to this question, but worth mentioning that your database insertion code is vulnerable to SQL injection.
<?php
require("simple_html_dom.php");
$thelink = "http://www.somelink.co.uk";
# do not keep raw contents of the file on memory
#$data1 = file_get_contents($thelink);
#$html1 = str_get_html($data1);
$html1 = str_get_html(file_get_contents($thelink));
$ret1 = $html1->find('#idResults2');
// first inception level, we know page has only links
if (!$ret1){
$es1 = $html1->find('table.litab a');
# free $html1, not used anymore
unset($html1);
$countlinks1 = 0;
foreach ($es1 as $aa1) {
$links1[$countlinks1] = $aa1->href;
$countlinks1++;
// echo (addslashes($aa->href));
}
# free memroy used by the $es1 value, not used anymore
unset($es1);
//for every link in array do the same
for ($i = 0; $i <= $countlinks1; $i++) {
# do not keep raw contents of the file on memory
#$data2 = file_get_contents($links1[$i]);
#$html2 = str_get_html($data2);
$html2 = str_get_html(file_get_contents($links1[$i]));
$ret2 = $html2->find('#idResults2');
// if got information then send to DB
if ($ret2){
pullInfo($html2);
} else {
// continue inception
$es2 = $html2->find('table.litab a');
# free memory used by $html2, not used anymore.
# we would unset it at the end of the loop.
$html2 = null;
$countlinks2 = 0;
foreach ($es2 as $aa2) {
$links2[$countlinks2] = $aa2->href;
$countlinks2++;
}
# free memory used by $es2
unest($es2);
for ($j = 0; $j <= $countlinks2; $j++) {
# do not keep raw contents of the file on memory
#$data3 = file_get_contents($links2[$j]);
#$html3 = str_get_html($data3);
$html3 = str_get_html(file_get_contents($links2[$j]));
$ret3 = $html3->find('#idResults2');
// if got information then send to DB
if ($ret3){
pullInfo($html3);
}
# free memory used by $html3 or on last iteration the memeory would net get free
unset($html3);
}
}
# free memory used by $html2 or on last iteration the memeory would net get free
unset($html2);
}
}
function pullInfo($html)
{
$tds = $html->find('td');
$count =0;
foreach ($tds as $td) {
$count++;
if ($count==1){
$name = addslashes($td->innertext);
}
if ($count==2){
$address = addslashes($td->innertext);
}
if ($count==3){
$number = addslashes(preg_replace('/(\d+) - (\d+)/i', '$2$1', $td->innertext));
}
}
# check for available data:
if ($count) {
# free $tds and $td
unset($tds, $td);
mysql_query("INSERT INTO people (name, area, phone)
VALUES ('$name', '$address', '$number')");
}
}
Update:
You could trace your memory usage to see how much memory is being used in each section of your code. this could be done by using the memory_get_usage() calls, and saving the result to some file. like placing this below code in the end of each of your loops, or before creating objects, calling heavy methods:
file_put_contents('memory.log', 'memory used in line ' . __LINE__ . ' is: ' . memory_get_usage() . PHP_EOL, FILE_APPEND);
So you could trace the memory usage of each part of your code.
In the end remember all this tracing and optimization might not be enough, since your application might really need more memory than 32 MB. I'v developed a system that analyzes several data sources and detects spammers, and then blocks their SMTP connections and since sometimes the number of connected users are over 30000, after a lot of code optimization, I had to increase the PHP memory limit to 768 MB on the server, Which is not a common thing to do.
If your operation requires memory and your server has more memory available, you can call ini_set('memory_limit', '128M'); or something similar (depending your memory requirement) to increase the amount of memory available to the script.
This does not mean you should not optimise and refactor your code :-) this is just one part.
The solution was to use the clear method such as:
$html4->clear(); a simple_html_dom method to clear memory When you are finished with the DOM element.
If you want to learn more, enter this website.
Firstly, let's turn this into a truly recursive function, should make it easier to modify the whole chain of events that way:
function findInfo($thelink)
{
$data = file_get_contents($thelink); //Might want to make sure that it's a valid link, i.e. that file get contents actually returned stuff, before trying to run further with it.
$html = str_get_html($data);
unset($data); //Finished using it, no reason to keep it around.
$ret = $html->find('#idResults2');
if($ret)
{
pullInfo($html);
return true; //Should stop once it finds it right?
}
else
{
$es = $html->find('table.litab a'); //Might want a little error checking here to make sure it actually found links.
unset($html); //Finished using it, no reason to keep it around
$countlinks = 0;
foreach($es as $aa)
{
$links[$countlinks] = $aa->href;
$countlinks++;
}
unset($es); //Finished using it, no reason to keep it around.
for($i = 0; $i <= $countlinks; $i++)
{
$result = findInfo($links[$i]);
if($result === true)
{
return true; //To break out of above recursive functions if lower functions return true
}
else
{
unset($links[$i]); //Finished using it, no reason to keep it around.
continue;
}
}
}
return false; //Will return false if all else failed, should hit a return true before this point if it successfully finds an info page.
}
See if that helps at all with the cleanups. Probably still run out of memory, but you shouldn't be holding onto the full html of each webpage scanned and what not with this.
Oh, and if you only want it to go only so deep, change the function declaration to something like:
function findInfo($thelink, $depth = 1, $maxdepth = 3)
Then when calling the function within the function, call it like so:
findInfo($html, $depth + 1, $maxdepth); //you include maxdepth so you can override it in the initial function call, like findInfo($thelink,,4)
and then do a check on depth vs. maxdepth at the start of the function and have it return false if depth is > than maxdepth.
If memory usage is your primary concern, you may want to consider using a SAX-based parser. Coding using a SAX parser can be a bit more complicated, but it's not necessary to keep the entire document in memory.
Related
I'm having an issue with a memory leak in this code. What I'm attempting to do is to temporarily upload a rather large CSV file (at least 12k records), and check each record for a partial duplication against other records in the CSV file. The reason why I say "partial duplication" is because basically if most of the record matches (at least 30 fields), it is going to be a duplicate record. The code I've written should, in theory, work as intended, but of course, it's a rather large loop and is exhausting memory. This is happening on the line that contains "array_intersect".
This is not for something I'm getting paid to do, but it is with the purpose of helping make life at work easier. I'm a data entry employee, and we are having to look at duplicate entries manually right now, which is asinine, so I'm trying to help out by making a small program for this.
Thank you so much in advance!
if (isset($_POST["submit"])) {
if (isset($_FILES["sheetupload"])) {
$fh = fopen(basename($_FILES["sheetupload"]["name"]), "r+");
$lines = array();
$records = array();
$counter = 0;
while(($row = fgetcsv($fh, 8192)) !== FALSE ) {
$lines[] = $row;
}
foreach ($lines as $line) {
if(!in_array($line, $records)){
if (count($records) > 0) {
//check array against records for dupes
foreach ($records as $record) {
if (count(array_intersect($line, $record)) > 30) {
$dupes[] = $line;
$counter++;
}
else {
$records[] = $line;
}
}
}
else {
$records[] = $line;
}
}
else {
$counter++;
}
}
if ($counter < 1) {
echo $counter." duplicate records found. New file not created.";
}
else {
echo $counter." duplicate records found. New file created as NEWSHEET.csv.";
$fp = fopen('NEWSHEET.csv', 'w');
foreach ($records as $line) {
fputcsv($fp, $line);
}
}
}
}
A couple of possibilities, assuming the script is reaching the memory limit or timing out. If you can access the php.ini file, try increasing the memory_limit and the max_execution_time.
If you can't access the server settings, try adding these to the top of your script:
ini_set('memory_limit','256M'); // change this number as necessary
set_time_limit(0); // so script does not time out
If altering these settings in the script is not possible, you might try using unset() in a few spots to free up memory:
// after the first while loop
unset($fh, $row);
and
//at end of each foreach loop
unset($line);
Good morning,
I´m actually going through some hard lessons while trying to handle huge csv files up to 4GB.
Goal is to search some items in a csv file (Amazon datafeed) by a given browsenode and also by some given item id´s (ASIN). To get a mix of existing items (in my database) plus some additional new itmes since from time to time items disapear on the marketplace. I also filter the title of the items because there are many items using the same.
I have been reading here lots af tips and finally decided to use php´s fgetcsv() and thought this function will not exhaust memory, since it reads the file line by line.
But no matter what I try I´m always running out of memory.
I can not understand why my code uses so much memory.
I set the memory limit to 4096MB, time limit is 0. Server has 64 GB Ram and two SSD hardisks.
May someone please check out my piece of code and explain how it is possible that im running out of memory and more important how memory is used?
private function performSearchByASINs()
{
$found = 0;
$needed = 0;
$minimum = 84;
if(is_array($this->searchASINs) && !empty($this->searchASINs))
{
$needed = count($this->searchASINs);
}
if($this->searchFeed == NULL || $this->searchFeed == '')
{
return false;
}
$csv = fopen($this->searchFeed, 'r');
if($csv)
{
$l = 0;
$title_array = array();
while(($line = fgetcsv($csv, 0, ',', '"')) !== false)
{
$header = array();
if(trim($line[6]) != '')
{
if($l == 0)
{
$header = $line;
}
else
{
$asin = $line[0];
$title = $this->prepTitleDesc($line[6]);
if(is_array($this->searchASINs)
&& !empty($this->searchASINs)
&& in_array($asin, $this->searchASINs)) //search for existing items to get them updated
{
$add = true;
if(in_array($title, $title_array))
{
$add = false;
}
if($add === true)
{
$this->itemsByASIN[$asin] = new stdClass();
foreach($header as $k => $key)
{
if(isset($line[$k]))
{
$this->itemsByASIN[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
}
}
$title_array[] = $title;
$found++;
}
}
if(($line[20] == $this->bnid || $line[21] == $this->bnid)
&& count($this->itemsByKey) < $minimum
&& !isset($this->itemsByASIN[$asin])) // searching for new items
{
$add = true;
if(in_array($title, $title_array))
{
$add = false;
}
if($add === true)
{
$this->itemsByKey[$asin] = new stdClass();
foreach($header as $k => $key)
{
if(isset($line[$k]))
{
$this->itemsByKey[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
}
}
$title_array[] = $title;
$found++;
}
}
}
$l++;
if($l > 200000 || $found == $minimum)
{
break;
}
}
}
fclose($csv);
}
}
I know my answer is a bit late but I had a similar problem with fgets() and things based on fgets() like SplFileObject->current() function. In my case it was on a windows system when trying to read a +800MB file. I think fgets() doesn't free the memory of the previous line in a loop. So every line that was read stayed in memory and let to a fatal out of memory error. I fixed it using fread($lineLength) instead but it is a bit trickier since you must supply the length.
It is very hard to manage large data using array without encountering timeout issue. Instead why not parse this datafeed to a database table and do the heavy lifting from there.
Have you tried this? SplFileObject::fgetcsv
<?php
$file = new SplFileObject("data.csv");
while (!$file->eof()) {
//your code here
}
?>
You are running out of memory because you use variables, and you are never doing an unset(); and use too many nested foreach. You could shrink that code in more functions
A solution should be, use a real Database instead.
I want to realize algorithm of Aho-Corasick. I've made trie and it works, I've done deleting from trie, insert and search. But when I've tried to insert my test data (~56k words), PHP has thrown an error:
php reached memory limit on one script (128 mb).
So, I think the problem is in my incorrect using references. Here is part of my code, where I get an error:
public function insert($key, $value) {
$current_node = &$this->root; // setting current node as root node
for ($i = 0; $i < mb_strlen($key, 'UTF-8'); $i++) {
$char = mb_substr($key, $i, 1, 'UTF-8');
$parent = &$current_node; // setting parent node
if (isset($current_node['children'][(string)$char])) {
$current_node = &$current_node['children'][(string)$char];
if (isset($current_node['isLeaf']))
unset($current_node['isLeaf']);
} else {
$current_node['children'][(string)$char] = [];
$current_node = &$current_node['children'][(string)$char];
}
$current_node['parent'] = &$parent;
if ($i == (mb_strlen($key, 'UTF-8') - 1)) {
$current_node['value'] = $value;
if (!isset($current_node['children'])) {
$current_node['isLeaf'] = true;
}
}
}
}
I think the problem is that I try to store all massive by reference, not just address in C/C++ style. So, can I solve this problem in C/C++ style? Has php instrument for it? What if I will write php-extension in C and then just add it to my php-interpretator?
Could anyone help me.
I need to return multiple img's, but with this code, only one of two is returning.
What is the solution.
Thank you in advance.
$test = "/claim/img/box.png, /claim/img/box.png";
function test($test)
{
$photo = explode(',', $test);
for ($i = 0; $i < count($photo); $i++)
{
$returnas = "<img src=".$photo[$i].">";
return $returnas;
}
}
This might be a good opportunity to learn about array_map.
function test($test) {
return implode("",array_map(function($img) {
return "<img src='".trim($img)."' />";
},explode(",",$test)));
}
Many functions make writing code a lot simpler, and it's also faster because it uses lower-level code.
While we're on the subject of learning things, PHP 5.5 gives us generators. You could potentially use one here. For example:
function test($test) {
$pieces = explode(",",$test);
foreach($pieces as $img) {
yield "<img src='".trim($img)."' />";
}
}
That yield is where the magic happens. This makes your function behave like a generator. You can then do this:
$images = test($test);
foreach($images as $image) echo $image;
Personally, I think this generator solution is a lot cleaner than the array_map one I gave earlier, which in turn is tidier than manually iterating.
Modify your code that way
function test($test)
{
$returnas = '';
$photo = explode(',', $test);
for ($i = 0; $i < count($photo); $i++)
{
$returnas .= "<img src=".$photo[$i].">";
}
return $returnas;
}
Your code didn't work since you were returning inside the loop immediatly. Every programming language support "only a return for call". In my solution you're appendig a string that has an img tag each time you enter the loop and return it after every photo is "passed" into the loop
You could even use the foreach() construct, of course
Bonus answer
If you don't know the difference between ...
for ($i = 0; $i < count($photo); $i++)
and
for ($i = 0, $count = count($photo); $i < $<; $i++)
Well, in first case you'll evaluate count($photo) every single time the for is called whereas the second time, it is evaluated only once.
This could be used for optimization porpuses (even if php, internally, stores the length of an array so it is accesible in O(1))
The function breaks after the first return statement. You need to save what you want to return in some structure, an array eg, and return this.
function test($test)
{
$result = array();
$photo = explode(',', $test);
for ($i = 0; $i < count($photo); $i++)
{
$returnas = "<img src=".$photo[$i].">";
$result[] = $returnas;
}
return $result;
}
I'm using cURL to pull a webpage from a server. I pass it to Tidy and throw the output into a DOMDocument. Then the trouble starts.
The webpage contains about three thousand (yikes) table tags, and I'm scraping data from them. There are two kinds of tables, where one or more type B follow a type A.
I've profiled my script using microtome(true) calls. I've placed calls before and after each stage of my script and subtracted the times from each other. So, if you'll follow me through my code, I'll explain it, share the profile results, and point out where the problem is. Maybe you can even help me solve the problem. Here we go:
First, I include two files. One handles some parsing, and the other defines two "data structure" classes.
// Imports
include('./course.php');
include('./utils.php');
Includes are inconsequential as far as I know, and so let's proceed to the cURL import.
// Execute cURL
$response = curl_exec($curl_handle);
I've configured cURL to not time out, and to post some header data, which is required to get a meaningful response. Next, I clean up the data to prepare it for DOMDocument.
// Run about 25 str_replace calls here, to clean up
// then run tidy.
$html = $response;
//
// Prepare some config for tidy
//
$config = array(
'indent' => true,
'output-xhtml' => true,
'wrap' => 200);
//
// Tidy up the HTML
//
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$html = $tidy;
Up until now, the code has taken about nine seconds. Considering this to be a cron job, running infrequently, I'm fine with that. However, the next part of the code really barfs. Here's where I take what I want from the HTML and shove it into my custom classes. (I plan to stuff this into a MySQL database too, but this is a first step.)
// Get all of the tables in the page
$tables = $dom->getElementsByTagName('table');
// Create a buffer for the courses
$courses = array();
// Iterate
$numberOfTables = $tables->length;
for ($i=1; $i <$numberOfTables ; $i++) {
$sectionTable = $tables->item($i);
$courseTable = $tables->item($i-1);
// We've found a course table, parse it.
if (elementIsACourseSectionTable($sectionTable)) {
$course = courseFromTable($courseTable);
$course = addSectionsToCourseUsingTable($course, $sectionTable);
$courses[] = $course;
}
}
For reference, here's the utility functions that I call:
//
// Tell us if a given element is
// a course section table.
//
function elementIsACourseSectionTable(DOMElement $element){
$tableHasClass = $element->hasAttribute('class');
$tableIsCourseTable = $element->getAttribute("class") == "coursetable";
return $tableHasClass && $tableIsCourseTable;
}
//
// Takes a table and parses it into an
// instance of the Course class.
//
function courseFromTable(DOMElement $table){
$secondRow = $table->getElementsByTagName('tr')->item(1);
$cells = $secondRow->getElementsByTagName('td');
$course = new Course;
$course->startDate = valueForElementInList(0, $cells);
$course->endDate = valueForElementInList(1, $cells);
$course->name = valueForElementInList(2, $cells);
$course->description = valueForElementInList(3, $cells);
$course->credits = valueForElementInList(4, $cells);
$course->hours = valueForElementInList(5, $cells);
$course->division = valueForElementInList(6, $cells);
$course->subject = valueForElementInList(7, $cells);
return $course;
}
//
// Takes a table and parses it into an
// instance of the Section class.
//
function sectionFromRow(DOMElement $row){
$cells = $row->getElementsByTagName('td');
//
// Skip any row with a single cell
//
if ($cells->length == 1) {
# code...
return NULL;
}
//
// Skip header rows
//
if (valueForElementInList(0, $cells) == "Section" || valueForElementInList(0, $cells) == "") {
return NULL;
}
$section = new Section;
$section->section = valueForElementInList(0, $cells);
$section->code = valueForElementInList(1, $cells);
$section->openSeats = valueForElementInList(2, $cells);
$section->dayAndTime = valueForElementInList(3, $cells);
$section->instructor = valueForElementInList(4, $cells);
$section->buildingAndRoom = valueForElementInList(5, $cells);
$section->isOnline = valueForElementInList(6, $cells);
return $section;
}
//
// Take a table containing course sections
// and parse it put the results into a
// give course object.
//
function addSectionsToCourseUsingTable(Course $course, DOMElement $table){
$rows = $table->getElementsByTagName('tr');
$numRows = $rows->length;
for ($i=0; $i < $numRows; $i++) {
$section = sectionFromRow($rows->item($i));
// Make sure we have an array to put sections into
if (is_null($course->sections)) {
$course->sections = array();
}
// Skip "meta" rows, since they're not really sections
if (is_null($section)) {
continue;
}
$course->addSection($section);
}
return $course;
}
//
// Returns the text from a cell
// with a
//
function valueForElementInList($index, $list){
$value = $list->item($index)->nodeValue;
$value = trim($value);
return $value;
}
This code takes 63 seconds. That's over a minute for a PHP script to pull data from a webpage. Sheesh!
I've been advised to split up the workload of my main work loop, but considering the homogenous nature of my data, I'm not entirely sure how. Any suggestions on improving this code are greatly appreciated.
What can I do to improve my code execution time?
It turns out that my loop is terribly inefficient.
Using a foreach cut time in half to about 31 seconds. But that wasn't fast enough. So I reticulated some splines and did some brainstorming with about half of the programmers that I know how to poke online. Here's what we found:
Using DOMNodeList's item() accessor is linear, producing exponentially slow processing times in loops. So, removing the first element after each iteration makes the loop faster. Now, we always access the first element of the list. This brought me down to 8 seconds.
After playing some more, I realized that the ->length property of DOMNodeList is just as bad as item(), since it also incurs linear cost. So I changed my for loop to this:
$table = $tables->item(0);
while ($table != NULL) {
$table = $tables->item(0);
if ($table === NULL) {
break;
}
//
// We've found a section table, parse it.
//
if (elementIsACourseSectionTable($table)) {
$course = addSectionsToCourseUsingTable($course, $table);
}
//
// Skip the last table if it's not a course section
//
else if(elementIsCourseHeaderTable($table)){
$course = courseFromTable($table);
$courses[] = $course;
}
//
// Remove the first item from the list
//
$first = $tables->item(0);
$first->parentNode->removeChild($first);
//
// Get the next table to parse
//
$table = $tables->item(0);
}
Note that I've done some other optimizations in terms of targeting the data I want, but the relevant part is how I handle progressing from one item to the next.