How to improve performance iterating a DOMDocument?

How to improve performance iterating a DOMDocument? - php

I'm using cURL to pull a webpage from a server. I pass it to Tidy and throw the output into a DOMDocument. Then the trouble starts.
The webpage contains about three thousand (yikes) table tags, and I'm scraping data from them. There are two kinds of tables, where one or more type B follow a type A.
I've profiled my script using microtome(true) calls. I've placed calls before and after each stage of my script and subtracted the times from each other. So, if you'll follow me through my code, I'll explain it, share the profile results, and point out where the problem is. Maybe you can even help me solve the problem. Here we go:
First, I include two files. One handles some parsing, and the other defines two "data structure" classes.
// Imports
include('./course.php');
include('./utils.php');
Includes are inconsequential as far as I know, and so let's proceed to the cURL import.
// Execute cURL
$response = curl_exec($curl_handle);
I've configured cURL to not time out, and to post some header data, which is required to get a meaningful response. Next, I clean up the data to prepare it for DOMDocument.
// Run about 25 str_replace calls here, to clean up
// then run tidy.
$html = $response;
//
// Prepare some config for tidy
//
$config = array(
'indent' => true,
'output-xhtml' => true,
'wrap' => 200);
//
// Tidy up the HTML
//
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$html = $tidy;
Up until now, the code has taken about nine seconds. Considering this to be a cron job, running infrequently, I'm fine with that. However, the next part of the code really barfs. Here's where I take what I want from the HTML and shove it into my custom classes. (I plan to stuff this into a MySQL database too, but this is a first step.)
// Get all of the tables in the page
$tables = $dom->getElementsByTagName('table');
// Create a buffer for the courses
$courses = array();
// Iterate
$numberOfTables = $tables->length;
for ($i=1; $i <$numberOfTables ; $i++) {
$sectionTable = $tables->item($i);
$courseTable = $tables->item($i-1);
// We've found a course table, parse it.
if (elementIsACourseSectionTable($sectionTable)) {
$course = courseFromTable($courseTable);
$course = addSectionsToCourseUsingTable($course, $sectionTable);
$courses[] = $course;
}
}
For reference, here's the utility functions that I call:
//
// Tell us if a given element is
// a course section table.
//
function elementIsACourseSectionTable(DOMElement $element){
$tableHasClass = $element->hasAttribute('class');
$tableIsCourseTable = $element->getAttribute("class") == "coursetable";
return $tableHasClass && $tableIsCourseTable;
}
//
// Takes a table and parses it into an
// instance of the Course class.
//
function courseFromTable(DOMElement $table){
$secondRow = $table->getElementsByTagName('tr')->item(1);
$cells = $secondRow->getElementsByTagName('td');
$course = new Course;
$course->startDate = valueForElementInList(0, $cells);
$course->endDate = valueForElementInList(1, $cells);
$course->name = valueForElementInList(2, $cells);
$course->description = valueForElementInList(3, $cells);
$course->credits = valueForElementInList(4, $cells);
$course->hours = valueForElementInList(5, $cells);
$course->division = valueForElementInList(6, $cells);
$course->subject = valueForElementInList(7, $cells);
return $course;
}
//
// Takes a table and parses it into an
// instance of the Section class.
//
function sectionFromRow(DOMElement $row){
$cells = $row->getElementsByTagName('td');
//
// Skip any row with a single cell
//
if ($cells->length == 1) {
# code...
return NULL;
}
//
// Skip header rows
//
if (valueForElementInList(0, $cells) == "Section" || valueForElementInList(0, $cells) == "") {
return NULL;
}
$section = new Section;
$section->section = valueForElementInList(0, $cells);
$section->code = valueForElementInList(1, $cells);
$section->openSeats = valueForElementInList(2, $cells);
$section->dayAndTime = valueForElementInList(3, $cells);
$section->instructor = valueForElementInList(4, $cells);
$section->buildingAndRoom = valueForElementInList(5, $cells);
$section->isOnline = valueForElementInList(6, $cells);
return $section;
}
//
// Take a table containing course sections
// and parse it put the results into a
// give course object.
//
function addSectionsToCourseUsingTable(Course $course, DOMElement $table){
$rows = $table->getElementsByTagName('tr');
$numRows = $rows->length;
for ($i=0; $i < $numRows; $i++) {
$section = sectionFromRow($rows->item($i));
// Make sure we have an array to put sections into
if (is_null($course->sections)) {
$course->sections = array();
}
// Skip "meta" rows, since they're not really sections
if (is_null($section)) {
continue;
}
$course->addSection($section);
}
return $course;
}
//
// Returns the text from a cell
// with a
//
function valueForElementInList($index, $list){
$value = $list->item($index)->nodeValue;
$value = trim($value);
return $value;
}
This code takes 63 seconds. That's over a minute for a PHP script to pull data from a webpage. Sheesh!
I've been advised to split up the workload of my main work loop, but considering the homogenous nature of my data, I'm not entirely sure how. Any suggestions on improving this code are greatly appreciated.
What can I do to improve my code execution time?

It turns out that my loop is terribly inefficient.
Using a foreach cut time in half to about 31 seconds. But that wasn't fast enough. So I reticulated some splines and did some brainstorming with about half of the programmers that I know how to poke online. Here's what we found:
Using DOMNodeList's item() accessor is linear, producing exponentially slow processing times in loops. So, removing the first element after each iteration makes the loop faster. Now, we always access the first element of the list. This brought me down to 8 seconds.
After playing some more, I realized that the ->length property of DOMNodeList is just as bad as item(), since it also incurs linear cost. So I changed my for loop to this:
$table = $tables->item(0);
while ($table != NULL) {
$table = $tables->item(0);
if ($table === NULL) {
break;
}
//
// We've found a section table, parse it.
//
if (elementIsACourseSectionTable($table)) {
$course = addSectionsToCourseUsingTable($course, $table);
}
//
// Skip the last table if it's not a course section
//
else if(elementIsCourseHeaderTable($table)){
$course = courseFromTable($table);
$courses[] = $course;
}
//
// Remove the first item from the list
//
$first = $tables->item(0);
$first->parentNode->removeChild($first);
//
// Get the next table to parse
//
$table = $tables->item(0);
}
Note that I've done some other optimizations in terms of targeting the data I want, but the relevant part is how I handle progressing from one item to the next.

Related

Optimization of foreach for thousands items

I'm running the code below over a set of 25,000 results. I need to optimize it because i'm hitting the memory limit.
$oldproducts = Oldproduct::model()->findAll(); /*(here i have 25,000 results)*/
foreach($oldproducts as $oldproduct) :
$criteria = new CDbCriteria;
$criteria->compare('`someid`', $oldproduct->someid);
$finds = Newproduct::model()->findAll($criteria);
if (empty($finds)) {
$new = new Newproduct;
$new->someid = $oldproduct->someid;
$new->save();
} else {
foreach($finds as $find) :
if ($find->price != $oldproduct->price) {
$find->attributes=array('price' => $oldproduct->price);
$find->save();
}
endforeach;
}
endforeach;
Code compares rows of two tables by someid. If it find coincidence it updates price column, if not creates a new record.

Use CDataProviderIterator which:
... allows iteration over large data sets without holding the entire set in memory.
You first have to pass a CDataProvider instance to it:
$dataProvider = new CActiveDataProvider("Oldproduct");
$iterator = new CDataProviderIterator($dataProvider);
foreach($iterator as $item) {
// do stuff
}

You could process the rows in chunks of ~5000 instead of getting all the rows in 1 go!
$cnt = 5000;
$offset = 0;
do {
$oldproducts = Oldproduct::model()->limit($cnt)->offset($offset)->findAll(); /*(here i have 25,000 results)*/
foreach($oldproducts as $oldproduct) {
// your code
}
$offset += $cnt;
} while($oldproducts >= $cnt);

Numerate tree nodes from linked list

I have the tree nodes connection info in the form of the linked list and ID of the root node. I need to numerate this nodes in such order that any lower level node in result should have higher number than any node of a higher level. Numbering starts from the ID value of the root node and incrementing by 1 for every other node. Order of the nodes on the same level is not important. What algorithms and data structures I may use to solve this problem before I start reinventing the wheel and to avoid the pitfalls? Language that will be used is pure PHP and the data comes from MySQL DB, but any solution like pseudo-code or just plain explanations is welcomed.
Edit:
So far I came up to this (thanks Beta for helping me out):
<?php
$data = array(
array(551285, 551286),
array(551286, 551290),
array(551286, 551297),
array(551288, 551432),
array(551289, 552149),
array(551290, 551292),
array(551292, 551294),
array(551296, 551355),
array(551296, 552245),
array(551297, 551299),
array(551299, 551301),
array(551299, 551304),
array(551304, 551306),
array(551306, 551307),
array(551307, 551308),
array(551308, 551309),
array(551309, 551312),
array(551311, 551328),
array(551312, 551313),
array(551313, 551315),
array(551315, 551316),
array(551316, 551317),
array(551286, 551288),
array(551286, 551289),
array(551286, 551320),
array(551290, 551322),
array(551292, 551324),
array(551294, 551296),
array(551294, 551326),
array(551297, 551342),
array(551299, 551344),
array(551301, 551303),
array(551304, 551346),
array(551307, 551349),
array(551309, 551311),
array(551309, 551353),
array(551313, 551357),
array(551317, 552094),
array(551286, 551287),
array(551290, 551291),
array(551292, 551293),
array(551294, 551295),
array(551297, 551298),
array(551299, 551300),
array(551301, 551302),
array(551304, 551305),
array(551309, 551310),
array(551313, 551314)
);
var_dump(numerateTreeNodes($data, 551285));
function numerateTreeNodes($linked_nodes, $root_node)
{
$numbered_nodes = array();
$queue = new SplQueue();
$queue->enqueue($root_node);
$counter = $root_node;
$children_grouped = groupDirectChildren($linked_nodes);
while ($queue->count()) {
$t = $queue->dequeue();
$numbered_nodes[$counter++] = $t;
if (isset($children_grouped[$t])) {
foreach ($children_grouped[$t] as $t_child) {
$queue->enqueue($t_child);
}
}
}
return $numbered_nodes;
}
function groupDirectChildren($nodes)
{
$grouped = array();
foreach ($nodes as $n) {
$grouped[$n[0]][] = $n[1];
}
return $grouped;
}
Any suggestions/corrections?

Multi-dimensional array search to preserve parent

TL;DR
I have this data: var_export and print_r.
And I need to narrow it down to: http://pastebin.com/EqwgpgAP ($data['Stock Information:'][0][0]);
How would one achieve it? (dynamically)
I'm working with vTiger 5.4.0 CRM and am looking to implement a function that would return a particular field information based on search criteria.
Well, vTiger is pretty weakly written system, looks and feels old, everything comes out from hundreds of tables with multiple joins (that's actually not that bad) etc., but job is job.
The need arose from getting usageunit picklist from Products module, Stock Information block.
Since there is no such function as getField();, I am looking forward to filter it out from Blocks, that is actually gathering the information about fields also.
getBlocks(); then calls something close to getFields();, that again something close to getValues(); and so on.
So...
$focus = new $currentModule(); // Products
$displayView = getView($focus->mode);
$productsBlocks = getBlocks($currentModule, $displayView, $focus->mode, $focus->column_fields); // in theory, $focus->column_fields should/could be narrowed down to my specific field, but vTiger doesn't work that way
echo "<pre>"; print_r($productsBlocks); echo "</pre>"; // = http://pastebin.com/3iTDUUgw (huge dump)
As you can see, the array under the key [Stock Information:], that actually comes out from translations (yada, yada...), under [0][0] contains information for usageunit.
Now, I was trying to array_filter(); the data out from there, but only thing I've managed to get is $productsBlocks stripped down to only contain [Stock Information:] with all the data:
$getUsageUnit = function($value) use (&$getUsageUnit) {
if(is_array($value)) return array_filter($value, $getUsageUnit);
if($value == 'usageunit') return true;
};
$productsUsageUnit = array_filter($productsBlocks, $getUsageUnit);
echo "<pre>"; print_r($productsUsageUnit); echo "</pre>"; // = http://pastebin.com/LU6VRC4h (not that huge of a dump)
And, the result I'm looking forward to is http://pastebin.com/EqwgpgAP, that I've manually got by print_r($productsUsageUnit['Stock Information:'][0][0]);.
How do I achieve this? (dynamically...)

function helper($data, $query) {
$result = array();
$search = function ($data, &$stack) use(&$search, $query) {
foreach ($data as $entry) {
if (is_array($entry) && $search($entry, $stack) || $entry === $query) {
$stack[] = $entry;
return true;
}
}
return false;
};
foreach ($data as $sub) {
$parentStack = array();
if ($search($sub, $parentStack)) {
$result[] = $parentStack[sizeof($parentStack) - 2];
}
}
return $result;
}
$node = helper($data, 'usageunit');
print_r($node);

Scraping a plain text file with no HTML?

I have the following data in a plain text file:
1. Value
Location : Value
Owner: Value
Architect: Value
2. Value
Location : Value
Owner: Value
Architect: Value
... upto 200+ ...
The numbering and the word Value changes for each segment.
Now I need to insert this data in to a MySQL database.
Do you have a suggestion on how can I traverse and scrape it so I can get the value of the text beside the number, and the value of "location", "owner", "architect" ?
Seems hard to do with DOM scraping class since there is no HTML tags present.

If the data is constantly structured, you can use fscanf to scan them from file.
/* Notice the newlines at the end! */
$format = <<<FORMAT
%d. %s
Location : %s
Owner: %s
Arcihtect: %s
FORMAT;
$file = fopen('file.txt', 'r');
while ($data = fscanf($file, $format)) {
list($number, $title, $location, $owner, $architect) = $data;
// Insert the data to database here
}
fclose($file);
More about fscanf in docs.

If every block has the same structure, you could do this with the file() function: http://nl.php.net/manual/en/function.file.php
$data = file('path/to/file.txt');
With this every row is an item in the array, and you could loop through it.
for ($i = 0; $i<count($data); $i+=5){
$valuerow = $data[$i];
$locationrow = $data[$i+1];
$ownerrow = $data[$i+2];
$architectrow = $data[$i+3];
// strip the data you don't want here, and instert it into the database.
}

That will work with a very simple stateful line-oriented parser. Every line you cumulate parsed data into an array(). When something tells you're on a new record, you dump what you parsed and proceed again.
Line-oriented parsers have a great property : they require little memory and what's most important, constant memory. They can proceed with gigabytes of data without any sweat. I'm managing a bunch of production servers and there's nothing worse than those scripts slurping whole files into memory (then stuffing arrays with parsed content which requires more than twice the original file size as memory).
This works and is mostly unbreakable :
<?php
$in_name = 'in.txt';
$in = fopen($in_name, 'r') or die();
function dump_record($r) {
print_r($r);
}
$current = array();
while ($line = fgets($in)) {
/* Skip empty lines (any number of whitespaces is 'empty' */
if (preg_match('/^\s*$/', $line)) continue;
/* Search for '123. <value> ' stanzas */
if (preg_match('/^(\d+)\.\s+(.*)\s*$/', $line, $start)) {
/* If we already parsed a record, this is the time to dump it */
if (!empty($current)) dump_record($current);
/* Let's start the new record */
$current = array( 'id' => $start[1] );
}
else if (preg_match('/^(.*):\s+(.*)\s*/', $line, $keyval)) {
/* Otherwise parse a plain 'key: value' stanza */
$current[ $keyval[1] ] = $keyval[2];
}
else {
error_log("parsing error: '$line'");
}
}
/* Don't forget to dump the last parsed record, situation
* we only detect at EOF (end of file) */
if (!empty($current)) dump_record($current);
fclose($in);
?>
Obvously you'll need something suited to your taste in function dump_record, like printing a correctly formated INSERT SQL statement.

This will give you what you want,
$array = explode("\n\n", $txt);
foreach($array as $key=>$value) {
$id_pattern = '#'.($key+1).'. (.*?)\n#';
preg_match($id_pattern, $value, $id);
$location_pattern = '#Location \: (.*?)\n#';
preg_match($location_pattern, $value, $location);
$owner_pattern = '#Owner\: (.*?)\n#';
preg_match($owner_pattern, $value, $owner);
$architect_pattern = '#Architect\: (.*?)#';
preg_match($architect_pattern, $value, $architect);
$id = $id[1];
$location = $location[1];
$owner = $owner[1];
$architect = $architect[1];
mysql_query("INSERT INTO table (id, location, owner, architect) VALUES ('".$id."', '".$location."', '".$owner."', '".$architect."')");
//Change MYSQL query
}

Agreed with Topener solution, here's an example if each block is 4 lines + blank line:
$data = file('path/to/file.txt');
$id = 0;
$parsedData = array();
foreach ($data as $n => $row) {
if (($n % 5) == 0) $id = (int) $row[0];
else {
$parsedData[$id][$row[0]] = $row[1];
}
}
Structure will be convenient to use, for MySQL or whatelse. I didn't add code to remove the colon from the first segment.
Good luck!

preg_match_all("/(\d+)\.(.*?)\sLocation\s*\:\s*(.*?)\sOwner\s*\:\s*(.*?)\sArchitect\s*\:\s*(.*?)\s?/i",$txt,$m);
$matched = array();
foreach($m[1] as $k => $v) {
$matched[$v] = array(
"location" => trim($m[2][$v]),
"owner" => trim($m[3][$v]),
"architect" => trim($m[4][$v])
);
}

Memory leakage in php with three for loops

My script is a spider that checks if a page is a "links page" or is a "information page".
if the page is a "links page" then it continue in a recursive manner (or a tree if you will)
until it finds the "information page".
I tried to make the script recursive and it was easy but i kept getting the error:
Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to
allocate 39 bytes) in /srv/www/loko/simple_html_dom.php on line 1316
I was told i would have to use the for loop method because no matter if i use the unset() function the script won't free memory and i only have three levels i need to loop through so it makes sense. But after i changed the script the error occurs again, but maybe i can free
memory now?
Something needs to die here, please help me destruct someone!
set_time_limit(0);
ini_set('memory_limit', '256M');
require("simple_html_dom.php");
$thelink = "http://www.somelink.com";
$html1 = file_get_html($thelink);
$ret1 = $html1->find('#idTabResults2');
// first inception level, we know page has only links
if (!$ret1){
$es1 = $html1->find('table.litab a');
//unset($html1);
$countlinks1 = 0;
foreach ($es1 as $aa1) {
$links1[$countlinks1] = $aa1->href;
$countlinks1++;
}
//unset($es1);
//for every link in array do the same
for ($i = 0; $i < $countlinks1; $i++) {
$html2 = file_get_html($links1[$i]);
$ret2 = $html2->find('#idTabResults2');
// if got information then send to DB
if ($ret2){
pullInfo($html2);
//unset($html2);
} else {
// continue inception
$es2 = $html2->find('table.litab a');
$html2 = null;
$countlinks2 = 0;
foreach ($es2 as $aa2) {
$links2[$countlinks2] = $aa2->href;
$countlinks2++;
}
//unset($es2);
for ($j = 0; $j < $countlinks2; $j++) {
$html3 = file_get_html($links2[$j]);
$ret3 = $html3->find('#idTabResults2');
// if got information then send to DB
if ($ret3){
pullInfo($html3);
} else {
// inception level three
$es3 = $html3->find('table.litab a');
$html3 = null;
$countlinks3 = 0;
foreach ($es3 as $aa3) {
$links3[$countlinks3] = $aa3->href;
$countlinks3++;
}
for ($k = 0; $k < $countlinks3; $k++) {
echo memory_get_usage() ;
echo "\n";
$html4 = file_get_html($links3[$k]);
$ret4 = $html4->find('#idTabResults2');
// if got information then send to DB
if ($ret4){
pullInfo($html4);
}
unset($html4);
}
unset($html3);
}
}
}
}
}
function pullInfo($html)
{
$tds = $html->find('td');
$count =0;
foreach ($tds as $td) {
$count++;
if ($count==1){
$name = html_entity_decode($td->innertext);
}
if ($count==2){
$address = addslashes(html_entity_decode($td->innertext));
}
if ($count==3){
$number = addslashes(preg_replace('/(\d+) - (\d+)/i', '$2$1', $td->innertext));
}
}
unset($tds, $td);
$name = mysql_real_escape_string($name);
$address = mysql_real_escape_string($address);
$number = mysql_real_escape_string($number);
$inAlready=mysql_query("SELECT * FROM people WHERE phone=$number");
while($e=mysql_fetch_assoc($inAlready))
$output[]=$e;
if (json_encode($output) != "null"){
//print(json_encode($output));
} else {
mysql_query("INSERT INTO people (name, area, phone)
VALUES ('$name', '$address', '$number')");
}
}
And here is a picture of the growth in memory size:

I modified the code a little bit to free as much memory as I see could be freed.
I've added a comment above each modification. The added comments start with "#" so you could find them easier.
This is not related to this question, but worth mentioning that your database insertion code is vulnerable to SQL injection.
<?php
require("simple_html_dom.php");
$thelink = "http://www.somelink.co.uk";
# do not keep raw contents of the file on memory
#$data1 = file_get_contents($thelink);
#$html1 = str_get_html($data1);
$html1 = str_get_html(file_get_contents($thelink));
$ret1 = $html1->find('#idResults2');
// first inception level, we know page has only links
if (!$ret1){
$es1 = $html1->find('table.litab a');
# free $html1, not used anymore
unset($html1);
$countlinks1 = 0;
foreach ($es1 as $aa1) {
$links1[$countlinks1] = $aa1->href;
$countlinks1++;
// echo (addslashes($aa->href));
}
# free memroy used by the $es1 value, not used anymore
unset($es1);
//for every link in array do the same
for ($i = 0; $i <= $countlinks1; $i++) {
# do not keep raw contents of the file on memory
#$data2 = file_get_contents($links1[$i]);
#$html2 = str_get_html($data2);
$html2 = str_get_html(file_get_contents($links1[$i]));
$ret2 = $html2->find('#idResults2');
// if got information then send to DB
if ($ret2){
pullInfo($html2);
} else {
// continue inception
$es2 = $html2->find('table.litab a');
# free memory used by $html2, not used anymore.
# we would unset it at the end of the loop.
$html2 = null;
$countlinks2 = 0;
foreach ($es2 as $aa2) {
$links2[$countlinks2] = $aa2->href;
$countlinks2++;
}
# free memory used by $es2
unest($es2);
for ($j = 0; $j <= $countlinks2; $j++) {
# do not keep raw contents of the file on memory
#$data3 = file_get_contents($links2[$j]);
#$html3 = str_get_html($data3);
$html3 = str_get_html(file_get_contents($links2[$j]));
$ret3 = $html3->find('#idResults2');
// if got information then send to DB
if ($ret3){
pullInfo($html3);
}
# free memory used by $html3 or on last iteration the memeory would net get free
unset($html3);
}
}
# free memory used by $html2 or on last iteration the memeory would net get free
unset($html2);
}
}
function pullInfo($html)
{
$tds = $html->find('td');
$count =0;
foreach ($tds as $td) {
$count++;
if ($count==1){
$name = addslashes($td->innertext);
}
if ($count==2){
$address = addslashes($td->innertext);
}
if ($count==3){
$number = addslashes(preg_replace('/(\d+) - (\d+)/i', '$2$1', $td->innertext));
}
}
# check for available data:
if ($count) {
# free $tds and $td
unset($tds, $td);
mysql_query("INSERT INTO people (name, area, phone)
VALUES ('$name', '$address', '$number')");
}
}
Update:
You could trace your memory usage to see how much memory is being used in each section of your code. this could be done by using the memory_get_usage() calls, and saving the result to some file. like placing this below code in the end of each of your loops, or before creating objects, calling heavy methods:
file_put_contents('memory.log', 'memory used in line ' . __LINE__ . ' is: ' . memory_get_usage() . PHP_EOL, FILE_APPEND);
So you could trace the memory usage of each part of your code.
In the end remember all this tracing and optimization might not be enough, since your application might really need more memory than 32 MB. I'v developed a system that analyzes several data sources and detects spammers, and then blocks their SMTP connections and since sometimes the number of connected users are over 30000, after a lot of code optimization, I had to increase the PHP memory limit to 768 MB on the server, Which is not a common thing to do.

If your operation requires memory and your server has more memory available, you can call ini_set('memory_limit', '128M'); or something similar (depending your memory requirement) to increase the amount of memory available to the script.
This does not mean you should not optimise and refactor your code :-) this is just one part.

The solution was to use the clear method such as:
$html4->clear(); a simple_html_dom method to clear memory When you are finished with the DOM element.
If you want to learn more, enter this website.

Firstly, let's turn this into a truly recursive function, should make it easier to modify the whole chain of events that way:
function findInfo($thelink)
{
$data = file_get_contents($thelink); //Might want to make sure that it's a valid link, i.e. that file get contents actually returned stuff, before trying to run further with it.
$html = str_get_html($data);
unset($data); //Finished using it, no reason to keep it around.
$ret = $html->find('#idResults2');
if($ret)
{
pullInfo($html);
return true; //Should stop once it finds it right?
}
else
{
$es = $html->find('table.litab a'); //Might want a little error checking here to make sure it actually found links.
unset($html); //Finished using it, no reason to keep it around
$countlinks = 0;
foreach($es as $aa)
{
$links[$countlinks] = $aa->href;
$countlinks++;
}
unset($es); //Finished using it, no reason to keep it around.
for($i = 0; $i <= $countlinks; $i++)
{
$result = findInfo($links[$i]);
if($result === true)
{
return true; //To break out of above recursive functions if lower functions return true
}
else
{
unset($links[$i]); //Finished using it, no reason to keep it around.
continue;
}
}
}
return false; //Will return false if all else failed, should hit a return true before this point if it successfully finds an info page.
}
See if that helps at all with the cleanups. Probably still run out of memory, but you shouldn't be holding onto the full html of each webpage scanned and what not with this.
Oh, and if you only want it to go only so deep, change the function declaration to something like:
function findInfo($thelink, $depth = 1, $maxdepth = 3)
Then when calling the function within the function, call it like so:
findInfo($html, $depth + 1, $maxdepth); //you include maxdepth so you can override it in the initial function call, like findInfo($thelink,,4)
and then do a check on depth vs. maxdepth at the start of the function and have it return false if depth is > than maxdepth.

If memory usage is your primary concern, you may want to consider using a SAX-based parser. Coding using a SAX parser can be a bit more complicated, but it's not necessary to keep the entire document in memory.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to improve performance iterating a DOMDocument? - php

Related

Optimization of foreach for thousands items

Numerate tree nodes from linked list

Multi-dimensional array search to preserve parent

Scraping a plain text file with no HTML?

Memory leakage in php with three for loops

Categories

Resources