Pinterest style / PHP Image Scraper Crashing Server

Pinterest style / PHP Image Scraper Crashing Server - php

I must have a memory leak or something that is just eating memory on my server somewhere in this class. For example if I file_get_contents(http://www.theknot.com) it will not be able to connect to the server tho its not down, or mysql closes the connection, or in extreme situations completed knock out the server for a mount of time that we can not even get a ping. I know its somewhere within the preg_match_all if block, but I dont know what would get run away to what I can only assume is a lot of processing on the regex match due to whatever is within the content that is fetched from the remote site. Any ideas?
<?php
class Utils_Linkpreview extends Zend_Db_table
{
public function getPreviews($url) {
$link = $url;
$width = 200;
$height = 200;
$regex = '/<img[^\/]+src="([^"]+\.(jpe?g|gif|png))/';
/// $regex = '/<img[^\/]+src="([^"]+)/';
$thumbs = false;
try {
$data = file_get_contents($link);
} catch (Exception $e) {
print "Caught exception when attempting to find images: ". $e->getMessage(). "\n";
}
if (($data) && preg_match_all($regex, $data, $m, PREG_PATTERN_ORDER)) {
if (isset($m[1]) && is_array($m[1])) {
$thumbs = array();
foreach (array_unique($m[1]) as $url) {
if (
($url = $this->rel2abs($url, $link)) &&
($i = #getimagesize($url)) &&
$i[0] >= ($width-10) &&
$i[1] >= ($height-10)
) {
$thumbs[] = $url;
}
}
}
}
return $thumbs;
}
private function rel2abs($url, $host) {
if (substr($url, 0, 4) == 'http') {
return $url;
} else {
$hparts = explode('/', $host);
if ($url[0] == '/') {
return implode('/', array_slice($hparts, 0, 3)) . $url;
} else if ($url[0] != '.') {
array_pop($hparts);
return implode('/', $hparts) . '/' . $url;
}
}
}
}
?>
EDIT - Amal Murali's comment pointed me in a better direction using PHP's DomDocument. Thanks bud!
Here is the result:
public function getPreviews($url) {
$link = $url;
$thumbs = false;
try {
$html = file_get_contents($link);
} catch (Exception $e) {
print "Caught exception when attempting to find images: ". $e->getMessage(). "\n";
}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//img[#width > 200 or substring-before(#width, 'px') > 200 or #height > 200 or substring-before(#height, 'px') > 200]") as $node)
{
$url = $node->getAttribute("src");
$thumbs[] = $this->rel2abs($url, $link);
}
return $thumbs;
}

EDIT - Amal Murali's comment pointed me in a better direction using PHP's DomDocument. Thanks bud!
Here is the result:
public function getPreviews($url) {
$link = $url;
$thumbs = false;
try {
$html = file_get_contents($link);
} catch (Exception $e) {
print "Caught exception when attempting to find images: ". $e->getMessage(). "\n";
}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//img[#width > 200 or substring-before(#width, 'px') > 200 or #height > 200 or substring-before(#height, 'px') > 200]") as $node)
{
$url = $node->getAttribute("src");
$thumbs[] = $this->rel2abs($url, $link);
}
return $thumbs;
}

Related

How to refresh opencart modification by code

I need to refresh modifications after install module.
public function install() {
$this->load->controller('marketplace/modification/refresh');
}
I tried this. Its worked but the page redirected to modification listing. How can i do without redirect. I am using opencart 3.

If you don't want to edit modification.php or clone its refresh function, You can use this:
public function install(){
$data['redirect'] = 'extension/extension/module';
$this->load->controller('marketplace/modification/refresh', $data);
}

You could not controll by this way as you are doing:
You need to do this as
public function install() {
$this->refresh();
}
protected function refresh($data = array()) {
$this->load->language('marketplace/modification');
$this->document->setTitle($this->language->get('heading_title'));
$this->load->model('setting/modification');
if ($this->validate()) {
// Just before files are deleted, if config settings say maintenance mode is off then turn it on
$maintenance = $this->config->get('config_maintenance');
$this->load->model('setting/setting');
$this->model_setting_setting->editSettingValue('config', 'config_maintenance', true);
//Log
$log = array();
// Clear all modification files
$files = array();
// Make path into an array
$path = array(DIR_MODIFICATION . '*');
// While the path array is still populated keep looping through
while (count($path) != 0) {
$next = array_shift($path);
foreach (glob($next) as $file) {
// If directory add to path array
if (is_dir($file)) {
$path[] = $file . '/*';
}
// Add the file to the files to be deleted array
$files[] = $file;
}
}
// Reverse sort the file array
rsort($files);
// Clear all modification files
foreach ($files as $file) {
if ($file != DIR_MODIFICATION . 'index.html') {
// If file just delete
if (is_file($file)) {
unlink($file);
// If directory use the remove directory function
} elseif (is_dir($file)) {
rmdir($file);
}
}
}
// Begin
$xml = array();
// Load the default modification XML
$xml[] = file_get_contents(DIR_SYSTEM . 'modification.xml');
// This is purly for developers so they can run mods directly and have them run without upload after each change.
$files = glob(DIR_SYSTEM . '*.ocmod.xml');
if ($files) {
foreach ($files as $file) {
$xml[] = file_get_contents($file);
}
}
// Get the default modification file
$results = $this->model_setting_modification->getModifications();
foreach ($results as $result) {
if ($result['status']) {
$xml[] = $result['xml'];
}
}
$modification = array();
foreach ($xml as $xml) {
if (empty($xml)){
continue;
}
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->preserveWhiteSpace = false;
$dom->loadXml($xml);
// Log
$log[] = 'MOD: ' . $dom->getElementsByTagName('name')->item(0)->textContent;
// Wipe the past modification store in the backup array
$recovery = array();
// Set the a recovery of the modification code in case we need to use it if an abort attribute is used.
if (isset($modification)) {
$recovery = $modification;
}
$files = $dom->getElementsByTagName('modification')->item(0)->getElementsByTagName('file');
foreach ($files as $file) {
$operations = $file->getElementsByTagName('operation');
$files = explode('|', $file->getAttribute('path'));
foreach ($files as $file) {
$path = '';
// Get the full path of the files that are going to be used for modification
if ((substr($file, 0, 7) == 'catalog')) {
$path = DIR_CATALOG . substr($file, 8);
}
if ((substr($file, 0, 5) == 'admin')) {
$path = DIR_APPLICATION . substr($file, 6);
}
if ((substr($file, 0, 6) == 'system')) {
$path = DIR_SYSTEM . substr($file, 7);
}
if ($path) {
$files = glob($path, GLOB_BRACE);
if ($files) {
foreach ($files as $file) {
// Get the key to be used for the modification cache filename.
if (substr($file, 0, strlen(DIR_CATALOG)) == DIR_CATALOG) {
$key = 'catalog/' . substr($file, strlen(DIR_CATALOG));
}
if (substr($file, 0, strlen(DIR_APPLICATION)) == DIR_APPLICATION) {
$key = 'admin/' . substr($file, strlen(DIR_APPLICATION));
}
if (substr($file, 0, strlen(DIR_SYSTEM)) == DIR_SYSTEM) {
$key = 'system/' . substr($file, strlen(DIR_SYSTEM));
}
// If file contents is not already in the modification array we need to load it.
if (!isset($modification[$key])) {
$content = file_get_contents($file);
$modification[$key] = preg_replace('~\r?\n~', "\n", $content);
$original[$key] = preg_replace('~\r?\n~', "\n", $content);
// Log
$log[] = PHP_EOL . 'FILE: ' . $key;
}
foreach ($operations as $operation) {
$error = $operation->getAttribute('error');
// Ignoreif
$ignoreif = $operation->getElementsByTagName('ignoreif')->item(0);
if ($ignoreif) {
if ($ignoreif->getAttribute('regex') != 'true') {
if (strpos($modification[$key], $ignoreif->textContent) !== false) {
continue;
}
} else {
if (preg_match($ignoreif->textContent, $modification[$key])) {
continue;
}
}
}
$status = false;
// Search and replace
if ($operation->getElementsByTagName('search')->item(0)->getAttribute('regex') != 'true') {
// Search
$search = $operation->getElementsByTagName('search')->item(0)->textContent;
$trim = $operation->getElementsByTagName('search')->item(0)->getAttribute('trim');
$index = $operation->getElementsByTagName('search')->item(0)->getAttribute('index');
// Trim line if no trim attribute is set or is set to true.
if (!$trim || $trim == 'true') {
$search = trim($search);
}
// Add
$add = $operation->getElementsByTagName('add')->item(0)->textContent;
$trim = $operation->getElementsByTagName('add')->item(0)->getAttribute('trim');
$position = $operation->getElementsByTagName('add')->item(0)->getAttribute('position');
$offset = $operation->getElementsByTagName('add')->item(0)->getAttribute('offset');
if ($offset == '') {
$offset = 0;
}
// Trim line if is set to true.
if ($trim == 'true') {
$add = trim($add);
}
// Log
$log[] = 'CODE: ' . $search;
// Check if using indexes
if ($index !== '') {
$indexes = explode(',', $index);
} else {
$indexes = array();
}
// Get all the matches
$i = 0;
$lines = explode("\n", $modification[$key]);
for ($line_id = 0; $line_id < count($lines); $line_id++) {
$line = $lines[$line_id];
// Status
$match = false;
// Check to see if the line matches the search code.
if (stripos($line, $search) !== false) {
// If indexes are not used then just set the found status to true.
if (!$indexes) {
$match = true;
} elseif (in_array($i, $indexes)) {
$match = true;
}
$i++;
}
// Now for replacing or adding to the matched elements
if ($match) {
switch ($position) {
default:
case 'replace':
$new_lines = explode("\n", $add);
if ($offset < 0) {
array_splice($lines, $line_id + $offset, abs($offset) + 1, array(str_replace($search, $add, $line)));
$line_id -= $offset;
} else {
array_splice($lines, $line_id, $offset + 1, array(str_replace($search, $add, $line)));
}
break;
case 'before':
$new_lines = explode("\n", $add);
array_splice($lines, $line_id - $offset, 0, $new_lines);
$line_id += count($new_lines);
break;
case 'after':
$new_lines = explode("\n", $add);
array_splice($lines, ($line_id + 1) + $offset, 0, $new_lines);
$line_id += count($new_lines);
break;
}
// Log
$log[] = 'LINE: ' . $line_id;
$status = true;
}
}
$modification[$key] = implode("\n", $lines);
} else {
$search = trim($operation->getElementsByTagName('search')->item(0)->textContent);
$limit = $operation->getElementsByTagName('search')->item(0)->getAttribute('limit');
$replace = trim($operation->getElementsByTagName('add')->item(0)->textContent);
// Limit
if (!$limit) {
$limit = -1;
}
// Log
$match = array();
preg_match_all($search, $modification[$key], $match, PREG_OFFSET_CAPTURE);
// Remove part of the the result if a limit is set.
if ($limit > 0) {
$match[0] = array_slice($match[0], 0, $limit);
}
if ($match[0]) {
$log[] = 'REGEX: ' . $search;
for ($i = 0; $i < count($match[0]); $i++) {
$log[] = 'LINE: ' . (substr_count(substr($modification[$key], 0, $match[0][$i][1]), "\n") + 1);
}
$status = true;
}
// Make the modification
$modification[$key] = preg_replace($search, $replace, $modification[$key], $limit);
}
if (!$status) {
// Abort applying this modification completely.
if ($error == 'abort') {
$modification = $recovery;
// Log
$log[] = 'NOT FOUND - ABORTING!';
break 5;
}
// Skip current operation or break
elseif ($error == 'skip') {
// Log
$log[] = 'NOT FOUND - OPERATION SKIPPED!';
continue;
}
// Break current operations
else {
// Log
$log[] = 'NOT FOUND - OPERATIONS ABORTED!';
break;
}
}
}
}
}
}
}
}
// Log
$log[] = '----------------------------------------------------------------';
}
// Log
$ocmod = new Log('ocmod.log');
$ocmod->write(implode("\n", $log));
// Write all modification files
foreach ($modification as $key => $value) {
// Only create a file if there are changes
if ($original[$key] != $value) {
$path = '';
$directories = explode('/', dirname($key));
foreach ($directories as $directory) {
$path = $path . '/' . $directory;
if (!is_dir(DIR_MODIFICATION . $path)) {
#mkdir(DIR_MODIFICATION . $path, 0777);
}
}
$handle = fopen(DIR_MODIFICATION . $key, 'w');
fwrite($handle, $value);
fclose($handle);
}
}
// Maintance mode back to original settings
$this->model_setting_setting->editSettingValue('config', 'config_maintenance', $maintenance);
// Do not return success message if refresh() was called with $data
$this->session->data['success'] = $this->language->get('text_success');
$url = '';
if (isset($this->request->get['sort'])) {
$url .= '&sort=' . $this->request->get['sort'];
}
if (isset($this->request->get['order'])) {
$url .= '&order=' . $this->request->get['order'];
}
if (isset($this->request->get['page'])) {
$url .= '&page=' . $this->request->get['page'];
}
}
}
I hope it shouwl work for you.
This process is used to refresh the modification when your module installing.
if you need globally this then please tell me I will update you process.

Error PHP website crawler class using Simple HTML Dom

I'm getting the following error when trying to use Simple HTML Dom inside a web crawler class. The class seems to be working well but I get many errors in my error_log file.
[01-Apr-2016 23:16:51 UTC] PHP Warning: Invalid argument supplied for foreach() in /home/scrybs/public_html/order/uploader/php/simple_html_dom.php on line 357
If I check Simple HTML Dom, the error comes from here:
function innertext()
{
if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER];
if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);
$ret = '';
foreach ($this->nodes as $n)
$ret .= $n->outertext();
return $ret;
}
The crawler class in question is as following:
class crawler
{
protected $_url;
protected $_depth;
protected $_host;
protected $_useHttpAuth = false;
protected $_user;
protected $_pass;
protected $_seen = array();
protected $_filter = array();
public $contenu = array();
public function __construct($url, $depth = 5)
{
$this->_url = $url;
$this->_depth = $depth;
$parse = parse_url($url);
$this->_host = $parse['host'];
$this->html = new simple_html_dom();
}
protected function _processAnchors($content, $url, $depth)
{
//$dom = new DOMDocument('1.0');
//#$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
//$dom->formatOutput = true;
$this->html->load($content);
$metatitle = $this->html->find('title',0)->innertext;
foreach($this->html->find("meta[name='description']") as $element){
$metadescription = $element->content;
}
foreach($this->html->find("meta[name='keywords']") as $element){
$metakeywords = $element->content;
}
if(!empty($metatitle)){
$this->contenu['meta_titles'][] = $metatitle;
}
if(!empty($metadescription)){
$this->contenu['meta_titles'][] = $metadescription;
}
if(!empty($metakeywords)){
$this->contenu['meta_titles'][] = $metakeywords;
}
// IMAGE ALTS
foreach($this->html->find('img') as $e){
if(!empty($e->alt)){
if(!$this->search_array($e->alt, $this->contenu)){
$this->contenu['alt_images'][] = $e->alt;
}
}
}
// LINKS
$links = $this->html->find('a');
foreach($links as $element){
// GET LINK TEXTS
$a = $element->innertext;
$a = preg_replace("/<a.*?>(.*?)<\/a>/", '\1', $a);
$a = preg_replace("/<p.*?>.*?<\/p>/", "{{P}}", $a);
$a = preg_replace("/<img.*?>/", "{{IMG}}", $a);
$a = preg_replace('#(<br */?>\s*)+#i', "{{BR}}", $a);
$a = preg_replace('#<button.*?>.*?</button>#i', '{{BUTTON}}', $a);
$a = preg_replace('#<time.*?>(.*?)</time>#i', '{{TIME}}', $a);
$a = preg_replace('#<span.*?>(.*?)</span>#i', '{{SPAN}}\1{{/SPAN}}', $a);
$a = preg_replace('#<strong.*?>(.*?)</strong>#i', '{{STRONG}}\1{{/STRONG}}', $a);
$a = preg_replace('#<b.*?>(.*?)</b>#i', '{{B}}\1{{/B}}', $a);
$a = preg_replace('#<i.*?>(.*?)</i>#i', '{{I}}\1{{/I}}', $a);
$a = preg_replace('#<small.*?>(.*?)</small>#i', '{{SMALL}}\1{{/SMALL}}', $a);
$a = preg_replace('#<abbr.*?>(.*?)</abbr>#i', '{{ABBR}}\1{{/ABBR}}', $a);
$a = trim(strip_tags($a));
$a = preg_replace('/\s+/', ' ', $a);
// CHECK IF NOT ONLY VARIABLES AND SPACES
$atmp = strip_tags($a);
$atmp = preg_replace("/{{.*?}}/", '', $atmp);
$atmp = preg_replace('/\s+/', '', $atmp);
if(!empty($a) && $a != '' && $atmp != ''){
if(!$this->search_array($a, $this->contenu)){
$this->contenu['link_texts'][] = $a;
}
}
// GET LINK TITLES
$title = $element->title;
if(!empty($title)){
if(!$this->search_array($title, $this->contenu)){
$this->contenu['link_titles'][] = $title;
}
}
$href = $element->href;
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
// Crawl only link that belongs to the start domain
$this->crawl_page($href, $depth - 1);
}
return $this->contenu;
}
protected function _getContent($url)
{
$handle = curl_init($url);
if ($this->_useHttpAuth) {
curl_setopt($handle, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($handle, CURLOPT_USERPWD, $this->_user . ":" . $this->_pass);
}
// follows 302 redirect, creates problem wiht authentication
// curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);
// return the content
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);
// response total time
$time = curl_getinfo($handle, CURLINFO_TOTAL_TIME);
/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
curl_close($handle);
return array($response, $httpCode, $time);
}
protected function _printResult($url, $depth, $httpcode, $time)
{
ob_end_flush();
$currentDepth = $this->_depth - $depth;
$count = count($this->_seen);
//echo "N::$count,CODE::$httpcode,TIME::$time,DEPTH::$currentDepth URL::$url <br>";
ob_start();
flush();
}
protected function isValid($url, $depth)
{
if (strpos($url, $this->_host) === false
|| $depth === 0
|| isset($this->_seen[$url])
|| preg_match("/#/i", $url)
|| preg_match("/.png/i", $url)
|| preg_match("/.jpg/i", $url)
|| preg_match("/.jpeg/i", $url)
|| preg_match("/.gif/i", $url)
|| preg_match("/.pdf/i", $url)
|| preg_match("/javascript/i", $url)
|| preg_match("/twitter.com/i", $url)
|| preg_match("/google.com/i", $url)
|| preg_match("/facebook.com/i", $url)
|| preg_match("/youtube.com/i", $url)
|| preg_match("/instagram.com/i", $url)
|| preg_match("/wp-login.php/i", $url)
) {
return false;
}
foreach ($this->_filter as $excludePath) {
if (strpos($url, $excludePath) !== false) {
return false;
}
}
return true;
}
public function search_array($needle, $haystack) {
if(in_array($needle, $haystack)) {
return true;
}
foreach($haystack as $element) {
if(is_array($element) && $this->search_array($needle, $element))
return true;
}
return false;
}
public function crawl_page($url, $depth)
{
if (!$this->isValid($url, $depth)) {
return;
}
// add to the seen URL
$this->_seen[$url] = true;
// get Content and Return Code
list($content, $httpcode, $time) = $this->_getContent($url);
// print Result for current Page
//$this->_printResult($url, $depth, $httpcode, $time);
// process subPages
$this->_processAnchors($content, $url, $depth, $contenu = array());
}
public function addFilterPath($path)
{
$this->_filter[] = $path;
}
public function run()
{
$this->crawl_page($this->_url, $this->_depth);
}
}
The error seems to be coming from this line related to innertext function:
// GET LINK TEXTS
$a = $element->innertext;
I don't get any error when I use:
$a = $element->innertext;
But not ideal as I would like to keep HTML tags. I don't get any error when I use Simple HTML Dom outside the class so does it have something to do with the fact that Simple HTML Dom is in a class? Do somebody have an idea?
Thanks for your help!

I have found the bug.
On my (limited) tests, the problem happens when you set depth > 1, so — seeing your code — when you load more than one page URL. One of the countless Simple HTML DOM problems, is that ->load() method doesn't work correctly on multiple loads.
Re-instantiating html object, the script seems work:
protected function _processAnchors( $content, $url, $depth )
{
$this->html = new simple_html_dom(); # <-----
$this->html->load( $content );
I tested also $this->html = str_get_html($content); but it works only on limited sites.
Additional Note: In HTML <title> tag is mandatory, but not all sites has well formatted HTML: consider checking for <title> tag (and for each tag) existence to avoid additional errors.

The PHP crawler I am using has a memory leak, what is causing this?

I am using a PHP crawler that has a memory leak. It is good for the first ~3125 links, then it runs out of memory.I tried getting rid of the MySQL insert, but that did not change anything. Can someone help me diagnose this problem? Thank you so much.
<?php
include $_SERVER['DOCUMENT_ROOT'] . '/config.php';
ini_set('max_execution_time', 0);
// USAGE
$startURL = $your_url;
$depth = 9999;
$crawler = new crawler($startURL, $depth);
// Exclude path with the following structure to be processed
$crawler->addFilterPath('customer/account/login/referer');
$crawler->run();
class crawler
{
protected $_url;
protected $_depth;
protected $_host;
protected $_seen = array();
protected $_filter = array();
public function __construct($url, $depth = 5)
{
$this->_url = $url;
$this->_depth = $depth;
$parse = parse_url($url);
$this->_host = $parse['host'];
}
protected function _processAnchors($content, $url, $depth)
{
$dom = new DOMDocument('1.0');
#$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
// Crawl only link that belongs to the start domain
$this->crawl_page($href, $depth - 1);
}
}
protected function _getContent($url)
{
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
$response = curl_exec($handle);
curl_close($handle);
return array($response);
}
protected function _printResult($url, $depth)
{
ob_end_flush();
$currentDepth = $this->_depth - $depth;
$count = count($this->_seen);
echo "$url <br>";
include $_SERVER['DOCUMENT_ROOT'] . '/config.php';
$databaseconnect = new PDO("mysql:dbname=DB_NAME;host=$mysqlhost;charset=utf8", $mysqlusername, $mysqlpassword);
$statement = $databaseconnect->prepare("INSERT INTO data(url,name) VALUES(:url,:name)");
$statement->execute(array(':url' => $url,
':name' => $url));
ob_start();
flush();
}
protected function isValid($url, $depth)
{
if (strpos($url, $this->_host) === false
|| $depth === 0
|| isset($this->_seen[$url])
) {
return false;
}
foreach ($this->_filter as $excludePath) {
if (strpos($url, $excludePath) !== false) {
return false;
}
}
return true;
}
public function crawl_page($url, $depth)
{
if (!$this->isValid($url, $depth)) {
return;
}
// add to the seen URL
$this->_seen[$url] = true;
// get Content and Return Code
list($content) = $this->_getContent($url);
// print Result for current Page
$this->_printResult($url, $depth);
// process subPages
$this->_processAnchors($content, $url, $depth);
}
public function addFilterPath($path)
{
$this->_filter[] = $path;
}
public function run()
{
$this->crawl_page($this->_url
, $this->_depth);
}
}
?>

I'm not sure if this classifies as a memory leak exactly. You are essentially using recursion without a terminating case. Before the crawl_page() method finishes it calls _processAnchors(), which in turn may call crawl_page() again if it finds any links (very likely). Every recursive call eats up more memory because the originating crawl_page() call (and most thereafter) can't be removed from the call stack until all of its recursive calls terminate.

Loop through array values

I have a small problem:
I need to repeat this
do {
$QUERY = "/member?id=".$counter."&action=refresh";
$URL = $HTTP.$HTTPUSER.":".$HTTPPASS."#".$HTTPSERVER.":".$HTTPPORT.$QUERY;
$xml = file_get_contents($URL);
$data = new SimpleXMLElement($xml);
$test_ip = (string)$data->c1;
$dnsip = explode('<br>', $test_ip);
$ext_ip = strip_tags($dnsip[1]);
if ($ext_ip != "127.0.0.1" && $ext_ip != "localhost") {
$dns = strip_tags($dnsip[0]);
echo "$dns $ext_ip <br>";
}
$counter +=1;
} while (!empty($data));
as many times as there are values inside an array, so i tried to add this
$ports = array('2001','2002','2003');
foreach ($ports as $HTTPPORT) {
echo "$HTTPPORT<br>";
$counter = 1;
do {
$QUERY = "/member?id=".$counter."&action=refresh";
$URL = $HTTP.$HTTPUSER.":".$HTTPPASS."#".$HTTPSERVER.":".$HTTPPORT.$QUERY;
$xml = file_get_contents($URL);
$data = new SimpleXMLElement($xml);
$test_ip = (string)$data->c1;
$dnsip = explode('<br>', $test_ip);
$ext_ip = strip_tags($dnsip[1]);
if ($ext_ip != "127.0.0.1" && $ext_ip != "localhost") {
$dns = strip_tags($dnsip[0]);
echo "$dns $ext_ip <br>";
}
$counter +=1;
} while (!empty($data)); }
The problem is that it executes the script with only first port number (2001), and i can't discover why.
Any Ideas?

You could be getting an exception inside your 'do .. while' loop that is causing bother.
I have added a 'try .. catch' block and some 'echo' statements to ensure it always loops round all the 'ports' now. Changed 'catch' to mark $data as empty then continue.
here is tested code:
<?php
$ports = array('2001','2002','2003');
$counter = 0; // total count of documents
foreach($ports as $HTTPPORT) {
echo $HTTPPORT, ' start of process port loop<br/>';
try { // catch any error -- report it and loop round again
do {
$QUERY = "/member?id=".$counter."&action=refresh";
$URL = ''; // $HTTP.$HTTPUSER.":".$HTTPPASS."#".$HTTPSERVER.":".$HTTPPORT.$QUERY;
try {
$xml = file_get_contents($URL);
$data = new SimpleXMLElement($xml);
}
catch (Exception $e) { // ignore any errors
echo 'SimpleXMLElement : oops :', $e->getMessage(), '<br />';
$data = ''; // mark as empty
}
if (!empty($data)) { // process data
$test_ip = (string)$data->c1;
$dnsip = explode('<br>', $test_ip);
$ext_ip = strip_tags($dnsip[1]);
if ($ext_ip != "127.0.0.1" && $ext_ip != "localhost") {
$dns = strip_tags($dnsip[0]);
echo "$dns $ext_ip <br>";
}
}
$counter +=1;
} while (!empty($data));
} // end of try to get and process a document...
catch (Exception $e) { // catch all errors for now
echo 'Processing List of Ports: oops! :', $e->getMessage(), '<br />';
}
} // end of foreach port

PHP recursive variable replacement

I'm writing code to recursively replace predefined variables from inside a given string. The variables are prefixed with the character '%'. Input strings that start with '^' are to be evaluated.
For instance, assuming an array of variables such as:
$vars['a'] = 'This is a string';
$vars['b'] = '123';
$vars['d'] = '%c'; // Note that $vars['c'] has not been defined
$vars['e'] = '^5 + %d';
$vars['f'] = '^11 + %e + %b*2';
$vars['g'] = '^date(\'l\')';
$vars['h'] = 'Today is %g.';
$vars['input_digits'] = '*****';
$vars['code'] = '%input_digits';
The following code would result in:
a) $str = '^1 + %c';
$rc = _expand_variables($str, $vars);
// Result: $rc == 1
b) $str = '^%a != NULL';
$rc = _expand_variables($str, $vars);
// Result: $rc == 1
c) $str = '^3+%f + 3';
$rc = _expand_variables($str, $vars);
// Result: $rc == 262
d) $str = '%h';
$rc = _expand_variables($str, $vars);
// Result: $rc == 'Today is Monday'
e) $str = 'Your code is: %code';
$rc = _expand_variables($str, $vars);
// Result: $rc == 'Your code is: *****'
Any suggestions on how to do that? I've spent many days trying to do this, but only achieved partial success. Unfortunately, my last attempt managed to generate a 'segmentation fault'!!
Help would be much appreciated!

Note that there is no check against circular inclusion, which would simply lead to an infinite loop. (Example: $vars['s'] = '%s'; ..) So make sure your data is free of such constructs.
The commented code
// if(!is_numeric($expanded) || (substr($expanded.'',0,1)==='0'
// && strpos($expanded.'', '.')===false)) {
..
// }
can be used or skipped. If it is skipped, any replacement is quoted, if the string $str will be evaluated later on! But since PHP automatically converts strings to numbers (or should I say it tries to do so??) skipping the code should not lead to any problems.
Note that boolean values are not supported! (Also there is no automatic conversion done by PHP, that converts strings like 'true' or 'false' to the appropriate boolean values!)
<?
$vars['a'] = 'This is a string';
$vars['b'] = '123';
$vars['d'] = '%c';
$vars['e'] = '^5 + %d';
$vars['f'] = '^11 + %e + %b*2';
$vars['g'] = '^date(\'l\')';
$vars['h'] = 'Today is %g.';
$vars['i'] = 'Zip: %j';
$vars['j'] = '01234';
$vars['input_digits'] = '*****';
$vars['code'] = '%input_digits';
function expand($str, $vars) {
$regex = '/\%(\w+)/';
$eval = substr($str, 0, 1) == '^';
$res = preg_replace_callback($regex, function($matches) use ($eval, $vars) {
if(isset($vars[$matches[1]])) {
$expanded = expand($vars[$matches[1]], $vars);
if($eval) {
// Special handling since $str is going to be evaluated ..
// if(!is_numeric($expanded) || (substr($expanded.'',0,1)==='0'
// && strpos($expanded.'', '.')===false)) {
$expanded = "'$expanded'";
// }
}
return $expanded;
} else {
// Variable does not exist in $vars array
if($eval) {
return 'null';
}
return $matches[0];
}
}, $str);
if($eval) {
ob_start();
$expr = substr($res, 1);
if(eval('$res = ' . $expr . ';')===false) {
ob_end_clean();
die('Not a correct PHP-Expression: '.$expr);
}
ob_end_clean();
}
return $res;
}
echo expand('^1 + %c',$vars);
echo '<br/>';
echo expand('^%a != NULL',$vars);
echo '<br/>';
echo expand('^3+%f + 3',$vars);
echo '<br/>';
echo expand('%h',$vars);
echo '<br/>';
echo expand('Your code is: %code',$vars);
echo '<br/>';
echo expand('Some Info: %i',$vars);
?>
The above code assumes PHP 5.3 since it uses a closure.
Output:
1
1
268
Today is Tuesday.
Your code is: *****
Some Info: Zip: 01234
For PHP < 5.3 the following adapted code can be used:
function expand2($str, $vars) {
$regex = '/\%(\w+)/';
$eval = substr($str, 0, 1) == '^';
$res = preg_replace_callback($regex, array(new Helper($vars, $eval),'callback'), $str);
if($eval) {
ob_start();
$expr = substr($res, 1);
if(eval('$res = ' . $expr . ';')===false) {
ob_end_clean();
die('Not a correct PHP-Expression: '.$expr);
}
ob_end_clean();
}
return $res;
}
class Helper {
var $vars;
var $eval;
function Helper($vars,$eval) {
$this->vars = $vars;
$this->eval = $eval;
}
function callback($matches) {
if(isset($this->vars[$matches[1]])) {
$expanded = expand($this->vars[$matches[1]], $this->vars);
if($this->eval) {
// Special handling since $str is going to be evaluated ..
if(!is_numeric($expanded) || (substr($expanded . '', 0, 1)==='0'
&& strpos($expanded . '', '.')===false)) {
$expanded = "'$expanded'";
}
}
return $expanded;
} else {
// Variable does not exist in $vars array
if($this->eval) {
return 'null';
}
return $matches[0];
}
}
}

I now have written an evaluator for your code, which addresses the circular reference problem, too.
Use:
$expression = new Evaluator($vars);
$vars['a'] = 'This is a string';
// ...
$vars['circular'] = '%ralucric';
$vars['ralucric'] = '%circular';
echo $expression->evaluate('%circular');
I use a $this->stack to handle circular references. (No idea what a stack actually is, I simply named it so ^^)
class Evaluator {
private $vars;
private $stack = array();
private $inEval = false;
public function __construct(&$vars) {
$this->vars =& $vars;
}
public function evaluate($str) {
// empty string
if (!isset($str[0])) {
return '';
}
if ($str[0] == '^') {
$this->inEval = true;
ob_start();
eval('$str = ' . preg_replace_callback('#%(\w+)#', array($this, '_replace'), substr($str, 1)) . ';');
if ($error = ob_get_clean()) {
throw new LogicException('Eval code failed: '.$error);
}
$this->inEval = false;
}
else {
$str = preg_replace_callback('#%(\w+)#', array($this, '_replace'), $str);
}
return $str;
}
private function _replace(&$matches) {
if (!isset($this->vars[$matches[1]])) {
return $this->inEval ? 'null' : '';
}
if (isset($this->stack[$matches[1]])) {
throw new LogicException('Circular Reference detected!');
}
$this->stack[$matches[1]] = true;
$return = $this->evaluate($this->vars[$matches[1]]);
unset($this->stack[$matches[1]]);
return $this->inEval == false ? $return : '\'' . $return . '\'';
}
}
Edit 1: I tested the maximum recursion depth for this script using this:
$alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEF'; // GHIJKLMNOPQRSTUVWXYZ
$length = strlen($alphabet);
$vars['a'] = 'Hallo World!';
for ($i = 1; $i < $length; ++$i) {
$vars[$alphabet[$i]] = '%' . $alphabet[$i-1];
}
var_dump($vars);
$expression = new Evaluator($vars);
echo $expression->evaluate('%' . $alphabet[$length - 1]);
If another character is added to $alphabet maximum recursion depth of 100 is reached. (But probably you can modify this setting somewhere?)

I actually just did this while implementing a MVC framework.
What I did was create a "find-tags" function that uses a regular expression to find all things that should be replaced using preg_match_all and then iterated through the list and called the function recursively with the str_replaced code.
VERY Simplified Code
function findTags($body)
{
$tagPattern = '/{%(?P<tag>\w+) *(?P<inputs>.*?)%}/'
preg_match_all($tagPattern,$body,$results,PREG_SET_ORDER);
foreach($results as $command)
{
$toReturn[] = array(0=>$command[0],'tag'=>$command['tag'],'inputs'=>$command['inputs']);
}
if(!isset($toReturn))
$toReturn = array();
return $toReturn;
}
function renderToView($body)
{
$arr = findTags($body);
if(count($arr) == 0)
return $body;
else
{
foreach($arr as $tag)
{
$body = str_replace($tag[0],$LOOKUPARRY[$tag['tag']],$body);
}
}
return renderToView($body);
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Pinterest style / PHP Image Scraper Crashing Server - php

Related

How to refresh opencart modification by code

Error PHP website crawler class using Simple HTML Dom

The PHP crawler I am using has a memory leak, what is causing this?

Loop through array values

PHP recursive variable replacement

Categories

Resources