How to write a bot that does not consume much RAM? - php

I have a web bot and It consumes my memory so much, after a time, memory usage hits to 50%, and the process gets killed; I have no idea why memory usage is increasing like that, I did not include "para.php" which is a library for parallel curl requests. I want to know more things about web crawlers, I searched a lot, but could not find any helpful document or methods that I can use.
This is the library from which I obtained para.php.
My code:
require_once "para.php";
class crawling{
public $montent;
public function crawl_page($url){
$m = new Mongo();
$muun = $m->howto->en->findOne(array("_id" => $url));
if (isset($muun)) {
return;
}
$m->howto->en->save(array("_id" => $url));
echo $url;
echo "\n";
$para = new ParallelCurl(10);
$para->startRequest($url, array($this,'on_request_done'));
$para->finishAllRequests();
preg_match_all("(<a href=\"(.*)\")siU", $this->montent, $matk);
foreach($matk[1] as $longu){
$href = $longu;
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
$this->crawl_page($longu);
}
}
public function on_request_done($content) {
$this->montent = $content;
}
$moj = new crawling;
$moj->crawl_page("http://www.example.com/");

You call this crawl_page function on 1 url.
It's content is fetched ($this->montent) and checked for links ($matk).
While these are not yet destroyed, you go recursive, starting a new call to crawl_page. $this->moment will be overwritten with the new content (that's ok). A bit further down, $matk (a new variable) is populated with the links for the new $this->montent. At this point, there are 2 $matk's in memory: the one with all links for the document you started processing first, and the one with all links for the document that was first linked to in your original document.
I'd suggest to find all links & save them to a database (instead of immediately going recursive). Then just clear the queue of links in the database, 1 by 1 (with each new document adding a new entry to the database)

Related

PHP image load function reads from the controller instead of domain/folder/to/image

Dear stackoverflow community,
Loading images from function works when I am in the indexAction (index of the controller, e.g gallery). When I want to load images from function while in a sublink of the main controller (e.g gallery/family) it doesn't load the images.
So when going to: domain.com/gallery everything is working.
When going to domain.com/gallery/family the images will be trying to find "domain.com/gallery/app/assets/image/gallery/family/" instead of "domain.com/app/assets/image/gallery/family/".
Image location is at "app/assets/image/gallery/family/" then running this function:
public function insertFamilyImages()
{
$directory = 'app/assets/image/gallery/family/';
$extension = '.jpg';
$html = '';
if ( file_exists($directory) ) {
foreach ( glob($directory . '*' . $extension) as $file ) {
$html .= '<li>';
$html .= '<img src="' . $file . '" alt="">';
$html .= '</li>';
}
} else {
echo 'directory ' . $directory . ' doesn\'t exist!';
}
return $html;
}
I am stuck at this and google doesn't help me. Neither I can find a similiar post on stackoverflow regarding this issue.
Thanks in advance,
Caleb
From looking at this it looks like when you create your html you point the img src to the file system path and not the uri. This html will be loaded in a remote browser and the browser can't access a remote file system. It needs an uri.
E.g.
Let say your domain name is example.com and en example image is called eg.png
Your image is located on your disk at app/assets/image/gallery/family/eg.png (relative).
However your images are accessible externally at https://example.com/gallery/family/eg.png.
You code would look something like this:
public function insertFamilyImages()
{
$directory = 'app/assets/image/gallery/family/';
$uri = 'https://example.com/gallery/family/';
$extension = '.jpg';
$html = '';
if ( file_exists($directory) ) {
foreach ( glob($directory . '*' . $extension) as $file ) {
$file = $uri.basename($file);
$html .= '<li>';
$html .= '<img src="' . $file . '" alt="">';
$html .= '</li>';
}
} else {
echo 'directory ' . $directory . ' doesn\'t exist!';
}
return $html;
}
You'll have to adjust this code to fit your case, but it should point you in the right direction where the problem lies. I used a complete uri here to point to the image to illustrate the point, but this could be made relative as well to the web root.
Never load / read / import any file with relative path. Otherwise your code will be faulty unpredictable.
Always use absolute path. In first php file like index.php which is entry point to your application define global root path like this:
define('ROOT', __DIR__);
then use ROOT as your constant at all files access/manipulation in your application.
$directory = ROOT.'/app/assets/image/gallery/family/';
You will never have problems with paths.

PHP failing with Out of Memory: Kill process - is it possible for me to make loops more efficient?

My php is running out of memory with a server error "Out of memory:Kill process..about 25% of the way through the process" Although it searches through about 10,000 lines, the number of lines that match the criteria, and therefore need to be stored and written to the file at the end of the process, are less than 200. So I am not sure why it is running out of memory.
Am I receiving this error because I am not clearing variables after each loop, or do I need to increase the memory on the server?
The process in brief is:
- LOOPA - loop through list of 400 zip codes
- using one api call for each zip - get list of all places within each zip (typically about 40-50)
-- SUBLOOP1 - for each place found, use an api call to get all events for that place
---- SUBLOOP1A loop through events to count the number for each place
zips = file($configFile, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
$dnis = file($dniFile, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
$s3->registerStreamWrapper();
$file = fopen("s3://{$bucket}/{$key}", 'w') or die("Unable to open file!");
fwrite($file, $type . " id" . "\t" . $type . " name" . "\t" . "zip" . "\t" . "event count" . "\n" );
foreach($zips as $n => $zip){
//first line is the lable to describe zips, so skip it
if ($n < 1) continue;
$params = $url;
$params .= "&q=" . $zip;
$more_node_pages = true;
while ($more_node_pages){
$res = fetchEvents($params);
//Now find the number of events for each place
foreach($res->data as $node){
//first check if on Do Not Include list
$countevents = true;
foreach($dnis as $dni) {
if ($dni == $node->id) {
echo "Not going to get events for ". $node->name . " id# " . $dni . "\n\n";
$countevents = false;
break;
}
}
//if it found a match, skip this and go to the next
if (!$countevents) continue;
$params = $url . $node->id . "/events/?fields=start_time.order(reverse_chronological)&limit=" . $limit . "&access_token=". $access_token;
//Count the number of valid upcoming events for that node
$event_count = 0;
$more_pages = true;
$more_events = true;
while ($more_pages) {
$evResponse = fetchEvents($params);
if (!empty($evResponse->error)) {
checkError($evResponse->error->message, $evResponse->error->code, $file);
}
//if it finds any events for that place, go throught each event for that place one by one to count until you reach today
foreach($evResponse->data as $event){
if(strtotime($event->start_time) > strtotime('now')){
$event_count++;
}
//else we have reached today's events for this node, so get out of this loop, and don't retrieve any more events for this node
else {
$more_events = false;
break;
}
}
if (!empty($evResponse->paging->next) and $more_events) $params = $evResponse->paging->next;
else $more_pages = false;
} //end while loop looking for more pages with more events for that node (page)
if ($event_count > "0") {
fwrite($file, $node->id . "\t" . $node->name . "\t" . $zip . "\t" . $event_count . "\n");
echo $event_count . "\n";
}
} // loop back to the next place until done
//test to see if there is an additional page
if (!empty($res->paging->next)) $params = $res->paging->next; else $more_node_pages = false;
} //close while loop for $more_node_pages containing additional nodes for that zip
} // loop back to the next zip until done
fclose($file);
I would highly recommend adding output to the beginning of each nested loop. I think you most likely have an infinite loop, which is causing the script to run out of memory.
If that isn't the case, then you can try increasing the memory limit for your PHP script by adding this line of PHP to the top of your script:
ini_set("memory_limit", "5G");
If it takes more than 5GB of RAM for your script to process the 400 zip codes, I would recommend breaking your script up so that you can run zip codes 0-10 and then 11-20, then 21-30, etc.
Hope this helps, cheers.
You need to find out where the memory is being lost and then you can either take care of it or work around it. memory_get_usage() is your friend - print it at the top (or bottom) of each loop with some identifier so you can see when & where you are using up memory.

Why use DOMDocument instead of PHP with HTML?

I still can't really wrap my head around the built in DOMDocument class.
Why should I use that instead of just doing it similar to the following?
I would like to know the benefits.
$URI = $_SERVER['REQUEST_URI'];
$navArr = Config::get('navigation');
$navigation = '<ul id="nav">' . "\n";
foreach($navArr as $name => $path) {
$navigation .= ' <li' . ((in_array($URI, $path)) ? ' class="active"' : false) . '>' . $name . '</li>' . "\n";
}
$navigation .= '</ul>' . "\n\n";
return $navigation;
Here's the same example using DOMDocument:
$doc = new DOMDocument;
$list = $doc->appendChild($doc->createElement('ul'));
$list->setAttribute('id', 'nav');
foreach ($navArr as $name => $path) {
$listItem = $list->appendChild($doc->createElement('li'));
if (in_array($URI, $path)) {
$listItem->setAttribute('class', 'active');
}
$link = $listItem->appendChild($doc->createElement('a'));
$link->setAttribute('href', $path[1]);
$link->appendChild($doc->createTextNode($name));
}
return $doc->saveHTML();
It's more verbose, but not too much, and possibly clearer what's happening at each step.
One benefit is character escaping: createTextNode and setAttribute ensure that the special HTML characters (quotes, ampersands and angle brackets) are escaped properly.
In the end, though, for a larger application, you'd probably want to use an actual templating language like Twig for generating HTML, as the templates are more readable and extensible.

Assetic is generating multiple files with same content

I have a class that uses Assetic to generate some css files to disk. I'll jump right into the code.
In my layout header, I'm doing something like this:
$assetify = new Assetify();
$assetify->setDebug(true);
$assetify->setAssetDirectory(BASE_DIR . '/public/assets');
$assetify->setOutputDirectory(BASE_DIR . '/public/assets/generated');
$assetify
->addStylesheet('/assets/css/bootstrap-2.3.2.css')
->addStylesheet('/assets/css/select2-3.4.3.css')
->addStylesheet('/assets/css/main.css');
echo $assetify->dump();
My "Assetify" class runs this through Assetic. I'll paste what I hope are only the relevant portions from the dump() function:
// The Asset Factory allows us to not have to do all the hard work ourselves.
$factory = new AssetFactory($this->assetDirectory, $this->debug);
$factory->setDefaultOutput('/generated/*.css');
// The Filter Manager allows us to organize filters for the asset handling.
// For other filters, see: https://github.com/kriswallsmith/assetic
$fm = new FilterManager();
$fm->set('yui_css', new Yui\CssCompressorFilter('/usr/local/bin/yuicompressor-2.4.7.jar'));
$fm->set('yui_js', new Yui\JsCompressorFilter('/usr/local/bin/yuicompressor-2.4.7.jar'));
$factory->setFilterManager($fm);
// The Asset Manager allows us to keep our assets organized.
$am = new AssetManager();
$factory->setAssetManager($am);
// The cache-busting worker prefixes every css with what amounts to a version number.
$factory->addWorker(new CacheBustingWorker());
$assetCollection = array();
foreach ($assetGroups as $assetGroup) {
foreach ($assetGroup as $media => $items) {
$fileCollection = array();
foreach ($items as $item) {
// Add this asset to the asset collection.
$fileCollection[] = new FileAsset($item);
}
$assetCollection[] = new AssetCollection($fileCollection);
}
}
$assetCollection = new AssetCollection($assetCollection);
$am->set('base_css', $assetCollection);
// Generate the required assets. Prefixing a filter name with a question mark
// will cause that filter to be omitted in debug mode.
$asset = $factory->createAsset(
array('#base_css'),
array('?yui_css')
);
// Configure an internal file system cache so we don't regenerate this file on every load.
$cache = new AssetCache(
$asset,
new FilesystemCache($this->outputDirectory)
);
// And generate static versions of the files on disk.
$writer = new AssetWriter($this->assetDirectory);
$writer->writeAsset($cache);
This generates two different files, 87229eb-f47a352.css and a37c1589762f39aee5bd24e9405dbdf9. The contents of the files are exactly the same. The 87229eb-f47a352.css file seems to get generated every single time, and the other file is not regenerated unless the contents of the files change (this is what I would like). If I comment out the $writer->writeAsset($cache), no files are written to disk.
What obvious configuration am I missing? I appreciate the help, thank you.
I was able to roughly replicate your code and got the same results.
I was trying to get the same results as what I think you require but ended up writing my own code to cache and serve static files.
It's not complete by any means but it is working. It has the following features:
You can choose to cache files for different pages if you specify $filename
You can choose to create versions of your released files or delete previous versions
A cached file will be generated to your target folder only if changes have made to a source file
You just need to put the code in to a class or function and return the url to serve.
Hope it helps :)
<?php
use Assetic\Factory\AssetFactory;
use Assetic\AssetManager;
use Assetic\FilterManager;
use Assetic\Asset\AssetCollection;
use Assetic\Asset\FileAsset;
use Assetic\Filter\JSMinFilter;
// JavaScript Collection
$js_collection[] = new FileAsset(SCRIPT_PATH . 'jquery.js');
$js_collection[] = new FileAsset(SCRIPT_PATH . 'production.js');
if (file_exists(SCRIPT_PATH . $page_info['name'] . '.js')) {
$js_collection[] = new FileAsset(SCRIPT_PATH . $page_info['name'] . '.js');
}
// CSS Collection
$css_collection[] = new FileAsset(STYLE_PATH . 'theme.css');
if (file_exists(STYLE_PATH . $page_info['name'] . '.css')) {
$css_collection[] = new FileAsset(STYLE_PATH . $page_info['name'] . '.css');
}
// The Filter Manager allows us to organize filters for the asset handling.
$fm = new FilterManager();
$fm->set('js', new JSMinFilter());
$js = new AssetCollection (
$js_collection
);
$js->setTargetPath(SCRIPT_PATH . 'static');
$css = new AssetCollection (
$css_collection
);
$css->setTargetPath(STYLE_PATH . 'static');
$am = new AssetManager();
$am->set('js', $js);
$am->set('css', $css);
//** TO DO: put the below in a class and return the static file names **//
// options
$seperator = '-';
$filename = $page_info['name'];
$versions = false;
// get a list of all collection names
$collections = $am->getNames();
// get each collection
foreach ($collections as $collection_name) {
// get the collection object
$collection = $am->get($collection_name);
// ensure file types are identical
$last_ext = false;
foreach ($collection as $leaf) {
$ext = strtolower(pathinfo($leaf->getSourcePath(), PATHINFO_EXTENSION));
if (!$last_ext || $ext == $last_ext) {
$last_ext = $ext;
} else {
throw new \RuntimeException('File type mismatch.');
}
}
// get the highest last-modified value of all assets in the current collection
$modified_time = $collection->getLastModified();
// get the target path
$path = $collection->getTargetPath();
// the target path must be set
if (!$path) {
throw new \RuntimeException('Target path not specified.');
}
// build the filename to check
$file = ($filename) ? $filename . $seperator . $modified_time . '.' . $ext : $modified_time . '.' . $ext;
$cached_file = $path . '/' . $file;
// the file doesn't exist so we need to minify, dump and save as new cached file
if (!file_exists($cached_file)) {
// create the output dir if it doesnt exist
if (!is_dir($path) && false === #mkdir($path, 0777, true)) {
throw new \RuntimeException('Unable to create directory ' . $path);
}
// apply the filters
if ($fm->has($collection_name)) {
$collection->ensureFilter($fm->get($collection_name));
}
// If not versioned, delete previous version of this file
if (!$versions) {
if ($filename) {
foreach (glob($path . '/' . $filename . $seperator . '*.' . $ext) as $searchfile) {
#unlink($searchfile);
}
} else {
foreach (glob($path . '/*.' . $ext) as $searchfile) {
#unlink($searchfile);
}
}
}
// put the contents in the file
if (false === #file_put_contents($cached_file, $collection->dump())) {
throw new \RuntimeException('Unable to write file ' . $cached_file);
}
}
// return the cached file
echo 'output: ' . $cached_file . '<br>';
}
exit;
?>

Selenium2 firefox: use the default profile

Selenium2, by default, starts firefox with a fresh profile. I like that for a default, but for some good reasons (access to my bookmarks, saved passwords, use my add-ons, etc.) I want to start with my default profile.
There is supposed to be a property controlling this but I think the docs are out of sync with the source, because as far as I can tell webdriver.firefox.bin is the only one that works. E.g. starting selenium with:
java -jar selenium-server-standalone-2.5.0.jar -Dwebdriver.firefox.bin=not-there
works (i.e. it complains). But this has no effect:
java -jar selenium-server-standalone-2.5.0.jar -Dwebdriver.firefox.profile=default
("default" is the name in profiles.ini, but I've also tried with "Profile0" which is the name of the section in profiles.ini).
I'm using PHPWebdriver (which uses JsonWireProtocol) to access:
$webdriver = new WebDriver("localhost", "4444");
$webdriver->connect("firefox");
I tried doing it from the PHP side:
$webdriver->connect("firefox","",array('profile'=>'default') );
or:
$webdriver->connect("firefox","",array('profile'=>'Profile0') );
with no success (firefox starts, but not using my profile).
I also tried the hacker's approach of creating a batch file:
#!/bin/bash
/usr/bin/firefox -P default
And then starting Selenium with:
java -jar selenium-server-standalone-2.5.0.jar -Dwebdriver.firefox.bin="/usr/local/src/selenium/myfirefox"
Firefox starts, but not using by default profile and, worse, everything hangs: selenium does not seem able to communicate with firefox when started this way.
P.S. I saw Selenium - Custom Firefox profile I tried this:
java -jar selenium-server-standalone-2.5.0.jar -firefoxProfileTemplate "not-there"
And it refuses to run! Excited, thinking I might be on to something, I tried:
java -jar selenium-server-standalone-2.5.0.jar -firefoxProfileTemplate /path/to/0abczyxw.default/
This does nothing. I.e. it still starts with a new profile :-(
Simon Stewart answered this on the mailing list for me.
To summarize his reply: you take your firefox profile, zip it up (zip, not tgz), base64-encode it, then send the whole thing as a /session json request (put the base64 string in the firefox_profile key of the Capabilities object).
An example way to do this on Linux:
cd /your/profile
zip -r profile *
base64 profile.zip > profile.zip.b64
And then if you're using PHPWebDriver when connecting do:
$webdriver->connect("firefox", "", array("firefox_profile" => file_get_contents("/your/profile/profile.zip.b64")))
NOTE: It still won't be my real profile, rather a copy of it. So bookmarks won't be remembered, the cache won't be filled, etc.
Here is the Java equivalent. I am sure there is something similar available in php.
ProfilesIni profile = new ProfilesIni();
FirefoxProfile ffprofile = profile.getProfile("default");
WebDriver driver = new FirefoxDriver(ffprofile);
If you want to additonal extensions you can do something like this as well.
ProfilesIni profile = new ProfilesIni();
FirefoxProfile ffprofile = profile.getProfile("default");
ffprofile.addExtension(new File("path/to/my/firebug.xpi"));
WebDriver driver = new FirefoxDriver(ffprofile);
java -jar selenium-server-standalone-2.21.0.jar -Dwebdriver.firefox.profile=default
should work. the bug is fixed.
Just update your selenium-server.
I was curious about this as well and what I got to work was very simple.
I use the command /Applications/Firefox.app/Contents/MacOS/firefox-bin -P to bring up Profile Manager. After I found which profile I needed to use I used the following code to activate the profile browser = Selenium::WebDriver.for :firefox, :profile => "batman".
This pulled all of my bookmarks and plug-ins that were associated with that profile.
Hope this helps.
From my understanding, it is not possible to use the -Dwebdriver.firefox.profile=<name> command line parameter since it will not be taken into account in your use case because of the current code design. Since I faced the same issue and did not want to upload a profile directory every time a new session is created, I've implemented this patch that introduces a new firefox_profile_name parameter that can be used in the JSON capabilities to target a specific Firefox profile on the remote server. Hope this helps.
I did It in Zend like this:
public function indexAction(){
$appdata = 'C:\Users\randomname\AppData\Roaming\Mozilla\Firefox' . "\\";
$temp = 'C:\Temp\\';
$hash = md5(rand(0, 999999999999999999));
if(!isset($this->params['p'])){
shell_exec("\"C:\\Program Files (x86)\\Mozilla Firefox\\firefox.exe\" -CreateProfile " . $hash);
}else{
$hash = $this->params['p'];
}
$ini = new Zend_Config_Ini('C:\Users\randomname\AppData\Roaming\Mozilla\Firefox\profiles.ini');
$path = false;
foreach ($ini as $key => $value){
if(isset($value->Name) && $value->Name == $hash){
$path = $value->Path;
break;
}
}
if($path === false){
die('<pre>No profile found with name: ' . $hash);
}
echo "<pre>Profile : $hash \nProfile Path : " . $appdata . "$path \n";
echo "Files: \n";
$filesAndDirs = $this->getAllFiles($appdata . $path);
$files = $filesAndDirs[0];
foreach ($files as $file){
echo " $file\n";
}
echo "Dirs : \n";
$dirs = array_reverse($filesAndDirs[1]);
foreach ($dirs as $dir){
echo " $dir\n";
}
echo 'Zipping : ';
$zip = new ZipArchive();
$zipPath = md5($path) . ".temp.zip";
$zipRet = $zip->open($temp .$zipPath, ZipArchive::CREATE);
echo ($zipRet === true)?"Succes\n":"Error $zipRet\n";
echo "Zip name : $zipPath\n";
foreach ($dirs as $dir){
$zipRet = $zip->addEmptyDir($dir);
if(!($zipRet === true) ){
echo "Error creating folder: $dir\n";
}
}
foreach ($files as $file){
$zipRet = $zip->addFile($appdata . $path ."\\". $file,$file);
if(!($zipRet === true && file_exists($appdata . $path . "\\". $file) && is_readable($appdata . $path . "\\". $file))){
echo "Error zipping file: $appdata$path/$file\n";
}
}
$zipRet = $zip->addFile($appdata . $path ."\\prefs.js",'user.js');
if(!($zipRet === true && file_exists($appdata . $path . "\\". $file) && is_readable($appdata . $path . "\\". $file))){
echo "Error zipping file: $appdata$path/$file\n";
}
$zipRet = $zip->close();
echo "Closing zip : " . (($zipRet === true)?("Succes\n"):("Error:\n"));
if($zipRet !== true){
var_dump($zipRet);
}
echo "Reading zip in string\n";
$zipString = file_get_contents($temp .$zipPath);
echo "Encoding zip\n";
$zipString = base64_encode($zipString);
echo $zipString . "\n";
require 'webdriver.php';
echo "Connecting Selenium\n";
$webDriver = new WebDriver("localhost",'4444');
if(!$webDriver->connect("firefox","",array('firefox_profile'=>$zipString))
{
die('Selenium is not running');
}
}
private function getAllFiles($path,$WithPath = false){
$return = array();
$dirs = array();
if (is_dir($path)) {
if ($dh = opendir($path)) {
while (($file = readdir($dh)) !== false) {
if(!in_array($file, array('.','..'))){
if(is_dir($path . "\\" . $file)){
$returned = $this->getAllFiles($path . "\\" . $file,(($WithPath==false)?'':$WithPath) . $file . "\\");
$return = array_merge($return,$returned[0]);
$dirs = array_merge($dirs,$returned[1]);
$dirs[] = (($WithPath==false)?'':$WithPath) . $file;
}else{
$return[] = (($WithPath==false)?'':$WithPath) . $file;
}
}
}
closedir($dh);
}
}
return array($return,$dirs);
}
The Idea is that you give in the get/post/zend parameters P with the name of the profile if not a random wil be created, and he will zip all the files put it in the temp folder and put it in.

Categories