I want to realize algorithm of Aho-Corasick. I've made trie and it works, I've done deleting from trie, insert and search. But when I've tried to insert my test data (~56k words), PHP has thrown an error:
php reached memory limit on one script (128 mb).
So, I think the problem is in my incorrect using references. Here is part of my code, where I get an error:
public function insert($key, $value) {
$current_node = &$this->root; // setting current node as root node
for ($i = 0; $i < mb_strlen($key, 'UTF-8'); $i++) {
$char = mb_substr($key, $i, 1, 'UTF-8');
$parent = &$current_node; // setting parent node
if (isset($current_node['children'][(string)$char])) {
$current_node = &$current_node['children'][(string)$char];
if (isset($current_node['isLeaf']))
unset($current_node['isLeaf']);
} else {
$current_node['children'][(string)$char] = [];
$current_node = &$current_node['children'][(string)$char];
}
$current_node['parent'] = &$parent;
if ($i == (mb_strlen($key, 'UTF-8') - 1)) {
$current_node['value'] = $value;
if (!isset($current_node['children'])) {
$current_node['isLeaf'] = true;
}
}
}
}
I think the problem is that I try to store all massive by reference, not just address in C/C++ style. So, can I solve this problem in C/C++ style? Has php instrument for it? What if I will write php-extension in C and then just add it to my php-interpretator?
Related
Good morning,
I´m actually going through some hard lessons while trying to handle huge csv files up to 4GB.
Goal is to search some items in a csv file (Amazon datafeed) by a given browsenode and also by some given item id´s (ASIN). To get a mix of existing items (in my database) plus some additional new itmes since from time to time items disapear on the marketplace. I also filter the title of the items because there are many items using the same.
I have been reading here lots af tips and finally decided to use php´s fgetcsv() and thought this function will not exhaust memory, since it reads the file line by line.
But no matter what I try I´m always running out of memory.
I can not understand why my code uses so much memory.
I set the memory limit to 4096MB, time limit is 0. Server has 64 GB Ram and two SSD hardisks.
May someone please check out my piece of code and explain how it is possible that im running out of memory and more important how memory is used?
private function performSearchByASINs()
{
$found = 0;
$needed = 0;
$minimum = 84;
if(is_array($this->searchASINs) && !empty($this->searchASINs))
{
$needed = count($this->searchASINs);
}
if($this->searchFeed == NULL || $this->searchFeed == '')
{
return false;
}
$csv = fopen($this->searchFeed, 'r');
if($csv)
{
$l = 0;
$title_array = array();
while(($line = fgetcsv($csv, 0, ',', '"')) !== false)
{
$header = array();
if(trim($line[6]) != '')
{
if($l == 0)
{
$header = $line;
}
else
{
$asin = $line[0];
$title = $this->prepTitleDesc($line[6]);
if(is_array($this->searchASINs)
&& !empty($this->searchASINs)
&& in_array($asin, $this->searchASINs)) //search for existing items to get them updated
{
$add = true;
if(in_array($title, $title_array))
{
$add = false;
}
if($add === true)
{
$this->itemsByASIN[$asin] = new stdClass();
foreach($header as $k => $key)
{
if(isset($line[$k]))
{
$this->itemsByASIN[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
}
}
$title_array[] = $title;
$found++;
}
}
if(($line[20] == $this->bnid || $line[21] == $this->bnid)
&& count($this->itemsByKey) < $minimum
&& !isset($this->itemsByASIN[$asin])) // searching for new items
{
$add = true;
if(in_array($title, $title_array))
{
$add = false;
}
if($add === true)
{
$this->itemsByKey[$asin] = new stdClass();
foreach($header as $k => $key)
{
if(isset($line[$k]))
{
$this->itemsByKey[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
}
}
$title_array[] = $title;
$found++;
}
}
}
$l++;
if($l > 200000 || $found == $minimum)
{
break;
}
}
}
fclose($csv);
}
}
I know my answer is a bit late but I had a similar problem with fgets() and things based on fgets() like SplFileObject->current() function. In my case it was on a windows system when trying to read a +800MB file. I think fgets() doesn't free the memory of the previous line in a loop. So every line that was read stayed in memory and let to a fatal out of memory error. I fixed it using fread($lineLength) instead but it is a bit trickier since you must supply the length.
It is very hard to manage large data using array without encountering timeout issue. Instead why not parse this datafeed to a database table and do the heavy lifting from there.
Have you tried this? SplFileObject::fgetcsv
<?php
$file = new SplFileObject("data.csv");
while (!$file->eof()) {
//your code here
}
?>
You are running out of memory because you use variables, and you are never doing an unset(); and use too many nested foreach. You could shrink that code in more functions
A solution should be, use a real Database instead.
I'm running into an "Cannot redeclare" error and I can't figure out how to fix it. So I have a few functions in a php file located below. Now these functions iterate over an array of data.
I think I've surmised that the problem is that I'm looping the function over and over again in the foreach loop, and its the foreach loop thats been the problem. It seems like its already writing one the function to memory the first time and then for some reason it doesn't like being evoked again.
Your help appreciated.
P.S I've seen a number of similar posts on the issue such as Fatal error: Cannot redeclare but that doesn't seem to work.
<?php
// *****Code Omitted from Stack****
function postHelper($data, $field1, $field2)
{ //TODO Abstract and make sure post Helper and modify Post can be the same thing.
$result = array();
for ($j = 0; $j < count($data); ++$j) { //iterator over array
if ($field2 == "") {
$result[$j] = $data[$j][$field1];
} else {
return $result[$j] = $data[$j][$field1][$field2];
}
}
return $result;
}
//returns an array with only # and # values
function modifyPost($data)
{
//puts symbol # before read data
function addSymbol($data, $field1, $field2)
{
$info = postHelper($data, $field1, $field2);
foreach ($info as &$n) {
$n = '#' . $n;
}
print_r($info);
}
/*
Parse texts and returns an array with only # or # signs used
*/
function parseText($data)
{
$newarr = array();
$text = postHelper($data, "text", "");
foreach ($text as &$s) { //separates into words
$ex = explode(" ", $s);
foreach ($ex as &$n) { //if text doesnt' begin with '#' or '#' then throw it out.
if (substr($n, 0, 1) === '#' || strpos($n, '#') !== false) {
array_push($newarr, $n . ',');
}
}
}
return $newarr;
}
}
foreach ($posts as $entry) {
if (!function_exists('modifyPost')) {
$nval = "hello";
modifyPost($entry);
$entry['mod_post'] = $nval;
}
}
?>
EDIT: I've solved the error. Turns out that the original posts did actually work. I messed in naming. I will give points to anyone who can explain to me why this is necessary for a call. Moreover, I will update post if there is an additional questions that I have.
Php doesn't support nested functions. Although you technically can declare a function within a function:
function modifyPost($data)
{
function addSymbol($data, $field1, $field2)
the inner function becomes global, and the second attempt to declare it (by calling the outer function once again) will fail.
This behaviour seems counter-intuitive, but this is how it works at the moment. There's RFC about real nested functions, which also lists several workarounds for the problem.
The error says it all. You have duplicate modifyData() & parseText functions.
Remove the top half of the php file so only one of each occurs.
Could anyone help me.
I need to return multiple img's, but with this code, only one of two is returning.
What is the solution.
Thank you in advance.
$test = "/claim/img/box.png, /claim/img/box.png";
function test($test)
{
$photo = explode(',', $test);
for ($i = 0; $i < count($photo); $i++)
{
$returnas = "<img src=".$photo[$i].">";
return $returnas;
}
}
This might be a good opportunity to learn about array_map.
function test($test) {
return implode("",array_map(function($img) {
return "<img src='".trim($img)."' />";
},explode(",",$test)));
}
Many functions make writing code a lot simpler, and it's also faster because it uses lower-level code.
While we're on the subject of learning things, PHP 5.5 gives us generators. You could potentially use one here. For example:
function test($test) {
$pieces = explode(",",$test);
foreach($pieces as $img) {
yield "<img src='".trim($img)."' />";
}
}
That yield is where the magic happens. This makes your function behave like a generator. You can then do this:
$images = test($test);
foreach($images as $image) echo $image;
Personally, I think this generator solution is a lot cleaner than the array_map one I gave earlier, which in turn is tidier than manually iterating.
Modify your code that way
function test($test)
{
$returnas = '';
$photo = explode(',', $test);
for ($i = 0; $i < count($photo); $i++)
{
$returnas .= "<img src=".$photo[$i].">";
}
return $returnas;
}
Your code didn't work since you were returning inside the loop immediatly. Every programming language support "only a return for call". In my solution you're appendig a string that has an img tag each time you enter the loop and return it after every photo is "passed" into the loop
You could even use the foreach() construct, of course
Bonus answer
If you don't know the difference between ...
for ($i = 0; $i < count($photo); $i++)
and
for ($i = 0, $count = count($photo); $i < $<; $i++)
Well, in first case you'll evaluate count($photo) every single time the for is called whereas the second time, it is evaluated only once.
This could be used for optimization porpuses (even if php, internally, stores the length of an array so it is accesible in O(1))
The function breaks after the first return statement. You need to save what you want to return in some structure, an array eg, and return this.
function test($test)
{
$result = array();
$photo = explode(',', $test);
for ($i = 0; $i < count($photo); $i++)
{
$returnas = "<img src=".$photo[$i].">";
$result[] = $returnas;
}
return $result;
}
I am trying to decode encrypted data in PHP, however the return value keeps coming back as null.
The data to be decrypted comes into the PHP file as a data argument.
$dataArg1 = $_REQUEST["data"];
// Retrieve $encryptedData from storage ...
//
// Load the private key and decrypt the encrypted data
$encryptedData = $dataArg1;
$privateKey = array ( array(123456,654321,123456), array(123456,1234),
array(1234567,4321)
);
openssl_private_decrypt($encryptedData, $sensitiveData, $privateKey);
The function above comes from the second response of another posting here on Stack Overflow:
How to encrypt data in javascript and decrypt in php?
I assume that the decrypted value is in the PHP variable, $sensitiveData.
When I echo that to the screen, I get nothing.
echo("sensitiveData=[$sensitiveData]<br />");
Thoughts?
UPDATE:
The return value from openssl_private_decrypt() is FALSE, and the return value is NULL.
UPDATE 2:
I created the public/private key from the following URL.
http://shop-js.sourceforge.net/crypto2.htm
At the bottom, there is the line:
And put the following in your private script (probably on your local hard disk -- not on the internet -- if your private key is found this whole thing is useless.)
<script>
function decrypt() {
// key = [ [d], [p], [q] ];
var key=[[123456789,123456789,123456789],[123456789,1234],[123456789,4321]];
document.form.text.value=rsaDecode(key, document.form.text.value);
}
</script>
(actual values changed)
I copied translated the "var key=" line to PHP (per my other posting). Translation above using embedded arrays. I then past that key to the decrypt function.
My thought is that the PHP documentation calls the private key "mixed". I am wondering if maybe I need a different format for the private key.
Here is the output:
dataArg1=[jmOdss9ktFc\"WO5eltUZXt0rpqS1NluNKa]
bResult=[]
sensitiveData=[]
var_dump=[NULL ]
$privateKey has to be in a certain format. You can't just throw in random data to it and magically expect it to know what to do with it.
Also, looking at the js you're using, it's not just doing RSA. It has a function named base64ToText. It's decoding the ciphertext with that, taking the first byte as the length of the "encrypted session key", getting the "encrypted session key", decrypting that with RSA and then using that as the key to RC4 to decrypt it. But there are a number of problems with that too. Among other things, base64ToText isn't the same thing as PHP's base64_encode as the name might imply.
Anyway I wasn't able to get it to working. Personally, I'd recommend something more like this (which is interoperable with PHP / phpseclib's Crypt_RSA):
http://area51.phpbb.com/phpBB/viewtopic.php?p=208860
That said, I did manage to figure a few things out. Your js lib uses base-28. To convert numbers from that format to one phpseclib uses you'll need to use this function:
function conv_base($num)
{
$result = pack('N', $num[count($num) - 1]);
for ($i = count($num) - 2; $i >= 0; --$i) {
_base256_lshift($result, 28);
$result = $result | str_pad(pack('N', $num[$i]), strlen($result), chr(0), STR_PAD_LEFT);
}
return $result;
}
function _base256_lshift(&$x, $shift)
{
if ($shift == 0) {
return;
}
$num_bytes = $shift >> 3; // eg. floor($shift/8)
$shift &= 7; // eg. $shift % 8
$carry = 0;
for ($i = strlen($x) - 1; $i >= 0; --$i) {
$temp = ord($x[$i]) << $shift | $carry;
$x[$i] = chr($temp);
$carry = $temp >> 8;
}
$carry = ($carry != 0) ? chr($carry) : '';
$x = $carry . $x . str_repeat(chr(0), $num_bytes);
}
Here's the script I used to confirm the correctness of that:
<?php
include('Math/BigInteger.php');
$p = array(242843315,241756122,189);
$q = array(177094647,33319298,129);
$n = array(45173685,178043534,243390137,201366668,24520);
$p = new Math_BigInteger(conv_base($p), 256);
$q = new Math_BigInteger(conv_base($q), 256);
$n = new Math_BigInteger(conv_base($n), 256);
$test = $p->multiply($q);
echo $test . "\r\n" . $n;
ie. they match.
I also ported your js's base64ToText to PHP:
function decode($t)
{
static $b64s = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_"';
$r = '';
$m = $a = 0;
for ($n = 0; $n < strlen($t); $n++) {
$c = strpos($b64s, $t[$n]);
if ($c >= 0) {
if ($m) {
$r.= chr(($c << (8-$m))&255 | $a);
}
$a = $c >> $m;
$m+=2;
if ($m == 8) {
$m = 0;
}
}
}
return $r;
}
Among other potential problems I may have encountered... who knows if their RC4 implementation is correct? Their base64 implementation isn't so it wouldn't be without precedent for the RC4 implementation to be broken too.
My script is a spider that checks if a page is a "links page" or is a "information page".
if the page is a "links page" then it continue in a recursive manner (or a tree if you will)
until it finds the "information page".
I tried to make the script recursive and it was easy but i kept getting the error:
Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to
allocate 39 bytes) in /srv/www/loko/simple_html_dom.php on line 1316
I was told i would have to use the for loop method because no matter if i use the unset() function the script won't free memory and i only have three levels i need to loop through so it makes sense. But after i changed the script the error occurs again, but maybe i can free
memory now?
Something needs to die here, please help me destruct someone!
set_time_limit(0);
ini_set('memory_limit', '256M');
require("simple_html_dom.php");
$thelink = "http://www.somelink.com";
$html1 = file_get_html($thelink);
$ret1 = $html1->find('#idTabResults2');
// first inception level, we know page has only links
if (!$ret1){
$es1 = $html1->find('table.litab a');
//unset($html1);
$countlinks1 = 0;
foreach ($es1 as $aa1) {
$links1[$countlinks1] = $aa1->href;
$countlinks1++;
}
//unset($es1);
//for every link in array do the same
for ($i = 0; $i < $countlinks1; $i++) {
$html2 = file_get_html($links1[$i]);
$ret2 = $html2->find('#idTabResults2');
// if got information then send to DB
if ($ret2){
pullInfo($html2);
//unset($html2);
} else {
// continue inception
$es2 = $html2->find('table.litab a');
$html2 = null;
$countlinks2 = 0;
foreach ($es2 as $aa2) {
$links2[$countlinks2] = $aa2->href;
$countlinks2++;
}
//unset($es2);
for ($j = 0; $j < $countlinks2; $j++) {
$html3 = file_get_html($links2[$j]);
$ret3 = $html3->find('#idTabResults2');
// if got information then send to DB
if ($ret3){
pullInfo($html3);
} else {
// inception level three
$es3 = $html3->find('table.litab a');
$html3 = null;
$countlinks3 = 0;
foreach ($es3 as $aa3) {
$links3[$countlinks3] = $aa3->href;
$countlinks3++;
}
for ($k = 0; $k < $countlinks3; $k++) {
echo memory_get_usage() ;
echo "\n";
$html4 = file_get_html($links3[$k]);
$ret4 = $html4->find('#idTabResults2');
// if got information then send to DB
if ($ret4){
pullInfo($html4);
}
unset($html4);
}
unset($html3);
}
}
}
}
}
function pullInfo($html)
{
$tds = $html->find('td');
$count =0;
foreach ($tds as $td) {
$count++;
if ($count==1){
$name = html_entity_decode($td->innertext);
}
if ($count==2){
$address = addslashes(html_entity_decode($td->innertext));
}
if ($count==3){
$number = addslashes(preg_replace('/(\d+) - (\d+)/i', '$2$1', $td->innertext));
}
}
unset($tds, $td);
$name = mysql_real_escape_string($name);
$address = mysql_real_escape_string($address);
$number = mysql_real_escape_string($number);
$inAlready=mysql_query("SELECT * FROM people WHERE phone=$number");
while($e=mysql_fetch_assoc($inAlready))
$output[]=$e;
if (json_encode($output) != "null"){
//print(json_encode($output));
} else {
mysql_query("INSERT INTO people (name, area, phone)
VALUES ('$name', '$address', '$number')");
}
}
And here is a picture of the growth in memory size:
I modified the code a little bit to free as much memory as I see could be freed.
I've added a comment above each modification. The added comments start with "#" so you could find them easier.
This is not related to this question, but worth mentioning that your database insertion code is vulnerable to SQL injection.
<?php
require("simple_html_dom.php");
$thelink = "http://www.somelink.co.uk";
# do not keep raw contents of the file on memory
#$data1 = file_get_contents($thelink);
#$html1 = str_get_html($data1);
$html1 = str_get_html(file_get_contents($thelink));
$ret1 = $html1->find('#idResults2');
// first inception level, we know page has only links
if (!$ret1){
$es1 = $html1->find('table.litab a');
# free $html1, not used anymore
unset($html1);
$countlinks1 = 0;
foreach ($es1 as $aa1) {
$links1[$countlinks1] = $aa1->href;
$countlinks1++;
// echo (addslashes($aa->href));
}
# free memroy used by the $es1 value, not used anymore
unset($es1);
//for every link in array do the same
for ($i = 0; $i <= $countlinks1; $i++) {
# do not keep raw contents of the file on memory
#$data2 = file_get_contents($links1[$i]);
#$html2 = str_get_html($data2);
$html2 = str_get_html(file_get_contents($links1[$i]));
$ret2 = $html2->find('#idResults2');
// if got information then send to DB
if ($ret2){
pullInfo($html2);
} else {
// continue inception
$es2 = $html2->find('table.litab a');
# free memory used by $html2, not used anymore.
# we would unset it at the end of the loop.
$html2 = null;
$countlinks2 = 0;
foreach ($es2 as $aa2) {
$links2[$countlinks2] = $aa2->href;
$countlinks2++;
}
# free memory used by $es2
unest($es2);
for ($j = 0; $j <= $countlinks2; $j++) {
# do not keep raw contents of the file on memory
#$data3 = file_get_contents($links2[$j]);
#$html3 = str_get_html($data3);
$html3 = str_get_html(file_get_contents($links2[$j]));
$ret3 = $html3->find('#idResults2');
// if got information then send to DB
if ($ret3){
pullInfo($html3);
}
# free memory used by $html3 or on last iteration the memeory would net get free
unset($html3);
}
}
# free memory used by $html2 or on last iteration the memeory would net get free
unset($html2);
}
}
function pullInfo($html)
{
$tds = $html->find('td');
$count =0;
foreach ($tds as $td) {
$count++;
if ($count==1){
$name = addslashes($td->innertext);
}
if ($count==2){
$address = addslashes($td->innertext);
}
if ($count==3){
$number = addslashes(preg_replace('/(\d+) - (\d+)/i', '$2$1', $td->innertext));
}
}
# check for available data:
if ($count) {
# free $tds and $td
unset($tds, $td);
mysql_query("INSERT INTO people (name, area, phone)
VALUES ('$name', '$address', '$number')");
}
}
Update:
You could trace your memory usage to see how much memory is being used in each section of your code. this could be done by using the memory_get_usage() calls, and saving the result to some file. like placing this below code in the end of each of your loops, or before creating objects, calling heavy methods:
file_put_contents('memory.log', 'memory used in line ' . __LINE__ . ' is: ' . memory_get_usage() . PHP_EOL, FILE_APPEND);
So you could trace the memory usage of each part of your code.
In the end remember all this tracing and optimization might not be enough, since your application might really need more memory than 32 MB. I'v developed a system that analyzes several data sources and detects spammers, and then blocks their SMTP connections and since sometimes the number of connected users are over 30000, after a lot of code optimization, I had to increase the PHP memory limit to 768 MB on the server, Which is not a common thing to do.
If your operation requires memory and your server has more memory available, you can call ini_set('memory_limit', '128M'); or something similar (depending your memory requirement) to increase the amount of memory available to the script.
This does not mean you should not optimise and refactor your code :-) this is just one part.
The solution was to use the clear method such as:
$html4->clear(); a simple_html_dom method to clear memory When you are finished with the DOM element.
If you want to learn more, enter this website.
Firstly, let's turn this into a truly recursive function, should make it easier to modify the whole chain of events that way:
function findInfo($thelink)
{
$data = file_get_contents($thelink); //Might want to make sure that it's a valid link, i.e. that file get contents actually returned stuff, before trying to run further with it.
$html = str_get_html($data);
unset($data); //Finished using it, no reason to keep it around.
$ret = $html->find('#idResults2');
if($ret)
{
pullInfo($html);
return true; //Should stop once it finds it right?
}
else
{
$es = $html->find('table.litab a'); //Might want a little error checking here to make sure it actually found links.
unset($html); //Finished using it, no reason to keep it around
$countlinks = 0;
foreach($es as $aa)
{
$links[$countlinks] = $aa->href;
$countlinks++;
}
unset($es); //Finished using it, no reason to keep it around.
for($i = 0; $i <= $countlinks; $i++)
{
$result = findInfo($links[$i]);
if($result === true)
{
return true; //To break out of above recursive functions if lower functions return true
}
else
{
unset($links[$i]); //Finished using it, no reason to keep it around.
continue;
}
}
}
return false; //Will return false if all else failed, should hit a return true before this point if it successfully finds an info page.
}
See if that helps at all with the cleanups. Probably still run out of memory, but you shouldn't be holding onto the full html of each webpage scanned and what not with this.
Oh, and if you only want it to go only so deep, change the function declaration to something like:
function findInfo($thelink, $depth = 1, $maxdepth = 3)
Then when calling the function within the function, call it like so:
findInfo($html, $depth + 1, $maxdepth); //you include maxdepth so you can override it in the initial function call, like findInfo($thelink,,4)
and then do a check on depth vs. maxdepth at the start of the function and have it return false if depth is > than maxdepth.
If memory usage is your primary concern, you may want to consider using a SAX-based parser. Coding using a SAX parser can be a bit more complicated, but it's not necessary to keep the entire document in memory.