Updating an array before MySQL insert? - php

I'm receiving an XML via php://input and after using simpleXML to break the elements down into variables and then what I want to do is append an array, or create an array of the variables every 30 seconds or so.
The reason is this script will be getting regular inputs, and rather than doing loads of mySQL updates or inserts, I assume it might be better for efficiency.
So, a couple of questions if anyone has a moment.
1) is there a way to check for a new input on php://input.
2) is there a better way to do this repeat check than sleep function?
3) how do I append/add to an array with these updating variables?
I haven't gone too far yet, so the code isn't useful but if you can forgive me simpleness:-
function input() {
$xml = new SimpleXMLElement($input);
$session_id = $xml->session_id;
$ip = $xml->ip;
$browser = $xml->browser;
store($session_id, $ip, $browser);
}
function store() {
$session_id = array();
$ip = array();
$browser = array();
}

If I understand you correctly, it seems that you are trying to use PHP for a long-running stateful program. I trust you know the following:
PHP programs generally don't run for longer than a few miliseconds, at most a few seconds for the typical web application. Every time a resource is requested from the PHP handler, the parsing begins anew and there is no program state that remains from the previous execution. Being a stateless environment, it is up to you to maintain the state. For this reason, PHP is not made to handle input that changes with time, or to maintain a state.
This being said, the simplest way to append to an array is the follwing:
$myarray[] = "newvalue";
or
$myarray['newkey'] = "newvalue";
To process the stream:
while (!feof($handle)){ $data = fgets($handle, 4096); }

Related

Can't get array after doing exec in php with OOP in PHP

I'm trying to get values of array using POO. But I want to do it using exec (I have to).
this is my exec.php
include('PriceList.php');
for($i=0;$i<1100;$i++){
$tableau[]=$i;
}
$lstPrix = new PriceList($tableau);
exec("php execute.php ");
execute.php
include('PriceList.php');
call_user_func( 'PriceList::getLstPrix' );
and a simple class PriceList.class.php
class PriceList
{
public static $_lstPrix = array();
public function __construct($lstPrix){
self::$_lstPrix = $lstPrix;
}
public static function getLstPrix(){
return self::$_lstPrix;
}
}
I'm trying to get the values of my array but it doesn't work. Where am I doing wrong? some help pls.
The problem in your case is that you set your prices in one process. Then, when the end of the code is reached, your static variable is destroyed.
Then, you launch a new process and try to get the prices. The static variable is empty as it wasn't persisted in any way and it's not the same process. That is why you get an empty array.
I know that what you are trying to achieve here would work in some other cases like an ASP.NET website with IIS as long as you use the same Application pool. If you set the static variable in one request, you can get the value in another request later.
You should save your list somewhere and get it back later. I would consider a database or maybe serialize the data and store it in a file.
For reasons well explained in a previous answer, you simply cannot do that.
However, you should know that while you cannot share data between processes, you can pass data from the caller processes to the child process, using various tricks (also known as inter process communication, or IPC).
Passing data from caller to child
Here is one simple method, using argv:
exec.php
$array = range(0,1100);
exec('php execute.php '.json_encode($array));
execute.php
$array = json_decode($argv[1]);
Another method, using stdin:
exec.php
$array = range(0,1100);
exec('echo '.json_encode($array).' | php execute.php');
execute.php
$array = json_decode(fgets(STDIN));
Note: there might be some escaping needed in some cases, check escapeshellargs.
Another method on POSIX systems is to use proc_open instead of exec, and create a writeable pipe to the child stdin and write your data there (check the example in the doc, it's well explained).
Passing results back
Using exec, your child process can just echo JSON, that you can decode in the caller.
With proc_open, you need a pipe on stdout that the child will write to. Also pretty well explained in the PHP doc.
Taken into account all the comments above, you should be able to make this work by sending a JSON and returning a JSON. This is what you can do:
exec.php:
for ($i = 0; $i < 1100; $i++) {
$tableau[] = $i;
}
$json = json_encode($tableau);
$returned_json = exec("php execute.php $json");
$returned_array = json_decode($returned_json);
var_dump($returned_array);
execute.php
include('PriceList.php');
$array = json_decode($argv[1]);
$lstPrix = new PriceList($array);
$result = $lstPrix::getLstPrix();
echo json_encode($result);
Source about $argv

PHP - Hit counter textfile reset

I have an issue with my non-unique hit counter.
The script is as below:
$filename = 'counter.txt';
if (file_exists($filename)) {
$current_value = file_get_contents($filename);
} else {
$current_value = 0;
}
$current_value++;
file_put_contents($filename, $current_value);
When I'm refreshing my website very often (like 10 times per second or even faster), the value in the text file are getting reset to 0.
Any guess for fixing this issue?
This is a pretty poor way to maintain a counter, but your problem is probably that when you fire multiple requests at the site, one of the calls to file_exists() is getting a false because one of the other processes is removing and recreating the file.
If you want this to work consistantly you are going to have to lock the file between read and write See flock on php manual
Of course without the file lock you would also be getting incorrect counts anyway, when 2 processes manage to read the same value from the file.
Locking the file would also potentially slow your system down as 2 or more processes queue for access to the file.
It would probably be a better idea to store your counter in a database, as they are designed for coping with this kind of quick fire access and ensuring every process is properly queued and released.
Does it help if you add a check if file_get_contents isn't returning false?
$value = file_get_contents($filename);
if($value !== false)
{
$current_value = $value
}

What is the best way to process large data in PHP

I have a daily cron job which will get a XML from web service. Sometimes it is large, contains more than 10K products information and the XML size will be 14M example.
What I need to do is parsing XML to object then processing them. The processing is quite complicated. Not like directly put them into the database, I need to do a lot operation on them, and finally put them into many database tables.
It is just in one PHP script. I don't have any experience on dealing with large data.
So the problem is it take a lot of memory. And very long time to do it. I turn my localhost PHP memory_limit to 4G and running 3.5hrs then got successful. But my production host is not allowed such amount memory.
I do a research but I am very confused which is a right way to dealing with this situation.
Here is a sample of my code:
function my_items_import($xml){
$results = new SimpleXMLElement($xml);
$results->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//it will loop over 10K
foreach($results->xpath('//i:Item') as $data) {
$data->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//my processing code here, it will call a other functions to do a lot things
processing($data);
}
unset($results);
}
As a start don't use SimpleXMLElement on the whole document. SimpleXMLElement loads everything in the memory and is not efficient for large data. Here is a snippet from a real code. You'll need to accommodate it to your case but hope you'll get the general idea.
$reader = new XMLReader();
$reader->xml($xml);
// Get cursor to first article
while($reader->read() && $reader->name !== 'article');
// Iterate articles
while($reader->name === 'article')
{
$doc = new DOMDocument('1.0', 'UTF-8');
$article = simplexml_import_dom($doc->importNode($reader->expand(), true));
processing($article);
$reader->next('article');
}
$reader->close();
$article is SimpleXMLElement which can be processed further.
This way you save a lot of memory by making only single article nodes go into memory.
Additionally if each processing() function take long time you can turn it into a background process which runs in separately from the main script and several processing() functions can be started in parallel.
Key hints:
dispose data during process.
Dispose data - mean over write it with blank data. BTW, unset is slower than overwrite with null
Use functions or static method, avoid as much oop instance as possible.
One extra question, how long it takes to loop your xml without do [lots things]:
function my_items_import($xml){
$results = new SimpleXMLElement($xml);
$results->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//it will loop over 10K
foreach($results->xpath('//i:Item') as $data) {
$data->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//my processing code here, it will call a other functions to do a lot things
//processing($data);
}
//unset($result);// no need
}

Storing data in /tmp from a forked process in php

For awhile now, I've been storing serialized objects from forked processes in /tmp with file_put_contents.
Once all child processes wrap up, I'm simply using file_get_contents and unserializing the data to rebuild my object for processing.
so my question is, is there a better way of storing my data without writing to /tmp?
Outside of storing the data in a file, the only other native solutions that come to mind is shm http://www.php.net/manual/en/function.shm-attach.php or socket stream pairs http://www.php.net/manual/en/function.stream-socket-pair.php
Either of these should be doable if the data collected is unimportant after the script is run. The idea behind both of them is to just open a communication channel between your parent and child processes. I will say that my personal opinion is that unless there is some sort of issue using the file system is causing that it is by far the least complicated solution.
SHM
The idea with shm is that instead of storing the serialized objects in a file, you would store them in an shm segment protected for concurrency by a semaphore. Forgive the code, it is rough but should be enough to give you the general idea.
/*** Configurations ***/
$blockSize = 1024; // Size of block in bytes
$shmVarKey = 1; //An integer specifying the var key in the shm segment
/*** In the children processes ***/
//First you need to get a semaphore, this is important to help make sure you don't
//have multiple child processes accessing the shm segment at the same time.
$sem = sem_get(ftok(tempnam('/tmp', 'SEM'), 'a'));
//Then you need your shm segment
$shm = shm_attach(ftok(tempnam('/tmp', 'SHM'), 'a'), $blockSize);
if (!$sem || !$shm) {
//error handling goes here
}
//if multiple forks hit this line at roughly the first time, the first one gets the lock
//everyone else waits until the lock is released before trying again.
sem_acquire($sem);
$data = shm_has_var($shm, $shmVarKey) ? shm_get_var($shm, $shmVarKey) : shm_get_var($shm, $shmVarKey);
//Here you could key the data array by probably whatever you are currently using to determine file names.
$data['child specific id'] = 'my data'; // can be an object, array, anything that is php serializable, though resources are wonky
shm_put_var($shm, $shmVarKey, $data); // important to note that php handles the serialization for you
sem_release($sem);
/*** In the parent process ***/
$shm = shm_attach(ftok(tempnam('/tmp', 'SHM'), 'a'), $blockSize);
$data = shm_get_var($shm, $shmVarKey);
foreach ($data as $key => $value)
{
//process your data
}
Stream Socket Pair
I personally love using these for inter process communication. The idea is that prior to forking, you create a stream socket pair. This results in two read write sockets being created that are connected to each other. One of them should be used by the parent, one of them should be used by the child. You would have to create a separate pair for each child and it will change your parent's model a little bit in that it will need to manage the communication a bit more real time.
Fortunately the PHP docs for this function has a great example: http://us2.php.net/manual/en/function.stream-socket-pair.php
You could use a shared memory cache such as memcached which would be faster, but depending on what you're doing and how sensitive/important the data is, a file-based solution may be your best option.

Downloading pages in parallel using PHP

I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.
I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.
In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.
The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.
My questions are
How to call such downloder asynchronously, I don't want my main script to wait till downloder completes.
Any location to store downloaded data, such as shared memory. Of course, other than database.
There any chances that data gets corrupt while storing and retrieving, how to avoid this?
Also, please guide me know if anyone have a better plan.
When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls... Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I'm not posting full version here of course ("small" is still quite a bit of code), but here's a simplified version of the main thing to give you the general idea:
public function launch() {
$channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
$activeJobs = array();
$running = 0;
do {
// pick jobs for free channels:
while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
// take free channel, (re)init curl handle and let
// queued object set options
$chId = key($freeChannels);
if (empty($channels[$chId])) {
$channels[$chId] = curl_init();
}
$job = array_pop($this->jobQueue);
$job->init($channels[$chId]);
curl_multi_add_handle($this->master, $channels[$chId]);
$activeJobs[$chId] = $job;
unset($freeChannels[$chId]);
}
$pending = count($activeJobs);
// launch them:
if ($pending > 0) {
while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
// poke it while it wants
curl_multi_select($this->master);
// wait for some activity, don't eat CPU
while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
// some connection(s) finished, locate that job and run response handler:
$pending--;
$chId = array_search($info['handle'], $channels);
$content = curl_multi_getcontent($channels[$chId]);
curl_multi_remove_handle($this->master, $channels[$chId]);
$freeChannels[$chId] = NULL;
// free up this channel
if ( !array_key_exists($chId, $activeJobs) ) {
// impossible, but...
continue;
}
$activeJobs[$chId]->onComplete($content);
unset($activeJobs[$chId]);
}
}
} while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
}
In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete.
With this structure new requests will start as soon as something out of the pool finishes.
Of course it doesn't really save you if not just retrieving takes time but processing as well... And it isn't a true parallel handling. But I still hope it helps. :)
P.S. did a trick for me. :) Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can't describe that feeling. :) I didn't really expect it to work as planned, because with PHP it rarely works exactly as supposed... That was like "ok, hope it finishes in at least an hour... Wha... Wait... Already?! 8-O"
You can use curl_multi: http://www.somacon.com/p537.php
You may also want to consider doing this client side and using Javascript.
Another solution is to write a hunter/gatherer that you submit an array of URLs to, then it does the parallel work and returns a JSON array after it's completed.
Put another way: if you had 100 URLs you could POST that array (probably as JSON as well) to mysite.tld/huntergatherer - it does whatever it wants in whatever language you want and just returns JSON.
Aside from the curl multi solution, another one is just having a batch of gearman workers. If you go this route, I've found supervisord a nice way to start a load of deamon workers.
Things you should look at in addition to CURL multi:
Non-blocking streams (example: PHP-MIO)
ZeroMQ for spawning off many workers that do requests asynchronously
While node.js, ruby EventMachine or similar tools are quite great for doing this stuff, the things I mentioned make it fairly easy in PHP too.
Try execute from PHP, python-pycurl scripts. Easier, faster than PHP curl.

Categories