Queued Laravel Notifications get stuck on AWS SQS

Queued Laravel Notifications get stuck on AWS SQS - php

I have a worker on AWS that handles queued Laravel notifications. Some of the notifications get send out, but others get stuck in the queue and I don't know why.
I've looked at the logs in Beanstalk and see three different types of error:
2020/11/03 09:22:34 [emerg] 10932#0: *30 malloc(4096) failed (12: Cannot allocate memory) while reading upstream, client: 127.0.0.1, server: , request: "POST /worker/queue HTTP/1.1", upstream: "fastcgi://unix:/run/php-fpm/www.sock:", host: "localhost"
I see an Out of Memory issue on Bugsnag too, but without any stacktrace.
Another error is this one:
2020/11/02 14:50:07 [error] 10241#0: *2623 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: , request: "POST /worker/queue HTTP/1.1", upstream: "fastcgi://unix:/run/php-fpm/www.sock", host: "localhost"
And this is the last one:
2020/11/02 15:00:24 [error] 10241#0: *2698 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "POST /worker/queue HTTP/1.1", upstream: "fastcgi://unix:/run/php-fpm/www.sock:", host: "localhost"
I don't really understand what I can do to resolve these errors. It's just a basic Laravel / EBS / SQS setup, and the only thing the queue has to do is handle notifications. Sometimes a couple of dozens at a time. I'm running a t2.micro, and would assume that's enough to send a few e-mails? I've upped the environment to a t2.large but to no avail.
I notice that messages end up in the queue, then get the status 'Messages in flight', but then run into all sorts of troubles on the Laravel side. But I don't get any useful errors to work with.
All implementation code seems to be fine, because the first few notifications go out as expected and if I don't queue at all, all notifications get dispatched right away.
The queued notifications eventually generate two different exceptions: MaxAttemptsExceededException and an Out of Memory FatalError, but neither leads me to the actual underlying problem.
Where do I look further to debug?
UPDATE
See my answer for the problem and the solution. The database transaction hadn't finished before the worker tried to send a notification for the object that still had to be created.

What is the current memory_limit assigned to PHP? You can determine this by running this command:
php -i | grep memory_limit
You can increase this by running something like:
sed -i -e 's/memory_limit = [current-limit]/memory_limit = [new-limit]/g' [full-path-to-php-ini]
Just replace the [current-limit] with the value displayed in the first command, and [new-limit] with a new reasonable value. This might require trial and error. Replace [full-path-to-php-ini] with the full path to the php.ini that's used by the process that's failing. To find this, run:
php -i | grep php.ini

First make sure that you increased the max_execution_time and also memory_limit
Also make sure that you set --timeout option
Then make sure you follow the instruction for Amazon SQS as in laravel doc says
The only queue connection which does not contain a retry_after value is Amazon SQS. SQS will retry the job based on the Default Visibility Timeout which is managed within the AWS console.
Job Expirations & Timeouts

If you are sure that some of the queued events are correctly received and processed by the worker Laravel, then as others said it's mostly a PHP memory issue.
On beanstalk, here's what I added to my ebextensions to get larger memory for PHP (it was for composer memory issues):
Note that this is with a t3.medium EC2 instance with 4go, dedicated for laravel API only.
02-environment.config
commands:
...
option_settings:
...
- namespace: aws:elasticbeanstalk:container:php:phpini
option_name: memory_limit
value: 4096M
- namespace: aws:ec2:instances
option_name: InstanceTypes
value: t3.medium
So you can try to increase the limit use more of your available instance max ram, and deploy again so beanstalk will rebuild the instance and setup PHP memory_limit.
Note: the real config contains other configuration files and more truncated contents of course.
As you said, you are just sending an email so it should be ok. Is it happening when there's a burst of email queued? Is there, in the end, many events in the SQS deadLetterQueue? If so, it may be because of a queued email burst. And so SQS will "flood" the /worker route to execute your jobs. You could check server usage from AWS console, or htop like CLI tools to monitor, and also check SQS interface to see if many failed jobs are coming at the same moments (burst).
Edit: for elastic beanstalk, I use dusterio/laravel-aws-worker, maybe you too as your log mentions the /worker/queue route

Memory
The default amount of memory allocated to PHP can often be quite small. When using EBS, you want to use config files as much as possible - any time you're having to SSH and change things on the server, you're going to have more issues when you need to redploy. I have this added to my EBS config /.ebextensions/01-php-settings.config:
option_settings:
aws:elasticbeanstalk:container:php:phpini:
memory_limit: 256M
That's been enough when running a t3.micro to do all my notification and import processing. For simple processing it doesn't usually need much more memory than the default, but it depends a fair bit on your use-case and how you've programmed your notifications.
Timeout
As pointed out in this answer already, the SQS queue operates a little differently when it comes to timeouts. This is a small trait that I wrote to help work around this issue:
<?php
namespace App\Jobs\Traits;
trait CanExtendSqsVisibilityTimeout
{
/** NOTE: this needs to map to setting in AWS console */
protected $defaultBackoff = 30;
protected $backoff = 30;
/**
* Extend the time that the job is locked for processing
*
* SQS messages are managed via the default visibility timeout console setting; noted absence of retry_after config
* #see https://laravel.com/docs/7.x/queues#job-expirations-and-timeouts
* AWS recommends to create a "heartbeat" in the consumer process in order to extend processing time:
* #see https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html#configuring-visibility-timeout
*
* #param int $delay Number of seconds to extend the processing time by
*
* #return void
*/
public function extendBackoff($delay = 60)
{
if ($this->job) {
// VisibilityTimeout has a 12 hour (43200s) maximum and will error above that; no extensions if close to it
if ($this->backoff + $delay > 42300) {
return;
}
// add the delay
$this->backoff += $delay;
$sqs = $this->job->getSqs();
$sqsJob = $this->job->getSqsJob();
$sqs->changeMessageVisibility([
'QueueUrl' => $this->job->getQueue(),
'ReceiptHandle' => $sqsJob['ReceiptHandle'],
'VisibilityTimeout' => $this->backoff,
]);
}
}
}
Then for a queued job that was taking a long time, I changed the code a bit to work out where I could insert a sensible "heartbeat". In my case, I had a loop:
class LongRunningJob implements ShouldQueue
{
use CanExtendSqsVisibilityTimeout;
//...
public function handle()
{
// some other processing, no loops involved
// now the code that loops!
$last_extend_at = time();
foreach ($tasks as $task) {
$task->doingSomething();
// make sure the processing doesn't time out, but don't extend time too often
if ($last_extend_at + $this->defaultBackoff - 10 > time()) {
// "heartbeat" to extend visibility timeout
$this->extendBackoff();
$last_extend_at = time();
}
}
}
Supervisor
It sounds like you might need to look at how you're running your worker(s) in a bit more detail.
Having Supervisor running to help restart your workers is a must, I think. Otherwise if the worker(s) stop working, messages that are queued up will end up getting deleted as they expire. It's a bit fiddly to get working nicely with Laravel + EBS - there isn't really much good documentation around it, which is potentially why not having to manage it is one of the selling points for Vapor!

We finally found out what the problem was, and it wasn't memory or execution time.
Already from the beginning I thought it was strange that either default memory or default execution time wasn't sufficient to send an e-mail or two.
Our use case is: a new Article is created and users receive a notification.
A few clues that led to the solution:
We noticed that we usually have problems with the first notification.
If we create 10 articles at the same time, we miss the first notification on every article.
We set the HTTP Max Connections in the Worker to 1. When creating 10 articles simultaneously, we noticed that only the first article missed the first notification.
We didn't get any useful error messages from the Worker, so we decided to set up our own EC2 and run php artisan queue manually.
What we then saw explained everything:
Illuminate\Database\Eloquent\ModelNotFoundException: No query results for model [App\Article]
This is an error that we never got from the EBS Worker / SQS and swiftly led to the solution:
The notification is handled before the article has made it to the database.
We have added a delay to the worker and haven't had a problem since then. We recently added a database transaction to the process of creating an article, and creating the notification happens within that transaction (but on the very end). I think that's why we didn't have this problem before. We decided to leave the notification creation in the transaction, and just handle the notifications with a delay. This means we don't have to do a hotfix to get this solved.
Thanks to everyone who joined in to help!

Related

Server-sent events SSE in PhP 7.4 - Apache hanging and not registering/serving any new requests

Context
I am working on a PhP Server-sent event application running on PhP 7.4 and Apache 2.4 on Ubuntu 20.10. The app does what it's supposed to, but, presumably, increased number of users (connections? SSE connections?) causes server to hang. I am expecting/would like to be able to handle a relatively large number of users (~1000), but my SSE events fire rarely (~3x in 15 min) and only look for and send a few string values found in a textual file on server.
Problem
My problem is that under some circumstances including increased number of clients (~70 to 100) Apache starts hanging. New HTTP requests are not reported in access log, no errors are reported in errors log, and any requests sent from browser seem to be loading forever with no server answer. Server load (processor, RAM) in that moment is minimal and I can access the server via SSH or FTP normally.
What I've tried
This happens with the default Apache configuration so following online advice I tried turning off mpm_prefork module and activating mpm_event and php7.4-fpm. Not much changed except the number of clients going up for a few dozens but that also might not be true since I cannot test that manually, just have the application live-tested when I have a chance.
I've tried turning off the SSE element in the application and in that case I have no Apache hanging issues (but I can't update clients' info for which I need SSE). That means SSE are probably causing an overload/Apache hang with regard to something, but I don't know what.
I assume Apache hanging has to do with number of open connections or processes. As much as I've learned, I can control that only in /etc/apache2/apache2.conf (I tried setting MaxKeepAliveRequests 0) and in /etc/php/7.4/fpm/pool.d/www.conf (I tried setting pm.max_children = 250, pm.start_servers = 10, pm.min_spare_servers = 5, pm.max_spare_servers = 15, pm.max_requests = 1000) but to no avail.
My questions
what can I do to increase Apache supported number of connections/SSE processes running?
what can I do to find out what causes Apache to hang or what typically causes that?
any other ideas/suggestions on how to solve Apache hanging?
My server-side code is
<?php
header('Content-Type: text/event-stream; charset=utf-8');
header("Cache-Control: no-store");
header('Connection: keep-alive');
header('Content-Encoding: none;');
set_time_limit(0);
while (true) {
if (configurationChanged()) {
echo "data: " . newConfiguration() . "\n\n";
ob_end_flush();
flush();
} else {
sleep(3);
}
if (connection_aborted()) break;
}
?>
My client code is
var source = new EventSource('myScript.php', {withCredentials: false});
source.onopen = function (event) {
console.log("Connection opened.");
};
source.onmessage = function(event) {
console.log(event.data);
// Do stuff with the obtained data here
}
Thanks for reading this.

The solution
My main problem was I didn't expect Apache to hang while there were resources available on my server. A lack of experience caused me to waste many hours before I realized I should look for causes in
Apache error log /var/log/apache2/error.log
FPM log /var/log/php7.4-fpm.log
I tried re-configuring mpm-event module according to link given in the comment. While it helped to increase the number of concurrent users for a few dozens, the same problem started occurring when number of users further increased.
What did help was setting pm = ondemand in /etc/php/7.4/fpm/pool.d/www.conf to avoid having to manually defining parameters. I'm not sure why is that not a default or not more widely recommended. My problem seemed to be solved.
However, a new issue started occurring. FPM log /var/log/php7.4-fpm.log started reporting 2 kinds of errors:
[mpm_event:error] ... AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.
which would leave my web application hanging for users for a few minutes, then going back to normal without any intervention.
[proxy_fcgi:error] ... (70007)The timeout specified has expired: ... AH01075: Error dispatching request to : (polling), referer: ...
which would kill my web application for my users (so I added JavaScript to reload target php script to my users if the SSE connection ended)
For 1.
I tried to follow the error message instructions "Increase ServerLimit" and added "ServerLimit 250" to /etc/apache2/mods-enabled/mpm_event.conf. That didn't solve the problem.
I found this Apache bug report, but I was using a version where that should have been fixed. I then found this page suggesting I should change mpm-event to mpm-worker. Worked like a charm and solved problem 1.
For 2.
Problem 2 was related to my PhP SSE application, in specific to the SSE script timeout. What did NOT help was simply adding set_time_limit(0); to my PhP script. Timeout was reached by proxy_fcgi according to the error, so I had to edit /etc/apache2/apache2.conf and set
Timeout 3600
ProxyTimeout 3600
This increased any script max execution time to 1 hour (3600 seconds). This is not an ideal solution, but I haven't been able to find a solution to allow only particular script (in my case SSE PhP script running in an infinite loop) execution time.
Hope this helps someone!

How to change URL timeout settings on linux webserver

I have some cron job set up on Linux through wget, those jobs run once every 24 hours. All the jobs are basically calling the APIs, pulling the data and I am strong it on the database. Now the issue is some API calls are very very slow and takes a lot of time to get a response that eventually ends up getting the below error.
--2017-07-24 06:00:02-- http://wwwin-cam-stage.cisco.com/cron/mh.php Resolving
wwwin-cam-stage.cisco.com (wwwin-cam-stage.cisco.com)... 171.70.100.25
Connecting to wwwin-cam-stage.cisco.com
(wwwin-cam-stage.cisco.com)|171.70.100.25|:80... connected. HTTP
request sent, awaiting response... Read error (Connection reset by
peer) in headers. Retrying.
--2017-07-24 06:05:03-- (try: 2) http://wwwin-cam-stage.cisco.com/cron/mh.php Connecting to
wwwin-cam-stage.cisco.com
(wwwin-cam-stage.cisco.com)|171.70.100.25|:80... connected. HTTP
request sent, awaiting response... Read error (Connection reset by
peer) in headers. Retrying.
--2017-07-24 06:10:05-- (try: 3) http://wwwin-cam-stage.cisco.com/cron/mh.php Connecting to
wwwin-cam-stage.cisco.com
(wwwin-cam-stage.cisco.com)|171.70.100.25|:80... connected. HTTP
request sent, awaiting response... 200 OK Length: 0 [text/html] Saving
to: ‘mh.php.6’
0K 0.00 =0s
2017-07-24 06:14:58 (0.00 B/s) - ‘mh.php.6’ saved [0/0]
Though at third try it gave response 200 OK, it messes up the actual data as it timed out in a first and second try.
How can I change the URL timeout settings to unlimited time or highest limit in order to complete job in one try and without getting error like
(Connection reset by peer)....

wget --timeout 10 http://url
This can be used in case of wget.
EDIT
Or
If you are asking about the Keep-Alive of linux machine this might help.
On RedHat Linux modify the following kernel parameter by editing the /etc/sysctl.conf file, and restart the network daemon (/etc/rc.d/init.d/network restart).
"Connection reset by peer" is the TCP/IP equivalent of slamming the
phone back on the hook. It's more polite than merely not replying,
leaving one hanging. But it's not the FIN-ACK expected of the truly
polite TCP/IP converseur.
Code:
# Decrease the time default value for tcp_keepalive_time
tcp_keepalive_time = 1800
EDIT
-T seconds
‘--timeout=seconds’
Set the network timeout to seconds seconds. This is equivalent to specifying ‘--dns-timeout’, ‘--connect-timeout’, and ‘--read-timeout’, all at the same time.
When interacting with the network, Wget can check for timeout and abort the operation if it takes too long. This prevents anomalies like hanging reads and infinite connects. The only timeout enabled by default is a 900-second read timeout. Setting a timeout to 0 disables it altogether. Unless you know what you are doing, it is best not to change the default timeout settings.
All timeout-related options accept decimal values, as well as subsecond values. For example, ‘0.1’ seconds is a legal (though unwise) choice of timeout. Subsecond timeouts are useful for checking server response times or for testing network latency.
‘--dns-timeout=seconds’
Set the DNS lookup timeout to seconds seconds. DNS lookups that don’t complete within the specified time will fail. By default, there is no timeout on DNS lookups, other than that implemented by system libraries.
‘--connect-timeout=seconds’
Set the connect timeout to seconds seconds. TCP connections that take longer to establish will be aborted. By default, there is no connect timeout, other than that implemented by system libraries.
‘--read-timeout=seconds’
Set the read (and write) timeout to seconds seconds. The “time” of this timeout refers to idle time: if, at any point in the download, no data is received for more than the specified number of seconds, reading fails and the download is restarted. This option does not directly affect the duration of the entire download.
Of course, the remote server may choose to terminate the connection sooner than this option requires. The default read timeout is 900 seconds.
Source Wget Manual.
See this wget Manual page for more information.

error connections would be exceeded: 300

When i am trying to connect aerospike(PHP Client) then i am getting an error
object(Aerospike)#4 (2) {
["errorno":"Aerospike":private] =>
int(-7) ["error":"Aerospike":private] =>
string(59) "Max node BB93615E8270008 connections would be exceeded: 300"
}

The Aerospike client for PHP has a constructor config max_threads that by default is set to 300. The PHP client is built around the C client, and passes that configuration down to the C client instance. Error status code -7 is AEROSPIKE_ERR_NO_MORE_CONNECTIONS. You could increase the max_threads.
However, I'm not sure is how you're getting this error. The non-ZTS PHP client is a single execution thread, and those connections should be reused. It's really only an issue in multi-threaded environments like HHVM, Java, C, etc when multiple commands are executing in parallel. Please give more information about your code and environment.

Twitter - twemproxy - memcached - Retry not working as expected

Simple setup:
1 node running twemproxy (vcache:22122)
2 nodes running memcached (vcache-1, vcache-2) both listening on 11211
I have the following twemproxy config:
default:
auto_eject_hosts: true
distribution: ketama
hash: fnv1a_64
listen: 0.0.0.0:22122
server_failure_limit: 1
server_retry_timeout: 600000 # 600sec, 10m
timeout: 100
servers:
- vcache-1:11211:1
- vcache-2:11211:1
The twemproxy node can resolve all hostnames. As part of testing I took down vcache-2. In theory for every attempt to interface with vcache:22122, twemproxy will contact a server from the pool to facilitate the attempt. However, if one of the cache nodes is down, then twemproxy is supposed to "auto eject" it from the pool, so subsequent requests will not fail.
It is up to the app layer to determine if a failed interface attempt with vcache:22122 was due to infrastructure issue, and if so, try again. However I am finding that on the retry, the same failed server is being used, so instead of subsequent attempts being passed to a known good cache node (in this case vcache-1) they are still being passed to the ejected cache node (vcache-2).
Here's the php code snippet which attempts the retry:
....
// $this is a Memcached object with vcache:22122 in the server list
$retryCount = 0;
do {
$status = $this->set($key, $value, $expiry);
if (Memcached::RES_SUCCESS === $this->getResultCode()) {
return true;
}
} while (++$retryCount < 3);
return false;
-- Update --
Link to Issue opened on Github for more info: Issue #427

I can't see anything wrong with your configuration. As you know the important settings are in place:
default:
auto_eject_hosts: true
server_failure_limit: 1
The documentation suggests connection timeouts might be an issue.
Relying only on client-side timeouts has the adverse effect of the original request having timedout on the client to proxy connection, but still pending and outstanding on the proxy to server connection. This further gets exacerbated when client retries the original request.
Is your PHP script closing the connection and retrying before twemproxy failed its first attempt and removed the server from the pool? Perhaps adding a timeout value in the twemproxy lower than the connection timeout used in PHP solves the issue.
From your discussion on Github though it sounds like support for healthcheck, and perhaps auto ejection, aren't stable in twemproxy. If you're building against old packages you might be better to find a package which has been stable for some time. Is mcrouter (with interesting article) suitable?

For this feature to work please merge with this repo/branch
https://github.com/charsyam/twemproxy/tree/feature/heartbeat
to have this specific commit
https://github.com/charsyam/twemproxy/commit/4d49d2ecd9e1d60f18e665570e4ad1a2ba9b65b1
here is the PR
https://github.com/twitter/twemproxy/pull/428
after that recompile it

Browser shows time out while Server process is still running

I am having following problem:
I am running BIG memory process but have divided memory load into smaller chunks so no CPU time out issue.
In the Server I am creating .xml files with around 100kb sizes and they will be created around 100+.
Now main problem is browser shows Response Time out and IE at the below (just upper status bar) shows .php file download message.
During this in the backend (Server side) process is still running and continuously creating .xml files in incremental order. So no issue with that.
I have following php.ini configuration.
max_execution_time = 10000 ; Maximum execution time of each script, in seconds
max_input_time = 10000 ; Maximum amount of time each script may spend parsing request data
memory_limit = 2000M ; Maximum amount of memory a script may consume (128MB)
; Maximum allowed size for uploaded files.
upload_max_filesize = 2000M
I am running my site on IE. And I am using ZSCE with PHP 5.3
Can anybody redirect me on proper way on this issue?
Edit:
Uploading image of Time out and that's why asking for .php file download.
Edit 2:
I briefly explain my execution flow:
I have one PHP file with objects of Class Hierarchies which will start to execute Function1() from each class Hierarchy.
I have class file.
First, let say, Function1() is executed which contains logic of creating XML files in chunks.
Second, let say, Function2() is executed which will display output generated by Function1().
All is done in Class Hierarchies manner. So I can't terminate, in between, execution of Function1() until it get executed. And after that Function2() will be called.
Edit 3:
This is specially for #hakre.
As you asked some cross questions and I agree with some points but let me describe more in detail about the issue.
First I was loading around 100+ MB size XML Files at a time and that's why my Memory in local setup was hanging and stops everything on Machine and CPU time was utilizing its most resources.
I, then, divided this big size XML files in to small size (means now I am loading single XML file at a time and then unloading it after its usage). This saved me from Memory overload and CPU issue on local setup.
Now my backend process is running no CPU or Memory issue but issue is with Browser Timeout. I even tried cURL but as per my current structure it does seems to fit because of my class hierarchy issue. I have a set of classes in hierarchy and they all execute first their Process functions and then they all execute their Output functions. So unless and until Process functions get executed the Output functions do not comes in picture and that's why Browser shows Timeout.
I even followed instructions suggested by #vortex and got little success but not what I am looking for. Why I could not implement cURl because My process function is Creating required XML files at one go so it's taking too much time to output to Browser. As Process function is taking that much time no output is possible to assign to client unless and until it get completed.
cURL Output:
URL....: myurl
Code...: 200 (0 redirect(s) in 0 secs)
Content: text/html Size: -1 (Own: 433) Filetime: -1
Time...: 60.437 Start # 60.437 (DNS: 0 Connect: 0.016 Request: 0.016)
Speed..: Down: 7 (avg.) Up: 0 (avg.)
Curl...: v7.20.0
Contents of test.txt file
* About to connect() to mylocalhost port 80 (#0)
* Trying 127.0.0.1... * connected
* Connected to mylocalhost (127.0.0.1) port 80 (#0)
\> GET myurl HTTP/1.1
Host: mylocalhost
Accept: */*
< HTTP/1.1 200 OK
< Date: Tue, 06 Aug 2013 10:01:36 GMT
< Server: Apache/2.2.21 (Win32) mod_ssl/2.2.21 OpenSSL/0.9.8o
< X-Powered-By: PHP/5.3.9-ZS5.6.0 ZendServer
< Set-Cookie: ZDEDebuggerPresent=php,phtml,php3; path=/
< Cache-Control: private
< Transfer-Encoding: chunked
< Content-Type: text/html
<
* Connection #0 to host mylocalhost left intact
* Closing connection #0
Disclaimer : An answer for this question is chosen based on the first little success based on answer selected. The solution from #Hakre is also feasible when this type of question is occurred. But right now no answer fixed my question but little bit. Hakre's answer is also more detail in case of person finding for more details about this type of issues.

assuming you made all the server side modifications so you dodge a server timeout [i saw pretty much everyting explained above], in order to dodge browser timeout it is crucial that you do something like this
<?php
set_time_limit(0);
error_reporting(E_ALL);
ob_implicit_flush(TRUE);
ob_end_flush();
I can tell you from experience that internet explorer doesn't have any issues as long as you output some content to it every now and then. I run a 30gb database update everyday [that takes around 2-4 hours] and opera seems to be the only browser that ignores the content output.
if you don't set "ob_implicit_flush" you need to do an "ob_flush()" after every piece of content.
References
ob_implicit_flush
ob_flush
if you don't use ob_implicit_flush at the top of your script as I wrote earlier, you need to do something like:
<?php
echo 'dummy text or execution stats';
ob_flush();
within your execution loop

1. I am running BIG memory process but have divided memory load into smaller chunks so no CPU time out issue.
Now that's a wild guess. How did you find out it was a CPU time out issue in the first place? Did you even? If yes, what does your test now gives? If not, how do you test now that this is not a time-out issue?
Despite you state there won't be a certain issue, you don't proof that and many questions are still open. That invites for guessing which is counter-productive for trouble-shooting (which you are doing here).
What you write here just means that you wrote code to chunk memory, however, this is not a test for CPU time out issues. The one is writing code the other part is test. Don't mix the two. And don't draw wild assumptions. Issues are for the test, otherwise it didn't happen.
So much for your first point already just to show you that when doing troubleshooting, look for facts (monitor, test, profile, step-debug) not run assumptions. This is curcial otherwise you look in the wrong places and ask the wrong questions.
From what you describe how the client (browser) behaves, this is not a time-out-issue per-se. The problem you've got is that the answer between the header response and the body response is taking to long for the taste of your browser. The one browser is assuming a time-out (as such a boundary value has been triggered and this looks more correct to me) and the other browser is assuming somthing is coming up, why not save it.
So you merely have a processing issue here. Please consult the menual of your internet browsers (HTTP clients) which configuration values you can change to change this behavior. E.g. monitor with a curl-request on the command-line how long the request actually take. Then configure your browser to not time-out when connecting to that server under such an amount of time you just measured. For example if you're using Internet Explorer: http://www.ehow.com/how_6186601_change-internet-timeout-options.html or if you're using Mozilla Firefox: http://forums.mozillazine.org/viewtopic.php?f=7&t=102322&start=0
As you didn't show any code on the server-side I assume you want to solve this problem with client settings. Curl will help you to measure the number of seconds such a request takes. Use the -v (Verbose) switch to obtain detailed information about the request.
In case you don't want to solve this on the client, curl will still help you to measure important data and easily reproduce any underlying server-related timing issue. So you should go for Curl on the command-line in any case, especially as looking into response-headers might reveal what triggers the (again) esoteric internet explorer behavior. Again the -v switch does reveal you request and response headers.
If you like to automate such tests with a PHP script, it's also possible with the PHP Curl Extension. This has been outlined in:
Php - Debugging Curl

The problem is with your web-server, not the browser.
If you're using Apache, you need to adjust your Timeout value at httpd.conf or virtual hosts config.

You have 3 pages
Process - Creates the XML files and then updates a database value saying that the process is done
A PHP page that returns {true} or {false} based on the status of the process completion database value
An ajax front end, polling page 2 every few seconds to check weather the process is done or not
Long Polling

I have had this issue several times, while reading large csv file and puting it in database. I solved it in way, that i divided the reading and putting in database process into smaller parts. Like i created a new table to make log of how much data is readed and inserted, and next time the page reloads itself and start from that position. So you can do it by creating one xml in one attempt,and reload page and start form next one. In this way the memory used by browser is refreshed.
Hope it will help.

Is it possible to send some output to browser from the script while it's still processing, even white space? If, then do it, it should reset the timeout counter.
If it's not possible, you have to increase the timeout of IE in the registry:
HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet Settings
You need ReceiveTimeout, if it's not there, create it as dword, and set the value in miliseconds.

What is a "CPU time out issue"?
The right way to solve the problem is to run the heavy stuff asynchronously, in a seperate session group (not the webserver process tree).

Try to include set_time_limit(0); in your PHP script page.
The following links might help you.
http://php.net/manual/en/function.set-time-limit.php
http://php.net/manual/en/function.ignore-user-abort.php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.