Alternative for slow PHP SimpleXmlElement cast to string

Alternative for slow PHP SimpleXmlElement cast to string - php

Does anyone knows of an alternative to convert a SimpleXmlElement into a string?
The standard string casting is very slow:
$myString = (string)$simpleXmlElement->someNode;
I needed to know which one is faster for finding an element with a specific text-value: XPath or walking the nodes... So I wrote a simple script which would measure the duration of 1000 iterations for both ways.
The first results were that XPath was much slower, but then I found out that I forgot the string cast in the node-walking part. When I fixed that, the node-walking was much much slower.
So, only the cast-to-string flipped the entire outcome.
Please review the following code to understand the issue at hand:
<pre>
<?php
//---------------------------------------------------------------------------
date_default_timezone_set('Europe/Amsterdam');
error_reporting(E_ALL);
//---------------------------------------------------------------------------
$data = <<<'EOD'
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<children>
<child><test>ads</test></child>
<child><test>sdf</test></child>
<child><test>dfg</test></child>
<child><test>fgh</test></child>
<child><test>ghj</test></child>
<child><test>hjk</test></child>
<child><test>jkl</test></child>
<child><test>ads</test></child>
<child><test>sdf</test></child>
<child><test>dfg</test></child>
<child><test>fgh</test></child>
<child><test>ghj</test></child>
<child><test>hjk</test></child>
<child><test>jkl</test></child>
<child><test>123</test></child>
<child><test>234</test></child>
<child><test>345</test></child>
<child><test>456</test></child>
<child><test>567</test></child>
<child><test>678</test></child>
<child><test>789</test></child>
<child><test>890</test></child>
<child><test>90-</test></child>
<child><test>0-=</test></child>
<child><test>qwe</test></child>
</children>
</root>
EOD;
$xml = new SimpleXMLElement($data);
$values = array('123', '234', '345', '456', '567', '678', '789', '890', '90-', '0-=', 'qwe');
$valCount = count($values);
$tries = 1000;
//---------------------------------------------------------------------------
echo("Running XPath... ");
$startTime = microtime(true);
for ($idx=0; $idx<$tries; $idx++)
$xml->xpath('/root/children/child[test="'.$values[($idx % $valCount)].'"]');
$duration = microtime(true) - $startTime;
echo("Finished in: $duration\r\n");
//---------------------------------------------------------------------------
echo("Running NodeWalk... ");
$startTime = microtime(true);
for ($idx=0; $idx<$tries; $idx++)
{
$nodes = $xml->children->child;
foreach ($nodes as $node)
if ((string)$node->test == $values[($idx % $valCount)])
break;
}
$duration = microtime(true) - $startTime;
echo("Finished in: $duration\r\n");
//---------------------------------------------------------------------------
?>
</pre>
When altering the line:
if ((string)$node->test == $values[($idx % $valCount)])
to:
if ($node->test == $values[($idx % $valCount)])
The code even looks at more nodes, but it's still a lot faster. So, it looks to me that the string cast here is very slow.
Does anyone have a faster alternative for the string cast?

Amazingly so, as #Gordon pointed out, there is actually not so much difference in performance... on Linux.
As the original tests were done only on Windows, I retested it on Linux, and what do you know... The difference between the XPath-way and the node-walking-way is gone.
For me that's enough, because I'm actually writing this for Linux (prod platform). But is still something to notice when building for Windows.

Related

Nested loops in PHP extremely slow

I have 6 nested loops in a PHP program, however, the calculation time for the script is extremely slow. I would like to ask if there is a better way of implementing the 6 loops and increasing computation time, even if it means switching to another language. The nature of the algorithm I'm implementing requires iteration, so I don't know how I can better implement it.
Here's the code.
<?php
$time1 = microtime(true);
$res = 16;
$imageres = 128;
for($x=0;$x<$imageres;++$x){
for($y=0;$y<$imageres;++$y){
$pixels[$x][$y]=1;
}};
$quantizermatrix = 1;
$scalingcoefficient = 1/($res/2);
for($currentimagex=0;$currentimagex<($res*($imageres/$res-1)+1);$currentimagex = $currentimagex +$res){
for($currentimagey=0;$currentimagey<($res*($imageres/$res-1)+1);$currentimagey = $currentimagey +$res){
for($u=0;$u<$res;++$u){
for($v=0;$v<$res;++$v){
for($x=0;$x<$res;++$x){
for($y=0;$y<$res;++$y){
if($u == 0) {$a = 1/(sqrt(2));} else{$a = 1;};
if($v == 0){$b = 1/(sqrt(2));}else{$b = 1;};
$xes[$y] = $pixels[$x+$currentimagex][$y+$currentimagey]*cos((M_PI/$res)*($x+0.5)*$u)*cos( M_PI/$res*($y+0.5)*$v);
}
$xes1[$x] = array_sum($xes);
}
$xes2= array_sum($xes1)*$scalingcoefficient*$a*$b;
$dctarray[$u+$currentimagex][$v+$currentimagey] = round($xes2/$quantizermatrix)*$quantizermatrix;
}}}};
foreach($dctarray as $dct){
foreach($dct as $dc){
echo $dc." ";
}
echo "<br>";}
$time2 = microtime(true);echo 'script execution time: ' . ($time2 - $time1);
?>
I've removed a large portion of the code that's irrelevant, since this is the section of the code that's problematic
Essentially the code iterates through every pixel in a PNG image and outputs a computed matrix (2d array). This code takes around 2 seconds for a 128x128 image. This makes this program impractical for normal images greater than 128x128

There is a function available in Imagick library
Imagick::exportImagePixels
Refer the below link it might help you out
http://www.php.net/manual/en/imagick.exportimagepixels.php

PHP's array_slice vs Python's splitting arrays

Some background
I was having a go at the common "MaxProfit" programming challenge. It basically goes like this:
Given a zero-indexed array A consisting of N integers containing daily
prices of a stock share for a period of N consecutive days, returns
the maximum possible profit from one transaction during this period.
I was quite pleased with this PHP algorithm I came up, having avoided the naive brute-force attempt:
public function maxProfit($prices)
{
$maxProfit = 0;
$key = 0;
$n = count($prices);
while ($key < $n - 1) {
$buyPrice = $prices[$key];
$maxFuturePrice = max( array_slice($prices, $key+1) );
$profit = $maxFuturePrice - $buyPrice;
if ($profit > $maxProfit) $maxProfit = $profit;
$key++;
}
return $maxProfit;
}
However, having tested my solution it seems to perform badly performance-wise, perhaps even in O(n2) time.
I did a bit of reading around the subject and discovered a very similar python solution. Python has some quite handy array abilities which allow splitting an array with a a[s : e] syntax, unlike in PHP where I used the array_slice function. I decided this must be the bottleneck so I did some tests:
Tests
PHP array_slice()
$n = 10000;
$a = range(0,$n);
$start = microtime(1);
foreach ($a as $key => $elem) {
$subArray = array_slice($a, $key);
}
$end = microtime(1);
echo sprintf("Time taken: %sms", round(1000 * ($end - $start), 4)) . PHP_EOL;
Results:
$ php phpSlice.php
Time taken: 4473.9199ms
Time taken: 4474.633ms
Time taken: 4499.434ms
Python a[s : e]
import time
n = 10000
a = range(0, n)
start = time.time()
for key, elem in enumerate(a):
subArray = a[key : ]
end = time.time()
print "Time taken: {0}ms".format(round(1000 * (end - start), 4))
Results:
$ python pySlice.py
Time taken: 213.202ms
Time taken: 212.198ms
Time taken: 215.7381ms
Time taken: 213.8121ms
Question
Why is PHP's array_slice() around 20x less efficient than Python?
Is there an equivalently efficient method in PHP that achieves the above and thus hopefully makes my maxProfit algorithm run in O(N) time? Edit I realise my implementation above is not actually O(N), but my question still stands regarding the efficiency of slicing arrays.

I don't really know, but PHP's arrays are messed up hybrid monsters, maybe that's why. Python's lists are really just lists, not at the same time dictionaries, so they might be simpler/faster because of that.
Yes, do an actual O(n) solution. Your solution isn't just slow because PHP's slicing is apparently slow, it's slow because you obviously have an O(n^2) algorithm. Just walk over the array once, keep track of the minimum price found so far, and check it with the current price. Not something like max over half the array in every single loop iteration.

Regex to select div from source

A client of mine has asked for me to create a simple site that monitors files on another site. He needs to monitor the file names (unsure why?) and have them outputted to a file.
Here's the example source; http://pastebin.com/tyLUmCJr
I don't speak Russian, so I'm unaware of what the site's about. I apologize if it's anything that's 'less-than-suitable'.
Anyway, if you scroll to line 117, you will see a file name. I need to get all of the file names.
I've played around with the DOMDocument and third-party tools although I believe I could use regex to increase the speed of this. If anybody could point me in the correct direction, it would be greatly appreciated.
Note: take in mind that the source is stored within a string-variable known as $content.
Cheers!

After some more detailed, extensive research, I found a way to do it. Here's how I achieved it;
<?php
require_once("phpQuery.php");
$min = isset($_GET['min']) ? $_GET['min'] : 1;
$max = isset($_GET['max']) ? $_GET['max'] : 2;
$pages = [];
foreach(range($min, $max) as $page) {
array_push($pages, iconv("CP1251", "UTF-8", file_get_contents("http://www.fayloobmennik.net/files/list/" . $page . ".html")));
}
$html = file_get_html("http://www.fayloobmennik.net/files/list/");
$elem = $html->find('div[id=info] table > tbody', 0);
$test = $elem->find('tr a');
foreach ($test as $test2) {
$regex = '/<a href=\"([^\"]*)\">(.*)<\/a>/iU';
$test2 = preg_match($regex, $test2, $match);
print_r(iconv("CP1251", "UTF-8", $match[2]));
echo "<br/>";
}
?>
The phpQuery.php class is simple_html_dom (I believe that's what it's called?).
Cheers.

Performance of variable expansion vs. sprintf in PHP

Regarding performance, is there any difference between doing:
$message = "The request $request has $n errors";
and
$message = sprintf('The request %s has %d errors', $request, $n);
in PHP?
I would say that calling a function involves more stuff, but I do not know what's PHP doing behind the scenes to expand variables names.
Thanks!

It does not matter.
Any performance gain would be so minuscule that you would see it (as an improvement in the hundreths of seconds) only with 10000s or 100000s of iterations - if even then.
For specific numbers, see this benchmark. You can see it has to generate 1MB+ of data using 100,000 function calls to achieve a measurable difference in the hundreds of milliseconds. Hardly a real-life situation. Even the slowest method ("sprintf() with positional params") takes only 0.00456 milliseconds vs. 0.00282 milliseconds with the fastest. For any operation requiring 100,000 string output calls, you will have other factors (network traffic, for example) that will be an order of magniture slower than the 100ms you may be able to save by optimizing this.
Use whatever makes your code most readable and maintainable for you and others. To me personally, the sprintf() method is a neat idea - I have to think about starting to use that myself.

In all cases the second won't be faster, since you are supplying a double-quoted string, which have to be parsed for variables as well. If you are going for micro-optimization, the proper way is:
$message = sprintf('The request %s has %d errors', $request, $n);
Still, I believe the seconds is slower (as #Pekka pointed the difference actually do not matter), because of the overhead of a function call, parsing string, converting values, etc. But please, note, the 2 lines of code are not equivalent, since in the second case $n is converted to integer. if $n is "no error" then the first line will output:
The request $request has no error errors
While the second one will output:
The request $request has 0 errors

A performance analysis about "variable expansion vs. sprintf" was made here.
As #pekka says, "makes your code most readable and maintainable for you and others". When the performance gains are "low" (~ less than twice), ignore it.
Summarizing the benchmark: PHP is optimized for Double-quoted and Heredoc resolutions. Percentuals to respect of average time, to calculating a very long string using only,
double-quoted resolution: 75%
heredoc resolution: 82%
single-quote concatenation: 93%
sprintf formating: 117%
sprintf formating with indexed params: 133%
Note that only sprintf do some formating task (see benchmark's '%s%s%d%s%f%s'), and as #Darhazer shows, it do some difference on output. A better test is two benchmarks, one only comparing concatenation times ('%s' formatter), other including formatting process — for example '%3d%2.2f' and functional equivalents before expand variables into double-quotes... And more one benchmark combination using short template strings.
PROS and CONS
The main advantage of sprintf is, as showed by benchmarks, the very low-cost formatter (!). For generic templating I suggest the use of the vsprintf function.
The main advantages of doubled-quoted (and heredoc) are some performance; and some readability and maintainability of nominal placeholders, that grows with the number of parameters (after 1), when comparing with positional marks of sprintf.
The use of indexed placeholders are at the halfway of maintainability with sprintf.
NOTE: not use single-quote concatenation, only if really necessary. Remember that PHP enable secure syntax, like "Hello {$user}_my_brother!", and references like "Hello {$this->name}!".

I am surprised, but for PHP 7.* "$variables replacement" is the fastest approach:
$message = "The request {$request} has {$n} errors";
You can simply prove it yourself:
$request = "XYZ";
$n = "0";
$mtime = microtime(true);
for ($i = 0; $i < 1000000; $i++) {
$message = "The request {$request} has {$n} errors";
}
$ctime = microtime(true);
echo '
"variable $replacement timing": '. ($ctime-$mtime);
$request = "XYZ";
$n = "0";
$mtime = microtime(true);
for ($i = 0; $i < 1000000; $i++) {
$message = 'The request '.$request.' has '.$n.' errors';
}
$ctime = microtime(true);
echo '
"concatenation" . $timing: '. ($ctime-$mtime);
$request = "XYZ";
$n = "0";
$mtime = microtime(true);
for ($i = 0; $i < 1000000; $i++) {
$message = sprintf('The request %s has %d errors', $request, $n);
}
$ctime = microtime(true);
echo '
sprintf("%s", $timing): '. ($ctime-$mtime);
The result for PHP 7.3.5:
"variable $replacement timing": 0.091434955596924
"concatenation" . $timing: 0.11175799369812
sprintf("%s", $timing): 0.17482495307922
Probably you already found recommendations like 'use sprintf instead of variables contained in double quotes, it’s about 10x faster.' What are some good PHP performance tips?
I see it was the truth but one day. Namely before the PHP 5.2.*
Here is a sample of how it was those days PHP 5.1.6:
"variable $replacement timing": 0.67681694030762
"concatenation" . $timing: 0.24738907814026
sprintf("%s", $timing): 0.61580610275269

For Injecting Multiple String variables into a String, the First one will be faster.
$message = "The request $request has $n errors";
And For a single injection, dot(.) concatenation will be faster.
$message = 'The request '.$request.' has 0 errors';
Do the iteration with a billion loop and find the difference.
For eg :
<?php
$request = "XYZ";
$n = "0";
$mtime = microtime(true);
for ($i = 0; $i < 1000000; $i++) {
$message = "The request {$request} has {$n} errors";
}
$ctime = microtime(true);
echo ($ctime-$mtime);
?>

Ultimately the 1st is the fastest when considering the context of a single variable assignment which can be seen by looking at various benchmarks. Perhaps though, using the sprintf flavor of core PHP functions could allow for more extensible code and be better optimized for bytecode level caching mechanisms like opcache or apc. In other words, a particular sized application could use less code when utilizing the sprintf method. The less code you have to cache into RAM, the more RAM you have for other things or more scripts. However, this only matters if your scripts wouldn't properly fit into RAM using evaluation.

Problem reading files greater than 1GB with XMLReader

Is there a maximum file size the XMLReader can handle?
I'm trying to process an XML feed about 3GB large. There are certainly no PHP errors as the script runs fine and successfully loads to the database after it's been run.
The script also runs fine with smaller test feeds - 1GB and below. However, when processing larger feeds the script stops reading the XML File after about 1GB and continues running the rest of the script.
Has anybody experienced a similar problem? and if so how did you work around it?
Thanks in advance.

I had same kind of problem recently and I thought to share my experience.
It seems that problem is in the way PHP was compiled, whether it was compiled with support for 64bit file sizes/offsets or only with 32bit.
With 32bits you can only address 4GB of data. You can find a bit confusing but good explanation here: http://blog.mayflower.de/archives/131-Handling-large-files-without-PHP.html
I had to split my files with Perl utility xml_split which you can find here: http://search.cpan.org/~mirod/XML-Twig/tools/xml_split/xml_split
I used it to split my huge XML file into manageable chunks. The good thing about the tool is that it splits XML files over whole elements. Unfortunately its not very fast.
I needed to do this one time only and it suited my needs, but I wouldn't recommend it repetitive use. After splitting I used XMLReader on smaller files of about 1GB in size.

Splitting up the file will definitely help. Other things to try...
adjust the memory_limit variable in php.ini. http://php.net/manual/en/ini.core.php
rewrite your parser using SAX -- http://php.net/manual/en/book.xml.php . This is a stream-oriented parser that doesn't need to parse the whole tree. Much more memory-efficient but slightly harder to program.
Depending on your OS, there might also be a 2gb limit on the RAM chunk that you can allocate. Very possible if you're running on a 32-bit OS.

It should be noted that PHP in general has a max file size. PHP does not allow for unsigned integers, or long integers, meaning you're capped at 2^31 (or 2^63 for 64 bit systems) for integers. This is important because PHP uses an integer for the file pointer (your position in the file as you read through), meaning it cannot process a file larger than 2^31 bytes in size.
However, this should be more than 1 gigabyte. I ran into issues with two gigabytes (as expected, since 2^31 is roughly 2 billion).

I've run into a similar issue when parsing large documents. What I wound up doing is breaking the feed into smaller chunks using filesystem functions, then parsing those smaller chunks... So if you have a bunch of <record> tags that you are parsing, parse them out with string functions as a stream, and when you get a full record in the buffer, parse that using the xml functions... It sucks, but it works quite well (and is very memory efficient, since you only have at most 1 record in memory at any one time)...

Do you get any errors with
libxml_use_internal_errors(true);
libxml_clear_errors();
// your parser stuff here....
$r = new XMLReader(...);
// ....
foreach( libxml_get_errors() as $err ) {
printf(". %d %s\n", $err->code, $err->message);
}
when the parser stops prematurely?

Using WindowsXP, NTFS as filesystem and php 5.3.2 there was no problem with this test script
<?php
define('SOURCEPATH', 'd:/test.xml');
if ( 0 ) {
build();
}
else {
echo 'filesize: ', number_format(filesize(SOURCEPATH)), "\n";
timing('read');
}
function timing($fn) {
$start = new DateTime();
echo 'start: ', $start->format('Y-m-d H:i:s'), "\n";
$fn();
$end = new DateTime();
echo 'end: ', $start->format('Y-m-d H:i:s'), "\n";
echo 'diff: ', $end->diff($start)->format('%I:%S'), "\n";
}
function read() {
$cnt = 0;
$r = new XMLReader;
$r->open(SOURCEPATH);
while( $r->read() ) {
if ( XMLReader::ELEMENT === $r->nodeType ) {
if ( 0===++$cnt%500000 ) {
echo '.';
}
}
}
echo "\n#elements: ", $cnt, "\n";
}
function build() {
$fp = fopen(SOURCEPATH, 'wb');
$s = '<catalogue>';
//for($i = 0; $i < 500000; $i++) {
for($i = 0; $i < 60000000; $i++) {
$s .= sprintf('<item>%010d</item>', $i);
if ( 0===$i%100000 ) {
fwrite($fp, $s);
$s = '';
echo $i/100000, ' ';
}
}
$s .= '</catalogue>';
fwrite($fp, $s);
flush($fp);
fclose($fp);
}
output:
filesize: 1,380,000,023
start: 2010-08-07 09:43:31
........................................................................................................................
#elements: 60000001
end: 2010-08-07 09:43:31
diff: 07:31
(as you can see I screwed up the output of the end-time but I don't want to run this script another 7+ minutes ;-))
Does this also work on your system?
As a side-note: The corresponding C# test application took only 41 seconds instead of 7,5 minutes. And my slow harddrive might have been the/one limiting factor in this case.
filesize: 1.380.000.023
start: 2010-08-07 09:55:24
........................................................................................................................
#elements: 60000001
end: 2010-08-07 09:56:05
diff: 00:41
and the source:
using System;
using System.IO;
using System.Xml;
namespace ConsoleApplication1
{
class SOTest
{
delegate void Foo();
const string sourcepath = #"d:\test.xml";
static void timing(Foo bar)
{
DateTime dtStart = DateTime.Now;
System.Console.WriteLine("start: " + dtStart.ToString("yyyy-MM-dd HH:mm:ss"));
bar();
DateTime dtEnd = DateTime.Now;
System.Console.WriteLine("end: " + dtEnd.ToString("yyyy-MM-dd HH:mm:ss"));
TimeSpan s = dtEnd.Subtract(dtStart);
System.Console.WriteLine("diff: {0:00}:{1:00}", s.Minutes, s.Seconds);
}
static void readTest()
{
XmlTextReader reader = new XmlTextReader(sourcepath);
int cnt = 0;
while (reader.Read())
{
if (XmlNodeType.Element == reader.NodeType)
{
if (0 == ++cnt % 500000)
{
System.Console.Write('.');
}
}
}
System.Console.WriteLine("\n#elements: " + cnt + "\n");
}
static void Main()
{
FileInfo f = new FileInfo(sourcepath);
System.Console.WriteLine("filesize: {0:N0}", f.Length);
timing(readTest);
return;
}
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.