I tried a performance check tool "DOM Monster" to analyze my php site. There is one information which says "50% of nodes are whitespace-only text nodes".
Ok I unterstand the problem but what is the fastest way to cleanup whitespace in php?
I think a good start is to use the "Output Control" like ob_start() and then replace the whitespace before releasing it with ob_end_flush(). In the moment I do everything with echo echo ... I never read much about this ob_* things is it useful?
I guess using preg_replace() is a performance killer for this job or?
So what is the best practice for this?
The fastest way to remove whitespace-only nodes is to not create them in the first place. Just remove all the whitespace immediately before and after each HTML tag.
You certainly could remove the spaces from your code after the fact using an output handler (look at the callback bit in ob_start), but if your goal is performance, then that kind of defeats the purpose.
A whitespace-only node is in the DOM tree parsed by the browser when it reads your HTML. It's where there's an HTML tag, then nothing but whitespace, then another HTML tag. It's a waste of browser resources, but not a huge deal.
The function trim() will solve your problem, isn't it?
http://www.php.net/manual/en/function.trim.php
Well, I guess you talk about HTML, and HTML is as is a meta language full of whitespace (attributes, texts).
By the way, you probably use newlines for readability.
I rather advise you to compress your page with deflate/gzip and webserver rules, ie an .htaccess rule:
<FilesMatch "\\.(js|css|html|htm|php|xml)$">
SetOutputFilter DEFLATE
</FilesMatch>
You can also take a look at Tidy which is a library to help you to check and cleanup your HTML code.
preg_replace will of course slow things down a little. But probably it's the fastest way anyway. The problem is more that preg_replace may be unreliable because it is very hard to write regular expression that works on all possible cases.
If you are createing XML/XHTML output, you could parse all your data using a fast stream parser SAX or StAX, php has both builtin usually, and then write the data back to the output without the whitespaces. That's simple, effective, reliable und at least medium fast. It's still not going to blow you off with speed.
Another option would be to just use gzip. (ob_handler('gz_handler') is the call in php if I remember correctly). This will compress your data and compression works extremely well on problems with data that repeats a lot within a document. That come with a litte performance penalty as well, but the reduced size of the output document may make up for it.
Though beware that the output will not be send to the browser before all output is available. This makes partial loading of webpages much harder ;-).
The problem with using ob_* and then trimming whitespace is that you’ll have to make sure to not remove displayed whitespace like in <pre> tags or <textarea>s etc. You’ll need a syntactical parser which understands where it should not trim.
With an (performance-)expensive parser you should also cache output where possible.
The following is code to remove all space characters but the first of a sequence of spaces. So 1 space will be kept, 3 spaces pruned to 1, etc.
at the top of you php file do
ob_start();
At the end do
function StripExtraSpace($s)
{
$newstr = "";
for($i = 0; $i < strlen($s); $i++)
{
$newstr = $newstr . substr($s, $i, 1);
if(substr($s, $i, 1) == ' ')
while(substr($s, $i + 1, 1) == ' ')
$i++;
}
return $newstr;
}
$content = ob_get_clean();
echo StripExtraSpace($content);
Related
I have a file with the next structure:
concept
[at0000] -- Blood Pressure
language
original_language =
translations =
author =
["organisation"] =
["email"] =
>
accreditation =
>
>
description
original_author =
["organisation"] =
["email"] =
["date"] =
>
details =
purpose =
I need to open and parse this file, but I must admit the indentations of each line, as the indentations represent hierarchical structures. Is there any way in PHP to go line by line analysis of the indentation, either the beginning, middle or end of the line?
//rant on
It's simple: who provides such a crappy data structure to parse.
It's 2014. XML all over the place and lightweight JSON.
What do we get? Not even CSV :)
//rant off
Maybe a fixed column width parser would fit:
https://github.com/t-geindre/fixed-column-width-parser
Basically you get lines with $lines = file("file.txt");
Then it's a matter of detecting the spaces or tabs in front of each line.
Update
Turns out this "data" has a structure.
The data-structure "Archetype Definition Language" (ADL) is described in ISO 13606-2.
http://pangea.upv.es/en13606/index.php/resources/files/doc_download/2-en13606-part-2
This document contains a grammar description in Chapter 8.
You might use this grammer for parser construction.
Parsing indentions is your smallest problem. Getting the data structure right, is the real task.
Happy test writing - this will be a lot of work... be warned.
Let me also point to OpenEHR.
OpenEHR uses Java and Eiffel as programming languages.
The ADL parser is implemented in Java.
You might find it at https://github.com/openEHR/java-libs/blob/master/adl-parser/src/main/javacc/adl.jj
This is the parser ADL v1.4 in Ruby:
https://github.com/skoba/openehr-ruby/tree/master/lib/openehr/parser
This should get you pretty close to a solution.
Hope this helps a bit..
You can use ltrim and rtrim functions.
For example using the following code:
$line = ' concept';
echo strlen(ltrim($line));
echo strlen($line);
you can calculate length of string with and without white-spaces at the beginning of the line.
However I don't know what you mean that you want to calculate indentations in the middle of the line. You should in that case go probably use substr function to go to the place when you expect indentation and then again use ltrim and strlen to calculate whitespaces at the beginning of substring.
You may also want to use Mb string functions in case you have in your code non-ASCII characters.
For parsing lines you can simply use file() function
ADL doesn't have a parser for PHP. But ADL can be transformed to XML using the CKM (http://ckm.openehr.org/ckm/) or the Archetype Editor (http://www.openehr.org/downloads/modellingtools).
You should use the XML in PHP.
I am about to make a char counting function which counts input from a tinyMce textarea.
Server-side validation with code like this:
$string = "is<isvery interesting <thatthis willbe stripped";
$stripped = strip_tags($string);
$count = strlen($stripped); // This will return 2
You might notice that $string has no tag at all, anyway strip_tags() strips everything from the first less-than sign on.
Is this a bug or a feature?
This has been documented:
Because strip_tags() does not actually validate the HTML, partial or
broken tags can result in the removal of more text/data than expected.
http://php.net/manual/en/function.strip-tags.php
strip_tags is actually quite dumb. It strips everything, that only remotely looks like an HTML tag. That is, starting with < and some alpha-numeric sign until the closing > or as far as it can get.
The observed behavior is in this context a bug. However, strip_tags is then not the tool to do error correction on input HTML. Its purpose is to strip away stuff, so that the remainder is safe to embed in websites. In doubt, it strips more, which is a good thing.
I was wondering if there was a way to slightly modify the require or include functionality so that it removes line breaks and white space. So it minifies all the html / js inside the documents that im trying to grab.
I tried this:
trim(require('my document.php'));
didn't work though, is there a correct way to do this?
Cheers,
Doug
You probably want to do something like
ob_start();
require('my document.php');
echo minify(ob_get_flush());
Which will get all the output generated by my document.php and minify it, you have to find a minifying library to do it though.
You'd have to write or use a parser of some sort to do the job properly. If you can use a rough and ready solution then str_replace will take an array and replace with a single character
e.g. str_replace(array("\n", "\r", "\t", "etc"), " ", $mystring);
But that seems like a lot of processing for what you want to achieve.
this question makes no sense.
Removing whitespaces makes no sense itself. If it's such a big concern, you can set up your server to send compressed contents, it will reduce bandwith.
Yet why to clean whitespace in the included files output only? Why not to do it for the whole site output at once?
What you're describing is potentially fragile and vastly inferior to simply enabling compression.
It would be helpful if you update your question to say why you would want to do this.
However, you can do it like this:
ob_start();
require('file');
echo = preg_replace(array("#[\r\n\t]+#", '#>\s+<#', '#\s+#', '#\s?{\s?#', '#\s?}\s?#'), array('', '><', ' ', '{', '}'), ob_get_clean());
Even so you'll likely find you manage to remove whitespace that you need - especially if you are running it on javascript and miss a semicolon.
I wrote a caching class. It automatically gets the page content after many database queries and saves it as a .html file. Every 600 seconds, it reads from this .html page instead of querying.
To improve reading speed even faster, I want to remove unwanted characters like " " (space) \n\l and such. How can I do this?
I know I can do this in many ways. trim, str_replace and so on. I want to know the fastest - and the safest (so it won't break javascript) way to rely on. :)
Thank you.
My advice to you: don't.
Unless your templates are made so poorly that they generate markup that consists mostly of spaces.
You want a way to remove spaces that is simple, fast and safe. But in reality you can choose only two of these three properties.
If you want simple and fast: use str_replace, but it breaks your javascript.
If you want simple and safe: edit files yourself and remove spaces manually.
If you want fast and safe: you'll have to use some complex parsers and/or optimization tools.
It's up to you!
Keep attention to your javascript code first, if you have javascript comments it can break your javascript code.
Have a look into Yui compressor and/or Google Closure Compressor to optimize your javascript first.
For the rest of the page, you can pass it in thoses handy functions
Compressing your HTML, CSS and Javascript using simple PHP Code
Hope it helps
I suppose using str_replace is the fastest way! An alternative would be preg_replace, but regular expressions are not as fast as simple string replacements.
Some time ago I have written this method, to tidy up some sourcecode:
private function makeTiny($source, $type) {
// Get replacements
$replacements = array();
if ($this->conf[$type]['stripTabs']) {
$replacements[] = "\t";
}
if ($this->conf[$type]['stripNewLines']) {
$replacements[] = "\n";
$replacements[] = "\r";
}
// Do replacements
$source = str_replace($replacements, '', $source);
// Strip comments
if ($this->conf[$type]['stripComments']) {
$source = preg_replace('/<\!\-\-.*?\-\->/is', '', $source);
}
// Strip double spaces
if ($this->conf[$type]['stripDoubleSpaces']) {
$source = preg_replace('/( {2,})/is', ' ', $source);
}
if ($this->conf[$type]['stripTwoLinesToOne']) {
$source = preg_replace('/(\n{2,})/is', "\n", $source);
}
return $source;
}
As far as I remember, this is not killing inline javascript! But you should try it before.
try
$newcontent = preg_replace('/(\n|\s|\t)+/','',$oldcontent);
this will replace the space and tab and new line
In the database I have some code like this one
Some text
<pre>
#include <cstdio>
int x = 1;
</pre>
Some text
When I'm trying to use phpQuery to do the parsing it fails because the <cstdio> is interpreted as a tag.
I could use htmlspecialchars but to apply it only inside pre tags I still need to do some parsing. I could use regex but it will be much more difficult (I will need to handle the possible attributes of the pre tag) and the idea of using a parser was to avoid this kind of regex thing.
What's the best way to do what I need to do ?
Remember to do encode HTML (& > and so on) before assembly
I finally went the regex way, considering only simple attributes for the pre tag (no '>' inside the attributes) :
foreach(array('pre', 'code') as $sTag)
$s = preg_replace_callback("#\<($sTag)([^\>]*?)\>(.+?)\<\/$sTag\>#si",
function($matches)
{
$matches[3] = str_replace(array('&', '<', '>'), array('&', '<', '>'), $matches[3]);
return "<{$matches[1]} {$matches[2]}>".htmlentities($matches[3], ENT_COMPAT, "UTF-8")."</{$matches[1]}>";
},
$s);
It also deals with caracters being already converted to html entities (we don't want to have it twice).
Not a perfect solution but given the data I need to apply it on it will do the work.
The error is, that your database contains HTML that contains some text which is not correctly encoded already.
So, if you want to save time and have a correct solution, then you should make sure, that the HTML in your database is correctly encoded. This means, you should make sure that everything will be correctely encoded (using htmlspecialchars()) before it is saved to your database!
Otherwise you just save garbage in your database, and you will have to write some special code to "prettify that garbage".
Any other solutions are workarounds, and those will cost you precious time in your future.
So: the best solution is to make sure, that anything you write to your database is correct.