This question is about optimizing a part of a program that I use to add in many projects as a common tool.
This 'templates parser' is designed to use a kind of text pattern containing html code or anything else with several specific tags, and to replace these by developer given values when rendered.
The few classes involved do a great job and work as expected, it allows when needed to isolate design elements and easily adapt / replace design blocks.
The patterns I use look like this (nothing exceptional I admit) :
<table class="{class}" id="{id}">
<block_row>
<tr>
<block_cell>
<td>{content}</td>
</block_cell>
</tr>
</block_row>
</table>
(Example code below are adapted extracts)
The parsing does things like that :
// Variables are sorted by position in pattern string
// Position is read once and stored in cache to avoid
// multiple calls to str_pos or str_replace
foreach ($this->aVars as $oVar) {
$sString = substr($sString, 0, $oVar->start) .
$oVar->value .
substr($sString, $oVar->end);
}
// Once pattern loaded, blocks look like --¤(<block_name>)¤--
foreach ($this->aBlocks as $sName=>$oBlock) {
$sBlockData = $oBlock->parse();
$sString = str_replace('--¤(' . $sName . ')¤--', $sBlockData, $sString);
}
By using the class instance I use methods like 'addBlock' or 'setVar' to fill my pattern with data.
This system has several disadvantages, among them the multiple objects in memory (one for each instance of block) and the fact that there are many calls to string manipulation functions during the parsing process (preg_replace in the past, now just a bunch of substr and pals).
The program on which I'm working is making a large use of these templates and they are just about to show their limits.
My question is the following (No need for code, just ideas or a lead to follow) :
Should I consider I've abused of this and should try to manage so that I don't need to make so many calls to these templates (for instance improving cache, using only simple view scripts...)
Do you know a technical solution to feed a structure with data that would not be that mad resource consumer I wrote ? While I'm writing I'm thinking about XSLT, would it be suitable, if yes could it improve performances ?
Thanks in advance for your advices
Use the XDebug extension to profile your code and find out exactly which parts of the code are taking the most time.
Related
I want to extrat the content of a specific div in an external webpage, the div looks like this:
<dt>Win rate</dt><dd><div>50%</div></dd>
My target is the "50%". I'm actually using this php code to extract the content:
function getvalue($parameter,$content){
preg_match($parameter, $content, $match);
return $match[1];
};
$parameter = '#<dt>Score</dt><dd><div>(.*)</div></dd>#';
$content = file_get_contents('https://somewebpage.com');
Everything works fine, the problem is that this method is taking too much time, especially if I've to use it several times with diferents $content.
I would like to know if there's a better (faster, simplier, etc.) way to acomplish the same function? Thx!
You may use DOMDocument::loadHTML and navigate your way to the given node.
$content = file_get_contents('https://somewebpage.com');
$doc = new DOMDocument();
$doc->loadHTML($content);
Now to get to the desired node, you may use method DOMDocument::getElementsByTagName, e.g.
$dds = $doc->getElementsByTagName('dd');
foreach($dds as $dd) {
// process each <dd> element here, extract inner div and its inner html...
}
Edit: I see a point #pebbl has made about DomDocument being slower. Indeed it is, however, parsing HTML with preg_match is a call for trouble; In that case, I'd also recommend looking at event-driven SAX XML parser. It is much more lightweight, faster and less memory intensive as it does not build a tree. You may take a look at XML_HTMLSax for such a parser.
There are basically three main things you can do to improve the speed of your code:
Off load the external page load to another time (i.e. use cron)
On a linux based server I would know what to suggest but seeing as you use Windows I'm not sure what the equivalent would be, but Cron for linux allows you to fire off scripts at certain schedule time offsets - in the background - so not using a browser. Basically I would recommend that you create a script who's sole purpose is to go and fetch the website pages at a particular time offset (depending on how frequently you need to update your data) and then write those webpages to files on your local system.
$listOfSites = array(
'http://www.something.com/page.htm',
'http://www.something-else.co.uk/index.php',
);
$dirToContainSites = getcwd() . '/sites';
foreach ( $listOfSites as $site ) {
$content = file_get_contents( $site );
/// i've just simply converted the URL into a filename here, there are
/// better ways of handling this, but this at least keeps things simple.
/// the following just converts any non letter or non number into an
/// underscore... so, http___www_something_com_page_htm
$file_name = preg_replace('/[^a-z0-9]/i','_', $site);
file_put_contents( $dirToContainSites . '/' . $file_name, $content );
}
Once you've created this script, you then need to set the server up to execute it as regularly as you need. Then you can modify your front-end script that displays the stats to read from local files, this would give a significant speed increase.
You can find out how to read files from a directory here:
http://uk.php.net/manual/en/function.dir.php
Or the simpler method (but prone to possible problems) is just to re-step your array of sites, convert the URLs to file names using the preg_replace above, and then check for the file's existence in the folder.
Cache the result of calculating your statistics
It's quite likely this being a stats page that you'll want to visit it quite frequently (not as frequent as a public page, but still). If the same page is visited more often than the cron-based script is executed then there is no reason to do all the calculation again. So basically all you have to do to cache your output is do something similar to the following:
$cachedVersion = getcwd() . '/cached/stats.html';
/// check to see if there is a cached version of this page
if ( file_exists($cachedVersion) ) {
/// if so, load it and echo it to the browser
echo file_get_contents($cachedVersion);
}
else {
/// start output buffering so we can catch what we send to the browser
ob_start();
/// DO YOUR STATS CALCULATION HERE AND ECHO IT TO THE BROWSER LIKE NORMAL
/// end output buffering and grab the contents so we now have a string
/// of the page we've just generated
$content = ob_get_contents(); ob_end_clean();
/// write the content to the cached file for next time
file_put_contents($cachedVersion, $content);
echo $content;
}
Once you start caching things you need to be aware of when you should delete or clear your cache - otherwise if you don't your stats output will never change. With regards to this situation, the best time to clear your cache is at the point you go and fetch the external web pages again. So you should add this line to the bottom of your "cron" script.
$cachedVersion = getcwd() . '/cached/stats.html';
unlink( $cachedVersion ); /// will delete the file
There are other speed improvements you could make to the caching system (you could even record the modified times of the external webpages and load only when they have been updated) but I've tried to keep things easy to explain.
Don't use a HTML Parser for this situation
Scanning a HTML file for one particular unique value does not require the use of a fully-blown or even lightweight HTML Parser. Using RegExp incorrectly seems to be one of those things that lots of start-up programmers fall into, and is a question that is always asked. This has led to lots of automatic knee-jerk reactions from more experience coders to automatically adhere to the following logic:
if ( $askedAboutUsingRegExpForHTML ) {
$automatically->orderTheSillyPersonToUse( $HTMLParser );
} else {
$soundAdvice = $think->about( $theSituation );
print $soundAdvice;
}
HTMLParsers should be used when the target within the markup is not so unique, or your pattern to match relies on such flimsy rules that it'll break the second an extra tag or character occurs. They should be used to make your code more reliable, not if you want to speed things up. Even parsers that do not build a tree of all the elements will still be using some form of string searching or regular expression notation, so unless the library-code you are using has been compiled in an extremely optimised manner, it will not beat well coded strpos/preg_match logic.
Considering I have not seen the HTML you are hoping to parse, I could be way off, but from what I've seen of your snippet it should be quite easy to find the value using a combination of strpos and preg_match. Obviously if your HTML is more complex and might have random multiple occurances of <dt>Win rate</dt><dd><div>50%</div></dd> it will cause problems - but even so - a HTMLParser would still have the same problem.
$offset = 0;
/// loop through the occurances of 'Win rate'
while ( ($p = stripos ($html, 'win rate', $offset)) !== FALSE ) {
/// grab out a snippet of the surrounding HTML to speed up the RegExp
$snippet = substr($html, $p, $p + 50 );
/// I've extended your RegExp to try and account for 'white space' that could
/// occur around the elements. The following wont take in to account any random
/// attributes that may appear, so if you find some pages aren't working - echo
/// out the $snippet var using something like "echo '<xmp>'.$snippet.'</xmp>';"
/// and that should show you what is appearing that is breaking the RegExp.
if ( preg_match('#^win\s+rate\s*</dt>\s*<dd>\s*<div>\s*([0-9]+%)\s*<#i', $snippet, $regs) ) {
/// once you are here your % value will be in $regs[1];
break; /// exit the while loop as we have found our 'Win rate'
}
/// reset our offset for the next loop
$offset = $p;
}
Gotchas to be aware of
If you are new to PHP, as you state in a comment above, then the above may seem rather complicated - which it is. What you are trying to do is quite complex, especially if you want to do it optimally and fast. However, if you follow throught the code I've given and research any bits that you aren't sure of / haven't heard of (php.net is your friend), it should give you a better understanding of a good way to achieve what you are doing.
Guessing ahead however, here are some of the problems you might face with the above:
File Permission errors - in order to be able to read and write files to and from the local operating system you will need to have the correct permissions to do so. If you find you can not write files to a particular directory it might be that the host you are using wont allow you to do so. If this is the case you can either contact them to ask about how to get write permission to a folder, or if that isn't possible you can easily change the code above to use a database instead.
I can't see my content - when using output buffering all the echo and print commands do not get sent to the browser, they instead get saved up in memory. PHP should automatically output all the stored content when the script exits, but if you use a command like ob_end_clean() this actually wipes the 'buffer' so all the content is erased. This can lead to confusing situations when you know you are echoing something.. but it just isn't appearing.
(Mini Disclaimer :) I've typed all the above manually so you may find there are PHP errors, if so, and they are baffling, just write them back here and StackOverflow can help you out)
Instead of trying to not use preg_match why not just trim your document contents down in size? for example, you could dump everything before <body and everything after </body>. then preg_match will be searching less content already.
Also, you could try to do each one of these processes as a pseudo separate thread, so that way they aren't happening one at a time.
I'm currently using this code:
$blog= file_get_contents("http://powback.tumblr.com/post/" . $post);
echo $blog;
And it works. But tumblr has added a script that activates each time you enter a password-field. So my question is:
Can i remove certain parts with file_get_contents? Or just remove everything above the <html> tag? could i possibly kill a whole div so it wont load at all? And if so; how?
edit:
I managed to do it the simple way. By skipping 766 characters. The script now work as intended!
$blog= file_get_contents("powback.tumblr.com/post/"; . $post, NULL, NULL, 766);
After file_get_contents returns, you have in your hands a string. You can do anything you want to it, including cutting out parts of it.
There are two ways to actually do the cutting:
Using string functions like str_replace, preg_replace and others; the exact recipe depends on what you need to do. This approach is kind of frowned upon because you are working at the wrong level of abstraction, but in some cases it has an unmatched performance to time spent ratio.
Parsing the HTML into a DOM tree, modifying it appropriately (this time working at the appropriate level of abstraction) and then turn it back into a string and echo it. This can be more convenient to work with if your requirements are not dead simple and is easier to maintain, but it typically requires more code to be written.
If you want to do something that's most naturally expressed in HTML document terms ("cutting out this <div>") then don't be tempted and go with the second approach.
At that point, $blog is just a string, so you can use normal PHP functions to alter it. Look into these 2:
http://php.net/manual/en/function.str-replace.php
http://us2.php.net/manual/en/function.preg-replace.php
You can parse your output using simple html dom parser and display olythe contents thatyou really want to display
Is there an standard output library that "knows" that php outputs to html?
For instance:
var_dump - this should be wrapped in <pre> or maybe in a table if the variable is an array
a version of echo that adds a "<br/>\n" in the end
Somewhere in the middle of PHPcode I want to add an H3 title:
.
?><h3><?= $title ?></h3><?
Out of php and then back in. I'd rather write:
tag_wrap($title, 'h3');
or
h3($title);
Obviously I can write a library myself, but I would prefer to use a conventional way if there is one.
Edit
3 Might not be a good example - I don't get much for using alternative syntax and I could have made it shorter.
1 and 2 are useful for debugging and quick testing.
I doubt that anyone would murder me for using some high-level html emitting functions of my own making when it saves a lot of writing.
In regards to #1, try xdebug's var_dump override, if you control your server and can install PHP extensions. The remote debugger and performance tools provided by xdebug are great additions to your arsenal. If you're looking only for pure PHP code, consider Kint or dBug to supplement var_dump.
In regards to #2 and #3, you don't need to do this. Rather, you probably shouldn't do this.
PHP makes a fine HTML templating language. Trying to create functions to emit HTML is going to lead you down a horrible road of basically implementing the DOM in a horribly awkward and backwards way. Considering how horribly awkward the DOM already is, that'll be quite an accomplishment. The future maintainers of your code are going to want to murder you for it.
There is no shame in escaping out of PHP to emit large blocks of HTML. Escaping out to emit a single tag, though, is completely silly. Don't do that, and don't create functions that do that. There are better ways.
First, don't forget that print and echo aren't functions, they're built in to the language parser. Because they're special snowflakes, they can take a list without parens. This can make some awkward HTML construction far less awkward. For example:
echo '<select name="', htmlspecialchars($select_name), '</select>';
foreach($list as $key => $value) {
echo '<option value="',
htmlspecialchars($key),
'">',
htmlspecialchars($value),
'</option>'
}
echo '</select>';
Next, PHP supports heredocs, a method of creating a double-quoted string without the double-quotes:
$snippet = <<<HERE
<h1>$heading</h1>
<p>
<span class="aside">$aside_content</span>
$actual_content
</p>
HERE;
With these two tools in your arsenal, you may find yourself breaking out of PHP far less frequently.
While there is a case for helper functions (there are only so many ways you can build a <select>, for example), you want to use these carefully and create them to reduce copy and paste, not simply to create them. The people that will be taking care of the code you're writing five years from now will appreciate you for it.
You should use a php template engine and just separate the entire presentation and logic. It make no sense for a educated programmer to try to create a library like that.
I'm trying to figure out the most efficient way to implement RoR-style partials/collections for a PHP template class that I'm writing. For those who aren't familiar with rails, I want to iterate over a template fragment (say a table row or list item) located in a separate file. I want to do this without resorting to eval or placing an include within the loop.
I've seen a similar post that addresses single partials, which are trivial, but nothing that covers implementing partials in a collection. I've been thinking about this so long my head hurts and I'm afraid I'm overlooking an obvious solution. I'm hoping someone here can suggest an elegant solution that, again, doesn't require eval or include within the loop. TIA.
You need a templating engine with that can process includes on its own and then eval the whole thing at once. Much like c preprocessor works.
Step 1 (source template):
$template = '
foreach($bigarray as $record)
#include "template_for_record.php"
'
Step 2 (after preprocessing):
$template = '
foreach($bigarray as $record)
// include statement replaced with file contents
echo $record['name'] etc
'
Step 3 (final rendering)
// eval() only once
eval($template);
In this way you can avoid the overhead of evaling/including subtemplate on every loop step.
You're asking how to do something without resorting to the solution.
Any template system you use is going to use an eval or an include within the loop, even if it's buried in abstraction 1000 layers deep.
That's just how it's done.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm searching for a PHP syntax highlighting engine that can be customized (i.e. I can provide my own tokenizers for new languages) and that can handle several languages simultaneously (i.e. on the same output page). This engine has to work well together with CSS classes, i.e. it should format the output by inserting <span> elements that are adorned with class attributes. Bonus points for an extensible schema.
I do not search for a client-side syntax highlighting script (JavaScript).
So far, I'm stuck with GeSHi. Unfortunately, GeSHi fails abysmally for several reasons. The main reason is that the different language files define completely different, inconsistent styles. I've worked hours trying to refactor the different language definitions down to a common denominator but since most definition files are in themselves quite bad, I'd finally like to switch.
Ideally, I'd like to have an API similar to CodeRay, Pygments or the JavaScript dp.SyntaxHighlighter.
Clarification:
I'm looking for a code highlighting software written in PHP, not for PHP (since I need to use it from inside PHP).
Since no existing tool satisfied my needs, I wrote my own. Lo and behold:
Hyperlight
Usage is extremely easy: just use
<?php hyperlight($code, 'php'); ?>
to highlight code. Writing new language definitions is relatively easy, too – using regular expressions and a powerful but simple state machine. By the way, I still need a lot of definitions so feel free to contribute.
[I marked this answer as Community Wiki because you're specifically not looking for Javascript]
http://softwaremaniacs.org/soft/highlight/ is a PHP (plus the following list of other languages supported) syntax highlighting library:
Python, Ruby, Perl, PHP, XML, HTML, CSS, Django, Javascript, VBScript, Delphi, Java, C++, C#, Lisp, RenderMan (RSL and RIB), Maya Embedded Language, SQL, SmallTalk, Axapta, 1C, Ini, Diff, DOS .bat, Bash
It uses <span class="keyword"> style markup.
It has also been integrated in the dojo toolkit (as a dojox project: dojox.lang.highlight)
Though not the most popular way to run a webserver, strictly speaking, Javascript is not only implemented on the client-side, but there are also Server-Side Javascript engine/platform combinations too.
I found this simple generic syntax highlighter written in PHP here and modified it a bit:
<?php
/**
* Original => http://phoboslab.org/log/2007/08/generic-syntax-highlighting-with-regular-expressions
* Usage => `echo SyntaxHighlight::process('source code here');`
*/
class SyntaxHighlight {
public static function process($s) {
$s = htmlspecialchars($s);
// Workaround for escaped backslashes
$s = str_replace('\\\\','\\\\<e>', $s);
$regexp = array(
// Comments/Strings
'/(
\/\*.*?\*\/|
\/\/.*?\n|
\#.[^a-fA-F0-9]+?\n|
\<\!\-\-[\s\S]+\-\-\>|
(?<!\\\)".*?(?<!\\\)"|
(?<!\\\)\'(.*?)(?<!\\\)\'
)/isex'
=> 'self::replaceId($tokens,\'$1\')',
// Punctuations
'/([\-\!\%\^\*\(\)\+\|\~\=\`\{\}\[\]\:\"\'<>\?\,\.\/]+)/'
=> '<span class="P">$1</span>',
// Numbers (also look for Hex)
'/(?<!\w)(
(0x|\#)[\da-f]+|
\d+|
\d+(px|em|cm|mm|rem|s|\%)
)(?!\w)/ix'
=> '<span class="N">$1</span>',
// Make the bold assumption that an
// all uppercase word has a special meaning
'/(?<!\w|>|\#)(
[A-Z_0-9]{2,}
)(?!\w)/x'
=> '<span class="D">$1</span>',
// Keywords
'/(?<!\w|\$|\%|\#|>)(
and|or|xor|for|do|while|foreach|as|return|die|exit|if|then|else|
elseif|new|delete|try|throw|catch|finally|class|function|string|
array|object|resource|var|bool|boolean|int|integer|float|double|
real|string|array|global|const|static|public|private|protected|
published|extends|switch|true|false|null|void|this|self|struct|
char|signed|unsigned|short|long
)(?!\w|=")/ix'
=> '<span class="K">$1</span>',
// PHP/Perl-Style Vars: $var, %var, #var
'/(?<!\w)(
(\$|\%|\#)(\->|\w)+
)(?!\w)/ix'
=> '<span class="V">$1</span>'
);
$tokens = array(); // This array will be filled from the regexp-callback
$s = preg_replace(array_keys($regexp), array_values($regexp), $s);
// Paste the comments and strings back in again
$s = str_replace(array_keys($tokens), array_values($tokens), $s);
// Delete the "Escaped Backslash Workaround Token" (TM)
// and replace tabs with four spaces.
$s = str_replace(array('<e>', "\t"), array('', ' '), $s);
return '<pre><code>' . $s . '</code></pre>';
}
// Regexp-Callback to replace every comment or string with a uniqid and save
// the matched text in an array
// This way, strings and comments will be stripped out and wont be processed
// by the other expressions searching for keywords etc.
private static function replaceId(&$a, $match) {
$id = "##r" . uniqid() . "##";
// String or Comment?
if(substr($match, 0, 2) == '//' || substr($match, 0, 2) == '/*' || substr($match, 0, 2) == '##' || substr($match, 0, 7) == '<!--') {
$a[$id] = '<span class="C">' . $match . '</span>';
} else {
$a[$id] = '<span class="S">' . $match . '</span>';
}
return $id;
}
}
?>
Demo: http://phpfiddle.org/lite/code/1sf-htn
Update
I just created a PHP port of my own JavaScript generic syntax highlighter here → https://github.com/taufik-nurrohman/generic-syntax-highlighter/blob/master/generic-syntax-highlighter.php
How to use:
<?php require 'generic-syntax-highlighter.php'; ?>
<pre><code><?php echo SH('<div class="foo"></div>'); ?></code></pre>
It might be worth looking at Pear_TextHighlighter (documentation)
I think it won't by default output html exactly how you want it, but it does provide extensive capabilities for customisation (i.e. you can create different renderers/parsers)
I had exactly the the same problem but as I was very short on time and needed really good code coverage I decided to write a PHP wrapper around Pygments library.
It's called PHPygmentizator. It's really simple to use. I wrote a very basic manual. As PHP is Web Development language primarily, I subordinated the structure to that fact and made it very easy to implement in almost any kind of website.
It supports configuration files and if that isn't enough and somebody needs to modify stuff in the process it also fires events.
Demo of how it works can be found on basically any post of my blog which contains source code, this one for example.
With default config you can just provide it a string in this format:
Any text here.
[pygments=javascript]
var a = function(ar1, ar2) {
return null;
}
[/pygments]
Any text.
So it highlights code between tags (tags can be customized in configuration file) and leaves the rest untouched.
Additionally I already made a Syntax recognition library (it uses algorithm which would probably be classified as Bayesian probability) which automatically recognizes which language code block is written in and can easily be hooked to one of PHPygmentizator events to provide automatic language recognition. I will probably make it public some time this week since I need to beautify the structure a bit and write some basic documentation. If you supply it with enough "learning" data it recognizes languages amazingly well, I tested even minified javascripts and languages which have similar keywords and structures and it has never made a mistake.
Another option is to use the GPL Highlight GUI program by Andre Simon which is available for most platforms. It converts PHP (and other languages) to HTML, RTF, XML, etc. which you can then cut and paste into the page you want. This way, the processing is only done once.
The HTML is also CSS based, so you can change the style as you please.
Personally, I use dp.SyntaxHighlighter, but that uses client side Javascript, so it doesn't meet your needs. It does have a nice Windows Live plugin though which I find useful.
A little late to chime in here, but I've been working on my own PHP syntax highlighting library. It is still in its early stages, but I am using it for all code samples on my blog.
Just checked out Hyperlight. It looks pretty cool, but it is doing some pretty crazy stuff. Nested loops, processing line by line, etc. The core class is over 1000 lines of code.
If you are interested in something simple and lightweight check out Nijikodo:
http://www.craigiam.com/nijikodo
Why not use PHP's build-in syntax highlighter?
http://php.net/manual/en/function.highlight-string.php
PHP Prettify Works fine so far, And has more customization than highlight_string
Krijn Hoetmer's PHP Highlighter provides a completely customizable PHP class to highlight PHP syntax. The HTML it generates, validates under a strict doctype, and is completely stylable with CSS.