I have a file with the next structure:
concept
[at0000] -- Blood Pressure
language
original_language =
translations =
author =
["organisation"] =
["email"] =
>
accreditation =
>
>
description
original_author =
["organisation"] =
["email"] =
["date"] =
>
details =
purpose =
I need to open and parse this file, but I must admit the indentations of each line, as the indentations represent hierarchical structures. Is there any way in PHP to go line by line analysis of the indentation, either the beginning, middle or end of the line?
//rant on
It's simple: who provides such a crappy data structure to parse.
It's 2014. XML all over the place and lightweight JSON.
What do we get? Not even CSV :)
//rant off
Maybe a fixed column width parser would fit:
https://github.com/t-geindre/fixed-column-width-parser
Basically you get lines with $lines = file("file.txt");
Then it's a matter of detecting the spaces or tabs in front of each line.
Update
Turns out this "data" has a structure.
The data-structure "Archetype Definition Language" (ADL) is described in ISO 13606-2.
http://pangea.upv.es/en13606/index.php/resources/files/doc_download/2-en13606-part-2
This document contains a grammar description in Chapter 8.
You might use this grammer for parser construction.
Parsing indentions is your smallest problem. Getting the data structure right, is the real task.
Happy test writing - this will be a lot of work... be warned.
Let me also point to OpenEHR.
OpenEHR uses Java and Eiffel as programming languages.
The ADL parser is implemented in Java.
You might find it at https://github.com/openEHR/java-libs/blob/master/adl-parser/src/main/javacc/adl.jj
This is the parser ADL v1.4 in Ruby:
https://github.com/skoba/openehr-ruby/tree/master/lib/openehr/parser
This should get you pretty close to a solution.
Hope this helps a bit..
You can use ltrim and rtrim functions.
For example using the following code:
$line = ' concept';
echo strlen(ltrim($line));
echo strlen($line);
you can calculate length of string with and without white-spaces at the beginning of the line.
However I don't know what you mean that you want to calculate indentations in the middle of the line. You should in that case go probably use substr function to go to the place when you expect indentation and then again use ltrim and strlen to calculate whitespaces at the beginning of substring.
You may also want to use Mb string functions in case you have in your code non-ASCII characters.
For parsing lines you can simply use file() function
ADL doesn't have a parser for PHP. But ADL can be transformed to XML using the CKM (http://ckm.openehr.org/ckm/) or the Archetype Editor (http://www.openehr.org/downloads/modellingtools).
You should use the XML in PHP.
Related
I have a CSV parser, that takes Outlook 2010 Contact Export .CSV file, and produces an array of values.
I break each row on the new line symbol, and each column on the comma. It works fine, until someone puts a new line inside a field (typically Address). This new line, which I assume is "\n" or "\r\n", explodes the row where it shouldn't, and the whole file becomes messed up from there on.
In my case, it happens when Business Street is written in two lines:
123 Apple Dr. Unit A
My code:
$file = file_get_contents("outlook.csv");
$rows = explode("\r\n",$file);
foreach($rows as $row)
{
$columns = explode(",",$row);
// Further manipulation here.
}
I have tried both "\n" and "\r\n", same result.
I figured I could calculate the number of columns in the first row (keys), and then find a way to not allow a new line until this many columns have been parsed, but it feels shady.
Is there another character for the new line that I can try, that would not be inside the data fields themselves?
The most common way of handling newlines in CSV files is to "quote" fields which contain significant characters such as newlines or commas. It may be worth looking into whether your CSV generator does this.
I recommend using PHP's fgetcsv() function, which is intended for this purpose. As you've discovered, splitting strings on commas works only in the most trivial cases.
In cases, where that doesn't work, a more sophisticated, reportedly RFC4180-compliant parser is available here.
I also recommend fgetcsv()
fgetcsv will also take care of commas inside strings ( between quotes ).
Interesting parsing tutorial
+1 to the previous answer ;)
PS: fgetcsv is a bit slower then opening the file and explode the contents etc. But imo it's worth it.
I tried a performance check tool "DOM Monster" to analyze my php site. There is one information which says "50% of nodes are whitespace-only text nodes".
Ok I unterstand the problem but what is the fastest way to cleanup whitespace in php?
I think a good start is to use the "Output Control" like ob_start() and then replace the whitespace before releasing it with ob_end_flush(). In the moment I do everything with echo echo ... I never read much about this ob_* things is it useful?
I guess using preg_replace() is a performance killer for this job or?
So what is the best practice for this?
The fastest way to remove whitespace-only nodes is to not create them in the first place. Just remove all the whitespace immediately before and after each HTML tag.
You certainly could remove the spaces from your code after the fact using an output handler (look at the callback bit in ob_start), but if your goal is performance, then that kind of defeats the purpose.
A whitespace-only node is in the DOM tree parsed by the browser when it reads your HTML. It's where there's an HTML tag, then nothing but whitespace, then another HTML tag. It's a waste of browser resources, but not a huge deal.
The function trim() will solve your problem, isn't it?
http://www.php.net/manual/en/function.trim.php
Well, I guess you talk about HTML, and HTML is as is a meta language full of whitespace (attributes, texts).
By the way, you probably use newlines for readability.
I rather advise you to compress your page with deflate/gzip and webserver rules, ie an .htaccess rule:
<FilesMatch "\\.(js|css|html|htm|php|xml)$">
SetOutputFilter DEFLATE
</FilesMatch>
You can also take a look at Tidy which is a library to help you to check and cleanup your HTML code.
preg_replace will of course slow things down a little. But probably it's the fastest way anyway. The problem is more that preg_replace may be unreliable because it is very hard to write regular expression that works on all possible cases.
If you are createing XML/XHTML output, you could parse all your data using a fast stream parser SAX or StAX, php has both builtin usually, and then write the data back to the output without the whitespaces. That's simple, effective, reliable und at least medium fast. It's still not going to blow you off with speed.
Another option would be to just use gzip. (ob_handler('gz_handler') is the call in php if I remember correctly). This will compress your data and compression works extremely well on problems with data that repeats a lot within a document. That come with a litte performance penalty as well, but the reduced size of the output document may make up for it.
Though beware that the output will not be send to the browser before all output is available. This makes partial loading of webpages much harder ;-).
The problem with using ob_* and then trimming whitespace is that you’ll have to make sure to not remove displayed whitespace like in <pre> tags or <textarea>s etc. You’ll need a syntactical parser which understands where it should not trim.
With an (performance-)expensive parser you should also cache output where possible.
The following is code to remove all space characters but the first of a sequence of spaces. So 1 space will be kept, 3 spaces pruned to 1, etc.
at the top of you php file do
ob_start();
At the end do
function StripExtraSpace($s)
{
$newstr = "";
for($i = 0; $i < strlen($s); $i++)
{
$newstr = $newstr . substr($s, $i, 1);
if(substr($s, $i, 1) == ' ')
while(substr($s, $i + 1, 1) == ' ')
$i++;
}
return $newstr;
}
$content = ob_get_clean();
echo StripExtraSpace($content);
Well if I comment something it's skipped in all languages, but how are they skipped and what is readed?
Example:
// This is commented out
Now does PHP reads the whole comment to go to next lines or just reads the //?
The script is parsed and split into tokens.
You can actually try this out yourself on any valid PHP source code using token_get_all(), it uses PHP's native tokenizer.
The example from the manual shows how a comment is dealt with:
<?php
$tokens = token_get_all('<?php echo; ?>'); /* => array(
array(T_OPEN_TAG, '<?php'),
array(T_ECHO, 'echo'),
';',
array(T_CLOSE_TAG, '?>') ); */
/* Note in the following example that the string is parsed as T_INLINE_HTML
rather than the otherwise expected T_COMMENT (T_ML_COMMENT in PHP <5).
This is because no open/close tags were used in the "code" provided.
This would be equivalent to putting a comment outside of <?php ?>
tags in a normal file. */
$tokens = token_get_all('/* comment */');
// => array(array(T_INLINE_HTML, '/* comment */'));
?>
There is a tokenization phase while compiling. During this phase, it see the // and then just ignores everything to the end of the line. Compilers CAN get complicated, but for the most part are pretty straight forward.
http://compilers.iecc.com/crenshaw/
Your question doesn't make sense. Having read the '//', it then has to keep reading to the newline to find it. There's no choice about this. There is no other way to find the newline.
Conceptually, compiling has several phases that are logically prior to parsing:
Scanning.
Screening.
Tokenization.
(1) basically means reading the file character by character from left to right.
(2) means throwing things away of no interest, e.g. collapsing multiple newline/whitespace sequences to a single space.
(3) means combining what's left into tokens, e.g. identifiers, keywords, literals, punctuation.
Comments are screened out during (2). In modern compilers this is all done at once by a deterministic automaton.
I need a regex (to work in PHP) to replace American English words in HTML with British English words. So color would be replaced by colour, meters by metres and so on [I know that meters is also a British English word, but for the copy we'll be using it will always be referring to units of distance rather than measuring devices]. The pattern would need to work accurately in the following (slightly contrived) examples (although as I have no control over the actual input these could exist):
<span style="color:red">This is the color red</span>
[should not replace color in the HTML tag but should replace it in the sentence]
<p>Color: red</p>
[should replace word]
<p>Tony Brammeter lives 2000 meters from his sister</p>
[should replace meters for the word but not in the name]
I know there are edge cases where replacement wouldn't be useful (if his name was Tony Meter for example), but these are rare enough that we can deal with them when they come up.
Html/xml should not be processed with regular expressions, it is really hard to generate one that will match anything. But you can use the builtin dom extension and process your string recursively:
# Warning: untested code!
function process($node, $replaceRules) {
foreach ($node->children as $childNode) {
if ($childNode instanceof DOMTextNode) {
$text = pre_replace(
array_keys(replaceRules),
array_values($replaceRules),
$childNode->wholeText
);
$node->replaceChild($childNode, new DOMTextNode($text));
} else {
process($childNode, $replaceRules);
}
}
}
$replaceRules = array(
'/\bcolor\b/i' => 'colour',
'/\bmeter\b/i' => 'metre',
);
$doc = new DOMDocument();
$doc->loadHtml($htmlString);
process($doc, $replaceRules);
$htmlString = $doc->saveHTML();
I think you'd rather need a dictionary and maybe even some grammatical analysis in order to get this working correctly, since you don't have control over the input. A pure regex solution is not really going to be able to process this kind of data correctly.
So I'd suggest to first come up with a list of words that need to be replaced, those are not only "color" and "meter". Wikipedia has some information on the topic.
You do not want a regular expression for this. Regular expressions are by their very nature stateless, and you need some measure of state to be able to tell the difference between 'in a html tag' and 'in data'.
You want to be using a HTML parser in combination with something like a str_replace, or even better, use a proper grammer dictionary and stuff as Lucero suggests.
The second problem is easier - you want to replace when there are word boundaries around the word: http://www.regular-expressions.info/wordboundaries.html -- this will make sure you don't replace the meter in Brammeter.
The first problem is much harder. You don't want to replace words inside HTML entities - nothing between <> characters. So, your match must make sure that you last saw > or nothing, but never just <. This is either hard, and requires some combination of lookahead/lookbehind assertions, or just plain impossible with regular expressions.
a script implementing a state machine would work much better here.
You don't need to use a regex explicitly. You can try the str_replace function, or if you need it to be case insensitive use the str_ireplace function.
Example:
$str = "<p>Color: red</p>";
$new_str = str_ireplace ('%color%', 'colour', $str);
You can pass an array with all the words that you want to search for, instead of the string.
First things first: Neither this, this, this nor this answered my question. So I'll open a new one.
Please read
Okay okay. I know that regexes are not the way to parse general HTML. Please take note that the created documents are written using a limited, controlled HTML subset. And people writing the docs know what they're doing. They are all IT professionals!
Given the controlled syntax it is possible to parse the documents I have here using regexes.
I am not trying to download arbitrary documents from the web and parse them!
And if the parsing does fail, the document is edited, so it'll parse. The problem I am addressing here is more general than that (i.e. not replace patterns inside two other patterns).
A little bit of background (you can skip this...)
In our office we are supposed to "pretty print" our documentation. Hence why some came up with putting it all into Word documents. So far we're thankfully not quite there yet. And, if I get this done, we might not need to.
The current state (... and this)
The main part of the docs are stored in a TikiWiki database. I've created a daft PHP script which converts the documents from HTML (via LaTeX) to PDF. One of the must have features of the selected Wiki-System was a WYSIWYG editor. Which, as expected leaves us with documents with a less then formal DOM.
Consequently, I am transliterating the document using "simple" regexes. It all works (mostly) fine so far, but I encountered one problem I haven't figured out on my own yet.
The problem
Some special characters need to replaced by LaTeX markup. For exaple, the \ character should be replaced by $\backslash$ (unless someone knows another solution?).
Except while in a verbatim block!
I do replace <code> tags with verbatim sections. But if this code block contains backslashes (as is the case for Windows folder names), the script still replaces these backslashes.
I reckon I could solve this using negative LookBehinds and/or LookAheads. But my attempts did not work.
Granted, I would be better off with a real parser. In fact, it is something on my "in-brain-roadmap", but it is currently out of the scope. The script works well enough for our limited knowledge domain. Creating a parser would require me to start pretty much from scratch.
My attempt
Example Input
The Hello \ World document is located in:
<code>C:\documents\hello_world.txt</code>
Expected output
The Hello $\backslash$ World document is located in:
\begin{verbatim}C:\documents\hello_world.txt\end{verbatim}
This is the best I could come up with so far:
<?php
$patterns = array(
"special_chars2" => array( '/(?<!<code[^>]*>.*)\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);
foreach( $patterns as $name => $p ){
$tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>
Note that this is only an excerpt, and the [^$] is another LaTeX requirement.
Another attempt which seemed to work:
<?php
$patterns = array(
"special_chars2" => array( '/\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);
foreach( $patterns as $name => $p ){
$tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>
... in other words: leaving out the negative lookbehind.
But this looks more error-prone than with both lookbehind and lookahead.
A related question
As you may have noticed, the pattern is ungreedy (/.../U). So will this match only as little possible inside a <code> block? Considering the look-arounds?
If me, I will try to find HTML parser and will do with that.
Another option is will try to chunk the string into <code>.*?</code> and other parts.
and will update other parts, and will recombine it.
$x="The Hello \ World document is located in:\n<br>
<code>C:\documents\hello_world.txt</code>";
$r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE);
for($i=0;$i<count($r);$i+=2)
$r[$i]=str_replace("\\","$\\backslash$",$r[$i]);
$x=implode($r);
echo $x;
Here is the results.
The Hello $\backslash$ World document is located in:
C:\documents\hello_world.txt
Sorry, If my approach is not suitable for you.
I reckon I could solve this using negative LookBehinds and/or LookAheads.
You reckon wrong. Regular expressions are not a replacement for a parser.
I would suggest that you pipe the html through htmltidy, then read it with a dom-parser and then transform the dom to your target output format. Is there anything preventing your from taking this route?
Parser FTW, ok. But if you can't use a parser, and you can be certain that <code> tags are never nested, you could try the following:
Find <code>.*?</code> sections of your file (probably need to turn on dot-matches-newlines mode).
Replace all backslashes inside that section with something unique like #?#?#?#
Replace the section found in 1 with that new section
Replace all backslashes with $\backslash$
Replace als <code> with \begin{verbatim} and all </code> with \end{verbatim}
Replace #?#?#?# with \
FYI, regexes in PHP don't support variable-length lookbehind. So that makes this conditional matching between two boundaries difficult.
Pandoc? Pandoc converts between a bunch of formats. you can also concatenate a bunch of flies together then covert them. Maybe a few shell scripts combined with your php scraping scripts?
With your "expected input" and the command pandoc -o text.tex test.html the output is:
The Hello \textbackslash{} World document is located in:
\verb!C:\documents\hello_world.txt!
pandoc can read from stdin, write to stdout or pipe right into a file.
Provided that your <code> blocks are not nested, this regex would find a backslash after ^ start-of-string or </code> with no <code> in between.
((?:^|</code>)(?:(?!<code>).)+?)\\
| | |
| | \-- backslash
| \-- least amount of anything not followed by <code>
\-- start-of-string or </code>
And replace it with:
$1$\backslash$
You'd have to run this regex in "singleline" mode, so . matches newlines. You'd also have to run it multiple times, specifying global replacement is not enough. Each replacement will only replace the first eligible backslash after start-of-string or </code>.
Write a parser based on an HTML or XML parser like DOMDocument. Traverse the parsed DOM and replace the \ on every text node that is not a descendent of a code node with $\backslash$ and every node that is a code node with \begin{verbatim} … \end{verbatim}.