php function preg_replace regex not working, a syntax question - php

im trying to remove unnessary comments with preg-replace in controlled script situations, but my regex is incorrect. Anyone any ideas whats wrong with my regex? (i have Apache/2.0.54 & PHP/5.2.9
BEFORE:
// Bla Bli Blue Blow Bell Billy Bow Bye
script var etc (); // cangaroo cognac codified cilly celine cocktail couplet
script http://blaa.org // you get the idea!
AFTER:
script var etc ();
script http://blaa.org
PROBLEM: what regex to use?
# when comment starts on a new line, delete this entire line
# find [a new line] [//] [space or no space] [comment]
$buffer = preg_replace('??', '??', $buffer);
# when comment is halfway in script ( // comment)
# find [not beginning of a line] [1 TAB] [//] [1 space again] [comment]
$buffer = preg_replace('??', '??', $buffer);
Any and All suggestions will be valued +1 by me, cuase im so darn close to solve this riddle!

Try this regex:
/(?<!http:)\/\/[^\r\n]*/
Be cautious though, consider strings like:
<!--
// not a comment -->
or
/*
// not a comment */
and
var s = "also // not // a // comment";
And you might want to work around https://... and ftp://... etc.

Related

PHP preg_split is not returning a proper value

Im trying to split a wordpress Blogpost title after the character is found so it doesnt cut off like an explode would do, but it gives me the following var_dump:
array(2) { [0]=> string(0) "" [1]=> string(0) "" }
here is my code:
$title = $post['post_title'];
$titlepart = preg_split("/(.+)([.,?!]{1})(.+)/", $title);
var_dump($titlepart);
any Ideas?
So my understanding of what you're trying to do is split the title by a control character of some kind, in the example you've chosen four: ,, ., ! and ?. You expect the title to be split into two (so the control characters to definitely appear, and hopefully only appear once - although we can protect against multiples we have to make an assumption - I put them all in the second line).
Here are two solutions - using preg_match and using strpbrk. I've included some example strings at the top ($ts) and the expected solutions in the comment immediately below.
<?php
$ts=[
'Koningsdag 2017: Waar komt het vandaan? En wat u moet weten over deze dag',
'King\'s Day 2017: You love it! You won\'t believe how old it is',
'Queen of hearts: Foul-tempered. Find out what Alice did',
'Wise owl: What goes up, must come down',
'Shakespeare: Some are born great, some achieve greatness, and some have greatness thrust upon them',
//'No delimiter',
//''
];
/* Desired outputs:(?)
Koningsdag 2017: Waar komt het vandaan?
En wat u moet weten over deze dag
King's Day 2017: You love it!
You won't believe how old it is
Queen of hearts: Foul-tempered.
Find out what Alice did
Wise owl: What goes up,
must come down
Shakespeare: Some are born great,
some achieve greatness, and some have greatness thrust upon them
*/
function report($res) {
if (count($res)!==2) {throw new Exception("Unexpected result");}
echo $res[0]."\n".$res[1]."\n\n";
}
function splitByPreg($title,$chars='.,?!') {
//Split by $chars, capture the split marker so we can append it back to the first match
// - If we see more than one $char we assume all further examples should be in part 2
$parts=preg_split('/(['.preg_quote($chars,'/').'])\s*/', $title,2,PREG_SPLIT_DELIM_CAPTURE);
$n=count($parts);
switch ($n) {
case 3:
//Expected
return [$parts[0].$parts[1],$parts[2]];
case 2:
//Not sure how this could happen
return $parts;
case 1:
//No delim found
return [$parts[0],""];
}
return ["",""];
}
function splitByChar($title,$chars='.,?!') {
//This returns the second line as a string starting with the break character
$lineTwo=strpbrk($title,$chars);
if ($lineTwo===false) return [$title,""];
$n=strlen($title);
$break=$n-strlen($lineTwo)+1;//+1 to move the break character to line1
return [substr($title,0,$break),trim(substr($title,$break))];
}
foreach ($ts as $t) {
//Pick one:
$res=splitByPreg($t);
//$res=splitByChar($t);
report($res);
}
As you can see you can specify the delimiters to either method (if you wish to use more than/alternatives to the original four). I've tested with a few, including / (which is only sometimes a control character in regex.

php regex to python, preg_match($x,$y,$z) to re.search(x,y,z)

Mr. Wiktor, this question is by no means a duplicate of the question you juxtaposed in your justification of marking this a duplicate.
To wit, the question you pointed to asks what's the countpart of preg_match in python. I, even in the TITLE ITSELF mention the "re.search" which was the answer to the thread you mentioned. I'm aware of re.search
My question is SPECIFICALLY to how I can use the 3rd argument in re.search the way that its counterpart in php is used in the example I provided. Mr. Wiktor, I respectfully request that you unmark my thread as duplicate Thank you in advance sir.
What I'm trying to do is Stemming (NLP) for the Greek language in python. The php code is this:
protected static $step1list = array(
"φαγια"=>"φα",
"φαγιου"=>"φα",
"φαγιων"=>"φα",
"σκαγια"=>"σκα",
"σκαγιου"=>"σκα",
"σκαγιων"=>"σκα",
"ολογιου"=>"ολο",
"ολογια"=>"ολο",
"ολογιων"=>"ολο",
"σογιου"=>"σο",
"σογια"=>"σο",
"σογιων"=>"σο",
"τατογια"=>"τατο",
"τατογιου"=>"τατο",
"τατογιων"=>"τατο",
"κρεασ"=>"κρε",
"κρεατοσ"=>"κρε",
"κρεατα"=>"κρε",
"κρεατων"=>"κρε",
"περασ"=>"περ",
"περατοσ"=>"περ",
"περατα"=>"περ",
"περατων"=>"περ",
"τερασ"=>"τερ",
"τερατοσ"=>"τερ",
"τερατα"=>"τερ",
"τερατων"=>"τερ",
"φωσ"=>"φω",
"φωτοσ"=>"φω",
"φωτα"=>"φω",
"φωτων"=>"φω",
"καθεστωσ"=>"καθεστ",
"καθεστωτοσ"=>"καθεστ",
"καθεστωτα"=>"καθεστ",
"καθεστωτων"=>"καθεστ",
"γεγονοσ"=>"γεγον",
"γεγονοτοσ"=>"γεγον",
"γεγονοτα"=>"γεγον",
"γεγονοτων"=>"γεγον"
);
protected static $step1regexp="/(.*)(φαγια|φαγιου|φαγιων|σκαγια|σκαγιου|σκαγιων|ολογιου|ολογια|ολογιων|σογιου|σογια|σογιων|τατογια|τατογιου|τατογιων|κρεασ|κρεατοσ|κρεατα|κρεατων|περασ|περατοσ|περατα|περατων|τερασ|τερατοσ|τερατα|τερατων|φωσ|φωτοσ|φωτα|φωτων|καθεστωσ|καθεστωτοσ|καθεστωτα|καθεστωτων|γεγονοσ|γεγονοτοσ|γεγονοτα|γεγονοτων)$/u";
$w;
$stem="";
$suffix="";
$firstch="";
if (preg_match($step1regexp, $w, $fp)) {
$stem = $fp[1];
$suffix = $fp[2];
$w = $stem.$step1list[$suffix];
}
The latest thing i've tried is this (i don't rly have blah on the lists, they're the same as the php one):
import re
step1list = {
u"φαγια": u"φα",
blah blah blah blah
}
stem = ""
suffix=""
firstch=""
s = u"σογια"
reg = re.compile(r'/(.*)(φαγια|φαγιου|φαγιων|σκαγια|σκαγιου|σκαγιων|ολογιου|ολογια|ολογιων|σογιου|σογια|σογιων|τατογια|τατογιου|τατογιων|κρεασ|κρεατοσ|κρεατα|κρεατων|περασ|περατοσ|περατα|περατων|τερασ|τερατοσ|τερατα|τερατων|φωσ|φωτοσ|φωτα|φωτων|καθεστωσ|καθεστωτοσ|καθεστωτα|καθεστωτων|γεγονοσ|γεγονοτοσ|γεγονοτα|γεγονοτων)$');
m = reg.search(s)
if m:
stem = m.group(1);
suffix = m.group(2);
s = "{0}{1}".format(stem, step1list[suffix])
print(s)
print(stem)
print(suffix)
what I get as a result is:
σογια
(with 2 blank lines after it) which means that the 2 groups are not successfully identified :(
How do I mend this?
from the docs: (also see match vs search)
import re
p = re.compile( regex )
m = p.search( 'string goes here' ) #p.match() to find from start of string only
if m:
print 'Match found: ', m.group() # group(1...n) for capture groups
else:
print 'No match'

Text Processing with PHP

Stackoverflow: I need your help!
I've been tasked with turning some (fairly) complex work diagrams for railway staff extracted from a Word document into something more usable for further processing, such as into a PHP array.
Here is a sample of one of the work diagrams:
LTP BH 4000
( Link 5)
DVR Su
On 00.22 PASS Barnham 00+34 5H97
Off 08.03 Lham 00+42
Hrs 7:41 PPTC Lham (06+24) 5N08
Traction for the above Service is
Days Su class 377
From 18/05/2014 377 PC Lham 01+46 5S62 DOO
To 24/08/2014 (Via CET)
TC Lham O Sh 01+50
PNB
377 PC Lham O Sh 03+10 5W62 DOO
(Via CWM)
DTCS Lham 03+32
377 PP Lham Shed 04+10 5W00 DOO
(Via CWM)
DTCS Lham Shed 04+24
PPTC Lham Shed (07+39) 5E24
Traction for the above Service is
class 377
PPTC Lham (06+37) 5H92
Traction for the above service is
class 377
377 PP Lham Shed 05+45 5W01 DOO
(Via CET)
377 Lham O Sh 05+57 06+28 5W01 DOO
(Via CWM)
TC Lham Shed 06+42
PPTC Lham Shed (09+58) 5H67
Traction for the above Service is
class 377
PPTC Lham Shed (07+41) 5P29 RP MO
Traction for the above Service is
class 377
(Unit forms part of 22+17
attachment)
PASS Lham 07.54 2P31
(To Bognor Regis)
Barnham 08.02
Routes 919
I've managed to process some of the data using simple regular expressions, but where I am struggling is the "middle" data which actually shows the work to be done. I am struggling because there is no real structure that defines what each line should look like, you will notice that many lines are different with some even including free text notes.
What I am looking to accomplish is to turn each row into an array that looks like the following:
$row = array("stock", "activity", "location", "departure_time", "arrival_time", "train_id", "notes");
The difficulty comes as not every line fits into this format - some lines have every "column", whereas others have one or more columns missing and other lines consist of free text.
I am by no means a text processing expert, but I cannot seem to find a solution to this problem. I'm not after a complete solution, just some pointers would be gratefully received!
Update Just for clarification, I'm not interested in the free text rows. The data they contain is not important for what I am trying to accomplish.
I'll refine this answer more as soon as more data comes in, but in the meantime I'd go with what amounts to a state machine.
You read the text one line after the other. Initially you are in the "WAITING FOR DIAGRAM" state:
$status = array(
'file' => $fp,
'manager' => 'waitForDiagram',
);
$chunk = 0;
$lineno = 0;
$manage = $status['manager'];
while (!feof($fp)) {
$line = fgets($fp, 1024); // is 1 Kb enough? Maybe not.
$lineno ++;
$manage($status, $line);
if ($status['manager'] != $manage)) {
$chunk = 0;
if (!function_exists($status['manager'])) {
trigger_error("{$manage}({$line}) -> {$status['manager']}: no such state");
}
$manage = $status['manager'];
}
if (++$chunk > ALERT) {
trigger_error("Stuck in state {$manage} since {$chunk} lines!", E_USER_ERROR);
}
}
Then you define a function for each state, beginning with the first:
function waitForDiagram(&$status, $line) {
// Part common to most such state functions:
$tokens = tokenise($line);
// Quickly check whether anything needs doing.
if (!in_array($token[0], [ "LTP" ]) {
// if not, return.
return;
}
$status['diagram'] = array(
'diagram' => array(
'title' => $token[0],
'whatever' => $token[1],
'comment' => '',
)
);
...
// In this case, all information is only in one line, so we can
// continue to the next state, which in this case is always waitForOnAndGetComments.
$status['manager'] = 'waitForOnAndGetComments';
}
function waitForOnAndGetComments(&$status, $line) {
$tokens = tokenise($line);
// If we get "On" it's the line, otherwise it is still the comment
if (!in_array($token[0], [ "On" ]) {
$status['diagram']['comments'] .= $line;
return;
}
// Otherwise we have On 00.22 PASS Barnham 00+34
// and always a next line.
$offTok = tokenise(fgets($status['fp'], 1024));
if ($offTok['0'] != "Off") {
trigger_error("Found ON, but next row is not OFF, what gives?", E_USER_ERROR);
}
$status['diagram']['on'] = array(
'time' => $tokens[1],
...
);
...
$status['diagram']['off'] = array(
'time' => $offTok[1],
'line' => $offTok[2],
...
);
$status['manager'] = 'waitForSomethingElse';
}
...and so on...
One important thing is how you tokenise the lines. If you have a clear delimiter (such as a tab) and can use explode, all well and good. Else you can try with preg_split('#\\s{2,}#'), using sequences of two or more whitespaces to separate "cells" in each "row".
I found what was causing me grief solving this. I'm loading the Word document using a tool called "antiword". Antiword seems to strip special characters such as tabs. However, I found that by passing the "-w 0" switch, these characters are preserved and parsing the diagrams using simple regular expressions became trivial. Many thanks to #Iserni for taking to time to help me, none the less.

How to make tidy ignore TT code in html.tt templates?

I have some TT templates that I want tidy-up a little. I use tidy on the command-line.
my command looks like:
$ tidy -utf8 --preserve-entities y -indent -wrap 120 file.html.tt
Unfortunately if I have code like:
[% aoh.unshift({ label => '', value => 'All types' }); %]
It ends up in the resulting file as:
[% aoh.unshift({ label => '', value => 'All types' }); %]
The same happens with Template Toolkit code in tag attributes, eg:
<a href="[%%20%20c.url_for('/content/edit').query('data_type'%20=%3Edata_type%20)%20%]" >
What would be the needed options to make tidy ignore everything between "[%" and "%]"?
Same question holds true for PHP start and end tags.
Thanks.
What if you temporarily disguise your TT tags?
$ perl -pie 's/\[%/<!--\[ %/g; s/%\]/% \]-->/g' file.html.tt
$ tidy -utf8 --preserve-entities y -indent -wrap 120 file.html.tt
$ perl -pie 's/<!--\[ %/\[%/g; s/% \]-->/%\]/g' file.html.tt
The first command will convert all your TT elements into HTML comments, the last command changes them back.
Somehow extending the ideas here, why not replace TT snippets for something completely harmless and after tidy put original things back. In code below, I am replacing for comments like <!-- sn20 -->:
use File::Slurp;
my $template = read_file(shift);
# replace TT snippets with <!-- snNN -->
my %snip = ();
my $id = 0;
$template =~ s/ \[% (.*?) %\] / $snip{++$id} = $1; "<!-- sn$id -->" /gxse;
# run tidy
open my $tidy_fh, '|-', 'tidy -utf8 --preserve-entities y -indent -wrap 120 >tidy_out'
or die;
print $tidy_fh $template;
close $tidy_fh;
# fix code back
my $template_tidied = read_file('tidy_out');
$template_tidied =~ s/<!-- sn(\d+) -->/ "[%$snip{$1}%]" /ge;
# print the result
print $template_tidied;

Processing log files with PHP regular expressions

Im going through a log file using php and it looks either like this:
11/06/05 09:17:59 TORMS068 11/06/05 09:17:59.234 TORMS068\Admin ... EPTH{2} ITEMIX{8} TELL{`` sdcsit49 - FileSystem /oracle/REF/sapdata2 Critical - MSGREC:1727:100 ``} USE{TELL} ATTACHMENT... xact{`NO`}
Where I used Ellipses to show there were a lot of other stuff
OR like this
11/06/05 11:29:38 TORM ... H{3} ITEMIX{5} TELL{``marble: initiator SCSI ID now 7 } File={ /var/adm/messages } - MsgRec 5174:406``} USE{TELL} ATTACHMENT{} UserParms{} AnswerWait{`10`} BaudRate{`1200`} C... eviceId{``} TellExact{`NO`}
So it is either followed by a USE{TELL} or File={.*}
I want to extract what is in the {}'s for TELL{} for every line in the log file..
Please help me, I'm going crazy lol
Thanks
if (preg_match("/USE\{(?<insideTheBrackets>[^\}]+)\}/", $line, $pat)) {
var_dump($pat['insideTheBrackets']);
}
PHP's regular expression documentation is pretty good at explaining this stuff...
<?php
$str = "11/06/05 11:29:38 TORM ... H{3} ITEMIX{5} TELL{``marble: initiator SCSI ID now 7 } File={ /var/adm/messages } - MsgRec 5174:406``} USE{TELL} ATTACHMENT{} UserParms{} AnswerWait{`10`} BaudRate{`1200`} C... eviceId{``} TellExact{`NO`}";
$pat = '/TELL{(``.*``)}/';
preg_match($pat,$str,$matches);
print_r($matches);
?>
Well I couldn't get a reliable method to work, but what I ended up with was
if (preg_match("/(TELL{)(.*)} (USE{T)/iU ",$data,$matches))
$rowMsg=substr($matches[0],5,-7);
$rowMsg=preg_replace("/[}\. ]{1}\$/",'',$rowMsg);

Categories