Upgrade to PHP 5.4.29 (from 5.3?) broke my regex - php

My host just upgrade PHP the 5.4.29 from one of the 5.3 versions, I believe. This broke a very important regular expression that I use in a frequently used program.
I want to match variations on the following (each line is a separate example:
1 / AGGRAVATED ASSAULT Withdrawn 18 § 2702
1 / Simple Assault Guilty Plea 18 § 2701 §§ A
1 / Criminal Mischief Judgment of Acquittal 18 § 3304-12
This is my regex. It has worked for the last 2 years without fail:
/\d\s+\/\s+(.+)\s{12,}(\w.+?)(?=\s\s)\s{12,}(\w{0,2})\s+(\w{1,2}\s?\247\s?\d+(\-|\247|\w+)*)/
I use it as follows:
if (preg_match(self::$chargesSearch2, $line, $matches))
My expectation is that
matches[1] = the charge (Aggravated assault, etc...)
matches[2] = the grading (which often doesn't appear and isn't on any of these examples) matches[3] = the disposition (Withdrawn, etc...)
matches[4] = the code section (18 § 2702)
For some reason it doesn't work now--it doesn't match the lines in question. Does anyone see the error?

While I cannot answer the question of why it doesn't work any more, I can tell you that regex is not the right tool for this job. It's kind of like using a flat-headed screwdriver to drive cross-headed screws. Sure, it'll work, but it's easier to use the right tool.
In this case, you should make some kind of basic parser.
$line = "1 / AGGRAVATED ASSAULT ...";
list($count, $rest) = explode("/", $line, 2);
$count = intval(trim($count));
list($crime, $verdict, $details) = explode(" ",$rest);
$crime = trim($crime);
$verdict = trim($verdict);
$details = trim($details);
// I don't know what the significance of the $details are.
// But using the above you should be able to figure out how to parse it :)

Related

Enforce English only on PHP form submission

I would like the contact form on my website to only accept text submitted in English. I've been dealing with a lot of spam recently that has appeared in multiple languages that is slipping right past the CAPTCHA. There is simply no reason for anyone to submit this form in a language other than English since it's not a business and more of a hobby for personal use.
I've been looking through this documentation and was hopeful that something like preg_match( '/[\p{Latin}]/u', $input) might work, but I'm not bilingual and don't understand all the nuances of character encoding, so while this will help filter out something like Russian it still allows languages like Vietnamese to slip through.
Ideally I would like it to accept:
Any Unicode symbol that might be used. I have frequently come across different styles of dashes, apostrophes, or things related to math, for example.
Common diacritical marks / accented characters found in words like "résumé."
And I would like it to reject:
Anything that appears to be something other than English, or uncommon. I'm not overly concerned with accents such as "naïve" or in words borrowed from other languages.
I'm thinking of simply stripping all potentially valid characters as follows:
$input = 'testing for English only!';
// reference: https://en.wikipedia.org/wiki/List_of_Unicode_characters
// allowed punctuation
$basic_latin = '`~!##$%^&*()-_=+[{]}\\|;:\'",<.>/?';
$input = str_replace(str_split($basic_latin), '', $input);
// allowed symbols and accents
$latin1_supplement = '¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿É×é÷';
$input = str_replace(str_split($latin1_supplement), '', $input);
$unicode_symbols = '–—―‗‘’‚‛“”„†‡•…‰′″‹›‼‾⁄⁊';
$input = str_replace(str_split($unicode_symbols), '', $input);
// remove all spaces including tabs and end lines
$input = preg_replace('/\s+/', '', $input);
// check that remaining characters are alpha-numeric
if (strlen($input) > 0 && ctype_alnum($input)) {
echo 'this is English';
} else {
echo 'no bueno señor';
}
However, I'm afraid there might be some perfectly common and valid exceptions that I'm unwittingly leaving out. I'm hoping that someone might be able to offer a more elegant solution or approach?
There are no native PHP features that would provide language recognition. There's an abandoned Pear package and some classes floating around the cyberspace (I haven't tested). If an external API is fine, Google's Translation API Basic can detect language, 500K free characters per month.
There is however a very simple solution to all this. We don't really need to know what language it is. All we need to know is whether it's reasonably valid English. And not Swahili or Klingon or Russian or Gibberish. Now, there is a convenient PHP extension for this: PSpell.
Here's a sample function you might use:
/**
* Spell Check Stats.
* Returns an array with OK, FAIL spell check counts and their ratio.
* Use the ratio to filter out undesirable (non-English/garbled) content.
*
* #updated 2022-12-29 00:00:29 +07:00
* #author #cmswares
* #ref https://stackoverflow.com/q/74910421/4630325
*
* #param string $text
*
* #return array
*/
function spell_check_stats(string $text): array
{
$stats = [
'ratio' => null,
'ok' => 0,
'fail' => 0
];
// Split into words
$words = preg_split('~[^\w\']+~', $text, -1, PREG_SPLIT_NO_EMPTY);
// Nw PSpell:
$pspeller = pspell_new("en");
// Check spelling and build stats
foreach($words as $word) {
if(pspell_check($pspeller, $word)) {
$stats['ok']++;
} else {
$stats['fail']++;
}
}
// Calculate ratio of OK to FAIL
$stats['ratio'] = match(true) {
$stats['fail'] === 0 => 0, // avoiding division by zero here!
$stats['ok'] === 0 => count($words),
default => $stats['ok'] / $stats['fail'],
};
return $stats;
}
Source at BitBucket. Function usage:
$stats = spell_check_stats('This starts in English, esto no se quiere, tätä ei haluta.');
// ratio: 0.7142857142857143, ok: 5, fail: 7
Then simply decide the threshold at which a submission is rejected. For example, if 20 words in 100 fail; ie. 80:20 ratio, or "ratio = 4". The higher the ratio, the more (properly-spelled) English it is.
The "ok" and "fail" counts are also returned in case you need to calibrate separately for very short strings. Run some tests on existing valid and spam content to see what sorts of figures you get, and then tune your rejection threshold accordingly.
PSpell package for PHP may not be installed by default on your server. On CentOS / RedHat, yum install php-pspell aspell-en, to install both the PHP module (includes ASpell dependency), along with an English dictionary. For other platforms, install per your package manager.
For Windows and modern PHP, I can't find the extension dll, or a maintained Aspell port. Please share if you've found a solution. Would like to have this on my dev machine too.

Getting different output for same PHP code

(Can't paste the exact question as the contest is over and I am unable to access the question. Sorry.)
Hello, recently I took part in a programming contest (PHP). I tested the code on my PC and got the desired output but when I checked my code on the contest website and ideone, I got wrong output. This is the 2nd time the same thing has happened. Same PHP code but different output.
It is taking input from command line. The purpose is to bring substrings that contact the characters 'A','B','C','a','b','c'.
For example: Consider the string 'AaBbCc' as CLI input.
Substrings: A,a,B,b,C,c,Aa,AaB,AaBb,AaBbC,AaBbCc,aB,aBb,aBbC,aBbCc,Bb,BbC,BbCc,bC,bCc,Cc.
Total substrings: 21 which is the correct output.
My machine:
Windows 7 64 Bit
PHP 5.3.13 (Wamp Server)
Following is the code:
<?php
$stdin = fopen('php://stdin', 'r');
while(true) {
$t = fread($stdin,3);
$t = trim($t);
$t = (int)$t;
while($t--) {
$sLen=0;
$subStringsNum=0;
$searchString="";
$searchString = fread($stdin,20);
$sLen=strlen($searchString);
$sLen=strlen(trim($searchString));
for($i=0;$i<$sLen;$i++) {
for($j=$i;$j<$sLen;$j++) {
if(preg_match("/^[A-C]+$/i",substr($searchString,$i,$sLen-$j))) {$subStringsNum++;}
}
}
echo $subStringsNum."\n";
}
die;
}
?>
Input:
2
AaBbCc
XxYyZz
Correct Output (My PC):
21
0
Ideone/Contest Website Output:
20
0
You have to keep in mind that your code is also processing the newline symbols.
On Windows systems, newline is composed by two characters, which escaped representation is \r\n.
On UNIX systems including Linux, only \n is used, and on MAC they use \r instead.
Since you are relying on the standard output, it will be susceptible to those architecture differences, and even if it was a file you are enforcing the architecture standard by using the flag "r" when creating the file handle instead of "rb", explicitly declaring you don't want to read the file in binary safe mode.
You can see in in this Ideone.com version of your code how the PHP script there will give the expected output when you enforce the newline symbols used by your home system, while in this other version using UNIX newlines it gives the "wrong" output.
I suppose you should be using fgets() to read each string separetely instead of fread() and then trim() them to remove those characters before processing.
I tried to analyse this code and that's what I know:
It seems there are no problems with input strings. If there were any it would be impossible to return result 20
I don't see any problem with loops, I usually use pre-incrementation but it shouldn't affect result at all
There are only 2 possibilities for me that cause unexpected result:
One of the loops iteration isn't executed - it could be only the last one inner loop (when $i == 5 and then $j == 5 because this loop is run just once) so it will match difference between 21 and 20.
preg_match won't match this string in one of occurrences (there are 21 checks of preg_match and one of them - possible the last one doesn't match).
If I had to choose I would go for the 1st possible cause. If I were you I would contact concepts author and ask them about version and possibility to test other codes. In this case the most important is how many times preg_match() is launched at all - 20 or 21 (using simple echo or extra counter would tell us that) and what are the strings that preg_match() checks. Only this way you can find out why this code doesn't work in my opinion.
It would be nice if you could put here any info when you find out something more.
PS. Of course I also get result 21 so it's hard to say what could be wrong

How to parse strings - detailed explanation and information on syntax

I would like to parse a sting of data in a shell script with a simple 1 line expression. But I do not know how or where to find any information describing how it is done. All the examples I can find just looks like an illegal math equations, and I can not find any documentation describing how it works.
First, what exactly is this form of parsing called so I know what I am talking about and what to search for. Secondly, where can I find what it all means so I can learn how to use it correctly and not just copy some one else's work with little understanding of how it works.
/\.(\w+)/*.[0-9]/'s/" /"\n/g;s/=/\n/gp
I recall learning about this in perl a couple decades ago, but have long since forgotten what it all means. I have spent days searching for information on what this all means. All I can find are specific examples with no explanations of what it is technically called and how it works!
I want to separate each field then extract the key name and numerical data in a shell script. I realize some forms of parsing are done differently in shell scripts as opposed to php or perl scripts. But I need to learn the parsing syntax used to filter out the specific data sets that I could use in both, shell and php.
Currently I need to parse a single line of data from a file in a shell script for a set of conditionals required by other support scripts.
#!/bin/sh
Line=`cat ./dump.txt`
#Line = "V:12.46 A:3.427 AV:6.08 D:57.32 S:LOAD CT:45.00 P:42.71 AH:2016.80"
# for each field parse data ("/[A-Z]:[0-9]/}" < $Line)
# $val[$1] = $2
# $val["V"] = "12.46"
# $val["AV"] = "6.08"
if $val["V"] < 11.4
then
~/controls/stop.sh
else
~/controls/start.sh
fi
if $val["AV"] > 10.7
then
echo $val["AV"] > ./source.txt
else
echo "DOWN" > ./source.txt
fi
I need to identify and separate the difference between "V:" and "AV:".
In php I can use foreach & explode into an array. But I am tired of writing half a page of code for some thing that can be done in a single line. I need to learn a simpler and more efficient way to parse data from a string and extract the data in to a usable variable.
$Line = file_get_contents("./dump.txt");
$field = explode (' ' , $Line);
foreach($field as $arg)
{
$val = explode (':' , $arg);
$data[$val[0]] = $val[1];
}
# $data["V"] = "12.46"
# $data["AV"] = "6.08"
A quick shell example is much appreciated, but I really need to know "HOW TO" do this my self. Please give me some links or search criteria to find the definitions and syntax to these parsing expressions.
Thank you in advance for your help.
The parsing patterns you're talking about are commonly referred to as regular expressions or regex.
For php you can find a lot of helpful information from http://au1.php.net/manual/en/book.pcre.php
Regex is quite hard especially for complex expressions so I usually google search for an online regex expression tester. Preferably one which highlights whats being matched. Javascript ones are especially good as the results are instant and the regex syntax is the same for PHP.
Special thanks to James T for leading me in the right direction.
After reading through the regular expressions I have figured out the search pattern I need. Also included is a brief script to test the output. Taking into account that BASH can not use decimal numbers we need to convert it to a whole number. The decimal intigers is always fixed at 2 or 3 places so conversion is easy, just drop the decimal. Also the order in which the fields are recorded remains constant so the order in which they are read will remain the same.
The regular expression that fits the search for each of the first 4 fields is:
\w+:([0-9]+)\.([0-9]+)\s
( ) = the items to search/parse; using 2 searches for each data set "V:12.46"
\w = for the word search and the " + " means any 1 or more letters
: = for the delimiter
( -search set 1:
[0-9] = search any numbers and the " + " means any 1 or more digits
) -end search set 1
\. = for the decimal point in the data
( -search set 2:
[0-9] = search any numbers and the " + " means any 1 or more ( second set after the decimal)
) -end search set 2
\s = white space (blank space)
Now duplicate the search 3 times for the first 3 fields, giving me 6 variables.
\w+:([0-9]+)\.([0-9]+)\s\w+:([0-9]+)\.([0-9]+)\s\w+:([0-9]+)\.([0-9]+)\s
And here is a simple script to test the output:
#!/bin/bash
Line="V:13.53 A:7.990 AV:13.65 D:100.00 S:BulkCharge CT:35.00 P:108.11 AH:2116.20"
regex="\w+:([0-9]+)\.([0-9]+)\s\w+:([0-9]+)\.([0-9]+)\s\w+:([0-9]+)\.([0-9]+)\s"
if [[ $Line =~ $regex ]]; then
echo "match found in $Line"
i=1
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo " capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
Volt=${BASH_REMATCH[1]}${BASH_REMATCH[2]}
Amp=${BASH_REMATCH[3]}${BASH_REMATCH[4]}
AVG=${BASH_REMATCH[5]}${BASH_REMATCH[6]}
else
echo "$Line does not match"
fi
if [ $Volt -gt 1200 ]
then
echo "Voltage is $Volt"
fi
resulting with an output of:
match found in V:13.53 A:7.990 AV:13.65 D:100.00 S:BulkCharge CT:35.00 P:108.11 AH:2116.20
capture[1]: 13
capture[2]: 53
capture[3]: 7
capture[4]: 990
capture[5]: 13
capture[6]: 65
Voltage is 1353

Parsing e-mail-like headers (similar to RFC822)

Problem / Question
There is a database of bot information that I would like to parse. It is said to be similar to RFC822 messages.
Before I re-invent the wheel and write a parser of my own, I figured I would see if something else was already available. I stumbled across imap_rfc822_parse_headers(), which seems to do exactly what I want. Unfortunately, the IMAP extension is not available in my environment.
I have seen many alternatives online and on Stack Overflow. Unfortunately, they are all built for e-mail and do more than I need... often times parsing out an entire e-mail and handling headers in special ways. I just want to simply parse those headers into a useful object or array.
Is there a straight PHP version of imap_rfc822_parse_headers() available, or something equivalent that will parse data like this? If not, I will write my own.
Sample Data
robot-id: abcdatos
robot-name: ABCdatos BotLink
robot-from: no
robot-useragent: ABCdatos BotLink/1.0.2 (test links)
robot-language: basic
robot-description: This robot is used to verify availability of the ABCdatos
directory entries (http://www.abcdatos.com), checking
HTTP HEAD. Robot runs twice a week. Under HTTP 5xx
error responses or unable to connect, it repeats
verification some hours later, verifiying if that was a
temporary situation.
robot-history: This robot was developed by ABCdatos team to help
working in the directory maintenance.
robot-environment: commercial
modified-date: Thu, 29 May 2003 01:00:00 GMT
modified-by: ABCdatos
robot-id: acme-spider
robot-name: Acme.Spider
robot-cover-url: http://www.acme.com/java/software/Acme.Spider.html
robot-exclusion: yes
robot-exclusion-useragent: Due to a deficiency in Java it's not currently possible to set the User-Agent.
robot-noindex: no
robot-host: *
robot-language: java
robot-description: A Java utility class for writing your own robots.
robot-history:
robot-environment:
modified-date: Wed, 04 Dec 1996 21:30:11 GMT
modified-by: Jef Poskanzer
...
Assuming that $data contains the sample data you pasted above, here is the parser:
<?php
/*
* $data = <<<'DATA'
* <put-sample-data-here>
* DATA;
*
*/
$parsed = array();
$blocks = preg_split('/\n\n/', $data);
$lines = array();
$matches = array();
foreach ($blocks as $i => $block) {
$parsed[$i] = array();
$lines = preg_split('/\n(([\w.-]+)\: *((.*\n\s+.+)+|(.*(?:\n))|(.*))?)/',
$block, -1, PREG_SPLIT_DELIM_CAPTURE);
foreach ($lines as $line) {
if(preg_match('/^\n?([\w.-]+)\: *((.*\n\s+.+)+|(.*(?:\n))|(.*))?$/',
$line, $matches)) {
$parsed[$i][$matches[1]] = preg_replace('/\n +/', ' ',
trim($matches[2]));
}
}
}
print_r($parsed);
The message MIME type is pretty common. Parsers exist plenty, but are commonly hard to google. Personally I resort to regex here, if the format is somewhat consistent.
For example these two will do the trick:
// matches a consecutive RFC821 style key:value list
define("RX_RFC821_BLOCK", b"/(?:^\w[\w.-]*\w:.*\R(?:^[ \t].*\R)*)++\R*/m");
// break up Key: value lines
define("RX_RFC821_SPLIT", b"/^(\w+(?:[-.]?\w+)*)\s*:\s*(.*\n(?:^[ \t].*\n)*)/m");
Number one breaks out coherent blocks of message/* lines, and the second can be used to split up each such block. It needs post-processing to strip leading indendation from continued value lines though.

Sorting Katakana names

If I have a list of Katakana names what is the best way to sort them?
Also is it more common to sort names based on their {first name}{last name} or {last name}{first name}.
Another question is how do we get the first character Hiragana representation of a Katakana name like how it is done for the iPhone's contact list is sorted.?
Thanks.
In Japan it is common (if not expected) that a person's first name appear after their surname when written: {last} {first}. But this would also depend on the context. In a less formal context it would be acceptable for a name to appear {first} {last}.
http://en.wikipedia.org/wiki/Japanese_name
Not that it matters, but why would the names of individuals be written in Katakana and not in the traditional Kanji?
I think it's
sort($array,SORT_LOCALE_STRING);
Provide more information if it's not your case
This answer talks about using the system locale to sort Unicode strings in PHP. Besides threading issues, it is also dependent on your vendor having supplied you with a correct locale for what you want to use. I’ve had so much trouble with that particular issue that I’ve given up using vendor locales altogether.
If you’re worried about different pronunciations of Unihan ideographs, then you probably need access to the Unihan database — or its moral equivalent. A smaller subset may suffice.
For example, I know that in Perl, the JIS X 0208 standard is used when the Japanese "ja" locale for is selected in the constructor for Unicode::Collate::Locale. This doesn’t depend on the system locale, so you can rely on it.
I’ve also had good luck in Perl with Lingua::JA::Romanize::Japanese, as that’s somewhat friendlier to use than accessing Unicode::Unihan directly.
Back to PHP. This article observes that you can’t get PHP to sort Japanese correctly.
I’ve taken his set of strings and run it through Perl’s sort, and I indeed get a different answer than he gets. If I use the default or English locale, I get in Perl what he gets in PHP. But if I use the Japanese locale for the collation module — which has nothing to do with the system locale and is completely thread-safe — then I get a rather different result. Watch:
JA Sort                          = EN Sort
------------------------------------------------------------
Java                               Java
NVIDIA                             NVIDIA
Windows ファイウォール             Windows ファイウォール
インターネット オプション          インターネット オプション
キーボード                         キーボード
システム                           システム
タスク                             タスク
フォント                           フォント
プログラムの追加と削除             プログラムの追加と削除
マウス                             マウス
メール                             メール
音声認識                         ! 地域と言語オプション
画面                             ! 日付と時刻
管理ツール                       ! 画面
自動更新                         ! 管理ツール
地域と言語オプション             ! 自動更新
電源オプション                     電源オプション
電話とモデムのオプション           電話とモデムのオプション
日付と時刻                       ! 音声認識
I don’t know whether this will help you at all, because I don’t know how to get at the Perl bits from PHP (can you?), but here is the program that generates that. It uses a couple of non-standard modules installed from CPAN to do its business.
#!/usr/bin/env perl
#
# jsort - demo showing how Perl sorts Japanese in a
# different way than PHP does.
#
# Data taken from http://www.localizingjapan.com/blog/2011/02/13/sorting-in-japanese-—-an-unsolved-problem/
#
# Program by Tom Christiansen <tchrist#perl.com>
# Saturday, April 9th, 2011
use utf8;
use 5.10.1;
use strict;
use autodie;
use warnings;
use open qw[ :std :utf8 ];
use Unicode::Collate::Locale;
use Unicode::GCString;
binmode(DATA, ":utf8");
my #data = <DATA>;
chomp #data;
my $ja_sorter = new Unicode::Collate::Locale locale => "ja";
my $en_sorter = new Unicode::Collate::Locale locale => "en";
my #en_data = $en_sorter->sort(#data);
my #ja_data = $ja_sorter->sort(#data);
my $gap = 8;
my $width = 0;
for my $datum (#data) {
my $columns = width($datum);
$width = $columns if $columns > $width;
}
my $bar = "-" x ( 2 + 2 * $width + $gap );
$width = -($width + $gap);
say justify($width => "JA Sort"), "= ", "EN Sort";
say $bar;
for my $i ( 0 .. $#data ) {
my $same = $ja_data[$i] eq $en_data[$i] ? " " : "!";
say justify($width => $ja_data[$i]), $same, " ", $en_data[$i];
}
sub justify {
my($len, $str) = #_;
my $alen = abs($len);
my $cols = width($str);
my $spacing = ($alen > $cols) && " " x ($alen - $cols);
return ($len < 0)
? $str . $spacing
: $spacing . $str
}
sub width {
return 0 unless #_;
my $str = shift();
return 0 unless length $str;
return Unicode::GCString->new($str)->columns;
}
__END__
システム
画面
Windows ファイウォール
インターネット オプション
キーボード
メール
音声認識
管理ツール
自動更新
日付と時刻
タスク
プログラムの追加と削除
フォント
電源オプション
マウス
地域と言語オプション
電話とモデムのオプション
Java
NVIDIA
Hope this helps. It shows that it is, at least theoretically, possible.
EDIT
This answer from How can I use Perl libraries from PHP? references this PHP package to do that for you. So if you don’t find a PHP library with the needed Japanese sorting stuff, you should be able to use the Perl module. The only one you need is Unicode::Collate::Locale. It comes standard as of release 5.14 (really 5.13.4, but that’s a devel version), but you can always install it from CPAN if you have an earlier version of Perl.

Categories