Problem with regular expression for some special parttern - php

I got a problem when I tried to find some characters with following code:
$str = "统计类型目前分为0日Q统计,月统q计及287年7统1计三7种,如需63自定义时间段,点1击此hell处进入自o定w义统or计d!页面。其他统计:客服工作量统计 | 本周服务统计EXCEL";
preg_match_all('/[\w\uFF10-\uFF19\uFF21-\uFF3A\uFF41-\uFF5A]/',$str,$match); //line 5
print_r($match);
And I got error as below:
Warning: preg_match_all() [function.preg-match-all]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u at offset 4 in E:\mycake\app\webroot\re.php on line 5
I'm not so familiar with reg expression and have no idea about this error.How can I fix this?Thanks.

The problem is, that the PCRE regular expression engine does not understand the \uXXXX-syntax to denote characters via their unicode codepoints. Instead the PCRE engine uses a \x{XXXX}-syntax combined with the u-modifier:
preg_match_all('/[\w\x{FF10}-\x{FF19}\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]/u',$str,$match);
print_r($match);
See my answer here for some more information.
EDIT:
$str = "统计类型目前分为0日Q统计,月统q计及287年7统1计三7种,如需63自定义时间段,点1击此hell处进入自o定w义统or计d!页面。其他统计:客服工作量统计 | 本周服务统计EXCEL";
preg_match_all('/[\w\x{FF10}-\x{FF19}\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]/u',$str,$match);
// ^
// |
print_r($match);
/* Array
(
[0] => Array
(
[0] => 0
[1] => Q
[2] => q
[3] => 2
[4] => 8
[5] => 7
[6] => 7
[7] => 1
[8] => 7
[9] => 6
[10] => 3
[11] => 1
[12] => h
[13] => e
[14] => l
[15] => l
[16] => o
[17] => w
[18] => o
[19] => r
[20] => d
[21] => E
[22] => X
[23] => C
[24] => E
[25] => L
)
) */
You're sure, that you used the u-modifier (see arrow above)? If so, you'd have to check if your PHP supports th u-modifier at all (PHP > 4.1.0 on Unix and > 4.2.3 on Windows).

Related

Reg Exp - preg_match_all reduce array result

This is my Reg Exp "[c]?[\d+|\D+]\s*". My input is this "c7=c4/c5*100" and the result is :
Array
(
[0] => Array
(
[0] => c7
[1] => =
[2] => c5
[3] => +
[4] => c3
[5] => *
[6] => 1
[7] => 0
[8] => 0
)
)
But what I want is:
Array
(
[0] => Array
(
[0] => c7
[1] => =
[2] => c5
[3] => +
[4] => c3
[5] => *
[6] => 100
)
)
I can't seem to get the last part working, I'm lost as what to do next - Can anyone help?
Thanks,
Paul
You specified a character class [\d+|\D+] which would match any of the specified characters. I think you meant using an or | with a grouping construct c?(?:\d+|\D+)\s* but in that case it would match c followed by either \d+ or \D so that would match the = sign right after it resulting in c= as a match and /c as a match.
Try matching an optional c c? followed by one or more digits or | match not a digit \D
c?\d+|\D
$re = '/c?\d+|\D/m';
$str = 'c7=c4/c5*100';
preg_match_all($re, $str, $matches);
print_r($matches);
That will result in:
Array
(
[0] => Array
(
[0] => c7
[1] => =
[2] => c4
[3] => /
[4] => c5
[5] => *
[6] => 100
)
)
Demo

Splitting string into sections while maintaining all non-word characters

I'm working on an encryption function just for fun (for a non-production environment). Currently running my encrypt function like this:
encrypt("This is a string.");
Produces the following string:
GnulHynkAfdsGknp AfdsGknp Wgbf GknpLnugBuipAfdsCbhgByfg.
This is perfect, exactly what I wanted and expected - however, now I'm trying to write a decrypt function. Every character that is encrypted will have a single capital letter followed by 3 non-capital letters (As you can see from the example above).
My plan was to run preg_split() to get the different letters of the string.
Here is my current PHP code (pattern ([A-Z][a-z]{3})):
print_r(preg_split("/([A-Z][a-z]{3})/", $string));
There are a couple of problems with this. While testing, I discovered that it is not returning what I expected, the return is:
Array
(
[0] =>
[1] =>
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] => .
)
(Via eval.in)
So this has the proper amount of returns, but they are all blank. Why are all the values blank?
Another thing that I thought of was that I needed to include other characters such as spaces, commas, periods etc in the preg_split() return. In the return I got from eval.in, it appears as though the final period has been included. Is this true for spaces and other characters as well, or do I need to do something special in cases of these characters?
It's "splitting" on those matches so they are removed. You want preg_match_all or use PREG_SPLIT_DELIM_CAPTURE with PREG_SPLIT_NO_EMPTY.
print_r(preg_split("/([A-Z][a-z]{3})/",
$string,
null,
PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY));
You should remove capturing group () and use preg_match_all.
$text = "GnulHynkAfdsGknp AfdsGknp Wgbf GknpLnugBuipAfdsCbhgByfg.";
preg_match_all("/[A-Z][a-z]{3}|(?: |,|\.)/", $text, $match);
print_r($match);
Output:
Array
(
[0] => Array
(
[0] => Gnul
[1] => Hynk
[2] => Afds
[3] => Gknp
[4] =>
[5] => Afds
[6] => Gknp
[7] =>
[8] => Wgbf
[9] =>
[10] => Gknp
[11] => Lnug
[12] => Buip
[13] => Afds
[14] => Cbhg
[15] => Byfg
[16] => .
)
)

PHP Array Keys to variables

I want a value of my array to be displayed if I use a equel or almost equel variable.
So for example if I have the following array line: [1] => g
I want to display 'g' if I use the variable $1 (Or even better with the varible $arr1, so it does not interfere with other things later on.)
Here is my code: (I'm uploading a simple .txt file with some letters and making a array of each individual charachter):
$linearray = array();
$workingarray = array();
while(! feof($file)) {
$line = fgets($file);
$line = str_split(trim("$line"));
$linearray = array_merge($linearray, $line);
}
$workingarray[] = $linearray;
print_r($workingarray);
When I have done this I will get this outcome;
Array ( [0] => Array ( [0] => g [1] => g [2] => h [3] => o [4] => n
[5] => d [6] => x [7] => s [8] => v [9] => i [10] => s [11] => h [12]
=> f [13] => g [14] => f [15] => h [16] => m [17] => a [18] => g [19] => i [20] => e [21] => d [22] => h [23] => v [24] => b [25] => v [26] => m [27] => d [28] => o [29] => m [30] => v [31] => b [32] => ) )
I tried using the following to make it work:
extract($workingarray);
echo "$1";
But that sadly doesn't work. I just recieve this:
$1
And I want to recieve this:
g
It would be even better if I recieved the same effect with for example echo "$arr1" and then recieve g and for echo "$arr2" recieve h etc etc
This is simply impossible: http://php.net/manual/en/language.variables.basics.php
Variable names cannot start with a digit. The only allowable first char for variable names are letters and underscore.
And don't use extract or similar constructs. All they do is litter your variable namespace with unpredictable/unknown junk - you could very easily overwrite some OTHER critical variable with this useless junk, making for very difficult/impossible bugs to diagnose.
You're not saving any time by making up these new variables.

Problems with str_split

I'm new here, and I have a question. I'm doing a code that I'll use soon, more something left me with a huge doubt. So I'm separating the word more special characters converted are being separated, I wish they would get together to assign a color to each then is there any way to do this?
Code:
<?php
$text = "My nickname is: п€Яd Øwп€d"; #this would be the result of what i received via post
print_r(str_split($text));
?>
Result:
Array
(
[0] => M
[1] => y
[2] =>
[3] => n
[4] => i
[5] => c
[6] => k
[7] => n
[8] => a
[9] => m
[10] => e
[11] =>
[12] => i
[13] => s
[14] => :
[15] =>
[16] => &
[17] => #
[18] => 1
[19] => 0
[20] => 8
[21] => 7
[22] => ;
[23] => &
[24] => e
[25] => u
[26] => r
[27] => o
[28] => ;
[...]
)
I'd like to return this:
Array ( [0] => M
[1] => y
[2] =>
[3] => n
[4] => i
[5] => c
[6] => k
[7] => n
[8] => a
[9] => m
[10] => e
[11] =>
[12] => i
[13] => s
[14] => :
[15] =>
[16] => п
[17] => €
[...]
)
Thank you for the help.
[UPDATED]
I tested the functions that friends have passed, most don't use utf-8 as my default charset ISO-8859-1, and one more thing I forgot to add, by editing the phrase "My nickname is:" and adding a & for example: "My nickname is & personal name" returns a bug. I appreciate who can help again.
You can try to write your own str_split, which could look like the following
function str_split_encodedTogether($text) {
$result = array();
$length = strlen($text);
$tmp = "";
for ($charAt=0; $charAt < $length; $charAt++) {
if ($text[ $charAt ] == '&') {//beginning of special char
$tmp = '&';
} elseif ($text[ $charAt ] == ';') {//end of special char
array_push($result, $tmp.';');
$tmp = "";
} elseif (!empty($tmp)) {//in midst of special char
$tmp .= $text[ $charAt ];
} else {//regular char
array_push($result, $text[ $charAt ]);
}
}
return $result;
}
Basically what it does is check if the current character is a &, if so, save all following characters (including ampersand) in $tmp until ;. This basically gives you the wanted result but will fail, whenever there is a & which doesn't belong to an encoded character.
Use preg_split():
<?php
$text = "My nickname is: п€Яd Øwп€d"; #this would be the result of what i received via post
print_r(preg_split('/(\&(?=[^;]*\s))|(\&[^;]*;)|/', $text, -1, PREG_SPLIT_DELIM_CAPTURE + PREG_SPLIT_NO_EMPTY));
?>

Php regular expressions work different on different servers

I am using regex to get URL's from a webpage.
On localhost (PHP 5.3.15 with Suhosin-Patch (cli) (built: Aug 24 2012 17:45:44)) code:
$file = file_get_contents("http://www.etech.haw-hamburg.de/Stundenplan/");
$pattern = "/<a href=\"([^\"]*.pdf)\">(.*)<\/a>/iU";
preg_match_all($pattern, $file, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
gives:
=> Array
(
[0] => Sem_IuE_E1a.pdf
[1] => Sem_IuE_E2a.pdf
[2] => Sem_IuE_E3a.pdf
[3] => Sem_IuE_E4a.pdf
[4] => Sem_IuE_E6AT.pdf
[5] => Sem_IuE_E7.pdf
[6] => Sem_IuE_E1b.pdf
[7] => Sem_IuE_E2b.pdf
[8] => Sem_IuE_E3b.pdf
[9] => Sem_IuE_E4b.pdf
[10] => Sem_IuE_E6II.pdf
[11] => Sem_IuE_E6KT.pdf
[12] => Sem_IuE_BMT1.pdf
[13] => Laborplan%20BMT1%20KoP%201.pdf
[14] => Sem_IuE_BMT2.pdf
[15] => Sem_IuE_BMT3.pdf
[16] => Sem_IuE_BMT4.pdf
[17] => Sem_IuE_BMT5.pdf
[18] => Sem_IuE_BMT6.pdf
[19] => Sem_IuE_IE2.pdf
[20] => Sem_IuE_IE4.pdf
[21] => Sem_IuE_IE6.pdf
[22] => Sem_IuE_AM.pdf
[23] => Sem_IuE_IKM1.pdf
[24] => Legende_Stud.pdf
[25] => Kalender.pdf
[26] => Doz.pdf
[27] => Doz.pdf
)
while, on the remote server (PHP 5.3.3 (cli) (built: Feb 22 2013 02:51:11)) the same code gives:
=> Array
(
[0] => Sem_IuE_E2a.pdf
[1] => Sem_IuE_E7.pdf
[2] => Sem_IuE_E1b.pdf
[3] => Sem_IuE_E2b.pdf
[4] => Sem_IuE_E3b.pdf
[5] => Sem_IuE_E6II.pdf
[6] => Sem_IuE_E6KT.pdf
[7] => Sem_IuE_BMT1.pdf
[8] => Laborplan%20BMT1%20KoP%201.pdf
[9] => Sem_IuE_BMT2.pdf
[10] => Sem_IuE_BMT3.pdf
[11] => Sem_IuE_BMT4.pdf
[12] => Sem_IuE_BMT5.pdf
[13] => Sem_IuE_BMT6.pdf
[14] => Sem_IuE_IE2.pdf
[15] => Sem_IuE_IE4.pdf
[16] => Sem_IuE_IE6.pdf
[17] => Sem_IuE_AM.pdf
[18] => Doz.pdf
[19] => Doz.pdf
)
What is the problem?
I have no precise answer. But in your question you mention that you have different results by using PHP 5.3.3 and PHP 5.3.15.
I took a look at PHP5 ChangeLog, where the answer probably lies, and saw the following possible explanations.
PHP 5.3.6:
Upgraded bundled PCRE to version 8.11. (Ilia)
PHP 5.3.7
Upgraded bundled PCRE to version 8.12. (Scott)
I read the release notes for both PCRE versions, and I am not sure what could affect matching in your case, except for a few corrections mentioning UTF8 encoding.
But, while looking at U modifier I noticed in PCRE Configuration Options that:
PCRE's backtracking limit. Defaults to 100000 for PHP < 5.3.7.
My guess is that some fix in the U (PCRE_UNGREEDY) modifier changed the way that the part between the <a> is matched. This makes sense, because by looking at the source of the page you are scraping, the only one that matches in the earlier PHP version are the <a> tags that don't contain inner HTML.
Example, this one matches:
E2a
This one doesn't:
<span lang=IT style='mso-ansi-language:IT'>E4a</span>
Very interesting, but how to fix it?
I don't have access to an earlier PHP version so I cannot test it, but I would say remove the greedy part of your regular expression, because you don't need to match the part inside the <a></a> tags, since the value is already contained in the PDF filename:
$pattern = "/<a href=\"([^\"]*.pdf)\">/i";
Or
Use a DOM Parser.
I've come up with a work-around. If you open the page, strip the tags, then parse you should get more consistent answers. Code from Microsoft apps (target page) is horrible.
<?php
$file = file_get_contents("http://www.etech.haw-hamburg.de/Stundenplan/");
$file = strip_tags($file,'<a>');
$pattern = "!\<a href=[\"|']([^.]+\.pdf)[\"|']\>([^\<]+)\<\/a\>!iU";
preg_match_all($pattern, $file, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
?>

Categories