Php regular expressions work different on different servers - php

I am using regex to get URL's from a webpage.
On localhost (PHP 5.3.15 with Suhosin-Patch (cli) (built: Aug 24 2012 17:45:44)) code:
$file = file_get_contents("http://www.etech.haw-hamburg.de/Stundenplan/");
$pattern = "/<a href=\"([^\"]*.pdf)\">(.*)<\/a>/iU";
preg_match_all($pattern, $file, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
gives:
=> Array
(
[0] => Sem_IuE_E1a.pdf
[1] => Sem_IuE_E2a.pdf
[2] => Sem_IuE_E3a.pdf
[3] => Sem_IuE_E4a.pdf
[4] => Sem_IuE_E6AT.pdf
[5] => Sem_IuE_E7.pdf
[6] => Sem_IuE_E1b.pdf
[7] => Sem_IuE_E2b.pdf
[8] => Sem_IuE_E3b.pdf
[9] => Sem_IuE_E4b.pdf
[10] => Sem_IuE_E6II.pdf
[11] => Sem_IuE_E6KT.pdf
[12] => Sem_IuE_BMT1.pdf
[13] => Laborplan%20BMT1%20KoP%201.pdf
[14] => Sem_IuE_BMT2.pdf
[15] => Sem_IuE_BMT3.pdf
[16] => Sem_IuE_BMT4.pdf
[17] => Sem_IuE_BMT5.pdf
[18] => Sem_IuE_BMT6.pdf
[19] => Sem_IuE_IE2.pdf
[20] => Sem_IuE_IE4.pdf
[21] => Sem_IuE_IE6.pdf
[22] => Sem_IuE_AM.pdf
[23] => Sem_IuE_IKM1.pdf
[24] => Legende_Stud.pdf
[25] => Kalender.pdf
[26] => Doz.pdf
[27] => Doz.pdf
)
while, on the remote server (PHP 5.3.3 (cli) (built: Feb 22 2013 02:51:11)) the same code gives:
=> Array
(
[0] => Sem_IuE_E2a.pdf
[1] => Sem_IuE_E7.pdf
[2] => Sem_IuE_E1b.pdf
[3] => Sem_IuE_E2b.pdf
[4] => Sem_IuE_E3b.pdf
[5] => Sem_IuE_E6II.pdf
[6] => Sem_IuE_E6KT.pdf
[7] => Sem_IuE_BMT1.pdf
[8] => Laborplan%20BMT1%20KoP%201.pdf
[9] => Sem_IuE_BMT2.pdf
[10] => Sem_IuE_BMT3.pdf
[11] => Sem_IuE_BMT4.pdf
[12] => Sem_IuE_BMT5.pdf
[13] => Sem_IuE_BMT6.pdf
[14] => Sem_IuE_IE2.pdf
[15] => Sem_IuE_IE4.pdf
[16] => Sem_IuE_IE6.pdf
[17] => Sem_IuE_AM.pdf
[18] => Doz.pdf
[19] => Doz.pdf
)
What is the problem?

I have no precise answer. But in your question you mention that you have different results by using PHP 5.3.3 and PHP 5.3.15.
I took a look at PHP5 ChangeLog, where the answer probably lies, and saw the following possible explanations.
PHP 5.3.6:
Upgraded bundled PCRE to version 8.11. (Ilia)
PHP 5.3.7
Upgraded bundled PCRE to version 8.12. (Scott)
I read the release notes for both PCRE versions, and I am not sure what could affect matching in your case, except for a few corrections mentioning UTF8 encoding.
But, while looking at U modifier I noticed in PCRE Configuration Options that:
PCRE's backtracking limit. Defaults to 100000 for PHP < 5.3.7.
My guess is that some fix in the U (PCRE_UNGREEDY) modifier changed the way that the part between the <a> is matched. This makes sense, because by looking at the source of the page you are scraping, the only one that matches in the earlier PHP version are the <a> tags that don't contain inner HTML.
Example, this one matches:
E2a
This one doesn't:
<span lang=IT style='mso-ansi-language:IT'>E4a</span>
Very interesting, but how to fix it?
I don't have access to an earlier PHP version so I cannot test it, but I would say remove the greedy part of your regular expression, because you don't need to match the part inside the <a></a> tags, since the value is already contained in the PDF filename:
$pattern = "/<a href=\"([^\"]*.pdf)\">/i";
Or
Use a DOM Parser.

I've come up with a work-around. If you open the page, strip the tags, then parse you should get more consistent answers. Code from Microsoft apps (target page) is horrible.
<?php
$file = file_get_contents("http://www.etech.haw-hamburg.de/Stundenplan/");
$file = strip_tags($file,'<a>');
$pattern = "!\<a href=[\"|']([^.]+\.pdf)[\"|']\>([^\<]+)\<\/a\>!iU";
preg_match_all($pattern, $file, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
?>

Related

Splitting string into sections while maintaining all non-word characters

I'm working on an encryption function just for fun (for a non-production environment). Currently running my encrypt function like this:
encrypt("This is a string.");
Produces the following string:
GnulHynkAfdsGknp AfdsGknp Wgbf GknpLnugBuipAfdsCbhgByfg.
This is perfect, exactly what I wanted and expected - however, now I'm trying to write a decrypt function. Every character that is encrypted will have a single capital letter followed by 3 non-capital letters (As you can see from the example above).
My plan was to run preg_split() to get the different letters of the string.
Here is my current PHP code (pattern ([A-Z][a-z]{3})):
print_r(preg_split("/([A-Z][a-z]{3})/", $string));
There are a couple of problems with this. While testing, I discovered that it is not returning what I expected, the return is:
Array
(
[0] =>
[1] =>
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] => .
)
(Via eval.in)
So this has the proper amount of returns, but they are all blank. Why are all the values blank?
Another thing that I thought of was that I needed to include other characters such as spaces, commas, periods etc in the preg_split() return. In the return I got from eval.in, it appears as though the final period has been included. Is this true for spaces and other characters as well, or do I need to do something special in cases of these characters?
It's "splitting" on those matches so they are removed. You want preg_match_all or use PREG_SPLIT_DELIM_CAPTURE with PREG_SPLIT_NO_EMPTY.
print_r(preg_split("/([A-Z][a-z]{3})/",
$string,
null,
PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY));
You should remove capturing group () and use preg_match_all.
$text = "GnulHynkAfdsGknp AfdsGknp Wgbf GknpLnugBuipAfdsCbhgByfg.";
preg_match_all("/[A-Z][a-z]{3}|(?: |,|\.)/", $text, $match);
print_r($match);
Output:
Array
(
[0] => Array
(
[0] => Gnul
[1] => Hynk
[2] => Afds
[3] => Gknp
[4] =>
[5] => Afds
[6] => Gknp
[7] =>
[8] => Wgbf
[9] =>
[10] => Gknp
[11] => Lnug
[12] => Buip
[13] => Afds
[14] => Cbhg
[15] => Byfg
[16] => .
)
)

PHP double sort array based on substring

I am building a custom switch manager for work, my current issue is more an aesthetic one but I think it is a good learning experience. I have posted the array below for clarity:
Array
(
[1] => FastEthernet0/1
[10] => FastEthernet0/10
[11] => FastEthernet0/11
[12] => FastEthernet0/12
[13] => FastEthernet0/13
[14] => FastEthernet0/14
[15] => FastEthernet0/15
[16] => FastEthernet0/16
[17] => FastEthernet0/17
[18] => FastEthernet0/18
[19] => FastEthernet0/19
[2] => FastEthernet0/2
[20] => FastEthernet0/20
[21] => FastEthernet0/21
[22] => FastEthernet0/22
[23] => FastEthernet0/23
[24] => FastEthernet0/24
[3] => FastEthernet0/3
[4] => FastEthernet0/4
[5] => FastEthernet0/5
[6] => FastEthernet0/6
[7] => FastEthernet0/7
[8] => FastEthernet0/8
[9] => FastEthernet0/9
[25] => Null0
)
On our bigger switches I am using asort($arr); to get GigabitEthernet1/1 to come before 2/1, etc...
My goal is to sort on the interface number (part after '/') so that 1/8 comes before 1/10.
Could someone point me in the right direction, I want to work for the results but I am not familiar enough with PHP to know exactly where to go.
Notes: On out larger multi-module switches the IDs are not in order so a sort on $arr[key] won't work.
You can use the flag while using asort(), like below.
asort($arr, SORT_NATURAL | SORT_FLAG_CASE);print_r($arr);
It will print/sort the data as yo need.
The SORT_NATURAL and SORT_FLAG_CASE requires v5.4+.
If you're using an older version of PHP, you could do it with uasort and a custom comparison callback function.
$interfaces = array(...);
$ifmaj = array();
$ifmin = array();
$if_cmp = function ($a, $b) {
list($amaj,$amin) = split('/',$a);
list($bmaj,$bmin) = split('/',$b);
$maj = strcmp($amaj,$bmaj);
if ($maj!=0) return $maj;
//Assuming right side is an int
return $amin-$bmin;
};
uasort($interfaces, $if_cmp);

PHP - Convert hex character to HTML entity

I'm using imagettftext to create images of certain text characters. To do this, I need to convert an array of hexidecimal character codes to their HTML equivalents and I can't seem to find any functionality built-in to PHP to do this. What am I missing? (through searching, I came across this: PHP function imagettftext() and unicode, but it none of the answers seem to do what I need - some characters convert but most don't).
Here's the resulting HTML representation (in the browser)
[characters] => Array
(
[33] => A
[34] => B
[35] => C
[36] => D
[37] => E
[38] => F
[39] => G
[40] => H
[41] => I
[42] => J
[43] => K
[44] => L
)
Which comes from this array (not capable of rendering in imagettftext):
[characters] => Array
(
[33] => &#x41
[34] => &#x42
[35] => &#x43
[36] => &#x44
[37] => &#x45
[38] => &#x46
[39] => &#x47
[40] => &#x48
[41] => &#x49
[42] => &#x4a
[43] => &#x4b
[44] => &#x4c
)
Based on a sample from the PHP manual, you could do this with a regex:
$newText = preg_replace('/&#x([a-f0-9]+)/mei', 'chr(0x\\1)', $oldText);
I'm not sure a raw html_entity_decode() would work in your case, as your array elements are missing the trailing ; -- a necessary part of these entities.
EDIT, July 2015:
In response to Ben's comment noting the /e modifier being deprecated, here's how to write this using preg_replace_callback() and an anonymous function:
$newText = preg_replace_callback(
'/&#x([a-f0-9]+)/mi',
function ($m) {
return chr(hexdec($m[1]));
},
$oldText
);
Well, you obviously haven't search hard enough. html_entity_decode().

Extracting data from a csv file using php

I wrote this piece of code:
$handle = fopen('upload/EFT.csv', "r");
while (! feof($handle)) {
print_r(fgetcsv($handle));
}
fclose($handle);
This is the file:
AKRV0002,AKR,V0002,Akron
AKRV0006,AKR,V0006,Akron
AKRV0007,AKR,V0007,Akron
AKRV0011,AKR,V0011,Akron
AKRV0012,AKR,V0012,Akron
ATLV0019,ATL,V0019,ATLANTA
ATLV0021,ATL,V0021,ATLANTA
It returns this:
Array ( [0] => AKRV0002 [1] => AKR [2] => V0002 [3] => Akron AKRV0006 [4] => AKR [5] => V0006 [6] => Akron AKRV0007 [7] => AKR [8] => V0007 [9] => Akron AKRV0011 [10] => AKR [11] => V0011 [12] => Akron AKRV0012 [13] => AKR [14] => V0012 [15] => Akron ATLV0019 [16] => ATL [17] => V0019 [18] => ATLANTA ATLV0021 [19] => ATL [20] => V0021 [21] => ATLANTA
How can I have this return each line in a new array?
See how array position 3 is "Akron AKRV0006" — that is the last value of line 1 and the first value of line 2. It appears that the newlines aren't being read correctly. Without your raw file, I can't tell why.
Once you have fixed that, you will see that fgetcsv reads only one line at a time (in other words, returns an array with the data from only one row), not all lines at once. So, you will need to loop and add each array to another array until fgetcsv returns no more data:
$data = array();
while ($row = fgetcsv($handle)) {
$data[] = $row;
}
Do you happen to use an old Mac computer?
"Note: If PHP is not properly recognizing the line endings when reading files either on or created by a Macintosh computer, enabling the auto_detect_line_endings run-time configuration option may help resolve the problem."
http://php.net/manual/en/function.fgetcsv.php

Problem with regular expression for some special parttern

I got a problem when I tried to find some characters with following code:
$str = "统计类型目前分为0日Q统计,月统q计及287年7统1计三7种,如需63自定义时间段,点1击此hell处进入自o定w义统or计d!页面。其他统计:客服工作量统计 | 本周服务统计EXCEL";
preg_match_all('/[\w\uFF10-\uFF19\uFF21-\uFF3A\uFF41-\uFF5A]/',$str,$match); //line 5
print_r($match);
And I got error as below:
Warning: preg_match_all() [function.preg-match-all]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u at offset 4 in E:\mycake\app\webroot\re.php on line 5
I'm not so familiar with reg expression and have no idea about this error.How can I fix this?Thanks.
The problem is, that the PCRE regular expression engine does not understand the \uXXXX-syntax to denote characters via their unicode codepoints. Instead the PCRE engine uses a \x{XXXX}-syntax combined with the u-modifier:
preg_match_all('/[\w\x{FF10}-\x{FF19}\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]/u',$str,$match);
print_r($match);
See my answer here for some more information.
EDIT:
$str = "统计类型目前分为0日Q统计,月统q计及287年7统1计三7种,如需63自定义时间段,点1击此hell处进入自o定w义统or计d!页面。其他统计:客服工作量统计 | 本周服务统计EXCEL";
preg_match_all('/[\w\x{FF10}-\x{FF19}\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]/u',$str,$match);
// ^
// |
print_r($match);
/* Array
(
[0] => Array
(
[0] => 0
[1] => Q
[2] => q
[3] => 2
[4] => 8
[5] => 7
[6] => 7
[7] => 1
[8] => 7
[9] => 6
[10] => 3
[11] => 1
[12] => h
[13] => e
[14] => l
[15] => l
[16] => o
[17] => w
[18] => o
[19] => r
[20] => d
[21] => E
[22] => X
[23] => C
[24] => E
[25] => L
)
) */
You're sure, that you used the u-modifier (see arrow above)? If so, you'd have to check if your PHP supports th u-modifier at all (PHP > 4.1.0 on Unix and > 4.2.3 on Windows).

Categories