UTF-8 characters in preg_match_all (PHP) [duplicate] - php

This question already has answers here:
preg_match and UTF-8 in PHP
(8 answers)
Closed 12 months ago.
I have preg_match_all('/[aäeëioöuáéíóú]/u', $in, $out, PREG_OFFSET_CAPTURE);
If $in = 'hëllo' $out is:
array(1) {
[0]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(2) "ë"
[1]=>
int(1)
}
[1]=>
array(2) {
[0]=>
string(1) "o"
[1]=>
int(5)
}
}
}
The position of o should be 4. I've read about this problem online (the ë gets counted as 2). Is there a solution for this? I've seen mb_substr and similar, but is there something like this for preg_match_all?
Kind of related: Is their an equivalent of preg_match_all in Python? (Returning an array of matches with their position in the string)

This is not a bug, PREG_OFFSET_CAPTURE refers to the byte offset of the character in the string.
mb_ereg_search_pos behaves the same way. One possibility is to change the encoding to UTF-32 before and then divide the position by 4 (because all unicode code units are represented as 4-byte sequences in UTF-32):
mb_regex_encoding("UTF-32");
$string = mb_convert_encoding('hëllo', "UTF-32", "UTF-8");
$regex = mb_convert_encoding('[aäeëioöuáéíóú]', "UTF-32", "UTF-8");
mb_ereg_search_init ($string, $regex);
$positions = array();
while ($r = mb_ereg_search_pos()) {
$positions[] = reset($r)/4;
}
print_r($positions);
gives:
Array
(
[0] => 1
[1] => 4
)
You could also convert the binary positions into code unit positions. For UTF-8, a suboptimal implementation is:
function utf8_byte_offset_to_unit($string, $boff) {
$result = 0;
for ($i = 0; $i < $boff; ) {
$result++;
$byte = $string[$i];
$base2 = str_pad(
base_convert((string) ord($byte), 10, 2), 8, "0", STR_PAD_LEFT);
$p = strpos($base2, "0");
if ($p == 0) { $i++; }
elseif ($p <= 4) { $i += $p; }
else { return FALSE; }
}
return $result;
}

There is simple workaround, to be used after preg_match() results matched. You need to iterate every match result and reassign position value with following:
$utfPosition = mb_strlen(substr($wholeSubjectString, 0, $capturedEntryPosition), 'utf-8');
Tested on php 5.4 under Windows, depends on Multibyte PHP extension only.

PHP doesn't support unicode very well, so a lot of string functions, including preg_*, still count bytes instead of characters.
I tried finding a solution by encoding and decoding strings, but ultimately it all came down to the preg_match_all function.
About the python thing: a python regex matchobject contains the match position by default mo.start() and mo.end(). See: http://docs.python.org/library/re.html#finding-all-adverbs-and-their-positions

Another way how to split UTF-8 $string by a regular expression is to use function preg_split(). Here is my working solution:
$result = preg_split('~\[img/\d{1,}/img\]\s?~', $string, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
PHP 5.3.17

Related

PHP : How to select specific parts of a string [duplicate]

This question already has answers here:
Extract a substring between two characters in a string PHP
(11 answers)
Closed 10 months ago.
I was wondering... I have two strings :
"CN=CMPPDepartemental_Direction,OU=1 - Groupes de sécurité,OU=CMPP_Departementale,OU=Pole_Ambulatoire,OU=Utilisateurs_ADEI,DC=doadei,DC=wan",
"CN=CMPPDepartemental_Secretariat,OU=1 - Groupes de sécurité,OU=CMPP_Departementale,OU=Pole_Ambulatoire,OU=Utilisateurs_ADEI,DC=doadei,DC=wan"
Is there a way in php to select only the first part of these strings ? I would like to just select CMPPDepartemental_Direction and CMPPDepartemental_Secretariat.
I had thought of trying with substr() or trim() but without success.
You should use preg_match with regex CN=(\w+_\w+) to extract needed parts:
$strs = [
"CN=CMPPDepartemental_Direction,OU=1 - Groupes de sécurité,OU=CMPP_Departementale,OU=Pole_Ambulatoire,OU=Utilisateurs_ADEI,DC=doadei,DC=wan",
"CN=CMPPDepartemental_Secretariat,OU=1 - Groupes de sécurité,OU=CMPP_Departementale,OU=Pole_Ambulatoire,OU=Utilisateurs_ADEI,DC=doadei,DC=wan"
];
foreach ($strs as $str) {
$matches = null;
preg_match('/CN=(\w+_\w+)/', $str, $matches);
echo $matches[1];
}
If the strings always have the same structure, I recommend using a custom function find_by_keyword - so you can search for other keywords too.
function find_by_keyword( $string, $keyword ) {
$array = explode(",",$string);
$found = [];
// Loop through each item and check for a match.
foreach ( $array as $string ) {
// If found somewhere inside the string, add.
if ( strpos( $string, $keyword ) !== false ) {
$found[] = substr($string, strlen($keyword));
}
}
return $found;
}
var_dump(find_by_keyword($str2, "CN="));
// array(1) {
[0]=>
string(27) "CMPPDepartemental_Direction"
}
var_dump(find_by_keyword($str2, "OU="));
//array(4) {
[0]=>
string(25) "1 - Groupes de sécurité"
[1]=>
string(4) "CMPP"
[2]=>
string(4) "Pole"
[3]=>
string(12) "Utilisateurs"
}
Examle here.

4 bytes big-endian to int

I have a file that contains data (keywords) to interpret and starts with 4-bytes big endian to determine number of keywords. I can't seem to get the proper integer value from it.
$bytes = "00000103";
$keywords = preg_replace("/(.{2})(.{2})(.{2})(.{2})/u", "\x$1\x$2\x$3\x$4", $bytes);
var_dump($keywords);
$unpacked = unpack("N", $keywords);
var_dump($unpacked);
Outputs (incorrect):
string(16) "\x00\x00\x01\x03"
array(1) {
[1]=>
int(1551380528)
}
For testing purposes, I change the $keywords variable to:
$bytes = "\x00\x00\x01\x03";
It outputs (correct):
string(4) ""
array(1) {
[1]=>
int(259)
}
How do I change the data-type of $keywords? Searched a lot, but can't get it to work unfortunately.
PS. After posting, it doesn't show the 2 characters (boxes with questionmarks) in them in the correct output for string(4).
You can simply use the hexdec-function:
$bytes = "00000103";
$dec = hexdec($bytes);
var_dump($dec); //int(259)

Split a string on every nth character and ensure that all segment strings have the same length

I want to split the following string into 3-letter elements. Additionally, I want all elements to have 3 letters even when the number of characters in the inout string cannot be split evenly.
Sample string with 10 characters:
$string = 'lognstring';
The desired output:
$output = ['log','nst','rin','ing'];
Notice how the in late in the inout string is used a second time to make the last element "full length".
Hope this help you.
$str = 'lognstring';
$arr = str_split($str, 3);
$array1= $arr;
array_push($array1,substr($str, -3));
print_r($array1);
$str = 'lognstring';
$chunk = 3;
$arr = str_split($str, $chunk); //["log","nst","rin","g"]
if(strlen(end($arr)) < $chunk) //if last item string length is under $chunk
$arr[count($arr)-1] = substr($str, -$chunk); //replace last item to last $chunk size of $str
print_r($arr);
/**
array(4) {
[0]=>
string(3) "log"
[1]=>
string(3) "nst"
[2]=>
string(3) "rin"
[3]=>
string(3) "ing"
}
*/
Differently from the earlier posted answers that blast the string with str_split() then come back and mop up the last element if needed, I'll demonstrate a technique that will populate the array of substrings in one clean pass.
To conditionally reduce the last iterations starting point, either use a ternary condition or min(). I prefer the syntactic brevity of min().
Code: (Demo)
$string = 'lognstring';
$segmentLength = 3;
$totalLength = strlen($string);
for ($i = 0; $i < $totalLength; $i += $segmentLength) {
$result[] = substr($string, min($totalLength - $segmentLength, $i), $segmentLength);
}
var_export($result);
Output:
array (
0 => 'log',
1 => 'nst',
2 => 'rin',
3 => 'ing',
)
Alternatively, you can prepare the string BEFORE splitting (instead of after).
Code: (Demo)
$extra = strlen($string) % $segmentLength;
var_export(
str_split(
$extra
? substr($string, 0, -$extra) . substr($string, -$segmentLength)
: $string,
$segmentLength
)
);

PHP Compress array of bits into shortest string possible

I have an array that contains values of 1 or 0 representing true or false values. e.g.
array(1,0,0,1,0,1,1,1,1);
I want to compress/encode this array into the shortest string possible so that it can be stored within a space constrained place such as a cookie. It also need to be able to be decoded again later. How do I go about this?
ps. I am working in PHP
Here is my proposal:
$a = array(1,0,0,1,0,1,1,1,1,1,0,0,1,0,1,1,1,1,1,0,0,1,0,1,1,1,1);
$compressed = base64_encode(implode('', array_map(function($i) {
return chr(bindec(implode('', $i)));
}, array_chunk($a, 8))));
var_dump($compressed); // string(8) "l8vlBw=="
So you get each 8 characters (which in fact is a binary 0..255), convert them to an integer, represent as an ASCII character, implode it to a string and convert to base64 to be able to save it as a string.
UPD:
the opposite is pretty straightforward:
$original = str_split(implode('', array_map(function($i) {
return decbin(ord($i));
}, str_split(base64_decode($compressed)))));
How exactly I wrote it (just in case anyone interesting how to write such unreadable and barely maintainable code):
I've written the $original = $compressed; and started reversing the right part of this expression step by step:
Decoded from base64 to a binary string
Split it to an array
Converted every character to its ASCII code
Converted decimal ASCII code to a binary
Joined all the binary numbers into a single one
Split the long binary string to an array
Dont use serialize. Just make a string of it:
<?php
$string = implode( '', $array );
?>
You are left with an string like this:
100101111
If you want to have an array again, just access it like an array:
$string = '100101111';
echo $string[1]; // returns "0"
?>
Of course you could also make it a decimal and just store the number. That's even shorter then the "raw" bits.
<?php
$dec = bindec( $string );
?>
How about pack and unpack
$arr = array(1,1,1,1,0,0,1,1,0,1,0,0,1,1,0,0,1,1,1,1);
$str = implode($arr);
$res = pack("h*", $str);
var_dump($res);
$rev = unpack("h*", $res);
var_dump($rev);
output:
string(10) # Not visible here
array(1) {
[1]=>
string(20) "11110011010011001111"
}
Here is my solution based on zerkms answer, this deals with the loss of leading 0's when converting decimals back into binary.
function compressBitArray(array $bitArray){
$byteChunks = array_chunk($bitArray, 8);
$asciiString = implode('', array_map(function($i) {
return chr(bindec(implode('', $i)));
},$byteChunks));
$encoded = base64_encode($asciiString).'#'.count($bitArray);
return $encoded;
}
//decode
function decompressBitArray($compressedString){
//extract origional length of the string
$parts = explode('#',$compressedString);
$origLength = $parts[1];
$asciiChars = str_split(base64_decode($parts[0]));
$bitStrings = array_map(function($i) {
return decbin(ord($i));
}, $asciiChars);
//pad lost leading 0's
for($i = 0; $i < count($bitStrings); $i++){
if($i == count($bitStrings)-1){
$toPad = strlen($bitStrings[$i]) + ($origLength - strlen(implode('', $bitStrings)));
$bitStrings[$i] = str_pad($bitStrings[$i], $toPad, '0', STR_PAD_LEFT);
}else{
if(strlen($bitStrings[$i]) < 8){
$bitStrings[$i] = str_pad($bitStrings[$i], 8, '0', STR_PAD_LEFT);
}
}
}
$bitArray = str_split(implode('', $bitStrings));
return $bitArray;
}

Convert and reconvert a version to number to store in database

is there any algorithm to convert an string like 1.0.0 to a sortable number via PHP?
It should be able to convert to same string again. It's not possible to just remove dots. Also length of version is unknown, for example 1.0.0, 11.222.0, 0.8.1526
If you just want to sort versions, there is no need to convert.
<?php
$versions = array('1.0.0', '11.222.0', '0.8.1256');
usort($versions, 'version_compare');
var_dump($versions);
array(3) {
[0]=>
string(8) "0.8.1256"
[1]=>
string(5) "1.0.0"
[2]=>
string(8) "11.222.0"
}
If you want to compare versions numbers, you could just use the version_compare() function.
And if you have an array of versions that you need to sort, you could use a function such as usort() / uasort(), with a callback based on version_compare().
If you insist on an arbitrary length there is no way to uniquely map the numbers with at the same time maintaining the ordering criterion. Maybe you just want to sort the version numbers without conversion (see other answers)?
If you expect version segmentation with numbers like 12345 (eg. 0.9.12345.2), then you may be best off exploding the string and storing each segment in separate field in SQL.
That way you can sort it how ever you wish.
One option would be using explode:
function cmp($a, $b)
{
$a = explode('.', $a);
$b = explode('.', $b);
$m = min(count($a), count($b));
for ($i = 0; $i < $m; $i++) {
if (intval($a[$i]) < intval($b[$i]))
return -1;
else
return 1;
}
return 0;
}
EDIT: Didn't know about version_compare, that might be a better option if it works as you need.
Here are a couple of functions that convert version to string and vice-versa.
So you can store the strings in your database and be able to sort them. I've used a length of 5 char but you can adapt to your needs.
function version_to_str($version) {
$list = explode('.', $version);
$str = '';
foreach ($list as $element) {
$str .= sprintf('%05d', $element);
}
return $str;
}
function str_to_version($str) {
$version = array();
for ($i=0; $i<strlen($str); $i+=5) {
$version[] = intval(substr($str, $i, 5));
}
return implode('.', $version);
}
$versions = array('1.0.0', '11.222.0', '0.8.1526');
$versions = array_map("version_to_str", $versions);
sort($versions);
$versions = array_map("str_to_version", $versions);
print_r($versions);
output:
Array
(
[0] => 0.8.1526
[1] => 1.0.0
[2] => 11.222.0
)

Categories