strange behavior of preg_match_all() - php

Following code:
$string ='۱۲۳۴۵۶۷۸۹۰';
$regex ='#۱#';
preg_match_all($regex,$string,$match);
var_dump($match);
will output:
array(1) {
[0] =>
array(1) {
[0] =>
string(2) "۱"
}
}
but
$regex2 ='#[۱]#';
preg_match_all($regex2,$string,$match);
var_dump($match);
will output
array (size=1)
0 =>
array (size=11)
0 => string '�' (length=1)
1 => string '�' (length=1)
2 => string '�' (length=1)
3 => string '�' (length=1)
4 => string '�' (length=1)
5 => string '�' (length=1)
6 => string '�' (length=1)
7 => string '�' (length=1)
8 => string '�' (length=1)
9 => string '�' (length=1)
10 => string '�' (length=1)
Indeed I want use RegEx like [۱۲۳۴۵۶۷۸۹۰]‍‍‍‍‍‍, but the function output strange result with such RegEx's. I am using PHP 5.4

Try adding the Unicode flag:
$regex = '#[۱]#u';
The reason for this is because ۱ is actually several bytes long. On it's own, it's harmless because those exact bytes are either the symbol, or the individual bytes being there coincidentally. However, in a character class any of the individual bytes may match any of the individual bytes in the other characters, which is does because they are close together in the map.

Related

Can not match the last group of numbers using php preg_match()

preg_match_all("/(\d{12})
(?:,|$)/","111762396541,561572500056,561729950637,561135281443",$matches);
var_dump($mathes):
array (size=2)
0 =>
array (size=4)
0 => string '561762396543,' (length=13)
1 => string '561572500056,' (length=13)
2 => string '561729950637,' (length=13)
3 => string '561135281443' (length=12)
1 =>
array (size=4)
0 => string '561762396543' (length=12)
1 => string '561572500056' (length=12)
2 => string '561729950637' (length=12)
3 => string '561135281443' (length=12)
But I want the $matches like this:
array (size=4)
0 => string '561762396543,' (length=13)
1 => string '561572500056,' (length=13)
2 => string '561729950637,' (length=13)
3 => string '561135281443' (length=12)
I wanna match groups of numbers(each has 12 digits) and a suffix comma if there is one.The exeption is the last group of numbers,it doesnt have to match a comma,cause it reaches the end of the line.
Try this instead:
preg_match_all("/(\d{12}(?:,|$))/","111762396541,561572500056,561729950637,561135281443",$matches);
When the $ is inside your character range brackets [ ] it is looking for the $ characters not the end-of-line.
EDIT: If you want to include the comma in your matches, then just use the above code sample and look at $matches[0].
If you wanted an easier syntax that matches any sort of word boundary, the \b will match commas and end-of-line, too:
preg_match_all("/(\d{12}\b)/","111762396541,561572500056,561729950637,561135281443",$matches);

PHP str_split on string with decoded html_entity

If I run this code:
<?php
$string = 'My string ‘to parse’';
$string_decoded = html_entity_decode($string, ENT_QUOTES, 'utf-8');
$string_array = str_split($string_decoded);
var_dump($string_array);
?>
I get this result:
array (size=28)
0 => string 'M' (length=1)
1 => string 'y' (length=1)
2 => string ' ' (length=1)
3 => string 's' (length=1)
4 => string 't' (length=1)
5 => string 'r' (length=1)
6 => string 'i' (length=1)
7 => string 'n' (length=1)
8 => string 'g' (length=1)
9 => string ' ' (length=1)
10 => string '�' (length=1)
11 => string '�' (length=1)
12 => string '�' (length=1)
13 => string 't' (length=1)
14 => string 'o' (length=1)
15 => string ' ' (length=1)
16 => string 'p' (length=1)
17 => string 'a' (length=1)
18 => string 'r' (length=1)
19 => string 's' (length=1)
20 => string 'e' (length=1)
21 => string '�' (length=1)
22 => string '�' (length=1)
23 => string '�' (length=1)
As you can see, instead of the decoded single quotes (left/right), I'm getting these three characters for each quote...
I noticed that this happens with some entities, but not others. A few that present this issue are ‘ ” $copy;. Some that don't present the same problem are & $gt;.
I tried different charsets but couldn't find one that would work for all.
What am I doing wrong? Is there a way to make it work for all entities? Or at least all the "common" ones?
Thanks.
This should do well:
function mb_str_split($string) {
return preg_split('/(?<!^)(?!$)/u', $string );
}
$string = 'My string ‘to parse’';
$string = utf8_encode($string);
$string_decoded = html_entity_decode($string, ENT_QUOTES, 'utf-8');
$string_array = mb_str_split($string_decoded);
var_dump($string_array);
As mentioned in comments: you need to split the string with mb_split or by regex.
Proof: https://3v4l.org/3FRmG

What array function should I use for creating an index?

Hello guys I am trying to create an index of all words on html page that my crawler parses.
At this moment I have managed to breakdown the html page into an array of words and I have filtered out all the stop words.
At this stage I have a few problems.
The array of words from the parsed html page have words that are repeated, I like that because I still have to record how many times a word appeared in the page.
The array looks like this.
$wordsFromHTML =
array (size=119)
0 => string 'web' (length=3)
1 => string 'giants' (length=6)
2 => string 'vryheid' (length=7)
3 => string 'news' (length=4)
4 => string 'access' (length=6)
5 => string 'mails' (length=5)
6 => string 'mobile' (length=6)
7 => string 'february' (length=8)
8 => string 'access' (length=6)
9 => string 'mails' (length=5)
10 => string 'web' (length=3)
11 => string 'february' (length=8)
12 => string 'access' (length=6)
13 => string 'mails' (length=5)
14 => string 'desktop' (length=7)
15 => string 'february' (length=8)
16 => string 'hosting' (length=7)
17 => string 'web' (length=3)
18 => string 'giants' (length=6)
19 => string 'vryheid' (length=7)
20 => string 'february' (length=8)
22 => string 'us' (length=2)
Now I want to save all the words from the $wordsFromHTML to the $indesArray which is my final index.
It should look like this.
$indexArray = array('web'=>array('url'=>array(0,10,17)))
The problem is how to keep incrementing the position ($wordsFromHTML keys) for each word that was repeated from the $wordsFromHTML array in the final index array.
The index array should only have unique words and if another word that already exists try to come in, we use the already existing word which has the same URL and increment its position.
Hope you understand my question.

Count() returns different number of entries than var_dump()

Hey guys I'm tired and can't figure this one out so any help would be appreciated.
I have an array called $Row that I get from database table.
When I run var_dump($Row); I get the following:
array
0 => string '1' (length=1)
'id' => string '1' (length=1)
1 => string 'erik' (length=4)
'username' => string 'erik' (length=4)
2 => string 'd95eb19a15301089985ad6fd6ecbf2d7' (length=32)
'password' => string 'd95eb19a15301089985ad6fd6ecbf2d7' (length=32)
3 => string '' (length=0)
'email' => string '' (length=0)
4 => string '0' (length=1)
'date_join' => string '0' (length=1)
5 => string '0' (length=1)
'date_mod' => string '0' (length=1)
6 => string '1' (length=1)
'active' => string '1' (length=1)
7 => string '1' (length=1)
'admin' => string '1' (length=1)
8 => string '0' (length=1)
'deleted' => string '0' (length=1)
When I run echo count($Row); I get value 18.
Count and var_dump are right next to each other, there is no modification of $Row.
Question: Why does Count() return 18 entried when there are only 8 of them inside $Row shown by var_dump()? I guess I just don't understand count()... I checked http://php.net/manual/en/function.count.php, but still don't get it...
Edit: I understand whats wrong know, thanks everyone. Another question. How can I remove the ones that are in a string, for instance I want this kind of a table:
array
0 => string '1' (length=1)
1 => string 'erik' (length=4)
2 => string 'd95eb19a15301089985ad6fd6ecbf2d7' (length=32)
3 => string '' (length=0)
4 => string '0' (length=1)
5 => string '0' (length=1)
6 => string '1' (length=1)
7 => string '1' (length=1)
8 => string '0' (length=1)
* I'm using mysql_fetch_array() to retrieve the data and put it into a table.
You are having mixed array integer and associative and having 18 elements. This is I think returned from mysql_fetch_array which by default return associative and numeric indexed array.
Because your strings are also values in $Row they just seem to be indented because of the string char.
That means
'id' => string '1' (length=1)
'username' => string 'erik' (length=4)
'password' => string 'd95eb19a15301089985ad6fd6ecbf2d7' (length=32)
// ...
are also in your array
becuase you're getting back a mix of associative and numeric result set from you db ( you can precise which one you really want with FETCH_ASSOC or alike as a parameter to you dbvendor_query() function ).
There is very well 18 elements in your array.
I think that applying count to this:
0 => string '1' (length=1)
'id' => string '1' (length=1)
returns 2 instead of 1!
0 and id are two different items of the array!
This taken for the others 8 elements make the count return 18
Actually, there are indeed 18 elements in your array:
indices 0 through 8 (9 elements in total)
id, username, ... another 9 elements
9 + 9 = 18
If you only want the numeric indices, and supposing you're using MySQL:
$Row = mysql_fetch_array($result, MYSQL_NUM)
Source: http://php.net/manual/en/function.mysql-fetch-array.php

PHP range for hebrew alphabets

PHP has a function range('a','z') which prints the English alphabet a, b, c, d, etc.
Is there a similar function for hebrew alphabets?
You can do something like this:
function utfOrd($c) {
return intval(array_pop(unpack('H*', $c)),16);
}
function utfChr($c) {
return pack('H*', base_convert("$c", 10, 16));
}
var_dump(array_map('utfChr', range(utfOrd('א'), utfOrd('ת'))));
Prints:
array
0 => string 'א' (length=2)
1 => string 'ב' (length=2)
2 => string 'ג' (length=2)
3 => string 'ד' (length=2)
4 => string 'ה' (length=2)
5 => string 'ו' (length=2)
6 => string 'ז' (length=2)
7 => string 'ח' (length=2)
8 => string 'ט' (length=2)
9 => string 'י' (length=2)
10 => string 'ך' (length=2)
11 => string 'כ' (length=2)
12 => string 'ל' (length=2)
13 => string 'ם' (length=2)
14 => string 'מ' (length=2)
15 => string 'ן' (length=2)
16 => string 'נ' (length=2)
17 => string 'ס' (length=2)
18 => string 'ע' (length=2)
19 => string 'ף' (length=2)
20 => string 'פ' (length=2)
21 => string 'ץ' (length=2)
22 => string 'צ' (length=2)
23 => string 'ק' (length=2)
24 => string 'ר' (length=2)
25 => string 'ש' (length=2)
26 => string 'ת' (length=2)
If you need some more characters, you can use this to create your hardcoded array or merge few ranges.
Range can work with the standard western alphabet because the characters A thru Z are consecutive values in the ASCII (and UTF-8) character set.
Hebrew characters are not ASCII chars (see this list) but you could set an initial range of the UTF-8 numeric values and then just array_map that to characters.

Categories