How to split Chinese characters in PHP?

How to split Chinese characters in PHP? - php

I need some help regarding how to split Chinese characters mixed with English words and numbers in PHP.
For example, if I read
FrontPage 2000中文版應用大全
I'm hoping to get
FrontPage, 2000, 中,文,版,應,用,大,全
or
FrontPage, 2,0,0,0, 中,文,版,應,用,大,全
How can I achieve this?
Thanks in advance :)

Assuming you are using UTF-8 (or you can convert it to UTF-8 using Iconv or some other tools), then using the u modifier (doc: http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php )
<?
$s = "FrontPage 2000中文版應用大全";
print_r(preg_match_all('/./u', $s, $matches));
echo "\n";
print_r($matches);
?>
will give
21
Array
(
[0] => Array
(
[0] => F
[1] => r
[2] => o
[3] => n
[4] => t
[5] => P
[6] => a
[7] => g
[8] => e
[9] =>
[10] => 2
[11] => 0
[12] => 0
[13] => 0
[14] => 中
[15] => 文
[16] => 版
[17] => 應
[18] => 用
[19] => 大
[20] => 全
)
)
Note that my source code is stored in a file encoded in UTF-8 also, for the $s to contain those characters.
The following will match alphanumeric as a group:
<?
$s = "FrontPage 2000中文版應用大全";
print_r(preg_match_all('/(\w+)|(.)/u', $s, $matches));
echo "\n";
print_r($matches[0]);
?>
result:
10
Array
(
[0] => FrontPage
[1] =>
[2] => 2000
[3] => 中
[4] => 文
[5] => 版
[6] => 應
[7] => 用
[8] => 大
[9] => 全
)

/**
* Reference: http://www.regular-expressions.info/unicode.html
* Korean: Hangul
* CJK: Han
* Japanese: Hiragana, Katakana
* Flag u required
*/
preg_match_all(
'/\p{Hangul}|\p{Hiragana}|\p{Han}|\p{Katakana}|(\p{Latin}+)|(\p{Cyrillic}+)/u',
$str,
$result
);
This one is working if you are using PHP 7.0 too.
This one is just not working. I regret I have upvoted a non-working solution....
<?
$s = "FrontPage 2000中文版應用大全";
print_r(preg_match_all('/(\w+)|(.)/u', $s, $matches));
echo "\n";
print_r($matches[0]);
?>

With this code you can make chinese text (utf8) to wrap at the end of the line so that it is still readable
print_r(preg_match_all('/([\w]+)|(.)/u', $str, $matches));
$arr_result = array();
foreach ($matches[0] as $key => $val) {
$arr_result[]=$val;
$arr_result[]=""; //add Zero-Width Space
}
foreach ($arr_result as $key => $val) {
$out .= $val;
}
return $out;

Related

preg_match to show lines containing one string and one of the other two

I've got an array in php:
Array
(
[0] => sth!Man!Tree!null
[1] => sth!Maning!AppTree!null
[2] => sth!Man!Lake!null
[3] => sth!Man!Tree!null
[4] => sth!Man!AppTree!null
[5] => sth!Maning!AppTree!null
[6] => sth!Man!Tree!null
[7] => sth!Maning!AppTree!null
[8] => sth!Maning!AppTree!null
[9] => sth!Man!Tree!null
[10] => sth!Man!Tree!null
[11] => sth!Man!Tree!null
[12] => sth!Man!Tree!null
[12] => sth!Man!Lake!null
[13] => sth!Maning!Tree!null
)
and this preg_match function:
preg_match("/Man/i", $line) && (preg_match("/!Tree!/i", $line) || preg_match("/!Lake!/i", $line))
My goal is to change it to one preg_match regex function to display only lines with Man and Tree or Man and Lake. Is it possible?

You can use the following regex:
(?i)\b(?:Lake|Tree)\b.*\bMan\b|\bMan\b.*\b(?:Tree|Lake)\b
See demo.
The word boundaries match only the whole words, (?i) inline mode option enables case-insensitive search, and we need at least two main alternatives to account for different positions of Man and Lake/Tree.
Sample code:
$re = "/(?i)\\b(?:Lake|Tree)\\b.*\\bMan\\b|\\bMan\\b.*\\b(?:Tree|Lake)\\b/";
$str = " Man and Tree or Man and Lake. Is it possible?";
preg_match($re, $str, $matches);

preg_match("/Man!(?:Tree|Lake)/i", $line, $matches) should do it most efficiently.

Problems with str_split

I'm new here, and I have a question. I'm doing a code that I'll use soon, more something left me with a huge doubt. So I'm separating the word more special characters converted are being separated, I wish they would get together to assign a color to each then is there any way to do this?
Code:
<?php
$text = "My nickname is: п€Яd Øwп€d"; #this would be the result of what i received via post
print_r(str_split($text));
?>
Result:
Array
(
[0] => M
[1] => y
[2] =>
[3] => n
[4] => i
[5] => c
[6] => k
[7] => n
[8] => a
[9] => m
[10] => e
[11] =>
[12] => i
[13] => s
[14] => :
[15] =>
[16] => &
[17] => #
[18] => 1
[19] => 0
[20] => 8
[21] => 7
[22] => ;
[23] => &
[24] => e
[25] => u
[26] => r
[27] => o
[28] => ;
[...]
)
I'd like to return this:
Array ( [0] => M
[1] => y
[2] =>
[3] => n
[4] => i
[5] => c
[6] => k
[7] => n
[8] => a
[9] => m
[10] => e
[11] =>
[12] => i
[13] => s
[14] => :
[15] =>
[16] => п
[17] => €
[...]
)
Thank you for the help.
[UPDATED]
I tested the functions that friends have passed, most don't use utf-8 as my default charset ISO-8859-1, and one more thing I forgot to add, by editing the phrase "My nickname is:" and adding a & for example: "My nickname is & personal name" returns a bug. I appreciate who can help again.

You can try to write your own str_split, which could look like the following
function str_split_encodedTogether($text) {
$result = array();
$length = strlen($text);
$tmp = "";
for ($charAt=0; $charAt < $length; $charAt++) {
if ($text[ $charAt ] == '&') {//beginning of special char
$tmp = '&';
} elseif ($text[ $charAt ] == ';') {//end of special char
array_push($result, $tmp.';');
$tmp = "";
} elseif (!empty($tmp)) {//in midst of special char
$tmp .= $text[ $charAt ];
} else {//regular char
array_push($result, $text[ $charAt ]);
}
}
return $result;
}
Basically what it does is check if the current character is a &, if so, save all following characters (including ampersand) in $tmp until ;. This basically gives you the wanted result but will fail, whenever there is a & which doesn't belong to an encoded character.

Use preg_split():
<?php
$text = "My nickname is: п€Яd Øwп€d"; #this would be the result of what i received via post
print_r(preg_split('/(\&(?=[^;]*\s))|(\&[^;]*;)|/', $text, -1, PREG_SPLIT_DELIM_CAPTURE + PREG_SPLIT_NO_EMPTY));
?>

How to stop splitting within a pair of second delimiter in preg_split (PHP)?

I need to generate an array with preg_split, as implode('', $array) can re-generate the original string. `preg_split of
$str = 'this is a test "some quotations is her" and more';
$array = preg_split('/( |".*?")/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
generates an array of
Array
(
[0] => this
[1] =>
[2] => is
[3] =>
[4] => a
[5] =>
[6] => test
[7] =>
[8] =>
[9] => "some quotations is here"
[10] =>
[11] =>
[12] => and
[13] =>
[14] => more
)
I need to take care of the space before/after the quotation marks too, to generate an array with the exact pattern of the original string.
For example, if the string is test "some quotations is here"and, the array should be
Array
(
[0] => test
[1] =>
[2] => "some quotations is here"
[3] => and
)
Note: The edit has been made based on initial discussion with #mikel.

Will this work for you ?
preg_split('/( ?".*?" ?| )/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);

This should do the trick
$str = 'this is a test "some quotations is her" and more';
$result = preg_split('/(?:("[^"]+")|\b)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
$result = array_slice($result, 1,-1);
Output
Array
(
[0] => this
[1] =>
[2] => is
[3] =>
[4] => a
[5] =>
[6] => test
[7] =>
[8] => "some quotations is her"
[9] =>
[10] => and
[11] =>
[12] => more
)
Reconstruction
implode('', $result);
// => this is a test "some quotations is her" and more

preg_split with regex giving incorrect output

I'm using preg_split to an string, but I'm not getting desired output. For example
$string = 'Tachycardia limit_from:1900-01-01 limit_to:2027-08-29 numresults:10 sort:publication-date direction:descending facet-on-toc-section-id:Case Reports';
$vals = preg_split("/(\w*\d?):/", $string, NULL, PREG_SPLIT_DELIM_CAPTURE);
is generating output
Array
(
[0] => Tachycardia
[1] => limit_from
[2] => 1900-01-01
[3] => limit_to
[4] => 2027-08-29
[5] => numresults
[6] => 10
[7] => sort
[8] => publication-date
[9] => direction
[10] => descending facet-on-toc-section-
[11] => id
[12] => Case Reports
)
Which is wrong, desire output it
Array
(
[0] => Tachycardia
[1] => limit_from
[2] => 1900-01-01
[3] => limit_to
[4] => 2027-08-29
[5] => numresults
[6] => 10
[7] => sort
[8] => publication-date
[9] => direction
[10] => descending
[11] => facet-on-toc-section-id
[12] => Case Reports
)
There something wrong with regex, but I'm not able to fix it.

I would use
$vals = preg_split("/(\S+):/", $string, NULL, PREG_SPLIT_DELIM_CAPTURE);
Output is exactly like you want

It's because the \w class does not include the character -, so i would expand the \w with that too:
/((?:\w|-)*\d?):/

Try this regex instead to include '-' or other characters in your splitting pattern: http://regexr.com?32qgs
((?:[\w\-])*\d?):

How to preg_split using PREG_SPLIT_DELIM_CAPTURE

$str = "blabla and, some more blah";
$delimiters = " ,¶.\n";
$char_buff = preg_split("/(,) /", $str, -1, PREG_SPLIT_DELIM_CAPTURE);
print_r($char_buff);
I get:
Array (
[0] => blabla and
[1] => ,
[2] => some more blah
)
I was able to figure out how to use the parenthesis to get the comma to show up in its own array element -- but how can I do this with multiple different delimiters (for example, those in the $delimiters variable)?

You need to create a character class by wrapping the delimiters with [ and ].
<?php
$str = "blabla and, some more blah. Blah.\nSecond line.";
$delimiters = " ,¶.\n";
$char_buff = preg_split('/([' . $delimiters . '])/', $str, -1,
PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($char_buff);
You also need to use PREG_SPLIT_NO_EMPTY so that in places where you get two matches in a row, for instance a comma followed by a space, you don't get an empty match.
Output
Array
(
[0] => blabla
[1] =>
[2] => and
[3] => ,
[4] =>
[5] => some
[6] =>
[7] => more
[8] =>
[9] => blah
[10] => .
[11] =>
[12] => Blah
[13] => .
[14] =>
[15] => Second
[16] =>
[17] => line
[18] => .
)
Depending on what you are doing, using strtok may be a more appropriate way of doing it though.

Use something like:
'/([,.])/'
That is put each delimiter in that square bracket.

Each delimiter expression needs to be inside its own group.
print_r(preg_split('/2\d4/' , '12345', null, PREG_SPLIT_DELIM_CAPTURE));
Array ( [0] => 1 [1] => 5 )
print_r(preg_split('/(2)(\d)(4)/', '12345', null, PREG_SPLIT_DELIM_CAPTURE));
Array ( [0] => 1 [1] => 2 [2] => 3 [3] => 4 [4] => 5 )

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to split Chinese characters in PHP? - php

Related

preg_match to show lines containing one string and one of the other two

Problems with str_split

How to stop splitting within a pair of second delimiter in preg_split (PHP)?

preg_split with regex giving incorrect output

How to preg_split using PREG_SPLIT_DELIM_CAPTURE

Categories

Resources