Pre_split treats apostrophe like html entity - php

I'm currently splitting utf8mb4_unicode_ci text outputted from my database by #, #, $, and spaces using the following method:
$textSplit = preg_split("/(?=[ ##$])/", $text, -1, PREG_SPLIT_NO_EMPTY);
However, when I split a piece of database text with an apostrophe, I get the following output:
// $text is a database value that equals "Is this John's text?"
$textSplit = preg_split("/(?=[ ##$])/", $text, -1, PREG_SPLIT_NO_EMPTY);
// Outputs array(5) { [0]=> string(2) "Is" [1]=> string(5) " this" [2]=> string(5) " John&" [3]=> string(6) "#039;s" [4]=> string(5) " text" }
var_dump($textSplit);
Is there anyway to prevent preg_split from treating the apostrophe like an html entity so that it splits up the text like this?
array(4) { [0]=> string(2) "Is" [1]=> string(5) " this" [2]=> string(7) " John's" [3]=> string(5) " text" }

If anyone runs into this same issue, I was able to resolve it by using htmlspecialchars_decode($text, ENT_QUOTES). Thanks for everyone's help in getting to this solution!

Try a lookbehind:
/(?<!&)(?=[ ##$])/
It won't match any character following &, preventing &#xxx to match.

Related

PHP - REGEX TO ARRAY like MP3TAG

I would like to ask how to convert a string to array using
a string pattern like mp3tag does
%ALBUM% - %SOMETHING% - %SOMETHING%,
the ' - ' are custom chars that are not static.
If i didnt made myself clear
i want fro custom sting to make it an array
but the pattern is custom not static
Is this possible in php and if so how.
$str = "%ALBUM% & %SOMETHING% (ノ゜-゜)ノ ︵ ┬──┬ %SOMETHING%,";
preg_match_all("/%([a-z]+)%/i", $str, $matches);
var_dump($matches);
Outputs
array(2) {
[0]=>
array(3) {
[0]=>
string(7) "%ALBUM%"
[1]=>
string(11) "%SOMETHING%"
[2]=>
string(11) "%SOMETHING%"
}
[1]=>
array(3) {
[0]=>
string(5) "ALBUM"
[1]=>
string(9) "SOMETHING"
[2]=>
string(9) "SOMETHING"
}
}

preg_* functions matching subpattern with quantifier

I have a regex of this form:
/(?:^- (.*)$\r*\n*)+/m
The intention is to match one or more lines of text that start with -[space].
This works fine, except for when it comes to collecting the matched subpatterns (.*). Only the last one is returned, and any previous subpattern matches (which appear in the result array as part of index 0) are lost.
I really need some way of getting those subpatterns in an array, so I can pass them to implode and do what I'm trying to do with them.
Am I missing something obvious here?
Maybe you could use
preg_match_all('/^- (.*)\r\n/m', $subject, $result, PREG_PATTERN_ORDER);
var_dump($result);
For example:
<?php
$subject = "- some line
- some content
- some other content
nothing to match over here
- more things here
- more patterns
nothing to match here
";
preg_match_all('/^- (.*)\r\n/m', $subject, $result, PREG_PATTERN_ORDER);
var_dump($result);
?>
Outcome:
array(2) {
[0]=>
array(5) {
[0]=>
string(12) "- some line
"
[1]=>
string(15) "- some content
"
[2]=>
string(21) "- some other content
"
[3]=>
string(19) "- more things here
"
[4]=>
string(16) "- more patterns
"
}
[1]=>
array(5) {
[0]=>
string(9) "some line"
[1]=>
string(12) "some content"
[2]=>
string(18) "some other content"
[3]=>
string(16) "more things here"
[4]=>
string(13) "more patterns"
}
}

How to match a "tag" list using regular expressions & PHP

I have a form input field that accepts multiple "tags" from a user, a bit like the one on this site! So, for example a user could enter something like:
php mysql regex
...which would be nice & simple to separate up the multiple tags, as I could explode() on the spaces. I would end up with:
array('php', 'mysql', 'regex')
However things get a little more complicated as the user can separate tags with commas or
spaces & use double quotes for multi-word tags.
So a user could also input:
php "mysql" regex, "zend framework", another "a, tag with punc $^&!)(123 *note the comma"
All of which would be valid. This should produce:
array('php', 'mysql', 'regex', 'zend framework', 'another', 'a, tag with punc $^&!)(123 *note the comma')
I don't know how to write a regular expression that would firstly match everything in double quotes, then explode the string on commas or spaces & finally match everything else. I guess I would use preg_match_all() for this?
Could anyone point me in the right direction!? Many thanks.
Try this regex out. I tested it against your string, and it correctly pulled out the individual tags:
("([^"]+)"|\s*([^,"\s]+),?\s*)
This code:
$string = 'php "mysql" regex, "zend framework", another "a, tag with punc $^&!)(123 *note the comma"';
$re = '("([^"]+)"|\s*([^,"\s]+),?\s*)';
$matches = array();
preg_match_all($re, $string, $matches);
var_dump($matches);
Yielded the following result for me:
array(3) {
[0]=>
array(6) {
[0]=>
string(4) "php "
[1]=>
string(7) ""mysql""
[2]=>
string(8) " regex, "
[3]=>
string(16) ""zend framework""
[4]=>
string(9) " another "
[5]=>
string(44) ""a, tag with punc $^&!)(123 *note the comma""
}
[1]=>
array(6) {
[0]=>
string(0) ""
[1]=>
string(5) "mysql"
[2]=>
string(0) ""
[3]=>
string(14) "zend framework"
[4]=>
string(0) ""
[5]=>
string(42) "a, tag with punc $^&!)(123 *note the comma"
}
[2]=>
array(6) {
[0]=>
string(3) "php"
[1]=>
string(0) ""
[2]=>
string(5) "regex"
[3]=>
string(0) ""
[4]=>
string(7) "another"
[5]=>
string(0) ""
}
}
Hope that helps.

preg_split using PREG_SPLIT_DELIM_CAPTURE

I was looking to split a string based on a regular expression but I also have interest in keeping the text we split on:
php > var_dump(preg_split("/(\^)/","category=Telecommunications & CATV^ORcategory!=ORtest^caused_byISEMPTY^EQ"), null, PREG_SPLIT_DELIM_CAPTURE);
array(4) {
[0]=> string(34) "category=Telecommunications & CATV"
[1]=> string(18) "ORcategory!=ORtest"
[2]=> string(16) "caused_byISEMPTY"
[3]=> string(2) "EQ"
}
NULL
int(2)
What I do not understand is why am I not getting an array such as:
array(4) {
[0]=> "category=Telecommunications & CATV"
[1]=> "^"
[2]=> "ORcategory!=ORtest"
[3]=> "^"
[4]=> "caused_byISEMPTY"
[5]=> "^"
[6]=> "EQ"
}
Additionally, how could I change my regular expression to match "^OR" and also "^". I was having trouble with a lookbehind assertion such as:
$regexp = "/(?<=\^)OR|\^/";
This will work as expected:
var_dump(preg_split('/(\^)/','category=Telecommunications & CATV^ORcategory!=ORtest^caused_byISEMPTY^EQ', -1, PREG_SPLIT_DELIM_CAPTURE));
the closing bracket of preg_split() is at the wrong place.
additional question:
/(\^OR|\^)/

regex breaking Chinese string

When i run this code and similar some Chinese the ni (你) character (maybe others) gets chopped of and broken.
$sample = "你不喜欢 香蕉 吗";
$parts = preg_split("/[\s,]+/", $sample);
var_dump($parts);
//outputs
array(4) {
[0]=>
string(2) "�"
[1]=>
string(9) "不喜欢"
[2]=>
string(6) "香蕉"
[3]=>
string(3) "吗"
}
//in 我觉得 你很 麻烦
//out
array(4) {
[0]=>
string(9) "我觉得"
[1]=>
string(2) "�"
[2]=>
string(3) "很"
[3]=>
string(6) "麻烦"
}
Is my regex wrong?
If your string is in UTF-8, you must use the u modifier:
$sample = "你不喜欢 香蕉 吗";
$parts = preg_split("/[\\s,]+/u", $sample);
var_dump($parts);
If it's in another encoding, see unicornaddict's answer.
Since the input string is multi-byte, I guess you'll have to use mb_split in place of preg_split.

Categories