str_replace for UTF-16 characters - php

I have some strings containing characters such as \x{1f601} which I want to replace with some text.
When I do this using preg_replace, it would be something like:
preg_replace('/\x{1f601}/u', '######', $str)
However, this doesn't seem to work with str_replace:
str_replace("\x{1f601}", '######', $str)
How can I make such replacements work with str_replace?

preg_replace is a Regex parser/replacer, which is a Perl Regular expression engine, but str_replace is NOT and replaces things with a plaintext method
The Preg_replace you have got can be seen here in regex101, stating that:
matches the character 😁 with position 0x1f601 (128513 decimal or 373001 octal) in the character set
But this could be transferable to a non-regex find and replace,by copy and pasting that face smiley symbol into the str_replace directly.
$str = str_replace("😁", '######', $str)
Or, by reading deceze's comment which gives you a clean, small solution.
Additional:
You are using a character set that is non-standard so it may be useful for you to explore Mb_Str_replace (gitHub) which is an accompanyment (but not directly from) the mb_string collection of PHP functions.
Finally:
Why do you need to do string replace whe you are already doing regex preg_replace? Also please read the manual which states all of this fairly clearly.

Related

Multibyte regex and hashtag parsing in PHP

I'm currently working on a project where users can tag their content using hashtags in the text area. When saving the post, I go through the content to find any hashtags, save them and relate them to the post model. It's all working fine except for one flaw, there is no multibyte support, which is a bug issue since this project will be international and with broad language support.
For instance, lets say I have this content in my post:
$content = 'This is my testing string, look at the hashtags and see that the multibyte ones are ignored. #php #regex #my #multibyte #ÄÀö #öl #lÀsa #drickaöl #tags #are #being #ignored'
I'm currently using preg_match_all to fetch all the hashtags, like this:
preg_match_all('/(#\w+)/', $content, $matches);
Although, this will ignore any tag starting with a multibyte sign, such as Ä, À, or ö, or simply break each tag wherever it encounters one.
People have been recommending the mb_ereg() method, but as far as I can tell, this only has support for getting a boolean result, indication whether or not your string matches the pattern.
You can have a look at my simple regex here.
Please help me understand and potentially fix this so that I can get this feature working properly.
Many thanks!
You need to use the u flag with your regex:
$re = '/#\w+/u';
See IDEONE demo
$re = '/#\w+/u';
$str = "This is my testing string, look at the hashtags and see that the multibyte ones are ignored. #php #regex #my #multibyte #ÄÀö #öl #lÀsa #drickaöl #tags #are #being #ignored";
preg_match_all($re, $str, $matches);
print_r($matches[0]);
Perhaps, you also might want to use \p{L} (a Unicode letter category), but it does not seem necessary since \w with the u Unicode flag already matches all Unicode letters.
Here is a regex version with \p{L}:
$re = '/#[0-9_\p{L}]+/u';
See IDEONE demo
You can also use PCRE unicode propertiez: \p{L} and \p{N} for this:
preg_match_all('/(#[\p{L}\p{N}_]+)/u', $content, $matches);
RegEx Demo

Split string on non-alphanumerics in PHP? Is it possible with php's native function?

I was trying to split a string on non-alphanumeric characters or simple put I want to split words. The approach that immediately came to my mind is to use regular expressions.
Example:
$string = 'php_php-php php';
$splitArr = preg_split('/[^a-z0-9]/i', $string);
But there are two problems that I see with this approach.
It is not a native php function, and is totally dependent on the PCRE Library running on server.
An equally important problem is that what if I have punctuation in a word
Example:
$string = 'U.S.A-men's-vote';
$splitArr = preg_split('/[^a-z0-9]/i', $string);
Now this will spilt the string as [{U}{S}{A}{men}{s}{vote}]
But I want it as [{U.S.A}{men's}{vote}]
So my question is that:
How can we split them according to words?
Is there a possibility to do it with php native function or in some other way where we are not dependent?
Regards
Sounds like a case for str_word_count() using the oft forgotten 1 or 2 value for the second argument, and with a 3rd argument to include hyphens, full stops and apostrophes (or whatever other characters you wish to treat as word-parts) as part of a word; followed by an array_walk() to trim those characters from the beginning or end of the resultant array values, so you only include them when they're actually embedded in the "word"
Either you have PHP installed (then you also have PCRE), or you don't. So your first point is a non-issue.
Then, if you want to exclude punctuation from your splitting delimiters, you need to add them to your character class:
preg_split('/[^a-z0-9.\']+/i', $string);
If you want to treat punctuation characters differently depending on context (say, make a dot only be a delimiter if followed by whitespace), you can do that, too:
preg_split('/\.\s+|[^a-z0-9.\']+/i', $string);
As per my comment, you might want to try (add as many separators as needed)
$splitArr = preg_split('/[\s,!\?;:-]+|[\.]\s+/', $string, -1, PREG_SPLIT_NO_EMPTY);
You'd then have to handle the case of a "quoted" word (it's not so easy to do in a regular expression, because 'is" "this' quoted? And how?).
So I think it's best to keep ' and " within words (so that "it's" is a single word, and "they 'll" is two words) and then deal with those cases separately. For example a regexp would have some trouble in correctly handling
they 're 'just friends'. Or that's what they say.
while having "'re" and a sequence of words of which the first is left-quoted and the last is right-quoted, the first not being a known sequence ('s, 're, 'll, 'd ...) may be handled at application level.
This is not a php-problem, but a logical one.
Words could be concatenated by a -. Abbrevations could look like short sentences.
You can match your example directly by creating a solution that fits only on this particular phrase. But you cant get a solution for all possible phrases. That would require a neuronal-computing based content-recognition.

PHP: is there an isLetter() function or equivalent?

I am no PHP expert. I am looking for the PHP equivalent of isLetter() in Java, but I can't find it. Does it exist?
I need to extract letters from a given string and make them lower case, for example: "Ap.ér4i5T i6f;" should give "apéritif'. So, yes, there are accentuated characters in my strings.
ctype_alpha().
In addition to regex / preg_replace, you can also use strtoupper($string) and strtolower($string), if you need to universally upper-case a string. As Konrad mentioned, preg_replace is probably your best bet though.
http://php.net/manual/en/function.strtoupper.php
http://www.php.net/manual/en/function.strtolower.php
In PHP (and in Java) you wouldn’t use isLetter to implement it, you’d rather replace all characters that aren’t letters using a regular expression:
echo preg_replace('/\P{L}/', '', input);
Loop up the documentation of preg_replace and the regex pattern syntax desciption, in particular the relevant Unicode character classes.
You could probably use the php-slugs source code, with appropriate modifications.

multi-byte function to replace preg_match_all?

I'm looking for a multi-byte function to replace preg_match_all(). I need one that will give me an array of matched strings, like the $matches argument from preg_match(). The function mb_ereg_match() doesn't seem to do it -- it only gives me a boolean indicating if there were any matches.
Looking at the mb_* functions page, I don't offhand see anythng that replaces the functionality of preg_match(). What do I use?
Edit I'm an idiot. I originally posted this question asking for a replacement for preg_match, which of course is ereg_match. However both those only return the first result. What I wanted was a replacement for preg_match_all, which returns all match texts. But anyways, the u modifier works in my case for preg_match_all, as hakre pointed out.
Have you taken a look into mb_ereg?
Additionally, you can pass an UTF-8 encoded string into preg_match using the u modifier, which might be the kind of multi-byte support you need. The other option is to encode into UTF-8 and then encode the results back.
See as well an answer to a related question: Are the PHP preg_functions multibyte safe?
PHP: preg_grep manual
$matches = preg_grep('/(needles|to|find)/u', $inputArray);
Returns an array indexed using the keys from the input array.
Note the /u modifier which enables multibyte support.
Hope it helps others.

PHP preg_replace/preg_match vs PHP str_replace

Can anyone give me a quick summary of the differences please?
To my mind, are they both doing the same thing?
str_replace replaces a specific occurrence of a string, for instance "foo" will only match and replace that: "foo". preg_replace will do regular expression matching, for instance "/f.{2}/" will match and replace "foo", but also "fey", "fir", "fox", "f12", etc.
[EDIT]
See for yourself:
$string = "foo fighters";
$str_replace = str_replace('foo','bar',$string);
$preg_replace = preg_replace('/f.{2}/','bar',$string);
echo 'str_replace: ' . $str_replace . ', preg_replace: ' . $preg_replace;
The output is:
str_replace: bar fighters, preg_replace: bar barhters
:)
str_replace will just replace a fixed string with another fixed string, and it will be much faster.
The regular expression functions allow you to search for and replace with a non-fixed pattern called a regular expression. There are many "flavors" of regular expression which are mostly similar but have certain details differ; the one we are talking about here is Perl Compatible Regular Expressions (PCRE).
If they look the same to you, then you should use str_replace.
str_replace searches for pure text occurences while preg_replace for patterns.
I have not tested by myself, but probably worth of testing. But according to some sources preg_replace is 2x faster on PHP 7 and above.
See more here: preg_replace vs string_replace.

Categories