Fix broken UTF-8 on PHP-substring - php

I got a little problem:
I wrote my own search engine for my Joomla-based website. Now the problem is, that I generate a preview of the article text using PHP's substring method. Its works fine, but it has some issues when it has to split multibyte-characters, since its not really taking X-Chars, but X-Bytes of the string. This means, that all multibyte characters potentially get splitted by this function, which doesn't look nice.
Anyone know a good workaround but reworking it with additional wordwrap function?
Best wishes

mb_substr will perform a multi-byte safe substring.
i.e.
mb_substring('Some string',1,3);
http://php.net/manual/en/function.mb-substr.php

Related

How do I make the PHP similar_text() function work for Japanese Characters (Kanji, Katakana and Hiragana)?

I'm wanting to use the similar_text() function provided by PHP for Japanese characters. But unfortunately it is giving the wrong answer. How can I make it work?
For Example:
similar_text('土橋勇樹', '東日刷株式')
gives the output 3, but we can clearly see it should be 0
You will want to handle the possible multibytes that are forming the Kanji characters. I am not 100% confident but I suspect similar_text does not support mb and you need a similar solution that can.
This links show peoples attempts at handling mb char similar to the php function.
https://gist.github.com/soderlind/74a06f9408306cfc5de9
https://github.com/antalaron/mb-similar-text
I have not personally tested this but the approach could be right or inspire you to write a custom function.
Also covered in this other post:
how to use similar text php code in arabic

Does a reliable way to capitalize Unicode text exist?

I recently had to deal with some complex problems working with Unicode string (using PHP, a language I know pretty well). The mbstring extension was not really working properly and we had huge pains trying to capitalize Unicode letters, which with ASCII text is a trivial problem, already solved in a variety of ways.
If I had to solve this problem with ASCII text, I would probably just take the character, check if it is a letter and then subtract 32 from its ASCII value, for example! But as for now, I could not find anything explaining how the problem of capitalization of Unicode text has been solved: do I need to store a complete associative table to map every lowercase character to its related uppercase version? I suppose (and hope) I will hear a huge NO!
The heart of the question: does any method to correctly convert lowercases into uppercases (and back) exist when operating with Unicode characters? And if this is the case, which strategies are applied?
For this test suppose you do not have any, but really ANY module available: no mbstring, no iconv, nothing. Moreover, for the sake of simplicity suppose to have the problem of recognizing individual characters already solved, our String object has a nextChar() method which can be used to find the next character, independently from its byte-length. Suppose that what you want to do is taking a string, iterate over it with nextChar() and, for each character, capitalize it if possible.
If unclear or in the need of more information simply comment, I will try to answer your doubts, if they are not even bigger than mine at the moment ;)
You can try PortableUTF8 library, written as alternative to mbstring and iconv.
http://pageconfig.com/post/portable-utf8
Another interesting library is Stringy. It works by default with mbstring but if module is not located it will use polyfill package .
https://github.com/danielstjules/Stringy
In order to improve knowledge of the problem it's interesting to read:
What factors make PHP Unicode-incompatible?
I hope it will be useful for you.

Php highlight search terms using preg_replace - character encoding issues

Hy guys, I need to highlight some searched words, in the result, and i was thinking to use pre_replace, and it works just fine, until i use accented characters.
So this is my code:
preg_replace("/(?<!\[)(\b{$search}\b)(?!\])/i", $replace, $string);
And if I'm looking for the word "mokus", it finds it, but leaves out "mókus",
The same thing happens the other way around.
And ideas? Thanks in advance.
You might want to research the term Accent Folding.
Here's a good article to understand the problem, the proposed solutions are in Javascript but you can translate the logic to PHP

codeigniter disallowed characters error

if i trying to access this url http://localhost/common/news/33/+%E0%B0%95%E0%B1%87%E0%B0%B8.html , it shows an An Error Was Encountered, The URI you submitted has disallowed characters. I set $config['permitted_uri_chars'] = 'a-z 0-9~%.:??_=+-?' ; ..// WHat i do ?
Yeah, if you want to allow non-ASCII bytes you would have to add them to permitted_uri_chars. This feature operates on URL-decoded strings (normally, unless there is something unusual about the environment), so you have to put the verbatim bytes you want in the string and not merely % and the hex digits. (Yes, I said bytes: _filter_uri doesn't use Unicode regex, so you can't use a Unicode range.)
Trying to filter incoming values (instead of encoding outgoing ones) is a ludicrously basic error that it is depressing to find in a popular framework. You can turn this misguided feature off by setting permitted_uri_chars to an empty string, or maybe you would like a range of all bytes except for control codes ("\x20-\xFF"). Unfortunately the _filter_uri function still does crazy, crazy, broken things with some input, HTML-encoding some punctuation on the way in for some unknown bizarre reason. And you don't get to turn this off.
This, along with the broken “anti-XSS” mangler, makes me believe the CodeIgniter team have quite a poor understanding of how string escaping and security issues actually work. I would not trust anything they say on security ever.
What to do?
Stop using unicode characters in an URL - for the same reasons as you shouldn't name files on a filesystem with unicode characters.
But, if you really need it, I'll copy/paste some lines from the config:
Leave blank to allow all characters -- but only if you are insane.
I would NOT suggest trying to decode them or use any other tricks, instead I would suggest using urlencode() and urldecode() functions.
Since I don't have a copy of your code, I can't add examples, if you could provide me some, I can show you an example how to do it.
However, it's pretty straightforward to use, and it's built in PHP4 and PHP5.
I had a similar problem and wanted to share the solution. It was reset password, and I had to send the username and time, as the url will be active for an hour only. Codeigniter will not accept certain characters in url for security reasons and I did not want to change that. So here is what I did:
concat user name, '__' and time() in a var $str
encrypt $str using MCRYPT_BLOWFISH, this may contain '/', '+'
re-encrypt using str2hex (got it from here)
put the encoded string as the 3rd argument in the link sent by
email, like,
http://xyz.com/users/resetpassword/3123213213ABCDEF238746238469898
-you can see that the url contains only 0-9 and A-Z.
When link from email is clicked, get the 3rd uri segment, use
hex2str() to decrypt to blowfish encrypted string, and then apply
blowfish decrypt to get the original string.
split with '__' to get the user name and time
I know that its almost a year till this question was asked, but I am hoping that someone will find this solution helpful after coming here by google.

Removing characters from a PHP String

I'm accepting a string from a feed for display on the screen that may or may not include some rubbish I want to filter out. I don't want to filter normal symbols at all.
The values I want to remove look like this: �
It is only this that I want removed. Relevant technology is PHP.
Suggestions appreciated.
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
Try to get your data as Unicode, or to make a agreement with your feed provider to you both use the same encoding.
Thanks for the responses, guys. Unfortunately, those submitted had the following problems:
wrong for obvious reasons:
ereg_replace("[^A-Za-z0-9]", "", $string);
This:
s/[\u00FF-\uFFFF]//
which also uses the deprecated ereg form of regex also didn't work when I converted to preg because the range was simply too large for the regex to handle. Also, there are holes in that range that would allow rubbish to seep through.
This suggestion:
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
while valid, is no good because I don't have any control over how the data I receive is encoded. It comes from an external source. Sometimes there's garbage in there and sometimes there is not.
So, the solution I came up with was relatively dirty, but in the absence of something more robust I'm just accepting all standard letters, numbers and symbols and discarding the rest.
This does seem to work for now. The solution is as follows:
$fixT = str_replace("£", "£", $string);
$fixT = str_replace("€", "€", $fixT);
$fixT = preg_replace("/[^a-zA-Z0-9\s\.\/:!\[\]\*\+\-\|\<\>##\$%\^&\(\)_=\';,'\?\\\{\}`~\"]/", "", $fixT);
If anyone has any better ideas I'm still keen to hear them. Cheers.
You are looking for characters that are outside of the range of glyphs that your font can display. You can find the maximum unicode value that your font can display, and then create a regex that will replace anything above that value with an empty string. An example would be
s/[\u00FF-\uFFFF]//
This would strip anything above character 255.
That's going to be difficult for you to do, since you don't have a solid definition of what to filter and what to keep. Typically, characters that show up as empty squares are anything that the typeface you're using doesn't have a glyph for, so the definition of "stuff that shows up like this: �" is horribly inexact.
It would be much better for you to decide exactly what characters are valid (this is always a good approach anyway, with any kind of data cleanup) and discard everything that is not one of those. The PHP filter function is one possibility to do this, depending on the level of complexity and robustness you require.
If you cant resolve the issue with the data from the feed and need to filter the information then this may help:
PHP5 filter_input is very good for filtering input strings and allows a fair amount of rlexability
filter_input(input_type, variable, filter, options)
You can also filter all of your form data in one line if it requires the same filtering :)
There are some good examples and more information about it here:
http://www.w3schools.com/PHP/func_filter_input.asp
The PHP site has more information on the options here: Validation Filters
Take a look at this question to get the value of each byte in your string. (This assumes that multibyte overloading is turned off.)
Once you have the bytes, you can use them to determine what these "rubbish" characters actually are. It's possible that they're a result of misinterpreting the encoding of the string, or displaying it in the wrong font, or something else. Post them here and people can help you further.
Try this:
Download a sample from the feed manually.
Open it in Notepad++ or another advanced text editor (KATE on Linux is good for this).
Try changing the encoding and converting from one encoding to another.
If you find a setting that makes the characters display properly, then you'll need to either encode your site in that encoding, or convert it from that encoding to whatever you use on your site.
Hello Friends,
try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/
Thanks,
Chintu(prajapati.chintu.001#gmail.com)

Categories