strlen & special chars - php

I'm having an issue finding a solution here.. I'm developing a WordPress theme for a client that uses a for() loop to iterate through the title of the page so it can be wrapped in <span>s and displayed vertically.. the loop uses strlen() to find the length of the title but since some of the page titles include '...' or commas in the title it returns the html chars instead.. I can't figure out what is causing that and every effort via htmlspecialchars_decode() or html_entity_decode() doesn't work.. any suggestions? Is there something going on with the for loop that I'm now aware of?
Since it was requested here is the actual code:
$p_title = get_the_title($port_page->ID);
$title = '';
for($i=0;$i<strlen($p_title);$i++){
if(($p_title[$i])){
$title .="<span>$p_title[$i]</span>";
}
I've tried using mb_strlen as well.. the problem with searching for a specific character to replace doesn't necessarily solve the problem since page titles are arbitrarily set by the site owner..
The weird thing is the Title is not encoded in any way and echo's normally before the for loop.. So it's as if something is converting it..

This sounds a lot like a character encoding issue with multibyte characters. Can you try replacing strlen() with mb_strlen() and see if it does the job?
http://php.net/manual/en/function.mb-strlen.php

strlen() only returns the number of bytes in a string. Some special characters can be represented with multiple bytes, and Unicode can also make single 'characters' like a copyright symbol ("©") occupy many characters (e.g. ©).
Your "..." (ellipsis) can be a special character in Unicode for example.
The quick and dirty solution I suggest:
// Example string should be 1 character long, 6 bytes
$text = "©";
$bytes = strlen($text);
mb_internal_encoding('UTF-8');
$text = html_entity_decode($text, ENT_QUOTES, "UTF-8");
$length = mb_strlen($text);
print "String is ".$length." characters long, ".$bytes." bytes long";
Note that I'm assuming your string is already UTF-8. If it isn't, convert it first.

Related

PHP strpos says different croatian chars are the same: š č

I have the following code:
$text = 'Tomáš'
echo strpos($text, "č");
# result if 4
I believe they are different chars so why is PHP telling me they are the same?
What is going on and how can I correct this?
The encoding you chose to save your source code file in cannot encode the characters you're trying to save. Whatever characters PHP is seeing, it's not comparing the strings you think it is. Save your source code in an encoding that can encode all characters, preferably UTF-8.
You should try with mb_strpos function.
Performs a multi-byte safe strpos() operation based on number of characters. The first character's position is 0, the second character position is 1, and so on.
With a regular setup, it returns false to me.
However if you've troubles with such special characters, using mb_strpos instead of strpos should help.
http://php.net/manual/en/function.mb-strpos.php

Substr not working with html tags and entities

I have gone throught the following question:
substr() not working but it did not work for me :(
I am facing the same problem. I am using nicEditor and for at the time of insert, I do htmlentities(addslashes(urlencode($description)))
and when I view the description? It shows me correctly, but when i use substr() it returns nothing.
like:
substr($description,0,10)
$description contains the content and it is fine, present in db, works without substr()
Please provide a var_dumb()
of $description and a bit more code before $description is filled in, so we can see if there is an other problem.
Try this one
Use mb_substr for multibyte character encodings like UTF-8. substr
just counts bytes while mb_substr counts characters.
substr() works with singlebyte only
http://php.net/manual/en/function.mb-substr.php
Source: PHP Substr Function Trimming Problem
This happens because in UTF-8 characters are not restricted to one
byte, they have variable length to match Unicode characters, between 1
and 4 bytes.
A safe way of cutting these strings without losing anything is by
using the mb_substr PHP function instead. It works almost the same way
as substr but the difference is that you can add a new parameter to
specify the encoding type, whether is UTF-8 or a different encoding.
Source: http://osc.co.cr/extracting-a-substring-from-a-utf-8-string-in-php/

substr doesn't work fine with utf8

I am using a substr method to access the first 20 characters of a string. It works fine in normal situation, but while working on rtl languages (utf8) it gives me wrong results (about 10 characters are shown). I have searched the web but found nth useful to solve this issue. This is my line of code:
substr($article['CBody'],0,20);
Thanks in advance.
If you’re working with strings encoded as UTF-8 you may lose
characters when you try to get a part of them using the PHP substr
function. This happens because in UTF-8 characters are not restricted
to one byte, they have variable length to match Unicode characters,
between 1 and 4 bytes.
You can use mb_substr(), It works almost the same way as substr but the difference is that you can add a new parameter to specify the encoding type, whether is UTF-8 or a different encoding.
Try this:
$str = mb_substr($article['CBody'], 0, 20, 'UTF-8');
echo utf8_decode($str);
Hope this helps.
Use this instead, here is extra text to make the body long enough. This will handle multi-byte characters.
http://php.net/manual/en/function.mb-substr.php

Accessing chars in multibyte string php

I have mbstring.func_overload = 7 and using UTF-8. Everything works fine but this not:
$str = "ãçéíõ";
echo $str[0];
It prints a question mark in the browser.
This instead works normally:
echo substr($str,0,1);
Someone knows why?
Indexing into the string with $str[0] pulls bytes out of it. It cannot be made aware of encodings, no matter that mbstring.func_overload has been set so. You will need to use substr even if it is not as convenient.
Indexing into a string is a grievous coding error unless that string represents a blob, and you just came upon the reason.
Yes, it's because you are using multibyte strings, in which a single character is represented by one to four bytes. If you select just one byte (as in $str[0]) you probably have only a half character selected.
substr() instead is multibyte save and doesn't count the bytes, but the chars.

remove invalid chars from html document

i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.

Categories