PHP - Strlen behaves very strange, the same things - different results, lol numbers? - php

tresc and tresc_pelna
The same type, the same content
The same content. 876 characters in total.
Taken from db by ...AS data_dodania, p.data_modyfikacji, p.tresc, p.tresc_pelna, p.url, count(k.id)...
Echeon to website by <?= strlen($post['tresc_pelna']).'----'.strlen($post['tresc']) ?>
And guess what?
This is the output
876----3248
What the...?
I have completly no Idea what is happening here xD.
Please help guys :D
Both fields utf8_polish_ci and exactly same content
<?= mb_strlen($post['tresc_pelna'], 'utf-8').'----'.mb_strlen($post['tresc'], 'utf-8') ?>
Still bad result.
tresc over 3 thousands... what the... How? why?

MySQL has two built-in functions for determining the length of variable-length items. One, which counts distinct unicode characters, is called CHAR_LENGTH(). The other counts octets (bytes), and is called LENGTH().
In PHP, strlen() counts octets, like MySQL's LENGTH(). Many unicode strings, especially those encoded in utf8, have a variable number of octets per character. You can use grapheme_strlen() to count those.
I've found it's sometimes helpful to do SELECT HEX(unicode_column) to figure out what's stashed in MySQL. Just fetching the column data puts you at the mercy of the character rendering of the MySQL client you use, and can be very confusing.
It's also possible your database columns have entitized data in them (for example the string é rather than the Unicode character é. If that entity text gets sent to a web browser, it renders as the letter.

The difference between LENGTH and CHAR_LENGTH could explain a ratio of under 1.2x for most European text. It won't explain 3248:876, which is nearly 4x.
Perhaps these are part of the answer:
Htmlentities, such as ó which is taking 8 bytes to represent a 2-byte utf8 character. We can't see whether one of them has < and the other has <.
Formatting tags, such as <p>. Again, possibly <p>
Still, that is not enough to explain nearly 4x. For example, a simple letter, such as a, will be one byte, regardless of how it is encoded. Please provide the HEX for a small sample.

Related

PHP, uppercase a single character

Essentially, what I want to know is, will this:
echo chr(ord('a')-32);
Work for every a-z letter, on every possible PHP installation, every single time?
Read here for a bit of background information
After searching for a while, I realised that most of the questions for changing string case in PHP only apply to entire words or sentences.
I have a case where I only need to upper 1 single character.
I though using functions like strtoupper, ucfirst and ucwords were overkill for single characters, seeing as they are designed to work with strings.
So after looking around php.net I found the functions chr and ord which convert chars to their ascii representation (and back).
After a little playing, I discovered I can convert a lower to an upper by doing
echo chr(ord('a')-32);
This simply offsets the character by 32 places in the ascii table. Which just happens to be the character's upper version.
The reason I'm posting this on stackoverflow, is because I want to know if there are any edge cases that could break this simple conversion.
Would changing the character set of the php script, or somethig like that affect the outcome?
Is this $upper = chr(ord($lower)-$offset) the standard way to upper a char in PHP? or is there another?
The ASCII code doesn't change between PHP installations, because it is based on the ASCII table.
Quote from www.asciitable.com:
ASCII stands for American Standard Code for Information Interchange. Computers can only understand numbers, so an ASCII code is the numerical representation of a character such as 'a' or '#' or an action of some sort. ASCII was developed a long time ago and now the non-printing characters are rarely used for their original purpose.
Quote from PHP documentation on chr():
Returns a one-character string containing the character specified by ascii.
In any case, I'd say it's more overkill to do it your way than do it with strtoupper().
strtoupper() is also faster.

preg_match multiple and accentuated characters

The code above
preg_match('~\b(rain|dry|certain|clear)\b~i',$string);
It works like a charm, but when i'm searching for words with accentuated characters it doesn't work.
Can somebody help me
Well, technically a and á and à are all different characters to the interpreter. They are encoded differently and there is no way to know which different encodings represent a "similar" character (in some languages accented character are radically different letters). So you would need to include all variants you want to match. However, if you need the actual offset within the string, you might encounter difficulties, because for UTF-8 strings the offset is given in bytes not characters.
See this SO question for an example how to include all versions of a character.
And this bug report in case you encounter the problem with the wrong offsets.

PHP: How to properly split a unicode Korean string?

I have a problem where I can't seem to be able to write "certain" Korean characters. Let me try to explain. These are the steps I take.
MS Access DB file (US version) has a table with Korean in it. I export this table as a text file with UTF-8 encoding. Let's call it "A.txt"
When A.txt is read, stored in an array, then written to a new file (B.txt), all characters display properly. I'm using header("Content-Type: text/plain; charset=UTF-8"); at the very beginning of my PHP script. I simply use fwrite($fh, $someStr).
WHen I read B.txt in another script and write to yet a new file (C.txt), there's a certain column (obvisouly in the PHP code, I'm not working with a table or matrix, but effectively speaking when outputted back to the original text file format) that causes the characters to show up something like this: ¸ì¹˜ ì–´ëœíŠ¸ 나ì¼ë¡. This entire column has broken characters, so if I have 5 columns in a text file, delimited by commas and encapsulated with double quotes, this column will break all of the other columns' Korean characters. If I omit this column in writing the text file, all is well.
Now, I noticed that certain PHP functions/operations break the Unicode characters. For example, if I use preg_replace() for any of the Korean strings and try to fwrite() that, it will break. However, I'm not performing anything that I'm not already doing on other fields/columns (speaking in terms of text file format), and other sections are not broken.
Does anyone have any idea on how to rectify this? I've tried utf8_encode() and mb_convert_encoding() in different ways with no success. I'm reading utf8_encode() wouldn't even be necessary if my file is UTF-8 to begin with. I've tried setting my computer language to Korean as well..
I've spent 2 days on this already, and it's becoming a huge waste of time. Please help!
UPDATE:
I think I may have found the culprit. In the script that creates B.txt, I split a long Korean string into two (using string ...<br /><br />... as indicator) and assign them to different columns. I think this splitting operation is ultimately causing the problem.
NEW QUESTION:
How do I go about splitting this long string into two while preserving the unicode? Previsouly, I had used strpos() and substr(), but I am reading that the mb_*() function might be what I need.. Testing now.
Try the unicode modifier (u) for preg
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

Convert ASCII to plaintext in PHP

I am scraping some sites, and have ASCII text that I want to convert to plain text for storing in a DB. For example I want
I have got to tell anyone who will listen that this is
one of THE best adventure movies I've ever seen.
It's almost impossible to convey how pumped I am
now that I've seen it.
converted to
I have got to tell anyone who will listen that this is
one of THE best adventure movies I've ever seen. It's
almost impossible to convey how pumped I am now that
I've seen it.
I have googled my fingers bloody, any help?
You can use html_entity_decode:
echo html_entity_decode('...', ENT_QUOTES, 'UTF-8');
Few notes:
Please note that it looks like you actually want to convert from HTML-encoded string(with entities like ) to ASCII AKA plaintext.
This example converts to UTF-8 which is ASCII-compatible character encoding for all ASCII characters (i.e. with char codes below 128). If you really want plain ASCII (thus loosing all accented characters and characters from foreign languages) you should strip all offending characters separately.
Last argument ('UTF-8') is necessary to keep compatibility with different PHP versions since the default value has changed since PHP 5.4.0.
Update: Example with your text in ideone.
Update2: Changed ENT_COMPAT to ENT_QUOTES by #Daan's suggestion.

Special characters in Flex

I am working on a Flex app that has a MySQL database. Data is retrieved from the DB using PHP then I am using AMFPHP to pass the data on to Flex
The problem that I am having is that the data is being copied from Word documents which sometimes result in some of the more unusual characters are not displaying properly. For example, Word uses different characters for starting and ending double quotes instead of just " (the standard double quotes). Another example is the long dash instead of -.
All of these characters result in one or more accented capital A characters appearing instead. Not only that, each time the document is saved, the characters are replaced again resulting in an ever-increasing number of these accented A's appearing.
Doing a search and replace for each troublesome character to swap it for one of the none characters seems to work but obviously this requires compiling a list of all the characters that may appear and means there is scope for this continuing as new characters are used for the first time. It also seems like a bit of a brute force way of getting round the problem rather than a proper solution.
Does anyone know what causes this and have any good workarounds / fixes? I have had similar problems when using utf-8 characters in html documents that aren't set to use utf-8. Is this the same thing and if so, how do I get flex to use utf-8?
Many thanks
Adam
It is the same thing, and smart quotes aren't special as such: you will in fact be failing for every non-ASCII character. As such a trivial ad-hoc replace for the smart quote characters will be pointless.
At some point, someone is mis-decoding a sequence of bytes as ISO-8859-1 or Windows code page 1252 when it should have been UTF-8. Difficult to say where without detail/code.
What is “the document”? What format is it? Does that format support UTF-8 content? If it does not, you will need to encode output you put into it at the document-creation phase to the encoding the consumer of that document expects, eg. using iconv.

Categories