Optimizing conversion to ascii - php

I want to convert all characters in a string to ascii codes, and right now I do this with ord() function - but it is pretty slow. Is there another, faster way to do this?
I have ~100GB of text on which I'd have to use this convertion, so this little difference matters a lot.
I've been thinking about creating a map of ascii characters and using it instead, but I'm not sure if it will be faster, and I couldn't find ascii map anywhere.

Related

Fix broken UTF-8 on PHP-substring

I got a little problem:
I wrote my own search engine for my Joomla-based website. Now the problem is, that I generate a preview of the article text using PHP's substring method. Its works fine, but it has some issues when it has to split multibyte-characters, since its not really taking X-Chars, but X-Bytes of the string. This means, that all multibyte characters potentially get splitted by this function, which doesn't look nice.
Anyone know a good workaround but reworking it with additional wordwrap function?
Best wishes
mb_substr will perform a multi-byte safe substring.
i.e.
mb_substring('Some string',1,3);
http://php.net/manual/en/function.mb-substr.php

PHP, uppercase a single character

Essentially, what I want to know is, will this:
echo chr(ord('a')-32);
Work for every a-z letter, on every possible PHP installation, every single time?
Read here for a bit of background information
After searching for a while, I realised that most of the questions for changing string case in PHP only apply to entire words or sentences.
I have a case where I only need to upper 1 single character.
I though using functions like strtoupper, ucfirst and ucwords were overkill for single characters, seeing as they are designed to work with strings.
So after looking around php.net I found the functions chr and ord which convert chars to their ascii representation (and back).
After a little playing, I discovered I can convert a lower to an upper by doing
echo chr(ord('a')-32);
This simply offsets the character by 32 places in the ascii table. Which just happens to be the character's upper version.
The reason I'm posting this on stackoverflow, is because I want to know if there are any edge cases that could break this simple conversion.
Would changing the character set of the php script, or somethig like that affect the outcome?
Is this $upper = chr(ord($lower)-$offset) the standard way to upper a char in PHP? or is there another?
The ASCII code doesn't change between PHP installations, because it is based on the ASCII table.
Quote from www.asciitable.com:
ASCII stands for American Standard Code for Information Interchange. Computers can only understand numbers, so an ASCII code is the numerical representation of a character such as 'a' or '#' or an action of some sort. ASCII was developed a long time ago and now the non-printing characters are rarely used for their original purpose.
Quote from PHP documentation on chr():
Returns a one-character string containing the character specified by ascii.
In any case, I'd say it's more overkill to do it your way than do it with strtoupper().
strtoupper() is also faster.

Sanitize/Replace all Japanese, Chinese Korean, Russian etc. characters

I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.
As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.
You could run it through your existing sanitizer, then anything not latin, you could convert to punycode
So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit

Compress small string by converting it's base in PHP?

I was wondering if there would be some way to compress a small ASCII string (~100 characters) by combining some of the native PHP compression and base converting functions to produce an even smaller string (`~60 characters).
For example, could I take a string, gzcompress it, convert it to a number, and then change the base to a system with more values?
The goal is to have a smaller string that is ASCII (perhaps UTF-8) compatible for display.
You could try a dictionary compression like lzw or a golomb code but the compression depends on the data. Without the exact data it's not possible to answer the question.
base64_encode(gzcompress($input));
That should do it, but I don't think this will make your original string much smaller.
http://php.net/manual/en/function.base64-encode.php
http://php.net/manual/en/function.gzcompress.php

Declaration to make PHP script completely Unicode-friendly

Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.
The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.
Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):
The PHP script source is itself in considered to be in UTF‑8 (eg, strings and regexes).
All input and output is automatically converted to/from UTF‑8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).
All functions with Unicode versions use those instead (eg, Collator::sort for sort).
All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have /u tacked on implicitly, and things like \w and \b and \s all work on Unicode the way The Unicode Standard requires them to work, etc).
For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen, grapheme_strstr, grapheme_strpos, and grapheme_substr), and the regex stuff works on proper graphemes (ie, . — or even [^abc] — matches a Unicode grapheme cluster no matter how many code points it contains, etc).
That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.
So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.
One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :
mbstring supports a 'function
overloading' feature which enables you
to add multibyte awareness to such an
application without code modification
by overloading multibyte counterparts
on the standard string functions.
For example, mb_substr() is
called instead of substr() if
function overloading is enabled.
All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
This isn't a good idea.
Unicode strings cannot transparently replace byte strings. Even when you are correctly handling all human-readable text as Unicode, there are still important uses for byte strings in handling file and network data that isn't character-based, and interacting with systems that explicitly use bytes.
For example, spit out a header 'Content-Length: '.strlen($imageblob) and you're going to get brokenness if that's suddenly using codepoint semantics.
You still need to have both mb_strlen and strlen, and you have to know which is the right one to use in each circumstance; there's not a single switch you can throw to automatically do the right thing.
This is why IMO the approach of having a single string datatype that can be treated with byte or codepoint semantics is generally a mistake. Languages that provide separate datatypes for byte strings (with byte semantics), and character strings (with Unicode codepoint semantics(*)) tend to be more consistent.
(*: or UTF-16 code unit semantics if unlucky)

Categories