Ruby equivalent to PHP's utf8_encode and utf8_decode functions [duplicate] - php

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How can I convert a string from windows-1252 to utf-8 in Ruby?
How can i transform the utf8 chars to iso8859-1
So here's my problem. I have tools I've distributed across my internal LAN and my external web servers serve up jobs/data to the internal LAN. As a result I'm often passing data I'd rather snoopy people didn't see. In other words, it's not the end of the world if someone sees my SQL string but I'd rather they didn't. I know, it's like a deadbolt on a sliding glass door. It won't keep anyone truly determined out but it should discourage the random curious script kiddie.
So I have a set of simple ciphers I've written in PHP. Recently I've determined that my next extension to my toolset needs to be in Ruby but those new tools need to communicate with my previously built set of PHP tools - I don't want to rebuild all of the PHP tools. So I need my PHP ciphers to be exactly reproduced by my Ruby code so that when Ruby encrypts a string my PHP tools and decipher it, then pass back an encrypted string for my Ruby tools to decipher.
My simple ciphers are just modified Caesar ciphers. A Caesar cipher (for those unfamiliar with the name) is where you shift all characters by a single known number of letters - i.e. a shift of 3 turn A into D, B into E, C into F, etc. A true Caesar cipher would require wrapping so that a shift of 3 would turn a Z into a C. However, mine doesn't do that, it simply adds 3 to Z and uses the utf8_encode and utf8_decode functions in PHP.
Now I need an equivalent in Ruby. I thought I'd found it in
str.encode('utf-8')
But that returns this error
undefined method 'encode' for #
My Googling suggests there is no single solution to this in Ruby for some reason. Ruby needs to know the current encoding of the string before encoding it into UTF-8. At least that's the way I understood the issue.
So the string coming in would be whatever Ruby 1.8.7 defaults to. (In case this is useful... I'm using Ubuntu 12.04 desktop, US, English and I think I grabbed Ruby using apt-get with the default repositories.) I need a variety of strings like SQL query statements - "SELECT * FROM table WHERE id = ?" and strings produced by PHP's md5() output. Every other string should fall into the category of upper case letters, lower case letters, and numbers all of it in US English.
Thanks

Check out Encoding and look at the documentation for the encode method you're using.I believe it is the first form that you need.

Related

What is the difference between a string and a "binary" string in PHP?

In PHP, you can (since PHP 5.2.1) use "binary strings":
$binary = (binary) $string;
$binary = b"binary string";
What is the difference with a "normal" string?
The only meaningful insight I could find was this comment:
However, it will only have effect as of PHP 6.0.0, as noted on http://www.php.net/manual/en/function.is-binary.php .
The link is dead. It would actually make sense that binary strings were added in PHP while PHP 6.0 was being developed, since 6.0 was supposed to bring Unicode support. So it was sort of a premature feature.
However is there an official source that could confirm that? I.e. confirm that there is absolutely no difference between classic strings and binary strings?
I don't have any official source to back this up, but I believe the reason is simple:
The present PHP treats strings as byte arrays, i.e. as raw binary blobs. PHP 6 was slated to be this great new release with its biggest improvement being native Unicode handling. At that point, a string literal would actually be understood as a string of characters instead as a string of bytes. Many string handling functions would break because of this and a lot of code would need to be retrofitted to continue to work in PHP 6.
As a migration path, strings could be declared as binary strings to keep the current behaviour. This was added early on to give developers ample time to prepare their code for PHP 6 compatibility. At the moment b doesn't do anything, the changed behaviour would only show up in PHP 6.
Well, PHP 6 never happened and is dead for now. So, b continues to do nothing for the time being and for now it's questionable if it will ever have any specific use in the future.

How to determine a word is English or any other language [duplicate]

This question already has answers here:
Detect language from string in PHP
(18 answers)
Closed 9 years ago.
I am developing a small library automation software and I need to determine a word is in English or Turkish. An example scenario is like this:
User enters a book title.
Determine it's Turkish or English.
Set the languge combobox to the respective language to help user fill the form.
A friend of mine suggested me "connect to Google Translate and use it" which seems reasonable but an algorithm without connecting an external service or database will be more appropriate for me. (I also search the Turkish/English specific characters like ç,ş,İ/w,x to decide) Therefore I am searching an algorithm to do this job maybe based on letter frequencies or something like it. Anything available in literature? Thanks, in advance. (I use php, mysql if it's important)
If the sample you're testing is that small (a single word or phrase) then simple heuristics like letter frequency aren't going to be very useful, as the English phrase "Jazz Quizzes" would probably fit the profile of many languages more readily than English.
You might be able to use frequency of bigraphs and trigraphs (2- and 3-letter combinations), as English and Turkish are sufficiently unrelated as to have combinations which only occur in one.
More likely, however, you are going to have to use a database of actual words from the two languages. In that case, you are probably best off using a third party API or database, rather than going to all the effort building your own corpuses, implementing the statistical algorithms, etc.
As per comment.
please check:
Detect language from string in PHP
or:
http://wiki.apache.org/solr/LanguageDetection
Solr can give you language with probability (for example this sentence is 90% English or 10% Turkish)

PHP string functions vs mbstring functions

I have an application that has so far been in English only. Content encoding throughout templates and database has been UTF-8. I am now looking to internationalize/translate the application into languages that have character sets absolutely needing UTF-8.
The application uses various PHP string functions such as strlen(), strpos(), substr(), etc, and my understanding is that I should switch these for multi-byte string functions such as mb_strlen(), mb_strlen(), mb_substr(), etc, in order for multi-byte characters to be handled correctly. I've tried to read around this topic a little but virtually everything I can find goes deep into "encoding theory" and doesn't provide a simple answer to the question: If I'm using UTF-8 throughout, can I switch from using strlen() to mb_strlen() and expect things to work normally in for example both English and Arabic, or is there something else I still need to look out for?
Any insight would be welcome, and apologies if I'm offending someone who has encoding close to their heart with my relative ignorance.
No. Since bytearrays are also strings in PHP, a simple replacement of the 8-bit string functions with their mb_* counterparts will cause nothing but trouble. Functions like strlen() and substr() are probably more frequently used with bytes than actual text strings.
At the place I last worked, we managed to build a multilingual web-site (Arabic, Hindi, among other languages) in PHP without using the mbstring library at all. Text string manipulation actually doesn't happen that often. When it does, it would require far more care than just changing a function name. Most of the challenges, I've found, lie on the HTML side. Getting a page layout to work with a RTL language is the non-trivial part.
I don't know if you're just using Arabic as an example. The difficulty of internationalization can vary quite substantially depending on whether "international" means European languages only (plus Russian), or if it's inclusive of Middle-Eastern, South-Asian, and Far-East languages.
Check the status of the mbstring.func_overload flag in php.ini
If (ini_get('mbstring.func_overload') & 2) then functions like strlen() (as listed here) are already overloaded by the mb_strlen() function, so there is no need for you to call the mb_* functions explicitly.
The number of multibyte functions really needed are under 10, so create 3 or 5 questions whether the usage of the function or logic is good. This quesiton is obsecure and hard to answer. Small questions can get quick answers. Concrete questions can bring out good answers. let me know when you create other questions.
If you need use cases, see the fallback functions in CMSes such as Wordpress, MediaWiki, Drupal.
When you decide to start using mbstring, You should avoid using mbstring.func_overload directive. Mbstring maintainers are going to deprecate mbstring.func_overload in PHP 5.5 or 5.6 (see PHP core mailing list in 2012 April). mbstring.func_overload breaks the codebases that are not expected to use mbstring.func_overload. you can see the cases in CakePHP, Zend Framework 1x in caliculating Content-Length by using strlen().
I answerd the similar question in another place: Should i refactor all my framework to use mbstring functions?

Explain this XSS string, it uses perl

I am trying to test one of my php sanitization classes against a few xss scripts available on
http://ha.ckers.org/xss.html
So one of the scripts in there has perl in it, is this some kind of a perl statement?? And would this execute directly on the server, since perl is a server scripting language.
perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out
Is the script that I am trying to work with. I have not tested it yet though, but I want to understand before I use it.
The \0 is a string termination character in the laguage C. Since perl is built on top of C, in the old days you could inject this "poisonous null byte" to make the C part read the line
<IMG SRC=java instead of the whole string, and thus maybe allow the whole thing through even though you were trying to strip stuff like SRC=javascript:
Mostly this doesn't work anymore because the higher level languages has gotten pretty good at defeating attacks like this by stripping out stray control chars like \0 before sending the strings on to the lower level routines.
You can read more on the poison nullbyte here: http://insecure.org/news/P55-07.txt or here: http://hakipedia.com/index.php/Poison_Null_Byte
The Perl isn't the attack, it just demonstrates how to generate the attack, since you can't see it in a plain string.
The point is that there is a null character (represented in Perl as \0) in the data.

PHP String Length Without strlen()

Just browsing over the latest release of the PHP coding standards, and something caught my eye:
http://svn.php.net/viewvc/php/php-src/trunk/CODING_STANDARDS?revision=296679&view=markup
Coding standard #4 states that "When writing functions that deal with strings, be sure to remember that PHP holds the length property of each string, and that it shouldn't be calculated with strlen()..."
I've ALWAYS used strlen, and maybe it's just late, but how do you access the built-in length property of a string in PHP?
They're talking about the C function, not the PHP function. The C function will stop counting after the first \0, but PHP strings can contain \0 elsewhere other than the end.
Its been clearly mentioned that they talking about PHP Coding standards not about C function or extension of PHP engines.
======================== PHP Coding Standards========================
This file lists several standards that any programmer, adding or changing
code in PHP, should follow. Since this file was added at a very late
stage of the development of PHP v3.0, the code base does not (yet) fully
follow it, but it's going in that general direction. Since we are now
well into the version 4 releases, many sections have been recoded to use
these rules.
Still I didn't found any relevant information about string length property but I think in future they might release the information if it's related to new version of PHP.
Please post if someone found useful information about this.
To get the length of a string zval (variable in C) use Z_STRLEN(your_zval)
see zend_operators.h at line 398 (PHP 5.4) :
#define Z_STRLEN(zval) (zval).value.str.len

Categories