using ≠ like != pros/cons - php

Is it ok to use ≠ instead of !=. I know it's an extra alt code and I've never used this on a project but I've tested it out and it works. Are there any pros/cons besides having to Alt +8800.
Edit:
I'm not going to use this, I just want to know.
Tested language php.

You have not mentioned which programming language your question is about, but ≠ has a number of disadvantages:
It does not exist in ASCII. Code written in anything else but pure 7-bit ASCII is way too vulnerable to strange encoding errors and it forces unnecessary requirements upon the editing program. It might even be displayed incorrectly depending on the editor font etc, and you do NOT want that to happen when editing code.
Even if it does work, it is not widely used, which by itself is a good reason to avoid it.
It saves screen space, sure, at the expense of clarity. It might even be mistaken for an = if you are tired. Terse code is not always best.
It cannot be typed easily in a portable manner. As a matter of fact, I don't know how to produce it on my Linux system without a graphical character selector.
Except for the screen space (disk space is actually the same or worse than !=) I cannot think of any other "advantage", so why bother?
EDIT:
On my system (Mandriva Linux 2010.1) with PHP 5.3.4 the ≠ (U+2260, or 8800 in decimal) operator does not work. Are you certain that your editor does not implicitly convert it to !=?

Related

Simplest way to convert subscript numbers

we get book titles from different sources (library systems) (with possibly different encoding, but mostly utf8). These strings are shown in the web and via export to Endnote and RefWorks. RefWorks (windows Quotation system) does not accept any other encoding than ANSI.
In the RIS/Refworks export, activating the line
$smarty = iconv("UTF-8", "Windows-1252", $smarty);
Example string
Diphosphen-komplexes (CO) 5CrPhPPPhCr(CO) 5
does suddenly cut off everything after the first subscript char (the rectangles). These chars are also not correctly printed in HTML but this output is okay because nothing is cut off. In UTF-8 export file encoding nothing is cut off, too. Despite that, the Windows software can't read UTF-8.
The simplest solution would be to convert any subscript number to a regular number. Everything would work quite well then. But I could not find any simple solution to this. Working with hex codes is the only thing I could imagine. This solutions is also preferred for use in our Solr index.
Anybody knows a better solutions?
The example string contains Private Use code points such as U+E5F8. By definition, no standard assigns any meaning to them; their use is purely by private agreements. It is thus impossible to convert them to anything, or to do anything with them, without knowing or inferring the private agreements involved. Some systems use Private Use code points to represent some symbols that are assigned to those points in some special font. Knowing what that font is and inspecting it may thus help to find out the agreement.
The conversion would need to be coded separately, in an ad hoc manner, since there is an an hoc agreement involved.
“ANSI”, which here means windows-1252, does not contain any subscript characters. In the context of a chemical formula, replacing subscript digits by normal digits does not change the meaning, and the formula is understandable, though it looks unprofessional.
When converting to HTML format (or other rich text format), you can use normal digits wrapped in elements that cause subscript rendering (or otherwise style them). HTML has the sub element for this, but its implementations differ between browsers and tend to be a poor quality, so a better approach is to generate <span class=sub>...</span> and use CSS to set the vertical position and font size.

Sanitize/Replace all Japanese, Chinese Korean, Russian etc. characters

I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.
As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.
You could run it through your existing sanitizer, then anything not latin, you could convert to punycode
So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit

Is the TAB character bad in source code? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm pretty familiar I guess with both Zend and PEAR PHP coding standards, and from my previous two employers no TAB characters were allowed in the code base, claiming it might be misinterpreted by the build script or something (something like that, I honestly can't remember the exact reason). We all set up our IDE's to explode TABs to 4 spaces.
I'm very used to this, but now my newest employer insists on using TABs and not spaces for indentation. I suppose I shouldn't really care since I can just tell PHP Storm to just use the TAB char when i hit the Tab key, but, I do. I want spaces and I'd like a valid argument for why spaces are better than TABs.
So, personal preferences aside, my question is, is there a legitimate reason to avoid using TABs in our code base?
Keep in mind this coding standard applies to PHP and JavaScript.
Tabs are better
Clearer to reader what level each piece of code is on; spaces can be ambiguous, espcially when it's unclear whether you're using 2-space tabs or 4-space tabs
Conceptually makes more sense. When you indent, you expect a tab and not spaces. The document should represent what you did on your keyboard, and not in-document settings.
Tabs can't be confused with spaces in extended lines of code. Especially when word warp is enabled, there could be a space in a wrapped series of words. This could obviously confuse the readers. It would slow down paired programming.
Tabs are a special character. When viewing of special characters is enabled on your IDE, levels can be more easily identified.
Note to all coders: you can easily switch between tabs and spaces using any of JetBrains' editors (ex. PHPStorm, RubyIDE, ReSharper, IntelliJIDEA, etc.) by simply pressing CTRL + ALT + L on Windows, Mac, or Linux.
is there a legitimate reason to avoid using TABs in our code base?
I have consulted at many a company and not once have I run into a codebase that didn't have some sort of mixture of tabs and spaces among various source files and not once has it been a problem.
Preferred? Sure.
Legitimate, as in accordance with established rules, principles, or standards? No.
Edit
All I really want to know is if there's nothing wrong with TABs, why
would both Zend and PEAR specifically say they are not allowed?
Because it's their preference. A convention they wish to be followed to keep uniformity (along with things like naming and brace style). Nothing more.
Spaces are better than tabs because different editors and viewers, or different editor settings, might cause tabs to be displayed differently. That's the only legitimate reason for avoiding tabs if your programming language treats tabs and spaces the same. If some tool chokes on tabs, then that tool is broken from the language point of view.
Even when everybody in your team sets their editor to treat tabs as four spaces, you'll get a different display when you have to open up your source code in some tool that doesn't.
The most important thing to worry about is being consistent about always using the same indentation scheme - having a confused mix of tabs and spaces is living hell, and is worse then either pure tabs or pure spaces. Therefore, if the rest of the project is using tabs you should use them too.
Anyway, there isn't a clear winner on Tabs vs Spaces. Space supporters say that the using only spaces for everything is a simper rule to enforce while Tabs supporters say that using tabs for indentation and spaces for alignment allows different developers to display the tab-width they find more comfortable.
In the end, tabs-vs-spaces is should not be a bid deal. The only time I have seem people argue that one of the alternatives is strictly better then the other is in indentation-sensitive languages, like Python or Haskell. In these mixing tabs and spaces can change the program semantics in hard to see ways, instead of only making the source code look weird.
Ever since my first CS class, tabs have always been taboo. Reason being, tabs are basically like variables. Different IDE's can define a TAB as a different number of spaces. Speaking from a Visual Studio/NetBeans/DevC++ perspective, all have the capacity to change the 'definition' of a TAB based on number of desired spaces. So if you have 4 spaces defined, there is no way that you can know if my IDE says 3 spaces or 5 spaces. So if anyone happens to use a space-based indentation style and someone else uses TABS, the formatting can get all jacked up.
As a counter-point, however, if the 'standard' is to always use tabs, then it really wouldn't matter since the formatting will all appear the same - regardless of the number of defined spaces. But all it takes is one person to use a space and the formatting can look horrid and get really confusing. This can't happen when using spaces. Also, what happens if you don't want to use the same spacing between functions/methods, etc? What if you like using 4 spaces in some cases and only 2 in other cases?
I have seen build scripts that parse source code and generate documentation or even other code. These kind of scripts usually depend on the code being in an expected format, and frequently that means either using spaces (or sometimes tabs). Perhaps these scripts could be modified to be more robust by checking for tabs or spaces, but frequently you are stuck with what you've got. In that kind of an environment, consistent formatting becomes more important.

Conversion from Simplified to Traditional Chinese

If a website is localized/internationalized with a Simplified Chinese translation...
Is it possible to reliably
automatically convert the text to
Traditional Chinese in a high quality
way?
If so, is it going to be extremely high quality or just a good starting point for a translator to tweak?
Are there open source tools (ideally in PHP) to do
such a conversion?
Is the conversion better one way vs. the other (simplified -> traditional, or vice versa)?
Short answer: No, not reliably+high quality. I wouldn't recommend automated tools unless the market isn't that important to you and you can risk certain publicly embarrassing flubs. You may find some localization firms are happier to start with a quality simplified Chinese translation and adapt it to traditional, but you may also find that many companies prefer to start with the English source.
Longer answer: There are some cases where only the glyphs are different, and they have different unicode code points. But there are also some idiomatic and vocabulary differences between the PRC and Taiwan/Hong Kong, and your quality will suffer if these aren't handled. Technical terms may be more problematic or less, depending on the era in which the terms became commonly used. Some of these issues may be caught by automated tools, but not all of them. Certainly, if you go the route of automatically converting things, make sure you get buyoff from QA teams based in each of your target markets.
Additionally, there are sociopolitical concerns as well. For example, you can use terms like "Republic of China" in Taiwan, but this will royally piss off the Chinese government if it appears in your simplified Chinese version (and sometimes your English version); if you have an actual subsidiary or partner in China, the staff may be arrested solely on the basis of subversive terminology. (This is not unique to China; Pakistan/India and Turkey have similar issues). You can get into similar trouble by referring to "Taiwan" as a "country."
As a native Hong Konger myself, I concur with #JasonTrue: don't do it. You risk angering and offending your potential users in Taiwan and Hong Kong.
BUT, if you still insist on doing so, have a look at how Wikipedia does it; here is one implementation (note license).
Is it possible to reliably automatically convert the text to Traditional Chinese in a high quality way?
Other answers are focused on the difficulties, but these are exaggerated. One thing is that a substantial portion of the characters are exactly the same. The second thing is the 'simplified' forms are exactly that: simplified forms of the traditional characters. That means mostly there is a 1 to 1 relationship between traditional and simplified characters.
If so, is it going to be extremely high quality or just a good starting point for a translator to tweak?
A few things will need tweaking.
Are there open source tools (ideally in PHP) to do such a conversion?
Not that I am aware of, though you might want to check out the google translate api?
Is the conversion better one way vs. the other (simplified -> traditional, or vice versa)?
A few characters lost distinction in the simplified alphabet. For instance 麵(flour) was simplified to the same character as 面(face, side). For this reason traditional->simplified would be slightly more accurate.
I'd also like to point out that traditional characters are not solely in use in Taiwan (They can be found in HK and occasionally even in the mainland)
I was able to find this and this. Need to create an account to download, though. Never used the site myself so I cannot vouch for it.
Fundamentally, simplified Chinese words have a lot of missing meanings. No programming language in the world will be able to accurately convert simplified Chinese into traditional Chinese. You will just cause confusion for your intended audience (Hong Kong, Macau, Taiwan).
A perfect example of failed translation from simplified Chinese to traditional Chinese is the word "后". In the simplified form, it has two meanings, "behind" or "queen". When you attempt to convert this back to traditional Chinese, however, there can be more than two character choices: 後 "behind" or 后 "queen". One funny example I came across is a translator which converted "皇后大道" Queen's Road to "皇後大道", which literally means Queen's Behind Road.
Unless your translation algorithm is super smart, it is bound to produce errors. So you're better off hiring a very good translator who's fluent in both types of Chinese.
Short answer: Yes. And it's easy. You can firstly convert it from UTF-8 to BIG5, then there are lots of tools for you to convert BIG5 to GBK, then you can convert GBK to UTF-8.
I know nothing about any form of Chinese, but by looking at the examples in this Wikipedia page I'm inclined to think that automatic conversion is possible, since many of the phrases seem to use the same number of characters and even the some of the same characters.
I ran a quick test using a multibyte ord() function and I can't see any patterns that would allow the automatic conversion without the use of a (huge?) lookup translation table.
Traditional Chinese 漢字
Simplified Chinese 汉字
function mb_ord($string)
{
if (is_array($result = unpack('N', iconv('UTF-8', 'UCS-4BE', $string))) === true)
{
return $result[1];
}
return false;
}
var_dump(mb_ord('漢'), mb_ord('字')); // 28450, 23383
var_dump(mb_ord('汉'), mb_ord('字')); // 27721, 23383
This might be a good place to start building the LUTT:
Simplified/Traditional Chinese Characters List
I got to this other linked answer that seems to agree (to some degree) with my reasoning:
There are several countries where
Chinese is the main written language.
The major difference between them is
whether they use simplified or
traditional characters, but there are
also minor regional differences (in
vocabulary, etc).

Are named entities in HTML still necessary in the age of Unicode aware browsers?

I did a lot of PHP programming in the last years and one thing that keeps annoying me is the weak support for Unicode and multibyte strings (to be sure, natively there is none). For example, "htmlentities" seems to be a much used function in the PHP world and I found it to be absolutely annoying when you've put an effort into keeping every string localizable, only store UTF-8 in your database, only deliver UTF-8 webpages etc. Suddenly, somewhere between your database and the browser there's this hopelessly naive function pretending every byte is a character and messes everything up.
I would just love to just dump this kind of functions, they seem totally superfluous. Is it still necessary these days to write 'ä' instead of 'ä'? At least my Firefox seems perfectly happy to display even the strangest Asian glyphs as long as they're served in a proper encoding.
Update: To be more precise: Are named entities necessary for anything else than displaying HTML tags (as in "<" for "<")
Update 2:
#Konrad: Are you saying that, no, named entities are not needed?
#Ross: But wouldn't it be better to sanitize user input when it's entered, to keep my output logic free from such issues? (assuming of course, that reliable sanitizing on input is possible - but then, if it isn't, can it be on output?)
Named entities in "real" XHTML (i.e. with application/xhtml+xml, rather than the more frequently-used text/html compatibility mode) are discouraged. Aside from the five defined in XML itself (<, >, &, ", &apos;), they'd all have to be defined in the DTD of the particular DocType you're using. That means your browser has to explicitly support that DocType, which is far from a given. Numbered entities, on the other hand, obviously only require a lookup table to get the right Unicode character.
As for whether you need entities at all these days: you can pretty much expect any modern browser to support UTF-8. Therefore, as long as you can guarantee that the database, the markup and the web server all agree to serve that, ditch the entities.
If using XHTML, it's actually recommended not to use named entities ([citation needed]). Some browsers (Firefox …), when parsing this as XML (which they normally don't), don't read the DTD files and thus are unable to handle the entities.
As it's best practice anyway to use UTF-8 as encoding if there are no compelling reasons to do otherwise, this only means that the creator of the documents needs a decent editor that can not only handle the documents but also provides a good way of entering the divers glyphs. OS X doesn't really have this problem because most needed glyphs can be reached via “alt” keys but Windows doesn't have this feature.
#Konrad: Are you saying that, no, named entities are not needed?
Precisely. Unless, of course, there are silly restrictions, e.g. legacy database drivers that choke on UTF-8 etc.
Safari seems to have issues with some glyphs but not others, it may not be needed but it's probably best to do so, of course, this is my opinion and not backed up by anything but my own observations.

Categories