Mixing tabs and spaces without mixing tabs and spaces - php

I'm a Vim user who (due to my Python background) has grown accustomed to spaces instead of tabs. I'm working on a PHP project with another developer who uses Windows IDEs and really wants to use tabs. There really doesn't seem to be any hard and fast style guide for PHP that prefers a specific style, so I'm kind of stuck needing to deal with the tabs.
Is there some way, either through Git or Vim, that I can work with tabbed code and convert it to spaces while I'm editing it? Then perhaps on git add/commit it could be converted back to tabs?

The one really important thing is that the tracked content be standardized. You and the other developer are just going to have to agree on something there. Whichever of you wants to do something besides the agreed-upon standard is may end up with mixed results. It's not always possible to cleanly convert back and forth.
I would really recommend just dealing with it. I have my own opinion about tabs and spaces, and I've worked on code using each. The best thing to do is just to set up your editor to match the standardized style, and go with it. In vim, turn off expandtab, set tabstop to what you like.
If you still want to try, there are two primary ways:
Use autocommands in Vim to convert on read/write. You'd probably need BufReadPost, BufWritePre, and BufWritePost. (For writing, you convert to the standard, let Vim write it, then convert back to the way you like to edit.) Make sure tabstop is set the way you like, then something like this (untested):
set tabstop=4
autocmd BufReadPost * set expandtab | retab
autocmd BufWritePre * set noexpandtab | retab!
autocmd BufWritePost * set expandtab | retab
The * is the filepattern that this will apply to; you may have to mess with that, or only add the autocommands for files within a certain directory, to make sure this doesn't happen for everything you edit. Note that this is dangerous; it will for example replace literal tab characters inside strings.
Use Git's smudge/clean filters. You can read about them in man gitattributes, or in Pro Git. You could use those to convert for editing, then back to the standard for committing. If there's never any weird indentation, it could be as simple as changing leading tabs to some number of spaces, and leading spaces to a fraction of that number of tabs. Do it with sed/perl/indent, whatever you're comfortable with.

Related

coding style for HTML i18n/l10n with variables

I have been working on web developement for quite some time now and I have always struggled to find a clean solution for a problem I have encountered during i18n of HTML strings, mostly anchor tags.
First of let me show you a typical problematic example. This is a frequently encountered string in HTML templates:
Welcome to my site. Check out our cool products
you should not miss.
How do I translate this string while still having the following properties:
Dynamic generation of the URL (e.g. using a router)
A translatable string that is as readable as possible (so translators can do it w/o looking at the code)
Because the string contains HTML, I probably want to escape some parts I insert (e.g. the URL), so I don't make myself vulnerable to XSS if this URL contains user input
It should look as good as possible in the code as well
How do you translate your strings when they contain dynamic content and HTML?
When I now want to apply i18n to that string, I probably turn to gettext or a framework function. Since I come from the PHP/Joomla! world, I used JText::_ before, which acts very similar to gettext. In Python I now use Babel. Both share the same problem and probably more languages, too. All code I share here is my way of doing it in Python, more explicitly, in my Mako templates
Of course, the problem is: There is HTML in our string to be translated (and a URL, for that matter). Here are my options, which I will each explain afterwards:
Passing the raw string to gettext
Splitting the text into three bits
Surrounding linked word with variables
Using one variable that gets build seperately
Passing the raw string to gettext
This one seems the first approach one might take, if not aware of the implications.
Approach 1:
_('Welcome to my site. Check out our cool products \
you should not miss.')
For this msgid you could now translate it, keeping the HTML intact.
Advantages:
This looks very clean in the code and is easy to understand
If the translator is keeping the HTML intact this does not produce any problems
Disadvantages:
The translator has to know at least a little HTML
The string is completely unflexible, e.g. if the URL changes, all translations have to be adjusted
It does not allow for dynamic generation of the URL using something like a router
So as a conclusion, while I used this I quickly hit my limit. My next idea was:
Splitting the text into three bits
Approach 2:
_('Welcome to my site. Check out our cool ') + '<a href="/products">' +\
_('products') + '</a>' + _(' you should not miss.')
Advantages:
The URL is completely flexible now
Only actual text for the translators
Disadvantages:
Splits a sentence into three parts
Translator has to know which parts relate together or he might not be able to produce meaningful sentences
Not very pretty in code
The msgid may be a single word, which can cause problems (beware of contexts) but can be fixed.
I used this technique for some time because I did not know about printf style strings in PHP (which I used back then). Because this looked so ugly, I tried a different approach:
Surrounding linked word with variables
Approach 3:
_('Welcome to my site. Check out our cool %sproducts%s you should not miss.' % \
('', '')
Advantages:
Single string to translate, a complete sentence
Translator gets the context right from the string
Code is not that ugly
Disadvantages:
Translator has to take care that no %s goes missing (might be confusion as it reads like sproducts)
Introduces two format string variables for every URL, one being only </a>
Using one variable that gets build seperately
From here I had some different approaches, but I finally came of with the one I currently use (which might look like overkill, but I perfer it for now).
Approach 4:
_('Welcome to my site. Check out our cool %s \
you should not miss.') % ('%s' % ('/products', _('products')))
Let me take some time to reason this (seemingly lunatic) approach. First of all, the actual translation string looks like this:
_('Welcome to my site. Checkout our cool ${product_url} \
you should not miss.')
Which leaves a translator with the information what is inserted there (that's the translationstring version). Second, I want to ensure that I can manually escape all parts that are inserted into the HTML. While Mako provides automatic escaping, this does not make sense in a statement like this:
${'This is a url'}
It would destroy the url so I have to apply the |n filter to remove any escaping. However, if any argument of that is user supplied, it also opens up to XSS which I want to prevent. Not taking any risk, I can just escape any input (the same way good template engines do by defualt) and then remove Mako's escaping for this one string. So
'%s' % ('/products', _('products'))
actually looks like
'%s' % (escape('/products'), _('products'))
where escape is imported from markupsafe (see Markupsafe).
The final part now is dynamic URLs through a router: request.route_url('products_view')
To combine each of these possibilities, I have to produce something very ugly (note that this uses the mapping keyword argument of translationstring (translationstring.TranslationString) but that combines all the benefits I want/need from translation:
Final result:
_('Welcome to my site. Checkout our cool ${product_url} \
you should not miss.', mapping={'product_url': '%s' %\
(escape(request.route_url('products_view')), _('products'))})
Advantages:
Full HTML escpaing
Fully dynamic
Very good msgids for translation
Disadvantages:
An extremely ugly construct in the template (or the program anyway)
The lingua extractor doesn't catch _('products') so we have do extract that manually
So that is it, this concludes my approaches to this problem. Maybe I am doing something way to complicated and you have a lot better ideas or maybe this is a problem that depends on specific types of translatable text (and one has to choose the right approach).
Did I miss any solution or anything that would improve my approach?

How to force line-breaks on ?

Sometimes text on my pages looks very strange, real example:
trained professionals and paraprofessionals coming together
...While the parent div is quite narrow so the text is just sticking out of it.
And it looks quite strange, because actually represents a space.
So, I wonder if it's possible to make the browser account these characters as actual spaces and break the line where necessary without actually replacing them?
EDIT
Why a blind replacing is a problem?
Because may be needed sometimes.
Consider the following example:
Ranks:<br>
Marshall<br>
Leutenant<br>
Sergeant
If I just use a preg_replace on them it would look differently in the end.
(I would also consider some suggestions if you have any ideas on replacing them smartly (for php platform) If you could think of some algorithm that wouldn't affect formatting.)
By definition, is a non-breakable space. It's very meaning is not to be broken across line endings. If this is not what you intend then I suggest fixing the HTML instead of trying to force the browser into non-standard behaviour.

Handle arabic string in PHP with Eclipse

I am currently working on the localization of a website, which was first in english only. A third party company did the translations, and provided us with an excel file with the translations. Which I successfully converted to a PHP array that I can use in my views. I'm using Eclipse for Windows to edit my PHP files.
All is fine, except that I need to add variables in my strings, ex:
'%1 is now following %2'
In arabic I was provided with strings like this one:
'_______الآن يتتبع _______'
I find that replacing __ with %1 and %2 is incredibly difficult because the arabic part is a right to left string, and the %1, %2 will be considered left-to-right, or right-to-left, and I'm not sure . I hardly have the results I expect with the order of my param, because %1 will sometimes go to the left of the string, sometimes on the right, depending on where I start to type. Copy-pasting the replacement strings can also have the same strange effects.
Most of the times I end up with a string like this one:
%2الآن يتتبع %1
The %1 should be at right hand site, the %2 at the left hand site. The %1 is obviously considered right-to-left string because the % appears on the right. The %2 is considered left-to-right.
I'm sure someone as this issue before. Is there any way it can be done easily in Eclipse? Or using a smarter editor for arabic issues? Or maybe it is a Windows issue? Is there a workaround?
UPDATE
I also tried splitting my string into multiple strings, but this also changes the order of the parameters:
'%1' . 'الآن تتبع' . '%2'
UPDATE 2
It seems that changing the replacement string makes things better. It is probably linked to how numbers are handled in Arabic strings. This string was edited in Eclipse without any problem. The order of the parameter is correct, the string is handled correctly by PHP:
'{var2} الآن يتتبع {var1}'
If no other solution is found, this could be a good alternative.
Being an Arabic speaker I get lots of localization tasks. Although I haven't faced this problem in particular but I've had many left-to-right/right-to-left issues while editing. I've had success working with Notepad++.
So here's what I usually do when I want to edit Arabic text
Open empty Notepad++ *
Set encoding to UTF-8 (Encoding -> Encoding in UTF-8)
Enable RTL mode (View -> Text Direction RTL)
Paste your strings
And here's a screenshot showing how I'm editing your string
*: for some reason, whenever I open an already existing file things go bananas. So maybe I'm being superstitious, but this has always worked for me.
Update: First time I did this I was skeptical because the strings looked wrong, but then I did this:
print_r(str_split($string));
and I saw that they're indeed in the correct order.
#Adnan helped me realize and later confirmed that there are issues when mixing Latin numbers with Arabic text.
Based on that conclusion, the solution is simply to stop using %1, %2, %3, ... as placeholders. I will be using more descriptive keywords instead, for example {USER}, {ALBUM}, {PHOTO}, ...
This shows the expected result in the PHP file and it is easily editable:
'ar' => '{USER} الآن يتابع {ALBUM}'
I would prefer the original Notepad for this kind of task.
Open Notepad, make sure you're in LTR mode
Type %1
Change mode to RTL by pressing CTRL + SHIFT
Paste the arabic string into the editor.
Revert back to LTR by pressing CTRL + SHIFT again.
Type %2
Select all with CTRL + A and copy with CTRL + C
Paste into the IDE. It should look weird but execute as expected.
Reason for using Notepad: More complex editor such as Notepad++, Sublime, Coda (Mac), and some IDEs - in your case Eclipse may not use the correct encoding, and Notepad is simple yet works good for multilangual tasks.

Is the TAB character bad in source code? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm pretty familiar I guess with both Zend and PEAR PHP coding standards, and from my previous two employers no TAB characters were allowed in the code base, claiming it might be misinterpreted by the build script or something (something like that, I honestly can't remember the exact reason). We all set up our IDE's to explode TABs to 4 spaces.
I'm very used to this, but now my newest employer insists on using TABs and not spaces for indentation. I suppose I shouldn't really care since I can just tell PHP Storm to just use the TAB char when i hit the Tab key, but, I do. I want spaces and I'd like a valid argument for why spaces are better than TABs.
So, personal preferences aside, my question is, is there a legitimate reason to avoid using TABs in our code base?
Keep in mind this coding standard applies to PHP and JavaScript.
Tabs are better
Clearer to reader what level each piece of code is on; spaces can be ambiguous, espcially when it's unclear whether you're using 2-space tabs or 4-space tabs
Conceptually makes more sense. When you indent, you expect a tab and not spaces. The document should represent what you did on your keyboard, and not in-document settings.
Tabs can't be confused with spaces in extended lines of code. Especially when word warp is enabled, there could be a space in a wrapped series of words. This could obviously confuse the readers. It would slow down paired programming.
Tabs are a special character. When viewing of special characters is enabled on your IDE, levels can be more easily identified.
Note to all coders: you can easily switch between tabs and spaces using any of JetBrains' editors (ex. PHPStorm, RubyIDE, ReSharper, IntelliJIDEA, etc.) by simply pressing CTRL + ALT + L on Windows, Mac, or Linux.
is there a legitimate reason to avoid using TABs in our code base?
I have consulted at many a company and not once have I run into a codebase that didn't have some sort of mixture of tabs and spaces among various source files and not once has it been a problem.
Preferred? Sure.
Legitimate, as in accordance with established rules, principles, or standards? No.
Edit
All I really want to know is if there's nothing wrong with TABs, why
would both Zend and PEAR specifically say they are not allowed?
Because it's their preference. A convention they wish to be followed to keep uniformity (along with things like naming and brace style). Nothing more.
Spaces are better than tabs because different editors and viewers, or different editor settings, might cause tabs to be displayed differently. That's the only legitimate reason for avoiding tabs if your programming language treats tabs and spaces the same. If some tool chokes on tabs, then that tool is broken from the language point of view.
Even when everybody in your team sets their editor to treat tabs as four spaces, you'll get a different display when you have to open up your source code in some tool that doesn't.
The most important thing to worry about is being consistent about always using the same indentation scheme - having a confused mix of tabs and spaces is living hell, and is worse then either pure tabs or pure spaces. Therefore, if the rest of the project is using tabs you should use them too.
Anyway, there isn't a clear winner on Tabs vs Spaces. Space supporters say that the using only spaces for everything is a simper rule to enforce while Tabs supporters say that using tabs for indentation and spaces for alignment allows different developers to display the tab-width they find more comfortable.
In the end, tabs-vs-spaces is should not be a bid deal. The only time I have seem people argue that one of the alternatives is strictly better then the other is in indentation-sensitive languages, like Python or Haskell. In these mixing tabs and spaces can change the program semantics in hard to see ways, instead of only making the source code look weird.
Ever since my first CS class, tabs have always been taboo. Reason being, tabs are basically like variables. Different IDE's can define a TAB as a different number of spaces. Speaking from a Visual Studio/NetBeans/DevC++ perspective, all have the capacity to change the 'definition' of a TAB based on number of desired spaces. So if you have 4 spaces defined, there is no way that you can know if my IDE says 3 spaces or 5 spaces. So if anyone happens to use a space-based indentation style and someone else uses TABS, the formatting can get all jacked up.
As a counter-point, however, if the 'standard' is to always use tabs, then it really wouldn't matter since the formatting will all appear the same - regardless of the number of defined spaces. But all it takes is one person to use a space and the formatting can look horrid and get really confusing. This can't happen when using spaces. Also, what happens if you don't want to use the same spacing between functions/methods, etc? What if you like using 4 spaces in some cases and only 2 in other cases?
I have seen build scripts that parse source code and generate documentation or even other code. These kind of scripts usually depend on the code being in an expected format, and frequently that means either using spaces (or sometimes tabs). Perhaps these scripts could be modified to be more robust by checking for tabs or spaces, but frequently you are stuck with what you've got. In that kind of an environment, consistent formatting becomes more important.

Is it possible to write a regex which checks if a string (javascript & php code) is minified?

Is it possible to write a regular expression which checks if a string (some code) is minified?
Many PHP/JS obfuscators remove white space chars (among other things).
So, the final minified code sometimes looks like this:
PHP:
$a=array();if(is_array($a)){echo'ok';}
JS:
a=[];if(typeof(a)=='object'&&(a instanceof Array){alert('ok')}
in both cases there are no space chars before and after "{", "}", ";", etc. There also some other patterns which can help. I am not expecting a high accuracy regex, just need one which checks if at least 100 chars of string looks like minified code.
Thanks in advice.
PURPOSES: web malware scanner
I think a minifier will strip all newline characters, although there might possibly be one at the end of the file still if the minified code was pasted back in a text editor. Something like this will probably be fairly accurate:
/^[^\n\r]+(\r\n?|\n)?$/
That just tests that there are no newline characters in the whole thing except for possibly one at the end. So no guarantees, but I think it will work well on any longish block of code.
The short answer is "no", regex cannot do this.
Your best bet will probably be to do a statistical analysis of the source files, and compare against some known heuristics. For instance, by comparing the variable names against those often found in minimized code. A minimized file probably has a lot of one-character variable names, for instance... and won't have two-character variable names until all the one-character variable names are exhausted... etc.
Another option would be simply to run the source file through a minimizer, and see if the output is sufficiently different from the input. If not, it was probably already minimized.
But I have to agree with sg3s's final sentence: If you can explain why you need this, we can probably provide more useful answers to your actual needs.
No. Since the syntax/code and its intention doesn't change and some people who're very familiar with the php and/or js will write simple functions on one line without any whitespace at all (me :s).
What you could do is count all the whitespace characters in a string though this would also be unreliable since for some stuff you simply need whitespace, like x instanceof y heh. Also not all code is minified and cramped into a single row (see jQuery UI) so you can't really count on that either....
Maybe you can explain why you need to know this and we can try and find an alternative?
You can't tell if it's got minified or just written like that by hand (probably only applies for smaller scripts). But you can check if it doesn't contain unnecessary whitespace.
Take a look at open source obfuscator/minifier and see what rules they use to remove the whitespace. Validating if those rules were applied should work, if regex get to complex, a simple parser might be needed.
Just make sure that string literals like a="if ( b )" are excluded.
Run it through a parser for that particular language (even a prettifier might work fine) and modify it to count the number of unused characters. Use the percentage of unused chars vs. number of chars in documents as a test for minification. I don't think you can do this accurately with regex, although counting whitespace vs. document content might be okay.

Categories