urlencode to lower case in PHP - php

In PHP, when url encoding using urlencode(), the outputted characters are in upper case:
echo urlencode('MyString'.chr(31));
//returns 'MyString%1F'
I need to get PHP to give me back 'MyString%1f' for the above example but not to lower case any other part of the string. in order to be consistent with other platforms. Is there any way I can do this without having to run through the string one character at a time, working out if I need to change the casing each time?

Why would you want to do this at all? F or f, it shouldn't make any difference as percent encoding is ment to be case-insensitive. The only case I could think of would be when creating hashes, however personally I would then convert the whole string to either uppercase or lowercase, ie treat it as case-insensitive.
Anyways, if you really need to do this, then it should be relatively easy using preg_replace_callback:
$original = 'MyString%1F%E2%FOO%22';
$modified = preg_replace_callback('/%[0-9A-F]{2}/', function(array $matches)
{
return strtolower($matches[0]);
},
$original);
var_dump($modified);
This should give you:
string(18) "MyString%1f%e2%FOO%22"

Related

How to Convert Arabic Characters to Unicode Using PHP

I want to to know how can I convert a word into unicode exactly like:
http://www.arabunic.free.fr/
can anyone know how to do that using PHP considering that Arabic text may contains ligatures?
thanks
Edit
I'm not sure what is that "unicode" but I need to have the Arabic Character in it's equivalent machine number considering that arabic characters have different contextual forms depending on their position - see here:
http://en.wikipedia.org/wiki/Arabic_alphabet#Table_of_basic_letters
the same character in different position:
ب‎ | ـب‎ | ـبـ‎ | بـ‎
I think it must be a way to convert each Arabic character into it's equivalent number, but how?
Edit
I still believe there's a way to convert each character to it's form depending on positions
any idea is appreciated..
All what you need is function called: utf8Glyphs which you can find it in ArGlyphs.class.php download it from ar-php
and visit Ar-PHP for the ArPHP more information about the project and classes.
This will reverse the word with same of its characters (glyphs).
Example of usage:
<?php
include('Arabic.php');
$Arabic = new Arabic('ArGlyphs');
$text = 'بسم الله الرحمن الرحيم';
$text = $Arabic->utf8Glyphs($text);
echo $text;
?>
i assume you wnat to convert بهروز to \u0628\u0647\u0631\u0648\u0632 take a look at http://hsivonen.iki.fi/php-utf8/ all you have to do after calling unicodeToUtf8('بهروز') is to convert integers you got in array to hex & make sure they have 4digigts & prefix em with \u & you're done. also you can get same using json_encode
json_encode('بهروز') // returns "\u0628\u0647\u0631\u0648\u0632"
EDIT:
seems you want to get character codes of بب which first one differs from second one, all you have to do is applying bidi algorithm on your text using fribidi_log2vis then getting character code by one of ways i said before.
here's example:
$string = 'بب'; // \u0628\u0628
$bidiString = fribidi_log2vis($string, FRIBIDI_LTR, FRIBIDI_CHARSET_UTF8);
json_encode($bidiString); // \ufe90\ufe91
EDIT:
i just remembered that tcpdf has bidi algorithm which implemented using pure php so if you can not get fribidi extension of php to work, you can use tcpdf (utf8Bidi by default is protected so you need to make it public)
require_once('utf8.inc'); // http://hsivonen.iki.fi/php-utf8/
require_once('tcpdf.php'); // http://www.tcpdf.org/
$t = new TCPDF();
$text = 'بب';
$t->utf8Bidi(utf8ToUnicode($text)); // will return an array like array(0 => 65168, 1 => 65169)
Just set the element containing the arabic text to "rtl" (right to left), then input correctly spelled arabic and the text will flow with all ligatures looked for.
div {
direction:rtl;
}
On a side note, don't forget to read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
Think about that : The "ba" (ب) arabic letter is a "ba" no matter where it appears in the sentence.
Try this:
<?php
$string = 'a';
$expanded = iconv('UTF-8', 'UTF-32', $string);
$arr = unpack('L*', $expanded);
print_r($arr);
?>
I'm totally agree with FloatBird about the use of the arabic.php which you will find it as he said at ar-php, The thing is they have changed the class name after version 4 from Arabic to I18N_Arabic so in order for the code to work using arabic.php ver 4.0 you need to change the code to
<?php
include('Arabic.php');
$Arabic = new I18N_Arabic('ArGlyphs');
$text = 'بسم الله الرحمن الرحيم';
$text = $Arabic->utf8Glyphs($text);
echo $text;
?>
Also notice that you need to put the php code file inside the I18N folder.
Anyway it is working fantastically, Thanks again FloatBird
I had a similar problem when I wanted to store an object that had values in Arabic, so writing in Arabic was stored as UNICODE," so the solution was as follows.
$detailsLog = $product->only(['name', 'unit', 'quantity']);
$detailsLog = json_encode($detailsLog, JSON_UNESCAPED_UNICODE);
$log->details = $detailsLog;
$log->save();
When you put the second parameter of the json_encode JSON_UNESCAPED_UNICODE follower, the Arabic words return without encoding.
i think you could try:
<meta charset="utf-8" />
if this does not work use FloatBird Answer

Replace characters in a string with their HTML coding

I need to replace characters in a string with their HTML coding.
Ex. The "quick" brown fox, jumps over the lazy (dog).
I need to replace the quotations with the & quot; and replace the brakets with & #40; and & #41;
I have tried str_replace, but I can only get 1 character to be replaced. Is there a way to replace multiple characters using str_replace? Or is there a better way to do this?
Thanks!
I suggest using the function htmlentities().
Have a look at the Manual.
PHP has a number of functions to deal with this sort of thing:
Firstly, htmlentities() and htmlspecialchars().
But as you already found out, they won't deal with ( and ) characters, because these are not characters that ever need to be rendered as entities in HTML. I guess the question is why you want to convert these specific characters to entities? I can't really see a good reason for doing it.
If you really do need to do it, str_replace() will do multiple string replacements, using arrays in both the search and replace paramters:
$output = str_replace(array('(',')'), array('&#40','&#41'), $input);
You can also use the strtr() function in a similar way:
$conversions = array('('=>'(', ')'=>')');
$output = strtr($conversions, $input);
Either of these would do the trick for you. Again, I don't know why you'd want to though, because there's nothing special about ( and ) brackets in this context.
While you're looking into the above, you might also want to look up get_html_translation_table(), which returns an array of entity conversions as used in htmlentities() or htmlspecialchars(), in a format suitable for use with strtr(). You could load that array and add the extra characters to it before running the conversion; this would allow you to convert all normal entity characters as well as the same time.
I would point out that if you serve your page with the UTF8 character set, you won't need to convert any characters to entities (except for the HTML reserved characters <, > and &). This may be an alternative solution for you.
You also asked in a separate comment about converting line feeds. These can be converted with PHP's nl2br() function, but could also be done using str_replace() or strtr(), so could be added to a conversion array with everything else.

PHP string manipulation help with timestamped files

i'm trying to work out the best way to remove a timestamp from a filename using php's string functions. The timestamp is split from the rest of the filename by an underscore on the left, and the dot to start the file extension on the right (e.g myfile_12343434.jpg) - I only ever want the text prior to the underscore although the length of this can vary. What's the best way to deal with this? Thanks!
edit to leave the extension intact (including e.g. .gd2 and .JPEG) do this:
$new = preg_replace("/_\\d+(\\.[a-z0-9]+)\$/i","\\1",$orig);
this effectively removes only the "_123" part, in a not-so-pretty way. For the purists among us, a version with a lookahead assertion, which only removes the timestamp:
$new = preg_replace("/_\\d+(?=\\.[0-9a-z]+\$)/i","",$orig);
You could use this:
$filename = explode("_", $orig_filename)[0];
The best way is to use preg_replace() to specify an exact match. A good start is something like the following (which will also preserve the extension):
$new = preg_replace("/_\d+/","",$orig);
But since this is a unix timestamp, we can do better by specifying the length of the numeric portion that it will match on:
$new = preg_replace("/_\d{1,11}/","",$orig);

What regex pattern do I need for this?

I need a regex (to work in PHP) to replace American English words in HTML with British English words. So color would be replaced by colour, meters by metres and so on [I know that meters is also a British English word, but for the copy we'll be using it will always be referring to units of distance rather than measuring devices]. The pattern would need to work accurately in the following (slightly contrived) examples (although as I have no control over the actual input these could exist):
<span style="color:red">This is the color red</span>
[should not replace color in the HTML tag but should replace it in the sentence]
<p>Color: red</p>
[should replace word]
<p>Tony Brammeter lives 2000 meters from his sister</p>
[should replace meters for the word but not in the name]
I know there are edge cases where replacement wouldn't be useful (if his name was Tony Meter for example), but these are rare enough that we can deal with them when they come up.
Html/xml should not be processed with regular expressions, it is really hard to generate one that will match anything. But you can use the builtin dom extension and process your string recursively:
# Warning: untested code!
function process($node, $replaceRules) {
foreach ($node->children as $childNode) {
if ($childNode instanceof DOMTextNode) {
$text = pre_replace(
array_keys(replaceRules),
array_values($replaceRules),
$childNode->wholeText
);
$node->replaceChild($childNode, new DOMTextNode($text));
} else {
process($childNode, $replaceRules);
}
}
}
$replaceRules = array(
'/\bcolor\b/i' => 'colour',
'/\bmeter\b/i' => 'metre',
);
$doc = new DOMDocument();
$doc->loadHtml($htmlString);
process($doc, $replaceRules);
$htmlString = $doc->saveHTML();
I think you'd rather need a dictionary and maybe even some grammatical analysis in order to get this working correctly, since you don't have control over the input. A pure regex solution is not really going to be able to process this kind of data correctly.
So I'd suggest to first come up with a list of words that need to be replaced, those are not only "color" and "meter". Wikipedia has some information on the topic.
You do not want a regular expression for this. Regular expressions are by their very nature stateless, and you need some measure of state to be able to tell the difference between 'in a html tag' and 'in data'.
You want to be using a HTML parser in combination with something like a str_replace, or even better, use a proper grammer dictionary and stuff as Lucero suggests.
The second problem is easier - you want to replace when there are word boundaries around the word: http://www.regular-expressions.info/wordboundaries.html -- this will make sure you don't replace the meter in Brammeter.
The first problem is much harder. You don't want to replace words inside HTML entities - nothing between <> characters. So, your match must make sure that you last saw > or nothing, but never just <. This is either hard, and requires some combination of lookahead/lookbehind assertions, or just plain impossible with regular expressions.
a script implementing a state machine would work much better here.
You don't need to use a regex explicitly. You can try the str_replace function, or if you need it to be case insensitive use the str_ireplace function.
Example:
$str = "<p>Color: red</p>";
$new_str = str_ireplace ('%color%', 'colour', $str);
You can pass an array with all the words that you want to search for, instead of the string.

How to handle diacritics (accents) when rewriting 'pretty URLs'

I rewrite URLs to include the title of user generated travelblogs.
I do this for both readability of URLs and SEO purposes.
http://www.example.com/gallery/280-Gorges_du_Todra/
The first integer is the id, the rest is for us humans (but is irrelevant for requesting the resource).
Now people can write titles containing any UTF-8 character, but most are not allowed in the URL.
My audience is generally English speaking, but since they travel, they like to include names like
Aït Ben Haddou
What is the proper way to translate this for displaying in an URL using PHP on linux.
So far I've seen several solutions:
just strip all non allowed characters, replace spaces
this has strange results:
'Aït Ben Haddou' → /gallery/280-At_Ben_Haddou/
Not really helpfull.
just strip all non allowed characters, replace spaces, leave charcode (stackoverflow.com) most likely because of the 'regex-hammer' used
this gives strange results:
'tést tést' → /questions/0000/t233st-t233st
translate to 'nearest equivalent'
'Aït Ben Haddou' → /gallery/280-Ait_Ben_Haddou/
But this goes wrong for german; for example 'ü' should be transliterated 'ue'.
For me, as a Dutch person, the 3rd result 'looks' the best.
I'm quite sure however that (1) many people will have a different opinion and (2) it is just plain wrong in the german example.
Another problem with the 3rd option is: how to find all possible characters that can be converted to a 7bit equivalent?
So the question is:
what, in your opinion, is the most desirable result. (within tech-limits)
How to technically solve it. (reach the desired result) with PHP.
Ultimately, you're going to have to give up on the idea of "correct", for this problem. Translating the string, no matter how you do it, destroys accuracy in the name of compatibility and readability. All three options are equally compatible, but #1 and #2 suffer in terms of readability. So just run with it and go for whatever looks best — option #3.
Yes, the translations are wrong for German, but unless you start requiring your users to specify what language their titles are in (and restricting them to only one), you're not going to solve that problem without far more effort than it's worth. (For example, running each word in the title through dictionaries for each known language and translating that word's diacritics according to the rules of its language would work, but it's excessive.)
Alternatively, if German is a higher concern than other languages, make your translation always use the German version when one exists: ä→ae, ë→e, ï→i, ö→oe, ü→ue.
Edit:
Oh, and as for the actual method, I'd translate the special cases, if any, via str_replace, then use iconv for the rest:
$text = str_replace(array("ä", "ö", "ü", "ß"), array("ae", "oe", "ue", "ss"), $text);
$text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);
To me the third is most readable.
You could use a little dictionary e.g. ï -> i and ü -> ue to specify how you'd like various charcaters to be translated.
As an interesting side note, on SO nothing seems to really matter after the ID -- this is a link to this page:
How to handle diacritics (accents) when rewriting 'pretty URLs'
Obviously the motivation is to allow title changes without breaking links, and you may want to consider that feature as well.
Nice topic, I had the same problem a while ago.
Here's how I fixed it:
function title2url($string=null){
// return if empty
if(empty($string)) return false;
// replace spaces by "-"
// convert accents to html entities
$string=htmlentities(utf8_decode(str_replace(' ', '-', $string)));
// remove the accent from the letter
$string=preg_replace(array('#&([a-zA-Z]){1,2}(acute|grave|circ|tilde|uml|ring|elig|zlig|slash|cedil|strok|lig){1};#', '#&[euro]{1};#'), array('${1}', 'E'), $string);
// now, everything but alphanumeric and -_ can be removed
// aso remove double dashes
$string=preg_replace(array('#[^a-zA-Z0-9\-_]#', '#[\-]{2,}#'), array('', '-'), html_entity_decode($string));
}
Here's how my function works:
Convert it to html entities
Strip the accents
Remove all remaining weird chars
Now people can write titles containing any UTF-8 character, but most are not allowed in the URL.
On the contrary, most are allowed. See for example Wikipedia's URLs - things like http://en.wikipedia.org/wiki/Café (aka http://en.wikipedia.org/wiki/Caf%C3%A9) display nicely - even if StackOverflow's highlighter doesn't pick them out correctly :-)
The trick is reading them reliably across any hosting environment; there are problems with CGI and Windows servers, particularly IIS, for example.
This is a good function:
function friendlyURL($string) {
setlocale(LC_CTYPE, 'en_US.UTF8');
$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$string = str_replace(' ', '-', $string);
$string = preg_replace('/\\s+/', '-', $string);
$string = strtolower($string);
return $string;
}

Categories