translating letters vs special characters - php

I've got a bunch of data which could be mixed characters, special characters, and 'accent' characters, etc.
I've been using php inconv with translit, but noticed today that a bullet point gets converted to 'bull'. I don't know what other characters like this don't get converted or deleted.
$, *, %, etc do get removed.
Basically what I'm trying to do is keep letters, but remove just the 'non-language' bits.
This is the code I've been using
$slugIt = #iconv('UTF-8', 'ASCII//TRANSLIT', $slugIt);
$slugIt = preg_replace("/[^a-zA-Z0-9 -]/", "", $slugIt);
of course, if I move the preg_replace to be above the inconv function, the accent characters will be removed before they are translated, so that doesn't work either.
Any ideas on this? or what non-letter characters are missed in the TRANSLIT?
---------------------Edited---------------------------------
Strangely, it doesn't appear to be the TRANSLIT which is changing a bullet to 'bull'. I commented out the preg-replace, and the 'bull' has been returned to a bullet point. Unfortunately I'm trying to use this to create readable urls, as well as a few other things, so I would still need to do url encoding.

Try adding the /u modifier to preg_replace.
See Pattern Modifers

you can try using the POSIX Regex:
$slugIt = ereg_replace('[^[:alnum:] -]', '', $slugIt);
$slugIt = #iconv('UTF-8', 'ASCII//TRANSLIT', $slugIt);
[:alnum:] will match any alpha numeric character (including the ones with accent).
Take a look at http://php.net/manual/en/book.regex.php for more information on PHP's POSIX implementation.

In the end this turned out to be a combination of wrong character set in, AND how windows handles inconv.
First of all, i had an iso-8859 character set going in, and even though I was defining utf-8 in the head of the document, php was still treating the characterset as ISO.
Secondly, when using iconv in windows, you cannot apparently combine ASCII//TRANSLIT//IGNORE, which thankfully you can do in windows.
Now on linux, all accented characters are translated to their base character, and non-alpha numerics are removed.
Here's the new code
$slugIt = #iconv('iso-8859-1', 'ASCII//TRANSLIT//IGNORE', $slugIt);
$slugIt = preg_replace("/[^a-zA-Z0-9]/", "", $slugIt);

Related

PHP mb_ereg_replace deleting additional characters beyond each unicode search target

I have text with unicode soft hyphens (U+00AD) that I wish to remove. I'm attempting to do this with PHP's mb_ereg_replace() function. It's finding the soft hyphens, but the replace procedure is removing both the soft hyphens and the first character that immediately follows them.
My code is:
$text_cleansed = mb_ereg_replace('[\u00AD]', '', $text);
For example, if $text were "en-dur-ance" (with the hyphens shown here being the invisible unicode soft hyphens), then $text_cleansed would be "enurnce"; -d and -a have been removed, when for each soft hyphen only the soft hyphen should be removed. mb_ereg_replace has therefore removed each soft hyphen and the first character that follows it. Surely, I must be feeding incorrect arguments into the function.
What is causing this behavior, and what would be the correct arguments for the function?
PHP regex does not support \u notation. The symbols in your regex are treated as separate entities, not as a hex notation (as '\u', '0', 'A', 'D').
Use preg_replace with \x{} notation with /u modifier (necessary to interpret the pattern and the input string as Unicode strings):
preg_replace('~\x{00AD}~u', '', $s)
See IDEONE demo
What stribizhev has written is correct. It's a matter of syntax that isn't clear in the PHP mb_ereg documentation. I'm answering after already marking his second response as the answer, because his third response has the specific answer to my original question (re: multi-byte strings--mb--as opposed to regular strings).
1) If non-multi-byte strings are used, preg_replace('~\x{00AD}~u', '', $text) is the solution.
2) If multi-byte strings are used, mb_ereg_replace('[\x{00AD}]', '', $text) is the solution.
It's a matter of syntax, obscure to those not experienced in regex work. With luck, this will help someone else with a similar problem.
If your string is UTF8 encoded, you don't need to take in account that your string is multi-byte to remove a character from the ASCII range (00-7F) since these bytes aren't used to compose other multi-byte characters. In this case, you can use str_replace:
$result = str_replace('-', '', $text);

Regex to reject non-english characters?

Is there a simple regex that will catch all non-english characters? It would need to allow common punctation and symbols, but no special characters such as Russian, Japanese, etc.
Looking for something to work in PHP.
Since in your comment your referring to addresses, they might contain digits too. So:
preg_replace('/[^[:alpha:][:punct:][:digit:]]/u', utf8_encode($input), '');
Should replace your unwanted characters. The [:alpha:] class will only work, if your locale is set up correctly, though. If, for example, it's set to de_DE, not only "a" through "z" are regarded characters, but also "exotics" like "ä", "ö", "è", and the like.
Also, since you don't want "Russian, Japanese, etc.", note the u modifier. The input has to be UTF-8 encoded in order to not break it and give you wrong results.
Such as this one [^A-Za-z0-9\,\.\-]?
This q/a seemed to handle it: PHP Validate string characters are UK or US Keyboard characters
use hex codes, e.g. this cleans out all non-ascii characters as well as line endings, and replaces them with spaces. space (\x20) is deliberately left out of the range so that consecutive runs of spaces and/or special chars are replaced with a single space.
$clean = trim(preg_replace('/[^\x21-\x7E]+/', ' ', $input));
if (strlen($str) == strlen(utf8_decode($str))) {
}

PHP: URL friendly strings [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How to handle diacritics (accents) when rewriting 'pretty URLs'
I want to replace special characters, such as Å Ä Ö Ü é, with "normal" characters (those between a-z and 0-9). And spaces should certainly be replaced with dashes, but that's not really a problem.
In other words, I want to turn this:
en räksmörgås
into this:
en-raksmorgas
What's the best way to do this?
Thank you in advance.
You can use iconv for the string replacement...
$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
Basically, it'll transliterate the characters it can, and drop those it can't (that are not in the ASCII character set)...
Then, just replace the spaces with str_replace:
$string = str_replace(' ', '-', $string);
Or, if you want to get fancy, you can replace all consecutive white-space characters with a single dash using a simple regex:
$string = preg_replace('/\\s+/', '-', $string);
Edit As #Robert Ros points out, you need to set the locale prior to using iconv (Depending on the defaults of your system). Just execute this line prior to the iconv line:
setlocale(LC_CTYPE, 'en_US.UTF8');
Check out http://php.net/manual/en/function.strtr.php
<?php
$addr = strtr($addr, "äåö", "aao");
?>
A clever hack often used for this is calling htmlentitites, then running
preg_replace('/&(\w)(acute|uml|circ|tilde|ring|grave);/', '\1', $str);
to get rid of the diacritics. A more complete (but often unnecessarily complicated) solution is using a Unicode decomposition algorithm to split diacritics, then dropping everything that is not an ASCII letter or digit.

Remove control characters from PHP string

How can I remove control characters like STX from a PHP string? I played around with
preg_replace("/[^a-zA-Z0-9 .\-_;!:?äÄöÖüÜß<>='\"]/","",$pString)
but found that it removed way to much. Is there a way to remove only
control chars?
If you mean by control characters the first 32 ascii characters and \x7F (that includes the carriage return, etc!), then this will work:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
(Note the single quotes: with double quotes the use of \x00 causes a parse error, somehow.)
The line feed and carriage return (often written \r and \n) may be saved from removal like so:
preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $input);
I must say that I think Bobby's answer is better, in the sense that [:cntrl:] better conveys what the code does than [\x00-\x1F\x7F].
WARNING: ereg_replace is deprecated in PHP >= 5.3.0 and removed in PHP >= 7.0.0!, please use preg_replace instead of ereg_replace:
preg_replace('/[[:cntrl:]]/', '', $input);
For Unicode input, this will remove all control characters, unassigned, private use, formatting and surrogate code points (that are not also space characters, such as tab, new line) from your input text. I use this to remove all non-printable characters from my input.
<?php
$clean = preg_replace('/[^\PC\s]/u', '', $input);
for more info on \p{C} see http://www.regular-expressions.info/unicode.html#category
PHP does support POSIX-Classes so you can use [:cntrl:] instead of some fancy character-magic-stuff:
ereg_replace("[:cntrl:]", "", $pString);
Edit:
A extra pair of square brackets might be needed in 5.3.
ereg_replace("[[:cntrl:]]", "", $pString);
TLDR Answer
Use this Regex...
/[^\PCc^\PCn^\PCs]/u
Like this...
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
TLDR Explanation
^\PCc : Do not match control characters.
^\PCn : Do not match unassigned characters.
^\PCs : Do not match UTF-8-invalid characters.
Working Demo
Simple demo to demonstrate: IDEOne Demo
$text = "\u{0019}hello";
print($text . "\n\n");
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
print($text);
Output:
(-Broken-Character)hello
hello
Alternatives
^\PC : Match only visible characters. Do not match any invisible characters.
^\PCc : Match only non-control characters. Do not match any control characters.
^\PCc^\PCn : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
^\PCc^\PCn^\PCs : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
^\PCc^\PCn^\PCs^\PCf : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
This regex will match anything visible, given in both its short-hand and long-hand form...
\PL\PM\PN\PP\PS\PZ
\PLetter\PMark\PNumber\PPunctuation\PSymbol\PSeparator
Normally, \p indicates that it's something we want to match and we use \P (capitalized) to indicate something that does not match. But PHP doesn't have this functionality, so we need to use ^ in the regex to do a manual negation.
A simpler regex then would be ^\PC, but this might be too restrictive in deleting invisible formatting. You may want to look closely and see what's best, but one of the alternatives should fit your needs.
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\PL or \PLetter: any kind of letter from any language.
\PLl or \PLowercase_Letter: a lowercase letter that has an uppercase variant.
\PLu or \PUppercase_Letter: an uppercase letter that has a lowercase variant.
\PLt or \PTitlecase_Letter: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\PL& or \PCased_Letter: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\PLm or \PModifier_Letter: a special character that is used like a letter.
\PLo or \POther_Letter: a letter or ideograph that does not have lowercase and uppercase
\PM or \PMark: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\PMn or \PNon_Spacing_Mark: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\PMc or \PSpacing_Combining_Mark: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\PMe or \PEnclosing_Mark: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\PZ or \PSeparator: any kind of whitespace or invisible separator.
\PZs or \PSpace_Separator: a whitespace character that is invisible, but does take up space.
\PZl or \PLine_Separator: line separator character U+2028.
\PZp or \PParagraph_Separator: paragraph separator character U+2029.
\PS or \PSymbol: math symbols, currency signs, dingbats, box-drawing characters, etc.
\PSm or \PMath_Symbol: any mathematical symbol.
\PSc or \PCurrency_Symbol: any currency sign.
\PSk or \PModifier_Symbol: a combining character (mark) as a full character on its own.
\PSo or \POther_Symbol: various symbols that are not math symbols, currency signs, or combining characters.
\PN or \PNumber: any kind of numeric character in any script.
\PNd or \PDecimal_Digit_Number: a digit zero through nine in any script except ideographic scripts.
\PNl or \PLetter_Number: a number that looks like a letter, such as a Roman numeral.
\PNo or \POther_Number: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\PP or \PPunctuation: any kind of punctuation character.
\PPd or \PDash_Punctuation: any kind of hyphen or dash.
\PPs or \POpen_Punctuation: any kind of opening bracket.
\PPe or \PClose_Punctuation: any kind of closing bracket.
\PPi or \PInitial_Punctuation: any kind of opening quote.
\PPf or \PFinal_Punctuation: any kind of closing quote.
\PPc or \PConnector_Punctuation: a punctuation character such as an underscore that connects words.
\PPo or \POther_Punctuation: any kind of punctuation character that is not a dash, bracket, quote or connector.
\PC or \POther: invisible control characters and unused code points.
\PCc or \PControl: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\PCf or \PFormat: invisible formatting indicator.
\PCo or \PPrivate_Use: any code point reserved for private use.
\PCs or \PSurrogate: one half of a surrogate pair in UTF-16 encoding.
\PCn or \PUnassigned: any code point to which no character has been assigned.
To keep the control characters but make them compatible for JSON, I had to to
$str = preg_replace(
array(
'/\x00/', '/\x01/', '/\x02/', '/\x03/', '/\x04/',
'/\x05/', '/\x06/', '/\x07/', '/\x08/', '/\x09/', '/\x0A/',
'/\x0B/','/\x0C/','/\x0D/', '/\x0E/', '/\x0F/', '/\x10/', '/\x11/',
'/\x12/','/\x13/','/\x14/','/\x15/', '/\x16/', '/\x17/', '/\x18/',
'/\x19/','/\x1A/','/\x1B/','/\x1C/','/\x1D/', '/\x1E/', '/\x1F/'
),
array(
"\u0000", "\u0001", "\u0002", "\u0003", "\u0004",
"\u0005", "\u0006", "\u0007", "\u0008", "\u0009", "\u000A",
"\u000B", "\u000C", "\u000D", "\u000E", "\u000F", "\u0010", "\u0011",
"\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018",
"\u0019", "\u001A", "\u001B", "\u001C", "\u001D", "\u001E", "\u001F"
),
$str
);
(The JSON rules state: “All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).”)
regex free method
If you are only zapping the control characters I'm familiar with (those under 32 and 127), try this out:
for($control = 0; $control < 32; $control++) {
$pString = str_replace(chr($control), "", $pString;
}
$pString = str_replace(chr(127), "", $pString;
The loop gets rid of all but DEL, which we just add to the end.
I'm thinking this will be a lot less stressful on you and the script then dealing with regex and the regex library.
Updated regex free method
Just for kicks, I came up with another way to do it. This one does it using an array of control characters:
$ctrls = range(chr(0), chr(31));
$ctrls[] = chr(127);
$clean_string = str_replace($ctrls, "", $string);

regex to match any UTF character excluding punctuation

I'm preparing a function in PHP to automatically convert a string to be used as a filename in a URL (*.html). Although ASCII should be use to be on the safe side, for SEO needs I need to allow the filename to be in any language but I don't want it to include punctuation other than a dash (-) and underscore (_), chars like *%$##"' shouldn't be allowed.
Spaces should be converted to dashes.
I think that using Regex will be the easiest way, but I'm not sure it how to handle UTF8 strings.
My ASCII functions looks like this:
function convertToPath($string)
{
$string = strtolower(trim($string));
$string = preg_replace('/[^a-z0-9-]/', '-', $string);
$string = preg_replace('/-+/', "-", $string);
return $string;
}
Thanks,
Roy.
I think that for SEO needs you should stick to ASCII characters in the URL.
In theory, many more characters are allowed in URLs. In practice most systems only parse ASCII reliable.
Also, many automagically-parse-the-link scripts choke on non-ASCII characters. So allowing URLs with non-ASCII characters in your URLs drastically reduces the change of your link showing up (correctly) in user generated content.
(if you want an example of such a script, take a look at the stackoverflow script, it chokes on parenthesis for example)
You could also take a look at:
How to handle diacritics (accents) when rewriting ‘pretty URLs’
The accepted solution there is to transiterate the non-ASCII characters:
<?php
$text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);
?>
Hope this helps
If UTF-8 mode is selected you can select all non-Letters (according to the Unicode general category - please refer to the PHP documentation Regular Expression Details) by using
/\P{L}+/
so I'd try the following (untested):
function convertToPath($string)
{
$string = mb_strtolower(trim($string), 'UTF-8');
$string = preg_replace('/\P{L}+/', '-', $string);
$string = preg_replace('/-+/', "-", $string);
return $string;
}
Be aware that you'll get prolems with strtolower() on UTF-8 strings as it'll mess with you multi-byte characters - use mb_strtolower() instead.

Categories