Making my script UTF-8 compatible? - php

I have quite a long script which involves chopping lots of large text files into individual words and processing them.
I lowercase everything then remove all characters except for letters and spaces with:
$content=preg_replace('/[^a-z\s]/', '', $content); // Remove non-letters
This is then exploded and each word goes into an associated array as the key with the number of occurances as the value:
$words=array_count_values($content);
I want to convert the script to be able to work with languages other than English. Is PHP going to be OK with this? Can I use UTF-8 characters as array keys? And how would I preg_replace to remove everything except letters from any language? (All numbers, punctuation and random characters still need to be removed.)

Yes you can use UTF-8 characters as keys (is there anything that can't be a key in a PHP array? :)). Your regexp might look something like:
/\pL+/u
EDIT:
Sorry, should be:
/[^\pL\p{Zs}]/u

This should work, for both your problems.
<?php
$string = "Héllø";
echo preg_replace('/[^a-z\s]/i', '', $string) . "\n";
echo preg_replace('/[^a-z\W\s]/ui', '', $string) . "\n";
$arr = array(
$string => 5
);
print_r($arr);
?>
In the preg_replace the u flag means it's unicode safe, the i flag means it's case-insensitive. \W are all word characters.

Ultimately, you won't be able to create an algorithm that works realiably for all languages. Unicode Standard Annex #29 provides a "Default Word Boundary Specification" (which I'm not sure would be easy to implement in PHP, because the only source of character properties available in userland is PCRE; mbstring has this information, but it doesn't expose it), but it warns the algorithm must be tailored for specific languages:
It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. [...]
For Thai, Lao, Khmer, Myanmar, and other scripts that do not use typically use spaces between words, a good implementation should not depend on the default word boundary specification. [...]

Related

From Postgress regexp replace match in PHP language

i need some help.
I have PostgreSQL regexp_replace pattern, like:
regexp_replace(lower(institution_title),'[[:cntrl:]]|[[[:digit:]]|[[:punct:]]|[[:blank:]]|[[:space:]|„|“|“|”"]','','g')
and i need this one alternative in PHP language
Because one half is from postgress db, and i have to compare strings from php aswell.
You may use the same POSIX character classes with PHP PCRE regex:
preg_replace('/[[:cntrl:][:digit:][:punct:][:blank:][:space:]„““”"]+/', '', strtolower($institution_title))
See demo
Besides, there are Unicode category classes in PCRE. Thus, you may also try
preg_replace('/[\p{Cc}\d\p{P}\s„““”"]+/u', '', mb_strtolower($institution_title, 'UTF-8'))
Where \p{Cc} stands for Control characters, \d for digits, \p{P} for punctuation, and \s for whitespace.
I am adding /u modifier to handle Unicode strings, too.
See a regex demo
hanks guys, but i bumped to another problem, i cannot match strings, if there is specifik symbols,
here is my output of postgres sql:
SQL:
select regexp_replace(lower(title),'[[:cntrl:]]|[[[:digit:]]|[[:punct:]]|[[:blank:]]|[[:space:]|„|“|“|”"]','','g')
from cls_institutions
Output:
"oxforduniversity"
"šiauliųuniversitetas"
"harwarduniversity"
"internationalbusinessschool"
"vilniuscollege"
"žemaitijoskolegija"
"worldhealthorganization"
But in PHP is output is a little bit different: I got my array with institutions:
$institutions[] = "'".preg_replace('/[[:cntrl:][:digit:][:punct:][:blank:][:space:]„““”"]+/', '', strtolower($data[0]))."'";
And PHP outputs like this:
"oxforduniversity",
"Šiauliųuniversitetas",
"harwarduniversity",
"internationalbusinessschool",
"vilniuscollege",
"Žemaitijoskolegija",
"worldhealthorganization"
First letter is not lowered case, somehow... I am missing something?

Why is ctype_alnum unhelpful in matching culture-agnostic alphanumerics?

Let's suppose that I have a text in a variable called $text and I want to validate it, so that it can contain spaces, underscores, dots and any letters from any languages and any digits. Since I am a total noob with regular expressions, I thought I can work-around learning it, like this:
if (!ctype_alnum(str_replace(".", "", str_replace(" ", "", str_replace("_", "", $text))))) {
//invalid
}
This correctly considers the following inputs as valid:
foobarloremipsum
foobarloremipsu1m
foobarloremi psu1m
foobar._remi psu1m
So far, so good. But if I enter my name, Lajos Árpád, which contains non-English letters, then it is considered to be invalid.
Returns TRUE if every character in text is either a letter or a digit,
FALSE otherwise.
Source.
I suppose that a setting needs to be changed to allow non-English letters, but how can I use ctype_alnum to return true if and only if $text contains only letters or digits in a culture-agnostic fashion?
Alternatively, I am aware that some spooky regular expression can be used to resolve the issue, including things like \p{L} which is nice, but I am interested to know whether it is possible using ctype_alnum.
You need to use setlocale with category set to LC_CTYPE and the appropriate locale for the ctype_* family of functions to work on non-English characters.
Note that the locale that you're using with setlocale needs to actually be installed on the system, otherwise it won't work. The best way to remedy this situatioin is to use a portable solution, given in this answer to a similar question.

Split string on non-alphanumerics in PHP? Is it possible with php's native function?

I was trying to split a string on non-alphanumeric characters or simple put I want to split words. The approach that immediately came to my mind is to use regular expressions.
Example:
$string = 'php_php-php php';
$splitArr = preg_split('/[^a-z0-9]/i', $string);
But there are two problems that I see with this approach.
It is not a native php function, and is totally dependent on the PCRE Library running on server.
An equally important problem is that what if I have punctuation in a word
Example:
$string = 'U.S.A-men's-vote';
$splitArr = preg_split('/[^a-z0-9]/i', $string);
Now this will spilt the string as [{U}{S}{A}{men}{s}{vote}]
But I want it as [{U.S.A}{men's}{vote}]
So my question is that:
How can we split them according to words?
Is there a possibility to do it with php native function or in some other way where we are not dependent?
Regards
Sounds like a case for str_word_count() using the oft forgotten 1 or 2 value for the second argument, and with a 3rd argument to include hyphens, full stops and apostrophes (or whatever other characters you wish to treat as word-parts) as part of a word; followed by an array_walk() to trim those characters from the beginning or end of the resultant array values, so you only include them when they're actually embedded in the "word"
Either you have PHP installed (then you also have PCRE), or you don't. So your first point is a non-issue.
Then, if you want to exclude punctuation from your splitting delimiters, you need to add them to your character class:
preg_split('/[^a-z0-9.\']+/i', $string);
If you want to treat punctuation characters differently depending on context (say, make a dot only be a delimiter if followed by whitespace), you can do that, too:
preg_split('/\.\s+|[^a-z0-9.\']+/i', $string);
As per my comment, you might want to try (add as many separators as needed)
$splitArr = preg_split('/[\s,!\?;:-]+|[\.]\s+/', $string, -1, PREG_SPLIT_NO_EMPTY);
You'd then have to handle the case of a "quoted" word (it's not so easy to do in a regular expression, because 'is" "this' quoted? And how?).
So I think it's best to keep ' and " within words (so that "it's" is a single word, and "they 'll" is two words) and then deal with those cases separately. For example a regexp would have some trouble in correctly handling
they 're 'just friends'. Or that's what they say.
while having "'re" and a sequence of words of which the first is left-quoted and the last is right-quoted, the first not being a known sequence ('s, 're, 'll, 'd ...) may be handled at application level.
This is not a php-problem, but a logical one.
Words could be concatenated by a -. Abbrevations could look like short sentences.
You can match your example directly by creating a solution that fits only on this particular phrase. But you cant get a solution for all possible phrases. That would require a neuronal-computing based content-recognition.

Regex to ignore accents? PHP

Is there anyway to make a Regex that ignores accents?
For example:
preg_replace("/$word/i", "<b>$word</b>", $str);
The "i" in the regex is to ignore case sensitive, but is there anyway to match, for example
java with Jávã?
I did try to make a copy of the $str, change the content to a no accent string and find the index of all the occurrences. But the index of the 2 strings seems to be different, even though it's just with no accents.
(I did a research, but all I could found is how to remove accents from a string)
I don't think, there is such a way. That would be locale-dependent and you probably want a "/u" switch first to enable UTF-8 in pattern strings.
I would probably do something like this.
function prepare($pattern)
{
$replacements = Array("a" => "[áàäâ]",
"e" => "[éèëê]" ...);
return str_replace(array_keys($replacements), $replacements, $pattern);
}
pcre_replace("/(" . prepare($word) . ")/ui", "<b>\\1</b>", $str);
In your case, index was different, because unless you used mb_string you were probably dealing with UTF-8 which uses more than one byte per character.
Java and Jávã are different words, there's no native support in regex for removing accents, but you can include all possible combinations of characters with or without accents that you want to replace in your regex.
Like preg_replace("/java|Jávã|jáva|javã/i", "<b>$word</b>", $str);.
Good luck!
Regex isn't the tool for you here.
The answer you're looking for is the strtr() function.
This function replaces specified characters in a string, and is exactly what you're looking for.
In your example, Jávã, you could use a strtr() call like this:
$replacements = array('á'=>'a', 'ã'=>'a');
$output = strtr("Jávã",$replacements);
$output will now contain Java.
Of course, you'll need a bigger $replacements array to deal with all the characters you want to work with. See the the manual page I linked for some examples of how people are using it.
Note that there isn't a simple blanket list of characters, because firstly it would be huge, and secondly, the same starting character may need to be translated differently in different contexts or languages.
Hope that helps.
<?php
if (!function_exists('htmlspecialchars_decode')) {
function htmlspecialchars_decode($text) {
return str_replace(array('<','>','"','&'),array('<','>','"','&'),$text);
}
}
function removeMarkings($text)
{
$text=htmlentities($text);
// components (key+value = entity name, replace with key)
$table1=array(
'a'=>'grave|acute|circ|tilde|uml|ring',
'ae'=>'lig',
'c'=>'cedil',
'e'=>'grave|acute|circ|uml',
'i'=>'grave|acute|circ|uml',
'n'=>'tilde',
'o'=>'grave|acute|circ|tilde|uml|slash',
's'=>'zlig', // maybe szlig=>ss would be more accurate?
'u'=>'grave|acute|circ|uml',
'y'=>'acute'
);
// direct (key = entity, replace with value)
$table2=array(
'Ð'=>'D', // not sure about these character replacements
'ð'=>'d', // is an ð pronounced like a 'd'?
'Þ'=>'B', // is a þ pronounced like a 'b'?
'þ'=>'b' // don't think so, but the symbols looked like a d,b so...
);
foreach ($table1 as $k=>$v) $text=preg_replace("/&($k)($v);/i",'\1',$text);
$text=str_replace(array_keys($table2),$table2,$text);
return htmlspecialchars_decode($text);
}
$text="Here two words, one in normal way and another in accent mode java and jává and me searched with java and it found both occurences(higlighted form this sentence) java and jává<br/>";
$find="java"; //The word going to higlight,trying to higlight both java and jává by this seacrh word
$text=utf8_decode($text);
$find=removeMarkings(utf8_decode($find)); $len=strlen($find);
preg_match_all('/\b'.preg_quote($find).'\b/i', removeMarkings($text), $matches, PREG_OFFSET_CAPTURE);
$start=0; $newtext="";
foreach ($matches[0] as $m) {
$pos=$m[1];
$newtext.=substr($text,$start,$pos-$start);
$newtext.="<b>".substr($text,$pos,$len)."</b>";
$start=$pos+$len;
}
$newtext.=substr($text,$start);
echo "<blockquote>",$newtext,"</blockquote>";
?>
I think something like this will help you, I got this one from a forum.. just take a look.
Set an appropriate locale (such as fr_FR, for example) and use the strcoll function to compare a string ignoring accents.

filter non-alphanumeric "repeating" characters

What's the best way to filter non-alphanumeric "repeating" characters
I would rather no build a list of characters to check for. Is there good regex for this I can use in PHP.
Examples:
...........
*****************
!!!!!!!!
###########
------------------
~~~~~~~~~~~~~
Special case patterns:
=*=*=*=*=*=
->->->->
Based on #sln answer:
$str = preg_replace('~([^0-9a-zA-Z])\1+|(?:=[*])+|(?:->)+~', '', $str);
The pattern could be something like this : s/([\W_]|=\*|->)\1+//g
or, if you want to replace by just a single instance: s/([\W_]|=\*|->)\1+/$1/g
edit ... probably any special sequence should be first in the alternation, incase you need to make something like == special, it won't be grabbed by [\W_].
So something like s/(==>|=\*|->|[\W_])\1+/$1/g where special cases are first.
preg_replace('~\W+~', '', $str);
sin's solution is pretty good but the use of \W "non-word" class includes whitespace. I don't think you wan't to be removing sequences of tabs or spaces! Using a negative class (something like: '[^A-Za-z0-9\s]') would work better.
This will filter out all symbols
[code]
$q = ereg_replace("[^A-Za-z0-9 ]", "", $q);
[/code]
replace(/([^A-Za-z0-9\s]+)\1+/, "")
will remove repeated patterns of non-alphanumeric non-whitespace strings.
However, this is a bad practice because you'll also be removing all non-ASCII European and other international language characters in the Unicode base.
The only place where you really won't ever care about internationalization is in processing source code, but then you are not handling text quoted in strings and you may also accidentally de-comment a block.
You may want to be more restrictive in what you try to remove by giving a list of characters to replace instead of the catch-all.
Edit: I have done similar things before when trying to process early-version ShoutCAST radio names. At that time, stations tried to call attention to themselves by having obnoxious names like: <<!!!!--- GREAT MUSIC STATION ---!!!!>>. I used used similar coding to get rid of repeated symbols, but then learnt (the hard way) to be careful in what I eventually remove.
This works for me:
preg_replace('/(.)\1{3,}/i', '', $sourceStr);
It removes all the symbols that repats 3+ times in row.

Categories