This question already has answers here:
php POST and non-english language chars passes empty
(2 answers)
PHP: Allow only certain characters in string, without using regex
(1 answer)
Closed 9 years ago.
My problem is that I am making a small search engine from scratch, but it gets messed up if I search in Russian/any other language besides English. I was hoping some one could give me a code with regex that could filter out (not just detect, automaticallt filter out) Russian letters, or any other letters except the English letters, and keyboard special characters (-/:;()$&#". - etc).
Later on, I will implement different language support for my engine, but for now, I want to finish the base of the engine.
Thanks in advance.
You may create an array of allowed characters and then filter those that are not allowed:
$allowed = array_merge(range('a', 'z'), range('A', 'Z'), range(0, 9), array(' ', '+', '/', '-', '*', '.')); // Create an array of allowed characters
$string = 'This is allowed and this not é Ó ½ and nothing 123.'; // test string
$array = str_split($string); // split the string (character length = 1)
echo implode('', array_intersect($array, $allowed)); // Filter and implode !
Online demo.
Why complicate? A regex will read the contents of the string, so better do it yourself. Read the characters of the string and check their corresponding ASCII value.
Create a hashset like structure with SplStorageObject and check manually if the characters fall in the desired set. You can add any characters that you want to read to this set.
EDIT - You might want to use regex too - something like [a-zA-Z0-9,./+&-] but using a set could allow you to expand your search engine gradually by adding more characters to the known-characters set.
this may not be the most effective way but it works :)
$str='"it is a simple test \ + - é Ó ½ 213 /:;()$&#".~" ';
$result= preg_replace('/[^\s\w\+\-\\":;#\(\)\$\&\.\/]*/', '', $str);
echo $result;
but you need to add every special characters.
Related
This question already has answers here:
What is the best way to split a string into an array of Unicode characters in PHP?
(8 answers)
Closed 2 years ago.
I'm trying to use one script for keyword density. Everything works except for foreign letters (be it swedish, Estonian, or anything else).
$file includes the text.
Here's where the problem comes in:
$testsource = explode(" ", $file); // This has no problems with non-english letters
FIRST WORD in array: "Mängi"
$source = preg_split("/[(\b\W+\b)]/", $file, 0, PREG_SPLIT_NO_EMPTY); // This removes the non-english letter sometimes and also a letter in front of it
FIRST WORD in array: "ngi"
In case of this specific word the problem seems to be the "ä" character (and in case of other words other non-english characters) as my current preg_split removes the "Mä" from the beginning of the word. Words with no special characters are ok.
Question: What can I add to the preg_split not to cause issues?
Ah, never mind, the answer is to change the preg_split line to the following:
$source = preg_split("/[(\b\+\b)\s!##$%*]/", $file, 0, PREG_SPLIT_NO_EMPTY);
I want to make a hyphen-separated string (for use in the URL) based on the user-submitted title of the post.
Suppose if the user entered the title of the post as:
$title = "USA is going to deport indians -- Breaking News / News India";
I want to convert it as below
$slug = usa-is-going-to-deport-indians-breaking-news-news-india";
There could be some more characters that I also want to be converted. For Example '&' to 'and' and '#', '%', to hyphen(-).
One of the ways that I tried was to use the str_replace() function, but with this method I have to call str_replace() too many times and it is time consuming.
One more problem is there could be more than one hyphen (-) in the title string, I want to convert more than one hyphens (-) to one hyphen(-).
Is there any robust and efficient way to solve this problem?
You can use preg_replace function to do this :
Input :
$string = "USA is going to deport indians -- Breaking News / News India";
$string = preg_replace("/[^\w]+/", "-", $string);
echo strtolower($string);
Output :
usa-is-going-to-deport-indians-breaking-news-news-india
I would suggest using the sanitize_title() function
check the documentation
There are three steps in this task (creating a "slug" string); each requires a separate pass over the input string.
Cast all characters to lowercase.
Replace ampersand symbols with [space]and[space] to ensure that the symbol is not consumed by a later replacement AND the replacement "and" is not prepended or appended to its neighboring words.
Replace sequences of one or more non-alphanumeric characters with a literal hyphen.
Multibyte-safe Code: (Demo)
$title = "ÛŞÃ is going to dèport 80% öf indians&citizens are #concerned -- Breaking News / News India";
echo preg_replace(
'/[^\pL\pN]+/u',
'-',
str_replace(
'&',
' and ',
mb_strtolower($title)
)
);
Output:
ûşã-is-going-to-dèport-80-öf-indians-and-citizens-are-concerned-breaking-news-news-india
Note that the replacement in str_replace() could be done within the preg_replace() call by forming an array of find strings and an array of replacement strings. However, this may be false economy -- although there would be fewer function calls, the more expensive regex-based function call would make two passes over the entire string.
If you wish to convert accented characters to ASCII characters, then perhaps read the different techniques at Convert accented characters to their plain ascii equivalents.
If you aren't worries about multibyte characters, then the simpler version of the same approach would be:
echo preg_replace(
'/[^a-z\d]+/',
'-',
str_replace(
'&',
' and ',
strtolower($title)
)
);
To mop up any leading or trailing hyphens in the result string, it may be a good idea to unconditionally call trim($resultstring, '-'). Demo
For a deeper dive on the subject of creating a slug string, read PHP function to make slug (URL string).
I am trying to create slugs. My string is like this: $string='möbel#*-jérôme-mp3-how?';
Step: 1
First, I want to remove special characters, non-alphanumeric and non-latin characters from this string.
Like this: $string='möbel-jérôme-mp3-how';
Previously, I used to have only english characters in the string.
So, I used to do like this: $string = preg_replace("([^a-z0-9])", "-", $string);
However, since I also want to retain foreign characters, this is not working.
Step: 2
Then, I want to remove the all the words that have one or more numbers in them.
In this example string, I want to remove the word mp3 as it contains one or more numbers.
So, the final string looks like this: $string='möbel-jérôme-how';
I used to do like this:
$words = explode('-',$string);
$result = array();
foreach($words as $word)
{
if( ($word ==preg_replace("([^a-z])", "-", $word)) && strlen($word)>2)
$result[]=$word;
}
$string = implode(' ',$result);
This does not work now as it contains foreign characters.
In PHP, you have access to Unicode properties:
$result = preg_replace('/[^\p{L}\p{N}-]+/u', '', $subject);
will do step 1 for you. (\p{L} matches any Unicode letter, \p{N} matches any Unicode digit).
Removing words with digits is just as easy:
$result2 = preg_replace('/\b\w*\d\w*\b-?/', '', $result);
(\b matches the start and end of a word).
I would strongly suggest to transliterate the unicode characters if you are actually doing slugs for links. You can use PHP's iconv to achieve that.
Similar question here. The ingenuity and simplicity of the top voted answer, I think, is great:)
I would suggest doing this in multiple steps:
Create a string of allowed characters ( all of them ) and and go through the string by keeping only them. ( it will take some time, but it's a one time thing )
Do an explode on - and go through all the words and keep only the ones, that don't contain numbers. Then implode it again.
I believe, you can write the script on your own from now.
I have quite a long script which involves chopping lots of large text files into individual words and processing them.
I lowercase everything then remove all characters except for letters and spaces with:
$content=preg_replace('/[^a-z\s]/', '', $content); // Remove non-letters
This is then exploded and each word goes into an associated array as the key with the number of occurances as the value:
$words=array_count_values($content);
I want to convert the script to be able to work with languages other than English. Is PHP going to be OK with this? Can I use UTF-8 characters as array keys? And how would I preg_replace to remove everything except letters from any language? (All numbers, punctuation and random characters still need to be removed.)
Yes you can use UTF-8 characters as keys (is there anything that can't be a key in a PHP array? :)). Your regexp might look something like:
/\pL+/u
EDIT:
Sorry, should be:
/[^\pL\p{Zs}]/u
This should work, for both your problems.
<?php
$string = "Héllø";
echo preg_replace('/[^a-z\s]/i', '', $string) . "\n";
echo preg_replace('/[^a-z\W\s]/ui', '', $string) . "\n";
$arr = array(
$string => 5
);
print_r($arr);
?>
In the preg_replace the u flag means it's unicode safe, the i flag means it's case-insensitive. \W are all word characters.
Ultimately, you won't be able to create an algorithm that works realiably for all languages. Unicode Standard Annex #29 provides a "Default Word Boundary Specification" (which I'm not sure would be easy to implement in PHP, because the only source of character properties available in userland is PCRE; mbstring has this information, but it doesn't expose it), but it warns the algorithm must be tailored for specific languages:
It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. [...]
For Thai, Lao, Khmer, Myanmar, and other scripts that do not use typically use spaces between words, a good implementation should not depend on the default word boundary specification. [...]
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
preg_match php special characters
As part of my register system I need to check for the existence of special characters In an variable. How can I perform this check? The person who gives the most precise answer gets best.
Assuming that you mean html entities when you say "special chars", you can use this:
<?php
$table = get_html_translation_table(HTML_ENTITIES, ENT_COMPAT, 'UTF-8');
$chars = implode('', array_keys($table));
if (preg_match("/[{$chars}]+/", $string) === 1) {
// special chars in string
}
get_html_translation_table gets all the possible html entities. If you only want the entities that the function htmlspecialchars converts, then you can pass HTML_SPECIALCHARS instead of HTML_ENTITIES. The return value of get_html_translation_table is an array of (html entity, escaped entity) pairs.
Next, we want to put all the html entities in a regular expression like [&"']+, which will match any substring containing one of the characters inside square brackets of length 1 or more. So we use array_keys to get the keys of the translation table (the unencoded html entities), and implode them together into a single string.
Then we put them into the regular expression and use preg_match to see if the string contains any of those characters. You can read more about regular expression syntax at the PHP docs.
$special_chars = // all the special characters you want to check for
$string = // the string you want to check for
if (preg_match('/'.$special_chars.'/', $string))
{
// special characters exist in the string.
}
Check the manual of preg_match for more details
A quick google search for "php special characters" brings up some good info:
htmlentities() - http://php.net/manual/en/function.htmlentities.php
htmlspecialchars() - http://php.net/manual/en/function.htmlspecialchars.php