Preg_split loses foreign letters [duplicate] - php

This question already has answers here:
What is the best way to split a string into an array of Unicode characters in PHP?
(8 answers)
Closed 2 years ago.
I'm trying to use one script for keyword density. Everything works except for foreign letters (be it swedish, Estonian, or anything else).
$file includes the text.
Here's where the problem comes in:
$testsource = explode(" ", $file); // This has no problems with non-english letters
FIRST WORD in array: "Mängi"
$source = preg_split("/[(\b\W+\b)]/", $file, 0, PREG_SPLIT_NO_EMPTY); // This removes the non-english letter sometimes and also a letter in front of it
FIRST WORD in array: "ngi"
In case of this specific word the problem seems to be the "ä" character (and in case of other words other non-english characters) as my current preg_split removes the "Mä" from the beginning of the word. Words with no special characters are ok.
Question: What can I add to the preg_split not to cause issues?

Ah, never mind, the answer is to change the preg_split line to the following:
$source = preg_split("/[(\b\+\b)\s!##$%*]/", $file, 0, PREG_SPLIT_NO_EMPTY);

Related

How to prevent zalgo text using php [duplicate]

This question already has answers here:
Remove special characters that mess with formating
(2 answers)
Closed 7 years ago.
I have some problems with Zalgo on my imageboard.
Texts like below mess up my imageboard. Is there a way to prevent these characters and "fix" or clean up the texts?
Example text Source:
ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
I tried to use this solution:
$cleanMessage = preg_replace("/[^\x20-\xAD\x7F]/", "", $input_lines);
Taken from here: Remove special characters that mess with formating
But it works only for latin chars
Can anyone help me?
This regular expression replaces every superscript symbol in the $text variable:
$text = preg_replace("~[\p{M}]~uis","", $text);
If $text contains char with superscript, for example กิ this regex will remove that superscript symbol and result $text will contain just ก.
I was improved this regex and changed it to filter only second level of phonetic marks
$text = preg_replace("~(?:[\p{M}]{1})([\p{M}])+?~uis","", $text);
This regex will filter only second level of superscript symbols.
Use it if you want to filter deutch or other languages with reserved marks.
This regex will transform this word -
͐̈ͩ̎Zͮ͌ͦ͆ͦͤÃ̉͛̄ͭ̈̚LͫG̉̋͂̉Oͨ͌̋͗!
into this: ZÄLͫGO!
I hope second regex will help you.

Get substring of characters up to the first occurance of dot or dash [duplicate]

This question already has answers here:
PHP substring extraction. Get the string before the first '/' or the whole string
(14 answers)
Closed 7 years ago.
How do you specify that you want to return a substring containing all characters from the start of a string up to but not including the first dot or dash?
For example if the original string is:
'abcdefg.hij-k'
or if the original string is:
'abcdefg.hi-j.k.l.mn-op'
Then the same substring of:
'abcdefg'
should be returned.
The key thing here is that there may be multiple dots and dashes occurring randomly and we are only interested in the first chunk of characters.
EDIT:
A dot or a dash may occur first.
You could use preg_split:
$res = preg_split('/[-.]/', $string);
The string you want is in $res[0]
Or preg_match:
preg_match('/^([^-.]+)/', $string, $matches);
The result is in $match[1]
Try using the explode function.
$final = explode(".", $originalstr);
first should be $final[0];

Only allow English characters/letters/numbers and a few special characters [duplicate]

This question already has answers here:
php POST and non-english language chars passes empty
(2 answers)
PHP: Allow only certain characters in string, without using regex
(1 answer)
Closed 9 years ago.
My problem is that I am making a small search engine from scratch, but it gets messed up if I search in Russian/any other language besides English. I was hoping some one could give me a code with regex that could filter out (not just detect, automaticallt filter out) Russian letters, or any other letters except the English letters, and keyboard special characters (-/:;()$&#". - etc).
Later on, I will implement different language support for my engine, but for now, I want to finish the base of the engine.
Thanks in advance.
You may create an array of allowed characters and then filter those that are not allowed:
$allowed = array_merge(range('a', 'z'), range('A', 'Z'), range(0, 9), array(' ', '+', '/', '-', '*', '.')); // Create an array of allowed characters
$string = 'This is allowed and this not é Ó ½ and nothing 123.'; // test string
$array = str_split($string); // split the string (character length = 1)
echo implode('', array_intersect($array, $allowed)); // Filter and implode !
Online demo.
Why complicate? A regex will read the contents of the string, so better do it yourself. Read the characters of the string and check their corresponding ASCII value.
Create a hashset like structure with SplStorageObject and check manually if the characters fall in the desired set. You can add any characters that you want to read to this set.
EDIT - You might want to use regex too - something like [a-zA-Z0-9,./+&-] but using a set could allow you to expand your search engine gradually by adding more characters to the known-characters set.
this may not be the most effective way but it works :)
$str='"it is a simple test \ + - é Ó ½ 213 /:;()$&#".~" ';
$result= preg_replace('/[^\s\w\+\-\\":;#\(\)\$\&\.\/]*/', '', $str);
echo $result;
but you need to add every special characters.

Replace content between two words [duplicate]

This question already has answers here:
Get content between two strings PHP
(7 answers)
Closed 4 years ago.
I am trying to replace the content between two words using php. The content between the two words is different so I can't use tradition str_replace. I want to replace the content between two words for example:
I would like to replace **some string of text** between two words
change to:
I would like to replace between two words
You can see that I removed all the wording between "some" and "text". Again I cannot use regular str_replace because the text between the two words may differ. For example it may say:
I would like to replace **some words of text** between two words
change to:
I would like to replace between two words
The regex is simple: /some .*? text/
Just replace it with the empty string.
According to your question, only the inner part of your string changes. If that is the case it's rather trivial, because you already have the solution: You do not need to replace it, but you just need to not take it over:
$result = substr($string, 0, $startlen) . substr($string, -$endlen);
Probably this helps you to find some more "resolution angles" for such problems.

PHP preg_replace only muptiple occurrences [duplicate]

This question already has answers here:
PHP Preg-Replace more than one underscore
(7 answers)
Closed 1 year ago.
this following code will replaces spaces correctly:
$string = preg_replace("/[[:blank:]]+/", "", $string);
but how can I make it so that it will only replace it if there is more than 2 blank spaces? Because right now it replaces all spaces, I only need it to replace more than one space. I searched on here and see people use totally different preg_replace codes, but it also removes newlines so if the code I posted can just be simply modified to allow more than one blank, that would be great. I remember a while back reading a tutorial where it used something like {2+} in the preg area to match anything with more than two or something but not sure how to make it work correctly.
/[[:blank:]]{2,}/
That will make it replace sequences of two or more.
The php manual has a chapter about repetition/quantifiers.
$string = preg_replace("/[[:blank:]]+/", " ", $string);
Same as yours but replaces all occurrences of spaces with one space.

Categories