Replace all kind of dashes - php

I have a excel document which I import in MySQL using a library.
But some of the texts in the document contain dashes which I though I have replaced, but apparently not all of them.
-, –, - <-all of these are different.
Is there any way I could replace all kind of dahes with this one -
The main problem is that I dont know all of the dashes that exist in computers.

Just use regex with unicode modifier u and a character class:
$output = preg_replace('#\p{Pd}#u', '-', $input);
From the manual : Pd Dash punctuation
Online demo

How about:
$string = str_replace(array('-','–','-','—', ...), '-', $string);
Use the above code and see if it works. If you're still seeing some dashes not being replaced, you can just add them into the array, and it'll work.

Related

using regex for filtering some words in persian in php

I'm working on a script that is going to identify offensive words from text messages. The problem is that sometimes users make some changes in words and make them unidentifiable. my code has to be able to identify those too as far as possible.
First of all I replace all non-alnum chars to spaces.
And then:
I've written two regex patterns.
One to remove repeating characters from string.
for Example: the user has written: seeeeex, it replaces it with sex:
preg_replace('/(.)\1+/', '$1', $text)
this regex works fine for English words but not in Farsi words which is my case.
for example if you write:
امیییییییییین
it does nothing with it.
I also tried
mb_ereg_replace
But it didn't work either.
My other regex is to remove spaces around all one-letter words.
for example: I want it to convert S E X to sex:
preg_replace('/( [a-zA-Zآ-ی] )\1+/', trim('$1'), $text);
This regex doesn't work at all and needs to be corrected.
Thank you for your help
Working with multi-byte characters, you should enable Unicode Aware modifier to change behavior of tokens in order to match right thing. In your first case it should be:
/(.)\1+/u
In your second regex, however, I see both syntax and semantic errors which you would change it to:
/\b(\pL)\s+/u
PHP:
preg_replace('/\b(\pL)\s+/u', '$1', $text);
Putting all together:
$text = 'سسس ککک سسس';
echo preg_replace(['/(.)\1+/u', '/\b(\pL)\s+/u'], '$1', $text); // خروجی میدهد: سکس
Live demo

Regex to replace punctuation

I've been trying for a few hours to get this to work to the effect I need but nothing works quite like it should. I'm building a discussion board type thing and have made a way to tag other users by putting #username in the post text.
Currently I have this code to strip anything that wouldn't be part of the username once the tags have already been pulled out of the entire text:
$name= preg_replace("/[^A-Za-z0-9_]/",'',$name);
This works well because it correct captures names that are for example (#username), #username:, #username, some text etc. (so to remove the ,, :, and )).
HOWEVER, this does not work when the user has non-ascii characters in their username. For example if it's #üsername, the result of that line above gives sername which is not useful.
IS there a way using preg_replace to still strip these additional punctuation, but retain any non-ascii letters?
Any help is much appreciated :)
You enter the area of Unicode Regexps.
$name= preg_replace('/[^\p{Letter}\p{Number}_]/u', '', $name);
or the other way round. The link I provided contains more examples.
To detect punctuation characters, you can use unicode property \p{P} instead:
$name = preg_replace('/[\p{P} ]+/', '', $name);
RegEx Demo

preg_replace to convert a user input string to a link pattern

I want to convert user input string
"something ... un// important ,,, like-this"
to
"something-un-important-like-this"
So basically remove all recurring special characters with "-". I've googled and came to this
preg_replace('/[-]+/', '-', preg_replace('/[^a-zA-Z0-9_-]/s', '-', strtolower($string)));
I'm curious as to know if this can be done with a single preg_replace().
Just to clear things out:
replace all special characters and blank space with a hyphen(-). If more occurrence appear consecutively replace them with single hyphen
My solution works perfectly as I want to but I'm looking to do the same in a single call
There was a similar question yesterday, but I don't have it at hand.
In your current first pattern:
[^a-zA-Z0-9_-]
you're looking for a single character only. If you make that a greedy match for one or more, the regular expression engine will automatically replace multiple of these with a single one:
[^a-zA-Z0-9_-]+
^- + = one or more
You then still have the problem that existing - inside the string are not caught, so you need to take them out of the "not-in" character class:
[^a-zA-Z0-9_]+
This then should do it:
preg_replace('/[^a-zA-Z0-9_]+/s', '-', strtolower($string));
And as it's only lowercase, you do not need to look for A-Z as well, just another reduction:
preg_replace('/[^a-z0-9_]+/s', '-', strtolower($string));
See as well Repetition and/or Quantifiers of which the + is one of (see Repetition­Docs; Repetition with Star and Plus­regular-expressions.info).
Also if you take a look at the modifiers­Docs, you'll see that the s (PCRE_DOTALL) modifier is not necessary:
$urlSlug = preg_replace('/[^a-z0-9_]+/', '-', strtolower($string));
Hope this helps and explains you a little about the regular expression you're using and also where you can find further documentation which is always helpful.
Try This:
preg_replace('/[^a-zA-Z0-9_-]+/s', '-', strtolower($string));

Php - Group by similar words

I was just thinking that how could we group by or seperate similar words in PHP or MYSQL. For instance, like i have samsung Glaxy Ace, Is this possible to recognize S120, S-120, s120, S-120.
Is this even possible?
Thanks
What you could do is strip all non alphanumeric characters and spaces, and strtoupper() the string.
$new_string = preg_replace("/[^a-zA-Z0-9]/", "", $string);
$new_string = strtoupper($new_string);
Only those? Easily.
/S-?120/i
But if you want to extend, you'll probably need to move from REGEX to something a little more sophisticated.
The best thing to do here is to pick a format and standardise on it. So for your example, you would just store S120, and when you get a value from a user, strip all non-alphanumeric characters from it and convert it to upper case.
You can do this in PHP with this code:
$result = strtoupper(preg_replace('/(\W|_)+/', '', $userInput));

PHP trim problem

I asked earlier how can I get rid of extra hyphens and whitespace added at the end and beginning of user submitted text for example, -ruby-on-rails- should be ruby-on-rails you guys suggested trim() which worked fine by itself but when I added it to my code it did not work at all it actually did some funky things to my code.
I tried placing the trim() code every where in my code but nothing worked can someone help me to get rid of extra hyphens and whitespace added at the end and beginning of user submitted text?
Here is my PHP code.
$tags = preg_split('/,/', strip_tags($_POST['tag']), -1, PREG_SPLIT_NO_EMPTY);
$tags = str_replace(' ', '-', $tags);
Update the trim statement to the following in order to update each item in the array:
foreach($tags as $key=>$value) {
$tags[$key] = trim($value, '-');
}
That should allow you to trim each value based on a string being expected.
If you have a string you can do this to strip hyphens from the beginning and end:
$tag = trim($tag, '-');
Your problem is that preg_split returns an array, but trim takes a string. You need to do the above for every string in the array.
Regarding trimming whitespace: if you are first converting all whitespace to hyphens then it should not be necessary to trim whitespace afterwards - the whitespace will already be gone. But be careful because the terms "whitespace" and "space" have different meanings. Your question seems to muddle these two terms.
Verify that the hyphen character you're attempting to trim is the same hyphen character that is wrapping -ruby-on-rails-. For example, these are all different characters that look similar: -, –, —, ―.
Im new to StackOverflow.com so I hope the function I wrote helps you in some way. You can specify what characters you want it to trim in the second parameter, for your example I've set it to just remove whitespace and 'dashes' by default, i've tested it using 'ruby-on-rails' and a somewhat extreme example of '- -- - - ruby-on-rails - -- - - -' and both produce the result: 'ruby-on-rails'.
The regular expression might be a bit of a q&d way of going about it but I hope it helps you, just reply if you have any problems implementing it or w/e.
function customTrim($s,$c='- ')
{
preg_match('#'.($a='[^'.$c.']').'.{1,}'.$a.'#',$s,$match);
return $match[0];
}

Categories