Php highlight search terms using preg_replace - character encoding issues

Php highlight search terms using preg_replace - character encoding issues - php

Hy guys, I need to highlight some searched words, in the result, and i was thinking to use pre_replace, and it works just fine, until i use accented characters.
So this is my code:
preg_replace("/(?<!\[)(\b{$search}\b)(?!\])/i", $replace, $string);
And if I'm looking for the word "mokus", it finds it, but leaves out "mókus",
The same thing happens the other way around.
And ideas? Thanks in advance.

You might want to research the term Accent Folding.
Here's a good article to understand the problem, the proposed solutions are in Javascript but you can translate the logic to PHP

Related

How do I make the PHP similar_text() function work for Japanese Characters (Kanji, Katakana and Hiragana)?

I'm wanting to use the similar_text() function provided by PHP for Japanese characters. But unfortunately it is giving the wrong answer. How can I make it work?
For Example:
similar_text('土橋勇樹', '東日刷株式')
gives the output 3, but we can clearly see it should be 0

You will want to handle the possible multibytes that are forming the Kanji characters. I am not 100% confident but I suspect similar_text does not support mb and you need a similar solution that can.
This links show peoples attempts at handling mb char similar to the php function.
https://gist.github.com/soderlind/74a06f9408306cfc5de9
https://github.com/antalaron/mb-similar-text
I have not personally tested this but the approach could be right or inspire you to write a custom function.
Also covered in this other post:
how to use similar text php code in arabic

preg_match regex syntax

Tried different regex generators with no luck.
I have this string that i put in preg_match:
$search_string = "/^:([A-Za-z0-9_\-]+)[#!~a-zA-Z0-9#\.\-]+\s*([A-Z]+)\s*[:]*([\#a-zA-Z0-9\-]+)*\s*[:]*([!\#\-\.A-Za-z0-9 ]+)*/";
It's basically for usernames. Sadly, when username has underscore in it. For example iam_coolguy wouldn't work.
How to add underscore to this search string?
I can't seem to figure out how regex works.
It's not a duplicate, scrolled past all preg_match threads.
/[a-z]/i seems easy and understandable for me, but my string is too advanced for my knowledge.
Thanks.

If you are just looking to grab somthing between //'s
I would just use this regex \/(.*)\/
but as the others have said, you havent given any limitations on what the username can and can't have in it.
If you need more, say something and I will adjust my answer.

Fix broken UTF-8 on PHP-substring

I got a little problem:
I wrote my own search engine for my Joomla-based website. Now the problem is, that I generate a preview of the article text using PHP's substring method. Its works fine, but it has some issues when it has to split multibyte-characters, since its not really taking X-Chars, but X-Bytes of the string. This means, that all multibyte characters potentially get splitted by this function, which doesn't look nice.
Anyone know a good workaround but reworking it with additional wordwrap function?
Best wishes

mb_substr will perform a multi-byte safe substring.
i.e.
mb_substring('Some string',1,3);
http://php.net/manual/en/function.mb-substr.php

Regexp word boundaries in non-ASCII situations

I have a regular expression in my PHP script like this:
/(\b$term|$term\b)(?!([^<]+)?>)/iu
This matches the word contained in $term, as long as there's a word boundary before or after and it's not inside a HTML tag.
However, this doesn't work in non-ASCII cases, for example with Russian text. Is there a way to make it work?
I can get almost as good result with
/(\s$term|$term\s)(?!([^<]+)?>)/iu
but this is obviously more limited and since this regexp is about highlighting search terms, it has the problem of including the space in the highlight.
I've read this StackOverflow question about the problem, but it doesn't help - doesn't work correctly. In that example the captures are the other way around (capture text outside the search term, when I need to capture the search term).
Any way to make this work? Thanks!

You could use zero-width lookahead/lookbehind assertions to assert the that characters to the left and right of what you're matching are non-letters?

The \b is certainly defined to work perfectly well on Unicode, as is required by UTS#18. What are you saying it is not doing? What are the exact text strings involved?

Removing characters from a PHP String

I'm accepting a string from a feed for display on the screen that may or may not include some rubbish I want to filter out. I don't want to filter normal symbols at all.
The values I want to remove look like this: �
It is only this that I want removed. Relevant technology is PHP.
Suggestions appreciated.

This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
Try to get your data as Unicode, or to make a agreement with your feed provider to you both use the same encoding.

Thanks for the responses, guys. Unfortunately, those submitted had the following problems:
wrong for obvious reasons:
ereg_replace("[^A-Za-z0-9]", "", $string);
This:
s/[\u00FF-\uFFFF]//
which also uses the deprecated ereg form of regex also didn't work when I converted to preg because the range was simply too large for the regex to handle. Also, there are holes in that range that would allow rubbish to seep through.
This suggestion:
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
while valid, is no good because I don't have any control over how the data I receive is encoded. It comes from an external source. Sometimes there's garbage in there and sometimes there is not.
So, the solution I came up with was relatively dirty, but in the absence of something more robust I'm just accepting all standard letters, numbers and symbols and discarding the rest.
This does seem to work for now. The solution is as follows:
$fixT = str_replace("£", "£", $string);
$fixT = str_replace("€", "€", $fixT);
$fixT = preg_replace("/[^a-zA-Z0-9\s\.\/:!\[\]\*\+\-\|\<\>##\$%\^&\(\)_=\';,'\?\\\{\}`~\"]/", "", $fixT);
If anyone has any better ideas I'm still keen to hear them. Cheers.

You are looking for characters that are outside of the range of glyphs that your font can display. You can find the maximum unicode value that your font can display, and then create a regex that will replace anything above that value with an empty string. An example would be
s/[\u00FF-\uFFFF]//
This would strip anything above character 255.

That's going to be difficult for you to do, since you don't have a solid definition of what to filter and what to keep. Typically, characters that show up as empty squares are anything that the typeface you're using doesn't have a glyph for, so the definition of "stuff that shows up like this: �" is horribly inexact.
It would be much better for you to decide exactly what characters are valid (this is always a good approach anyway, with any kind of data cleanup) and discard everything that is not one of those. The PHP filter function is one possibility to do this, depending on the level of complexity and robustness you require.

If you cant resolve the issue with the data from the feed and need to filter the information then this may help:
PHP5 filter_input is very good for filtering input strings and allows a fair amount of rlexability
filter_input(input_type, variable, filter, options)
You can also filter all of your form data in one line if it requires the same filtering :)
There are some good examples and more information about it here:
http://www.w3schools.com/PHP/func_filter_input.asp
The PHP site has more information on the options here: Validation Filters

Take a look at this question to get the value of each byte in your string. (This assumes that multibyte overloading is turned off.)
Once you have the bytes, you can use them to determine what these "rubbish" characters actually are. It's possible that they're a result of misinterpreting the encoding of the string, or displaying it in the wrong font, or something else. Post them here and people can help you further.

Try this:
Download a sample from the feed manually.
Open it in Notepad++ or another advanced text editor (KATE on Linux is good for this).
Try changing the encoding and converting from one encoding to another.
If you find a setting that makes the characters display properly, then you'll need to either encode your site in that encoding, or convert it from that encoding to whatever you use on your site.

Hello Friends,
try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/
Thanks,
Chintu(prajapati.chintu.001#gmail.com)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Php highlight search terms using preg_replace - character encoding issues - php

You might want to research the term Accent Folding. Here's a good article to understand the problem, the proposed solutions are in Javascript but you can translate the logic to PHP

Related

How do I make the PHP similar_text() function work for Japanese Characters (Kanji, Katakana and Hiragana)?

preg_match regex syntax

Fix broken UTF-8 on PHP-substring

Regexp word boundaries in non-ASCII situations

Removing characters from a PHP String

Categories

Resources