Ignore accents in greek words when using elastic search

Ignore accents in greek words when using elastic search - php

I have a search input box that supports autosuggestion logic. The results are fetched from an elastic index whose analyzers (index_analyzer, search_analyzer) use as tokenizer: nGram, and as filter: standard, lowercase and asciifolding for the search_analyzer & lowercase and asciifolding for the index_analyzer respectively.
What I am struggling to achieve but without any effect/result yet is to get result(s) even if the user has given a greek word without the accent (tonos). Otherwise, the user gets proper results and the mechanism works as expected.
I have to mention that the given string is matched against a specific field to the document set that includes greek words with accent. Moreover, this field is of datatype string and enabled to get analyzed.
The query is formed (using example string without accent that highlights the problem):
$searchString = mb_strtolower('Προταση', 'UTF-8')
$queryText = new Elastica_Query_Text();
$queryText->setField('name', $searchString)
$query = new Elastica_Query();
$query->setQuery($queryText);
A quick solution but not appropriate cause it's kinda heavy for this purpose, is to form a fuzzy query with min_similarity set to 0.7. Then it works thoroughly but the cost is significant.
All the work has been done using the elastica, let alone php. Could you please help to solve my problem ? It is imperative for me that a solution be found.
Thank you in advance

Related

PHP: Converting scandic characters to hex unicode

I'm working on a Joomla site with Fabrik and problem is that Fabrik serializes some data using json_encode() but does not take into account the possibility of åäö and such. Now when a database search is made, it tries to find stuff with åäö, but doesn't find anything, because
everything is \u00e4 and \u00f6
and so forth.
I'm not much for digging into Fabrik's code and inserting one flag somewhere and worry about accidentally overwriting it when I update Fabrik. So I figured, since I'm disappointed in Fabrik anyway, I could just write around it completely in a custom template. Easy.
The problem is that I can't find a way or a function like htmlentities() that I can just feed the stuff to to make it match. I could just character replace them, but that's not a good solution.
Paraphrase: I wanna make word Mörkö into -> M\u00f6rk\u00f6. How?

Maybe there's another way but that works as excepted :
$encoded = substr(json_encode('Mörkö'), 1, -1);
json_encode('Mörkö') => "M\u00f6rk\u00f6"
substr() => M\u00f6rk\u00f6

Wrapping words in a SQL query string with regex

I have a database class that is written in PHP and it should take care of some things I don't want to care about. One of these features is handling the decryption of columns that are encoded with the AES function of MySQL.
This works perfect in normal cases (which in my opinion means that there is no alias in the query string "AS bla_bla"). Lets say that someone writes a query string that contains an alias, which contains the name of a column the script should decrypt, the query dies, because my regex wraps not only the column, but the alias as well. That is not how its supposed to be.
This is the regex I've written:
preg_replace("/(((\`|)\w+(\`|)\.|)[encrypted|column|list])/i", "AES_DECRYPT(${0},'the hash')"
The part with the grave accents is there because sometimes the query does contain the table name which is either inside of grave accents or not.
An example input:
SELECT encrypted, something AS 'a_column' FROM a_table;
An example output:
SELECT AES_DECRYPT(encrypted, 'the hash'), something AS 'a_AES_DECRYPT(column, 'the hash')' FROM a_table;
As you can see, this is not going to work, so my idea was to search only for words, that are not right after the word 'as' until a special character or a white space appears. Of course i tried it hours to work, but I don't get the correct syntax.
Is it possible to solve this with pure regex and if yes how would it look like?

This should get you started:
$quoted_name = '(\w+|`\w+`|"\w+"|\'\w+\')';
preg_match("/^SELECT ((, )?$quoted_name( AS $quoted_name)?)* FROM $quoted_name;$/", "SELECT encrypted, something AS 'a_column' FROM a_table;", $m);
var_dump($m);
The replacement parts should be easy to spot an write after you study the var_dump.

Replace '&' with 'and' on the fly in PHP

Is there a way to replace the character & with and in a PHP web form as the user types it rather than after submitting the form?
When & is inserted into our database our search engine doesn't interpret the & correctly replacing it with & returning an incorrect search result (i.e. not the result that included &).
Here is the field we would like to run this on:
<input type="text" name="project_title" id="project_title" value="<?php echo $project_title; ?>" size="60" class="btn_input2"/>

Is there a way to replace the character & with and in a PHP web form as the user types it rather than after submitting the form?
PHP is on the server, it has no control over anything taking place under any circumstances what-so-ever on the client-side. It sends raw text from the web server, a 100megaton thermonuclear device explodes, and PHP never exists anymore after the content is sent. Just the document received on your client side remains. To work with effects on your client side, you need to work with JavaScript.
To do that, you would pick your favorite JavaScript library and add an event listener for "keyup" events. Replace ampersands with "and", and drop the replacement text back in the box. mugur has posted an answer that shows you how to do this.
This is a horrible solution in practice because your users will be screaming for bloody justice to deliver them from such an awful user experience. What you've ended up doing is replacing the input text with something they didn't want. Other search tools do this, why can't yours? You hit backspace, then what? When you hit in the text, you probably lose your cursor position.
Not only that, you're treating a symptom rather than the cause. Look at why you're doing this:
The reason is when & is inserted into our database our search engine flips out and replaces it with & which then returns an incorrect result (i.e. not the result that included &).
No, your database and search engine do no such thing as "flipping out". You're not aware of what's going on and try to treat symptoms rather than learn the cause and fix it. Your symptom cure will create MORE issues down the road. Don't do it.
& is an HTML Entity Code. Every "special" charecter has one. This means your database also encodes > as > as well as characters with accents in them (such as French, German, or Spanish texts). You get "Wrong" results for all of these.
You didn't show any code so you don't get any code. But here's what your problem is.
Your code is converting raw text into HTML Entity codes where appropriate, you're searching against a non-encoded string.
Option 1: Fix the cause
Encode your search text with HTML entities so that it matches for all these cases. Match accent charecters with their non-accented cousins so searching for "francais" might return "français".
Option 2: Fix one symptom
Do a string replace for ampersands either on the client or server side, your search breaks for all other encodings. Never find texts such as "Bob > Sally". Never find "français".

Before submitting the form you'd need to use JavaScript to change as the user types it in. Not ideal since JS can be turned off.
You'd be much better to "clean" the ampersands after submitting but before inserting into the database.
A simple str_replace should work:
str_replace(' & ',' and ', $_POST['value']);
But as others have pointed out, this isn't a good solution. The best solution would be to encode the ampersands as they go into the database (which seems to be happening just now), then modify your search script to allow for this.

You can do that as they complete the form with jquery like this:
$('#input').change(function() { // edited conforming Icognito suggestion
var some_val = $('#input').val().replace('&', 'and');
$('#input').val( some_val );
});
EDIT: working example (http://jsfiddle.net/4gXZW/13/)
JS:
$('.target').change(function() {
$('.target').val($('.target').val().replace('&', 'and'));
});
HTML:
<input class="target" type="text" value="Field 1" />
Otherwise you can do that in PHP before the insert sql.
$to_insert = str_replace("&", "and", $_POST['your_variable']);

Wordpress Query £ signs failure

I have code that takes the name of a term and pulls in a post of a custom post type of the same name. This works well. Except when a £ character is in the title.
e.g. pseudocode
$q = new WP_Query (array( 'name' => "Insurance Rating £1K"));
if($q->have_posts()){
// expected path of logic flow
} else {
// nothing was found =s
}
This post does indeed exist, yet it is not found, and this problem only affects cases with a '£' character in the title. Since Wordpress already sanitizes the titles etc, what is happening? Why does this not work?
edit:
This is a general case, not specific to any codebase of mine. I want to know why this happens and how to avoid it, the codebase this first arose in is irrelevant. So I dont need an alternative solution, as I'm looking for Why it happened
edit 2:
The database tables are using utf8_general_ci encoding.
The £ character is also being saved as is, not as a html entity, here's a screenshot from phpmyadmin:

What encoding is your PHP file in, and does it match the encoding of the database? They need to match for this to work. (Check your IDE or this link provided by #Tom)
Failing that, make sure that the character isn't a £ entity instead of the literal character.

Blocking Cuss/Vulgar/Obscenity Terms in PHP

I know you might laugh, but actually this is a common need in most apps. Many apps that take in customer/visitor input may need to filter cuss words or vulgar terms.
Sometimes PHP changes and new stuff gets added in. For instance, just the other day I learned about MultiCurl API in PHP5. So, anyway, is there a new native function in PHP that lets me filter most common English-based cuss words in a string, as well as flip a boolean to say, "string had English-based cuss words in it"? It doesn't need to be perfect, obviously, but cut out a good bit of garbage and let me replace it with ### for instance.
If that's not part of PHP yet, then does anyone have a function that I can use which cloaks the cuss word list? For instance, I want it such that I can drop the class in a project and not have to worry about another programmer getting offended. In other words, a decently encoded cuss word list -- not one actually spelled out.
Now, obviously it needs to be flexible and let words like "rebuttal" get through.
tl;dr: Does PHP5 now have a native function that can filter obscene words? And if not, does anyone have a class that encodes a cuss word list so that it doesn't offend other programmers?

I doubt this is something that would be a high priority for the core PHP team since that treads dangerously close to censorship. Censorship in that they would have a 'master' list of 'inappropriate' language which should be filtered.
You can do this fairly simply. Make up an array of all the words you want filtered out and when a page is displayed that contains user input run a preg_filter() on the words.
$bad_words = array('bleeping', 'blooping');
$submitted_text = 'bleh blah....';
echo preg_filter($bad_words, $replace, $submitted_text);
Note: you will have to deal with the edge cases where a bad word might be inside of a good word (i.e.- 'shitzu[sic] dog')
EDIT
For the bad-words-inside-good-words issue, you can add to the regular expression to require space at the beginning and end of the bad word. If you have lots of submissions though, it's going to be a constant battle to keep up with the trolls.

<?php
$badwords = "fuc";
$replacebad = "****";
$string = $_POST['something'];
$filtered = str_ireplace($badwords, $replacebad, "$string");
echo $filtered;
?>
something like this ?
Edit:
sorry I didn't noticed the php5 part ..

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.