I'm not sure if this is specific question for Cassandra or this can also belong to PHP so I'm sorry for tagging PHP.
So basically i'm ordering some long row columns by their column names, which goes like this:
2012-01-01_aa_99999 | 2012-01-01_aaa | 2012-01-12_aaaaa
So this is working the way i want it to work, but i don't understand how does it actually order those string.
What is not clear to me is that first string 2012-01-01_aa_99999 seems to be way bigger then the rest two, and i'm concerned that at some point it might ignore first part of the string which is a date and put some string where they shouldn't belong.
In my case those string consist of quite a few parts so i'm really concerned about this, so basically i need some explanation how does this ordering happens internally.
i don't understand how does it actually order those string.
The strings you provided appear to be lexicographically ordered.
I had the same question as I want to construct a composite primary key index with well-understood sorting abilities. It turns out Cassandra appears to compare UTF-8 strings using a byte-by-byte binary comparison... this is indeed a completely broken sort function from a logical perspective. If you had mixed ASCII and Kanji characters in your string, for example, your sort order would be effectively random. However, as long as this sort order is known, one can design your usage patterns around it.
This could be easily fixed, of course, and it would be nearly a single-line change of code to patch in a "real" sort function. This would require a bit extra CPU time, of course.
Related
I have two words nice and niece. How do i figure out e is missing to make both words the same.
Also say i have from and form. How do i return/figure out the letter r o has to be swap to make the words the same.
What i really want is a php that does whats in the image below.
I tried using built-in PHP functions for string manipulation, but none seem to be able to accomplish what i want.
Any help will be greatly appreciated.
what you're looking for is optimal alignment. simplest way to compute it is Needleman–Wunsch algorithm. for bigger inputs you will need something better but more complicated: Hirschberg's algorithm (which has better memory complexity). or simply get some diff library written in your language
however you won't get 'impossible' result as each word can be changed to other word using deletions, insertions and changes. if you need different constraints you would have to modify above algorithms and use your own metrics and/or operations
Background: I have a large database of people, and I want to look for duplicates, which is more difficult than it seems. I already do a lot of comparison between the names (which are often spelled in different ways), dates of birth and so on. When two profiles appear to be similar enough to the matching algorithm, they are presented to an operator who will judge.
Most profiles have more than one phone number attached, so I would like to use them to find duplicates. They can be entered as "001-555-123456", but also as "555-123456", "555-123456-7-8", "555-123456 call me in the evening" or anything you might imagine.
My first idea is to strip all non-numeric characters and get the "longest common substring".
There are a lot of algorithms around to find the longest common substring inside a set.
But whenever I compare two profiles A and B, I have two sets of phone numbers. I would like to find the longest common substring between a string in the set A and a string in a set B.
Can you please help me in finding such an algorithm?
I normally program in PHP, a SQL-only solution would be even better, but any other language would go.
As Voitcus said before, you have to clean your data first before you start comparing or looking for duplicates. A phone number should follow a strict pattern. For the numbers which do not match the pattern try to adjust them to it. Then you have the ability to look for duplicates.
Morevover you should do data-cleaning before persisting it, maybe in a seperate column. You then dont have to care for that when looking for duplicates ... just to avoid performance peaks.
Algorithms like levenshtein or similar_text() in php, doesnt fit to that use-case quite well.
In my opinion the best way is to strip all non-numeric characters from the texts containing phone numbers. You can do this in many ways, some regular expression would be the best, but see below.
Then, if it is possible, you can find the country direction code, if the user has its location country. If there is none, assume default and add to the string. The same would be probably with the cities. You can try to take a look also in place one lives, their zip code etc.
At the end of this you should have uniform phone numbers which can be easily compared.
The other way is to compare strings with the country (and city) code removed.
About searching "the longest common substring": The numbers thus filtered are the same, however you might need it eg. if someone typed "call me after 6 p.m.". If you're sure that the phone number is always at the beginning, so nobody typed something like 555-SUPERMAN which translates to 555-78737626, there is also possibility to remove everything after the last alphanumeric character (and this character, as well).
There is also a possibility to filter such data in the SQL statement. Consider something like a SELECT ..., [your trimming function(phone_number)] AS trimmed_phone WHERE (trimmed_phone is not numerical characters only) GROUP BY trimmed_phone. If trimming function would remove only whitespaces and special dividers like -, +, . (commonly in use in Germany), , perhaps etc., this query would leave you all phone numbers that are trimmed but contain characters not numeric -- take a look at the results, probably mostly digits and letters. How many of them are they? Maybe they have something common? Maybe some typical phrases you can filter out too?
If the result from such query would not be very much, maybe it's easier just to do it by hand?
I am working on a regular expression that would grab the price in different format as I don't know in which format I am going to get the string so I am trying to cover as many variation as possible
Here is what I came up with
\$\s*?(\d+\.?\d*?)+|usd\s*?(\d+\.?\d*?)+|(\d+\.?\d*?)\s*?usd+|(\d+\.?\d*?)\s*?dollars?+|dollars?\s*?(\d+\.?\d*?)+|(\d+\.?\d*?)\s*?bucks?+|bucks?\s*?(\d+\.?\d*?)+
I've tried the above with several examples and it didn't fail so far.
anyone can think of a better way to achieve that ?
The real answer here is going to be achieved through normalization of the data. Start by removing every character except digits, the dot, and (if you expect negative values) the hyphen. Then you will have a character string that can be used as a number. When you have some test data available, try normalization first before you try to write regular expressions. Not only will the code be easier to write, but it will run faster, too!
I would advise using seperate expressions for each variation, and testing them in sequence (most likely ones first), applying the chain of responibility pattern.
The advantage is maintainability. When you need to support a new variation (considering you don't know all possible cariations beforehand) it'll simply be a matter of adding another member to the chain, rather than fiddling with the arcane complexities of what you have built now.
I have always used rawurlencode to store user entered data into my mysql databases. The main reason I do this is so that stroing foreign characters is very simple I find. I'd then use rawurldecode to retrieve and display the data.
I read somewhere that rawurlencode was not meant for this purpose. Are there any disadvantages to what I'm doing?
So let's say I have a German address with many characters like umlauts etc. What is the simplest way to store this in a mysql database with no risks of it coming out wrong and being searchable using a search script? So far rawurelencode has been excellent for our system. Perhaps the practise can be improved upon by only encoding foreign letters and not common characters like spaces etc, which is a waste of space I totally agree.
Sure there are.
Let's start with the practical: for a large class of characters you are spending 3 bytes of storage for every byte of data. The description of rawurlencode (and of course the RFC) say that those characters are
all non-alphanumeric characters except -_.~
This means that there is a total of 26 + 26 + 10 (alphanumeric) + 4 (special exceptions) = 66 characters for which you do not waste space.
Then there are also the logical drawbacks: You are not storing the data itself, but rather a representation of the data tailored to URLs. Unless the data itself is URLs, that's not what you should be doing.
Drawbacks I can think of:
Waste of disk space.
Waste of CPU cycles encoding and decoding on every read and every write.
Additional complexity (you can't even inspect data with a MySQL client).
Impossibility to use full text searches.
URL encoding is not necessarily unique (there're at least two RFCs). It may not lead to data loss but it can lead to duplicate data (e.g., unique indexes where two rows actually contain the same piece of data).
You can accidentally encode a non-string piece of data such as a date: 2012-04-20%2013%3A23%3A00
But the main consideration is that such technique is completely arbitrary and unnecessary since MySQL doesn't have the least problem storing the complete Unicode catalogue. You could also decide to swap e's and o's in all strings: Holle, werdl!. Your app would run fine but it would not provide any added value.
Update: As Your Common Sense points out, a SQL clause as basic as ORDER BYis no longer usable. It's not that international chars will be ignored; you'll basically get an arbitrary sort order based on the ASCII code of the % and hexadecimal characters. If you can't SELECT * FROM city ORDER BY city_name reliably, you've rendered your DB useless.
I am using a fork to eat a soup
I am using money bills to fire the coals for BBQ
I am using a kettle to boil eggs.
I am using a microscope to hammer the nails.
Are there any disadvantages to what I'm doing?
YES
You are using a tool not on purpose. This is always a disadvantage.
A sane human being alway using a tool that is intended for the certain job. Not some randomly picked one. Especially if there is no shortage in the right tool supply.
URL encoding is not intended to be used with database, as one can tell from the name. That's alone reason enough for the sane developer. Take a look around: find the proper tool.
There is a thing called "common sense" - a thing widely used in the regular life but for some reason always absent in the php world.
A common sense can warn us: if we're using a wrong tool, it may spoil the work. Sooner or later it will spoil it. No need to ask for the certain details - it's a general rule. We are learning this rule at about age of 5.
Why not to use it while playing with some web thingies too?
Why not to ask yourself a question:
What's wrong with storing foreign characters at all?
urlencode makes stroing foreign characters very simple
Any hardships you encountered without urlencode?
Although I feel that common sense should be enough to answer the question, people always look for the "omen", the proof. Here you are:
Database's job is not limited to just storing and retrieving data. A plain text file can handle such a primitive task as well.
Data manipulations is what we are using databases for.
Most widely used ones are sorting and filtering.
Such a quite intelligent thing as a database can sort and filter data character-insensitive, which is very handy feature. But of course it can be done only if characters being saved as is, not as some random codes.
Sorting texts also may use ordering other than just binary order in the character table. Some umlaut characters may be present at the other parts of the table but database collation will put them in the right place. Of course it can be done only if characters being saved as is, not as some random codes.
Sometimes we have to manipulate the data that already stored in the database. Say, cut some piece from the string and compare with the entered value. How it is supposed to be done with urlencoded data?
I am in need of a lightweight fast search solution.
Today I use Fulltext in boolean mode, where every searchword is mandatory in the results.
The function is fast, working and meets the requirements.
BUT some of the fulltext limitations, http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html, have appeared to be a problem. The site is on a hosted server and Im not allowed to change the mysql settings (e.g. minimum lenght)
E.g.
the search must be able to find red, 11 and ab.cdwhich todays full text solution can't.
http://sphinxsearch.com/ is what you're looking for
though you have to understand that smaller words you find the bigger indexes you use.
Use Lucene, it's very often implemented with MySQL and it'll be both faster and more featureful.
Using the built-in FTS engine is relatively bad practice, especially since it doesn't work with the slightly more reliable InnoDB engine.
The only thing that would come to mind, would to be basing your search off the number of occurrences you can find. Your actual index method could vary, depending on what the DB offers.
Assuming DB size isn't an issue, a (very) basic approach would be to break the search blobs (say, a post on stackoverflow) into each word, normalize it (remove plurals, strip 'logic' words such as and, etc.) then insert each word as a new record, together with the ID that identifies your indexed resource.
Count the instances of the ID, order by count, higher number = more relevant.
Not exactly my field though, so tred carefully! =]
I'd recommend you try distance searching: Levenshtein
Or search for "N-gram fulltext indexing".
I haven't mucked around with it, but I read the theory of full text searching (with mysql at least) a little while back.
If memory serves me correctly you can use full text search for what you want, but you need to configure (and I think a recompile) to get it to work on smaller number of search characters. I think it is set to a default number of 4 characters. You'll want to change it to 2 characters in length with a few other options thrown in and test the results you get.
Someone correct me if this is incorrect. I would rather not throw him on a red herring.