I am looking to generate a random number for every user contribution as a title of the contribution.
I could simply check the database each time with a query and generate a number which does not equal to any of the entries of the database. But I imagine this as inefficient and it could become slow if the database is big in my opinion. Also I'd have to contain all the numbers of the database somewhere to manage the "not equals to", in an array or something similar but that can end up as a giant one.
Excuse the layman's speech I am new to this.
Any suggestions how this can be solved efficiently without straining the resources too much? You can explain it linguistically and do not have to provide me any scripts, I will figure it out.
You can use uniqid(). I'm not sure how portable it is.
Example:
printf("uniqid(): %s\r\n", uniqid());
Will output something like:
uniqid(): 4b3403665fea6
uniqid() will give you a random number that can technically repeat.
Maybe you can apply a simple algorithm on an auto-increment field? n(n+1)/2 or something?
Related
I have been using the function uniqid(), but I'm thinking, is it possible for the PHP function uniqid() to generate the same random name twice?
The answer that I see everywhere is no, but why? why isn't it possible to generate a name that was generated before by kind of mistake. I mean it probably could generate a name that was generated before, even if it was set to true like this => uniqid('', true), it may generate the same name by coincidence?
Warning
This function does not guarantee uniqueness of return value. Since most systems adjust system clock by NTP or like, system time is changed constantly. Therefore, it is possible that this function does not return unique ID for the process/thread. Use more_entropy to increase likelihood of uniqueness.
— https://www.php.net/uniqid
My answer it is possible, but not practical.
Much like the no two snowflakes are alike statement, it might happen, but the odds of anyone noticing are seriously small?
When you look at the odds, we are talking about being struck by lightning while standing under a metal shelter on a sunny day type of odds.
https://www.php.net/manual/en/function.uniqid.php
If you want to do the math feel free.
Unless you are dealing with the "rather extreme edge-case" of two-or-more threads possibly calling this function simultaneously within the same microsecond .... I would say that "your fears are comfortably unfounded."
I have to create unique codes for each "company" in my database.
The only way I see this to be possible is to create a random number with rand() and then check if the number exists for this "company" in the DB, if it does recreate.
My question is: Is there not a better way to do this - a more efficient way. As if I am creating 10 000 codes and there are already 500 000 in the DB it's going to get progressively slower and slower.
Any ideas or tips on perhaps a better way to do it?
EDIT:
Sorry perhaps I can explain better. The codes will not all be generated at the same time, they can be created once a day/month/year whenever.
Also, I need to be able to define the characters of the codes for example, alpha numberic or numbers only
I recommend you to use "Universally Unique Identifier": http://en.wikipedia.org/wiki/Universally_unique_identifier to generate your random codes for each company. In this way you can avoid checking your database for duplicates:
Anyone can create a UUID and use it to identify something with
reasonable confidence that the same identifier will never be
unintentionally created by anyone to identify something else.
Information labeled with UUIDs can therefore be later combined into a
single database without needing to resolve identifier (ID) conflicts.
In PHP you can use function uniqid for this purpose: http://es1.php.net/manual/en/function.uniqid.php
MySQL's UUID Function should help. http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_uuid
INSERT INTO table (col1,col2)VALUES(UUID(), "someValue")
If the codes are just integers then use autoincrement or get the current max value and start incrementing it
I recently bought myself a domain for personal URL shortening.
And I created a function to generate alphanumeric strings of 4 characters as reference.
BUT
How do I check if they are already used or not? I can't check for every URL if it exists in the database, or is this just the way it works and I have to do it?
If so, what if I have 13.000.000 URLs generated (out of 14.776.336). Do I need to keep generating strings until I have found one that is not in the DB yet?
This just doesn't look the right way to do it, anyone who can give me some advise?
One memory efficient and faster way I think of is following. This problem can be solved without use of database at all. The idea is that instead of storing used urls in database, you can store them in memory. And since storing them in memory can take a lot of memory usage, so we will use a bit set (an array of bits ) and we only one bit for each url.
For each random string you generate, create a hashcode for that that lies b/w 0 and max number K.
Create a bit set( basically a bit array). Whenever you use some url, set corresponding hash code bit in bit set to 1.
Whenever you generate a new url, see if its hashcode bit is set. If yes, then discard that url and generate a new one. Repeat the process till you get one unused one.
This way you avoid DB forever, your lookups are extremely fast and it takes least amount of memory.
I borrowed the idea from this place
A compromise solution is to generate a random id, and if it is already in the database, find the first empty id that is bigger than it. (Wrapping around if you can't find any empty space in the range above.)
If you don't want the ids to be unguessable (you probably don't if you only use 4 characters), this approach works fine and is quick.
One algorithm is to try a few times to find a free url of N characters, if still not found, increase N. Start with N=4.
Essentially what I want to do is search a number of MYSQL databases and return results where a certain field is more than 50% similar to another record in the databases.
What am I trying to achieve?
I have a number of writers who add content to a network of websites that I own, I need a tool that will tell me if any of the pages they have written are too similar to any of the pages currently published on the network. This could run on post/update or as a cron... either way would work for me.
I've tried making something with php, drawing the records from the database and using the function similar_text(), which gives a % difference between two strings - this however is not a workable solution as you have to compare every entry against every other entry & I worked out with microtime that it would take around 80 hours to completely search all of the entries!
Wondering if it's even possible!?
Thanks!
You are probably looking for is SOUNDEX. It is the only sound based search in mysql. If you have A LOT of data to compare, you're probably going to need to pregenerate the soundex and compare the soundex columns or use it live like this:
SELECT * FROM data AS t1 LEFT JOIN data AS t2 ON SOUNDEX(t1.fieldtoanalyse) = SOUNDEX(t2.fieldtoanalyse)
Note that you can also use the
t1.fieldtoanalyze SOUNDS LIKE t2.fieldtoanalyze
syntax.
Finaly, you can save the SOUNDEX to a column, just create a column and:
UPDATE data SET fieldsoundex = SOUNDEX(fieldtoanalyze)
and then compare live with pregenerated values
More on Soundex
Soundex is a function that analyzes the composition of a word but in a very crude way. It is very useful for comparisons of "Color" vs "Colour" and "Armor" vs "Armour" but can also sometimes dish out weird results with long words because the SOUNDEX of a word is a letter + a 3 number code. There is just so much you can do sadly with these combinations.
Note that there is no levenstein or metaphone implementation in mysql... not yet, but probably, levenstein would have been the best for your case.
Anything is possible.
Without knowing your criteria for similar, it's difficult to offer a specific solution. However, my suggestion would be pre-build a similarity table, utilize a function such as similar_text(). Use this as your index table when searching by term.
You'll take an initial hit to build such an index. However, you can manage it easier as new records are added.
Thanks for your answers guys, for anyone looking for a solution to a problem similar to this I used the SOUNDEX function to pull out entries that had a similar title then compared them with the similar_text() function. Not quite a complete database comparison, but near as I could get it!
I was wondering how I could quickly search a data string of up to 1 billion bytes of data. The data is all numeric. Currently, we have the data split into 250k files and the searches using strpos (fastest built-in function) on each file until it finds something.
Is there a way I can index to make it go faster? Any suggestions?
Eventually I would like to find multiple occurrences, which, as of now, would be done with the offset parameter on strpos.
Any help would surely lead to recognition where needed.
Thanks!
- James Hartig
Well, your tags indicate what you should do (the tag I am referring to is "indexing").
Basically, you should have separate files which would have the indexes for the data. It would have the data strings that you are looking for, as well as the file and byte positions that it is in.
You would then access the index, look up your value and then find the location(s) in the original file(s) for the data string, and process from there.
A good answer may require that you get a little more specific.
How long is the search query? 1 digit? 10 digits? Arbitrary length?
How "fast" is fast enough? 1 second? 10 seconds? 1 minute?
How many total queries per second / minute / hour do you expect?
How frequently does the data change? Every day? Hour? Continuously?
When you say "multiple occurrences" it sounds like you mean overlapping matches.
What is the "value" of the answer and to how many people?
A billion ain't what it used to be so you could just index the crap out of the whole thing and have an index that is 10 or even 100 times the original data. But if the data is changing by the minute, that would mean your were burning more cycles to create the index than to search it.
The amount of time and money you put into a solution is a function of the value of that solution.
You should definitely get a girlfriend. Besides helping you spend your time better it can grow fat without bursting. Oh, and the same goes for databases.
All of Peter Rowell's questions pertain. If you absolutely must have an out-of-the box answer then try grep. You can even exec it from PHP if you like. It is orders of magnitude faster than strpos. We've actually used it quite well as a solution for something that couldn't deal with indexing.
But again, Peter's questions still all apply. I'd answer them before diving into a solution.
Would a hash function/table work? Or a Suffix Array/Tree?