I have a CSV in the format:
Bill,Smith,123 Main Street,Smalltown,NY,5551234567
Jane,Smith,123 Main Street,Smalltown,NY,5551234567
John,Doe,85 Main Street,Smalltown,NY,5558901234
John,Doe,100 Foo Street,Bigtown,CA,5556789012
In other words, no one field is unique. Two people can have the same name, two people can have the same phone, etc., but each line is itself unique when you consider all of the fields.
I need to generate a unique ID for each row but it cannot be random. And I need to be able to take a line of the CSV at some time in the future and figure out what the unique ID was for that person without having to query a database.
What would be the fastest way of doing this in PHP? I need to do this for millions of rows, so md5()'ing the whole string for each row isn't really practical. Is there a better function I should use?
If you need to be able to later reconstruct the ID from only the text of the line, you will need a hash algorithm. It doesn't have to be MD5, though.
"Millions of IDs" isn't really a problem for modern CPUs (or, especially, GPUs. See Jeff's recent blog about Speed Hashing), so you might want to do the hashing in a different language than PHP. The only problem I can see is collisions. You need to be certain that your generated hashes really are unique, the chance of which depends on the number of entries, the used algorithm and the length of the hash.
According to Jeff's article, MD5 already is only of the fastest hash algorithms out there (with 10-20,000 million hashes per second), but NTLM appears to be twice as fast.
Why not just
CREATE TABLE data (
first VARCHAR(50),
last VARCHAR(50),
addr VARCHAR(50),
city VARCHAR(50),
state VARCHAR(50),
phone VARCHAR(50),
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY (id)
);
LOAD DATA [LOCAL] INFILE 'file.csv'
INTO TABLE data
(first,last,addr,city,state,phone);
How about just add the unique ID as a field?
$csv=file($file);
$i=0;
$csv_new=array();
foreach ($file as $val){
$csv_new[]=$i.",".$val;
$i++;
}
And output the $csv_new as the new csv file..
Dirty but it may work for you.
I understand what you're saying but I do not see a point. Creating a unique id that auto increments in the database would be the best route. The second route would be creating in the csv something like cell=a1+1 and dragging it down the entire row. In php you ca. Read the file and prepend something such as date(ymd).$id then write it back to the file. Again though this seems silly to do and the database route would be best. Just keep in mind pci compliance and always encrypt the data.
I'll post code later. I'm not at the PC at this time.
It's been a long time, But I found a situation that is sort of like this where I needed to prevent a row being created in a database, I created another column called de_dup which was set to be unique. I then for each row on creation used date('ymd').md5(implode($selected_csv_values)); this would prevent a customer from creating to orders on any given day unless specific information was different ie: firstname,lastname,creditcardnum,billingaddress.
Related
until now i ve always stored records in mysql database by generating an ID (varchar 32 primary key) with php, with a function like that:
$id = substr( str_shuffle( abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ), 0, 8 );
but until now in mysql DB i've always use utf8_bin (that is case sensitive) now i'm using utf8_general_ci (case insensitive).
I have a table in my DB to store statistics, in this table there are a millions of records.
in this case is better to use: 'id int unsigned autoincrement' as primary key?
if yes, is possibile that if many users call the sciprt at the same time the script crash with a 'duplicate id' error? and how i can avoid that?
Even though several people can access the site at once, but MySQL will process inserts in the table sequentially and will queue requests it receives. So in the insert query, if an ID is not provided an auto-incremented ID will be generated and then the row saved and committed. And the next request in queue will be processed. There is no way an auto-incremented ID can be as such duplicated.
Additionally, your code generates a random string and not an unique string. There is a lot of difference between the two. It is quite possible to generate a random string sequence that has been generated earlier.
On the other hand auto-increment is a gradually increasing sequential no ensuring there is no chance of having a duplicate key. As such it is always advised to use auto-increment to generate a primary key than generate one's own.
To get the last generated MySQL ID you can use mysqli_insert_id() right after your insert query in PHP and use it in your code for subsequent interactions with MySQL with respect to the inserted row.
At my opinion a autoincrement with mysql is better, because your php script now could be visited by more than one person at the same time.
So the id is maybe not unique anymore.
And I am pretty sure that mysql is so well programmed that it prohibit same ids ;)
In fact your current code has the bug that the same ID might be generated again. MySQL generated id doesn't have this problem. Even if you have a reason to generate your own ids, I would still use MySQL autoincrement integer to link between tables because of better indexing (speed).
And if for example you want to hide the sequence from the user, keep it in separate column with unique index. And do the id generation and insert in do while loop so if you happen to generate the same id second time, you can retry.
I have a MySQL/PHP performance related question.
I need to store an index list associated with each record in a table. Each list contains 1000 indices. I need to be able to quickly access any index value in the list associated to a given record. I am not sure about the best way to go. I've thought of the following ways and would like your input on them:
Store the list in a string as a comma separated value list or using JSON. Probably terrible performance since I need to extract the whole list out of the DB to PHP only to retrieve a single value. Parsing the string won't exactly be fast either... I can store a number of expanded lists in a Least Rencently Used cache on the PHP side to reduce load.
Make a list table with 1001 columns that will store the list and its primary key. I'm not sure how costly this is regarding storage? This also feels like abusing the system. And then, what if I need to store 100000 indices?
Only store with SQL the name of the binary file containing my indices and perform a fopen(); fseek(); fread(); fclose() cycle for each access? Not sure how the system filesystem cache will react to that. If it goes badly then there are many solutions available to adress the issues... but that's sounds a bit overkill no?
What do you think of that?
What about a good old one-to-many relationship?
records
-------
id int
record ...
indices
-------
record_id int
index varchar
Then:
SELECT *
FROM records
LEFT JOIN indices
ON records.id = indices.record_id
WHERE indices.index = 'foo'
The standard solution is to create another table, with one row per (record, index), and add a MySQL Index to allow fast search
CREATE TABLE IF NOT EXISTS `table_list` (
`IDrecord` int(11) NOT NULL,
`item` int(11) NOT NULL,
KEY `IDrecord` (`IDrecord`)
)
Change the item's type according to your needs - I used int in my example.
The most logical solution would be to put each value in it's own tuple. Adding a MYSQL index to each tuple will enable the DBMS to quickly ascertain the value, and should improve performance.
The reasons we're not going with your other answers are as follows:
Option 1
Storing multiple values in one MYSQL cell is a violation of the first stage of database normalisation. You can read up on it here.
Option 3
This has heavy reliance on other files. You want to localize your data storage as much as possible, to make it easier to maintain in the future.
I have keys for a project I made where I am trying to test a licensing system (Just for fun, and learning) a part that I thought I'd run into, is how to distribute the keys. I have about 100 keys in a database, and I'm trying to figure out the best way to distribute them. The database is layed out as follows,
ID (Auto Increment) | key
Using the PDO library, what is the most effective way to either to go in chronological order by ID? But even if I did chronological order, when I deleted the key that was given out, how would I go in chronological order? Or maybe random ID number? I have no clue how to go about the most effective way to distribute these keys?
If I understand your question correctly...
You might try this query through PDO:
SELECT * FROM `table-name`
ORDER BY `ID` ASC
Then when you step through the rows in a while() loop from the execution's return, it will be in chronological order like you asked.
As far as losing ID's, like if you delete the key with ID # 10, your table will jump from 9 to 11 in the returned rows IDs. When you add a new key, # 10 will not be used unless you specifically specify that ID when inserting.
EDIT: From the phrasing of your question, it sounds like you may be concerned about how you set up the ID's for the keys. Maybe you understand this already, but since you have Auto Increment, your IDs will be automatically generated when you insert new keys, so a new key would be assigned an ID of (ID of last inserted key) + 1.
Chronology isn't exactly a feature of PDO, or for that matter whatever database driver you are using... it's more a matter of your schema.
Typically, a commonly employed field in any database structure is a "timestamp" or "created" field that holds the time the record was created in the database. These fields can be MySQL datatype TIMESTAMP (in which case the driver will return seconds since the Unix Epoch), or DATETIME (in which case most drivers will attempt to return the language's native DateTime object if one exists.) Even though monotonically-increasing primary keys imply a certain amount of chronological order when sorted, a timestamp field can record the exact time a record was created at the server, as well as update on change using ON UPDATE CURRENT_TIMESTAMP. So I would suggest adding this to your schema.
With such a field in your database, you can always sort your queries using:
SORT BY timestamp_field_name ASC
Also, if by "distribute" you mean some data will be publicly accessible by using this key as query param of some sort, I wouldn't use the monotonic primary key for the exact reason you described, especially if this is a "licensing" proof of concept, which if you mean a DRM-type thing should probably produce a complex string. Hashed timestamps in a UNIQUE field, or the php uniqid function can produce values that can be stored in a VARCHAR database field with the UNIQUE key restraint. This is if I have understood your described goal.
I am creating a form that will capture all of the form data and store it in a database, what I would like to do is use some of the form data to create a custom md5 user id to prevent multiple entries, I know that this is most probably not the most ideal way of doing it, so if there is a better way of creating a unique md5 uid that I can use for each user, please enlighten me.
I have considered just using a random number and checking the database against the email and first name, but I am curious to see if there is a better way of doing it.
Thanx in advance!
Wait ... you are wanting to use a unique MD5 to create a user id? ... why not use an auto_increment integer field? Each time the INSERT is run, it will be increased by 1 therefore always being unique. And since it is an integer, if you are doing any searches against it it would be a lot faster.
You can let MySQL do the work for you using "UNIQUE". Assuming you have a table user_data(user_data_id, name, text, content, date) and want name and text to be UNIQUE as a tuple:
CREATE user_data (
user_data_id INTEGER,
name VARCHAR(50),
text TEXT,
date DATE,
PRIMARY_KEY(user_data_id),
UNIQUE(name,text)
);
I assume you mean you want to avoid having duplicate entries by the same person.
You should probably check the database for the input's email + firstname + lastname, after normalizing them with strtolower() and removing spaces etc.
Apart from that you can't know for sure if the person entering data in the form has done it before. You can't safely rely on the ip being the same due to proxies and gateways, or even the computer being the same (shared computers). If you are aggressive on those 2 fronts you'll probably get alot of frustrated legitimate users that can't use your system.
Your best bet is to assume your database has duplicate entries and then decide what to do with them. Either flag the ones not being used, if they're accounts on the system have some intelligence to check if they're duplicates based on their behavior.
I want to build a database-wide unique id. That unique id should be one field of every row in every table of that database.
There are a few approaches I have considered:
Create one master-table with an auto-increment-field and a trigger in every other table, like:
"before insert here, insert in master-table -> get the auto-increment value -> and use this value as primary-key here"
I have seen this before, but instead of making one INSERT, it does 2 INSERTS, which I expect would not be that performant.
Add a field uniqueId to every table, and fill this field with a PHP-generated integer... something like unix-timestamp plus a random number.
But I had to use BIGINT as the datatype, which means big index_length and big data_length.
Similar to the "uniqueId" idea, but instad of BIGINT I use VARCHAR and use uniqid() to populate this value.
Since you are looking for opinions... Of the three ideas you give, I would "vote" for the uniqid() solution. It seems pretty low cost in terms of execution (but possibly not implementation).
A simpler solution (I think) would be to just add a field to each table to store a guid and set the default value of the field to be MySQL's function that generates a guid (I think it is UUID). This lets the database do the work for you.
And in the spirit of coming up with random ideas... It would be possible to have some kind of offline process fill in the IDs asynchronously. Make sure every table has the appropriate field and make the default value be 0/empty. Then the offline process could simply run a query on each table to find the rows that do not yet have a unique id and it could fill them in. That would let you control the ID and even use some kind of incrementing integer. This, of course, requires that you do not need the unique ID instantly each time a record is inserted.